commit 257d3cbc7f7401f8ec59a24f0b6378cbb1eff126
parent 0a40a42daa93a0cd9f1f7824fcd49ed4b7ae45a6
Author: Brian Graham <brian@buildingbetterteams.de>
Date: Sat, 4 Apr 2026 10:47:44 +0200
Document pipeline flags and workflow in README
Covers: --reeval, --analyze, --full-pipeline, -j, --model flags.
Profiles (smoke, all-on, all-off, core) and DOE designs documented.
Pipeline auto-commits and pushes results.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Diffstat:
| M | README.md | | | 55 | ++++++++++++++++++++++++++++++++++++++++++------------- |
1 file changed, 42 insertions(+), 13 deletions(-)
diff --git a/README.md b/README.md
@@ -16,30 +16,59 @@ Define the variables that make up a coding loop (model, tools, prompt style, etc
### Running experiments
+The harness handles everything: run experiments, evaluate, analyze, commit, and push.
+
```bash
-# 1. Screen: which variables matter? (~53 cells, vary one axis at a time)
-python3 harness/run.py grid.yaml main_effects
+# Run a sweep (auto-analyzes and commits results when done)
+python3 harness/run.py grid.yaml main_effects -j 6
-# 2. Analyze: rank variables by effect size
-python3 harness/lib/experiment_design.py analyze results main_effects score
+# Run with a different baseline model
+python3 harness/run.py grid.yaml main_effects --model sonnet -j 6
-# 3. Deep dive: full factorial on the top variables that matter
-python3 harness/run.py grid.yaml "interaction_hunt:model,effort,tool_write"
+# Deep dive: full factorial on specific variables
+python3 harness/run.py grid.yaml "interaction_hunt:model,effort,tool_write" -j 6
+```
-# 4. Check for interactions between variables
-python3 harness/lib/experiment_design.py analyze results interactions model effort score
+### Pipeline flags
+
+```bash
+# Normal sweep: run -> evaluate -> analyze -> commit -> push
+python3 harness/run.py grid.yaml main_effects -j 6
+
+# Changed eval scripts? Re-evaluate ALL existing runs with latest code
+python3 harness/run.py grid.yaml main_effects -j 6 --reeval
+
+# Just re-evaluate + analyze, no new runs
+python3 harness/run.py grid.yaml smoke --reeval --analyze
+
+# Just analyze existing data
+python3 harness/run.py grid.yaml smoke --analyze
```
-### Other run modes
+What each flag does:
+- **No flags**: run experiments, evaluate new runs, analyze, commit, push
+- **`--reeval`**: re-evaluate ALL runs with current eval scripts (use when you changed tests)
+- **`--analyze`**: save main effects analysis to `results/analysis/`
+- **`--full-pipeline`**: `--reeval` + `--analyze`
+- **`-j N`**: run N experiments in parallel
+- **`--model MODEL`**: set baseline model for main_effects design
+
+### Profiles and designs
```bash
-# Profile-based (predefined subsets of the grid)
+# Profiles (predefined grid subsets)
python3 harness/run.py grid.yaml smoke # 6 cells, 1 run each
+python3 harness/run.py grid.yaml all-on # everything enabled, 3 runs
+python3 harness/run.py grid.yaml all-off # Bash only, 3 runs
python3 harness/run.py grid.yaml core # 30 cells, 3 runs each
-python3 harness/run.py grid.yaml full # 204,800 cells (don't)
-# Plackett-Burman screening (efficient multi-factor screening)
-python3 harness/run.py grid.yaml plackett_burman
+# DOE designs (statistically efficient sampling)
+python3 harness/run.py grid.yaml main_effects # 18 cells, vary one axis at a time
+python3 harness/run.py grid.yaml plackett_burman # Plackett-Burman screening
+
+# Manual analysis
+python3 harness/lib/experiment_design.py analyze results main_effects score
+python3 harness/lib/experiment_design.py analyze results interactions model effort score
```
### Building the dashboard