Document pipeline flags and workflow in README - loop-benchmarking - Controlled experiments across agentic coding configurations. Same task, one variable, what actually works.

commit 257d3cbc7f7401f8ec59a24f0b6378cbb1eff126
parent 0a40a42daa93a0cd9f1f7824fcd49ed4b7ae45a6
Author: Brian Graham <brian@buildingbetterteams.de>
Date:   Sat,  4 Apr 2026 10:47:44 +0200

Document pipeline flags and workflow in README

Covers: --reeval, --analyze, --full-pipeline, -j, --model flags.
Profiles (smoke, all-on, all-off, core) and DOE designs documented.
Pipeline auto-commits and pushes results.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Diffstat:
M README.md  | 55 ++++++++++++++++++++++++++++++++++++++++++-------------

1 file changed, 42 insertions(+), 13 deletions(-)
diff --git a/README.md b/README.md
@@ -16,30 +16,59 @@ Define the variables that make up a coding loop (model, tools, prompt style, etc
 
 ### Running experiments
 
+The harness handles everything: run experiments, evaluate, analyze, commit, and push.
+
 ```bash
-# 1. Screen: which variables matter? (~53 cells, vary one axis at a time)
-python3 harness/run.py grid.yaml main_effects
+# Run a sweep (auto-analyzes and commits results when done)
+python3 harness/run.py grid.yaml main_effects -j 6
 
-# 2. Analyze: rank variables by effect size
-python3 harness/lib/experiment_design.py analyze results main_effects score
+# Run with a different baseline model
+python3 harness/run.py grid.yaml main_effects --model sonnet -j 6
 
-# 3. Deep dive: full factorial on the top variables that matter
-python3 harness/run.py grid.yaml "interaction_hunt:model,effort,tool_write"
+# Deep dive: full factorial on specific variables
+python3 harness/run.py grid.yaml "interaction_hunt:model,effort,tool_write" -j 6
+```
 
-# 4. Check for interactions between variables
-python3 harness/lib/experiment_design.py analyze results interactions model effort score
+### Pipeline flags
+
+```bash
+# Normal sweep: run -> evaluate -> analyze -> commit -> push
+python3 harness/run.py grid.yaml main_effects -j 6
+
+# Changed eval scripts? Re-evaluate ALL existing runs with latest code
+python3 harness/run.py grid.yaml main_effects -j 6 --reeval
+
+# Just re-evaluate + analyze, no new runs
+python3 harness/run.py grid.yaml smoke --reeval --analyze
+
+# Just analyze existing data
+python3 harness/run.py grid.yaml smoke --analyze
 ```
 
-### Other run modes
+What each flag does:
+- **No flags**: run experiments, evaluate new runs, analyze, commit, push
+- **`--reeval`**: re-evaluate ALL runs with current eval scripts (use when you changed tests)
+- **`--analyze`**: save main effects analysis to `results/analysis/`
+- **`--full-pipeline`**: `--reeval` + `--analyze`
+- **`-j N`**: run N experiments in parallel
+- **`--model MODEL`**: set baseline model for main_effects design
+
+### Profiles and designs
 
 ```bash
-# Profile-based (predefined subsets of the grid)
+# Profiles (predefined grid subsets)
 python3 harness/run.py grid.yaml smoke          # 6 cells, 1 run each
+python3 harness/run.py grid.yaml all-on          # everything enabled, 3 runs
+python3 harness/run.py grid.yaml all-off         # Bash only, 3 runs
 python3 harness/run.py grid.yaml core           # 30 cells, 3 runs each
-python3 harness/run.py grid.yaml full           # 204,800 cells (don't)
 
-# Plackett-Burman screening (efficient multi-factor screening)
-python3 harness/run.py grid.yaml plackett_burman
+# DOE designs (statistically efficient sampling)
+python3 harness/run.py grid.yaml main_effects    # 18 cells, vary one axis at a time
+python3 harness/run.py grid.yaml plackett_burman # Plackett-Burman screening
+
+# Manual analysis
+python3 harness/lib/experiment_design.py analyze results main_effects score
+python3 harness/lib/experiment_design.py analyze results interactions model effort score
 ```
 
 ### Building the dashboard

	loop-benchmarking Controlled experiments across agentic coding configurations. Same task, one variable, what actually works.
	git clone https://git.shiptheloop.com/loop-benchmarking.git
	Log \| Files \| Refs \| README