loop-benchmarking

Controlled experiments across agentic coding configurations. Same task, one variable, what actually works.
git clone https://git.shiptheloop.com/loop-benchmarking.git
Log | Files | Refs | README

commit 257d3cbc7f7401f8ec59a24f0b6378cbb1eff126
parent 0a40a42daa93a0cd9f1f7824fcd49ed4b7ae45a6
Author: Brian Graham <brian@buildingbetterteams.de>
Date:   Sat,  4 Apr 2026 10:47:44 +0200

Document pipeline flags and workflow in README

Covers: --reeval, --analyze, --full-pipeline, -j, --model flags.
Profiles (smoke, all-on, all-off, core) and DOE designs documented.
Pipeline auto-commits and pushes results.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Diffstat:
MREADME.md | 55++++++++++++++++++++++++++++++++++++++++++-------------
1 file changed, 42 insertions(+), 13 deletions(-)

diff --git a/README.md b/README.md @@ -16,30 +16,59 @@ Define the variables that make up a coding loop (model, tools, prompt style, etc ### Running experiments +The harness handles everything: run experiments, evaluate, analyze, commit, and push. + ```bash -# 1. Screen: which variables matter? (~53 cells, vary one axis at a time) -python3 harness/run.py grid.yaml main_effects +# Run a sweep (auto-analyzes and commits results when done) +python3 harness/run.py grid.yaml main_effects -j 6 -# 2. Analyze: rank variables by effect size -python3 harness/lib/experiment_design.py analyze results main_effects score +# Run with a different baseline model +python3 harness/run.py grid.yaml main_effects --model sonnet -j 6 -# 3. Deep dive: full factorial on the top variables that matter -python3 harness/run.py grid.yaml "interaction_hunt:model,effort,tool_write" +# Deep dive: full factorial on specific variables +python3 harness/run.py grid.yaml "interaction_hunt:model,effort,tool_write" -j 6 +``` -# 4. Check for interactions between variables -python3 harness/lib/experiment_design.py analyze results interactions model effort score +### Pipeline flags + +```bash +# Normal sweep: run -> evaluate -> analyze -> commit -> push +python3 harness/run.py grid.yaml main_effects -j 6 + +# Changed eval scripts? Re-evaluate ALL existing runs with latest code +python3 harness/run.py grid.yaml main_effects -j 6 --reeval + +# Just re-evaluate + analyze, no new runs +python3 harness/run.py grid.yaml smoke --reeval --analyze + +# Just analyze existing data +python3 harness/run.py grid.yaml smoke --analyze ``` -### Other run modes +What each flag does: +- **No flags**: run experiments, evaluate new runs, analyze, commit, push +- **`--reeval`**: re-evaluate ALL runs with current eval scripts (use when you changed tests) +- **`--analyze`**: save main effects analysis to `results/analysis/` +- **`--full-pipeline`**: `--reeval` + `--analyze` +- **`-j N`**: run N experiments in parallel +- **`--model MODEL`**: set baseline model for main_effects design + +### Profiles and designs ```bash -# Profile-based (predefined subsets of the grid) +# Profiles (predefined grid subsets) python3 harness/run.py grid.yaml smoke # 6 cells, 1 run each +python3 harness/run.py grid.yaml all-on # everything enabled, 3 runs +python3 harness/run.py grid.yaml all-off # Bash only, 3 runs python3 harness/run.py grid.yaml core # 30 cells, 3 runs each -python3 harness/run.py grid.yaml full # 204,800 cells (don't) -# Plackett-Burman screening (efficient multi-factor screening) -python3 harness/run.py grid.yaml plackett_burman +# DOE designs (statistically efficient sampling) +python3 harness/run.py grid.yaml main_effects # 18 cells, vary one axis at a time +python3 harness/run.py grid.yaml plackett_burman # Plackett-Burman screening + +# Manual analysis +python3 harness/lib/experiment_design.py analyze results main_effects score +python3 harness/lib/experiment_design.py analyze results interactions model effort score ``` ### Building the dashboard

Impressum · Datenschutz