README.md (6796B)
1 # Loop Benchmarking 2 3 An open benchmark for comparing agentic coding loop configurations. Same task, different setups, all data public. 4 5 ## What this does 6 7 Define the variables that make up a coding loop (model, tools, prompt style, etc.), and the system generates every permutation. Each is run against a set of tasks in a clean-room environment with deterministic evaluation. No LLM grading. 8 9 ## Quick start 10 11 ### Prerequisites 12 13 - Node.js 22+ 14 - Python 3.12+ with PyYAML 15 - Claude Code CLI (authenticated via `claude login`) 16 17 ### Running experiments 18 19 The harness handles everything: run experiments, evaluate, analyze, commit, and push. 20 21 ```bash 22 # Run a sweep (auto-analyzes and commits results when done) 23 python3 harness/run.py grid.yaml main_effects -j 6 24 25 # Run with a different baseline model 26 python3 harness/run.py grid.yaml main_effects --model sonnet -j 6 27 28 # Deep dive: full factorial on specific variables 29 python3 harness/run.py grid.yaml "interaction_hunt:model,effort,tool_write" -j 6 30 ``` 31 32 ### Pipeline flags 33 34 ```bash 35 # Normal sweep: run -> evaluate -> analyze -> commit -> push 36 python3 harness/run.py grid.yaml main_effects -j 6 37 38 # Changed eval scripts? Re-evaluate ALL existing runs with latest code 39 python3 harness/run.py grid.yaml main_effects -j 6 --reeval 40 41 # Just re-evaluate + analyze, no new runs 42 python3 harness/run.py grid.yaml smoke --reeval --analyze 43 44 # Just analyze existing data 45 python3 harness/run.py grid.yaml smoke --analyze 46 ``` 47 48 What each flag does: 49 - **No flags**: run experiments, evaluate new runs, analyze, commit, push 50 - **`--reeval`**: re-evaluate ALL runs with current eval scripts (use when you changed tests) 51 - **`--analyze`**: save main effects analysis to `results/analysis/` 52 - **`--full-pipeline`**: `--reeval` + `--analyze` 53 - **`-j N`**: run N experiments in parallel 54 - **`--model MODEL`**: set baseline model for main_effects design 55 56 ### Profiles and designs 57 58 ```bash 59 # Profiles (predefined grid subsets) 60 python3 harness/run.py grid.yaml smoke # 6 cells, 1 run each 61 python3 harness/run.py grid.yaml all-on # everything enabled, 3 runs 62 python3 harness/run.py grid.yaml all-off # Bash only, 3 runs 63 python3 harness/run.py grid.yaml core # 30 cells, 3 runs each 64 65 # DOE designs (statistically efficient sampling) 66 python3 harness/run.py grid.yaml main_effects # 18 cells, vary one axis at a time 67 python3 harness/run.py grid.yaml plackett_burman # Plackett-Burman screening 68 69 # Manual analysis 70 python3 harness/lib/experiment_design.py analyze results main_effects score 71 python3 harness/lib/experiment_design.py analyze results interactions model effort score 72 ``` 73 74 ### Building the dashboard 75 76 ```bash 77 cd dashboard 78 npm install 79 npm run build # Static site in dashboard/dist/ 80 npm run dev # Dev server for local preview 81 ``` 82 83 ## Project structure 84 85 ``` 86 grid.yaml # Experiment grid: axes, values, exclusions, profiles 87 harness/ 88 run.py # Main orchestrator (Python) 89 lib/ 90 compute_grid.py # Cartesian product + exclusions 91 experiment_design.py # DOE plans + analysis (main effects, PB, interactions) 92 get-oauth-token.sh # Extracts OAuth token for --bare mode 93 invoke.sh # Claude CLI invocation (bash, used by run.sh) 94 evaluate.sh # Evaluation dispatch (bash, used by run.sh) 95 workspace.sh # Workspace creation (bash, used by run.sh) 96 tasks/ 97 tetris/ # Agent-friendly: build a game 98 bookmarks-api/ # Medium: REST API with auth 99 data-pipeline/ # Hard: CSV processing with edge cases 100 Each task has: 101 prompts/ # simple/detailed x en/es 102 eval/ # Deterministic test suites the agent never sees 103 context.md # Rules file (used when context_file=provided) 104 scoring.yaml # Category weights 105 results/ 106 runs/{run_id}/ # One directory per experiment run 107 meta.json # Config, timing, exit code 108 transcript.jsonl # Full conversation (every tool call and response) 109 claude_output.json # Summary metrics (cost, turns, tokens) 110 eval_results.json # Structural, functional, quality scores 111 workspace.tar.gz # Archived agent output 112 dashboard/ # Astro + React static site 113 Grid overview, insights (tornado charts, heatmaps), run detail with transcript viewer 114 ``` 115 116 ## Configuration dimensions (16 axes) 117 118 | Axis | Values | 119 |---|---| 120 | model | haiku, sonnet, opus | 121 | effort | high, max (extended thinking) | 122 | prompt_style | simple, detailed | 123 | language | typescript, javascript | 124 | human_language | en, es | 125 | tool_read | on, off | 126 | tool_write | on, off | 127 | tool_edit | on, off | 128 | tool_glob | on, off | 129 | tool_grep | on, off | 130 | linter | on, off | 131 | playwright | on, off | 132 | context_file | none, provided | 133 | sub_agents | on, off | 134 | web_search | on, off | 135 | max_budget | low ($0.50), high ($5.00) | 136 137 ## Evaluation 138 139 All scoring is deterministic code. The agent never sees the test suite. 140 141 - **Structural**: Does it build? Do expected files exist? 142 - **Functional**: Pre-written test suites (Playwright, vitest, golden file diff) 143 - **Quality**: Lint, type check, accessibility, security, performance 144 145 ## Experiment design 146 147 Instead of running the full 204,800-cell grid, use statistical designs: 148 149 - **Main effects sweep**: Vary one axis at a time from a baseline. Identifies which variables matter. 150 - **Plackett-Burman**: Screening design that tests many binary factors efficiently. 151 - **Interaction hunt**: Full factorial on a small subset of axes to find interactions. 152 153 The dashboard's Insights page visualizes main effects as tornado charts and interactions as heatmaps. 154 155 ## Metrics 156 157 All analyses can target different metrics. Switch between them in the dashboard or via CLI: 158 159 ```bash 160 # Which variables most affect quality? 161 python3 harness/lib/experiment_design.py analyze results main_effects score 162 163 # Which variables most affect cost? 164 python3 harness/lib/experiment_design.py analyze results main_effects cost 165 166 # Which variables most affect speed? 167 python3 harness/lib/experiment_design.py analyze results main_effects wall_time 168 169 # Which variables most affect iteration count? 170 python3 harness/lib/experiment_design.py analyze results main_effects turns 171 ``` 172 173 Available metrics: `score`, `cost`, `turns`, `wall_time`, `pass_rate`. 174 175 These metrics often conflict. A config that maximizes score may also maximize cost. A future addition is Pareto frontier analysis to identify configurations that are not dominated on any metric (e.g., "highest score at each cost level"). This would let you answer questions like "what's the cheapest config that still passes?" or "is Opus worth 5x the cost of Haiku for this task?"