loop-benchmarking

Controlled experiments across agentic coding configurations. Same task, one variable, what actually works.
git clone https://git.shiptheloop.com/loop-benchmarking.git
Log | Files | Refs | README

README.md (6796B)


      1 # Loop Benchmarking
      2 
      3 An open benchmark for comparing agentic coding loop configurations. Same task, different setups, all data public.
      4 
      5 ## What this does
      6 
      7 Define the variables that make up a coding loop (model, tools, prompt style, etc.), and the system generates every permutation. Each is run against a set of tasks in a clean-room environment with deterministic evaluation. No LLM grading.
      8 
      9 ## Quick start
     10 
     11 ### Prerequisites
     12 
     13 - Node.js 22+
     14 - Python 3.12+ with PyYAML
     15 - Claude Code CLI (authenticated via `claude login`)
     16 
     17 ### Running experiments
     18 
     19 The harness handles everything: run experiments, evaluate, analyze, commit, and push.
     20 
     21 ```bash
     22 # Run a sweep (auto-analyzes and commits results when done)
     23 python3 harness/run.py grid.yaml main_effects -j 6
     24 
     25 # Run with a different baseline model
     26 python3 harness/run.py grid.yaml main_effects --model sonnet -j 6
     27 
     28 # Deep dive: full factorial on specific variables
     29 python3 harness/run.py grid.yaml "interaction_hunt:model,effort,tool_write" -j 6
     30 ```
     31 
     32 ### Pipeline flags
     33 
     34 ```bash
     35 # Normal sweep: run -> evaluate -> analyze -> commit -> push
     36 python3 harness/run.py grid.yaml main_effects -j 6
     37 
     38 # Changed eval scripts? Re-evaluate ALL existing runs with latest code
     39 python3 harness/run.py grid.yaml main_effects -j 6 --reeval
     40 
     41 # Just re-evaluate + analyze, no new runs
     42 python3 harness/run.py grid.yaml smoke --reeval --analyze
     43 
     44 # Just analyze existing data
     45 python3 harness/run.py grid.yaml smoke --analyze
     46 ```
     47 
     48 What each flag does:
     49 - **No flags**: run experiments, evaluate new runs, analyze, commit, push
     50 - **`--reeval`**: re-evaluate ALL runs with current eval scripts (use when you changed tests)
     51 - **`--analyze`**: save main effects analysis to `results/analysis/`
     52 - **`--full-pipeline`**: `--reeval` + `--analyze`
     53 - **`-j N`**: run N experiments in parallel
     54 - **`--model MODEL`**: set baseline model for main_effects design
     55 
     56 ### Profiles and designs
     57 
     58 ```bash
     59 # Profiles (predefined grid subsets)
     60 python3 harness/run.py grid.yaml smoke          # 6 cells, 1 run each
     61 python3 harness/run.py grid.yaml all-on          # everything enabled, 3 runs
     62 python3 harness/run.py grid.yaml all-off         # Bash only, 3 runs
     63 python3 harness/run.py grid.yaml core           # 30 cells, 3 runs each
     64 
     65 # DOE designs (statistically efficient sampling)
     66 python3 harness/run.py grid.yaml main_effects    # 18 cells, vary one axis at a time
     67 python3 harness/run.py grid.yaml plackett_burman # Plackett-Burman screening
     68 
     69 # Manual analysis
     70 python3 harness/lib/experiment_design.py analyze results main_effects score
     71 python3 harness/lib/experiment_design.py analyze results interactions model effort score
     72 ```
     73 
     74 ### Building the dashboard
     75 
     76 ```bash
     77 cd dashboard
     78 npm install
     79 npm run build        # Static site in dashboard/dist/
     80 npm run dev          # Dev server for local preview
     81 ```
     82 
     83 ## Project structure
     84 
     85 ```
     86 grid.yaml                    # Experiment grid: axes, values, exclusions, profiles
     87 harness/
     88   run.py                     # Main orchestrator (Python)
     89   lib/
     90     compute_grid.py          # Cartesian product + exclusions
     91     experiment_design.py     # DOE plans + analysis (main effects, PB, interactions)
     92     get-oauth-token.sh       # Extracts OAuth token for --bare mode
     93     invoke.sh                # Claude CLI invocation (bash, used by run.sh)
     94     evaluate.sh              # Evaluation dispatch (bash, used by run.sh)
     95     workspace.sh             # Workspace creation (bash, used by run.sh)
     96 tasks/
     97   tetris/                    # Agent-friendly: build a game
     98   bookmarks-api/             # Medium: REST API with auth
     99   data-pipeline/             # Hard: CSV processing with edge cases
    100   Each task has:
    101     prompts/                 # simple/detailed x en/es
    102     eval/                    # Deterministic test suites the agent never sees
    103     context.md               # Rules file (used when context_file=provided)
    104     scoring.yaml             # Category weights
    105 results/
    106   runs/{run_id}/             # One directory per experiment run
    107     meta.json                # Config, timing, exit code
    108     transcript.jsonl         # Full conversation (every tool call and response)
    109     claude_output.json       # Summary metrics (cost, turns, tokens)
    110     eval_results.json        # Structural, functional, quality scores
    111     workspace.tar.gz         # Archived agent output
    112 dashboard/                   # Astro + React static site
    113   Grid overview, insights (tornado charts, heatmaps), run detail with transcript viewer
    114 ```
    115 
    116 ## Configuration dimensions (16 axes)
    117 
    118 | Axis | Values |
    119 |---|---|
    120 | model | haiku, sonnet, opus |
    121 | effort | high, max (extended thinking) |
    122 | prompt_style | simple, detailed |
    123 | language | typescript, javascript |
    124 | human_language | en, es |
    125 | tool_read | on, off |
    126 | tool_write | on, off |
    127 | tool_edit | on, off |
    128 | tool_glob | on, off |
    129 | tool_grep | on, off |
    130 | linter | on, off |
    131 | playwright | on, off |
    132 | context_file | none, provided |
    133 | sub_agents | on, off |
    134 | web_search | on, off |
    135 | max_budget | low ($0.50), high ($5.00) |
    136 
    137 ## Evaluation
    138 
    139 All scoring is deterministic code. The agent never sees the test suite.
    140 
    141 - **Structural**: Does it build? Do expected files exist?
    142 - **Functional**: Pre-written test suites (Playwright, vitest, golden file diff)
    143 - **Quality**: Lint, type check, accessibility, security, performance
    144 
    145 ## Experiment design
    146 
    147 Instead of running the full 204,800-cell grid, use statistical designs:
    148 
    149 - **Main effects sweep**: Vary one axis at a time from a baseline. Identifies which variables matter.
    150 - **Plackett-Burman**: Screening design that tests many binary factors efficiently.
    151 - **Interaction hunt**: Full factorial on a small subset of axes to find interactions.
    152 
    153 The dashboard's Insights page visualizes main effects as tornado charts and interactions as heatmaps.
    154 
    155 ## Metrics
    156 
    157 All analyses can target different metrics. Switch between them in the dashboard or via CLI:
    158 
    159 ```bash
    160 # Which variables most affect quality?
    161 python3 harness/lib/experiment_design.py analyze results main_effects score
    162 
    163 # Which variables most affect cost?
    164 python3 harness/lib/experiment_design.py analyze results main_effects cost
    165 
    166 # Which variables most affect speed?
    167 python3 harness/lib/experiment_design.py analyze results main_effects wall_time
    168 
    169 # Which variables most affect iteration count?
    170 python3 harness/lib/experiment_design.py analyze results main_effects turns
    171 ```
    172 
    173 Available metrics: `score`, `cost`, `turns`, `wall_time`, `pass_rate`.
    174 
    175 These metrics often conflict. A config that maximizes score may also maximize cost. A future addition is Pareto frontier analysis to identify configurations that are not dominated on any metric (e.g., "highest score at each cost level"). This would let you answer questions like "what's the cheapest config that still passes?" or "is Opus worth 5x the cost of Haiku for this task?"

Impressum · Datenschutz