CLAUDE.md - loop-benchmarking - Controlled experiments across agentic coding configurations. Same task, one variable, what actually works.

CLAUDE.md (10519B)
      1 # Loop Benchmarking
      2 
      3 ## What This Is
      4 
      5 An open benchmark for comparing agentic coding loop configurations. Define the variables that make up a loop setup, and the system generates every permutation automatically. Each permutation is run against each task multiple times (default 3) in a clean-room environment. All data is public.
      6 
      7 ## The Research Grid
      8 
      9 The grid is a cartesian product of configuration variables. You define the axes and their possible values; the harness explodes them into the full experiment matrix.
     10 
     11 23 axes (see `grid.yaml` for full definition):
     12 - **Model**: Haiku, Sonnet, Opus (Anthropic) / glm-4.5-air, glm-4.7, glm-5.1 (Z.AI)
     13 - **Provider**: anthropic, zai (controls API endpoint)
     14 - **Effort**: high, max (extended thinking)
     15 - **Prompt style**: simple, detailed
     16 - **Programming language**: TypeScript, JavaScript, unspecified
     17 - **Human language**: English, Spanish
     18 - **Individual tools**: Read, Write, Edit, Glob, Grep (each on/off)
     19 - **Tooling**: Playwright (off/available/instructed), linter (eslint) on/off
     20 - **Context**: rules file provided or not
     21 - **Web search**: enabled/disabled
     22 - **Budget**: low ($2.00), high ($10.00)
     23 - **Strategy**: none, plan_first, iterate, creative_validate, use_subagents, delegate, review, compete (deferred), split_work
     24 - **Tests provided**: none, a_few (4 tests), many (16 tests)
     25 - **Design guidance**: none, vague, specific
     26 - **Architecture**: none, separation, best_practices
     27 - **Error checking**: none, self_verify
     28 - **Context noise**: clean, wikipedia_25/50/75, lorem_25/50/75
     29 - **Renderer**: none, canvas, svg, dom, webgl
     30 
     31 ## Test Harness
     32 
     33 - Python orchestrator (`harness/run.py`) with parallel execution (`-j N`)
     34 - `--provider` flag required (anthropic or zai) to prevent cross-provider contamination
     35 - `--model` accepts actual model names (glm-4.5-air, not haiku) for non-anthropic providers
     36 - `-n N` / `--max-runs N` limits total runs
     37 - `--runs-per-cell N` overrides runs_per_cell from grid.yaml (use 1 for broad coverage)
     38 - `--commit-every N` analyzes and pushes results every N completed runs
     39 - OAuth auth via `--bare` + `apiKeyHelper` script (reads from ~/.claude/.credentials.json)
     40 - Z.AI auth via `ZAI_API_KEY` env var, sets `ANTHROPIC_BASE_URL` per-subprocess (isolated)
     41 - Auth keepalive: background process pings claude every 5 min to refresh token
     42 - DOE designs: main_effects, plackett_burman, interaction_hunt
     43 - Modular prompt builder: PROMPT_SNIPPETS dict with EN+ES translations per axis
     44 - Re-eval existing runs: `python3 harness/reeval.py -j 4`
     45 - Full pipeline: `python3 harness/clean-and-reeval.py -j 4`
     46 - Auto-extracts workspace artifacts for dashboard iframe preview
     47 - Auto-commits and pushes results after sweep completes
     48 - Invalid run detection: auto-discards 0-turn runs, skips completed runs on resume
     49 - Quick analysis: `python3 harness/analyze-and-push.py` (no re-eval, seconds)
     50 
     51 ### Sweep workflow
     52 ```bash
     53 # Broad coverage (1 run per cell, explore the variable space)
     54 python3 harness/run.py grid.yaml main_effects --provider anthropic --model haiku -j 4 --runs-per-cell 1 --commit-every 20
     55 
     56 # Backfill (add runs 2+3 to existing cells for variance measurement)
     57 python3 harness/run.py grid.yaml main_effects --provider anthropic --model haiku -j 4 --commit-every 20
     58 
     59 # Z.AI models
     60 export ZAI_API_KEY="..."
     61 python3 harness/run.py grid.yaml main_effects --provider zai --model glm-4.5-air -j 2 --runs-per-cell 1 --commit-every 20
     62 ```
     63 - Serve process cleanup: Popen with start_new_session + os.killpg prevents orphaned HTTP servers
     64 
     65 ## Scoring (Input/Output/Outcome Framework)
     66 
     67 All evaluation is deterministic code. No LLM grading.
     68 
     69 **Inputs** (experiment variables - the grid axes):
     70 - Model, provider, effort, language, tools, prompt style, strategy, etc.
     71 
     72 **Outcomes** (the headline score, defined in `tasks/tetris/scoring.yaml`):
     73 - **Gameplay bot** (50%): two-phase (mechanics + play-to-win), 60 pieces / 45s, 60ms polling, 16 tests
     74 - **SonarQube** (50%): cognitive complexity, bugs, vulnerabilities, code smells, maintainability/reliability/security ratings
     75 
     76 **Outputs** (tracked and displayed, but not in headline score):
     77 - **Build quality**: lint, type check, bundle size
     78 - **Structural**: entry point exists, build succeeds
     79 - **Code analysis**: file count, function length, nesting depth, naming consistency, separation of concerns, duplication, HTML validation, magic numbers, comments ratio
     80 - **Transcript analysis**: agent efficiency, wasted turns (docs, ASCII art, server starts), productivity ratio, self-testing
     81 
     82 ## Eval Pipeline (per run, in order)
     83 
     84 1. `structural.sh` - entry point, build, TS compilation
     85 2. `quality.sh` - ESLint, typecheck, bundle size
     86 3. `code-analysis.py` - 14 code quality metrics
     87 4. `transcript-analysis.py` - agent behavior from conversation log
     88 5. `gameplay-bot/` - two-phase: mechanics test then play-to-win. Uses Pierre Dellacherie's 4-heuristic Tetris AI (2003) with Colin Fahey's GA-optimized weights. Reference implementations: LeeYiyuan/tetrisai and mikhail-vlasenko/Tetris-AI (both MIT)
     89 6. `sonarqube-scan.py` - automated code quality scan (requires SonarQube at localhost:9000)
     90 
     91 ## Dashboard
     92 
     93 Static Astro site with React islands. SMUI design system (JetBrains Mono, Nord palette). Light/dark theme toggle.
     94 
     95 Pages:
     96 - **Grid** (`/`): per-task summary, box plots for score distribution, filterable cell/run table with sorting
     97 - **Insights** (`/insights`): convex hull scatter plots (4 density levels) with model toggles, variability analysis (box plots, reliability ranking, ANOVA decomposition), tornado chart with variance bands, interaction heatmap
     98 - **Surprises** (`/surprises`): aggregate surprise stats, breakdown by type (model upsets, prompt upsets, individual outliers), grouped surprise cards with run links
     99 - **Explore** (`/explore`): correlation matrix, efficiency frontier, bump chart, heatmap matrix, radar comparison, treemap
    100 - **Compare** (`/compare`): cell-based aggregate stats with score/cost ranges per axis value
    101 - **PCA** (`/pca`): principal component analysis scatter plot (PC1/PC2/PC3 selectable axes), model-colored points sized by score, loadings interpretation tables, variance explained bars
    102 - **Run detail** (`/run/{id}` or `/r/{short_id}`): outcome/output score separation, all config pills, SonarQube detail card, 6 detail cards, transcript viewer, artifact iframe, link to cell
    103 - **Cell detail** (`/cell/{id}` or `/c/{short_id}`): run comparison table, artifact gallery, variance stats, agent behavior comparison
    104 - **Methodology** (`/methodology`): scoring framework, DOE design, gameplay bot phases, known limitations
    105 
    106 Short URL IDs: 8-char SHA256 hash for `/r/` and `/c/` routes with redirect pages.
    107 
    108 ## Tech
    109 
    110 - Harness: Python orchestrator, bash eval scripts
    111 - Auth: OAuth token from ~/.claude/.credentials.json (Anthropic), ZAI_API_KEY env var (Z.AI)
    112 - Z.AI gateway: api.z.ai/api/anthropic, Anthropic-compatible API, accepts GLM model names directly
    113 - Dashboard: Astro + React + recharts. Types split: types.ts (client-safe) vs data.ts (server-only)
    114 - Artifacts: stored at project root `artifacts/` (not in dashboard/public/ to avoid 13GB build). Deploy rsyncs separately.
    115 - Deploy: Forgejo CI to research subdomain (blue/green)
    116 - Results committed to repo for dashboard build
    117 - SonarQube Community Edition at localhost:9000 (Java 17, standalone JAR)
    118 - Test fixtures: tasks/tetris/fixtures/tests-few/ and tests-full/ (Playwright tests given to agent)
    119 - Noise files: tasks/tetris/noise/ (wikipedia/lorem at calibrated sizes, generated by generate.py)
    120 
    121 ## Conventions
    122 
    123 - Source control: Forgejo (not GitHub)
    124 - Start conservative with resource-intensive settings
    125 - Never use emdashes
    126 - Pre-push hook verifies dashboard build
    127 - Provider must be explicit (--provider flag required)
    128 - GLM models use real names (glm-4.5-air), never mapped to haiku/sonnet/opus
    129 - **Gameplay bot driver MUST NOT hard-code language strings.** No text matching
    130   for start buttons, game over detection, restart buttons, score/level labels,
    131   or any other UI element. Detection must be purely structural (DOM structure,
    132   element properties, visual changes, behavioral response to input). The bot
    133   should work for games in any language without code changes.
    134 
    135 ## TODO
    136 
    137 ### Analysis
    138 - [x] PCA analysis: `harness/pca-analysis.py` generates `results/analysis/pca.json`, dashboard at `/pca`
    139 - [ ] Pareto frontier analysis: multi-objective optimization (score vs cost, score vs time)
    140 
    141 ### Eval
    142 - [ ] Quality scoring too coarse (binary pass/fail on 3 checks = 0/33/67/100%)
    143 - [ ] Gameplay bot does NOT test: wall kicks, lock delay (sliding at collision line), T-spins, hold piece, ghost piece, next piece preview, level/speed progression, DAS. Known limitation for methodology page.
    144 - [ ] Gameplay bot start detection checks canvas click before start buttons, causing false "started" on start screens. Reorder to check buttons first.
    145 - [ ] Gameplay bot false positives: piece_locks and game_over can pass on static start screens when grid reader misidentifies UI chrome as game state.
    146 - [ ] Some agents build working games that require a build step (Vite/webpack) but don't run the build, so the artifact is source code not a playable game. The eval scores 0 but the game "works" if you build it.
    147 - [ ] Games with minor UI bugs (CSS z-index, overflow, missing start button handler) can mask fully working gameplay logic. The bot scores 0 because it can't access the game, even though the code is correct. A "start game" button that doesn't work prevents testing all other mechanics.
    148 - [ ] Memory leak detection via Playwright heap snapshots
    149 - [ ] Frame rate measurement during gameplay
    150 - [ ] Dead code detection (knip)
    151 - [ ] Wall kick rotation testing
    152 
    153 ### Dashboard
    154 - [ ] Cell detail page shows 0%/0s for cells with no runs (should show "no data")
    155 - [ ] Show gameplay bot test results on cell detail page
    156 - [ ] Inline Tetris artifact previews in grid table (thumbnails)
    157 - [ ] Re-eval button in UI
    158 - [ ] Update methodology page with new axes, provider system, GLM models
    159 
    160 ### Harness
    161 - [ ] OAuth token refresh: verify the refresh endpoint works reliably
    162 - [ ] Compete strategy implementation (two parallel invocations, pick best)
    163 - [ ] Add more tasks beyond Tetris
    164 - [ ] irr_code noise type (user will provide Claude session transcript)
    165 - [ ] poison noise type (user will provide Java coding session)
    166 
    167 ### Data
    168 - [ ] Complete Z.AI sweeps (glm-4.7, glm-5.1 main_effects)
    169 - [ ] Interaction hunt on top variables
    170 - [ ] Test new axes (strategy, tests_provided, design_guidance, etc.) in runs
	loop-benchmarking Controlled experiments across agentic coding configurations. Same task, one variable, what actually works.
	git clone https://git.shiptheloop.com/loop-benchmarking.git
	Log \| Files \| Refs \| README