CLAUDE.md (10519B)
1 # Loop Benchmarking 2 3 ## What This Is 4 5 An open benchmark for comparing agentic coding loop configurations. Define the variables that make up a loop setup, and the system generates every permutation automatically. Each permutation is run against each task multiple times (default 3) in a clean-room environment. All data is public. 6 7 ## The Research Grid 8 9 The grid is a cartesian product of configuration variables. You define the axes and their possible values; the harness explodes them into the full experiment matrix. 10 11 23 axes (see `grid.yaml` for full definition): 12 - **Model**: Haiku, Sonnet, Opus (Anthropic) / glm-4.5-air, glm-4.7, glm-5.1 (Z.AI) 13 - **Provider**: anthropic, zai (controls API endpoint) 14 - **Effort**: high, max (extended thinking) 15 - **Prompt style**: simple, detailed 16 - **Programming language**: TypeScript, JavaScript, unspecified 17 - **Human language**: English, Spanish 18 - **Individual tools**: Read, Write, Edit, Glob, Grep (each on/off) 19 - **Tooling**: Playwright (off/available/instructed), linter (eslint) on/off 20 - **Context**: rules file provided or not 21 - **Web search**: enabled/disabled 22 - **Budget**: low ($2.00), high ($10.00) 23 - **Strategy**: none, plan_first, iterate, creative_validate, use_subagents, delegate, review, compete (deferred), split_work 24 - **Tests provided**: none, a_few (4 tests), many (16 tests) 25 - **Design guidance**: none, vague, specific 26 - **Architecture**: none, separation, best_practices 27 - **Error checking**: none, self_verify 28 - **Context noise**: clean, wikipedia_25/50/75, lorem_25/50/75 29 - **Renderer**: none, canvas, svg, dom, webgl 30 31 ## Test Harness 32 33 - Python orchestrator (`harness/run.py`) with parallel execution (`-j N`) 34 - `--provider` flag required (anthropic or zai) to prevent cross-provider contamination 35 - `--model` accepts actual model names (glm-4.5-air, not haiku) for non-anthropic providers 36 - `-n N` / `--max-runs N` limits total runs 37 - `--runs-per-cell N` overrides runs_per_cell from grid.yaml (use 1 for broad coverage) 38 - `--commit-every N` analyzes and pushes results every N completed runs 39 - OAuth auth via `--bare` + `apiKeyHelper` script (reads from ~/.claude/.credentials.json) 40 - Z.AI auth via `ZAI_API_KEY` env var, sets `ANTHROPIC_BASE_URL` per-subprocess (isolated) 41 - Auth keepalive: background process pings claude every 5 min to refresh token 42 - DOE designs: main_effects, plackett_burman, interaction_hunt 43 - Modular prompt builder: PROMPT_SNIPPETS dict with EN+ES translations per axis 44 - Re-eval existing runs: `python3 harness/reeval.py -j 4` 45 - Full pipeline: `python3 harness/clean-and-reeval.py -j 4` 46 - Auto-extracts workspace artifacts for dashboard iframe preview 47 - Auto-commits and pushes results after sweep completes 48 - Invalid run detection: auto-discards 0-turn runs, skips completed runs on resume 49 - Quick analysis: `python3 harness/analyze-and-push.py` (no re-eval, seconds) 50 51 ### Sweep workflow 52 ```bash 53 # Broad coverage (1 run per cell, explore the variable space) 54 python3 harness/run.py grid.yaml main_effects --provider anthropic --model haiku -j 4 --runs-per-cell 1 --commit-every 20 55 56 # Backfill (add runs 2+3 to existing cells for variance measurement) 57 python3 harness/run.py grid.yaml main_effects --provider anthropic --model haiku -j 4 --commit-every 20 58 59 # Z.AI models 60 export ZAI_API_KEY="..." 61 python3 harness/run.py grid.yaml main_effects --provider zai --model glm-4.5-air -j 2 --runs-per-cell 1 --commit-every 20 62 ``` 63 - Serve process cleanup: Popen with start_new_session + os.killpg prevents orphaned HTTP servers 64 65 ## Scoring (Input/Output/Outcome Framework) 66 67 All evaluation is deterministic code. No LLM grading. 68 69 **Inputs** (experiment variables - the grid axes): 70 - Model, provider, effort, language, tools, prompt style, strategy, etc. 71 72 **Outcomes** (the headline score, defined in `tasks/tetris/scoring.yaml`): 73 - **Gameplay bot** (50%): two-phase (mechanics + play-to-win), 60 pieces / 45s, 60ms polling, 16 tests 74 - **SonarQube** (50%): cognitive complexity, bugs, vulnerabilities, code smells, maintainability/reliability/security ratings 75 76 **Outputs** (tracked and displayed, but not in headline score): 77 - **Build quality**: lint, type check, bundle size 78 - **Structural**: entry point exists, build succeeds 79 - **Code analysis**: file count, function length, nesting depth, naming consistency, separation of concerns, duplication, HTML validation, magic numbers, comments ratio 80 - **Transcript analysis**: agent efficiency, wasted turns (docs, ASCII art, server starts), productivity ratio, self-testing 81 82 ## Eval Pipeline (per run, in order) 83 84 1. `structural.sh` - entry point, build, TS compilation 85 2. `quality.sh` - ESLint, typecheck, bundle size 86 3. `code-analysis.py` - 14 code quality metrics 87 4. `transcript-analysis.py` - agent behavior from conversation log 88 5. `gameplay-bot/` - two-phase: mechanics test then play-to-win. Uses Pierre Dellacherie's 4-heuristic Tetris AI (2003) with Colin Fahey's GA-optimized weights. Reference implementations: LeeYiyuan/tetrisai and mikhail-vlasenko/Tetris-AI (both MIT) 89 6. `sonarqube-scan.py` - automated code quality scan (requires SonarQube at localhost:9000) 90 91 ## Dashboard 92 93 Static Astro site with React islands. SMUI design system (JetBrains Mono, Nord palette). Light/dark theme toggle. 94 95 Pages: 96 - **Grid** (`/`): per-task summary, box plots for score distribution, filterable cell/run table with sorting 97 - **Insights** (`/insights`): convex hull scatter plots (4 density levels) with model toggles, variability analysis (box plots, reliability ranking, ANOVA decomposition), tornado chart with variance bands, interaction heatmap 98 - **Surprises** (`/surprises`): aggregate surprise stats, breakdown by type (model upsets, prompt upsets, individual outliers), grouped surprise cards with run links 99 - **Explore** (`/explore`): correlation matrix, efficiency frontier, bump chart, heatmap matrix, radar comparison, treemap 100 - **Compare** (`/compare`): cell-based aggregate stats with score/cost ranges per axis value 101 - **PCA** (`/pca`): principal component analysis scatter plot (PC1/PC2/PC3 selectable axes), model-colored points sized by score, loadings interpretation tables, variance explained bars 102 - **Run detail** (`/run/{id}` or `/r/{short_id}`): outcome/output score separation, all config pills, SonarQube detail card, 6 detail cards, transcript viewer, artifact iframe, link to cell 103 - **Cell detail** (`/cell/{id}` or `/c/{short_id}`): run comparison table, artifact gallery, variance stats, agent behavior comparison 104 - **Methodology** (`/methodology`): scoring framework, DOE design, gameplay bot phases, known limitations 105 106 Short URL IDs: 8-char SHA256 hash for `/r/` and `/c/` routes with redirect pages. 107 108 ## Tech 109 110 - Harness: Python orchestrator, bash eval scripts 111 - Auth: OAuth token from ~/.claude/.credentials.json (Anthropic), ZAI_API_KEY env var (Z.AI) 112 - Z.AI gateway: api.z.ai/api/anthropic, Anthropic-compatible API, accepts GLM model names directly 113 - Dashboard: Astro + React + recharts. Types split: types.ts (client-safe) vs data.ts (server-only) 114 - Artifacts: stored at project root `artifacts/` (not in dashboard/public/ to avoid 13GB build). Deploy rsyncs separately. 115 - Deploy: Forgejo CI to research subdomain (blue/green) 116 - Results committed to repo for dashboard build 117 - SonarQube Community Edition at localhost:9000 (Java 17, standalone JAR) 118 - Test fixtures: tasks/tetris/fixtures/tests-few/ and tests-full/ (Playwright tests given to agent) 119 - Noise files: tasks/tetris/noise/ (wikipedia/lorem at calibrated sizes, generated by generate.py) 120 121 ## Conventions 122 123 - Source control: Forgejo (not GitHub) 124 - Start conservative with resource-intensive settings 125 - Never use emdashes 126 - Pre-push hook verifies dashboard build 127 - Provider must be explicit (--provider flag required) 128 - GLM models use real names (glm-4.5-air), never mapped to haiku/sonnet/opus 129 - **Gameplay bot driver MUST NOT hard-code language strings.** No text matching 130 for start buttons, game over detection, restart buttons, score/level labels, 131 or any other UI element. Detection must be purely structural (DOM structure, 132 element properties, visual changes, behavioral response to input). The bot 133 should work for games in any language without code changes. 134 135 ## TODO 136 137 ### Analysis 138 - [x] PCA analysis: `harness/pca-analysis.py` generates `results/analysis/pca.json`, dashboard at `/pca` 139 - [ ] Pareto frontier analysis: multi-objective optimization (score vs cost, score vs time) 140 141 ### Eval 142 - [ ] Quality scoring too coarse (binary pass/fail on 3 checks = 0/33/67/100%) 143 - [ ] Gameplay bot does NOT test: wall kicks, lock delay (sliding at collision line), T-spins, hold piece, ghost piece, next piece preview, level/speed progression, DAS. Known limitation for methodology page. 144 - [ ] Gameplay bot start detection checks canvas click before start buttons, causing false "started" on start screens. Reorder to check buttons first. 145 - [ ] Gameplay bot false positives: piece_locks and game_over can pass on static start screens when grid reader misidentifies UI chrome as game state. 146 - [ ] Some agents build working games that require a build step (Vite/webpack) but don't run the build, so the artifact is source code not a playable game. The eval scores 0 but the game "works" if you build it. 147 - [ ] Games with minor UI bugs (CSS z-index, overflow, missing start button handler) can mask fully working gameplay logic. The bot scores 0 because it can't access the game, even though the code is correct. A "start game" button that doesn't work prevents testing all other mechanics. 148 - [ ] Memory leak detection via Playwright heap snapshots 149 - [ ] Frame rate measurement during gameplay 150 - [ ] Dead code detection (knip) 151 - [ ] Wall kick rotation testing 152 153 ### Dashboard 154 - [ ] Cell detail page shows 0%/0s for cells with no runs (should show "no data") 155 - [ ] Show gameplay bot test results on cell detail page 156 - [ ] Inline Tetris artifact previews in grid table (thumbnails) 157 - [ ] Re-eval button in UI 158 - [ ] Update methodology page with new axes, provider system, GLM models 159 160 ### Harness 161 - [ ] OAuth token refresh: verify the refresh endpoint works reliably 162 - [ ] Compete strategy implementation (two parallel invocations, pick best) 163 - [ ] Add more tasks beyond Tetris 164 - [ ] irr_code noise type (user will provide Claude session transcript) 165 - [ ] poison noise type (user will provide Java coding session) 166 167 ### Data 168 - [ ] Complete Z.AI sweeps (glm-4.7, glm-5.1 main_effects) 169 - [ ] Interaction hunt on top variables 170 - [ ] Test new axes (strategy, tests_provided, design_guidance, etc.) in runs