loop-benchmarking

Controlled experiments across agentic coding configurations. Same task, one variable, what actually works.
git clone https://git.shiptheloop.com/loop-benchmarking.git
Log | Files | Refs | README

commit cfb04f1f1d0b27c76825f4289a9de007f7b8bf00
parent ae769a448ed5b0539181c89a2b3fc575989123ae
Author: Brian Graham <brian@buildingbetterteams.de>
Date:   Mon,  6 Apr 2026 10:36:59 +0200

Update CLAUDE.md with complete project state

Full documentation of: 16 axes, eval pipeline (6 steps), scoring
framework (input/output/outcome), dashboard pages, tech stack,
conventions, and comprehensive TODO list.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Diffstat:
MCLAUDE.md | 87++++++++++++++++++++++++++++++++++++++++++++++++-------------------------------
1 file changed, 53 insertions(+), 34 deletions(-)

diff --git a/CLAUDE.md b/CLAUDE.md @@ -19,87 +19,106 @@ The grid is a cartesian product of configuration variables. You define the axes - **Context**: rules file provided or not - **Sub-agents**: enabled/disabled - **Web search**: enabled/disabled -- **Budget**: low ($0.50), high ($5.00) +- **Budget**: low ($2.00), high ($10.00) ## Test Harness - Python orchestrator (`harness/run.py`) with parallel execution (`-j N`) -- OAuth auth via `--bare` + `apiKeyHelper` script +- OAuth auth via `--bare` + `apiKeyHelper` script (reads from ~/.claude/.credentials.json) +- Auth keepalive: background process pings claude every 5 min to refresh token - DOE designs: main_effects, plackett_burman, interaction_hunt - Re-eval existing runs: `python3 harness/reeval.py -j 4` +- Full pipeline: `python3 harness/clean-and-reeval.py -j 4` - Auto-extracts workspace artifacts for dashboard iframe preview +- Auto-commits and pushes results after sweep completes +- Invalid run detection: auto-deletes runs with null cost, 1 turn, timeout, invalid API key, or short transcript on resume -## Scoring +## Scoring (Input/Output/Outcome Framework) All evaluation is deterministic code. No LLM grading. -Outcome score (the headline number, defined in `tasks/tetris/scoring.yaml`): -- **Gameplay bot** (50%): auto-calibrating Tetris player that tests all mechanics +**Inputs** (experiment variables - the grid axes): +- Model, effort, language, tools, prompt style, etc. + +**Outcomes** (the headline score, defined in `tasks/tetris/scoring.yaml`): +- **Gameplay bot** (50%): auto-calibrating Tetris player, 16 Playwright tests via continuous 150ms grid polling - **Quality** (50%): lint, type check, bundle size -Output metrics (tracked and displayed, but not in headline score): +**Outputs** (tracked and displayed, but not in headline score): - **Structural**: entry point exists, build succeeds -- **Code analysis**: file count, function length, nesting depth, naming consistency, separation of concerns, duplication, HTML validation -- **Transcript analysis**: agent efficiency, wasted turns, self-testing -- **SonarQube**: automated code quality scan +- **Code analysis**: file count, function length, nesting depth, naming consistency, separation of concerns, duplication, HTML validation, magic numbers, comments ratio +- **Transcript analysis**: agent efficiency, wasted turns (docs, ASCII art, server starts), productivity ratio, self-testing +- **SonarQube**: cognitive complexity, bugs, vulnerabilities, code smells, maintainability/reliability/security ratings (requires SonarQube at localhost:9000) + +## Eval Pipeline (per run, in order) + +1. `structural.sh` - entry point, build, TS compilation +2. `quality.sh` - ESLint, typecheck, bundle size +3. `code-analysis.py` - 14 code quality metrics +4. `transcript-analysis.py` - agent behavior from conversation log +5. `gameplay-bot/` - 16 Playwright tests via continuous grid scanning (adapted from MIT-licensed LeeYiyuan/tetrisai and mikhail-vlasenko/Tetris-AI) +6. `sonarqube-scan.py` - automated code quality scan ## Dashboard -Static Astro site with React islands. Pages: -- **Grid** (`/`): summary stats, bar charts, filterable run table -- **Insights** (`/insights`): surprise cards, scatter plots, tornado chart, interaction heatmap -- **Explore** (`/explore`): correlation matrix, efficiency frontier, bump chart, heatmap matrix, radar comparison, treemap -- **Compare** (`/compare`): aggregate stats per axis value -- **Run detail** (`/run/{id}`): metrics, config pills, 7 score bars, code analysis, agent behavior, transcript viewer, artifact iframe +Static Astro site with React islands. SMUI design system (JetBrains Mono, Nord palette). Light/dark theme toggle. -Light/dark theme toggle. SMUI design system (JetBrains Mono, Nord palette). +Pages: +- **Grid** (`/`): per-task summary, bar charts with error bars, filterable cell/run table with sorting +- **Insights** (`/insights`): surprise cards, convex hull scatter plots (4 density levels), variability analysis (box plots, reliability ranking, ANOVA decomposition), tornado chart with variance bands, interaction heatmap +- **Explore** (`/explore`): correlation matrix, efficiency frontier, bump chart, heatmap matrix, radar comparison, treemap +- **Compare** (`/compare`): cell-based aggregate stats with score/cost ranges per axis value +- **Run detail** (`/run/{id}`): outcome/output score separation, config pills, 6 detail cards, transcript viewer, artifact iframe +- **Cell detail** (`/cell/{id}`): run comparison table, artifact gallery, variance stats, agent behavior comparison +- **Methodology** (`/methodology`): scoring framework, DOE design, gameplay bot phases, known limitations ## Tech - Harness: Python orchestrator, bash eval scripts -- Auth: OAuth token from ~/.claude/.credentials.json, auto-refresh -- Dashboard: Astro + React + recharts +- Auth: OAuth token from ~/.claude/.credentials.json, keepalive background process +- Dashboard: Astro + React + recharts. Types split: types.ts (client-safe) vs data.ts (server-only) +- Artifacts: stored at project root `artifacts/` (not in dashboard/public/ to avoid 13GB build). Deploy rsyncs separately. - Deploy: Forgejo CI to research subdomain (blue/green) - Results committed to repo for dashboard build +- SonarQube Community Edition at localhost:9000 (Java 17, standalone JAR) ## Conventions - Source control: Forgejo (not GitHub) - Start conservative with resource-intensive settings - Never use emdashes +- Pre-push hook verifies dashboard build ## TODO ### Analysis -- [ ] PCA analysis: add when 100+ runs exist. One-hot encode categoricals, identify principal components explaining variance. Show which variable combinations matter most. +- [ ] PCA analysis: add when 100+ runs exist with new scoring. One-hot encode categoricals, identify principal components explaining variance. - [ ] Pareto frontier analysis: multi-objective optimization (score vs cost, score vs time) -- [ ] Per-task analysis: when more tasks are added, compare how variables affect different tasks differently ### Eval -- [ ] Wire functional eval (Playwright tests from gameplay bot) into the score more robustly -- [ ] Cyclomatic complexity measurement (escomplex or typhonjs-escomplex) +- [ ] Fix gameplay bot false positives: line_clear and score_changes need piecesSpawned > 0 check +- [ ] Quality scoring too coarse (binary pass/fail on 3 checks = 0/33/67/100%) +- [ ] Add SonarQube to outcome_weights once validated +- [ ] Cyclomatic complexity measurement (or use SonarQube cognitive complexity) - [ ] Memory leak detection via Playwright heap snapshots - [ ] Frame rate measurement during gameplay - [ ] Dead code detection (knip) -- [ ] Wall kick rotation testing (position piece against wall, try rotate) -- [ ] I-piece detection in rotation test needs tuning (not reliably found in 60 attempts) +- [ ] Wall kick rotation testing +- [ ] I-piece detection in rotation test needs tuning ### Dashboard -- [ ] Sortable columns on grid table -- [ ] Show context file content on run detail page when context_file=provided -- [ ] Add human_language=unspecified option -- [ ] Run detail: show gameplay bot test results (16 individual tests with pass/fail) -- [ ] Run detail: show game screenshots from bot at key moments +- [ ] Show SonarQube details on run page (bugs, smells, ratings) +- [ ] Show gameplay bot test results on cell detail page - [ ] Inline Tetris artifact previews in grid table (thumbnails) -- [ ] Re-eval button in UI (trigger reeval.py from dashboard) +- [ ] Re-eval button in UI ### Harness - [ ] OAuth token refresh: verify the refresh endpoint works reliably -- [ ] Auto-commit and push results after sweep completes - [ ] Support non-Anthropic models via LiteLLM/Ollama - [ ] Add more tasks beyond Tetris +- [ ] CodeScene integration when MCP token available ($9/mo waitlist) ### Data -- [ ] Complete sonnet sweep (many timed out) -- [ ] Run opus sweep -- [ ] Interaction hunt on top variables (model x language x context_file x prompt_style) +- [ ] Complete sonnet and opus sweeps (some timed out or budget-killed) +- [ ] Interaction hunt on top variables +- [ ] Full re-eval with new outcome scoring

Impressum · Datenschutz