Update CLAUDE.md with complete project state - loop-benchmarking - Controlled experiments across agentic coding configurations. Same task, one variable, what actually works.

commit cfb04f1f1d0b27c76825f4289a9de007f7b8bf00
parent ae769a448ed5b0539181c89a2b3fc575989123ae
Author: Brian Graham <brian@buildingbetterteams.de>
Date:   Mon,  6 Apr 2026 10:36:59 +0200

Update CLAUDE.md with complete project state

Full documentation of: 16 axes, eval pipeline (6 steps), scoring
framework (input/output/outcome), dashboard pages, tech stack,
conventions, and comprehensive TODO list.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Diffstat:
M CLAUDE.md  | 87 ++++++++++++++++++++++++++++++++++++++++++++++++-------------------------------

1 file changed, 53 insertions(+), 34 deletions(-)
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -19,87 +19,106 @@ The grid is a cartesian product of configuration variables. You define the axes 
 - **Context**: rules file provided or not
 - **Sub-agents**: enabled/disabled
 - **Web search**: enabled/disabled
-- **Budget**: low ($0.50), high ($5.00)
+- **Budget**: low ($2.00), high ($10.00)
 
 ## Test Harness
 
 - Python orchestrator (`harness/run.py`) with parallel execution (`-j N`)
-- OAuth auth via `--bare` + `apiKeyHelper` script
+- OAuth auth via `--bare` + `apiKeyHelper` script (reads from ~/.claude/.credentials.json)
+- Auth keepalive: background process pings claude every 5 min to refresh token
 - DOE designs: main_effects, plackett_burman, interaction_hunt
 - Re-eval existing runs: `python3 harness/reeval.py -j 4`
+- Full pipeline: `python3 harness/clean-and-reeval.py -j 4`
 - Auto-extracts workspace artifacts for dashboard iframe preview
+- Auto-commits and pushes results after sweep completes
+- Invalid run detection: auto-deletes runs with null cost, 1 turn, timeout, invalid API key, or short transcript on resume
 
-## Scoring
+## Scoring (Input/Output/Outcome Framework)
 
 All evaluation is deterministic code. No LLM grading.
 
-Outcome score (the headline number, defined in `tasks/tetris/scoring.yaml`):
-- **Gameplay bot** (50%): auto-calibrating Tetris player that tests all mechanics
+**Inputs** (experiment variables - the grid axes):
+- Model, effort, language, tools, prompt style, etc.
+
+**Outcomes** (the headline score, defined in `tasks/tetris/scoring.yaml`):
+- **Gameplay bot** (50%): auto-calibrating Tetris player, 16 Playwright tests via continuous 150ms grid polling
 - **Quality** (50%): lint, type check, bundle size
 
-Output metrics (tracked and displayed, but not in headline score):
+**Outputs** (tracked and displayed, but not in headline score):
 - **Structural**: entry point exists, build succeeds
-- **Code analysis**: file count, function length, nesting depth, naming consistency, separation of concerns, duplication, HTML validation
-- **Transcript analysis**: agent efficiency, wasted turns, self-testing
-- **SonarQube**: automated code quality scan
+- **Code analysis**: file count, function length, nesting depth, naming consistency, separation of concerns, duplication, HTML validation, magic numbers, comments ratio
+- **Transcript analysis**: agent efficiency, wasted turns (docs, ASCII art, server starts), productivity ratio, self-testing
+- **SonarQube**: cognitive complexity, bugs, vulnerabilities, code smells, maintainability/reliability/security ratings (requires SonarQube at localhost:9000)
+
+## Eval Pipeline (per run, in order)
+
+1. `structural.sh` - entry point, build, TS compilation
+2. `quality.sh` - ESLint, typecheck, bundle size
+3. `code-analysis.py` - 14 code quality metrics
+4. `transcript-analysis.py` - agent behavior from conversation log
+5. `gameplay-bot/` - 16 Playwright tests via continuous grid scanning (adapted from MIT-licensed LeeYiyuan/tetrisai and mikhail-vlasenko/Tetris-AI)
+6. `sonarqube-scan.py` - automated code quality scan
 
 ## Dashboard
 
-Static Astro site with React islands. Pages:
-- **Grid** (`/`): summary stats, bar charts, filterable run table
-- **Insights** (`/insights`): surprise cards, scatter plots, tornado chart, interaction heatmap
-- **Explore** (`/explore`): correlation matrix, efficiency frontier, bump chart, heatmap matrix, radar comparison, treemap
-- **Compare** (`/compare`): aggregate stats per axis value
-- **Run detail** (`/run/{id}`): metrics, config pills, 7 score bars, code analysis, agent behavior, transcript viewer, artifact iframe
+Static Astro site with React islands. SMUI design system (JetBrains Mono, Nord palette). Light/dark theme toggle.
 
-Light/dark theme toggle. SMUI design system (JetBrains Mono, Nord palette).
+Pages:
+- **Grid** (`/`): per-task summary, bar charts with error bars, filterable cell/run table with sorting
+- **Insights** (`/insights`): surprise cards, convex hull scatter plots (4 density levels), variability analysis (box plots, reliability ranking, ANOVA decomposition), tornado chart with variance bands, interaction heatmap
+- **Explore** (`/explore`): correlation matrix, efficiency frontier, bump chart, heatmap matrix, radar comparison, treemap
+- **Compare** (`/compare`): cell-based aggregate stats with score/cost ranges per axis value
+- **Run detail** (`/run/{id}`): outcome/output score separation, config pills, 6 detail cards, transcript viewer, artifact iframe
+- **Cell detail** (`/cell/{id}`): run comparison table, artifact gallery, variance stats, agent behavior comparison
+- **Methodology** (`/methodology`): scoring framework, DOE design, gameplay bot phases, known limitations
 
 ## Tech
 
 - Harness: Python orchestrator, bash eval scripts
-- Auth: OAuth token from ~/.claude/.credentials.json, auto-refresh
-- Dashboard: Astro + React + recharts
+- Auth: OAuth token from ~/.claude/.credentials.json, keepalive background process
+- Dashboard: Astro + React + recharts. Types split: types.ts (client-safe) vs data.ts (server-only)
+- Artifacts: stored at project root `artifacts/` (not in dashboard/public/ to avoid 13GB build). Deploy rsyncs separately.
 - Deploy: Forgejo CI to research subdomain (blue/green)
 - Results committed to repo for dashboard build
+- SonarQube Community Edition at localhost:9000 (Java 17, standalone JAR)
 
 ## Conventions
 
 - Source control: Forgejo (not GitHub)
 - Start conservative with resource-intensive settings
 - Never use emdashes
+- Pre-push hook verifies dashboard build
 
 ## TODO
 
 ### Analysis
-- [ ] PCA analysis: add when 100+ runs exist. One-hot encode categoricals, identify principal components explaining variance. Show which variable combinations matter most.
+- [ ] PCA analysis: add when 100+ runs exist with new scoring. One-hot encode categoricals, identify principal components explaining variance.
 - [ ] Pareto frontier analysis: multi-objective optimization (score vs cost, score vs time)
-- [ ] Per-task analysis: when more tasks are added, compare how variables affect different tasks differently
 
 ### Eval
-- [ ] Wire functional eval (Playwright tests from gameplay bot) into the score more robustly
-- [ ] Cyclomatic complexity measurement (escomplex or typhonjs-escomplex)
+- [ ] Fix gameplay bot false positives: line_clear and score_changes need piecesSpawned > 0 check
+- [ ] Quality scoring too coarse (binary pass/fail on 3 checks = 0/33/67/100%)
+- [ ] Add SonarQube to outcome_weights once validated
+- [ ] Cyclomatic complexity measurement (or use SonarQube cognitive complexity)
 - [ ] Memory leak detection via Playwright heap snapshots
 - [ ] Frame rate measurement during gameplay
 - [ ] Dead code detection (knip)
-- [ ] Wall kick rotation testing (position piece against wall, try rotate)
-- [ ] I-piece detection in rotation test needs tuning (not reliably found in 60 attempts)
+- [ ] Wall kick rotation testing
+- [ ] I-piece detection in rotation test needs tuning
 
 ### Dashboard
-- [ ] Sortable columns on grid table
-- [ ] Show context file content on run detail page when context_file=provided
-- [ ] Add human_language=unspecified option
-- [ ] Run detail: show gameplay bot test results (16 individual tests with pass/fail)
-- [ ] Run detail: show game screenshots from bot at key moments
+- [ ] Show SonarQube details on run page (bugs, smells, ratings)
+- [ ] Show gameplay bot test results on cell detail page
 - [ ] Inline Tetris artifact previews in grid table (thumbnails)
-- [ ] Re-eval button in UI (trigger reeval.py from dashboard)
+- [ ] Re-eval button in UI
 
 ### Harness
 - [ ] OAuth token refresh: verify the refresh endpoint works reliably
-- [ ] Auto-commit and push results after sweep completes
 - [ ] Support non-Anthropic models via LiteLLM/Ollama
 - [ ] Add more tasks beyond Tetris
+- [ ] CodeScene integration when MCP token available ($9/mo waitlist)
 
 ### Data
-- [ ] Complete sonnet sweep (many timed out)
-- [ ] Run opus sweep
-- [ ] Interaction hunt on top variables (model x language x context_file x prompt_style)
+- [ ] Complete sonnet and opus sweeps (some timed out or budget-killed)
+- [ ] Interaction hunt on top variables
+- [ ] Full re-eval with new outcome scoring

	loop-benchmarking Controlled experiments across agentic coding configurations. Same task, one variable, what actually works.
	git clone https://git.shiptheloop.com/loop-benchmarking.git
	Log \| Files \| Refs \| README