commit cfb04f1f1d0b27c76825f4289a9de007f7b8bf00
parent ae769a448ed5b0539181c89a2b3fc575989123ae
Author: Brian Graham <brian@buildingbetterteams.de>
Date: Mon, 6 Apr 2026 10:36:59 +0200
Update CLAUDE.md with complete project state
Full documentation of: 16 axes, eval pipeline (6 steps), scoring
framework (input/output/outcome), dashboard pages, tech stack,
conventions, and comprehensive TODO list.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Diffstat:
| M | CLAUDE.md | | | 87 | ++++++++++++++++++++++++++++++++++++++++++++++++------------------------------- |
1 file changed, 53 insertions(+), 34 deletions(-)
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -19,87 +19,106 @@ The grid is a cartesian product of configuration variables. You define the axes
- **Context**: rules file provided or not
- **Sub-agents**: enabled/disabled
- **Web search**: enabled/disabled
-- **Budget**: low ($0.50), high ($5.00)
+- **Budget**: low ($2.00), high ($10.00)
## Test Harness
- Python orchestrator (`harness/run.py`) with parallel execution (`-j N`)
-- OAuth auth via `--bare` + `apiKeyHelper` script
+- OAuth auth via `--bare` + `apiKeyHelper` script (reads from ~/.claude/.credentials.json)
+- Auth keepalive: background process pings claude every 5 min to refresh token
- DOE designs: main_effects, plackett_burman, interaction_hunt
- Re-eval existing runs: `python3 harness/reeval.py -j 4`
+- Full pipeline: `python3 harness/clean-and-reeval.py -j 4`
- Auto-extracts workspace artifacts for dashboard iframe preview
+- Auto-commits and pushes results after sweep completes
+- Invalid run detection: auto-deletes runs with null cost, 1 turn, timeout, invalid API key, or short transcript on resume
-## Scoring
+## Scoring (Input/Output/Outcome Framework)
All evaluation is deterministic code. No LLM grading.
-Outcome score (the headline number, defined in `tasks/tetris/scoring.yaml`):
-- **Gameplay bot** (50%): auto-calibrating Tetris player that tests all mechanics
+**Inputs** (experiment variables - the grid axes):
+- Model, effort, language, tools, prompt style, etc.
+
+**Outcomes** (the headline score, defined in `tasks/tetris/scoring.yaml`):
+- **Gameplay bot** (50%): auto-calibrating Tetris player, 16 Playwright tests via continuous 150ms grid polling
- **Quality** (50%): lint, type check, bundle size
-Output metrics (tracked and displayed, but not in headline score):
+**Outputs** (tracked and displayed, but not in headline score):
- **Structural**: entry point exists, build succeeds
-- **Code analysis**: file count, function length, nesting depth, naming consistency, separation of concerns, duplication, HTML validation
-- **Transcript analysis**: agent efficiency, wasted turns, self-testing
-- **SonarQube**: automated code quality scan
+- **Code analysis**: file count, function length, nesting depth, naming consistency, separation of concerns, duplication, HTML validation, magic numbers, comments ratio
+- **Transcript analysis**: agent efficiency, wasted turns (docs, ASCII art, server starts), productivity ratio, self-testing
+- **SonarQube**: cognitive complexity, bugs, vulnerabilities, code smells, maintainability/reliability/security ratings (requires SonarQube at localhost:9000)
+
+## Eval Pipeline (per run, in order)
+
+1. `structural.sh` - entry point, build, TS compilation
+2. `quality.sh` - ESLint, typecheck, bundle size
+3. `code-analysis.py` - 14 code quality metrics
+4. `transcript-analysis.py` - agent behavior from conversation log
+5. `gameplay-bot/` - 16 Playwright tests via continuous grid scanning (adapted from MIT-licensed LeeYiyuan/tetrisai and mikhail-vlasenko/Tetris-AI)
+6. `sonarqube-scan.py` - automated code quality scan
## Dashboard
-Static Astro site with React islands. Pages:
-- **Grid** (`/`): summary stats, bar charts, filterable run table
-- **Insights** (`/insights`): surprise cards, scatter plots, tornado chart, interaction heatmap
-- **Explore** (`/explore`): correlation matrix, efficiency frontier, bump chart, heatmap matrix, radar comparison, treemap
-- **Compare** (`/compare`): aggregate stats per axis value
-- **Run detail** (`/run/{id}`): metrics, config pills, 7 score bars, code analysis, agent behavior, transcript viewer, artifact iframe
+Static Astro site with React islands. SMUI design system (JetBrains Mono, Nord palette). Light/dark theme toggle.
-Light/dark theme toggle. SMUI design system (JetBrains Mono, Nord palette).
+Pages:
+- **Grid** (`/`): per-task summary, bar charts with error bars, filterable cell/run table with sorting
+- **Insights** (`/insights`): surprise cards, convex hull scatter plots (4 density levels), variability analysis (box plots, reliability ranking, ANOVA decomposition), tornado chart with variance bands, interaction heatmap
+- **Explore** (`/explore`): correlation matrix, efficiency frontier, bump chart, heatmap matrix, radar comparison, treemap
+- **Compare** (`/compare`): cell-based aggregate stats with score/cost ranges per axis value
+- **Run detail** (`/run/{id}`): outcome/output score separation, config pills, 6 detail cards, transcript viewer, artifact iframe
+- **Cell detail** (`/cell/{id}`): run comparison table, artifact gallery, variance stats, agent behavior comparison
+- **Methodology** (`/methodology`): scoring framework, DOE design, gameplay bot phases, known limitations
## Tech
- Harness: Python orchestrator, bash eval scripts
-- Auth: OAuth token from ~/.claude/.credentials.json, auto-refresh
-- Dashboard: Astro + React + recharts
+- Auth: OAuth token from ~/.claude/.credentials.json, keepalive background process
+- Dashboard: Astro + React + recharts. Types split: types.ts (client-safe) vs data.ts (server-only)
+- Artifacts: stored at project root `artifacts/` (not in dashboard/public/ to avoid 13GB build). Deploy rsyncs separately.
- Deploy: Forgejo CI to research subdomain (blue/green)
- Results committed to repo for dashboard build
+- SonarQube Community Edition at localhost:9000 (Java 17, standalone JAR)
## Conventions
- Source control: Forgejo (not GitHub)
- Start conservative with resource-intensive settings
- Never use emdashes
+- Pre-push hook verifies dashboard build
## TODO
### Analysis
-- [ ] PCA analysis: add when 100+ runs exist. One-hot encode categoricals, identify principal components explaining variance. Show which variable combinations matter most.
+- [ ] PCA analysis: add when 100+ runs exist with new scoring. One-hot encode categoricals, identify principal components explaining variance.
- [ ] Pareto frontier analysis: multi-objective optimization (score vs cost, score vs time)
-- [ ] Per-task analysis: when more tasks are added, compare how variables affect different tasks differently
### Eval
-- [ ] Wire functional eval (Playwright tests from gameplay bot) into the score more robustly
-- [ ] Cyclomatic complexity measurement (escomplex or typhonjs-escomplex)
+- [ ] Fix gameplay bot false positives: line_clear and score_changes need piecesSpawned > 0 check
+- [ ] Quality scoring too coarse (binary pass/fail on 3 checks = 0/33/67/100%)
+- [ ] Add SonarQube to outcome_weights once validated
+- [ ] Cyclomatic complexity measurement (or use SonarQube cognitive complexity)
- [ ] Memory leak detection via Playwright heap snapshots
- [ ] Frame rate measurement during gameplay
- [ ] Dead code detection (knip)
-- [ ] Wall kick rotation testing (position piece against wall, try rotate)
-- [ ] I-piece detection in rotation test needs tuning (not reliably found in 60 attempts)
+- [ ] Wall kick rotation testing
+- [ ] I-piece detection in rotation test needs tuning
### Dashboard
-- [ ] Sortable columns on grid table
-- [ ] Show context file content on run detail page when context_file=provided
-- [ ] Add human_language=unspecified option
-- [ ] Run detail: show gameplay bot test results (16 individual tests with pass/fail)
-- [ ] Run detail: show game screenshots from bot at key moments
+- [ ] Show SonarQube details on run page (bugs, smells, ratings)
+- [ ] Show gameplay bot test results on cell detail page
- [ ] Inline Tetris artifact previews in grid table (thumbnails)
-- [ ] Re-eval button in UI (trigger reeval.py from dashboard)
+- [ ] Re-eval button in UI
### Harness
- [ ] OAuth token refresh: verify the refresh endpoint works reliably
-- [ ] Auto-commit and push results after sweep completes
- [ ] Support non-Anthropic models via LiteLLM/Ollama
- [ ] Add more tasks beyond Tetris
+- [ ] CodeScene integration when MCP token available ($9/mo waitlist)
### Data
-- [ ] Complete sonnet sweep (many timed out)
-- [ ] Run opus sweep
-- [ ] Interaction hunt on top variables (model x language x context_file x prompt_style)
+- [ ] Complete sonnet and opus sweeps (some timed out or budget-killed)
+- [ ] Interaction hunt on top variables
+- [ ] Full re-eval with new outcome scoring