commit 9ff262b655d701800bebbf03baa933fd4627cc2e
parent 612cb54f096ac951c4dca312dc65acac5bfa2430
Author: Brian Graham <brian@buildingbetterteams.de>
Date: Mon, 6 Apr 2026 21:08:40 +0200
Update CLAUDE.md for 23-axis grid, Z.AI provider, new commands
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Diffstat:
| M | CLAUDE.md | | | 67 | +++++++++++++++++++++++++++++++++++++++++++------------------------ |
1 file changed, 43 insertions(+), 24 deletions(-)
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -8,47 +8,61 @@ An open benchmark for comparing agentic coding loop configurations. Define the v
The grid is a cartesian product of configuration variables. You define the axes and their possible values; the harness explodes them into the full experiment matrix.
-16 axes (see `grid.yaml` for full definition):
-- **Model**: Haiku, Sonnet, Opus (or other providers via LiteLLM)
+23 axes (see `grid.yaml` for full definition):
+- **Model**: Haiku, Sonnet, Opus (Anthropic) / glm-4.5-air, glm-4.7, glm-5.1 (Z.AI)
+- **Provider**: anthropic, zai (controls API endpoint)
- **Effort**: high, max (extended thinking)
- **Prompt style**: simple, detailed
- **Programming language**: TypeScript, JavaScript, unspecified
- **Human language**: English, Spanish
- **Individual tools**: Read, Write, Edit, Glob, Grep (each on/off)
-- **Tooling**: Playwright on/off, linter (eslint) on/off
+- **Tooling**: Playwright (off/available/instructed), linter (eslint) on/off
- **Context**: rules file provided or not
-- **Sub-agents**: enabled/disabled
- **Web search**: enabled/disabled
- **Budget**: low ($2.00), high ($10.00)
+- **Strategy**: none, plan_first, iterate, creative_validate, use_subagents, delegate, review, compete (deferred), split_work
+- **Tests provided**: none, a_few (4 tests), many (16 tests)
+- **Design guidance**: none, vague, specific
+- **Architecture**: none, separation, best_practices
+- **Error checking**: none, self_verify
+- **Context noise**: clean, wikipedia_25/50/75, lorem_25/50/75
+- **Renderer**: none, canvas, svg, dom, webgl
## Test Harness
- Python orchestrator (`harness/run.py`) with parallel execution (`-j N`)
+- `--provider` flag required (anthropic or zai) to prevent cross-provider contamination
+- `--model` accepts actual model names (glm-4.5-air, not haiku) for non-anthropic providers
+- `-n N` / `--max-runs N` limits total runs
+- `--commit-every N` analyzes and pushes results every N completed runs
- OAuth auth via `--bare` + `apiKeyHelper` script (reads from ~/.claude/.credentials.json)
+- Z.AI auth via `ZAI_API_KEY` env var, sets `ANTHROPIC_BASE_URL` per-subprocess (isolated)
- Auth keepalive: background process pings claude every 5 min to refresh token
- DOE designs: main_effects, plackett_burman, interaction_hunt
+- Modular prompt builder: PROMPT_SNIPPETS dict with EN+ES translations per axis
- Re-eval existing runs: `python3 harness/reeval.py -j 4`
- Full pipeline: `python3 harness/clean-and-reeval.py -j 4`
- Auto-extracts workspace artifacts for dashboard iframe preview
- Auto-commits and pushes results after sweep completes
- Invalid run detection: auto-deletes runs with null cost, 1 turn, timeout, invalid API key, or short transcript on resume
+- Serve process cleanup: Popen with start_new_session + os.killpg prevents orphaned HTTP servers
## Scoring (Input/Output/Outcome Framework)
All evaluation is deterministic code. No LLM grading.
**Inputs** (experiment variables - the grid axes):
-- Model, effort, language, tools, prompt style, etc.
+- Model, provider, effort, language, tools, prompt style, strategy, etc.
**Outcomes** (the headline score, defined in `tasks/tetris/scoring.yaml`):
-- **Gameplay bot** (50%): auto-calibrating Tetris player, 16 Playwright tests via continuous 150ms grid polling
-- **Quality** (50%): lint, type check, bundle size
+- **Gameplay bot** (50%): two-phase (mechanics + play-to-win), 60 pieces / 45s, 60ms polling, 16 tests
+- **SonarQube** (50%): cognitive complexity, bugs, vulnerabilities, code smells, maintainability/reliability/security ratings
**Outputs** (tracked and displayed, but not in headline score):
+- **Build quality**: lint, type check, bundle size
- **Structural**: entry point exists, build succeeds
- **Code analysis**: file count, function length, nesting depth, naming consistency, separation of concerns, duplication, HTML validation, magic numbers, comments ratio
- **Transcript analysis**: agent efficiency, wasted turns (docs, ASCII art, server starts), productivity ratio, self-testing
-- **SonarQube**: cognitive complexity, bugs, vulnerabilities, code smells, maintainability/reliability/security ratings (requires SonarQube at localhost:9000)
## Eval Pipeline (per run, in order)
@@ -56,31 +70,36 @@ All evaluation is deterministic code. No LLM grading.
2. `quality.sh` - ESLint, typecheck, bundle size
3. `code-analysis.py` - 14 code quality metrics
4. `transcript-analysis.py` - agent behavior from conversation log
-5. `gameplay-bot/` - 16 Playwright tests via continuous grid scanning (adapted from MIT-licensed LeeYiyuan/tetrisai and mikhail-vlasenko/Tetris-AI)
-6. `sonarqube-scan.py` - automated code quality scan
+5. `gameplay-bot/` - two-phase: mechanics test then play-to-win (adapted from MIT-licensed LeeYiyuan/tetrisai and mikhail-vlasenko/Tetris-AI)
+6. `sonarqube-scan.py` - automated code quality scan (requires SonarQube at localhost:9000)
## Dashboard
Static Astro site with React islands. SMUI design system (JetBrains Mono, Nord palette). Light/dark theme toggle.
Pages:
-- **Grid** (`/`): per-task summary, bar charts with error bars, filterable cell/run table with sorting
-- **Insights** (`/insights`): surprise cards, convex hull scatter plots (4 density levels), variability analysis (box plots, reliability ranking, ANOVA decomposition), tornado chart with variance bands, interaction heatmap
+- **Grid** (`/`): per-task summary, box plots for score distribution, filterable cell/run table with sorting
+- **Insights** (`/insights`): surprise cards, convex hull scatter plots (4 density levels) with model toggles, variability analysis (box plots, reliability ranking, ANOVA decomposition), tornado chart with variance bands, interaction heatmap
- **Explore** (`/explore`): correlation matrix, efficiency frontier, bump chart, heatmap matrix, radar comparison, treemap
- **Compare** (`/compare`): cell-based aggregate stats with score/cost ranges per axis value
-- **Run detail** (`/run/{id}`): outcome/output score separation, config pills, 6 detail cards, transcript viewer, artifact iframe
-- **Cell detail** (`/cell/{id}`): run comparison table, artifact gallery, variance stats, agent behavior comparison
+- **Run detail** (`/run/{id}` or `/r/{short_id}`): outcome/output score separation, all config pills, SonarQube detail card, 6 detail cards, transcript viewer, artifact iframe, link to cell
+- **Cell detail** (`/cell/{id}` or `/c/{short_id}`): run comparison table, artifact gallery, variance stats, agent behavior comparison
- **Methodology** (`/methodology`): scoring framework, DOE design, gameplay bot phases, known limitations
+Short URL IDs: 8-char SHA256 hash for `/r/` and `/c/` routes with redirect pages.
+
## Tech
- Harness: Python orchestrator, bash eval scripts
-- Auth: OAuth token from ~/.claude/.credentials.json, keepalive background process
+- Auth: OAuth token from ~/.claude/.credentials.json (Anthropic), ZAI_API_KEY env var (Z.AI)
+- Z.AI gateway: api.z.ai/api/anthropic, Anthropic-compatible API, accepts GLM model names directly
- Dashboard: Astro + React + recharts. Types split: types.ts (client-safe) vs data.ts (server-only)
- Artifacts: stored at project root `artifacts/` (not in dashboard/public/ to avoid 13GB build). Deploy rsyncs separately.
- Deploy: Forgejo CI to research subdomain (blue/green)
- Results committed to repo for dashboard build
- SonarQube Community Edition at localhost:9000 (Java 17, standalone JAR)
+- Test fixtures: tasks/tetris/fixtures/tests-few/ and tests-full/ (Playwright tests given to agent)
+- Noise files: tasks/tetris/noise/ (wikipedia/lorem at calibrated sizes, generated by generate.py)
## Conventions
@@ -88,6 +107,8 @@ Pages:
- Start conservative with resource-intensive settings
- Never use emdashes
- Pre-push hook verifies dashboard build
+- Provider must be explicit (--provider flag required)
+- GLM models use real names (glm-4.5-air), never mapped to haiku/sonnet/opus
## TODO
@@ -96,29 +117,27 @@ Pages:
- [ ] Pareto frontier analysis: multi-objective optimization (score vs cost, score vs time)
### Eval
-- [ ] Fix gameplay bot false positives: line_clear and score_changes need piecesSpawned > 0 check
- [ ] Quality scoring too coarse (binary pass/fail on 3 checks = 0/33/67/100%)
-- [ ] Add SonarQube to outcome_weights once validated
-- [ ] Cyclomatic complexity measurement (or use SonarQube cognitive complexity)
- [ ] Memory leak detection via Playwright heap snapshots
- [ ] Frame rate measurement during gameplay
- [ ] Dead code detection (knip)
- [ ] Wall kick rotation testing
-- [ ] I-piece detection in rotation test needs tuning
### Dashboard
-- [ ] Show SonarQube details on run page (bugs, smells, ratings)
+- [ ] Cell detail page shows 0%/0s for cells with no runs (should show "no data")
- [ ] Show gameplay bot test results on cell detail page
- [ ] Inline Tetris artifact previews in grid table (thumbnails)
- [ ] Re-eval button in UI
+- [ ] Update methodology page with new axes, provider system, GLM models
### Harness
- [ ] OAuth token refresh: verify the refresh endpoint works reliably
-- [ ] Support non-Anthropic models via LiteLLM/Ollama
+- [ ] Compete strategy implementation (two parallel invocations, pick best)
- [ ] Add more tasks beyond Tetris
-- [ ] CodeScene integration when MCP token available ($9/mo waitlist)
+- [ ] irr_code noise type (user will provide Claude session transcript)
+- [ ] poison noise type (user will provide Java coding session)
### Data
-- [ ] Complete sonnet and opus sweeps (some timed out or budget-killed)
+- [ ] Complete Z.AI sweeps (glm-4.7, glm-5.1 main_effects)
- [ ] Interaction hunt on top variables
-- [ ] Full re-eval with new outcome scoring
+- [ ] Test new axes (strategy, tests_provided, design_guidance, etc.) in runs