Update CLAUDE.md for 23-axis grid, Z.AI provider, new commands - loop-benchmarking - Controlled experiments across agentic coding configurations. Same task, one variable, what actually works.

commit 9ff262b655d701800bebbf03baa933fd4627cc2e
parent 612cb54f096ac951c4dca312dc65acac5bfa2430
Author: Brian Graham <brian@buildingbetterteams.de>
Date:   Mon,  6 Apr 2026 21:08:40 +0200

Update CLAUDE.md for 23-axis grid, Z.AI provider, new commands

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Diffstat:
M CLAUDE.md  | 67 +++++++++++++++++++++++++++++++++++++++++++------------------------

1 file changed, 43 insertions(+), 24 deletions(-)
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -8,47 +8,61 @@ An open benchmark for comparing agentic coding loop configurations. Define the v
 
 The grid is a cartesian product of configuration variables. You define the axes and their possible values; the harness explodes them into the full experiment matrix.
 
-16 axes (see `grid.yaml` for full definition):
-- **Model**: Haiku, Sonnet, Opus (or other providers via LiteLLM)
+23 axes (see `grid.yaml` for full definition):
+- **Model**: Haiku, Sonnet, Opus (Anthropic) / glm-4.5-air, glm-4.7, glm-5.1 (Z.AI)
+- **Provider**: anthropic, zai (controls API endpoint)
 - **Effort**: high, max (extended thinking)
 - **Prompt style**: simple, detailed
 - **Programming language**: TypeScript, JavaScript, unspecified
 - **Human language**: English, Spanish
 - **Individual tools**: Read, Write, Edit, Glob, Grep (each on/off)
-- **Tooling**: Playwright on/off, linter (eslint) on/off
+- **Tooling**: Playwright (off/available/instructed), linter (eslint) on/off
 - **Context**: rules file provided or not
-- **Sub-agents**: enabled/disabled
 - **Web search**: enabled/disabled
 - **Budget**: low ($2.00), high ($10.00)
+- **Strategy**: none, plan_first, iterate, creative_validate, use_subagents, delegate, review, compete (deferred), split_work
+- **Tests provided**: none, a_few (4 tests), many (16 tests)
+- **Design guidance**: none, vague, specific
+- **Architecture**: none, separation, best_practices
+- **Error checking**: none, self_verify
+- **Context noise**: clean, wikipedia_25/50/75, lorem_25/50/75
+- **Renderer**: none, canvas, svg, dom, webgl
 
 ## Test Harness
 
 - Python orchestrator (`harness/run.py`) with parallel execution (`-j N`)
+- `--provider` flag required (anthropic or zai) to prevent cross-provider contamination
+- `--model` accepts actual model names (glm-4.5-air, not haiku) for non-anthropic providers
+- `-n N` / `--max-runs N` limits total runs
+- `--commit-every N` analyzes and pushes results every N completed runs
 - OAuth auth via `--bare` + `apiKeyHelper` script (reads from ~/.claude/.credentials.json)
+- Z.AI auth via `ZAI_API_KEY` env var, sets `ANTHROPIC_BASE_URL` per-subprocess (isolated)
 - Auth keepalive: background process pings claude every 5 min to refresh token
 - DOE designs: main_effects, plackett_burman, interaction_hunt
+- Modular prompt builder: PROMPT_SNIPPETS dict with EN+ES translations per axis
 - Re-eval existing runs: `python3 harness/reeval.py -j 4`
 - Full pipeline: `python3 harness/clean-and-reeval.py -j 4`
 - Auto-extracts workspace artifacts for dashboard iframe preview
 - Auto-commits and pushes results after sweep completes
 - Invalid run detection: auto-deletes runs with null cost, 1 turn, timeout, invalid API key, or short transcript on resume
+- Serve process cleanup: Popen with start_new_session + os.killpg prevents orphaned HTTP servers
 
 ## Scoring (Input/Output/Outcome Framework)
 
 All evaluation is deterministic code. No LLM grading.
 
 **Inputs** (experiment variables - the grid axes):
-- Model, effort, language, tools, prompt style, etc.
+- Model, provider, effort, language, tools, prompt style, strategy, etc.
 
 **Outcomes** (the headline score, defined in `tasks/tetris/scoring.yaml`):
-- **Gameplay bot** (50%): auto-calibrating Tetris player, 16 Playwright tests via continuous 150ms grid polling
-- **Quality** (50%): lint, type check, bundle size
+- **Gameplay bot** (50%): two-phase (mechanics + play-to-win), 60 pieces / 45s, 60ms polling, 16 tests
+- **SonarQube** (50%): cognitive complexity, bugs, vulnerabilities, code smells, maintainability/reliability/security ratings
 
 **Outputs** (tracked and displayed, but not in headline score):
+- **Build quality**: lint, type check, bundle size
 - **Structural**: entry point exists, build succeeds
 - **Code analysis**: file count, function length, nesting depth, naming consistency, separation of concerns, duplication, HTML validation, magic numbers, comments ratio
 - **Transcript analysis**: agent efficiency, wasted turns (docs, ASCII art, server starts), productivity ratio, self-testing
-- **SonarQube**: cognitive complexity, bugs, vulnerabilities, code smells, maintainability/reliability/security ratings (requires SonarQube at localhost:9000)
 
 ## Eval Pipeline (per run, in order)
 
@@ -56,31 +70,36 @@ All evaluation is deterministic code. No LLM grading.
 2. `quality.sh` - ESLint, typecheck, bundle size
 3. `code-analysis.py` - 14 code quality metrics
 4. `transcript-analysis.py` - agent behavior from conversation log
-5. `gameplay-bot/` - 16 Playwright tests via continuous grid scanning (adapted from MIT-licensed LeeYiyuan/tetrisai and mikhail-vlasenko/Tetris-AI)
-6. `sonarqube-scan.py` - automated code quality scan
+5. `gameplay-bot/` - two-phase: mechanics test then play-to-win (adapted from MIT-licensed LeeYiyuan/tetrisai and mikhail-vlasenko/Tetris-AI)
+6. `sonarqube-scan.py` - automated code quality scan (requires SonarQube at localhost:9000)
 
 ## Dashboard
 
 Static Astro site with React islands. SMUI design system (JetBrains Mono, Nord palette). Light/dark theme toggle.
 
 Pages:
-- **Grid** (`/`): per-task summary, bar charts with error bars, filterable cell/run table with sorting
-- **Insights** (`/insights`): surprise cards, convex hull scatter plots (4 density levels), variability analysis (box plots, reliability ranking, ANOVA decomposition), tornado chart with variance bands, interaction heatmap
+- **Grid** (`/`): per-task summary, box plots for score distribution, filterable cell/run table with sorting
+- **Insights** (`/insights`): surprise cards, convex hull scatter plots (4 density levels) with model toggles, variability analysis (box plots, reliability ranking, ANOVA decomposition), tornado chart with variance bands, interaction heatmap
 - **Explore** (`/explore`): correlation matrix, efficiency frontier, bump chart, heatmap matrix, radar comparison, treemap
 - **Compare** (`/compare`): cell-based aggregate stats with score/cost ranges per axis value
-- **Run detail** (`/run/{id}`): outcome/output score separation, config pills, 6 detail cards, transcript viewer, artifact iframe
-- **Cell detail** (`/cell/{id}`): run comparison table, artifact gallery, variance stats, agent behavior comparison
+- **Run detail** (`/run/{id}` or `/r/{short_id}`): outcome/output score separation, all config pills, SonarQube detail card, 6 detail cards, transcript viewer, artifact iframe, link to cell
+- **Cell detail** (`/cell/{id}` or `/c/{short_id}`): run comparison table, artifact gallery, variance stats, agent behavior comparison
 - **Methodology** (`/methodology`): scoring framework, DOE design, gameplay bot phases, known limitations
 
+Short URL IDs: 8-char SHA256 hash for `/r/` and `/c/` routes with redirect pages.
+
 ## Tech
 
 - Harness: Python orchestrator, bash eval scripts
-- Auth: OAuth token from ~/.claude/.credentials.json, keepalive background process
+- Auth: OAuth token from ~/.claude/.credentials.json (Anthropic), ZAI_API_KEY env var (Z.AI)
+- Z.AI gateway: api.z.ai/api/anthropic, Anthropic-compatible API, accepts GLM model names directly
 - Dashboard: Astro + React + recharts. Types split: types.ts (client-safe) vs data.ts (server-only)
 - Artifacts: stored at project root `artifacts/` (not in dashboard/public/ to avoid 13GB build). Deploy rsyncs separately.
 - Deploy: Forgejo CI to research subdomain (blue/green)
 - Results committed to repo for dashboard build
 - SonarQube Community Edition at localhost:9000 (Java 17, standalone JAR)
+- Test fixtures: tasks/tetris/fixtures/tests-few/ and tests-full/ (Playwright tests given to agent)
+- Noise files: tasks/tetris/noise/ (wikipedia/lorem at calibrated sizes, generated by generate.py)
 
 ## Conventions
 
@@ -88,6 +107,8 @@ Pages:
 - Start conservative with resource-intensive settings
 - Never use emdashes
 - Pre-push hook verifies dashboard build
+- Provider must be explicit (--provider flag required)
+- GLM models use real names (glm-4.5-air), never mapped to haiku/sonnet/opus
 
 ## TODO
 
@@ -96,29 +117,27 @@ Pages:
 - [ ] Pareto frontier analysis: multi-objective optimization (score vs cost, score vs time)
 
 ### Eval
-- [ ] Fix gameplay bot false positives: line_clear and score_changes need piecesSpawned > 0 check
 - [ ] Quality scoring too coarse (binary pass/fail on 3 checks = 0/33/67/100%)
-- [ ] Add SonarQube to outcome_weights once validated
-- [ ] Cyclomatic complexity measurement (or use SonarQube cognitive complexity)
 - [ ] Memory leak detection via Playwright heap snapshots
 - [ ] Frame rate measurement during gameplay
 - [ ] Dead code detection (knip)
 - [ ] Wall kick rotation testing
-- [ ] I-piece detection in rotation test needs tuning
 
 ### Dashboard
-- [ ] Show SonarQube details on run page (bugs, smells, ratings)
+- [ ] Cell detail page shows 0%/0s for cells with no runs (should show "no data")
 - [ ] Show gameplay bot test results on cell detail page
 - [ ] Inline Tetris artifact previews in grid table (thumbnails)
 - [ ] Re-eval button in UI
+- [ ] Update methodology page with new axes, provider system, GLM models
 
 ### Harness
 - [ ] OAuth token refresh: verify the refresh endpoint works reliably
-- [ ] Support non-Anthropic models via LiteLLM/Ollama
+- [ ] Compete strategy implementation (two parallel invocations, pick best)
 - [ ] Add more tasks beyond Tetris
-- [ ] CodeScene integration when MCP token available ($9/mo waitlist)
+- [ ] irr_code noise type (user will provide Claude session transcript)
+- [ ] poison noise type (user will provide Java coding session)
 
 ### Data
-- [ ] Complete sonnet and opus sweeps (some timed out or budget-killed)
+- [ ] Complete Z.AI sweeps (glm-4.7, glm-5.1 main_effects)
 - [ ] Interaction hunt on top variables
-- [ ] Full re-eval with new outcome scoring
+- [ ] Test new axes (strategy, tests_provided, design_guidance, etc.) in runs

	loop-benchmarking Controlled experiments across agentic coding configurations. Same task, one variable, what actually works.
	git clone https://git.shiptheloop.com/loop-benchmarking.git
	Log \| Files \| Refs \| README