loop-benchmarking

Controlled experiments across agentic coding configurations. Same task, one variable, what actually works.
git clone https://git.shiptheloop.com/loop-benchmarking.git
Log | Files | Refs | README

commit 9ff262b655d701800bebbf03baa933fd4627cc2e
parent 612cb54f096ac951c4dca312dc65acac5bfa2430
Author: Brian Graham <brian@buildingbetterteams.de>
Date:   Mon,  6 Apr 2026 21:08:40 +0200

Update CLAUDE.md for 23-axis grid, Z.AI provider, new commands

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Diffstat:
MCLAUDE.md | 67+++++++++++++++++++++++++++++++++++++++++++------------------------
1 file changed, 43 insertions(+), 24 deletions(-)

diff --git a/CLAUDE.md b/CLAUDE.md @@ -8,47 +8,61 @@ An open benchmark for comparing agentic coding loop configurations. Define the v The grid is a cartesian product of configuration variables. You define the axes and their possible values; the harness explodes them into the full experiment matrix. -16 axes (see `grid.yaml` for full definition): -- **Model**: Haiku, Sonnet, Opus (or other providers via LiteLLM) +23 axes (see `grid.yaml` for full definition): +- **Model**: Haiku, Sonnet, Opus (Anthropic) / glm-4.5-air, glm-4.7, glm-5.1 (Z.AI) +- **Provider**: anthropic, zai (controls API endpoint) - **Effort**: high, max (extended thinking) - **Prompt style**: simple, detailed - **Programming language**: TypeScript, JavaScript, unspecified - **Human language**: English, Spanish - **Individual tools**: Read, Write, Edit, Glob, Grep (each on/off) -- **Tooling**: Playwright on/off, linter (eslint) on/off +- **Tooling**: Playwright (off/available/instructed), linter (eslint) on/off - **Context**: rules file provided or not -- **Sub-agents**: enabled/disabled - **Web search**: enabled/disabled - **Budget**: low ($2.00), high ($10.00) +- **Strategy**: none, plan_first, iterate, creative_validate, use_subagents, delegate, review, compete (deferred), split_work +- **Tests provided**: none, a_few (4 tests), many (16 tests) +- **Design guidance**: none, vague, specific +- **Architecture**: none, separation, best_practices +- **Error checking**: none, self_verify +- **Context noise**: clean, wikipedia_25/50/75, lorem_25/50/75 +- **Renderer**: none, canvas, svg, dom, webgl ## Test Harness - Python orchestrator (`harness/run.py`) with parallel execution (`-j N`) +- `--provider` flag required (anthropic or zai) to prevent cross-provider contamination +- `--model` accepts actual model names (glm-4.5-air, not haiku) for non-anthropic providers +- `-n N` / `--max-runs N` limits total runs +- `--commit-every N` analyzes and pushes results every N completed runs - OAuth auth via `--bare` + `apiKeyHelper` script (reads from ~/.claude/.credentials.json) +- Z.AI auth via `ZAI_API_KEY` env var, sets `ANTHROPIC_BASE_URL` per-subprocess (isolated) - Auth keepalive: background process pings claude every 5 min to refresh token - DOE designs: main_effects, plackett_burman, interaction_hunt +- Modular prompt builder: PROMPT_SNIPPETS dict with EN+ES translations per axis - Re-eval existing runs: `python3 harness/reeval.py -j 4` - Full pipeline: `python3 harness/clean-and-reeval.py -j 4` - Auto-extracts workspace artifacts for dashboard iframe preview - Auto-commits and pushes results after sweep completes - Invalid run detection: auto-deletes runs with null cost, 1 turn, timeout, invalid API key, or short transcript on resume +- Serve process cleanup: Popen with start_new_session + os.killpg prevents orphaned HTTP servers ## Scoring (Input/Output/Outcome Framework) All evaluation is deterministic code. No LLM grading. **Inputs** (experiment variables - the grid axes): -- Model, effort, language, tools, prompt style, etc. +- Model, provider, effort, language, tools, prompt style, strategy, etc. **Outcomes** (the headline score, defined in `tasks/tetris/scoring.yaml`): -- **Gameplay bot** (50%): auto-calibrating Tetris player, 16 Playwright tests via continuous 150ms grid polling -- **Quality** (50%): lint, type check, bundle size +- **Gameplay bot** (50%): two-phase (mechanics + play-to-win), 60 pieces / 45s, 60ms polling, 16 tests +- **SonarQube** (50%): cognitive complexity, bugs, vulnerabilities, code smells, maintainability/reliability/security ratings **Outputs** (tracked and displayed, but not in headline score): +- **Build quality**: lint, type check, bundle size - **Structural**: entry point exists, build succeeds - **Code analysis**: file count, function length, nesting depth, naming consistency, separation of concerns, duplication, HTML validation, magic numbers, comments ratio - **Transcript analysis**: agent efficiency, wasted turns (docs, ASCII art, server starts), productivity ratio, self-testing -- **SonarQube**: cognitive complexity, bugs, vulnerabilities, code smells, maintainability/reliability/security ratings (requires SonarQube at localhost:9000) ## Eval Pipeline (per run, in order) @@ -56,31 +70,36 @@ All evaluation is deterministic code. No LLM grading. 2. `quality.sh` - ESLint, typecheck, bundle size 3. `code-analysis.py` - 14 code quality metrics 4. `transcript-analysis.py` - agent behavior from conversation log -5. `gameplay-bot/` - 16 Playwright tests via continuous grid scanning (adapted from MIT-licensed LeeYiyuan/tetrisai and mikhail-vlasenko/Tetris-AI) -6. `sonarqube-scan.py` - automated code quality scan +5. `gameplay-bot/` - two-phase: mechanics test then play-to-win (adapted from MIT-licensed LeeYiyuan/tetrisai and mikhail-vlasenko/Tetris-AI) +6. `sonarqube-scan.py` - automated code quality scan (requires SonarQube at localhost:9000) ## Dashboard Static Astro site with React islands. SMUI design system (JetBrains Mono, Nord palette). Light/dark theme toggle. Pages: -- **Grid** (`/`): per-task summary, bar charts with error bars, filterable cell/run table with sorting -- **Insights** (`/insights`): surprise cards, convex hull scatter plots (4 density levels), variability analysis (box plots, reliability ranking, ANOVA decomposition), tornado chart with variance bands, interaction heatmap +- **Grid** (`/`): per-task summary, box plots for score distribution, filterable cell/run table with sorting +- **Insights** (`/insights`): surprise cards, convex hull scatter plots (4 density levels) with model toggles, variability analysis (box plots, reliability ranking, ANOVA decomposition), tornado chart with variance bands, interaction heatmap - **Explore** (`/explore`): correlation matrix, efficiency frontier, bump chart, heatmap matrix, radar comparison, treemap - **Compare** (`/compare`): cell-based aggregate stats with score/cost ranges per axis value -- **Run detail** (`/run/{id}`): outcome/output score separation, config pills, 6 detail cards, transcript viewer, artifact iframe -- **Cell detail** (`/cell/{id}`): run comparison table, artifact gallery, variance stats, agent behavior comparison +- **Run detail** (`/run/{id}` or `/r/{short_id}`): outcome/output score separation, all config pills, SonarQube detail card, 6 detail cards, transcript viewer, artifact iframe, link to cell +- **Cell detail** (`/cell/{id}` or `/c/{short_id}`): run comparison table, artifact gallery, variance stats, agent behavior comparison - **Methodology** (`/methodology`): scoring framework, DOE design, gameplay bot phases, known limitations +Short URL IDs: 8-char SHA256 hash for `/r/` and `/c/` routes with redirect pages. + ## Tech - Harness: Python orchestrator, bash eval scripts -- Auth: OAuth token from ~/.claude/.credentials.json, keepalive background process +- Auth: OAuth token from ~/.claude/.credentials.json (Anthropic), ZAI_API_KEY env var (Z.AI) +- Z.AI gateway: api.z.ai/api/anthropic, Anthropic-compatible API, accepts GLM model names directly - Dashboard: Astro + React + recharts. Types split: types.ts (client-safe) vs data.ts (server-only) - Artifacts: stored at project root `artifacts/` (not in dashboard/public/ to avoid 13GB build). Deploy rsyncs separately. - Deploy: Forgejo CI to research subdomain (blue/green) - Results committed to repo for dashboard build - SonarQube Community Edition at localhost:9000 (Java 17, standalone JAR) +- Test fixtures: tasks/tetris/fixtures/tests-few/ and tests-full/ (Playwright tests given to agent) +- Noise files: tasks/tetris/noise/ (wikipedia/lorem at calibrated sizes, generated by generate.py) ## Conventions @@ -88,6 +107,8 @@ Pages: - Start conservative with resource-intensive settings - Never use emdashes - Pre-push hook verifies dashboard build +- Provider must be explicit (--provider flag required) +- GLM models use real names (glm-4.5-air), never mapped to haiku/sonnet/opus ## TODO @@ -96,29 +117,27 @@ Pages: - [ ] Pareto frontier analysis: multi-objective optimization (score vs cost, score vs time) ### Eval -- [ ] Fix gameplay bot false positives: line_clear and score_changes need piecesSpawned > 0 check - [ ] Quality scoring too coarse (binary pass/fail on 3 checks = 0/33/67/100%) -- [ ] Add SonarQube to outcome_weights once validated -- [ ] Cyclomatic complexity measurement (or use SonarQube cognitive complexity) - [ ] Memory leak detection via Playwright heap snapshots - [ ] Frame rate measurement during gameplay - [ ] Dead code detection (knip) - [ ] Wall kick rotation testing -- [ ] I-piece detection in rotation test needs tuning ### Dashboard -- [ ] Show SonarQube details on run page (bugs, smells, ratings) +- [ ] Cell detail page shows 0%/0s for cells with no runs (should show "no data") - [ ] Show gameplay bot test results on cell detail page - [ ] Inline Tetris artifact previews in grid table (thumbnails) - [ ] Re-eval button in UI +- [ ] Update methodology page with new axes, provider system, GLM models ### Harness - [ ] OAuth token refresh: verify the refresh endpoint works reliably -- [ ] Support non-Anthropic models via LiteLLM/Ollama +- [ ] Compete strategy implementation (two parallel invocations, pick best) - [ ] Add more tasks beyond Tetris -- [ ] CodeScene integration when MCP token available ($9/mo waitlist) +- [ ] irr_code noise type (user will provide Claude session transcript) +- [ ] poison noise type (user will provide Java coding session) ### Data -- [ ] Complete sonnet and opus sweeps (some timed out or budget-killed) +- [ ] Complete Z.AI sweeps (glm-4.7, glm-5.1 main_effects) - [ ] Interaction hunt on top variables -- [ ] Full re-eval with new outcome scoring +- [ ] Test new axes (strategy, tests_provided, design_guidance, etc.) in runs

Impressum · Datenschutz