loop-benchmarking, branch HEAD

Drop aborted glm-4.5-air run 0c19668a

2026-04-16T14:47:08Z

commit bbd676244fcbed8e4b2767bc3325769489b5da48 parent dd1bad2d5568fcda6b6b52af5c89baae4e8991ad Author: Brian Graham Date: Thu, 16 Apr 2026 16:47:08 +0200 Drop aborted glm-4.5-air run 0c19668a No claude_output.json, empty stderr, 40-line transcript, unset exit_code/wall_time -- same failure pattern as prior Z.AI purges. Rebuild PCA against the resulting 510-run dataset. Co-Authored-By: Claude Opus 4.6 (1M context)

Analyze and push 511 runs

2026-04-16T14:46:44Z

commit dd1bad2d5568fcda6b6b52af5c89baae4e8991ad parent 0f44859d3eac85c7dc11272eacc4e57fda2b14c0 Author: Brian Graham Date: Thu, 16 Apr 2026 16:46:44 +0200 Analyze and push 511 runs

Project runs across all dashboard pages

2026-04-16T14:45:12Z

commit 0f44859d3eac85c7dc11272eacc4e57fda2b14c0 parent 3b81eb9246542dee665795aeb510ae2ced79f03b Author: Brian Graham Date: Thu, 16 Apr 2026 16:45:12 +0200 Project runs across all dashboard pages Apply the projectRunForIndex pattern from index.astro to insights, explore, compare, pca, and surprises pages. All four active pages (insights, explore, compare, surprises) need only summary fields already covered by projectRunForIndex -- no new projectors required. pca.astro passes pre-computed JSON, not runs, so no change needed. Before (raw / gzipped): insights: 34.0 MB / ~3.1 MB explore: 50.8 MB / ~5.1 MB compare: 8.5 MB / ~800 KB surprises: 8.4 MB / ~800 KB dist/ total: 344 MB After (raw / gzipped): insights: 6.0 MB / 222 KB explore: 8.8 MB / 318 KB compare: 1.5 MB / 57 KB surprises: 1.4 MB / 55 KB dist/ total: 263 MB Co-Authored-By: Claude Opus 4.6 (1M context)

Project runs before serializing into index-page islands

2026-04-16T14:32:06Z

commit 3b81eb9246542dee665795aeb510ae2ced79f03b parent 0af972817d114910874b95bc4ec84298b7511e40 Author: Brian Graham Date: Thu, 16 Apr 2026 16:32:06 +0200 Project runs before serializing into index-page islands index.astro passes the full Run[] to 4 client:load islands (StatisticalPowerCard, Charts, TopBottomConfigs, Grid). Astro serializes each island's props independently, so the full eval_results payload (gameplay bot report with per-test details, SonarQube details, code_analysis, transcript_analysis) was embedded four times, once per island -- ~34 MB of HTML on a 510-run dataset. Add projectRunForIndex() in data.ts that returns a Run-shaped object containing only the fields these islands and analysis.groupIntoCells actually read (score summaries, functional.pass, cost, num_turns). Call it once in index.astro and pass the slim array to all four islands. dist/index.html: 34 MB -> 5.9 MB raw, 3.1 MB -> 217 KB gzipped. Co-Authored-By: Claude Opus 4.6 (1M context)

Rebuild PCA from post-reeval 510-run dataset

2026-04-16T13:53:49Z

commit 0af972817d114910874b95bc4ec84298b7511e40 parent 46364ff78312c3c0d5d647e2f6d59c0c40345cec Author: Brian Graham Date: Thu, 16 Apr 2026 15:53:49 +0200 Rebuild PCA from post-reeval 510-run dataset Co-Authored-By: Claude Opus 4.6 (1M context)

Analyze and push 512 runs

2026-04-16T13:53:40Z

commit 46364ff78312c3c0d5d647e2f6d59c0c40345cec parent 03f7652cb15c203683d9239f08dc22efbb51b1b5 Author: Brian Graham Date: Thu, 16 Apr 2026 15:53:40 +0200 Analyze and push 512 runs

Full reeval on GPU machine: V2 bot + SonarQube

2026-04-16T13:50:27Z

commit 03f7652cb15c203683d9239f08dc22efbb51b1b5 parent b499a01fb7df37b81f26449bd66bfd4cf68de116 Author: Brian Graham Date: Thu, 16 Apr 2026 15:50:27 +0200 Full reeval on GPU machine: V2 bot + SonarQube All 510 runs re-evaluated at -j 20. SonarQube Community 9.9.8 started locally for the scan; sonarqube-scan.py already updated from sonar.token to sonar.login for version compat. Co-Authored-By: Claude Opus 4.6 (1M context)

900s bot timeout + inactivity watchdog; aggregate agreement 48% to 79%

2026-04-16T11:08:21Z

commit b499a01fb7df37b81f26449bd66bfd4cf68de116 parent fd8274318dc475fe75d10c9b588c4af38d451c91 Author: Brian Graham Date: Thu, 16 Apr 2026 13:08:21 +0200 900s bot timeout + inactivity watchdog; aggregate agreement 48% to 79% Three changes to the gameplay bot pipeline: - Raise harness/run.py bot subprocess timeout from 300s to 900s. - Raise playwright.config.ts and index.ts test timeout from 360s to 900s. - Driver-level inactivity watchdog: readGrid()/wait() throw InactivityAbortError when 120s pass without a successful grid read (armed only after the game confirms started). bot.ts wraps Phases 3-8 in a guard that catches the abort and still writes a partial report. Fix calibration run_id mappings: commit 711df365 retagged 176 anthropic runs with prov=anth and bumped colliding run_num, leaving 13 of 17 calibration JSONs pointing at stale paths. Remap from the retag commit's rename list. Results on 17 calibration runs (j=20 on RTX 4070 / Ryzen 9): - Bot-vs-human agreement: 90/189 (48%) -> 181/228 (79%) - grid_detected: 9/17 -> 16/17 - renderer=unknown: 8/17 -> 1/17 (the one remaining is c1013100 which legitimately fails to load per human label) - Every anthropic run that previously got killed mid-calibration now finishes with proper detection and scores in the 0.67-1.00 range. - Wall-clock at j=20: 6m31s; at j=5: 16m48s; identical grid/renderer output across both, no inactivity aborts in either. Co-Authored-By: Claude Opus 4.6 (1M context)

Add human labels for 3 more calibration runs

2026-04-16T10:10:23Z

commit fd8274318dc475fe75d10c9b588c4af38d451c91 parent 711df365354d81be00d01bce2428e7c283e0ec2b Author: Brian Graham Date: Thu, 16 Apr 2026 12:10:23 +0200 Add human labels for 3 more calibration runs bbb70053 (haiku-4.5 DOM) flagged as very laggy -- playable but the lag hurts playability. c1013100 (gemma-4-26b) fails to load. e047cf3a (haiku-4.5) plays fully. Co-Authored-By: Claude Opus 4.6 (1M context)

Retag 176 pre-provider anthropic runs with prov=anth in cell_id

2026-04-16T09:58:46Z

commit 711df365354d81be00d01bce2428e7c283e0ec2b parent f801efc9b7f7880049fdeeeed53d55f0ecae5ecc Author: Brian Graham Date: Thu, 16 Apr 2026 11:58:46 +0200 Retag 176 pre-provider anthropic runs with prov=anth in cell_id These runs were created before the 'provider' axis was introduced. The earlier legacy migration added provider='anthropic' to each meta.json but didn't regenerate cell_ids to include the prov=anth segment, leaving them invisible to the current main_effects coverage check even though the run data itself was intact. This pass rebuilds each cell_id with the current AXIS_ABBREV/VALUE_ABBREV logic, renames run and artifact directories, updates meta.json's cell_id and run_id, and rewrites results/index.jsonl. Collisions where a post-provider run already occupied the target slot were resolved by bumping run_num (Option C: kept both as additional replicates). Impact: - haiku-4.5: 73 retagged - sonnet-4.6: 52 retagged - opus-4.6: 51 retagged - 194 total anthropic runs preserved, none deleted Co-Authored-By: Claude Opus 4.6 (1M context)

Add human trial labels for 4 calibration runs

2026-04-16T07:54:00Z

commit f801efc9b7f7880049fdeeeed53d55f0ecae5ecc parent 2fae566a4db8aae4b68eb9a4ff587a6a8e4245a2 Author: Brian Graham Date: Thu, 16 Apr 2026 09:54:00 +0200 Add human trial labels for 4 calibration runs Covers qwen-3.6-plus, haiku-4.5, opus-4.6, glm-5.1 (each with strat=usub or strat=none). All four are reported playable by the human tester but the bot currently scores them near zero because renderer detection fails (renderer=unknown, grid_detected=false), so aggregate bot-vs-human agreement drops to 47.6%. Co-Authored-By: Claude Opus 4.6 (1M context)

Preserve gameplay bot report on timeout

2026-04-16T07:06:51Z

commit 2fae566a4db8aae4b68eb9a4ff587a6a8e4245a2 parent 07408995fff0b06d234c4a19fdff6a2dae5b5028 Author: Brian Graham Date: Thu, 16 Apr 2026 09:06:51 +0200 Preserve gameplay bot report on timeout When the bot subprocess hit its 300s timeout, the report file written during the run was discarded, so eval_results.json.gameplay_bot had no test data and the /calibrate dashboard couldn't display results. On timeout, read the report if it exists and set timed_out=true. Re-eval 17 calibration runs with the fix: all now carry populated report.tests in eval_results.json. Co-Authored-By: Claude Opus 4.6 (1M context)

Remove 39 invalid glm-4.7 runs and add new sweep results

2026-04-16T05:59:34Z

commit 07408995fff0b06d234c4a19fdff6a2dae5b5028 parent baa0f5098d7e7bd291146968e751bbfe5ee255e4 Author: Brian Graham Date: Thu, 16 Apr 2026 07:59:34 +0200 Remove 39 invalid glm-4.7 runs and add new sweep results Purged zero-turn 429s from glm-4.7 sweep (Z.AI rate-limited the model hard during a ~7.5h window). Also includes the successful glm-4.7 runs from the same sweep and fresh glm-5.1 runs. glm-5.1: 123 clean runs, 0 bad glm-4.7: 55 clean runs retained, 39 bad removed Co-Authored-By: Claude Opus 4.6 (1M context)

Add 18 new runs (458 total)

2026-04-15T14:27:17Z

commit baa0f5098d7e7bd291146968e751bbfe5ee255e4 parent 4b971780246afdc6e97d3ed1d6c2aa6dcdeaa931 Author: Brian Graham Date: Wed, 15 Apr 2026 16:27:17 +0200 Add 18 new runs (458 total) Profile: main_effects Completed: 18 | Skipped: 109 | Failed: 20

Re-eval 17 calibration runs; fix reeval.py artifact cleanup

2026-04-15T13:41:27Z

commit 4b971780246afdc6e97d3ed1d6c2aa6dcdeaa931 parent e82be6aca0708fad30ff11975bd16e5be13f53ff Author: Brian Graham Date: Wed, 15 Apr 2026 15:41:27 +0200 Re-eval 17 calibration runs; fix reeval.py artifact cleanup Previous artifact-path fix broke the cleanup safety check: rmtree guard still matched dashboard/ when artifact_dir moved to artifacts/, so successful reevals wiped the agent-generated game code. Update the guard to match the new artifacts/ path. Bot vs human agreement on 17 calibration runs: 72/116 (62.1%). DOM-detected games agree at ~91%. Co-Authored-By: Claude Opus 4.6 (1M context)

Fix compute_grid OOM: fail on unknown profile, stream via generator, dispatch DOE designs

2026-04-15T11:47:11Z

commit e82be6aca0708fad30ff11975bd16e5be13f53ff parent 6678831b7fac8cd35467d4539afc9ce70d68d388 Author: Brian Graham Date: Wed, 15 Apr 2026 13:47:11 +0200 Fix compute_grid OOM: fail on unknown profile, stream via generator, dispatch DOE designs Three fixes in one pass: 1. get_axes() used to silently fall back to the full top-level grid when given an unknown profile name. With 23 axes this expands to ~40B cartesian combinations, and the process OOMed the host (7.6GB+ before swap-stormed into D-state). Now it raises ValueError listing the known profiles. 2. compute_cells() accumulated every cell in a list before returning. Even with lazy itertools.product, building the intermediate list defeats it. Converted to a generator yielding one cell at a time. Streaming the 'full' profile now peaks at ~12MB RSS instead of unbounded growth. The only in-repo consumer (harness/run.py) already materializes via a list comprehension, so the change is transparent there. 3. compute_grid.py now recognizes the DOE design names (main_effects, plackett_burman, interaction_hunt) and dispatches to experiment_design.py. Previously 'compute_grid.py grid.yaml main_effects' triggered the silent fallback (bug #1) because main_effects is a design, not a profile. Now it produces the expected one-at-a-time sweep. Unknown names now print the full list of valid profiles and designs instead of silently misbehaving. Co-Authored-By: Claude Opus 4.6 (1M context)

Remove 20 more zero-turn 429 runs from glm-5.1 sweep

2026-04-15T09:37:13Z

commit 6678831b7fac8cd35467d4539afc9ce70d68d388 parent d4faa51819846758ef1376070eaf8d72ebbfa48b Author: Brian Graham Date: Wed, 15 Apr 2026 11:37:13 +0200 Remove 20 more zero-turn 429 runs from glm-5.1 sweep Same Z.AI rate-limit pattern as previous purges: first request of each run gets 429'd, Claude CLI retries 10x, run dies ~200s in with num_turns=1 and no work product. Purged to keep the benchmark clean. Co-Authored-By: Claude Opus 4.6 (1M context)

Add 20 new runs (460 total)

2026-04-15T08:03:00Z

commit d4faa51819846758ef1376070eaf8d72ebbfa48b parent 1d3667655369838306205f50bcc8210fe54f3b6f Author: Brian Graham Date: Wed, 15 Apr 2026 10:03:00 +0200 Add 20 new runs (460 total) Profile: main_effects Completed: 20 | Skipped: 109 | Failed: 18

Remove 20 invalid glm-5.1 runs (429 / aborted / zero-turn)

2026-04-15T05:03:15Z

commit 1d3667655369838306205f50bcc8210fe54f3b6f parent 610f73ccd69efee8791e3090594c1ad45dabc63b Author: Brian Graham Date: Wed, 15 Apr 2026 07:03:15 +0200 Remove 20 invalid glm-5.1 runs (429 / aborted / zero-turn) 14 runs from the recent sweep hit Z.AI rate limits on the first turn despite correct auth, 2 older runs got cut off partway through by 429s, and 4 stale pre-fix runs logged 1 turn with non-zero cost. None have usable outcome data. Co-Authored-By: Claude Opus 4.6 (1M context)

Add 14 new runs (460 total)

2026-04-15T02:30:23Z

commit 610f73ccd69efee8791e3090594c1ad45dabc63b parent 238f1a535996225cdf2c4e054730278cfae58f0f Author: Brian Graham Date: Wed, 15 Apr 2026 04:30:23 +0200 Add 14 new runs (460 total) Profile: main_effects Completed: 14 | Skipped: 115 | Failed: 18

Fix Z.AI auth: skip apiKeyHelper for non-anthropic providers

2026-04-14T20:42:18Z

commit 238f1a535996225cdf2c4e054730278cfae58f0f parent 7418a1208c757383b4559e0bc32b40526ac75cb7 Author: Brian Graham Date: Tue, 14 Apr 2026 22:42:18 +0200 Fix Z.AI auth: skip apiKeyHelper for non-anthropic providers apiKeyHelper in --settings returned an Anthropic OAuth token and overrode the ANTHROPIC_AUTH_TOKEN env var, so every zai (and openrouter) request authenticated with the wrong credential. Z.AI responded with 429 on the first turn, Claude CLI retried 10x, and the run died after ~200s with zero useful work. Now apiKeyHelper is only set when provider has no base_url override, so env-var auth flows through for zai/openrouter. Also commits ~30 new glm-5.1 runs from the main_effects sweep that completed cleanly after the fix, minus 5 purged invalid runs (429/aborted/zero-turn) captured before the fix landed. Co-Authored-By: Claude Opus 4.6 (1M context)

Add 1 new runs (393 total)

2026-04-14T11:08:09Z

commit 7418a1208c757383b4559e0bc32b40526ac75cb7 parent c79224fc666a6101927d5f64238f4331d9939ea5 Author: Brian Graham Date: Tue, 14 Apr 2026 13:08:09 +0200 Add 1 new runs (393 total) Profile: main_effects Completed: 1 | Skipped: 0 | Failed: 0

Remove 68 more zero-cost GLM-5.1 runs (Z.AI auth still broken)

2026-04-14T07:32:07Z

commit c79224fc666a6101927d5f64238f4331d9939ea5 parent 92464fd70b670fba6b71a3d3ad995cee4c286044 Author: Brian Graham Date: Tue, 14 Apr 2026 09:32:07 +0200 Remove 68 more zero-cost GLM-5.1 runs (Z.AI auth still broken)

Add 68 new runs (459 total)

2026-04-14T04:07:33Z

commit 92464fd70b670fba6b71a3d3ad995cee4c286044 parent 0b3c14da72aae531800f474ecadf3d960ff431c8 Author: Brian Graham Date: Tue, 14 Apr 2026 06:07:33 +0200 Add 68 new runs (459 total) Profile: main_effects Completed: 68 | Skipped: 61 | Failed: 18

Checkpoint: 60 runs (453 total)

2026-04-14T03:54:07Z

commit 0b3c14da72aae531800f474ecadf3d960ff431c8 parent c3e360e804cee1c489ce684800cb272cdfc4071c Author: Brian Graham Date: Tue, 14 Apr 2026 05:54:07 +0200 Checkpoint: 60 runs (453 total)

Checkpoint: 30 runs (423 total)

2026-04-14T03:03:53Z

commit c3e360e804cee1c489ce684800cb272cdfc4071c parent e52e85189d28eaa9be6271e17b363b5321c9ba16 Author: Brian Graham Date: Tue, 14 Apr 2026 05:03:53 +0200 Checkpoint: 30 runs (423 total)

Remove 68 zero-cost GLM-5.1 runs (auth failures)

2026-04-13T21:11:31Z

commit e52e85189d28eaa9be6271e17b363b5321c9ba16 parent 19603805e8fb248250f450fc3acd810b42dcc389 Author: Brian Graham Date: Mon, 13 Apr 2026 23:11:31 +0200 Remove 68 zero-cost GLM-5.1 runs (auth failures) Z.AI API key was expired/invalid during these runs, resulting in 0 turns and 0 cost. All 68 were glm-5.1 model. Co-Authored-By: Claude Opus 4.6 (1M context)

Add 24 new runs (459 total)

2026-04-13T20:56:53Z

commit 19603805e8fb248250f450fc3acd810b42dcc389 parent 3f102921e8bdb6a5e5f6f0ac940fd6052d67bde4 Author: Brian Graham Date: Mon, 13 Apr 2026 22:56:53 +0200 Add 24 new runs (459 total) Profile: main_effects Completed: 24 | Skipped: 105 | Failed: 18

Checkpoint: 20 runs (459 total)

2026-04-13T20:53:20Z

commit 3f102921e8bdb6a5e5f6f0ac940fd6052d67bde4 parent 23f965ae24a10947be7b2d35deb1af122bfd49c3 Author: Brian Graham Date: Mon, 13 Apr 2026 22:53:20 +0200 Checkpoint: 20 runs (459 total)

Checkpoint: 10 runs (449 total)

2026-04-13T20:46:20Z

commit 23f965ae24a10947be7b2d35deb1af122bfd49c3 parent 15bbcc86f3fbbb02c35a2ab5ee724272dcc49aeb Author: Brian Graham Date: Mon, 13 Apr 2026 22:46:20 +0200 Checkpoint: 10 runs (449 total)

Add smaller noise files: 1k, 10k, 50k, 100k for both lorem and wikipedia

2026-04-13T20:34:29Z

commit 15bbcc86f3fbbb02c35a2ab5ee724272dcc49aeb parent 7e8573576791e92cf46d189047bfac223df15bfc Author: Brian Graham Date: Mon, 13 Apr 2026 22:34:29 +0200 Add smaller noise files: 1k, 10k, 50k, 100k for both lorem and wikipedia The existing 25/50/75% noise files (195-587KB) exceed some API input limits (Z.AI rejects them with 0 turns). Added smaller sizes: - 1KB, 10KB, 50KB, 100KB for both lorem and wikipedia Co-Authored-By: Claude Opus 4.6 (1M context)

Add 44 new runs (435 total)

2026-04-13T20:24:08Z

commit 7e8573576791e92cf46d189047bfac223df15bfc parent 5c6e636b97dc9f81a5ff432fc9b730442d5ec8ef Author: Brian Graham Date: Mon, 13 Apr 2026 22:24:08 +0200 Add 44 new runs (435 total) Profile: main_effects Completed: 44 | Skipped: 61 | Failed: 18

Checkpoint: 40 runs (433 total)

2026-04-13T20:17:09Z

commit 5c6e636b97dc9f81a5ff432fc9b730442d5ec8ef parent e6f94efd9e1797f169e38db6fe6e566c1b9cec37 Author: Brian Graham Date: Mon, 13 Apr 2026 22:17:09 +0200 Checkpoint: 40 runs (433 total)

Restore game artifacts deleted by GPU machine commit

2026-04-13T15:14:54Z

commit e6f94efd9e1797f169e38db6fe6e566c1b9cec37 parent 742064a538f4dfd67a0c2f63f3a48023a6a8fde4 Author: Brian Graham Date: Mon, 13 Apr 2026 17:14:54 +0200 Restore game artifacts deleted by GPU machine commit 8daeef92 accidentally deleted game HTML/JS/CSS files from artifacts/ when the GPU machine committed without having the full artifacts. Restored from d07dba79. Co-Authored-By: Claude Opus 4.6 (1M context)

CI: exclude artifacts/ from rsync --delete

2026-04-13T14:13:21Z

commit 742064a538f4dfd67a0c2f63f3a48023a6a8fde4 parent 8972d44e8b39923d82288a7c34ce62ed50b4261a Author: Brian Graham Date: Mon, 13 Apr 2026 16:13:21 +0200 CI: exclude artifacts/ from rsync --delete The --delete flag on dashboard rsync was wiping artifacts/ on every deploy. If the artifact rsync then failed (no artifacts on CI runner), the game previews were gone. Now --exclude='artifacts/' preserves previously deployed games. Co-Authored-By: Claude Opus 4.6 (1M context)

Add all game artifacts, fix CI artifact rsync

2026-04-13T13:44:43Z

commit 8972d44e8b39923d82288a7c34ce62ed50b4261a parent 8daeef92ba6951d8f87f6476d03c4fafe25a57d2 Author: Brian Graham Date: Mon, 13 Apr 2026 15:44:43 +0200 Add all game artifacts, fix CI artifact rsync Committed all game artifacts (HTML, JS, CSS) from workspace extracts. node_modules excluded by .gitignore. CI deploy.yml: skip artifact rsync gracefully if directory missing on the CI runner (artifacts are large and not always available). Co-Authored-By: Claude Opus 4.6 (1M context)

Re-eval all 390 runs with V2 bot on GPU machine

2026-04-13T13:28:24Z

commit 8daeef92ba6951d8f87f6476d03c4fafe25a57d2 parent d07dba794c0abd20688958f4185daf3447786621 Author: Brian Graham Date: Mon, 13 Apr 2026 15:28:24 +0200 Re-eval all 390 runs with V2 bot on GPU machine GPU canvas readback works (getImageData returns real pixels), unlocking 148 canvas games that previously failed on non-GPU machine. Fix reeval.py artifact path (was dashboard/public/artifacts/, now artifacts/). Clean up SonarQube .scannerwork temp files from artifacts. Co-Authored-By: Claude Opus 4.6 (1M context)

Context update for GPU machine testing

2026-04-13T11:58:14Z

commit d07dba794c0abd20688958f4185daf3447786621 parent 499b8e496ada5d677434f56c376d41db2517b3a3 Author: Brian Graham Date: Mon, 13 Apr 2026 13:58:14 +0200 Context update for GPU machine testing Co-Authored-By: Claude Opus 4.6 (1M context)

Update eval results: 123 runs re-evaluated with V2 bot

2026-04-12T16:46:57Z

commit 499b8e496ada5d677434f56c376d41db2517b3a3 parent a8683609ce323889201069673c68c03616555eb6 Author: Brian Graham Date: Sun, 12 Apr 2026 18:46:57 +0200 Update eval results: 123 runs re-evaluated with V2 bot V2 mean: 43% (up from V1's 20%). Distribution shifted from 0-10% cluster to 60-70% for games the bot can now play. Co-Authored-By: Claude Opus 4.6 (1M context)

Add 7 new games to calibration page

2026-04-12T16:23:03Z

commit a8683609ce323889201069673c68c03616555eb6 parent da239183820ed8946e7f489fa3abfe7ff2513ba6 Author: Brian Graham Date: Sun, 12 Apr 2026 18:23:03 +0200 Add 7 new games to calibration page opus, qwen, glm-5.1, haiku, gemma-4-26b across various score ranges (0-44%). Human tests unanswered, ready for testing. Co-Authored-By: Claude Opus 4.6 (1M context)

Analyze and push 391 runs

2026-04-12T15:56:13Z

commit da239183820ed8946e7f489fa3abfe7ff2513ba6 parent 821022cb2060158a630e184a67c95678e8dca7c7 Author: Brian Graham Date: Sun, 12 Apr 2026 17:56:13 +0200 Analyze and push 391 runs

Switch production eval to V2 gameplay bot

2026-04-12T15:43:33Z

commit 821022cb2060158a630e184a67c95678e8dca7c7 parent 00055378a50253cc949795147e20b64ed2a2767f Author: Brian Graham Date: Sun, 12 Apr 2026 17:43:33 +0200 Switch production eval to V2 gameplay bot Harness now uses gameplay-bot-v2 (two-tier architecture) when available, falls back to V1 if not. V2 has 95% agreement with human calibration (vs V1's 58%). Expect breakage on canvas games without GPU (getImageData returns zeros). DOM games should work well. Co-Authored-By: Claude Opus 4.6 (1M context)

Update calibration: cbbff570 CW rotation works, e2e04e75 scores on clear, 9805c24a has game over overlay

2026-04-12T15:38:24Z

commit 00055378a50253cc949795147e20b64ed2a2767f parent 42321c004e708af0b099db58e9cbedf54e03e145 Author: Brian Graham Date: Sun, 12 Apr 2026 17:38:24 +0200 Update calibration: cbbff570 CW rotation works, e2e04e75 scores on clear, 9805c24a has game over overlay cbbff570: CW rotation (Up) works, CCW (Z) is broken. Updated rotate=true. e2e04e75: Bot was right, score increases by 100 on line clear. Updated score_increases_on_clear=true. 9805c24a: Game over shows overlay with GAME OVER text + Play Again button. Updated game_over_display=true. V2 agreement now 95% (102/107). 5 remaining disagreements: 3 from trail rendering bug (4949d521), 1 game_over_display detection (9805c24a), 1 all_pieces_rotate edge case (cbbff570). Co-Authored-By: Claude Opus 4.6 (1M context)

V2: landmarks-based game_loads, updated calibration test names

2026-04-12T06:31:49Z

commit 42321c004e708af0b099db58e9cbedf54e03e145 parent 13710ed75e85f22717f6463b1b3bc0b822a0a66b Author: Brian Graham Date: Sun, 12 Apr 2026 08:31:49 +0200 V2: landmarks-based game_loads, updated calibration test names game_loads now checks for structural landmarks (canvas, DOM grid, tetris-ratio elements, cell containers) instead of failing on console errors. Console errors are informational, not a pass/fail gate. 8fe72fce: game_loads now PASS (was FAIL from benign startup TypeError), score 100% (20/20 scorable). Updated all calibration JSON files: score_changes renamed to score_increases_on_clear + score_element_visible. Co-Authored-By: Claude Opus 4.6 (1M context)

V2: partial landmarks work (agent hit limit)

2026-04-11T07:59:29Z

commit 13710ed75e85f22717f6463b1b3bc0b822a0a66b parent d1b5c77738368fcf645c325b912383d5c69f22ed Author: Brian Graham Date: Sat, 11 Apr 2026 09:59:29 +0200 V2: partial landmarks work (agent hit limit)

V2: stricter rotation test requires distinct rotation states

2026-04-11T07:12:39Z

commit d1b5c77738368fcf645c325b912383d5c69f22ed parent 669aa68861617a446e5ae56aea0a462995d183f0 Author: Brian Graham Date: Sat, 11 Apr 2026 09:12:39 +0200 V2: stricter rotation test requires distinct rotation states Previous rotate test passed if pressing rotate caused ANY grid change. Games with broken rotation (only 1 of 4 states works) would pass. New test: - Press rotate 4 times, wait 100ms between each - Record normalized active piece shape after each press - rotate test: passes if 2+ distinct shapes (baseline + rotation) - all_pieces_rotate: passes if 2+ J/L/T pieces reach 3+ distinct shapes - Skips if fewer than 2 J/L/T piece types seen Uses position-invariant shape keys (normalized to top-left origin) so auto-drop during the test doesn't confuse the comparison. Tracking: - session.distinctRotationShapes: max observed in Phase 3 probe - session.rotationShapesByPiece: Map> - playGame accepts rotationTrack param for gameplay-phase probing Results: - 9805c24a (broken): rotate now FAIL, all_pieces_rotate FAIL (was PASS/PASS) - cbbff570 (flaky): rotate FAIL (was PASS) - 4c7db3b9 (working): 100% score (up from 94%) - 1d08ee76: 95% (up from 89%) - 8fe72fce: 94% (unchanged) Co-Authored-By: Claude Opus 4.6 (1M context)

V2: game_over_display test passes on overlay OR restart presence

2026-04-11T05:30:25Z

commit 669aa68861617a446e5ae56aea0a462995d183f0 parent 71f0c4b7931a2d58ae370587a6a0f62ba352b716 Author: Brian Graham Date: Sat, 11 Apr 2026 07:30:25 +0200 V2: game_over_display test passes on overlay OR restart presence Previous check required BOTH a modal and a restart button. Now accepts either signal because different games show game-over UI differently (some have full modal, some just show a restart button overlay). Co-Authored-By: Claude Opus 4.6 (1M context)

V2: language-agnostic game over detection, capture in Phase 6

2026-04-11T05:28:13Z

commit 71f0c4b7931a2d58ae370587a6a0f62ba352b716 parent e0a13b62466491d470aedf8a0eb251d5075f2951 Author: Brian Graham Date: Sat, 11 Apr 2026 07:28:13 +0200 V2: language-agnostic game over detection, capture in Phase 6 Two fixes: 1. Game over display detection happens in Phase 6 (when game over is actually triggered) and is stored on session. Phase 8 no longer needs to re-trigger game over. Added gameOverText and gameOverRestartAvailable to GameSession. 2. detectGameOverText() and detectRestartOption() are now language-agnostic. Instead of matching text patterns, they detect structural modals: position fixed/absolute elements covering >15% of viewport with visible background/content and z-index. Restart detection finds clickable elements inside the detected modal. Also updated CLAUDE.md with explicit "driver MUST NOT hard-code language strings" convention. Known follow-ups: readLevel, detectNextPiecePreview, detectControls, and score element search still use text matching. These need language-agnostic replacements. Co-Authored-By: Claude Opus 4.6 (1M context)

Calibration cbbff570: rotation is flaky (human was wrong)

2026-04-11T05:25:49Z

commit e0a13b62466491d470aedf8a0eb251d5075f2951 parent 4f28472171324e5ca141cd341697a345ac9438fc Author: Brian Graham Date: Sat, 11 Apr 2026 07:25:49 +0200 Calibration cbbff570: rotation is flaky (human was wrong) Re-tested rotation. At best it rotates once per piece, sometimes stalls the game or causes blocks to vanish. CCW rotation also broken. Previous human test was incorrect. Game over can be triggered by spamming space and shows proper modal with "Play Again" button. Co-Authored-By: Claude Opus 4.6 (1M context)

Methodology: scoring uses SonarQube, code quality is in outputs, no emdashes

2026-04-10T19:24:30Z

commit 4f28472171324e5ca141cd341697a345ac9438fc parent f2f3ae07a56e601bfb819d07f034eedeabc6a9c8 Author: Brian Graham Date: Fri, 10 Apr 2026 21:24:30 +0200 Methodology: scoring uses SonarQube, code quality is in outputs, no emdashes Headline score is 50% gameplay bot + 50% SonarQube. The lint/typecheck/ bundle "code quality" metrics are tracked as output metrics, not part of the headline score. Also removed all prose emdashes per project convention. Co-Authored-By: Claude Opus 4.6 (1M context)

V2: fix AI player so it actually plays Tetris

2026-04-10T19:20:12Z

commit f2f3ae07a56e601bfb819d07f034eedeabc6a9c8 parent 3d89b1b341dd38bd6e1d3574d07e083fb57b1d62 Author: Brian Graham Date: Fri, 10 Apr 2026 21:20:12 +0200 V2: fix AI player so it actually plays Tetris Four bugs in the LeeYiyuan port were preventing the bot from playing: Bug 1 (primary): settledGrid pollution After hard-drop, the bot read the grid and stored it as 'settled' but the next piece had already spawned. detectActivePieceCells then returned null because the new piece was 'baked into settled'. Bot froze after move 1 (telemetry: pieces_spawned=1, pieces_locked=20). Fix: stripActivePiece(boardBeforePlacement) before placement, wait 350ms after drop, detect new piece against the saved board, strip it to get clean settledGrid for next iteration. Bug 2: pre-rotation column math wrong currentCol was captured before rotation, but rotation shifts the piece's leftmost column, so column moves were off by 1-2 cells. Fix: Adopt LeeYiyuan's slam-left strategy. After computing placement: rotate N times, press Left 10 times (slam to wall), press Right placement.column times, hard drop. No live position tracking needed. Bug 3: J[3] had negative column offset PIECES.J[3] = [[0,0],[1,0],[2,0],[2,-1]] -- minCol=-1 broke the column math for J piece in rotation state 3. Fix: Normalized to [[0,1],[1,1],[2,1],[2,0]] with minCol=0. All 28 rotation states now have minRow=0 and minCol=0. Bug 4: First-frame piece detection sloppy detectActivePieceCells fallback scanned all cells in top 6 rows when settledGrid was null, picking up UI chrome. Fix: BFS for connected component of 3-5 cells in top 4 rows, pick the one closest to spawn center column 4.5. Results: bot now actually plays Tetris. - Test on 4c7db3b9: 94% score (was 89%) - Gameplay phase cleared 10 lines (was 0) - Competitive play cleared 2 lines, scored 200 (was 0) Co-Authored-By: Claude Opus 4.6 (1M context)

Update methodology page with current bot architecture

2026-04-10T19:11:59Z

commit 3d89b1b341dd38bd6e1d3574d07e083fb57b1d62 parent 7df3ddd793a69cba93ded966d634045f4810a5fc Author: Brian Graham Date: Fri, 10 Apr 2026 21:11:59 +0200 Update methodology page with current bot architecture Major rewrite of the bot section to reflect the actual implementation: - 8 conditional phases (was 4) - 25 tests across mechanics, lifecycle, gameplay, game state, competitive - Two-tier architecture (Driver + Bot separation) - Discovery infrastructure: language-agnostic start, interactivity check, control discovery, calibration cache - All 9 competitive play bug detection tests listed - 60ms polling (was 150ms) - Updated limitations: GPU requirement, trail bugs, game over masking, hidden elements, etc. - Pierre Dellacherie attribution Co-Authored-By: Claude Opus 4.6 (1M context)

Correct attribution: Pierre Dellacherie's 4-heuristic Tetris AI

2026-04-10T19:06:34Z

commit 7df3ddd793a69cba93ded966d634045f4810a5fc parent 2644610c24ac12d9ef707571aec1e31a934389a8 Author: Brian Graham Date: Fri, 10 Apr 2026 21:06:34 +0200 Correct attribution: Pierre Dellacherie's 4-heuristic Tetris AI The algorithm is from Pierre Dellacherie (2003), not LeeYiyuan. Weights are from Colin Fahey's GA optimization. LeeYiyuan/tetrisai is the reference implementation we adapted code from. Updated methodology page, SPEC.md, player.ts, bot.ts, CLAUDE.md. Co-Authored-By: Claude Opus 4.6 (1M context)

V2: control discovery system

2026-04-10T18:04:37Z

commit 2644610c24ac12d9ef707571aec1e31a934389a8 parent 8dc9ec566791cf32913b7ea8f3ba37a789ef0b86 Author: Brian Graham Date: Fri, 10 Apr 2026 20:04:37 +0200 V2: control discovery system Bot no longer assumes default Tetris controls. Driver now probes each candidate key to discover what it actually does: - ArrowDown might be hard drop instead of soft drop - Space might be pause instead of hard drop - Some games have no soft drop at all (skip move_down test as N/A) Discovery is reload-safe (clears game state between probes), classifies based on grid delta (movement direction, distance, shape change), budget capped at 50s. New types: GameAction, ControlMapping, ControlMap with confidence levels New driver methods: discoverControls(), getControl() Bot updated: move_down and soft_drop_distinct skip when soft_drop not found Report includes control_discovery field showing what each key does Results: - 1d08ee76 (control swap): 67% -> 83-89% - 4c7db3b9 (working game): 86% -> 100% - 8fe72fce (held): 95% Co-Authored-By: Claude Opus 4.6 (1M context)

V2 fix: handle absolute-positioned active piece overlays

2026-04-10T16:58:30Z

commit 8dc9ec566791cf32913b7ea8f3ba37a789ef0b86 parent 14d5747dc2f92d77eaa7655145f2fe71bcde4d0a Author: Brian Graham Date: Fri, 10 Apr 2026 18:58:30 +0200 V2 fix: handle absolute-positioned active piece overlays Game 8fe72fce uses absolute-positioned div overlays for the falling piece, separate from the 200 grid cells. The grid reader was missing the active piece because it only read the first 200 children. Fix: - Added refreshGridDetection() in driver: re-detects grid without full re-calibration, called by verifyGameStarted() after start clicks - readDomGrid() now reads overlay children (>200 children) and computes which grid cell each absolute-positioned overlay falls into - Widened child-count ranges from 180-220 to 180-230 to accommodate overlays - Added screenshotGridArea() and captureGridDomFingerprint() as fallback signals for verifyGameStarted when grid-based detection misses Results: 8fe72fce 0% -> 95% (matches human's 20/20). Overall V2 vs human: 82% -> 86% agreement. Co-Authored-By: Claude Opus 4.6 (1M context)

Add gemma426b run artifacts and results

2026-04-10T12:38:22Z

commit 14d5747dc2f92d77eaa7655145f2fe71bcde4d0a parent 3bde26d36a17e8b79525bbe582d3ab13b8d8387b Author: Brian Graham Date: Fri, 10 Apr 2026 14:38:22 +0200 Add gemma426b run artifacts and results Co-Authored-By: Claude Opus 4.6 (1M context)

V2 bot: caching, bot/driver bridge, fixed CCW rotation test

2026-04-10T12:36:48Z

commit 3bde26d36a17e8b79525bbe582d3ab13b8d8387b parent d162c5ba603ac08e3db2a7fe0919dd0494c4f14d Author: Brian Graham Date: Fri, 10 Apr 2026 14:36:48 +0200 V2 bot: caching, bot/driver bridge, fixed CCW rotation test Three improvements merged: 1. Calibration caching (driver.ts): caches start mechanism, controls, grid bounds across reloads. Detects drift, flags conflicts. Eliminates timeouts from repeated full calibration. 2. Bot/driver bridge (bot.ts, driver.ts, types.ts): bot verifies game actually started before driver commits to a mechanism. Checks grid populated, movement responsive, no game-over text. discoverStartCandidates, tryStartMechanism, confirmStartMechanism, rejectStartMechanism methods. 3. CCW rotation test (bot.ts): fixed broken sequential test that was tautologically true. Now reloads page between Z and X tests, compares rotation states from same baseline. Results vs human calibration (9 games): - V1: 56/97 = 58% agreement - V2: 80/98 = 82% agreement Major wins: e2e04e75 (Spanish 18% -> 85%, perfect agreement), 4949d521 (trail bug 18% -> 67%), cbbff570 (18% -> 67%), 9805c24a (80% -> 95%), 7a348b81 (correctly finds working start button). Known regression: 8fe72fce went 44% -> 0% because bridge's strict verification rejects start mechanisms when benign startup console errors occur. Needs follow-up: distinguish pre-start errors from fatal errors. Co-Authored-By: Claude Opus 4.6 (1M context)

Add gameplay-bot-v2: two-tier architecture (Driver + Bot)

2026-04-09T19:14:39Z

commit d162c5ba603ac08e3db2a7fe0919dd0494c4f14d parent 3012989bb80dca8980569effc60dd0bd59e283c3 Author: Brian Graham Date: Thu, 9 Apr 2026 21:14:39 +0200 Add gameplay-bot-v2: two-tier architecture (Driver + Bot) New implementation with clean separation: - driver.ts (1710 lines): TetrisDriver class, all Playwright interaction - bot.ts (1690 lines): game logic, 25 tests, zero Playwright imports - types.ts (233 lines): TetrisDriver interface with 17 methods Improvements over v1: - Buttons before keyboard in start detection - 300ms post-click initialization wait - False start rejection (immediate game-over check) - Grid re-calibration after start - playable_30s gates on errors_during_play only - Interactivity verification via screenshot + DOM state Co-Authored-By: Claude Opus 4.6 (1M context)

Update calibration: 9805c24a (broken rotation, bad randomizer), cbbff570 (mostly works, spurious line clear, weird preview)

2026-04-09T18:22:15Z

commit 3012989bb80dca8980569effc60dd0bd59e283c3 parent 58c112b941608fed3c655c638c0e4c17daa5bb19 Author: Brian Graham Date: Thu, 9 Apr 2026 20:22:15 +0200 Update calibration: 9805c24a (broken rotation, bad randomizer), cbbff570 (mostly works, spurious line clear, weird preview) Co-Authored-By: Claude Opus 4.6 (1M context)

Add test #25 rendering_clean, update calibration data

2026-04-09T18:15:12Z

commit 58c112b941608fed3c655c638c0e4c17daa5bb19 parent 67bd49c6e259f78aade3caeae40c3418dedf8071 Author: Brian Graham Date: Thu, 9 Apr 2026 20:15:12 +0200 Add test #25 rendering_clean, update calibration data New test detects rendering trail bugs where falling pieces leave old positions still colored. Checks filled cell growth vs pieces placed during competitive play (threshold: 8x = trail bug). Updated calibration: 1d08ee76 (broken rotation, no soft drop), 4949d521 (trail rendering bug, lines never clear). Co-Authored-By: Claude Opus 4.6 (1M context)

Add two-tier architecture refactor spec for gameplay bot

2026-04-09T10:56:29Z

commit 67bd49c6e259f78aade3caeae40c3418dedf8071 parent 7fbe88ce2a1febb0954305d10f4e1878570e0f14 Author: Brian Graham Date: Thu, 9 Apr 2026 12:56:29 +0200 Add two-tier architecture refactor spec for gameplay bot Driver (webpage abstraction) + Bot (game logic) separation. 17-method TetrisDriver interface, 4-commit incremental migration plan, ~2740 lines (down from 3500). Bot never imports Playwright. Co-Authored-By: Claude Opus 4.6 (1M context)

Verify game interactivity via DOM + screenshot after start detection

2026-04-09T09:48:33Z

commit 7fbe88ce2a1febb0954305d10f4e1878570e0f14 parent 53c719fefdf6f437deb0b34eb1b8dbff56d06643 Author: Brian Graham Date: Thu, 9 Apr 2026 11:48:33 +0200 Verify game interactivity via DOM + screenshot after start detection Start detection now requires the game to respond to gameplay inputs (ArrowLeft/Right/Down) before confirming a mechanism worked. Checks both screenshot changes AND DOM state changes (class names, styles on grid children). This catches: - False starts from Space (visual change but not interactive) - Games that rebuild DOM via innerHTML (screenshot identical but DOM differs) Spanish game e2e04e75 now correctly starts via "Iniciar Juego" button (was falsely starting via Space). Score went from 18% to 75%. Co-Authored-By: Claude Opus 4.6 (1M context)

Add grid re-sampling after game start detection

2026-04-09T09:20:13Z

commit 53c719fefdf6f437deb0b34eb1b8dbff56d06643 parent 7ec3ff4435d0c822bb73dc7b4689cc4908ca9883 Author: Brian Graham Date: Thu, 9 Apr 2026 11:20:13 +0200 Add grid re-sampling after game start detection If initial calibration finds no grid (renderer: unknown), re-calibrate after the game starts. Games that create DOM cells dynamically via JS (innerHTML on render loop) have 0 children at page load but 200 after starting. Records grid_detected_at: "initial" vs "after_start" in report. Known issue: e2e04e75 game has startBtn button but bot falsely detects start via Space (something changed but game didn't actually start). Needs stronger start verification in next session. Co-Authored-By: Claude Opus 4.6 (1M context)

Add all 10 DOM games to calibration page

2026-04-09T09:10:57Z

commit 7ec3ff4435d0c822bb73dc7b4689cc4908ca9883 parent 4bffa2cd4b2213f424528e1cc00cecce1fcd1a8e Author: Brian Graham Date: Thu, 9 Apr 2026 11:10:57 +0200 Add all 10 DOM games to calibration page 5 new entries (human tests unanswered, ready for testing). 5 existing entries already have human test data. Co-Authored-By: Claude Opus 4.6 (1M context)

Update gameplay bot results for 10 DOM games with new start detection

2026-04-09T07:18:31Z

commit 4bffa2cd4b2213f424528e1cc00cecce1fcd1a8e parent 1d5cce537fb6c78ac946ca42dca010215a97e6fd Author: Brian Graham Date: Thu, 9 Apr 2026 09:18:31 +0200 Update gameplay bot results for 10 DOM games with new start detection All 10 DOM games now start successfully (10/10 game_starts: PASS). Language-agnostic detection works: space, auto, and button click all detected. Scores range 18-85%. The 18% scores are grid reader failures (canvas rendering despite DOM classification), not start detection issues. Co-Authored-By: Claude Opus 4.6 (1M context)

Language-agnostic start detection for gameplay bot

2026-04-09T07:03:26Z

commit 1d5cce537fb6c78ac946ca42dca010215a97e6fd parent 4ce8d09103c723f23b4f1d266fe3aef143995996 Author: Brian Graham Date: Thu, 9 Apr 2026 09:03:26 +0200 Language-agnostic start detection for gameplay bot Rewrote start mechanism detection to be fully language-agnostic: - No text string matching (removed btn/start/play selectors) - Detects clickable elements by cursor:pointer, background color, size - Sorts candidates by prominence (size, center proximity, contrast) - Keyboard triggers (Enter, Space) tried before DOM buttons - All wait times reduced from 300-500ms to 100ms - Overlay detection purely structural (position, z-index, viewport %) Tested: Spanish DOM game now starts correctly (game_starts: PASS via space). Spanish DOM game 2 scores 89% (up from 80%). Canvas games still blocked by GPU pixel readback issue. Co-Authored-By: Claude Opus 4.6 (1M context)

Update calibration: 93e8feea starts into game over, e2e04e75 no scoring

2026-04-09T06:07:32Z

commit 4ce8d09103c723f23b4f1d266fe3aef143995996 parent b19aa539899396ddc9373afcaa71621210b6e113 Author: Brian Graham Date: Thu, 9 Apr 2026 08:07:32 +0200 Update calibration: 93e8feea starts into game over, e2e04e75 no scoring 93e8feea: Game loads into immediate game over state with overlay that never dismisses. Bot false positive (says game_starts: PASS). e2e04e75: Spanish game, all mechanics work but score never changes (real game bug). Bot false negative (18% but most mechanics work). Co-Authored-By: Claude Opus 4.6 (1M context)

Calibration: copy button instead of JSON block, update human results

2026-04-09T06:04:42Z

commit b19aa539899396ddc9373afcaa71621210b6e113 parent 9fab5af2106b43a76d5ec6827903ccd666eb3945 Author: Brian Graham Date: Thu, 9 Apr 2026 08:04:42 +0200 Calibration: copy button instead of JSON block, update human results Replace inline JSON pre block with a clean "Copy JSON" button. Updated 4c7db3b9 (Spanish, all mechanics work) and 8fe72fce (English, 19 human passes including multi-line clear, score scaling, CCW rotation). Co-Authored-By: Claude Opus 4.6 (1M context)

Fix calibration UI: connect Human Testing toggle to all cards

2026-04-09T05:56:41Z

commit 9fab5af2106b43a76d5ec6827903ccd666eb3945 parent cce938f0ee50da98113502ccbc5e5d066efc5137 Author: Brian Graham Date: Thu, 9 Apr 2026 07:56:41 +0200 Fix calibration UI: connect Human Testing toggle to all cards The editing state from the parent "Human Testing" button now flows down to each CalibrationCard, showing tri-state test controls, editable notes, and copyable JSON export on all cards simultaneously. Co-Authored-By: Claude Opus 4.6 (1M context)

Interactive calibration UI with human testing mode

2026-04-09T05:52:24Z

commit cce938f0ee50da98113502ccbc5e5d066efc5137 parent d748de6f4a388178c427cd55d11db0149a9d0d5b Author: Brian Graham Date: Thu, 9 Apr 2026 07:52:24 +0200 Interactive calibration UI with human testing mode Calibrate page now uses a React island with: - "Human Testing" toggle button reveals clickable tri-state controls (yes/no/unanswered) for each test per game - Short code + game link in title for easy click-and-play - Editable notes field - Copyable JSON export per card for pasting results back - Aggregate agree/disagree stats at top - Bot results referenced from eval_results.json (not duplicated) Co-Authored-By: Claude Opus 4.6 (1M context)

Add bot calibration page with human vs bot comparison

2026-04-09T05:41:41Z

commit d748de6f4a388178c427cd55d11db0149a9d0d5b parent dcef6a4928511792f670d74ad63b8e1b9a7bde45 Author: Brian Graham Date: Thu, 9 Apr 2026 07:41:41 +0200 Add bot calibration page with human vs bot comparison Hidden /calibrate page showing hand-picked games with human test results side-by-side with bot results. Data is JSON-powered (one file per game in calibration/), references canonical bot results from eval_results.json. Initial 5 entries from manual testing of DOM-rendered games: - 2 games match well (80-85% bot vs human "playable") - 1 false negative (bot 18%, human says playable -- likely GPU issue) - 1 overlay bug correctly identified by human, bot confused - 1 genuinely broken game (both agree: won't start) Co-Authored-By: Claude Opus 4.6 (1M context)

Rewrite gameplay bot: 24 tests, 8 conditional phases, competitive play

2026-04-09T05:23:38Z

commit dcef6a4928511792f670d74ad63b8e1b9a7bde45 parent f978492f1169d00686406170f46df2c1f5f783ca Author: Brian Graham Date: Thu, 9 Apr 2026 07:23:38 +0200 Rewrite gameplay bot: 24 tests, 8 conditional phases, competitive play Major rewrite implementing the full SPEC.md design: Phase 1: Page load Phase 2: Start detection with falling piece detector (10 screenshots at 100ms, pixel cluster tracking for downward movement), overlay detection, cascading trigger sequence (auto/enter/space/button/canvas) Phase 3: Mechanics (movement, rotation, hard drop) -- conditional on P2 Phase 4: Piece lifecycle (lock, spawn, multiple) -- conditional on P3 Phase 5: Gameplay (60 pieces/45s, integrated score tracking) -- cond. P4 Phase 6: Game over (stack to top via grid reader) -- conditional on P4 Phase 7: Endurance (30s play) -- conditional on P5 Phase 8: Competitive play (60s, 8 bug-detection tests) -- conditional on P5 New tests 17-24: multi_line_clear, score_scaling, level_progression, speed_progression, next_piece_preview, game_over_display, counter_clockwise_rotation, soft_drop_distinct Score = passed / (total - skipped). Skipped tests don't penalize. Added SurveyData, CompetitivePlayResult types. Page survey function in calibrate.ts. 5-minute timeout for competitive play phase. Co-Authored-By: Claude Opus 4.6 (1M context)

Add comprehensive gameplay bot spec (24 tests, 8 phases)

2026-04-08T21:40:24Z

commit f978492f1169d00686406170f46df2c1f5f783ca parent dcf0b2b68809e147c61745c349e067cdc700b022 Author: Brian Graham Date: Wed, 8 Apr 2026 23:40:24 +0200 Add comprehensive gameplay bot spec (24 tests, 8 phases) Master spec consolidating all design decisions for the gameplay bot rewrite: - Conditional phase execution (each depends on previous succeeding) - Falling piece detector (10 screenshots at 100ms, pixel cluster tracking) - Start detection cascade: auto -> overlay -> buttons -> keyboard -> canvas - Competitive play phase (60s, bug detection for multi-line clear, score scaling, level/speed progression, rotation, soft drop) - 24 total tests (16 basic + 8 competitive play bug checks) - Skipped tests don't penalize score: passed / (total - skipped) - GPU requirement documented for canvas pixel readback Co-Authored-By: Claude Opus 4.6 (1M context)

Checkpoint: 40 runs (438 total)

2026-04-08T18:54:35Z

commit dcf0b2b68809e147c61745c349e067cdc700b022 parent 5df80751cd068bed2ff25120efc40e25ea15b4d8 Author: Brian Graham Date: Wed, 8 Apr 2026 20:54:35 +0200 Checkpoint: 40 runs (438 total)

Checkpoint: 35 runs (433 total)

2026-04-08T18:37:51Z

commit 5df80751cd068bed2ff25120efc40e25ea15b4d8 parent a683c185d8f6ef2a499b586db5302236627e7013 Author: Brian Graham Date: Wed, 8 Apr 2026 20:37:51 +0200 Checkpoint: 35 runs (433 total)

Checkpoint: 30 runs (428 total)

2026-04-08T18:20:43Z

commit a683c185d8f6ef2a499b586db5302236627e7013 parent 664a80e943fc00e4a5c6e147ebeaec3051c6def5 Author: Brian Graham Date: Wed, 8 Apr 2026 20:20:43 +0200 Checkpoint: 30 runs (428 total)

Checkpoint: 25 runs (423 total)

2026-04-08T18:04:16Z

commit 664a80e943fc00e4a5c6e147ebeaec3051c6def5 parent ef35ac28812a8669213b243d85d99716d7caf2cd Author: Brian Graham Date: Wed, 8 Apr 2026 20:04:16 +0200 Checkpoint: 25 runs (423 total)

Checkpoint: 20 runs (418 total)

2026-04-08T17:47:23Z

commit ef35ac28812a8669213b243d85d99716d7caf2cd parent 8199b922d2c75cb7dbe9ca98bf806ca4ce70a962 Author: Brian Graham Date: Wed, 8 Apr 2026 19:47:23 +0200 Checkpoint: 20 runs (418 total)

Checkpoint: 15 runs (413 total)

2026-04-08T17:30:46Z

commit 8199b922d2c75cb7dbe9ca98bf806ca4ce70a962 parent 413bf6cd88b73f770d308ed9eade42d2cfc570be Author: Brian Graham Date: Wed, 8 Apr 2026 19:30:46 +0200 Checkpoint: 15 runs (413 total)

Checkpoint: 10 runs (408 total)

2026-04-08T16:09:55Z

commit 413bf6cd88b73f770d308ed9eade42d2cfc570be parent 077892dc11b20dbd6266eaad70102215e24f8108 Author: Brian Graham Date: Wed, 8 Apr 2026 18:09:55 +0200 Checkpoint: 10 runs (408 total)

Checkpoint: 5 runs (403 total)

2026-04-08T11:48:57Z

commit 077892dc11b20dbd6266eaad70102215e24f8108 parent 0fb5a7736cd2f9318f185a4e0a0f2d5b3e73f97d Author: Brian Graham Date: Wed, 8 Apr 2026 13:48:57 +0200 Checkpoint: 5 runs (403 total)

Fix page load: use waitUntil commit, try root URL first

2026-04-08T11:23:55Z

commit 0fb5a7736cd2f9318f185a4e0a0f2d5b3e73f97d parent 43fb9fa4943a04511bffd96ccd1ba7e925d1ef15 Author: Brian Graham Date: Wed, 8 Apr 2026 13:23:55 +0200 Fix page load: use waitUntil commit, try root URL first Games with blocking JS never fire domcontentloaded. waitUntil: commit just waits for first bytes. Try root / before /index.html (serve SPA mode redirect). Co-Authored-By: Claude Opus 4.6 (1M context)

Rewrite start detection: 5-phase, language-agnostic, visual change

2026-04-08T07:59:16Z

commit 43fb9fa4943a04511bffd96ccd1ba7e925d1ef15 parent 69173d2750e5cab2d6a94d1c152116be336341c2 Author: Brian Graham Date: Wed, 8 Apr 2026 09:59:16 +0200 Rewrite start detection: 5-phase, language-agnostic, visual change Phase 1: auto-start (10 frames at 100ms, no input) Phase 2: DOM buttons by visual prominence (no text matching) Phase 3: canvas click grid (center, upper, lower, 3x3) Phase 4: keyboard triggers with combos Phase 5: retry all phases detectVisualChange: Level 1 (any change) + Level 2 (gameplay pattern) 30-second total budget. Stateful button recording. Co-Authored-By: Claude Opus 4.6 (1M context)

Fix large prompt handling: use wrapper script instead of bash -c

2026-04-08T07:21:29Z

commit 69173d2750e5cab2d6a94d1c152116be336341c2 parent 77e7e9a06140edc233811dccc1cb005b0402575d Author: Brian Graham Date: Wed, 8 Apr 2026 09:21:29 +0200 Fix large prompt handling: use wrapper script instead of bash -c bash -c with $(cat) broke on quotes in settings JSON. Now writes a shell wrapper script that reads the prompt from file. Co-Authored-By: Claude Opus 4.6 (1M context)

Checkpoint: 30 runs (414 total)

2026-04-08T06:52:25Z

commit 77e7e9a06140edc233811dccc1cb005b0402575d parent e9c7251cd07c133098a32de1b00898bc7ea79d3f Author: Brian Graham Date: Wed, 8 Apr 2026 08:52:25 +0200 Checkpoint: 30 runs (414 total)

Add 95% CI bands, statistical power card, tornado CI whiskers

2026-04-08T05:58:36Z

commit e9c7251cd07c133098a32de1b00898bc7ea79d3f parent 4c5457fbc3c2f5ff52de70289b518e2f956800f4 Author: Brian Graham Date: Wed, 8 Apr 2026 07:58:36 +0200 Add 95% CI bands, statistical power card, tornado CI whiskers - Box plot: CI band overlay with mean dot, tooltip shows CI range - Statistical Power card: avg CI width, detectable effect, color status - Tornado: CI whiskers on effect bars, non-significant dimmed with "n.s." - confidenceInterval() function with t-distribution for small samples Co-Authored-By: Claude Opus 4.6 (1M context)

Checkpoint: 15 runs (399 total)

2026-04-08T05:45:51Z

commit 4c5457fbc3c2f5ff52de70289b518e2f956800f4 parent 150e14e6b771b6276fa497d524bfb868bd1cb9d0 Author: Brian Graham Date: Wed, 8 Apr 2026 07:45:51 +0200 Checkpoint: 15 runs (399 total)

Switch qwen-3.6-plus from free to paid endpoint

2026-04-08T05:32:40Z

commit 150e14e6b771b6276fa497d524bfb868bd1cb9d0 parent 625d14b3b226e882d25a00909c6d47ab82d0080b Author: Brian Graham Date: Wed, 8 Apr 2026 07:32:40 +0200 Switch qwen-3.6-plus from free to paid endpoint

Fix argument list too long for noise cells

2026-04-08T05:17:24Z

commit 625d14b3b226e882d25a00909c6d47ab82d0080b parent e59ff443edb659c9d21d3fef8d708bd29176f827 Author: Brian Graham Date: Wed, 8 Apr 2026 07:17:24 +0200 Fix argument list too long for noise cells Large prompts (>100KB from context noise) exceeded OS arg limit. Now writes prompt to temp file and uses bash -c with cat for large prompts. Also deleted 20 gemma runs with 403 auth errors. Co-Authored-By: Claude Opus 4.6 (1M context)

Add minimax-m2.7 and kimi-k2.5 via OpenRouter

2026-04-08T05:09:29Z

commit e59ff443edb659c9d21d3fef8d708bd29176f827 parent 7a1efd6efd6f2649b557db27eb8af49fd4795d6e Author: Brian Graham Date: Wed, 8 Apr 2026 07:09:29 +0200 Add minimax-m2.7 and kimi-k2.5 via OpenRouter Co-Authored-By: Claude Opus 4.6 (1M context)

Checkpoint: 30 runs (453 total)

2026-04-08T05:07:45Z

commit 7a1efd6efd6f2649b557db27eb8af49fd4795d6e parent 2c88a48235c8cd6f7a422e556594f7c8dbd3ffbe Author: Brian Graham Date: Wed, 8 Apr 2026 07:07:45 +0200 Checkpoint: 30 runs (453 total)

Checkpoint: 20 runs (433 total)

2026-04-08T05:06:28Z

commit 2c88a48235c8cd6f7a422e556594f7c8dbd3ffbe parent c7095255ad2f40ee04f9d0e3824a1813ec8603fa Author: Brian Graham Date: Wed, 8 Apr 2026 07:06:28 +0200 Checkpoint: 20 runs (433 total)

Checkpoint: 10 runs (433 total)

2026-04-08T05:05:18Z

commit c7095255ad2f40ee04f9d0e3824a1813ec8603fa parent 858f9ae354cd621f1969c4ad6c188e41a297451c Author: Brian Graham Date: Wed, 8 Apr 2026 07:05:18 +0200 Checkpoint: 10 runs (433 total)

Analyze and push 393 runs

2026-04-08T04:59:26Z

commit 858f9ae354cd621f1969c4ad6c188e41a297451c parent ce57e6ee85e544459288916a5b5c147fb83db69d Author: Brian Graham Date: Wed, 8 Apr 2026 06:59:26 +0200 Analyze and push 393 runs

Checkpoint: 10 runs (396 total)

2026-04-07T22:11:37Z

commit ce57e6ee85e544459288916a5b5c147fb83db69d parent 1e69260f07274dca928db476f7e78700b94c472a Author: Brian Graham Date: Wed, 8 Apr 2026 00:11:37 +0200 Checkpoint: 10 runs (396 total)

Add 21 new runs (394 total)

2026-04-07T21:30:26Z

commit 1e69260f07274dca928db476f7e78700b94c472a parent a5c7df1a95000d3e881f09d29479b93d4169e27b Author: Brian Graham Date: Tue, 7 Apr 2026 23:30:26 +0200 Add 21 new runs (394 total) Profile: main_effects Completed: 21 | Skipped: 13 | Failed: 7

Checkpoint: 20 runs (393 total)

2026-04-07T21:12:31Z

commit a5c7df1a95000d3e881f09d29479b93d4169e27b parent 91f15a4a0a9b76a012f13ca069eec988584d5f99 Author: Brian Graham Date: Tue, 7 Apr 2026 23:12:31 +0200 Checkpoint: 20 runs (393 total)

Add 33 new runs (373 total)

2026-04-07T20:03:59Z

commit 91f15a4a0a9b76a012f13ca069eec988584d5f99 parent f078feba3d37071568e8638b9514b41e90605d84 Author: Brian Graham Date: Tue, 7 Apr 2026 22:03:59 +0200 Add 33 new runs (373 total) Profile: main_effects Completed: 33 | Skipped: 1 | Failed: 6

Checkpoint: 30 runs (373 total)

2026-04-07T20:03:09Z

commit f078feba3d37071568e8638b9514b41e90605d84 parent 08782c8c43bf0df2b7e5287bab5753f9bd7cee24 Author: Brian Graham Date: Tue, 7 Apr 2026 22:03:09 +0200 Checkpoint: 30 runs (373 total)

Checkpoint: 20 runs (365 total)

2026-04-07T20:02:08Z

commit 08782c8c43bf0df2b7e5287bab5753f9bd7cee24 parent d42c3a6342b91bcd6a46a99a86a26d7039dbb749 Author: Brian Graham Date: Tue, 7 Apr 2026 22:02:08 +0200 Checkpoint: 20 runs (365 total)