| 2026-04-16 14:47 | Drop aborted glm-4.5-air run 0c19668a | Brian Graham | 1 | +6012 | -6012 |
| 2026-04-16 14:46 | Analyze and push 511 runs | Brian Graham | 3 | +0 | -78 |
| 2026-04-16 14:45 | Project runs across all dashboard pages | Brian Graham | 4 | +31 | -16 |
| 2026-04-16 14:32 | Project runs before serializing into index-page islands | Brian Graham | 2 | +57 | -5 |
| 2026-04-16 13:53 | Rebuild PCA from post-reeval 510-run dataset | Brian Graham | 1 | +7825 | -5657 |
| 2026-04-16 13:53 | Analyze and push 512 runs | Brian Graham | 11 | +2581 | -2835 |
| 2026-04-16 13:50 | Full reeval on GPU machine: V2 bot + SonarQube | Brian Graham | 3287 | +337966 | -104023 |
| 2026-04-16 11:08 | 900s bot timeout + inactivity watchdog; aggregate agreement 48% to 79% | Brian Graham | 70 | +6834 | -2759 |
| 2026-04-16 10:10 | Add human labels for 3 more calibration runs | Brian Graham | 3 | +43 | -45 |
| 2026-04-16 09:58 | Retag 176 pre-provider anthropic runs with prov=anth in cell_id | Brian Graham | 4852 | +511972 | -512706 |
| 2026-04-16 07:54 | Add human trial labels for 4 calibration runs | Brian Graham | 4 | +87 | -88 |
| 2026-04-16 07:06 | Preserve gameplay bot report on timeout | Brian Graham | 74 | +8580 | -3235 |
| 2026-04-16 05:59 | Remove 39 invalid glm-4.7 runs and add new sweep results | Brian Graham | 1221 | +265103 | -9227 |
| 2026-04-15 14:27 | Add 18 new runs (458 total) | Brian Graham | 390 | +102913 | -1169 |
| 2026-04-15 13:41 | Re-eval 17 calibration runs; fix reeval.py artifact cleanup | Brian Graham | 19 | +1825 | -1219 |
| 2026-04-15 11:47 | Fix compute_grid OOM: fail on unknown profile, stream via generator, dispatch DOE designs | Brian Graham | 1 | +51 | -28 |
| 2026-04-15 09:37 | Remove 20 more zero-turn 429 runs from glm-5.1 sweep | Brian Graham | 221 | +0 | -61425 |
| 2026-04-15 08:03 | Add 20 new runs (460 total) | Brian Graham | 227 | +62260 | -835 |
| 2026-04-15 05:03 | Remove 20 invalid glm-5.1 runs (429 / aborted / zero-turn) | Brian Graham | 274 | +0 | -76552 |
| 2026-04-15 02:30 | Add 14 new runs (460 total) | Brian Graham | 157 | +41738 | -1372 |
| 2026-04-14 20:42 | Fix Z.AI auth: skip apiKeyHelper for non-anthropic providers | Brian Graham | 1154 | +281373 | -211 |
| 2026-04-14 11:08 | Add 1 new runs (393 total) | Brian Graham | 95 | +35444 | -3764 |
| 2026-04-14 07:32 | Remove 68 more zero-cost GLM-5.1 runs (Z.AI auth still broken) | Brian Graham | 748 | +0 | -208878 |
| 2026-04-14 04:07 | Add 68 new runs (459 total) | Brian Graham | 97 | +25606 | -1122 |
| 2026-04-14 03:54 | Checkpoint: 60 runs (453 total) | Brian Graham | 342 | +94062 | -1451 |
| 2026-04-14 03:03 | Checkpoint: 30 runs (423 total) | Brian Graham | 345 | +93985 | -2139 |
| 2026-04-13 21:11 | Remove 68 zero-cost GLM-5.1 runs (auth failures) | Brian Graham | 748 | +0 | -208822 |
| 2026-04-13 20:56 | Add 24 new runs (459 total) | Brian Graham | 48 | +12972 | -827 |
| 2026-04-13 20:53 | Checkpoint: 20 runs (459 total) | Brian Graham | 128 | +32174 | -1344 |
| 2026-04-13 20:46 | Checkpoint: 10 runs (449 total) | Brian Graham | 130 | +32454 | -1264 |
| 2026-04-13 20:34 | Add smaller noise files: 1k, 10k, 50k, 100k for both lorem and wikipedia | Brian Graham | 10 | +1289 | -1 |
| 2026-04-13 20:24 | Add 44 new runs (435 total) | Brian Graham | 50 | +13089 | -891 |
| 2026-04-13 20:17 | Checkpoint: 40 runs (433 total) | Brian Graham | 455 | +125268 | -2546 |
| 2026-04-13 15:14 | Restore game artifacts deleted by GPU machine commit | Brian Graham | 5463 | +1557109 | -6829 |
| 2026-04-13 14:13 | CI: exclude artifacts/ from rsync --delete | Brian Graham | 1 | +3 | -2 |
| 2026-04-13 13:44 | Add all game artifacts, fix CI artifact rsync | Brian Graham | 745 | +539948 | -232 |
| 2026-04-13 13:28 | Re-eval all 390 runs with V2 bot on GPU machine | Brian Graham | 6095 | +39175 | -1926721 |
| 2026-04-13 11:58 | Context update for GPU machine testing | Brian Graham | 3 | +211 | -67 |
| 2026-04-12 16:46 | Update eval results: 123 runs re-evaluated with V2 bot | Brian Graham | 226 | +24551 | -10583 |
| 2026-04-12 16:23 | Add 7 new games to calibration page | Brian Graham | 7 | +252 | -0 |
| 2026-04-12 15:56 | Analyze and push 391 runs | Brian Graham | 848 | +15206 | -184333 |
| 2026-04-12 15:43 | Switch production eval to V2 gameplay bot | Brian Graham | 1 | +6 | -2 |
| 2026-04-12 15:38 | Update calibration: cbbff570 CW rotation works, e2e04e75 scores on clear, 9805c24a has game over overlay | Brian Graham | 3 | +14 | -11 |
| 2026-04-12 06:31 | V2: landmarks-based game_loads, updated calibration test names | Brian Graham | 13 | +248 | -32 |
| 2026-04-11 07:59 | V2: partial landmarks work (agent hit limit) | Brian Graham | 1 | +28 | -0 |
| 2026-04-11 07:12 | V2: stricter rotation test requires distinct rotation states | Brian Graham | 2 | +243 | -38 |
| 2026-04-11 05:30 | V2: game_over_display test passes on overlay OR restart presence | Brian Graham | 1 | +18 | -12 |
| 2026-04-11 05:28 | V2: language-agnostic game over detection, capture in Phase 6 | Brian Graham | 4 | +134 | -29 |
| 2026-04-11 05:25 | Calibration cbbff570: rotation is flaky (human was wrong) | Brian Graham | 1 | +5 | -5 |
| 2026-04-10 19:24 | Methodology: scoring uses SonarQube, code quality is in outputs, no emdashes | Brian Graham | 1 | +36 | -32 |
| 2026-04-10 19:20 | V2: fix AI player so it actually plays Tetris | Brian Graham | 2 | +149 | -101 |
| 2026-04-10 19:11 | Update methodology page with current bot architecture | Brian Graham | 1 | +200 | -48 |
| 2026-04-10 19:06 | Correct attribution: Pierre Dellacherie's 4-heuristic Tetris AI | Brian Graham | 5 | +17 | -13 |
| 2026-04-10 18:04 | V2: control discovery system | Brian Graham | 5 | +844 | -2 |
| 2026-04-10 16:58 | V2 fix: handle absolute-positioned active piece overlays | Brian Graham | 3 | +305 | -10 |
| 2026-04-10 12:38 | Add gemma426b run artifacts and results | Brian Graham | 28 | +5952 | -0 |
| 2026-04-10 12:36 | V2 bot: caching, bot/driver bridge, fixed CCW rotation test | Brian Graham | 4 | +1219 | -39 |
| 2026-04-09 19:14 | Add gameplay-bot-v2: two-tier architecture (Driver + Bot) | Brian Graham | 5 | +3887 | -0 |
| 2026-04-09 18:22 | Update calibration: 9805c24a (broken rotation, bad randomizer), cbbff570 (mostly works, spurious line clear, weird preview) | Brian Graham | 2 | +44 | -46 |
| 2026-04-09 18:15 | Add test #25 rendering_clean, update calibration data | Brian Graham | 4 | +95 | -39 |
| 2026-04-09 10:56 | Add two-tier architecture refactor spec for gameplay bot | Brian Graham | 1 | +877 | -0 |
| 2026-04-09 09:48 | Verify game interactivity via DOM + screenshot after start detection | Brian Graham | 1 | +109 | -12 |
| 2026-04-09 09:20 | Add grid re-sampling after game start detection | Brian Graham | 2 | +16 | -0 |
| 2026-04-09 09:10 | Add all 10 DOM games to calibration page | Brian Graham | 5 | +170 | -0 |
| 2026-04-09 07:18 | Update gameplay bot results for 10 DOM games with new start detection | Brian Graham | 12 | +2211 | -391 |
| 2026-04-09 07:03 | Language-agnostic start detection for gameplay bot | Brian Graham | 1 | +136 | -120 |
| 2026-04-09 06:07 | Update calibration: 93e8feea starts into game over, e2e04e75 no scoring | Brian Graham | 2 | +5 | -5 |
| 2026-04-09 06:04 | Calibration: copy button instead of JSON block, update human results | Brian Graham | 3 | +37 | -23 |
| 2026-04-09 05:56 | Fix calibration UI: connect Human Testing toggle to all cards | Brian Graham | 1 | +4 | -3 |
| 2026-04-09 05:52 | Interactive calibration UI with human testing mode | Brian Graham | 2 | +286 | -126 |
| 2026-04-09 05:41 | Add bot calibration page with human vs bot comparison | Brian Graham | 6 | +332 | -0 |
| 2026-04-09 05:23 | Rewrite gameplay bot: 24 tests, 8 conditional phases, competitive play | Brian Graham | 4 | +1016 | -168 |
| 2026-04-08 21:40 | Add comprehensive gameplay bot spec (24 tests, 8 phases) | Brian Graham | 1 | +467 | -0 |
| 2026-04-08 18:54 | Checkpoint: 40 runs (438 total) | Brian Graham | 76 | +16950 | -1475 |
| 2026-04-08 18:37 | Checkpoint: 35 runs (433 total) | Brian Graham | 76 | +17010 | -1535 |
| 2026-04-08 18:20 | Checkpoint: 30 runs (428 total) | Brian Graham | 76 | +16890 | -1415 |
| 2026-04-08 18:04 | Checkpoint: 25 runs (423 total) | Brian Graham | 76 | +16945 | -1470 |
| 2026-04-08 17:47 | Checkpoint: 20 runs (418 total) | Brian Graham | 76 | +16924 | -1449 |
| 2026-04-08 17:30 | Checkpoint: 15 runs (413 total) | Brian Graham | 80 | +17541 | -1422 |
| 2026-04-08 16:09 | Checkpoint: 10 runs (408 total) | Brian Graham | 115 | +23334 | -1423 |
| 2026-04-08 11:48 | Checkpoint: 5 runs (403 total) | Brian Graham | 515 | +87714 | -3861 |
| 2026-04-08 11:23 | Fix page load: use waitUntil commit, try root URL first | Brian Graham | 1 | +11 | -7 |
| 2026-04-08 07:59 | Rewrite start detection: 5-phase, language-agnostic, visual change | Brian Graham | 3 | +499 | -292 |
| 2026-04-08 07:21 | Fix large prompt handling: use wrapper script instead of bash -c | Brian Graham | 1 | +8 | -5 |
| 2026-04-08 06:52 | Checkpoint: 30 runs (414 total) | Brian Graham | 475 | +89396 | -2605 |
| 2026-04-08 05:58 | Add 95% CI bands, statistical power card, tornado CI whiskers | Brian Graham | 6 | +402 | -7 |
| 2026-04-08 05:45 | Checkpoint: 15 runs (399 total) | Brian Graham | 6105 | +42330 | -2992694 |
| 2026-04-08 05:32 | Switch qwen-3.6-plus from free to paid endpoint | Brian Graham | 1 | +1 | -1 |
| 2026-04-08 05:17 | Fix argument list too long for noise cells | Brian Graham | 1 | +20 | -4 |
| 2026-04-08 05:09 | Add minimax-m2.7 and kimi-k2.5 via OpenRouter | Brian Graham | 3 | +14 | -1 |
| 2026-04-08 05:07 | Checkpoint: 30 runs (453 total) | Brian Graham | 2604 | +1472941 | -1658 |
| 2026-04-08 05:06 | Checkpoint: 20 runs (433 total) | Brian Graham | 91 | +3745 | -619 |
| 2026-04-08 05:05 | Checkpoint: 10 runs (433 total) | Brian Graham | 2700 | +1409001 | -1710 |
| 2026-04-08 04:59 | Analyze and push 393 runs | Brian Graham | 428 | +51903 | -6438 |
| 2026-04-07 22:11 | Checkpoint: 10 runs (396 total) | Brian Graham | 82 | +18103 | -1314 |
| 2026-04-07 21:30 | Add 21 new runs (394 total) | Brian Graham | 52 | +14651 | -810 |
| 2026-04-07 21:12 | Checkpoint: 20 runs (393 total) | Brian Graham | 537 | +102099 | -1940 |
| 2026-04-07 20:03 | Add 33 new runs (373 total) | Brian Graham | 2 | +34 | -0 |
| 2026-04-07 20:03 | Checkpoint: 30 runs (373 total) | Brian Graham | 168 | +39907 | -1680 |
| 2026-04-07 20:02 | Checkpoint: 20 runs (365 total) | Brian Graham | 184 | +38555 | -1869 |
| 2026-04-07 20:00 | Checkpoint: 10 runs (353 total) | Brian Graham | 560 | +43424 | -78060 |
| 2026-04-07 19:45 | Checkpoint: 20 runs (376 total) | Brian Graham | 162 | +35934 | -1819 |
| 2026-04-07 19:44 | Checkpoint: 10 runs (365 total) | Brian Graham | 192 | +36162 | -1710 |
| 2026-04-07 19:42 | Checkpoint: 20 runs (343 total) | Brian Graham | 184 | +41424 | -1774 |
| 2026-04-07 19:39 | Add gemma-4-26b model via OpenRouter | Brian Graham | 3 | +8 | -1 |
| 2026-04-07 19:39 | Checkpoint: 10 runs (332 total) | Brian Graham | 299 | +74610 | -1749 |
| 2026-04-07 18:57 | Fix falling piece detector: faster polling, longer settle time | Brian Graham | 1 | +6 | -6 |
| 2026-04-07 18:34 | Analyze and push 316 runs | Brian Graham | 777 | +22642 | -62215 |
| 2026-04-07 17:46 | Add 28 new runs (337 total) | Brian Graham | 115 | +25816 | -924 |
| 2026-04-07 17:32 | Checkpoint: 20 runs (324 total) | Brian Graham | 182 | +35941 | -9143 |
| 2026-04-07 17:21 | Add OpenRouter provider with Qwen 3.6 Plus via litellm proxy | Brian Graham | 4 | +24 | -3 |
| 2026-04-07 17:07 | Version model names: haiku-4.5, sonnet-4.6, opus-4.6 | Brian Graham | 60 | +18045 | -24 |
| 2026-04-07 16:44 | Add 30 new runs (322 total) | Brian Graham | 178 | +45158 | -782 |
| 2026-04-07 16:43 | Checkpoint: 20 runs (310 total) | Brian Graham | 5565 | +693787 | -498747 |
| 2026-04-07 16:25 | PCA: 10 components, taller scree bars, remove Variance Explained | Brian Graham | 3 | +3688 | -1188 |
| 2026-04-07 16:17 | Spread PCA dots wider (2.5x), shrink spheres | Brian Graham | 1 | +4 | -4 |
| 2026-04-07 16:11 | Install three.js deps in dashboard dir (fixes CI build) | Brian Graham | 2 | +687 | -2 |
| 2026-04-07 16:07 | Add --runs-per-cell docs, sweep workflow, clean 4 bad runs | Brian Graham | 67 | +4023 | -5 |
| 2026-04-07 16:06 | Add n= confidence to Insights page | Brian Graham | 3 | +49 | -11 |
| 2026-04-07 16:04 | Add --runs-per-cell flag to override runs_per_cell from grid.yaml | Brian Graham | 1 | +5 | -1 |
| 2026-04-07 16:00 | Add n= confidence indicators to Grid page | Brian Graham | 3 | +44 | -12 |
| 2026-04-07 15:50 | Add scree plot to PCA page | Brian Graham | 3 | +1195 | -1011 |
| 2026-04-07 15:44 | 3D PCA scatter plot with react-three-fiber | Brian Graham | 2 | +310 | -233 |
| 2026-04-07 15:42 | Self-host JetBrains Mono fonts, remove Google Fonts CDN | Brian Graham | 4 | +21 | -4 |
| 2026-04-07 15:37 | Replace task chart with Top/Bottom 10 configs on grid page | Brian Graham | 3 | +376 | -260 |
| 2026-04-07 15:29 | Add model filter to Insights page (tornado, heatmap) | Brian Graham | 28 | +1200 | -227 |
| 2026-04-07 15:28 | PCA analysis page, remove violin dots | Brian Graham | 5 | +4008 | -13 |
| 2026-04-07 15:26 | Surprises tab, model selector, shared color palette integration | Brian Graham | 8 | +618 | -50 |
| 2026-04-07 15:18 | Checkpoint: 40 runs (266 total) | Brian Graham | 391 | +107184 | -1310 |
| 2026-04-07 15:14 | Add variability violin chart to Compare page | Brian Graham | 2 | +403 | -0 |
| 2026-04-07 15:13 | Shared color palette for 10 models across all charts | Brian Graham | 5 | +91 | -51 |
| 2026-04-07 14:27 | Checkpoint: 20 runs (246 total) | Brian Graham | 384 | +104408 | -1341 |
| 2026-04-07 13:08 | Analyze and push 222 runs | Brian Graham | 1 | +222 | -222 |
| 2026-04-07 13:08 | Re-eval 222 runs (10 glm-4.5-air, 26 glm-4.7, 9 glm-5.1, 74 haiku, 51 opus, 52 sonnet) | Brian Graham | 412 | +11416 | -4401 |
| 2026-04-07 11:56 | Increase gameplay bot timeout to 300s (was 180s) | Brian Graham | 2 | +3 | -3 |
| 2026-04-07 11:04 | Analyze and push 222 runs | Brian Graham | 1 | +222 | -222 |
| 2026-04-07 11:04 | Re-eval 222 runs (10 glm-4.5-air, 26 glm-4.7, 9 glm-5.1, 74 haiku, 51 opus, 52 sonnet) | Brian Graham | 1137 | +16288 | -174627 |
| 2026-04-07 10:34 | Stop deleting turns=1 and timeout runs as invalid | Brian Graham | 2 | +32 | -48 |
| 2026-04-07 10:23 | Analyze and push 253 runs | Brian Graham | 127 | +19233 | -1114 |
| 2026-04-07 10:03 | Checkpoint: 40 runs (250 total) | Brian Graham | 210 | +54373 | -1220 |
| 2026-04-07 08:52 | Checkpoint: 30 runs (240 total) | Brian Graham | 49 | +10428 | -1050 |
| 2026-04-07 08:39 | Discard runs with 0 turns before eval/commit | Brian Graham | 1 | +13 | -0 |
| 2026-04-07 08:34 | Analyze and push 236 runs | Brian Graham | 200 | +35695 | -9482 |
| 2026-04-07 07:33 | Checkpoint: 20 runs (233 total) | Brian Graham | 163 | +40023 | -1179 |
| 2026-04-07 06:54 | Analyze and push 225 runs | Brian Graham | 95 | +8875 | -1248 |
| 2026-04-07 06:35 | Checkpoint: 10 runs (223 total) | Brian Graham | 151 | +35332 | -1170 |
| 2026-04-07 05:46 | Exclude dist/build from sonarqube scans | Brian Graham | 1 | +1 | -1 |
| 2026-04-07 05:39 | Analyze and push 216 runs | Brian Graham | 100 | +16009 | -1019 |
| 2026-04-07 05:37 | Rewrite bot start detection: falling piece detector, conditional phases | Brian Graham | 2 | +584 | -250 |
| 2026-04-07 05:27 | Add spec for gameplay bot rewrite (falling piece detection) | Brian Graham | 1 | +66 | -0 |
| 2026-04-07 05:20 | Add rich UI widget for api_retry rate limit events in transcript | Brian Graham | 1 | +27 | -0 |
| 2026-04-07 05:12 | Analyze and push 211 runs | Brian Graham | 226 | +1328 | -71109 |
| 2026-04-07 05:07 | Add limitation: UI bugs masking working gameplay logic | Brian Graham | 1 | +1 | -0 |
| 2026-04-07 05:01 | Document bot false positives and unbuildable game limitation | Brian Graham | 1 | +3 | -0 |
| 2026-04-07 04:48 | Analyze and push 222 runs | Brian Graham | 177 | +0 | -6516 |
| 2026-04-07 04:42 | Analyze and push 278 runs | Brian Graham | 31 | +1036 | -7374 |
| 2026-04-07 04:41 | Add analyze-and-push.py for quick analysis without re-eval | Brian Graham | 1 | +121 | -0 |
| 2026-04-07 04:40 | Remove 192 rate-limited zai runs, update analysis (103 zai + 177 anthropic) | Brian Graham | 2210 | +1810 | -613122 |
| 2026-04-07 02:31 | Add 99 new runs (472 total) | Brian Graham | 71 | +19085 | -747 |
| 2026-04-07 01:50 | Checkpoint: 95 runs (470 total) | Brian Graham | 82 | +21377 | -1137 |
| 2026-04-07 01:35 | Checkpoint: 90 runs (465 total) | Brian Graham | 120 | +17523 | -1416 |
| 2026-04-07 01:28 | Checkpoint: 85 runs (442 total) | Brian Graham | 70 | +16767 | -1354 |
| 2026-04-07 01:18 | Checkpoint: 80 runs (437 total) | Brian Graham | 69 | +16723 | -1329 |
| 2026-04-07 01:11 | Checkpoint: 75 runs (432 total) | Brian Graham | 69 | +16727 | -1322 |
| 2026-04-07 01:01 | Checkpoint: 70 runs (427 total) | Brian Graham | 69 | +16601 | -1206 |
| 2026-04-07 00:54 | Checkpoint: 65 runs (422 total) | Brian Graham | 69 | +16660 | -1253 |
| 2026-04-07 00:44 | Checkpoint: 60 runs (417 total) | Brian Graham | 69 | +16626 | -1233 |
| 2026-04-07 00:37 | Checkpoint: 55 runs (412 total) | Brian Graham | 69 | +16715 | -1310 |
| 2026-04-07 00:27 | Checkpoint: 50 runs (407 total) | Brian Graham | 69 | +16731 | -1336 |
| 2026-04-07 00:20 | Checkpoint: 45 runs (402 total) | Brian Graham | 69 | +16852 | -1315 |
| 2026-04-07 00:10 | Checkpoint: 40 runs (397 total) | Brian Graham | 69 | +17020 | -1363 |
| 2026-04-07 00:02 | Checkpoint: 35 runs (392 total) | Brian Graham | 69 | +16715 | -1313 |
| 2026-04-06 23:53 | Checkpoint: 30 runs (387 total) | Brian Graham | 69 | +16735 | -1337 |
| 2026-04-06 23:46 | Checkpoint: 25 runs (382 total) | Brian Graham | 69 | +16777 | -1374 |
| 2026-04-06 23:36 | Checkpoint: 20 runs (377 total) | Brian Graham | 69 | +16696 | -1300 |
| 2026-04-06 23:29 | Checkpoint: 15 runs (372 total) | Brian Graham | 69 | +16388 | -1182 |
| 2026-04-06 23:19 | Checkpoint: 10 runs (367 total) | Brian Graham | 59 | +13543 | -1256 |
| 2026-04-06 23:12 | Checkpoint: 5 runs (363 total) | Brian Graham | 61 | +10762 | -1383 |
| 2026-04-06 23:02 | Add 99 new runs (358 total) | Brian Graham | 105 | +28512 | -791 |
| 2026-04-06 22:48 | Checkpoint: 90 runs (351 total) | Brian Graham | 178 | +33023 | -1153 |
| 2026-04-06 22:31 | Checkpoint: 80 runs (323 total) | Brian Graham | 124 | +32192 | -1042 |
| 2026-04-06 22:14 | Checkpoint: 70 runs (313 total) | Brian Graham | 124 | +31979 | -1029 |
| 2026-04-06 21:57 | Checkpoint: 60 runs (303 total) | Brian Graham | 279 | +71033 | -1046 |
| 2026-04-06 20:44 | Checkpoint: 50 runs (293 total) | Brian Graham | 138 | +36145 | -1177 |
| 2026-04-06 20:23 | Checkpoint: 40 runs (283 total) | Brian Graham | 124 | +32245 | -1182 |
| 2026-04-06 20:06 | Checkpoint: 30 runs (273 total) | Brian Graham | 124 | +31988 | -1189 |
| 2026-04-06 19:49 | Checkpoint: 20 runs (263 total) | Brian Graham | 124 | +31852 | -1251 |
| 2026-04-06 19:32 | Checkpoint: 10 runs (253 total) | Brian Graham | 167 | +25364 | -5600 |
| 2026-04-06 19:11 | Document gameplay bot known limitations (wall kicks, lock delay, etc.) | Brian Graham | 1 | +1 | -0 |
| 2026-04-06 19:10 | Force all x-axis labels to show on box plot (interval={0}) | Brian Graham | 1 | +3 | -2 |
| 2026-04-06 19:08 | Update CLAUDE.md for 23-axis grid, Z.AI provider, new commands | Brian Graham | 1 | +43 | -24 |
| 2026-04-06 18:47 | Checkpoint: 40 runs (244 total) | Brian Graham | 412 | +111457 | -2861 |
| 2026-04-06 18:38 | Add new axes to run config box, link run to cell | Brian Graham | 1 | +11 | -1 |
| 2026-04-06 18:36 | Add 3 new runs (225 total) | Brian Graham | 113 | +15814 | -2215 |
| 2026-04-06 18:35 | Checkpoint: 20 runs (224 total) | Brian Graham | 439 | +97726 | -1336 |
| 2026-04-06 18:31 | Put (n=) count on separate line below model name in box plot | Brian Graham | 1 | +12 | -4 |
| 2026-04-06 18:26 | Remove scatter dots from Score Distribution box plot | Brian Graham | 1 | +2 | -17 |
| 2026-04-06 18:20 | Fix main_effects for provider filtering | Brian Graham | 2 | +4 | -0 |
| 2026-04-06 18:18 | Box plots for grid charts, model toggles for scatter hulls | Brian Graham | 2 | +353 | -115 |
| 2026-04-06 18:11 | Fix auto-commit using old artifacts path | Brian Graham | 1 | +2 | -2 |
| 2026-04-06 18:10 | Add 6 Z.AI GLM runs (glm-4.5-air, glm-4.7, glm-5.1) | Brian Graham | 137 | +32337 | -758 |
| 2026-04-06 18:03 | Add --commit-every N flag for periodic analyze+push | Brian Graham | 1 | +34 | -0 |
| 2026-04-06 17:36 | Add -n/--max-runs flag to limit total runs | Brian Graham | 1 | +9 | -1 |
| 2026-04-06 17:26 | Use real GLM model names directly, drop model_map | Brian Graham | 4 | +25 | -52 |
| 2026-04-06 17:13 | Add zai-smoke profile, fix provider in profiles | Brian Graham | 1 | +32 | -4 |
| 2026-04-06 17:09 | Accept actual model names with --model for non-anthropic providers | Brian Graham | 1 | +9 | -1 |
| 2026-04-06 17:08 | Use actual_model in cell_ids and dashboard display | Brian Graham | 5 | +42 | -10 |
| 2026-04-06 17:07 | Re-eval 177 runs (74 haiku, 51 opus, 52 sonnet) | Brian Graham | 433 | +28858 | -4765 |
| 2026-04-06 17:01 | Require --provider flag for run.py | Brian Graham | 1 | +19 | -0 |
| 2026-04-06 16:54 | Add provider axis for Z.AI (GLM) model support | Brian Graham | 14 | +68 | -7 |
| 2026-04-06 15:11 | Add short_id, short_cell_id, claude_version to analysis skip keys | Brian Graham | 2 | +5 | -0 |
| 2026-04-06 14:18 | Add short URL IDs, test fixtures, and context noise files | Brian Graham | 194 | +15979 | -704 |
| 2026-04-06 13:57 | Grid expansion: 7 new axes, migrate all run IDs to abbreviated format | Brian Graham | 4611 | +456579 | -454954 |
| 2026-04-06 13:54 | Fix serve process leak in gameplay bot eval | Brian Graham | 1 | +41 | -23 |
| 2026-04-06 13:13 | Re-eval 173 runs (71 haiku, 51 opus, 51 sonnet) | Brian Graham | 561 | +83595 | -5426 |
| 2026-04-06 12:02 | Fix cell_id length, add SonarQube details, rebuild gameplay bot | Brian Graham | 6 | +158 | -145 |
| 2026-04-06 10:42 | Re-eval 159 runs (57 haiku, 51 opus, 51 sonnet) | Brian Graham | 307 | +27096 | -6732 |
| 2026-04-06 09:44 | Re-eval 159 runs (57 haiku, 51 opus, 51 sonnet) | Brian Graham | 199 | +7489 | -17294 |
| 2026-04-06 08:44 | Add sonarqube metric to analysis pipeline, fix metric labels | Brian Graham | 3 | +18 | -4 |
| 2026-04-06 08:42 | Outcome = gameplay + SonarQube, not gameplay + lint/typecheck | Brian Graham | 2 | +11 | -13 |
| 2026-04-06 08:36 | Update CLAUDE.md with complete project state | Brian Graham | 1 | +53 | -34 |
| 2026-04-06 08:32 | Flexible axes on scatter plots and efficiency frontier | Brian Graham | 2 | +211 | -45 |
| 2026-04-06 08:30 | Add methodology page explaining scoring and experiment design | Brian Graham | 1 | +477 | -0 |
| 2026-04-06 08:30 | Add clean-and-reeval command | Brian Graham | 1 | +216 | -0 |
| 2026-04-06 08:29 | Restructure scoring: outcome vs output, flexible scatter, methodology nav | Brian Graham | 7 | +166 | -60 |
| 2026-04-06 08:07 | Fix artifacts: dist/ was globally gitignored, breaking compiled games | Brian Graham | 270 | +56036 | -1 |
| 2026-04-06 07:37 | Prevent off-grid reading and false positive piece detection | Brian Graham | 2 | +42 | -2 |
| 2026-04-06 07:26 | Wire SonarQube into eval pipeline | Brian Graham | 1 | +16 | -0 |
| 2026-04-06 07:25 | Add SonarQube integration for code quality analysis | Brian Graham | 1 | +185 | -0 |
| 2026-04-06 06:30 | Rewrite gameplay bot with continuous scanning and no false positives | Brian Graham | 6 | +1366 | -796 |
| 2026-04-06 04:36 | Scatter plot: 4 density levels instead of 2 | Brian Graham | 1 | +43 | -34 |
| 2026-04-06 04:31 | Fix empty scatter plots: add hidden Scatter to seed axis scales | Brian Graham | 1 | +9 | -0 |
| 2026-04-06 04:31 | Fix quality scoring, add budget/timeout indicators | Brian Graham | 3 | +31 | -3 |
| 2026-04-06 04:27 | Sort model bar chart: haiku, sonnet, opus | Brian Graham | 1 | +6 | -1 |
| 2026-04-06 04:23 | Move artifacts out of Astro public/, fix 13GB node_modules bloat | Brian Graham | 1886 | +227458 | -275543 |
| 2026-04-06 04:10 | Add directional indicators to correlation matrix | Brian Graham | 1 | +10 | -10 |
| 2026-04-06 03:55 | Re-eval all 159 runs with fixed scoring and improved bot calibration | Brian Graham | 269 | +7357 | -11022 |
| 2026-04-05 22:20 | Add 2 new runs (67 total) | Brian Graham | 126 | +2301 | -4819 |
| 2026-04-05 22:04 | Add 25 new runs (159 total) | Brian Graham | 439 | +102980 | -3713 |
| 2026-04-05 21:40 | Fix score calculation: remove double-counting, normalize weights | Brian Graham | 2 | +9 | -11 |
| 2026-04-05 21:34 | Improve gameplay bot calibration with fallbacks and DOM grid detection | Brian Graham | 3 | +385 | -50 |
| 2026-04-05 21:31 | Add 6 new runs (73 total) | Brian Graham | 210 | +1193 | -65163 |
| 2026-04-05 21:28 | Show detailed score breakdowns on run page | Brian Graham | 1 | +127 | -0 |
| 2026-04-05 19:58 | Add 35 new runs (159 total) | Brian Graham | 362 | +110987 | -917 |
| 2026-04-05 19:42 | Add 4 new runs (124 total) | Brian Graham | 56 | +17212 | -668 |
| 2026-04-05 19:22 | Convert all charts to cell-based: every visualization now shows cells not runs | Brian Graham | 7 | +263 | -169 |
| 2026-04-05 19:03 | Clean 9 incomplete runs (no HTML output), re-run analysis | Brian Graham | 63 | +1014 | -12871 |
| 2026-04-05 06:55 | Add 49 new runs (113 total) | Brian Graham | 225 | +59081 | -11558 |
| 2026-04-05 06:51 | Fix duplicate coefficientOfVariation declaration | Brian Graham | 1 | +0 | -2 |
| 2026-04-05 06:48 | Fix model order: haiku, sonnet, opus | Brian Graham | 1 | +1 | -1 |
| 2026-04-05 06:47 | Grid: per-task summary with cells/runs/score/cost. Cell: variance stats. Box plots: model order fix. | Brian Graham | 3 | +122 | -32 |
| 2026-04-05 06:36 | Add 32 new runs (113 total) | Brian Graham | 409 | +135579 | -979 |
| 2026-04-05 06:10 | Add cell detail page with run comparison and artifact gallery | Brian Graham | 3 | +612 | -1 |
| 2026-04-05 06:03 | Add variability analysis to insights page | Brian Graham | 2 | +724 | -1 |
| 2026-04-05 05:50 | Cell-based analytics across all dashboard views | Brian Graham | 4 | +443 | -105 |
| 2026-04-05 05:39 | Grid table: grouped view with score/cost ranges per config cell | Brian Graham | 1 | +165 | -43 |
| 2026-04-05 05:32 | Surprise cards now clickable with run details and outlier detection | Brian Graham | 1 | +198 | -63 |
| 2026-04-05 04:59 | Clean 31 bad runs, fix analysis metrics, re-run analysis | Brian Graham | 261 | +1772 | -81184 |
| 2026-04-04 22:21 | Add 49 new runs (113 total) | Brian Graham | 459 | +148224 | -2769 |
| 2026-04-04 21:19 | Add 6 new runs (73 total) | Brian Graham | 84 | +26852 | -523 |
| 2026-04-04 20:57 | Add 2 new runs (67 total) | Brian Graham | 210 | +12371 | -2988 |
| 2026-04-04 20:12 | Fix inflated scores for empty/broken games | Brian Graham | 3 | +66 | -45 |
| 2026-04-04 20:05 | Raise budget to $2/$10, delete 25 budget-killed sonnet/opus runs | Brian Graham | 278 | +8 | -89204 |
| 2026-04-04 09:42 | Progress. | Brian Graham | 154 | +36516 | -56 |
| 2026-04-04 09:07 | Add 5 new runs (72 total) | Brian Graham | 67 | +11296 | -2299 |
| 2026-04-04 08:47 | Document pipeline flags and workflow in README | Brian Graham | 1 | +42 | -13 |
| 2026-04-04 08:46 | Fix pipeline: reeval only when explicitly requested, auto-analyze on new runs | Brian Graham | 1 | +4 | -4 |
| 2026-04-04 08:46 | Add --reeval, --analyze, --full-pipeline flags to harness | Brian Graham | 1 | +49 | -1 |
| 2026-04-04 08:45 | Auto-commit and push results after sweep completes | Brian Graham | 1 | +32 | -0 |
| 2026-04-04 08:39 | Add new haiku and sonnet runs (72 total, 0 bad) | Brian Graham | 277 | +149635 | -3085 |
| 2026-04-04 08:25 | Fix grid table column order: Pass between Score and Cost | Brian Graham | 1 | +16 | -1 |
| 2026-04-04 08:25 | Auto-refresh OAuth token during sweeps | Brian Graham | 2 | +35 | -0 |
| 2026-04-04 08:23 | Add sortable grid columns, show context file on run page, update TODO | Brian Graham | 4 | +142 | -55 |
| 2026-04-04 08:04 | Fix BumpChart empty state, add HeatmapMatrix title | Brian Graham | 2 | +50 | -5 |
| 2026-04-04 08:02 | Handle language=unspecified in workspace setup and eval | Brian Graham | 2 | +5 | -2 |
| 2026-04-04 07:58 | Restyle bar charts to match SMUI | Brian Graham | 1 | +85 | -25 |
| 2026-04-04 07:57 | Add missing tool axis labels on compare page | Brian Graham | 1 | +5 | -0 |
| 2026-04-04 07:53 | Fix process is not defined error, split types for client safety | Brian Graham | 16 | +135 | -123 |
| 2026-04-04 07:32 | Align theme with SMUI, add light/dark mode toggle | Brian Graham | 2 | +392 | -76 |
| 2026-04-04 07:27 | Add tool axes to RunMeta type and AXIS_NAMES | Brian Graham | 1 | +15 | -0 |
| 2026-04-04 07:26 | Add Explore page with 6 interactive visualizations | Brian Graham | 9 | +2309 | -0 |
| 2026-04-04 07:11 | Add n= to chart labels, per-dimension metric selection | Brian Graham | 3 | +14 | -4 |
| 2026-04-04 07:04 | Add scatter plots and surprise detector to insights page | Brian Graham | 3 | +304 | -4 |
| 2026-04-04 06:59 | Re-evaluate all 67 runs with new eval pipeline | Brian Graham | 134 | +21588 | -156 |
| 2026-04-04 06:33 | Adopt Ship the Loop design system | Brian Graham | 2 | +174 | -41 |
| 2026-04-04 06:26 | Add re-eval command, show all eval dimensions in run detail UI | Brian Graham | 2 | +227 | -16 |
| 2026-04-04 06:21 | Comprehensive code quality analysis (Python rewrite) | Brian Graham | 2 | +379 | -5 |
| 2026-04-04 06:17 | Add HTML validation, duplication detection, accessibility, page load time | Brian Graham | 3 | +121 | -1 |
| 2026-04-04 06:16 | Fix score detection and rotation piece identification | Brian Graham | 1 | +53 | -21 |
| 2026-04-04 06:15 | Fix bad run detection, wire gameplay bot, fix compare page, improve rotation test | Brian Graham | 4 | +221 | -37 |
| 2026-04-04 06:06 | Add per-piece-type rotation test | Brian Graham | 1 | +171 | -0 |
| 2026-04-04 05:52 | Add gameplay bot, language=unspecified option, bump Playwright timeout | Brian Graham | 12 | +2244 | -1 |
| 2026-04-04 05:46 | Add code analysis and transcript analysis to eval pipeline | Brian Graham | 6 | +588 | -15 |
| 2026-04-04 05:29 | Increase timeout to 1200s (20 min) for larger models | Brian Graham | 1 | +1 | -1 |
| 2026-04-04 05:06 | Clean 26 more bad runs (timeouts + null cost), 67 good remain | Brian Graham | 183 | +0 | -28001 |
| 2026-04-04 04:49 | 93 good runs: 54 haiku, 36 sonnet, 3 opus | Brian Graham | 679 | +123602 | -8 |
| 2026-04-03 20:32 | Clean 51 failed runs, 38 good runs remain (32 haiku, 5 sonnet, 3 opus) | Brian Graham | 403 | +84607 | -0 |
| 2026-04-03 19:42 | Delete 47 failed runs (expired OAuth token), add token auto-refresh | Brian Graham | 335 | +79 | -50890 |
| 2026-04-03 19:39 | Auto-extract artifacts, add --model flag for sweep baseline | Brian Graham | 212 | +78722 | -4 |
| 2026-04-03 19:32 | Remove pre-tool-axes runs, add 60 main_effects sweep results | Brian Graham | 327 | +6018 | -6010 |
| 2026-04-03 18:32 | Add all-on and all-off anchor profiles | Brian Graham | 1 | +42 | -0 |
| 2026-04-03 18:30 | Add parallel execution to harness (-j flag) | Brian Graham | 1 | +155 | -87 |
| 2026-04-03 18:27 | Record full run config in transcript | Brian Graham | 2 | +32 | -1 |
| 2026-04-03 18:25 | Inject original prompts into existing transcript files | Brian Graham | 2 | +16 | -0 |
| 2026-04-03 18:25 | Fix inner iframe height in artifact preview | Brian Graham | 1 | +1 | -1 |
| 2026-04-03 18:19 | Remove bookmarks-api and data-pipeline tasks | Brian Graham | 45 | +0 | -2668 |
| 2026-04-03 18:17 | Link to source files on Forgejo from run detail page | Brian Graham | 1 | +16 | -0 |
| 2026-04-03 18:16 | Label exit code metric to clarify it's a process exit code | Brian Graham | 1 | +3 | -0 |
| 2026-04-03 18:15 | Add standalone link for artifact previews | Brian Graham | 1 | +5 | -2 |
| 2026-04-03 18:14 | Fix UTF-8 encoding in artifact iframe | Brian Graham | 1 | +3 | -3 |
| 2026-04-03 18:13 | Include prompt and context in transcript | Brian Graham | 2 | +51 | -1 |
| 2026-04-03 18:09 | Redesign run detail page, rich transcript viewer, tetris iframe preview | Brian Graham | 21 | +6382 | -337 |
| 2026-04-03 17:38 | Add claude_version to existing run metadata retroactively | Brian Graham | 6 | +15 | -12 |
| 2026-04-03 17:36 | UI improvements: readable run IDs, run detail layout, config pills | Brian Graham | 5 | +343 | -84 |
| 2026-04-03 17:25 | Fix results path resolution for Astro build | Brian Graham | 1 | +4 | -4 |
| 2026-04-03 17:19 | Add git commit to footer, document metrics and Pareto frontier | Brian Graham | 2 | +31 | -0 |
| 2026-04-03 17:15 | Add smoke run results to repo for dashboard | Brian Graham | 31 | +1126 | -2 |
| 2026-04-03 17:12 | Fix harness bugs, add DOE experiment design, insights dashboard | Brian Graham | 25 | +1936 | -90 |
| 2026-04-03 15:09 | Add benchmark harness, tasks, eval suites, and dashboard | Brian Graham | 62 | +11237 | -60 |
| 2026-04-03 12:49 | Bootstrap loop benchmarking project | Brian Graham | 2 | +91 | -0 |