Log - loop-benchmarking - Controlled experiments across agentic coding configurations. Same task, one variable, what actually works.

Date	Commit message	Author	Files	+	-
2026-04-16 14:47	Drop aborted glm-4.5-air run 0c19668a	Brian Graham	1	+6012	-6012
2026-04-16 14:46	Analyze and push 511 runs	Brian Graham	3	+0	-78
2026-04-16 14:45	Project runs across all dashboard pages	Brian Graham	4	+31	-16
2026-04-16 14:32	Project runs before serializing into index-page islands	Brian Graham	2	+57	-5
2026-04-16 13:53	Rebuild PCA from post-reeval 510-run dataset	Brian Graham	1	+7825	-5657
2026-04-16 13:53	Analyze and push 512 runs	Brian Graham	11	+2581	-2835
2026-04-16 13:50	Full reeval on GPU machine: V2 bot + SonarQube	Brian Graham	3287	+337966	-104023
2026-04-16 11:08	900s bot timeout + inactivity watchdog; aggregate agreement 48% to 79%	Brian Graham	70	+6834	-2759
2026-04-16 10:10	Add human labels for 3 more calibration runs	Brian Graham	3	+43	-45
2026-04-16 09:58	Retag 176 pre-provider anthropic runs with prov=anth in cell_id	Brian Graham	4852	+511972	-512706
2026-04-16 07:54	Add human trial labels for 4 calibration runs	Brian Graham	4	+87	-88
2026-04-16 07:06	Preserve gameplay bot report on timeout	Brian Graham	74	+8580	-3235
2026-04-16 05:59	Remove 39 invalid glm-4.7 runs and add new sweep results	Brian Graham	1221	+265103	-9227
2026-04-15 14:27	Add 18 new runs (458 total)	Brian Graham	390	+102913	-1169
2026-04-15 13:41	Re-eval 17 calibration runs; fix reeval.py artifact cleanup	Brian Graham	19	+1825	-1219
2026-04-15 11:47	Fix compute_grid OOM: fail on unknown profile, stream via generator, dispatch DOE designs	Brian Graham	1	+51	-28
2026-04-15 09:37	Remove 20 more zero-turn 429 runs from glm-5.1 sweep	Brian Graham	221	+0	-61425
2026-04-15 08:03	Add 20 new runs (460 total)	Brian Graham	227	+62260	-835
2026-04-15 05:03	Remove 20 invalid glm-5.1 runs (429 / aborted / zero-turn)	Brian Graham	274	+0	-76552
2026-04-15 02:30	Add 14 new runs (460 total)	Brian Graham	157	+41738	-1372
2026-04-14 20:42	Fix Z.AI auth: skip apiKeyHelper for non-anthropic providers	Brian Graham	1154	+281373	-211
2026-04-14 11:08	Add 1 new runs (393 total)	Brian Graham	95	+35444	-3764
2026-04-14 07:32	Remove 68 more zero-cost GLM-5.1 runs (Z.AI auth still broken)	Brian Graham	748	+0	-208878
2026-04-14 04:07	Add 68 new runs (459 total)	Brian Graham	97	+25606	-1122
2026-04-14 03:54	Checkpoint: 60 runs (453 total)	Brian Graham	342	+94062	-1451
2026-04-14 03:03	Checkpoint: 30 runs (423 total)	Brian Graham	345	+93985	-2139
2026-04-13 21:11	Remove 68 zero-cost GLM-5.1 runs (auth failures)	Brian Graham	748	+0	-208822
2026-04-13 20:56	Add 24 new runs (459 total)	Brian Graham	48	+12972	-827
2026-04-13 20:53	Checkpoint: 20 runs (459 total)	Brian Graham	128	+32174	-1344
2026-04-13 20:46	Checkpoint: 10 runs (449 total)	Brian Graham	130	+32454	-1264
2026-04-13 20:34	Add smaller noise files: 1k, 10k, 50k, 100k for both lorem and wikipedia	Brian Graham	10	+1289	-1
2026-04-13 20:24	Add 44 new runs (435 total)	Brian Graham	50	+13089	-891
2026-04-13 20:17	Checkpoint: 40 runs (433 total)	Brian Graham	455	+125268	-2546
2026-04-13 15:14	Restore game artifacts deleted by GPU machine commit	Brian Graham	5463	+1557109	-6829
2026-04-13 14:13	CI: exclude artifacts/ from rsync --delete	Brian Graham	1	+3	-2
2026-04-13 13:44	Add all game artifacts, fix CI artifact rsync	Brian Graham	745	+539948	-232
2026-04-13 13:28	Re-eval all 390 runs with V2 bot on GPU machine	Brian Graham	6095	+39175	-1926721
2026-04-13 11:58	Context update for GPU machine testing	Brian Graham	3	+211	-67
2026-04-12 16:46	Update eval results: 123 runs re-evaluated with V2 bot	Brian Graham	226	+24551	-10583
2026-04-12 16:23	Add 7 new games to calibration page	Brian Graham	7	+252	-0
2026-04-12 15:56	Analyze and push 391 runs	Brian Graham	848	+15206	-184333
2026-04-12 15:43	Switch production eval to V2 gameplay bot	Brian Graham	1	+6	-2
2026-04-12 15:38	Update calibration: cbbff570 CW rotation works, e2e04e75 scores on clear, 9805c24a has game over overlay	Brian Graham	3	+14	-11
2026-04-12 06:31	V2: landmarks-based game_loads, updated calibration test names	Brian Graham	13	+248	-32
2026-04-11 07:59	V2: partial landmarks work (agent hit limit)	Brian Graham	1	+28	-0
2026-04-11 07:12	V2: stricter rotation test requires distinct rotation states	Brian Graham	2	+243	-38
2026-04-11 05:30	V2: game_over_display test passes on overlay OR restart presence	Brian Graham	1	+18	-12
2026-04-11 05:28	V2: language-agnostic game over detection, capture in Phase 6	Brian Graham	4	+134	-29
2026-04-11 05:25	Calibration cbbff570: rotation is flaky (human was wrong)	Brian Graham	1	+5	-5
2026-04-10 19:24	Methodology: scoring uses SonarQube, code quality is in outputs, no emdashes	Brian Graham	1	+36	-32
2026-04-10 19:20	V2: fix AI player so it actually plays Tetris	Brian Graham	2	+149	-101
2026-04-10 19:11	Update methodology page with current bot architecture	Brian Graham	1	+200	-48
2026-04-10 19:06	Correct attribution: Pierre Dellacherie's 4-heuristic Tetris AI	Brian Graham	5	+17	-13
2026-04-10 18:04	V2: control discovery system	Brian Graham	5	+844	-2
2026-04-10 16:58	V2 fix: handle absolute-positioned active piece overlays	Brian Graham	3	+305	-10
2026-04-10 12:38	Add gemma426b run artifacts and results	Brian Graham	28	+5952	-0
2026-04-10 12:36	V2 bot: caching, bot/driver bridge, fixed CCW rotation test	Brian Graham	4	+1219	-39
2026-04-09 19:14	Add gameplay-bot-v2: two-tier architecture (Driver + Bot)	Brian Graham	5	+3887	-0
2026-04-09 18:22	Update calibration: 9805c24a (broken rotation, bad randomizer), cbbff570 (mostly works, spurious line clear, weird preview)	Brian Graham	2	+44	-46
2026-04-09 18:15	Add test #25 rendering_clean, update calibration data	Brian Graham	4	+95	-39
2026-04-09 10:56	Add two-tier architecture refactor spec for gameplay bot	Brian Graham	1	+877	-0
2026-04-09 09:48	Verify game interactivity via DOM + screenshot after start detection	Brian Graham	1	+109	-12
2026-04-09 09:20	Add grid re-sampling after game start detection	Brian Graham	2	+16	-0
2026-04-09 09:10	Add all 10 DOM games to calibration page	Brian Graham	5	+170	-0
2026-04-09 07:18	Update gameplay bot results for 10 DOM games with new start detection	Brian Graham	12	+2211	-391
2026-04-09 07:03	Language-agnostic start detection for gameplay bot	Brian Graham	1	+136	-120
2026-04-09 06:07	Update calibration: 93e8feea starts into game over, e2e04e75 no scoring	Brian Graham	2	+5	-5
2026-04-09 06:04	Calibration: copy button instead of JSON block, update human results	Brian Graham	3	+37	-23
2026-04-09 05:56	Fix calibration UI: connect Human Testing toggle to all cards	Brian Graham	1	+4	-3
2026-04-09 05:52	Interactive calibration UI with human testing mode	Brian Graham	2	+286	-126
2026-04-09 05:41	Add bot calibration page with human vs bot comparison	Brian Graham	6	+332	-0
2026-04-09 05:23	Rewrite gameplay bot: 24 tests, 8 conditional phases, competitive play	Brian Graham	4	+1016	-168
2026-04-08 21:40	Add comprehensive gameplay bot spec (24 tests, 8 phases)	Brian Graham	1	+467	-0
2026-04-08 18:54	Checkpoint: 40 runs (438 total)	Brian Graham	76	+16950	-1475
2026-04-08 18:37	Checkpoint: 35 runs (433 total)	Brian Graham	76	+17010	-1535
2026-04-08 18:20	Checkpoint: 30 runs (428 total)	Brian Graham	76	+16890	-1415
2026-04-08 18:04	Checkpoint: 25 runs (423 total)	Brian Graham	76	+16945	-1470
2026-04-08 17:47	Checkpoint: 20 runs (418 total)	Brian Graham	76	+16924	-1449
2026-04-08 17:30	Checkpoint: 15 runs (413 total)	Brian Graham	80	+17541	-1422
2026-04-08 16:09	Checkpoint: 10 runs (408 total)	Brian Graham	115	+23334	-1423
2026-04-08 11:48	Checkpoint: 5 runs (403 total)	Brian Graham	515	+87714	-3861
2026-04-08 11:23	Fix page load: use waitUntil commit, try root URL first	Brian Graham	1	+11	-7
2026-04-08 07:59	Rewrite start detection: 5-phase, language-agnostic, visual change	Brian Graham	3	+499	-292
2026-04-08 07:21	Fix large prompt handling: use wrapper script instead of bash -c	Brian Graham	1	+8	-5
2026-04-08 06:52	Checkpoint: 30 runs (414 total)	Brian Graham	475	+89396	-2605
2026-04-08 05:58	Add 95% CI bands, statistical power card, tornado CI whiskers	Brian Graham	6	+402	-7
2026-04-08 05:45	Checkpoint: 15 runs (399 total)	Brian Graham	6105	+42330	-2992694
2026-04-08 05:32	Switch qwen-3.6-plus from free to paid endpoint	Brian Graham	1	+1	-1
2026-04-08 05:17	Fix argument list too long for noise cells	Brian Graham	1	+20	-4
2026-04-08 05:09	Add minimax-m2.7 and kimi-k2.5 via OpenRouter	Brian Graham	3	+14	-1
2026-04-08 05:07	Checkpoint: 30 runs (453 total)	Brian Graham	2604	+1472941	-1658
2026-04-08 05:06	Checkpoint: 20 runs (433 total)	Brian Graham	91	+3745	-619
2026-04-08 05:05	Checkpoint: 10 runs (433 total)	Brian Graham	2700	+1409001	-1710
2026-04-08 04:59	Analyze and push 393 runs	Brian Graham	428	+51903	-6438
2026-04-07 22:11	Checkpoint: 10 runs (396 total)	Brian Graham	82	+18103	-1314
2026-04-07 21:30	Add 21 new runs (394 total)	Brian Graham	52	+14651	-810
2026-04-07 21:12	Checkpoint: 20 runs (393 total)	Brian Graham	537	+102099	-1940
2026-04-07 20:03	Add 33 new runs (373 total)	Brian Graham	2	+34	-0
2026-04-07 20:03	Checkpoint: 30 runs (373 total)	Brian Graham	168	+39907	-1680
2026-04-07 20:02	Checkpoint: 20 runs (365 total)	Brian Graham	184	+38555	-1869
2026-04-07 20:00	Checkpoint: 10 runs (353 total)	Brian Graham	560	+43424	-78060
2026-04-07 19:45	Checkpoint: 20 runs (376 total)	Brian Graham	162	+35934	-1819
2026-04-07 19:44	Checkpoint: 10 runs (365 total)	Brian Graham	192	+36162	-1710
2026-04-07 19:42	Checkpoint: 20 runs (343 total)	Brian Graham	184	+41424	-1774
2026-04-07 19:39	Add gemma-4-26b model via OpenRouter	Brian Graham	3	+8	-1
2026-04-07 19:39	Checkpoint: 10 runs (332 total)	Brian Graham	299	+74610	-1749
2026-04-07 18:57	Fix falling piece detector: faster polling, longer settle time	Brian Graham	1	+6	-6
2026-04-07 18:34	Analyze and push 316 runs	Brian Graham	777	+22642	-62215
2026-04-07 17:46	Add 28 new runs (337 total)	Brian Graham	115	+25816	-924
2026-04-07 17:32	Checkpoint: 20 runs (324 total)	Brian Graham	182	+35941	-9143
2026-04-07 17:21	Add OpenRouter provider with Qwen 3.6 Plus via litellm proxy	Brian Graham	4	+24	-3
2026-04-07 17:07	Version model names: haiku-4.5, sonnet-4.6, opus-4.6	Brian Graham	60	+18045	-24
2026-04-07 16:44	Add 30 new runs (322 total)	Brian Graham	178	+45158	-782
2026-04-07 16:43	Checkpoint: 20 runs (310 total)	Brian Graham	5565	+693787	-498747
2026-04-07 16:25	PCA: 10 components, taller scree bars, remove Variance Explained	Brian Graham	3	+3688	-1188
2026-04-07 16:17	Spread PCA dots wider (2.5x), shrink spheres	Brian Graham	1	+4	-4
2026-04-07 16:11	Install three.js deps in dashboard dir (fixes CI build)	Brian Graham	2	+687	-2
2026-04-07 16:07	Add --runs-per-cell docs, sweep workflow, clean 4 bad runs	Brian Graham	67	+4023	-5
2026-04-07 16:06	Add n= confidence to Insights page	Brian Graham	3	+49	-11
2026-04-07 16:04	Add --runs-per-cell flag to override runs_per_cell from grid.yaml	Brian Graham	1	+5	-1
2026-04-07 16:00	Add n= confidence indicators to Grid page	Brian Graham	3	+44	-12
2026-04-07 15:50	Add scree plot to PCA page	Brian Graham	3	+1195	-1011
2026-04-07 15:44	3D PCA scatter plot with react-three-fiber	Brian Graham	2	+310	-233
2026-04-07 15:42	Self-host JetBrains Mono fonts, remove Google Fonts CDN	Brian Graham	4	+21	-4
2026-04-07 15:37	Replace task chart with Top/Bottom 10 configs on grid page	Brian Graham	3	+376	-260
2026-04-07 15:29	Add model filter to Insights page (tornado, heatmap)	Brian Graham	28	+1200	-227
2026-04-07 15:28	PCA analysis page, remove violin dots	Brian Graham	5	+4008	-13
2026-04-07 15:26	Surprises tab, model selector, shared color palette integration	Brian Graham	8	+618	-50
2026-04-07 15:18	Checkpoint: 40 runs (266 total)	Brian Graham	391	+107184	-1310
2026-04-07 15:14	Add variability violin chart to Compare page	Brian Graham	2	+403	-0
2026-04-07 15:13	Shared color palette for 10 models across all charts	Brian Graham	5	+91	-51
2026-04-07 14:27	Checkpoint: 20 runs (246 total)	Brian Graham	384	+104408	-1341
2026-04-07 13:08	Analyze and push 222 runs	Brian Graham	1	+222	-222
2026-04-07 13:08	Re-eval 222 runs (10 glm-4.5-air, 26 glm-4.7, 9 glm-5.1, 74 haiku, 51 opus, 52 sonnet)	Brian Graham	412	+11416	-4401
2026-04-07 11:56	Increase gameplay bot timeout to 300s (was 180s)	Brian Graham	2	+3	-3
2026-04-07 11:04	Analyze and push 222 runs	Brian Graham	1	+222	-222
2026-04-07 11:04	Re-eval 222 runs (10 glm-4.5-air, 26 glm-4.7, 9 glm-5.1, 74 haiku, 51 opus, 52 sonnet)	Brian Graham	1137	+16288	-174627
2026-04-07 10:34	Stop deleting turns=1 and timeout runs as invalid	Brian Graham	2	+32	-48
2026-04-07 10:23	Analyze and push 253 runs	Brian Graham	127	+19233	-1114
2026-04-07 10:03	Checkpoint: 40 runs (250 total)	Brian Graham	210	+54373	-1220
2026-04-07 08:52	Checkpoint: 30 runs (240 total)	Brian Graham	49	+10428	-1050
2026-04-07 08:39	Discard runs with 0 turns before eval/commit	Brian Graham	1	+13	-0
2026-04-07 08:34	Analyze and push 236 runs	Brian Graham	200	+35695	-9482
2026-04-07 07:33	Checkpoint: 20 runs (233 total)	Brian Graham	163	+40023	-1179
2026-04-07 06:54	Analyze and push 225 runs	Brian Graham	95	+8875	-1248
2026-04-07 06:35	Checkpoint: 10 runs (223 total)	Brian Graham	151	+35332	-1170
2026-04-07 05:46	Exclude dist/build from sonarqube scans	Brian Graham	1	+1	-1
2026-04-07 05:39	Analyze and push 216 runs	Brian Graham	100	+16009	-1019
2026-04-07 05:37	Rewrite bot start detection: falling piece detector, conditional phases	Brian Graham	2	+584	-250
2026-04-07 05:27	Add spec for gameplay bot rewrite (falling piece detection)	Brian Graham	1	+66	-0
2026-04-07 05:20	Add rich UI widget for api_retry rate limit events in transcript	Brian Graham	1	+27	-0
2026-04-07 05:12	Analyze and push 211 runs	Brian Graham	226	+1328	-71109
2026-04-07 05:07	Add limitation: UI bugs masking working gameplay logic	Brian Graham	1	+1	-0
2026-04-07 05:01	Document bot false positives and unbuildable game limitation	Brian Graham	1	+3	-0
2026-04-07 04:48	Analyze and push 222 runs	Brian Graham	177	+0	-6516
2026-04-07 04:42	Analyze and push 278 runs	Brian Graham	31	+1036	-7374
2026-04-07 04:41	Add analyze-and-push.py for quick analysis without re-eval	Brian Graham	1	+121	-0
2026-04-07 04:40	Remove 192 rate-limited zai runs, update analysis (103 zai + 177 anthropic)	Brian Graham	2210	+1810	-613122
2026-04-07 02:31	Add 99 new runs (472 total)	Brian Graham	71	+19085	-747
2026-04-07 01:50	Checkpoint: 95 runs (470 total)	Brian Graham	82	+21377	-1137
2026-04-07 01:35	Checkpoint: 90 runs (465 total)	Brian Graham	120	+17523	-1416
2026-04-07 01:28	Checkpoint: 85 runs (442 total)	Brian Graham	70	+16767	-1354
2026-04-07 01:18	Checkpoint: 80 runs (437 total)	Brian Graham	69	+16723	-1329
2026-04-07 01:11	Checkpoint: 75 runs (432 total)	Brian Graham	69	+16727	-1322
2026-04-07 01:01	Checkpoint: 70 runs (427 total)	Brian Graham	69	+16601	-1206
2026-04-07 00:54	Checkpoint: 65 runs (422 total)	Brian Graham	69	+16660	-1253
2026-04-07 00:44	Checkpoint: 60 runs (417 total)	Brian Graham	69	+16626	-1233
2026-04-07 00:37	Checkpoint: 55 runs (412 total)	Brian Graham	69	+16715	-1310
2026-04-07 00:27	Checkpoint: 50 runs (407 total)	Brian Graham	69	+16731	-1336
2026-04-07 00:20	Checkpoint: 45 runs (402 total)	Brian Graham	69	+16852	-1315
2026-04-07 00:10	Checkpoint: 40 runs (397 total)	Brian Graham	69	+17020	-1363
2026-04-07 00:02	Checkpoint: 35 runs (392 total)	Brian Graham	69	+16715	-1313
2026-04-06 23:53	Checkpoint: 30 runs (387 total)	Brian Graham	69	+16735	-1337
2026-04-06 23:46	Checkpoint: 25 runs (382 total)	Brian Graham	69	+16777	-1374
2026-04-06 23:36	Checkpoint: 20 runs (377 total)	Brian Graham	69	+16696	-1300
2026-04-06 23:29	Checkpoint: 15 runs (372 total)	Brian Graham	69	+16388	-1182
2026-04-06 23:19	Checkpoint: 10 runs (367 total)	Brian Graham	59	+13543	-1256
2026-04-06 23:12	Checkpoint: 5 runs (363 total)	Brian Graham	61	+10762	-1383
2026-04-06 23:02	Add 99 new runs (358 total)	Brian Graham	105	+28512	-791
2026-04-06 22:48	Checkpoint: 90 runs (351 total)	Brian Graham	178	+33023	-1153
2026-04-06 22:31	Checkpoint: 80 runs (323 total)	Brian Graham	124	+32192	-1042
2026-04-06 22:14	Checkpoint: 70 runs (313 total)	Brian Graham	124	+31979	-1029
2026-04-06 21:57	Checkpoint: 60 runs (303 total)	Brian Graham	279	+71033	-1046
2026-04-06 20:44	Checkpoint: 50 runs (293 total)	Brian Graham	138	+36145	-1177
2026-04-06 20:23	Checkpoint: 40 runs (283 total)	Brian Graham	124	+32245	-1182
2026-04-06 20:06	Checkpoint: 30 runs (273 total)	Brian Graham	124	+31988	-1189
2026-04-06 19:49	Checkpoint: 20 runs (263 total)	Brian Graham	124	+31852	-1251
2026-04-06 19:32	Checkpoint: 10 runs (253 total)	Brian Graham	167	+25364	-5600
2026-04-06 19:11	Document gameplay bot known limitations (wall kicks, lock delay, etc.)	Brian Graham	1	+1	-0
2026-04-06 19:10	Force all x-axis labels to show on box plot (interval={0})	Brian Graham	1	+3	-2
2026-04-06 19:08	Update CLAUDE.md for 23-axis grid, Z.AI provider, new commands	Brian Graham	1	+43	-24
2026-04-06 18:47	Checkpoint: 40 runs (244 total)	Brian Graham	412	+111457	-2861
2026-04-06 18:38	Add new axes to run config box, link run to cell	Brian Graham	1	+11	-1
2026-04-06 18:36	Add 3 new runs (225 total)	Brian Graham	113	+15814	-2215
2026-04-06 18:35	Checkpoint: 20 runs (224 total)	Brian Graham	439	+97726	-1336
2026-04-06 18:31	Put (n=) count on separate line below model name in box plot	Brian Graham	1	+12	-4
2026-04-06 18:26	Remove scatter dots from Score Distribution box plot	Brian Graham	1	+2	-17
2026-04-06 18:20	Fix main_effects for provider filtering	Brian Graham	2	+4	-0
2026-04-06 18:18	Box plots for grid charts, model toggles for scatter hulls	Brian Graham	2	+353	-115
2026-04-06 18:11	Fix auto-commit using old artifacts path	Brian Graham	1	+2	-2
2026-04-06 18:10	Add 6 Z.AI GLM runs (glm-4.5-air, glm-4.7, glm-5.1)	Brian Graham	137	+32337	-758
2026-04-06 18:03	Add --commit-every N flag for periodic analyze+push	Brian Graham	1	+34	-0
2026-04-06 17:36	Add -n/--max-runs flag to limit total runs	Brian Graham	1	+9	-1
2026-04-06 17:26	Use real GLM model names directly, drop model_map	Brian Graham	4	+25	-52
2026-04-06 17:13	Add zai-smoke profile, fix provider in profiles	Brian Graham	1	+32	-4
2026-04-06 17:09	Accept actual model names with --model for non-anthropic providers	Brian Graham	1	+9	-1
2026-04-06 17:08	Use actual_model in cell_ids and dashboard display	Brian Graham	5	+42	-10
2026-04-06 17:07	Re-eval 177 runs (74 haiku, 51 opus, 52 sonnet)	Brian Graham	433	+28858	-4765
2026-04-06 17:01	Require --provider flag for run.py	Brian Graham	1	+19	-0
2026-04-06 16:54	Add provider axis for Z.AI (GLM) model support	Brian Graham	14	+68	-7
2026-04-06 15:11	Add short_id, short_cell_id, claude_version to analysis skip keys	Brian Graham	2	+5	-0
2026-04-06 14:18	Add short URL IDs, test fixtures, and context noise files	Brian Graham	194	+15979	-704
2026-04-06 13:57	Grid expansion: 7 new axes, migrate all run IDs to abbreviated format	Brian Graham	4611	+456579	-454954
2026-04-06 13:54	Fix serve process leak in gameplay bot eval	Brian Graham	1	+41	-23
2026-04-06 13:13	Re-eval 173 runs (71 haiku, 51 opus, 51 sonnet)	Brian Graham	561	+83595	-5426
2026-04-06 12:02	Fix cell_id length, add SonarQube details, rebuild gameplay bot	Brian Graham	6	+158	-145
2026-04-06 10:42	Re-eval 159 runs (57 haiku, 51 opus, 51 sonnet)	Brian Graham	307	+27096	-6732
2026-04-06 09:44	Re-eval 159 runs (57 haiku, 51 opus, 51 sonnet)	Brian Graham	199	+7489	-17294
2026-04-06 08:44	Add sonarqube metric to analysis pipeline, fix metric labels	Brian Graham	3	+18	-4
2026-04-06 08:42	Outcome = gameplay + SonarQube, not gameplay + lint/typecheck	Brian Graham	2	+11	-13
2026-04-06 08:36	Update CLAUDE.md with complete project state	Brian Graham	1	+53	-34
2026-04-06 08:32	Flexible axes on scatter plots and efficiency frontier	Brian Graham	2	+211	-45
2026-04-06 08:30	Add methodology page explaining scoring and experiment design	Brian Graham	1	+477	-0
2026-04-06 08:30	Add clean-and-reeval command	Brian Graham	1	+216	-0
2026-04-06 08:29	Restructure scoring: outcome vs output, flexible scatter, methodology nav	Brian Graham	7	+166	-60
2026-04-06 08:07	Fix artifacts: dist/ was globally gitignored, breaking compiled games	Brian Graham	270	+56036	-1
2026-04-06 07:37	Prevent off-grid reading and false positive piece detection	Brian Graham	2	+42	-2
2026-04-06 07:26	Wire SonarQube into eval pipeline	Brian Graham	1	+16	-0
2026-04-06 07:25	Add SonarQube integration for code quality analysis	Brian Graham	1	+185	-0
2026-04-06 06:30	Rewrite gameplay bot with continuous scanning and no false positives	Brian Graham	6	+1366	-796
2026-04-06 04:36	Scatter plot: 4 density levels instead of 2	Brian Graham	1	+43	-34
2026-04-06 04:31	Fix empty scatter plots: add hidden Scatter to seed axis scales	Brian Graham	1	+9	-0
2026-04-06 04:31	Fix quality scoring, add budget/timeout indicators	Brian Graham	3	+31	-3
2026-04-06 04:27	Sort model bar chart: haiku, sonnet, opus	Brian Graham	1	+6	-1
2026-04-06 04:23	Move artifacts out of Astro public/, fix 13GB node_modules bloat	Brian Graham	1886	+227458	-275543
2026-04-06 04:10	Add directional indicators to correlation matrix	Brian Graham	1	+10	-10
2026-04-06 03:55	Re-eval all 159 runs with fixed scoring and improved bot calibration	Brian Graham	269	+7357	-11022
2026-04-05 22:20	Add 2 new runs (67 total)	Brian Graham	126	+2301	-4819
2026-04-05 22:04	Add 25 new runs (159 total)	Brian Graham	439	+102980	-3713
2026-04-05 21:40	Fix score calculation: remove double-counting, normalize weights	Brian Graham	2	+9	-11
2026-04-05 21:34	Improve gameplay bot calibration with fallbacks and DOM grid detection	Brian Graham	3	+385	-50
2026-04-05 21:31	Add 6 new runs (73 total)	Brian Graham	210	+1193	-65163
2026-04-05 21:28	Show detailed score breakdowns on run page	Brian Graham	1	+127	-0
2026-04-05 19:58	Add 35 new runs (159 total)	Brian Graham	362	+110987	-917
2026-04-05 19:42	Add 4 new runs (124 total)	Brian Graham	56	+17212	-668
2026-04-05 19:22	Convert all charts to cell-based: every visualization now shows cells not runs	Brian Graham	7	+263	-169
2026-04-05 19:03	Clean 9 incomplete runs (no HTML output), re-run analysis	Brian Graham	63	+1014	-12871
2026-04-05 06:55	Add 49 new runs (113 total)	Brian Graham	225	+59081	-11558
2026-04-05 06:51	Fix duplicate coefficientOfVariation declaration	Brian Graham	1	+0	-2
2026-04-05 06:48	Fix model order: haiku, sonnet, opus	Brian Graham	1	+1	-1
2026-04-05 06:47	Grid: per-task summary with cells/runs/score/cost. Cell: variance stats. Box plots: model order fix.	Brian Graham	3	+122	-32
2026-04-05 06:36	Add 32 new runs (113 total)	Brian Graham	409	+135579	-979
2026-04-05 06:10	Add cell detail page with run comparison and artifact gallery	Brian Graham	3	+612	-1
2026-04-05 06:03	Add variability analysis to insights page	Brian Graham	2	+724	-1
2026-04-05 05:50	Cell-based analytics across all dashboard views	Brian Graham	4	+443	-105
2026-04-05 05:39	Grid table: grouped view with score/cost ranges per config cell	Brian Graham	1	+165	-43
2026-04-05 05:32	Surprise cards now clickable with run details and outlier detection	Brian Graham	1	+198	-63
2026-04-05 04:59	Clean 31 bad runs, fix analysis metrics, re-run analysis	Brian Graham	261	+1772	-81184
2026-04-04 22:21	Add 49 new runs (113 total)	Brian Graham	459	+148224	-2769
2026-04-04 21:19	Add 6 new runs (73 total)	Brian Graham	84	+26852	-523
2026-04-04 20:57	Add 2 new runs (67 total)	Brian Graham	210	+12371	-2988
2026-04-04 20:12	Fix inflated scores for empty/broken games	Brian Graham	3	+66	-45
2026-04-04 20:05	Raise budget to $2/$10, delete 25 budget-killed sonnet/opus runs	Brian Graham	278	+8	-89204
2026-04-04 09:42	Progress.	Brian Graham	154	+36516	-56
2026-04-04 09:07	Add 5 new runs (72 total)	Brian Graham	67	+11296	-2299
2026-04-04 08:47	Document pipeline flags and workflow in README	Brian Graham	1	+42	-13
2026-04-04 08:46	Fix pipeline: reeval only when explicitly requested, auto-analyze on new runs	Brian Graham	1	+4	-4
2026-04-04 08:46	Add --reeval, --analyze, --full-pipeline flags to harness	Brian Graham	1	+49	-1
2026-04-04 08:45	Auto-commit and push results after sweep completes	Brian Graham	1	+32	-0
2026-04-04 08:39	Add new haiku and sonnet runs (72 total, 0 bad)	Brian Graham	277	+149635	-3085
2026-04-04 08:25	Fix grid table column order: Pass between Score and Cost	Brian Graham	1	+16	-1
2026-04-04 08:25	Auto-refresh OAuth token during sweeps	Brian Graham	2	+35	-0
2026-04-04 08:23	Add sortable grid columns, show context file on run page, update TODO	Brian Graham	4	+142	-55
2026-04-04 08:04	Fix BumpChart empty state, add HeatmapMatrix title	Brian Graham	2	+50	-5
2026-04-04 08:02	Handle language=unspecified in workspace setup and eval	Brian Graham	2	+5	-2
2026-04-04 07:58	Restyle bar charts to match SMUI	Brian Graham	1	+85	-25
2026-04-04 07:57	Add missing tool axis labels on compare page	Brian Graham	1	+5	-0
2026-04-04 07:53	Fix process is not defined error, split types for client safety	Brian Graham	16	+135	-123
2026-04-04 07:32	Align theme with SMUI, add light/dark mode toggle	Brian Graham	2	+392	-76
2026-04-04 07:27	Add tool axes to RunMeta type and AXIS_NAMES	Brian Graham	1	+15	-0
2026-04-04 07:26	Add Explore page with 6 interactive visualizations	Brian Graham	9	+2309	-0
2026-04-04 07:11	Add n= to chart labels, per-dimension metric selection	Brian Graham	3	+14	-4
2026-04-04 07:04	Add scatter plots and surprise detector to insights page	Brian Graham	3	+304	-4
2026-04-04 06:59	Re-evaluate all 67 runs with new eval pipeline	Brian Graham	134	+21588	-156
2026-04-04 06:33	Adopt Ship the Loop design system	Brian Graham	2	+174	-41
2026-04-04 06:26	Add re-eval command, show all eval dimensions in run detail UI	Brian Graham	2	+227	-16
2026-04-04 06:21	Comprehensive code quality analysis (Python rewrite)	Brian Graham	2	+379	-5
2026-04-04 06:17	Add HTML validation, duplication detection, accessibility, page load time	Brian Graham	3	+121	-1
2026-04-04 06:16	Fix score detection and rotation piece identification	Brian Graham	1	+53	-21
2026-04-04 06:15	Fix bad run detection, wire gameplay bot, fix compare page, improve rotation test	Brian Graham	4	+221	-37
2026-04-04 06:06	Add per-piece-type rotation test	Brian Graham	1	+171	-0
2026-04-04 05:52	Add gameplay bot, language=unspecified option, bump Playwright timeout	Brian Graham	12	+2244	-1
2026-04-04 05:46	Add code analysis and transcript analysis to eval pipeline	Brian Graham	6	+588	-15
2026-04-04 05:29	Increase timeout to 1200s (20 min) for larger models	Brian Graham	1	+1	-1
2026-04-04 05:06	Clean 26 more bad runs (timeouts + null cost), 67 good remain	Brian Graham	183	+0	-28001
2026-04-04 04:49	93 good runs: 54 haiku, 36 sonnet, 3 opus	Brian Graham	679	+123602	-8
2026-04-03 20:32	Clean 51 failed runs, 38 good runs remain (32 haiku, 5 sonnet, 3 opus)	Brian Graham	403	+84607	-0
2026-04-03 19:42	Delete 47 failed runs (expired OAuth token), add token auto-refresh	Brian Graham	335	+79	-50890
2026-04-03 19:39	Auto-extract artifacts, add --model flag for sweep baseline	Brian Graham	212	+78722	-4
2026-04-03 19:32	Remove pre-tool-axes runs, add 60 main_effects sweep results	Brian Graham	327	+6018	-6010
2026-04-03 18:32	Add all-on and all-off anchor profiles	Brian Graham	1	+42	-0
2026-04-03 18:30	Add parallel execution to harness (-j flag)	Brian Graham	1	+155	-87
2026-04-03 18:27	Record full run config in transcript	Brian Graham	2	+32	-1
2026-04-03 18:25	Inject original prompts into existing transcript files	Brian Graham	2	+16	-0
2026-04-03 18:25	Fix inner iframe height in artifact preview	Brian Graham	1	+1	-1
2026-04-03 18:19	Remove bookmarks-api and data-pipeline tasks	Brian Graham	45	+0	-2668
2026-04-03 18:17	Link to source files on Forgejo from run detail page	Brian Graham	1	+16	-0
2026-04-03 18:16	Label exit code metric to clarify it's a process exit code	Brian Graham	1	+3	-0
2026-04-03 18:15	Add standalone link for artifact previews	Brian Graham	1	+5	-2
2026-04-03 18:14	Fix UTF-8 encoding in artifact iframe	Brian Graham	1	+3	-3
2026-04-03 18:13	Include prompt and context in transcript	Brian Graham	2	+51	-1
2026-04-03 18:09	Redesign run detail page, rich transcript viewer, tetris iframe preview	Brian Graham	21	+6382	-337
2026-04-03 17:38	Add claude_version to existing run metadata retroactively	Brian Graham	6	+15	-12
2026-04-03 17:36	UI improvements: readable run IDs, run detail layout, config pills	Brian Graham	5	+343	-84
2026-04-03 17:25	Fix results path resolution for Astro build	Brian Graham	1	+4	-4
2026-04-03 17:19	Add git commit to footer, document metrics and Pareto frontier	Brian Graham	2	+31	-0
2026-04-03 17:15	Add smoke run results to repo for dashboard	Brian Graham	31	+1126	-2
2026-04-03 17:12	Fix harness bugs, add DOE experiment design, insights dashboard	Brian Graham	25	+1936	-90
2026-04-03 15:09	Add benchmark harness, tasks, eval suites, and dashboard	Brian Graham	62	+11237	-60
2026-04-03 12:49	Bootstrap loop benchmarking project	Brian Graham	2	+91	-0

	loop-benchmarking Controlled experiments across agentic coding configurations. Same task, one variable, what actually works.
	git clone https://git.shiptheloop.com/loop-benchmarking.git
	Log \| Files \| Refs \| README