loop-benchmarking

Controlled experiments across agentic coding configurations. Same task, one variable, what actually works.
git clone https://git.shiptheloop.com/loop-benchmarking.git
Log | Files | Refs | README

DateCommit messageAuthorFiles+-
2026-04-16 14:47Drop aborted glm-4.5-air run 0c19668aBrian Graham1+6012-6012
2026-04-16 14:46Analyze and push 511 runsBrian Graham3+0-78
2026-04-16 14:45Project runs across all dashboard pagesBrian Graham4+31-16
2026-04-16 14:32Project runs before serializing into index-page islandsBrian Graham2+57-5
2026-04-16 13:53Rebuild PCA from post-reeval 510-run datasetBrian Graham1+7825-5657
2026-04-16 13:53Analyze and push 512 runsBrian Graham11+2581-2835
2026-04-16 13:50Full reeval on GPU machine: V2 bot + SonarQubeBrian Graham3287+337966-104023
2026-04-16 11:08900s bot timeout + inactivity watchdog; aggregate agreement 48% to 79%Brian Graham70+6834-2759
2026-04-16 10:10Add human labels for 3 more calibration runsBrian Graham3+43-45
2026-04-16 09:58Retag 176 pre-provider anthropic runs with prov=anth in cell_idBrian Graham4852+511972-512706
2026-04-16 07:54Add human trial labels for 4 calibration runsBrian Graham4+87-88
2026-04-16 07:06Preserve gameplay bot report on timeoutBrian Graham74+8580-3235
2026-04-16 05:59Remove 39 invalid glm-4.7 runs and add new sweep resultsBrian Graham1221+265103-9227
2026-04-15 14:27Add 18 new runs (458 total)Brian Graham390+102913-1169
2026-04-15 13:41Re-eval 17 calibration runs; fix reeval.py artifact cleanupBrian Graham19+1825-1219
2026-04-15 11:47Fix compute_grid OOM: fail on unknown profile, stream via generator, dispatch DOE designsBrian Graham1+51-28
2026-04-15 09:37Remove 20 more zero-turn 429 runs from glm-5.1 sweepBrian Graham221+0-61425
2026-04-15 08:03Add 20 new runs (460 total)Brian Graham227+62260-835
2026-04-15 05:03Remove 20 invalid glm-5.1 runs (429 / aborted / zero-turn)Brian Graham274+0-76552
2026-04-15 02:30Add 14 new runs (460 total)Brian Graham157+41738-1372
2026-04-14 20:42Fix Z.AI auth: skip apiKeyHelper for non-anthropic providersBrian Graham1154+281373-211
2026-04-14 11:08Add 1 new runs (393 total)Brian Graham95+35444-3764
2026-04-14 07:32Remove 68 more zero-cost GLM-5.1 runs (Z.AI auth still broken)Brian Graham748+0-208878
2026-04-14 04:07Add 68 new runs (459 total)Brian Graham97+25606-1122
2026-04-14 03:54Checkpoint: 60 runs (453 total)Brian Graham342+94062-1451
2026-04-14 03:03Checkpoint: 30 runs (423 total)Brian Graham345+93985-2139
2026-04-13 21:11Remove 68 zero-cost GLM-5.1 runs (auth failures)Brian Graham748+0-208822
2026-04-13 20:56Add 24 new runs (459 total)Brian Graham48+12972-827
2026-04-13 20:53Checkpoint: 20 runs (459 total)Brian Graham128+32174-1344
2026-04-13 20:46Checkpoint: 10 runs (449 total)Brian Graham130+32454-1264
2026-04-13 20:34Add smaller noise files: 1k, 10k, 50k, 100k for both lorem and wikipediaBrian Graham10+1289-1
2026-04-13 20:24Add 44 new runs (435 total)Brian Graham50+13089-891
2026-04-13 20:17Checkpoint: 40 runs (433 total)Brian Graham455+125268-2546
2026-04-13 15:14Restore game artifacts deleted by GPU machine commitBrian Graham5463+1557109-6829
2026-04-13 14:13CI: exclude artifacts/ from rsync --deleteBrian Graham1+3-2
2026-04-13 13:44Add all game artifacts, fix CI artifact rsyncBrian Graham745+539948-232
2026-04-13 13:28Re-eval all 390 runs with V2 bot on GPU machineBrian Graham6095+39175-1926721
2026-04-13 11:58Context update for GPU machine testingBrian Graham3+211-67
2026-04-12 16:46Update eval results: 123 runs re-evaluated with V2 botBrian Graham226+24551-10583
2026-04-12 16:23Add 7 new games to calibration pageBrian Graham7+252-0
2026-04-12 15:56Analyze and push 391 runsBrian Graham848+15206-184333
2026-04-12 15:43Switch production eval to V2 gameplay botBrian Graham1+6-2
2026-04-12 15:38Update calibration: cbbff570 CW rotation works, e2e04e75 scores on clear, 9805c24a has game over overlayBrian Graham3+14-11
2026-04-12 06:31V2: landmarks-based game_loads, updated calibration test namesBrian Graham13+248-32
2026-04-11 07:59V2: partial landmarks work (agent hit limit)Brian Graham1+28-0
2026-04-11 07:12V2: stricter rotation test requires distinct rotation statesBrian Graham2+243-38
2026-04-11 05:30V2: game_over_display test passes on overlay OR restart presenceBrian Graham1+18-12
2026-04-11 05:28V2: language-agnostic game over detection, capture in Phase 6Brian Graham4+134-29
2026-04-11 05:25Calibration cbbff570: rotation is flaky (human was wrong)Brian Graham1+5-5
2026-04-10 19:24Methodology: scoring uses SonarQube, code quality is in outputs, no emdashesBrian Graham1+36-32
2026-04-10 19:20V2: fix AI player so it actually plays TetrisBrian Graham2+149-101
2026-04-10 19:11Update methodology page with current bot architectureBrian Graham1+200-48
2026-04-10 19:06Correct attribution: Pierre Dellacherie's 4-heuristic Tetris AIBrian Graham5+17-13
2026-04-10 18:04V2: control discovery systemBrian Graham5+844-2
2026-04-10 16:58V2 fix: handle absolute-positioned active piece overlaysBrian Graham3+305-10
2026-04-10 12:38Add gemma426b run artifacts and resultsBrian Graham28+5952-0
2026-04-10 12:36V2 bot: caching, bot/driver bridge, fixed CCW rotation testBrian Graham4+1219-39
2026-04-09 19:14Add gameplay-bot-v2: two-tier architecture (Driver + Bot)Brian Graham5+3887-0
2026-04-09 18:22Update calibration: 9805c24a (broken rotation, bad randomizer), cbbff570 (mostly works, spurious line clear, weird preview)Brian Graham2+44-46
2026-04-09 18:15Add test #25 rendering_clean, update calibration dataBrian Graham4+95-39
2026-04-09 10:56Add two-tier architecture refactor spec for gameplay botBrian Graham1+877-0
2026-04-09 09:48Verify game interactivity via DOM + screenshot after start detectionBrian Graham1+109-12
2026-04-09 09:20Add grid re-sampling after game start detectionBrian Graham2+16-0
2026-04-09 09:10Add all 10 DOM games to calibration pageBrian Graham5+170-0
2026-04-09 07:18Update gameplay bot results for 10 DOM games with new start detectionBrian Graham12+2211-391
2026-04-09 07:03Language-agnostic start detection for gameplay botBrian Graham1+136-120
2026-04-09 06:07Update calibration: 93e8feea starts into game over, e2e04e75 no scoringBrian Graham2+5-5
2026-04-09 06:04Calibration: copy button instead of JSON block, update human resultsBrian Graham3+37-23
2026-04-09 05:56Fix calibration UI: connect Human Testing toggle to all cardsBrian Graham1+4-3
2026-04-09 05:52Interactive calibration UI with human testing modeBrian Graham2+286-126
2026-04-09 05:41Add bot calibration page with human vs bot comparisonBrian Graham6+332-0
2026-04-09 05:23Rewrite gameplay bot: 24 tests, 8 conditional phases, competitive playBrian Graham4+1016-168
2026-04-08 21:40Add comprehensive gameplay bot spec (24 tests, 8 phases)Brian Graham1+467-0
2026-04-08 18:54Checkpoint: 40 runs (438 total)Brian Graham76+16950-1475
2026-04-08 18:37Checkpoint: 35 runs (433 total)Brian Graham76+17010-1535
2026-04-08 18:20Checkpoint: 30 runs (428 total)Brian Graham76+16890-1415
2026-04-08 18:04Checkpoint: 25 runs (423 total)Brian Graham76+16945-1470
2026-04-08 17:47Checkpoint: 20 runs (418 total)Brian Graham76+16924-1449
2026-04-08 17:30Checkpoint: 15 runs (413 total)Brian Graham80+17541-1422
2026-04-08 16:09Checkpoint: 10 runs (408 total)Brian Graham115+23334-1423
2026-04-08 11:48Checkpoint: 5 runs (403 total)Brian Graham515+87714-3861
2026-04-08 11:23Fix page load: use waitUntil commit, try root URL firstBrian Graham1+11-7
2026-04-08 07:59Rewrite start detection: 5-phase, language-agnostic, visual changeBrian Graham3+499-292
2026-04-08 07:21Fix large prompt handling: use wrapper script instead of bash -cBrian Graham1+8-5
2026-04-08 06:52Checkpoint: 30 runs (414 total)Brian Graham475+89396-2605
2026-04-08 05:58Add 95% CI bands, statistical power card, tornado CI whiskersBrian Graham6+402-7
2026-04-08 05:45Checkpoint: 15 runs (399 total)Brian Graham6105+42330-2992694
2026-04-08 05:32Switch qwen-3.6-plus from free to paid endpointBrian Graham1+1-1
2026-04-08 05:17Fix argument list too long for noise cellsBrian Graham1+20-4
2026-04-08 05:09Add minimax-m2.7 and kimi-k2.5 via OpenRouterBrian Graham3+14-1
2026-04-08 05:07Checkpoint: 30 runs (453 total)Brian Graham2604+1472941-1658
2026-04-08 05:06Checkpoint: 20 runs (433 total)Brian Graham91+3745-619
2026-04-08 05:05Checkpoint: 10 runs (433 total)Brian Graham2700+1409001-1710
2026-04-08 04:59Analyze and push 393 runsBrian Graham428+51903-6438
2026-04-07 22:11Checkpoint: 10 runs (396 total)Brian Graham82+18103-1314
2026-04-07 21:30Add 21 new runs (394 total)Brian Graham52+14651-810
2026-04-07 21:12Checkpoint: 20 runs (393 total)Brian Graham537+102099-1940
2026-04-07 20:03Add 33 new runs (373 total)Brian Graham2+34-0
2026-04-07 20:03Checkpoint: 30 runs (373 total)Brian Graham168+39907-1680
2026-04-07 20:02Checkpoint: 20 runs (365 total)Brian Graham184+38555-1869
2026-04-07 20:00Checkpoint: 10 runs (353 total)Brian Graham560+43424-78060
2026-04-07 19:45Checkpoint: 20 runs (376 total)Brian Graham162+35934-1819
2026-04-07 19:44Checkpoint: 10 runs (365 total)Brian Graham192+36162-1710
2026-04-07 19:42Checkpoint: 20 runs (343 total)Brian Graham184+41424-1774
2026-04-07 19:39Add gemma-4-26b model via OpenRouterBrian Graham3+8-1
2026-04-07 19:39Checkpoint: 10 runs (332 total)Brian Graham299+74610-1749
2026-04-07 18:57Fix falling piece detector: faster polling, longer settle timeBrian Graham1+6-6
2026-04-07 18:34Analyze and push 316 runsBrian Graham777+22642-62215
2026-04-07 17:46Add 28 new runs (337 total)Brian Graham115+25816-924
2026-04-07 17:32Checkpoint: 20 runs (324 total)Brian Graham182+35941-9143
2026-04-07 17:21Add OpenRouter provider with Qwen 3.6 Plus via litellm proxyBrian Graham4+24-3
2026-04-07 17:07Version model names: haiku-4.5, sonnet-4.6, opus-4.6Brian Graham60+18045-24
2026-04-07 16:44Add 30 new runs (322 total)Brian Graham178+45158-782
2026-04-07 16:43Checkpoint: 20 runs (310 total)Brian Graham5565+693787-498747
2026-04-07 16:25PCA: 10 components, taller scree bars, remove Variance ExplainedBrian Graham3+3688-1188
2026-04-07 16:17Spread PCA dots wider (2.5x), shrink spheresBrian Graham1+4-4
2026-04-07 16:11Install three.js deps in dashboard dir (fixes CI build)Brian Graham2+687-2
2026-04-07 16:07Add --runs-per-cell docs, sweep workflow, clean 4 bad runsBrian Graham67+4023-5
2026-04-07 16:06Add n= confidence to Insights pageBrian Graham3+49-11
2026-04-07 16:04Add --runs-per-cell flag to override runs_per_cell from grid.yamlBrian Graham1+5-1
2026-04-07 16:00Add n= confidence indicators to Grid pageBrian Graham3+44-12
2026-04-07 15:50Add scree plot to PCA pageBrian Graham3+1195-1011
2026-04-07 15:443D PCA scatter plot with react-three-fiberBrian Graham2+310-233
2026-04-07 15:42Self-host JetBrains Mono fonts, remove Google Fonts CDNBrian Graham4+21-4
2026-04-07 15:37Replace task chart with Top/Bottom 10 configs on grid pageBrian Graham3+376-260
2026-04-07 15:29Add model filter to Insights page (tornado, heatmap)Brian Graham28+1200-227
2026-04-07 15:28PCA analysis page, remove violin dotsBrian Graham5+4008-13
2026-04-07 15:26Surprises tab, model selector, shared color palette integrationBrian Graham8+618-50
2026-04-07 15:18Checkpoint: 40 runs (266 total)Brian Graham391+107184-1310
2026-04-07 15:14Add variability violin chart to Compare pageBrian Graham2+403-0
2026-04-07 15:13Shared color palette for 10 models across all chartsBrian Graham5+91-51
2026-04-07 14:27Checkpoint: 20 runs (246 total)Brian Graham384+104408-1341
2026-04-07 13:08Analyze and push 222 runsBrian Graham1+222-222
2026-04-07 13:08Re-eval 222 runs (10 glm-4.5-air, 26 glm-4.7, 9 glm-5.1, 74 haiku, 51 opus, 52 sonnet)Brian Graham412+11416-4401
2026-04-07 11:56Increase gameplay bot timeout to 300s (was 180s)Brian Graham2+3-3
2026-04-07 11:04Analyze and push 222 runsBrian Graham1+222-222
2026-04-07 11:04Re-eval 222 runs (10 glm-4.5-air, 26 glm-4.7, 9 glm-5.1, 74 haiku, 51 opus, 52 sonnet)Brian Graham1137+16288-174627
2026-04-07 10:34Stop deleting turns=1 and timeout runs as invalidBrian Graham2+32-48
2026-04-07 10:23Analyze and push 253 runsBrian Graham127+19233-1114
2026-04-07 10:03Checkpoint: 40 runs (250 total)Brian Graham210+54373-1220
2026-04-07 08:52Checkpoint: 30 runs (240 total)Brian Graham49+10428-1050
2026-04-07 08:39Discard runs with 0 turns before eval/commitBrian Graham1+13-0
2026-04-07 08:34Analyze and push 236 runsBrian Graham200+35695-9482
2026-04-07 07:33Checkpoint: 20 runs (233 total)Brian Graham163+40023-1179
2026-04-07 06:54Analyze and push 225 runsBrian Graham95+8875-1248
2026-04-07 06:35Checkpoint: 10 runs (223 total)Brian Graham151+35332-1170
2026-04-07 05:46Exclude dist/build from sonarqube scansBrian Graham1+1-1
2026-04-07 05:39Analyze and push 216 runsBrian Graham100+16009-1019
2026-04-07 05:37Rewrite bot start detection: falling piece detector, conditional phasesBrian Graham2+584-250
2026-04-07 05:27Add spec for gameplay bot rewrite (falling piece detection)Brian Graham1+66-0
2026-04-07 05:20Add rich UI widget for api_retry rate limit events in transcriptBrian Graham1+27-0
2026-04-07 05:12Analyze and push 211 runsBrian Graham226+1328-71109
2026-04-07 05:07Add limitation: UI bugs masking working gameplay logicBrian Graham1+1-0
2026-04-07 05:01Document bot false positives and unbuildable game limitationBrian Graham1+3-0
2026-04-07 04:48Analyze and push 222 runsBrian Graham177+0-6516
2026-04-07 04:42Analyze and push 278 runsBrian Graham31+1036-7374
2026-04-07 04:41Add analyze-and-push.py for quick analysis without re-evalBrian Graham1+121-0
2026-04-07 04:40Remove 192 rate-limited zai runs, update analysis (103 zai + 177 anthropic)Brian Graham2210+1810-613122
2026-04-07 02:31Add 99 new runs (472 total)Brian Graham71+19085-747
2026-04-07 01:50Checkpoint: 95 runs (470 total)Brian Graham82+21377-1137
2026-04-07 01:35Checkpoint: 90 runs (465 total)Brian Graham120+17523-1416
2026-04-07 01:28Checkpoint: 85 runs (442 total)Brian Graham70+16767-1354
2026-04-07 01:18Checkpoint: 80 runs (437 total)Brian Graham69+16723-1329
2026-04-07 01:11Checkpoint: 75 runs (432 total)Brian Graham69+16727-1322
2026-04-07 01:01Checkpoint: 70 runs (427 total)Brian Graham69+16601-1206
2026-04-07 00:54Checkpoint: 65 runs (422 total)Brian Graham69+16660-1253
2026-04-07 00:44Checkpoint: 60 runs (417 total)Brian Graham69+16626-1233
2026-04-07 00:37Checkpoint: 55 runs (412 total)Brian Graham69+16715-1310
2026-04-07 00:27Checkpoint: 50 runs (407 total)Brian Graham69+16731-1336
2026-04-07 00:20Checkpoint: 45 runs (402 total)Brian Graham69+16852-1315
2026-04-07 00:10Checkpoint: 40 runs (397 total)Brian Graham69+17020-1363
2026-04-07 00:02Checkpoint: 35 runs (392 total)Brian Graham69+16715-1313
2026-04-06 23:53Checkpoint: 30 runs (387 total)Brian Graham69+16735-1337
2026-04-06 23:46Checkpoint: 25 runs (382 total)Brian Graham69+16777-1374
2026-04-06 23:36Checkpoint: 20 runs (377 total)Brian Graham69+16696-1300
2026-04-06 23:29Checkpoint: 15 runs (372 total)Brian Graham69+16388-1182
2026-04-06 23:19Checkpoint: 10 runs (367 total)Brian Graham59+13543-1256
2026-04-06 23:12Checkpoint: 5 runs (363 total)Brian Graham61+10762-1383
2026-04-06 23:02Add 99 new runs (358 total)Brian Graham105+28512-791
2026-04-06 22:48Checkpoint: 90 runs (351 total)Brian Graham178+33023-1153
2026-04-06 22:31Checkpoint: 80 runs (323 total)Brian Graham124+32192-1042
2026-04-06 22:14Checkpoint: 70 runs (313 total)Brian Graham124+31979-1029
2026-04-06 21:57Checkpoint: 60 runs (303 total)Brian Graham279+71033-1046
2026-04-06 20:44Checkpoint: 50 runs (293 total)Brian Graham138+36145-1177
2026-04-06 20:23Checkpoint: 40 runs (283 total)Brian Graham124+32245-1182
2026-04-06 20:06Checkpoint: 30 runs (273 total)Brian Graham124+31988-1189
2026-04-06 19:49Checkpoint: 20 runs (263 total)Brian Graham124+31852-1251
2026-04-06 19:32Checkpoint: 10 runs (253 total)Brian Graham167+25364-5600
2026-04-06 19:11Document gameplay bot known limitations (wall kicks, lock delay, etc.)Brian Graham1+1-0
2026-04-06 19:10Force all x-axis labels to show on box plot (interval={0})Brian Graham1+3-2
2026-04-06 19:08Update CLAUDE.md for 23-axis grid, Z.AI provider, new commandsBrian Graham1+43-24
2026-04-06 18:47Checkpoint: 40 runs (244 total)Brian Graham412+111457-2861
2026-04-06 18:38Add new axes to run config box, link run to cellBrian Graham1+11-1
2026-04-06 18:36Add 3 new runs (225 total)Brian Graham113+15814-2215
2026-04-06 18:35Checkpoint: 20 runs (224 total)Brian Graham439+97726-1336
2026-04-06 18:31Put (n=) count on separate line below model name in box plotBrian Graham1+12-4
2026-04-06 18:26Remove scatter dots from Score Distribution box plotBrian Graham1+2-17
2026-04-06 18:20Fix main_effects for provider filteringBrian Graham2+4-0
2026-04-06 18:18Box plots for grid charts, model toggles for scatter hullsBrian Graham2+353-115
2026-04-06 18:11Fix auto-commit using old artifacts pathBrian Graham1+2-2
2026-04-06 18:10Add 6 Z.AI GLM runs (glm-4.5-air, glm-4.7, glm-5.1)Brian Graham137+32337-758
2026-04-06 18:03Add --commit-every N flag for periodic analyze+pushBrian Graham1+34-0
2026-04-06 17:36Add -n/--max-runs flag to limit total runsBrian Graham1+9-1
2026-04-06 17:26Use real GLM model names directly, drop model_mapBrian Graham4+25-52
2026-04-06 17:13Add zai-smoke profile, fix provider in profilesBrian Graham1+32-4
2026-04-06 17:09Accept actual model names with --model for non-anthropic providersBrian Graham1+9-1
2026-04-06 17:08Use actual_model in cell_ids and dashboard displayBrian Graham5+42-10
2026-04-06 17:07Re-eval 177 runs (74 haiku, 51 opus, 52 sonnet)Brian Graham433+28858-4765
2026-04-06 17:01Require --provider flag for run.pyBrian Graham1+19-0
2026-04-06 16:54Add provider axis for Z.AI (GLM) model supportBrian Graham14+68-7
2026-04-06 15:11Add short_id, short_cell_id, claude_version to analysis skip keysBrian Graham2+5-0
2026-04-06 14:18Add short URL IDs, test fixtures, and context noise filesBrian Graham194+15979-704
2026-04-06 13:57Grid expansion: 7 new axes, migrate all run IDs to abbreviated formatBrian Graham4611+456579-454954
2026-04-06 13:54Fix serve process leak in gameplay bot evalBrian Graham1+41-23
2026-04-06 13:13Re-eval 173 runs (71 haiku, 51 opus, 51 sonnet)Brian Graham561+83595-5426
2026-04-06 12:02Fix cell_id length, add SonarQube details, rebuild gameplay botBrian Graham6+158-145
2026-04-06 10:42Re-eval 159 runs (57 haiku, 51 opus, 51 sonnet)Brian Graham307+27096-6732
2026-04-06 09:44Re-eval 159 runs (57 haiku, 51 opus, 51 sonnet)Brian Graham199+7489-17294
2026-04-06 08:44Add sonarqube metric to analysis pipeline, fix metric labelsBrian Graham3+18-4
2026-04-06 08:42Outcome = gameplay + SonarQube, not gameplay + lint/typecheckBrian Graham2+11-13
2026-04-06 08:36Update CLAUDE.md with complete project stateBrian Graham1+53-34
2026-04-06 08:32Flexible axes on scatter plots and efficiency frontierBrian Graham2+211-45
2026-04-06 08:30Add methodology page explaining scoring and experiment designBrian Graham1+477-0
2026-04-06 08:30Add clean-and-reeval commandBrian Graham1+216-0
2026-04-06 08:29Restructure scoring: outcome vs output, flexible scatter, methodology navBrian Graham7+166-60
2026-04-06 08:07Fix artifacts: dist/ was globally gitignored, breaking compiled gamesBrian Graham270+56036-1
2026-04-06 07:37Prevent off-grid reading and false positive piece detectionBrian Graham2+42-2
2026-04-06 07:26Wire SonarQube into eval pipelineBrian Graham1+16-0
2026-04-06 07:25Add SonarQube integration for code quality analysisBrian Graham1+185-0
2026-04-06 06:30Rewrite gameplay bot with continuous scanning and no false positivesBrian Graham6+1366-796
2026-04-06 04:36Scatter plot: 4 density levels instead of 2Brian Graham1+43-34
2026-04-06 04:31Fix empty scatter plots: add hidden Scatter to seed axis scalesBrian Graham1+9-0
2026-04-06 04:31Fix quality scoring, add budget/timeout indicatorsBrian Graham3+31-3
2026-04-06 04:27Sort model bar chart: haiku, sonnet, opusBrian Graham1+6-1
2026-04-06 04:23Move artifacts out of Astro public/, fix 13GB node_modules bloatBrian Graham1886+227458-275543
2026-04-06 04:10Add directional indicators to correlation matrixBrian Graham1+10-10
2026-04-06 03:55Re-eval all 159 runs with fixed scoring and improved bot calibrationBrian Graham269+7357-11022
2026-04-05 22:20Add 2 new runs (67 total)Brian Graham126+2301-4819
2026-04-05 22:04Add 25 new runs (159 total)Brian Graham439+102980-3713
2026-04-05 21:40Fix score calculation: remove double-counting, normalize weightsBrian Graham2+9-11
2026-04-05 21:34Improve gameplay bot calibration with fallbacks and DOM grid detectionBrian Graham3+385-50
2026-04-05 21:31Add 6 new runs (73 total)Brian Graham210+1193-65163
2026-04-05 21:28Show detailed score breakdowns on run pageBrian Graham1+127-0
2026-04-05 19:58Add 35 new runs (159 total)Brian Graham362+110987-917
2026-04-05 19:42Add 4 new runs (124 total)Brian Graham56+17212-668
2026-04-05 19:22Convert all charts to cell-based: every visualization now shows cells not runsBrian Graham7+263-169
2026-04-05 19:03Clean 9 incomplete runs (no HTML output), re-run analysisBrian Graham63+1014-12871
2026-04-05 06:55Add 49 new runs (113 total)Brian Graham225+59081-11558
2026-04-05 06:51Fix duplicate coefficientOfVariation declarationBrian Graham1+0-2
2026-04-05 06:48Fix model order: haiku, sonnet, opusBrian Graham1+1-1
2026-04-05 06:47Grid: per-task summary with cells/runs/score/cost. Cell: variance stats. Box plots: model order fix.Brian Graham3+122-32
2026-04-05 06:36Add 32 new runs (113 total)Brian Graham409+135579-979
2026-04-05 06:10Add cell detail page with run comparison and artifact galleryBrian Graham3+612-1
2026-04-05 06:03Add variability analysis to insights pageBrian Graham2+724-1
2026-04-05 05:50Cell-based analytics across all dashboard viewsBrian Graham4+443-105
2026-04-05 05:39Grid table: grouped view with score/cost ranges per config cellBrian Graham1+165-43
2026-04-05 05:32Surprise cards now clickable with run details and outlier detectionBrian Graham1+198-63
2026-04-05 04:59Clean 31 bad runs, fix analysis metrics, re-run analysisBrian Graham261+1772-81184
2026-04-04 22:21Add 49 new runs (113 total)Brian Graham459+148224-2769
2026-04-04 21:19Add 6 new runs (73 total)Brian Graham84+26852-523
2026-04-04 20:57Add 2 new runs (67 total)Brian Graham210+12371-2988
2026-04-04 20:12Fix inflated scores for empty/broken gamesBrian Graham3+66-45
2026-04-04 20:05Raise budget to $2/$10, delete 25 budget-killed sonnet/opus runsBrian Graham278+8-89204
2026-04-04 09:42Progress.Brian Graham154+36516-56
2026-04-04 09:07Add 5 new runs (72 total)Brian Graham67+11296-2299
2026-04-04 08:47Document pipeline flags and workflow in READMEBrian Graham1+42-13
2026-04-04 08:46Fix pipeline: reeval only when explicitly requested, auto-analyze on new runsBrian Graham1+4-4
2026-04-04 08:46Add --reeval, --analyze, --full-pipeline flags to harnessBrian Graham1+49-1
2026-04-04 08:45Auto-commit and push results after sweep completesBrian Graham1+32-0
2026-04-04 08:39Add new haiku and sonnet runs (72 total, 0 bad)Brian Graham277+149635-3085
2026-04-04 08:25Fix grid table column order: Pass between Score and CostBrian Graham1+16-1
2026-04-04 08:25Auto-refresh OAuth token during sweepsBrian Graham2+35-0
2026-04-04 08:23Add sortable grid columns, show context file on run page, update TODOBrian Graham4+142-55
2026-04-04 08:04Fix BumpChart empty state, add HeatmapMatrix titleBrian Graham2+50-5
2026-04-04 08:02Handle language=unspecified in workspace setup and evalBrian Graham2+5-2
2026-04-04 07:58Restyle bar charts to match SMUIBrian Graham1+85-25
2026-04-04 07:57Add missing tool axis labels on compare pageBrian Graham1+5-0
2026-04-04 07:53Fix process is not defined error, split types for client safetyBrian Graham16+135-123
2026-04-04 07:32Align theme with SMUI, add light/dark mode toggleBrian Graham2+392-76
2026-04-04 07:27Add tool axes to RunMeta type and AXIS_NAMESBrian Graham1+15-0
2026-04-04 07:26Add Explore page with 6 interactive visualizationsBrian Graham9+2309-0
2026-04-04 07:11Add n= to chart labels, per-dimension metric selectionBrian Graham3+14-4
2026-04-04 07:04Add scatter plots and surprise detector to insights pageBrian Graham3+304-4
2026-04-04 06:59Re-evaluate all 67 runs with new eval pipelineBrian Graham134+21588-156
2026-04-04 06:33Adopt Ship the Loop design systemBrian Graham2+174-41
2026-04-04 06:26Add re-eval command, show all eval dimensions in run detail UIBrian Graham2+227-16
2026-04-04 06:21Comprehensive code quality analysis (Python rewrite)Brian Graham2+379-5
2026-04-04 06:17Add HTML validation, duplication detection, accessibility, page load timeBrian Graham3+121-1
2026-04-04 06:16Fix score detection and rotation piece identificationBrian Graham1+53-21
2026-04-04 06:15Fix bad run detection, wire gameplay bot, fix compare page, improve rotation testBrian Graham4+221-37
2026-04-04 06:06Add per-piece-type rotation testBrian Graham1+171-0
2026-04-04 05:52Add gameplay bot, language=unspecified option, bump Playwright timeoutBrian Graham12+2244-1
2026-04-04 05:46Add code analysis and transcript analysis to eval pipelineBrian Graham6+588-15
2026-04-04 05:29Increase timeout to 1200s (20 min) for larger modelsBrian Graham1+1-1
2026-04-04 05:06Clean 26 more bad runs (timeouts + null cost), 67 good remainBrian Graham183+0-28001
2026-04-04 04:4993 good runs: 54 haiku, 36 sonnet, 3 opusBrian Graham679+123602-8
2026-04-03 20:32Clean 51 failed runs, 38 good runs remain (32 haiku, 5 sonnet, 3 opus)Brian Graham403+84607-0
2026-04-03 19:42Delete 47 failed runs (expired OAuth token), add token auto-refreshBrian Graham335+79-50890
2026-04-03 19:39Auto-extract artifacts, add --model flag for sweep baselineBrian Graham212+78722-4
2026-04-03 19:32Remove pre-tool-axes runs, add 60 main_effects sweep resultsBrian Graham327+6018-6010
2026-04-03 18:32Add all-on and all-off anchor profilesBrian Graham1+42-0
2026-04-03 18:30Add parallel execution to harness (-j flag)Brian Graham1+155-87
2026-04-03 18:27Record full run config in transcriptBrian Graham2+32-1
2026-04-03 18:25Inject original prompts into existing transcript filesBrian Graham2+16-0
2026-04-03 18:25Fix inner iframe height in artifact previewBrian Graham1+1-1
2026-04-03 18:19Remove bookmarks-api and data-pipeline tasksBrian Graham45+0-2668
2026-04-03 18:17Link to source files on Forgejo from run detail pageBrian Graham1+16-0
2026-04-03 18:16Label exit code metric to clarify it's a process exit codeBrian Graham1+3-0
2026-04-03 18:15Add standalone link for artifact previewsBrian Graham1+5-2
2026-04-03 18:14Fix UTF-8 encoding in artifact iframeBrian Graham1+3-3
2026-04-03 18:13Include prompt and context in transcriptBrian Graham2+51-1
2026-04-03 18:09Redesign run detail page, rich transcript viewer, tetris iframe previewBrian Graham21+6382-337
2026-04-03 17:38Add claude_version to existing run metadata retroactivelyBrian Graham6+15-12
2026-04-03 17:36UI improvements: readable run IDs, run detail layout, config pillsBrian Graham5+343-84
2026-04-03 17:25Fix results path resolution for Astro buildBrian Graham1+4-4
2026-04-03 17:19Add git commit to footer, document metrics and Pareto frontierBrian Graham2+31-0
2026-04-03 17:15Add smoke run results to repo for dashboardBrian Graham31+1126-2
2026-04-03 17:12Fix harness bugs, add DOE experiment design, insights dashboardBrian Graham25+1936-90
2026-04-03 15:09Add benchmark harness, tasks, eval suites, and dashboardBrian Graham62+11237-60
2026-04-03 12:49Bootstrap loop benchmarking projectBrian Graham2+91-0

Impressum · Datenschutz