loop-benchmarking

Controlled experiments across agentic coding configurations. Same task, one variable, what actually works.
git clone https://git.shiptheloop.com/loop-benchmarking.git
Log | Files | Refs | README

commit 4ce8d09103c723f23b4f1d266fe3aef143995996
parent b19aa539899396ddc9373afcaa71621210b6e113
Author: Brian Graham <brian@buildingbetterteams.de>
Date:   Thu,  9 Apr 2026 08:07:32 +0200

Update calibration: 93e8feea starts into game over, e2e04e75 no scoring

93e8feea: Game loads into immediate game over state with overlay that
never dismisses. Bot false positive (says game_starts: PASS).
e2e04e75: Spanish game, all mechanics work but score never changes
(real game bug). Bot false negative (18% but most mechanics work).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Diffstat:
Mtasks/tetris/eval/gameplay-bot/calibration/93e8feea.json | 2+-
Mtasks/tetris/eval/gameplay-bot/calibration/e2e04e75.json | 8++++----
2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/tasks/tetris/eval/gameplay-bot/calibration/93e8feea.json b/tasks/tetris/eval/gameplay-bot/calibration/93e8feea.json @@ -2,7 +2,7 @@ "run_id": "tetris_arch=none_ctx=provided_noise=clean_dsgn=none_eff=high_echk=none_hlang=es_lang=ts_lint=off_budget=low_model=haiku45_pw=avail_prompt=detailed_rndr=none_strat=usub_tst=none_tedit=on_tglob=on_tgrep=off_tread=off_twrite=off_web=off_run3", "short_id": "93e8feea", "label": "Spanish overlay bug", - "notes": "Spanish game with full-screen overlay. Has a start button but clicking never dismisses the overlay. Game is visible behind the transparency and the next block changes on click, but game is unplayable.", + "notes": "Game starts directly into 'game over' state with a new game button. Has a transparent full-screen overlay. Clicking the button changes the next block visible behind the overlay but never dismisses it. Unplayable due to this bug.", "human_tested_at": "2026-04-09", "human_tests": { "game_loads": true, diff --git a/tasks/tetris/eval/gameplay-bot/calibration/e2e04e75.json b/tasks/tetris/eval/gameplay-bot/calibration/e2e04e75.json @@ -2,7 +2,7 @@ "run_id": "tetris_arch=none_ctx=none_noise=clean_dsgn=none_eff=high_echk=none_hlang=es_lang=ts_lint=on_budget=low_model=haiku45_pw=avail_prompt=simple_rndr=none_strat=usub_tst=none_tedit=on_tglob=on_tgrep=on_tread=on_twrite=on_web=on_run1", "short_id": "e2e04e75", "label": "Spanish basic play", - "notes": "Spanish game. Basic play works fine.", + "notes": "Spanish game. Basic play works fine. Score does not change during play.", "human_tested_at": "2026-04-09", "human_tests": { "game_loads": true, @@ -12,14 +12,14 @@ "move_right": true, "move_down": true, "rotate": true, - "hard_drop": null, + "hard_drop": true, "all_pieces_rotate": null, "piece_locks": true, "new_piece_spawns": true, "multiple_pieces": true, "line_clear": null, - "score_changes": null, - "game_over": null, + "score_changes": false, + "game_over": true, "playable_30s": true, "multi_line_clear": null, "score_scaling": null,

Impressum · Datenschutz