loop-benchmarking

Controlled experiments across agentic coding configurations. Same task, one variable, what actually works.
git clone https://git.shiptheloop.com/loop-benchmarking.git
Log | Files | Refs | README

commit e7d3751a6c70ea21ce4ababeea6480153f01ddc8
parent 8f6bcecff9c0de70911f505506bc65fc3c588dc7
Author: Brian Graham <brian@buildingbetterteams.de>
Date:   Tue,  7 Apr 2026 07:07:49 +0200

Add limitation: UI bugs masking working gameplay logic

Games with CSS issues, broken start buttons, or overlays can score 0
even when the underlying game logic works correctly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Diffstat:
MCLAUDE.md | 1+
1 file changed, 1 insertion(+), 0 deletions(-)

diff --git a/CLAUDE.md b/CLAUDE.md @@ -122,6 +122,7 @@ Short URL IDs: 8-char SHA256 hash for `/r/` and `/c/` routes with redirect pages - [ ] Gameplay bot start detection checks canvas click before start buttons, causing false "started" on start screens. Reorder to check buttons first. - [ ] Gameplay bot false positives: piece_locks and game_over can pass on static start screens when grid reader misidentifies UI chrome as game state. - [ ] Some agents build working games that require a build step (Vite/webpack) but don't run the build, so the artifact is source code not a playable game. The eval scores 0 but the game "works" if you build it. +- [ ] Games with minor UI bugs (CSS z-index, overflow, missing start button handler) can mask fully working gameplay logic. The bot scores 0 because it can't access the game, even though the code is correct. A "start game" button that doesn't work prevents testing all other mechanics. - [ ] Memory leak detection via Playwright heap snapshots - [ ] Frame rate measurement during gameplay - [ ] Dead code detection (knip)

Impressum · Datenschutz