Add limitation: UI bugs masking working gameplay logic - loop-benchmarking - Controlled experiments across agentic coding configurations. Same task, one variable, what actually works.

commit e7d3751a6c70ea21ce4ababeea6480153f01ddc8
parent 8f6bcecff9c0de70911f505506bc65fc3c588dc7
Author: Brian Graham <brian@buildingbetterteams.de>
Date:   Tue,  7 Apr 2026 07:07:49 +0200

Add limitation: UI bugs masking working gameplay logic

Games with CSS issues, broken start buttons, or overlays can score 0
even when the underlying game logic works correctly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Diffstat:
M CLAUDE.md  | 1 +

1 file changed, 1 insertion(+), 0 deletions(-)
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -122,6 +122,7 @@ Short URL IDs: 8-char SHA256 hash for `/r/` and `/c/` routes with redirect pages
 - [ ] Gameplay bot start detection checks canvas click before start buttons, causing false "started" on start screens. Reorder to check buttons first.
 - [ ] Gameplay bot false positives: piece_locks and game_over can pass on static start screens when grid reader misidentifies UI chrome as game state.
 - [ ] Some agents build working games that require a build step (Vite/webpack) but don't run the build, so the artifact is source code not a playable game. The eval scores 0 but the game "works" if you build it.
+- [ ] Games with minor UI bugs (CSS z-index, overflow, missing start button handler) can mask fully working gameplay logic. The bot scores 0 because it can't access the game, even though the code is correct. A "start game" button that doesn't work prevents testing all other mechanics.
 - [ ] Memory leak detection via Playwright heap snapshots
 - [ ] Frame rate measurement during gameplay
 - [ ] Dead code detection (knip)

	loop-benchmarking Controlled experiments across agentic coding configurations. Same task, one variable, what actually works.
	git clone https://git.shiptheloop.com/loop-benchmarking.git
	Log \| Files \| Refs \| README