Document bot false positives and unbuildable game limitation - loop-benchmarking - Controlled experiments across agentic coding configurations. Same task, one variable, what actually works.

commit 8f6bcecff9c0de70911f505506bc65fc3c588dc7
parent b3077d446a1b2e14aa526b86af97732322f74c40
Author: Brian Graham <brian@buildingbetterteams.de>
Date:   Tue,  7 Apr 2026 07:01:05 +0200

Document bot false positives and unbuildable game limitation

- Start detection order bug: clicks canvas before checking start buttons
- piece_locks and game_over can false-positive on static start screens
- Some agents build working Vite/webpack projects but don't compile them

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Diffstat:
M CLAUDE.md  | 3 +++

1 file changed, 3 insertions(+), 0 deletions(-)
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -119,6 +119,9 @@ Short URL IDs: 8-char SHA256 hash for `/r/` and `/c/` routes with redirect pages
 ### Eval
 - [ ] Quality scoring too coarse (binary pass/fail on 3 checks = 0/33/67/100%)
 - [ ] Gameplay bot does NOT test: wall kicks, lock delay (sliding at collision line), T-spins, hold piece, ghost piece, next piece preview, level/speed progression, DAS. Known limitation for methodology page.
+- [ ] Gameplay bot start detection checks canvas click before start buttons, causing false "started" on start screens. Reorder to check buttons first.
+- [ ] Gameplay bot false positives: piece_locks and game_over can pass on static start screens when grid reader misidentifies UI chrome as game state.
+- [ ] Some agents build working games that require a build step (Vite/webpack) but don't run the build, so the artifact is source code not a playable game. The eval scores 0 but the game "works" if you build it.
 - [ ] Memory leak detection via Playwright heap snapshots
 - [ ] Frame rate measurement during gameplay
 - [ ] Dead code detection (knip)

	loop-benchmarking Controlled experiments across agentic coding configurations. Same task, one variable, what actually works.
	git clone https://git.shiptheloop.com/loop-benchmarking.git
	Log \| Files \| Refs \| README