loop-benchmarking

Controlled experiments across agentic coding configurations. Same task, one variable, what actually works.
git clone https://git.shiptheloop.com/loop-benchmarking.git
Log | Files | Refs | README

commit 8f6bcecff9c0de70911f505506bc65fc3c588dc7
parent b3077d446a1b2e14aa526b86af97732322f74c40
Author: Brian Graham <brian@buildingbetterteams.de>
Date:   Tue,  7 Apr 2026 07:01:05 +0200

Document bot false positives and unbuildable game limitation

- Start detection order bug: clicks canvas before checking start buttons
- piece_locks and game_over can false-positive on static start screens
- Some agents build working Vite/webpack projects but don't compile them

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Diffstat:
MCLAUDE.md | 3+++
1 file changed, 3 insertions(+), 0 deletions(-)

diff --git a/CLAUDE.md b/CLAUDE.md @@ -119,6 +119,9 @@ Short URL IDs: 8-char SHA256 hash for `/r/` and `/c/` routes with redirect pages ### Eval - [ ] Quality scoring too coarse (binary pass/fail on 3 checks = 0/33/67/100%) - [ ] Gameplay bot does NOT test: wall kicks, lock delay (sliding at collision line), T-spins, hold piece, ghost piece, next piece preview, level/speed progression, DAS. Known limitation for methodology page. +- [ ] Gameplay bot start detection checks canvas click before start buttons, causing false "started" on start screens. Reorder to check buttons first. +- [ ] Gameplay bot false positives: piece_locks and game_over can pass on static start screens when grid reader misidentifies UI chrome as game state. +- [ ] Some agents build working games that require a build step (Vite/webpack) but don't run the build, so the artifact is source code not a playable game. The eval scores 0 but the game "works" if you build it. - [ ] Memory leak detection via Playwright heap snapshots - [ ] Frame rate measurement during gameplay - [ ] Dead code detection (knip)

Impressum · Datenschutz