loop-benchmarking

Controlled experiments across agentic coding configurations. Same task, one variable, what actually works.
git clone https://git.shiptheloop.com/loop-benchmarking.git
Log | Files | Refs | README

commit f978492f1169d00686406170f46df2c1f5f783ca
parent dcf0b2b68809e147c61745c349e067cdc700b022
Author: Brian Graham <brian@buildingbetterteams.de>
Date:   Wed,  8 Apr 2026 23:40:24 +0200

Add comprehensive gameplay bot spec (24 tests, 8 phases)

Master spec consolidating all design decisions for the gameplay bot rewrite:
- Conditional phase execution (each depends on previous succeeding)
- Falling piece detector (10 screenshots at 100ms, pixel cluster tracking)
- Start detection cascade: auto -> overlay -> buttons -> keyboard -> canvas
- Competitive play phase (60s, bug detection for multi-line clear, score
  scaling, level/speed progression, rotation, soft drop)
- 24 total tests (16 basic + 8 competitive play bug checks)
- Skipped tests don't penalize score: passed / (total - skipped)
- GPU requirement documented for canvas pixel readback

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Diffstat:
Atasks/tetris/eval/gameplay-bot/SPEC.md | 467+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 467 insertions(+), 0 deletions(-)

diff --git a/tasks/tetris/eval/gameplay-bot/SPEC.md b/tasks/tetris/eval/gameplay-bot/SPEC.md @@ -0,0 +1,467 @@ +# Gameplay Bot Master Spec + +## Overview + +The gameplay bot is a Playwright-based automated tester that evaluates Tetris games +built by AI agents. Every game is different: different DOM structures, different +controls, different start mechanisms, different rendering approaches (canvas, DOM, +SVG, WebGL). The bot must work with all of them. + +The bot does NOT use Claude or any LLM for grading. All evaluation is deterministic +code. The bot plays the game using a known-good AI algorithm (4-heuristic genetic +optimization from LeeYiyuan/tetrisai, MIT License) and records what happens. + +## Architecture + +Two main components: +1. **Grid reader**: Reads the game state by sampling pixels (canvas) or DOM colors. + Produces a 10x20 boolean grid. Works regardless of rendering approach. +2. **AI player**: Given a grid state and current piece type, computes the optimal + placement using aggregate height, lines cleared, holes, and bumpiness heuristics. + +The bot requires GPU access for reliable pixel readback from canvas. Without GPU, +headless Chromium's canvas `getImageData()` returns all zeros. The host must expose +`/dev/dri/card0` and `/dev/dri/renderD128` to the container. + +## Test Structure: Conditional Phases + +Tests are organized as conditional phases. Each phase depends on the previous one +succeeding. If a phase fails, all downstream tests are marked "skipped: [phase] failed" +instead of producing false positives. + +### Pre-test Survey (not scored, just data collection) + +Before any tests run, survey the page: +- Is there a visible element with tetris grid proportions (~2:1 height:width)? +- Is there a full-screen overlay (high z-index element covering >80% of viewport)? +- Are there clickable elements (buttons, links, divs with click handlers)? +- Is there a canvas element? Multiple canvases? +- Is there a DOM-based grid (table, grid of divs)? +- What text is visible? ("Press Enter", "Start", "Play", etc.) + +Store all survey data. This informs the start mechanism detection but is not a test. + +### Phase 1: Page Load (test 1) + +**Test: `game_loads`** +- Navigate to the game URL +- Wait 3 seconds for scripts to execute +- Check for console errors +- Pass: page loaded, no critical errors +- Fail: page failed to load or has critical JS errors + +### Phase 2: Game Start Detection (tests 2-3) + +This phase determines if and how the game starts. It uses a cascading strategy: + +**Step 2a: Auto-start check** +- Take 10 screenshots at 100ms intervals (1 second total) +- Look for a colored cluster of pixels (~4 cells, roughly square-ish bounding box) + that moved downward between frames +- This is the "falling piece detector" +- If found: game auto-starts, no button needed +- Store: `start_mechanism: "auto"` + +**Step 2b: Button/overlay detection (if 2a failed)** +- Check for full-screen overlay (element covering >80% viewport with high z-index) +- If overlay found: + - Try pressing Enter + - Wait 500ms, run falling piece detector (10 screenshots, 100ms each) + - If piece found: `start_mechanism: "enter"`, overlay was a start screen + - If not, try pressing Space, same check + - If not, try clicking the overlay center + - If not, try clicking any visible buttons in the overlay +- If no overlay: + - Find all clickable elements (buttons, elements with onclick, role="button") + - Try clicking each one, run falling piece detector after each + - If piece found: `start_mechanism: "button"`, remember which element worked + - The start button might change state or disappear after clicking -- that's fine + +**Step 2c: Keyboard fallback (if 2b failed)** +- Try pressing: Enter, Space, ArrowDown, Z, P, any key +- After each, run falling piece detector +- If piece found: store which key started the game + +**Step 2d: Canvas click (if 2c failed)** +- Click the center of the canvas/grid element +- Run falling piece detector +- Some games render their start button on the canvas (no DOM element to find) + +**Falling piece detector algorithm:** +- Take 10 screenshots at 100ms intervals +- For each consecutive pair, diff the pixels +- Look for a cluster of changed pixels that: + - Is roughly rectangular (bounding box aspect ratio between 1:1 and 4:1) + - Is in the upper portion of the game area + - Moved downward between frames (centroid Y increased) +- A "cluster" = contiguous region of non-background-color pixels that appeared +- Size: roughly 1-4 cells worth of pixels (each cell is typically 15-40px) +- The piece may have rounded corners, glow effects, shadows -- look at bounding box +- Must see movement in at least 2 frame pairs to confirm it's falling + +**Tests derived from Phase 2:** +- `game_starts`: pass if falling piece detected by any mechanism +- `auto_drop`: pass if piece falls on its own without any key input (only valid if + game auto-started or we confirmed start mechanism worked) + +### Phase 3: Mechanics Tests (tests 4-9) + +Only runs if Phase 2 succeeded (game started, piece detected). + +Reload the page. Start the game using the mechanism discovered in Phase 2. +Wait for a piece to appear (falling piece detector). + +For each control test: +1. Read the grid state (grid reader) +2. Press the key +3. Wait 60ms +4. Read the grid state again +5. Compare: did the relevant change happen? + +**Test: `move_left`** -- ArrowLeft, piece column decreased +**Test: `move_right`** -- ArrowRight, piece column increased +**Test: `move_down`** -- ArrowDown, piece row increased (soft drop) +**Test: `rotate`** -- ArrowUp (or Z), piece shape changed (bounding box dimensions swapped for non-O pieces) +**Test: `hard_drop`** -- Space, piece instantly at bottom (filled cells appear in bottom rows) +**Test: `all_pieces_rotate`** -- Track piece types seen during play, confirm rotation works for non-O pieces + +If the grid reader cannot read the grid (no GPU, bad calibration), fall back to +screenshot comparison for these tests. Mark as "(screenshot-verified)" not +"(grid-verified)" in the detail string. + +### Phase 4: Piece Lifecycle Tests (tests 10-12) + +Only runs if Phase 3 mechanics worked. + +Continue from Phase 3 state or reload + start. + +**Test: `piece_locks`** +- Hard drop a piece +- Wait 300ms +- Read grid: are there filled cells at the bottom that persist? +- Must see cells that don't move for 2 consecutive reads 500ms apart + +**Test: `new_piece_spawns`** +- After a piece locks, check top 4 rows of grid +- A new piece should appear (filled cells in top rows) +- Track: `piecesSpawned` counter + +**Test: `multiple_pieces`** +- Play 10+ pieces (hard drop repeatedly) +- Must detect at least 3 distinct piece placements +- Track piece types seen (I, O, T, S, Z, J, L) + +### Phase 5: Gameplay Tests (tests 13-14) + +Only runs if Phase 4 piece lifecycle works. + +Reload the page. Start game. Play using the AI player for an extended session: +- **60 pieces max, 45 seconds max** +- **60ms polling** between grid reads +- Read score element on every 5th poll cycle (integrated score tracking) + +**Test: `line_clear`** +- During AI play, watch for complete rows (all cells filled) +- After complete row detected, wait 200-500ms for clear animation +- Read grid again: did the complete row disappear? +- If AI play doesn't clear a line, try brute force: drop pieces at each column +- If brute force fails, check if total filled cells decreased (indirect clear detection) +- Pass: at least 1 line cleared (grid-verified) + +**Test: `score_changes`** +- Read score element before play begins (record initial value) +- During play, read score on every 5th poll cycle +- After play, read final score +- Pass: score increased from initial value +- If no score element found, try scanning page text for changing numbers +- Record: all score values observed, deltas between readings + +### Phase 6: Game Over Test (test 15) + +Only runs if Phase 5 gameplay works (pieces can be placed and lines can clear, +or at minimum pieces can be placed). + +Reload the page. Start game. + +**Test: `game_over`** +- Stack pieces to trigger game over: hard drop in the same column repeatedly + to build a tall tower +- After each drop, check grid reader: are there filled cells in the top 2 rows? +- Once top rows are filled, check: + 1. Does the game stop accepting input? (press keys, check if grid changes) + 2. Does "Game Over" or similar text appear in DOM? + 3. Does the page become static? (2 screenshots 1s apart are identical) +- Pass: game stopped after filling to top (at least 1 of the 3 checks) +- Do NOT use screenshot comparison alone (false positives on static start screens) +- Must have evidence that pieces WERE being placed before the game stopped + +### Phase 7: Endurance Test (test 16) + +Only runs if Phase 5 gameplay works. + +Reload the page. Start game. + +**Test: `playable_30s`** +- Play using AI player for 30 seconds +- Track: pieces placed, lines cleared, console errors, play errors +- Pass: played for 30+ seconds without crashing, placed 5+ pieces, no critical errors + +### Phase 8: Competitive Play (not pass/fail, produces metrics + 8 additional tests) + +Only runs if Phase 5 gameplay works. + +Reload the page. Start game. Play competitively for 60 seconds using AI player. + +**Purpose**: Find bugs that the basic tests miss. A game might start, move pieces, +and clear single lines, but fail on multi-line clears, score scaling, level +progression, etc. + +**Data recorded:** +```json +{ + "duration_seconds": 45, + "pieces_placed": 62, + "total_lines_cleared": 18, + "single_clears": 12, + "double_clears": 2, + "triple_clears": 1, + "tetris_clears": 0, + "max_combo": 3, + "score_readings": [0, 100, 200, 500, ...], + "score_final": 4200, + "score_increases": [100, 100, 300, ...], + "level_readings": [1, 1, 1, 2, 2, 3], + "level_final": 3, + "game_over_reached": true, + "game_over_text_found": "Game Over", + "restart_available": true, + "next_piece_visible": true, + "speed_increased": true, + "bugs_detected": ["score_does_not_scale_with_simultaneous_clears"] +} +``` + +**8 additional tests (tests 17-24):** + +Each has three outcomes: pass (tested, works), fail (tested, broken), skip (no opportunity to test). + +**Test 17: `multi_line_clear`** +- During play, detect when 2+ rows are complete simultaneously +- Wait for clear animation (200-500ms) +- Check: did all complete rows disappear? +- Bug: `multi_line_clear_only_removes_one_row` -- only 1 row cleared when 2+ were complete +- Skip: if no multi-line clear opportunity occurred during 60s play + +**Test 18: `score_scaling`** +- Track score delta for each clear event +- Compare: single clear delta vs multi-line clear delta +- Bug: `score_does_not_scale_with_simultaneous_clears` -- multi-line gives same points as single +- Standard Tetris scoring: single=100, double=300, triple=500, tetris=800 (x level) +- Don't enforce exact formula, just check that multi > single +- Skip: if no multi-line clear occurred + +**Test 19: `level_progression`** +- Track lines cleared and level display throughout the session +- After 10+ lines cleared, level should have increased from initial value +- Bug: `level_does_not_increase` +- Skip: if fewer than 10 lines cleared + +**Test 20: `speed_progression`** +- Measure auto-drop interval at game start (time between automatic downward moves) +- After level increases, measure again +- Bug: `speed_does_not_increase` -- interval didn't decrease +- Skip: if level didn't increase + +**Test 21: `next_piece_preview`** +- Look for a secondary display area showing the next piece +- Check: small canvas/div near the main grid with a single piece shape +- Pass: found a next piece display +- Fail: no next piece preview found + +**Test 22: `game_over_display`** +- When game over is triggered (from Phase 6 or competitive play): +- Check for "Game Over" or similar text +- Check for restart button/prompt +- Pass: both message and restart option present +- Fail: missing either +- Skip: game over not reached during competitive play + +**Test 23: `counter_clockwise_rotation`** +- During play, occasionally press Z key instead of Up arrow +- Compare piece shape before and after +- Pass: Z rotates opposite direction from Up +- Fail: Z does same as Up, or doesn't rotate +- Skip: could not detect rotation direction + +**Test 24: `soft_drop_distinct`** +- Press Down arrow: piece should move one row +- Press Space: piece should drop to bottom +- Compare: Down should NOT behave like Space +- Bug: `soft_drop_acts_as_hard_drop` +- Pass: Down moves 1 row, Space drops to bottom +- Fail: Down drops to bottom like Space + +## Polling and Timing + +- **Grid polling**: 60ms between reads during play +- **Post-lock wait**: 100ms after piece locks before reading settled state +- **Falling piece detector**: 10 screenshots at 100ms intervals +- **Post-trigger settling**: 500ms after pressing start key/button before detection +- **Score reading**: every 5th poll cycle during play (every 300ms) +- **Clear animation wait**: 200-500ms between detecting complete row and checking if it cleared + +## Grid Reader + +The grid reader samples pixels to determine cell state (filled/empty). + +**Canvas games**: Use `getImageData()` to sample 5 points per cell (center + corners). +Requires GPU access for reliable readback. + +**DOM games**: Read `backgroundColor` or `style.background` of each cell element. + +**Grid detection**: Find the game grid by looking for a rectangular region with +approximately 2:1 height:width ratio containing a 10-column x 20-row grid of +uniformly-sized cells. + +**Validation**: Reject grids where >60% of cells are filled (likely reading UI chrome, +not game state). Validate aspect ratio (height ~= 2 * width). + +## AI Player + +4-heuristic evaluation from LeeYiyuan/tetrisai (MIT License): +- Aggregate height: sum of column heights (weight: -0.510066) +- Lines cleared: number of complete rows (weight: 0.760666) +- Holes: empty cells below filled cells (weight: -0.35663) +- Bumpiness: sum of absolute height differences between adjacent columns (weight: -0.184483) + +For each possible (rotation, column) placement: +1. Simulate piece drop +2. Score the resulting board +3. Pick the highest-scoring placement + +Execute placement: rotate N times, move to target column, hard drop. + +## Report Structure + +```json +{ + "implementation": { + "renderer": "canvas|dom|svg|webgl|unknown", + "grid_detected": true, + "grid_bounds": { "x": 0, "y": 0, "width": 320, "height": 640 }, + "controls": { "left": "ArrowLeft", "right": "ArrowRight", ... }, + "start_mechanism": "auto|enter|space|button|click_canvas|unknown", + "score_element_found": true, + "grid_confidence": 0.95, + "survey": { + "has_overlay": false, + "has_canvas": true, + "has_dom_grid": false, + "visible_text": ["TETRIS", "Score: 0", "Level: 1"], + "clickable_elements": 2 + } + }, + "tests": [ + { "name": "game_loads", "pass": true, "detail": "..." }, + { "name": "game_starts", "pass": true, "detail": "started via enter" }, + ... + ], + "summary": { + "total": 24, + "passed": 20, + "failed": 2, + "skipped": 2, + "score": 0.83 + }, + "gameplay": { + "pieces_placed": 45, + "lines_cleared": 12, + "max_score_observed": 4200, + "play_duration_seconds": 30, + "errors_during_play": 0 + }, + "competitive_play": { + "duration_seconds": 55, + "pieces_placed": 62, + "total_lines_cleared": 18, + "single_clears": 12, + "double_clears": 2, + "triple_clears": 1, + "tetris_clears": 0, + "score_readings": [0, 100, 200, ...], + "score_final": 4200, + "level_final": 3, + "bugs_detected": [] + }, + "session": { + "frames": 500, + "pieces_spawned": 45, + "pieces_locked": 44, + "lines_cleared": 12, + "piece_types_seen": ["I", "O", "T", "S", "Z", "J", "L"], + "grid_read_success_rate": 0.98 + }, + "performance": { + "load_time_ms": 150 + }, + "accessibility": { + "issues": ["canvas without aria-label"], + "issue_count": 1, + "pass": false + } +} +``` + +## Score Calculation + +The bot score is: `passed / total` (excluding skipped tests from both numerator +and denominator). So if 20/22 non-skipped tests pass, score = 0.91. + +Skipped tests don't penalize -- they indicate the bot couldn't test that feature +because a prerequisite failed. The game may still be good; we just can't verify. + +## Files + +- `types.ts` -- TypeScript interfaces for all data structures +- `calibrate.ts` -- Grid detection, control detection, start mechanism, survey +- `grid-reader.ts` -- Pixel sampling, grid state reading, piece detection +- `player.ts` -- AI player, placement execution, heuristic evaluation +- `tests.ts` -- Phase execution, test derivation, falling piece detector +- `index.ts` -- Playwright test entry point, HTTP server, report output + +## Known Limitations + +The bot does NOT test: +- Wall kicks (piece sliding along collision line) +- Lock delay (brief window to slide before piece locks) +- T-spins +- Hold piece functionality +- Ghost piece (shadow showing where piece will land) +- Next piece preview accuracy (only checks if it exists) +- Level/speed progression exact values (only checks direction) +- DAS (delayed auto-shift for held keys) +- Piece randomizer fairness (bag system vs pure random) + +The bot CAN be fooled by: +- Games that render pieces identically to UI chrome (rare) +- Games with unusual grid sizes (not 10x20) +- Games where the grid is not visible (3D Tetris, first-person Tetris) +- Games requiring mouse input for gameplay (not just start) +- Games with very fast initial drop speed (piece may lock before bot reads it) + +## GPU Requirement + +Without GPU access in the container, canvas `getImageData()` returns all zeros +in headless Chromium. The bot falls back to DOM-based grid reading for DOM-rendered +games, but canvas games will fail grid reading entirely. + +To enable GPU in Proxmox LXC: +``` +lxc.cgroup2.devices.allow: c 226:* rwm +lxc.mount.entry: /dev/dri dev/dri none bind,optional,create=dir +lxc.hook.autodev: sh -c "chmod 666 ${LXC_ROOTFS_MOUNT}/dev/dri/card0 ${LXC_ROOTFS_MOUNT}/dev/dri/renderD128 2>/dev/null || true" +``` + +The bot should detect GPU availability at startup and log a warning if +`/dev/dri/renderD128` is not accessible. Canvas-based tests will report +"grid reader unavailable (no GPU)" instead of false results.

Impressum · Datenschutz