commit f978492f1169d00686406170f46df2c1f5f783ca
parent dcf0b2b68809e147c61745c349e067cdc700b022
Author: Brian Graham <brian@buildingbetterteams.de>
Date: Wed, 8 Apr 2026 23:40:24 +0200
Add comprehensive gameplay bot spec (24 tests, 8 phases)
Master spec consolidating all design decisions for the gameplay bot rewrite:
- Conditional phase execution (each depends on previous succeeding)
- Falling piece detector (10 screenshots at 100ms, pixel cluster tracking)
- Start detection cascade: auto -> overlay -> buttons -> keyboard -> canvas
- Competitive play phase (60s, bug detection for multi-line clear, score
scaling, level/speed progression, rotation, soft drop)
- 24 total tests (16 basic + 8 competitive play bug checks)
- Skipped tests don't penalize score: passed / (total - skipped)
- GPU requirement documented for canvas pixel readback
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Diffstat:
1 file changed, 467 insertions(+), 0 deletions(-)
diff --git a/tasks/tetris/eval/gameplay-bot/SPEC.md b/tasks/tetris/eval/gameplay-bot/SPEC.md
@@ -0,0 +1,467 @@
+# Gameplay Bot Master Spec
+
+## Overview
+
+The gameplay bot is a Playwright-based automated tester that evaluates Tetris games
+built by AI agents. Every game is different: different DOM structures, different
+controls, different start mechanisms, different rendering approaches (canvas, DOM,
+SVG, WebGL). The bot must work with all of them.
+
+The bot does NOT use Claude or any LLM for grading. All evaluation is deterministic
+code. The bot plays the game using a known-good AI algorithm (4-heuristic genetic
+optimization from LeeYiyuan/tetrisai, MIT License) and records what happens.
+
+## Architecture
+
+Two main components:
+1. **Grid reader**: Reads the game state by sampling pixels (canvas) or DOM colors.
+ Produces a 10x20 boolean grid. Works regardless of rendering approach.
+2. **AI player**: Given a grid state and current piece type, computes the optimal
+ placement using aggregate height, lines cleared, holes, and bumpiness heuristics.
+
+The bot requires GPU access for reliable pixel readback from canvas. Without GPU,
+headless Chromium's canvas `getImageData()` returns all zeros. The host must expose
+`/dev/dri/card0` and `/dev/dri/renderD128` to the container.
+
+## Test Structure: Conditional Phases
+
+Tests are organized as conditional phases. Each phase depends on the previous one
+succeeding. If a phase fails, all downstream tests are marked "skipped: [phase] failed"
+instead of producing false positives.
+
+### Pre-test Survey (not scored, just data collection)
+
+Before any tests run, survey the page:
+- Is there a visible element with tetris grid proportions (~2:1 height:width)?
+- Is there a full-screen overlay (high z-index element covering >80% of viewport)?
+- Are there clickable elements (buttons, links, divs with click handlers)?
+- Is there a canvas element? Multiple canvases?
+- Is there a DOM-based grid (table, grid of divs)?
+- What text is visible? ("Press Enter", "Start", "Play", etc.)
+
+Store all survey data. This informs the start mechanism detection but is not a test.
+
+### Phase 1: Page Load (test 1)
+
+**Test: `game_loads`**
+- Navigate to the game URL
+- Wait 3 seconds for scripts to execute
+- Check for console errors
+- Pass: page loaded, no critical errors
+- Fail: page failed to load or has critical JS errors
+
+### Phase 2: Game Start Detection (tests 2-3)
+
+This phase determines if and how the game starts. It uses a cascading strategy:
+
+**Step 2a: Auto-start check**
+- Take 10 screenshots at 100ms intervals (1 second total)
+- Look for a colored cluster of pixels (~4 cells, roughly square-ish bounding box)
+ that moved downward between frames
+- This is the "falling piece detector"
+- If found: game auto-starts, no button needed
+- Store: `start_mechanism: "auto"`
+
+**Step 2b: Button/overlay detection (if 2a failed)**
+- Check for full-screen overlay (element covering >80% viewport with high z-index)
+- If overlay found:
+ - Try pressing Enter
+ - Wait 500ms, run falling piece detector (10 screenshots, 100ms each)
+ - If piece found: `start_mechanism: "enter"`, overlay was a start screen
+ - If not, try pressing Space, same check
+ - If not, try clicking the overlay center
+ - If not, try clicking any visible buttons in the overlay
+- If no overlay:
+ - Find all clickable elements (buttons, elements with onclick, role="button")
+ - Try clicking each one, run falling piece detector after each
+ - If piece found: `start_mechanism: "button"`, remember which element worked
+ - The start button might change state or disappear after clicking -- that's fine
+
+**Step 2c: Keyboard fallback (if 2b failed)**
+- Try pressing: Enter, Space, ArrowDown, Z, P, any key
+- After each, run falling piece detector
+- If piece found: store which key started the game
+
+**Step 2d: Canvas click (if 2c failed)**
+- Click the center of the canvas/grid element
+- Run falling piece detector
+- Some games render their start button on the canvas (no DOM element to find)
+
+**Falling piece detector algorithm:**
+- Take 10 screenshots at 100ms intervals
+- For each consecutive pair, diff the pixels
+- Look for a cluster of changed pixels that:
+ - Is roughly rectangular (bounding box aspect ratio between 1:1 and 4:1)
+ - Is in the upper portion of the game area
+ - Moved downward between frames (centroid Y increased)
+- A "cluster" = contiguous region of non-background-color pixels that appeared
+- Size: roughly 1-4 cells worth of pixels (each cell is typically 15-40px)
+- The piece may have rounded corners, glow effects, shadows -- look at bounding box
+- Must see movement in at least 2 frame pairs to confirm it's falling
+
+**Tests derived from Phase 2:**
+- `game_starts`: pass if falling piece detected by any mechanism
+- `auto_drop`: pass if piece falls on its own without any key input (only valid if
+ game auto-started or we confirmed start mechanism worked)
+
+### Phase 3: Mechanics Tests (tests 4-9)
+
+Only runs if Phase 2 succeeded (game started, piece detected).
+
+Reload the page. Start the game using the mechanism discovered in Phase 2.
+Wait for a piece to appear (falling piece detector).
+
+For each control test:
+1. Read the grid state (grid reader)
+2. Press the key
+3. Wait 60ms
+4. Read the grid state again
+5. Compare: did the relevant change happen?
+
+**Test: `move_left`** -- ArrowLeft, piece column decreased
+**Test: `move_right`** -- ArrowRight, piece column increased
+**Test: `move_down`** -- ArrowDown, piece row increased (soft drop)
+**Test: `rotate`** -- ArrowUp (or Z), piece shape changed (bounding box dimensions swapped for non-O pieces)
+**Test: `hard_drop`** -- Space, piece instantly at bottom (filled cells appear in bottom rows)
+**Test: `all_pieces_rotate`** -- Track piece types seen during play, confirm rotation works for non-O pieces
+
+If the grid reader cannot read the grid (no GPU, bad calibration), fall back to
+screenshot comparison for these tests. Mark as "(screenshot-verified)" not
+"(grid-verified)" in the detail string.
+
+### Phase 4: Piece Lifecycle Tests (tests 10-12)
+
+Only runs if Phase 3 mechanics worked.
+
+Continue from Phase 3 state or reload + start.
+
+**Test: `piece_locks`**
+- Hard drop a piece
+- Wait 300ms
+- Read grid: are there filled cells at the bottom that persist?
+- Must see cells that don't move for 2 consecutive reads 500ms apart
+
+**Test: `new_piece_spawns`**
+- After a piece locks, check top 4 rows of grid
+- A new piece should appear (filled cells in top rows)
+- Track: `piecesSpawned` counter
+
+**Test: `multiple_pieces`**
+- Play 10+ pieces (hard drop repeatedly)
+- Must detect at least 3 distinct piece placements
+- Track piece types seen (I, O, T, S, Z, J, L)
+
+### Phase 5: Gameplay Tests (tests 13-14)
+
+Only runs if Phase 4 piece lifecycle works.
+
+Reload the page. Start game. Play using the AI player for an extended session:
+- **60 pieces max, 45 seconds max**
+- **60ms polling** between grid reads
+- Read score element on every 5th poll cycle (integrated score tracking)
+
+**Test: `line_clear`**
+- During AI play, watch for complete rows (all cells filled)
+- After complete row detected, wait 200-500ms for clear animation
+- Read grid again: did the complete row disappear?
+- If AI play doesn't clear a line, try brute force: drop pieces at each column
+- If brute force fails, check if total filled cells decreased (indirect clear detection)
+- Pass: at least 1 line cleared (grid-verified)
+
+**Test: `score_changes`**
+- Read score element before play begins (record initial value)
+- During play, read score on every 5th poll cycle
+- After play, read final score
+- Pass: score increased from initial value
+- If no score element found, try scanning page text for changing numbers
+- Record: all score values observed, deltas between readings
+
+### Phase 6: Game Over Test (test 15)
+
+Only runs if Phase 5 gameplay works (pieces can be placed and lines can clear,
+or at minimum pieces can be placed).
+
+Reload the page. Start game.
+
+**Test: `game_over`**
+- Stack pieces to trigger game over: hard drop in the same column repeatedly
+ to build a tall tower
+- After each drop, check grid reader: are there filled cells in the top 2 rows?
+- Once top rows are filled, check:
+ 1. Does the game stop accepting input? (press keys, check if grid changes)
+ 2. Does "Game Over" or similar text appear in DOM?
+ 3. Does the page become static? (2 screenshots 1s apart are identical)
+- Pass: game stopped after filling to top (at least 1 of the 3 checks)
+- Do NOT use screenshot comparison alone (false positives on static start screens)
+- Must have evidence that pieces WERE being placed before the game stopped
+
+### Phase 7: Endurance Test (test 16)
+
+Only runs if Phase 5 gameplay works.
+
+Reload the page. Start game.
+
+**Test: `playable_30s`**
+- Play using AI player for 30 seconds
+- Track: pieces placed, lines cleared, console errors, play errors
+- Pass: played for 30+ seconds without crashing, placed 5+ pieces, no critical errors
+
+### Phase 8: Competitive Play (not pass/fail, produces metrics + 8 additional tests)
+
+Only runs if Phase 5 gameplay works.
+
+Reload the page. Start game. Play competitively for 60 seconds using AI player.
+
+**Purpose**: Find bugs that the basic tests miss. A game might start, move pieces,
+and clear single lines, but fail on multi-line clears, score scaling, level
+progression, etc.
+
+**Data recorded:**
+```json
+{
+ "duration_seconds": 45,
+ "pieces_placed": 62,
+ "total_lines_cleared": 18,
+ "single_clears": 12,
+ "double_clears": 2,
+ "triple_clears": 1,
+ "tetris_clears": 0,
+ "max_combo": 3,
+ "score_readings": [0, 100, 200, 500, ...],
+ "score_final": 4200,
+ "score_increases": [100, 100, 300, ...],
+ "level_readings": [1, 1, 1, 2, 2, 3],
+ "level_final": 3,
+ "game_over_reached": true,
+ "game_over_text_found": "Game Over",
+ "restart_available": true,
+ "next_piece_visible": true,
+ "speed_increased": true,
+ "bugs_detected": ["score_does_not_scale_with_simultaneous_clears"]
+}
+```
+
+**8 additional tests (tests 17-24):**
+
+Each has three outcomes: pass (tested, works), fail (tested, broken), skip (no opportunity to test).
+
+**Test 17: `multi_line_clear`**
+- During play, detect when 2+ rows are complete simultaneously
+- Wait for clear animation (200-500ms)
+- Check: did all complete rows disappear?
+- Bug: `multi_line_clear_only_removes_one_row` -- only 1 row cleared when 2+ were complete
+- Skip: if no multi-line clear opportunity occurred during 60s play
+
+**Test 18: `score_scaling`**
+- Track score delta for each clear event
+- Compare: single clear delta vs multi-line clear delta
+- Bug: `score_does_not_scale_with_simultaneous_clears` -- multi-line gives same points as single
+- Standard Tetris scoring: single=100, double=300, triple=500, tetris=800 (x level)
+- Don't enforce exact formula, just check that multi > single
+- Skip: if no multi-line clear occurred
+
+**Test 19: `level_progression`**
+- Track lines cleared and level display throughout the session
+- After 10+ lines cleared, level should have increased from initial value
+- Bug: `level_does_not_increase`
+- Skip: if fewer than 10 lines cleared
+
+**Test 20: `speed_progression`**
+- Measure auto-drop interval at game start (time between automatic downward moves)
+- After level increases, measure again
+- Bug: `speed_does_not_increase` -- interval didn't decrease
+- Skip: if level didn't increase
+
+**Test 21: `next_piece_preview`**
+- Look for a secondary display area showing the next piece
+- Check: small canvas/div near the main grid with a single piece shape
+- Pass: found a next piece display
+- Fail: no next piece preview found
+
+**Test 22: `game_over_display`**
+- When game over is triggered (from Phase 6 or competitive play):
+- Check for "Game Over" or similar text
+- Check for restart button/prompt
+- Pass: both message and restart option present
+- Fail: missing either
+- Skip: game over not reached during competitive play
+
+**Test 23: `counter_clockwise_rotation`**
+- During play, occasionally press Z key instead of Up arrow
+- Compare piece shape before and after
+- Pass: Z rotates opposite direction from Up
+- Fail: Z does same as Up, or doesn't rotate
+- Skip: could not detect rotation direction
+
+**Test 24: `soft_drop_distinct`**
+- Press Down arrow: piece should move one row
+- Press Space: piece should drop to bottom
+- Compare: Down should NOT behave like Space
+- Bug: `soft_drop_acts_as_hard_drop`
+- Pass: Down moves 1 row, Space drops to bottom
+- Fail: Down drops to bottom like Space
+
+## Polling and Timing
+
+- **Grid polling**: 60ms between reads during play
+- **Post-lock wait**: 100ms after piece locks before reading settled state
+- **Falling piece detector**: 10 screenshots at 100ms intervals
+- **Post-trigger settling**: 500ms after pressing start key/button before detection
+- **Score reading**: every 5th poll cycle during play (every 300ms)
+- **Clear animation wait**: 200-500ms between detecting complete row and checking if it cleared
+
+## Grid Reader
+
+The grid reader samples pixels to determine cell state (filled/empty).
+
+**Canvas games**: Use `getImageData()` to sample 5 points per cell (center + corners).
+Requires GPU access for reliable readback.
+
+**DOM games**: Read `backgroundColor` or `style.background` of each cell element.
+
+**Grid detection**: Find the game grid by looking for a rectangular region with
+approximately 2:1 height:width ratio containing a 10-column x 20-row grid of
+uniformly-sized cells.
+
+**Validation**: Reject grids where >60% of cells are filled (likely reading UI chrome,
+not game state). Validate aspect ratio (height ~= 2 * width).
+
+## AI Player
+
+4-heuristic evaluation from LeeYiyuan/tetrisai (MIT License):
+- Aggregate height: sum of column heights (weight: -0.510066)
+- Lines cleared: number of complete rows (weight: 0.760666)
+- Holes: empty cells below filled cells (weight: -0.35663)
+- Bumpiness: sum of absolute height differences between adjacent columns (weight: -0.184483)
+
+For each possible (rotation, column) placement:
+1. Simulate piece drop
+2. Score the resulting board
+3. Pick the highest-scoring placement
+
+Execute placement: rotate N times, move to target column, hard drop.
+
+## Report Structure
+
+```json
+{
+ "implementation": {
+ "renderer": "canvas|dom|svg|webgl|unknown",
+ "grid_detected": true,
+ "grid_bounds": { "x": 0, "y": 0, "width": 320, "height": 640 },
+ "controls": { "left": "ArrowLeft", "right": "ArrowRight", ... },
+ "start_mechanism": "auto|enter|space|button|click_canvas|unknown",
+ "score_element_found": true,
+ "grid_confidence": 0.95,
+ "survey": {
+ "has_overlay": false,
+ "has_canvas": true,
+ "has_dom_grid": false,
+ "visible_text": ["TETRIS", "Score: 0", "Level: 1"],
+ "clickable_elements": 2
+ }
+ },
+ "tests": [
+ { "name": "game_loads", "pass": true, "detail": "..." },
+ { "name": "game_starts", "pass": true, "detail": "started via enter" },
+ ...
+ ],
+ "summary": {
+ "total": 24,
+ "passed": 20,
+ "failed": 2,
+ "skipped": 2,
+ "score": 0.83
+ },
+ "gameplay": {
+ "pieces_placed": 45,
+ "lines_cleared": 12,
+ "max_score_observed": 4200,
+ "play_duration_seconds": 30,
+ "errors_during_play": 0
+ },
+ "competitive_play": {
+ "duration_seconds": 55,
+ "pieces_placed": 62,
+ "total_lines_cleared": 18,
+ "single_clears": 12,
+ "double_clears": 2,
+ "triple_clears": 1,
+ "tetris_clears": 0,
+ "score_readings": [0, 100, 200, ...],
+ "score_final": 4200,
+ "level_final": 3,
+ "bugs_detected": []
+ },
+ "session": {
+ "frames": 500,
+ "pieces_spawned": 45,
+ "pieces_locked": 44,
+ "lines_cleared": 12,
+ "piece_types_seen": ["I", "O", "T", "S", "Z", "J", "L"],
+ "grid_read_success_rate": 0.98
+ },
+ "performance": {
+ "load_time_ms": 150
+ },
+ "accessibility": {
+ "issues": ["canvas without aria-label"],
+ "issue_count": 1,
+ "pass": false
+ }
+}
+```
+
+## Score Calculation
+
+The bot score is: `passed / total` (excluding skipped tests from both numerator
+and denominator). So if 20/22 non-skipped tests pass, score = 0.91.
+
+Skipped tests don't penalize -- they indicate the bot couldn't test that feature
+because a prerequisite failed. The game may still be good; we just can't verify.
+
+## Files
+
+- `types.ts` -- TypeScript interfaces for all data structures
+- `calibrate.ts` -- Grid detection, control detection, start mechanism, survey
+- `grid-reader.ts` -- Pixel sampling, grid state reading, piece detection
+- `player.ts` -- AI player, placement execution, heuristic evaluation
+- `tests.ts` -- Phase execution, test derivation, falling piece detector
+- `index.ts` -- Playwright test entry point, HTTP server, report output
+
+## Known Limitations
+
+The bot does NOT test:
+- Wall kicks (piece sliding along collision line)
+- Lock delay (brief window to slide before piece locks)
+- T-spins
+- Hold piece functionality
+- Ghost piece (shadow showing where piece will land)
+- Next piece preview accuracy (only checks if it exists)
+- Level/speed progression exact values (only checks direction)
+- DAS (delayed auto-shift for held keys)
+- Piece randomizer fairness (bag system vs pure random)
+
+The bot CAN be fooled by:
+- Games that render pieces identically to UI chrome (rare)
+- Games with unusual grid sizes (not 10x20)
+- Games where the grid is not visible (3D Tetris, first-person Tetris)
+- Games requiring mouse input for gameplay (not just start)
+- Games with very fast initial drop speed (piece may lock before bot reads it)
+
+## GPU Requirement
+
+Without GPU access in the container, canvas `getImageData()` returns all zeros
+in headless Chromium. The bot falls back to DOM-based grid reading for DOM-rendered
+games, but canvas games will fail grid reading entirely.
+
+To enable GPU in Proxmox LXC:
+```
+lxc.cgroup2.devices.allow: c 226:* rwm
+lxc.mount.entry: /dev/dri dev/dri none bind,optional,create=dir
+lxc.hook.autodev: sh -c "chmod 666 ${LXC_ROOTFS_MOUNT}/dev/dri/card0 ${LXC_ROOTFS_MOUNT}/dev/dri/renderD128 2>/dev/null || true"
+```
+
+The bot should detect GPU availability at startup and log a warning if
+`/dev/dri/renderD128` is not accessible. Canvas-based tests will report
+"grid reader unavailable (no GPU)" instead of false results.