loop-benchmarking

Controlled experiments across agentic coding configurations. Same task, one variable, what actually works.
git clone https://git.shiptheloop.com/loop-benchmarking.git
Log | Files | Refs | README

SPEC.md (18082B)


      1 # Gameplay Bot Master Spec
      2 
      3 ## Overview
      4 
      5 The gameplay bot is a Playwright-based automated tester that evaluates Tetris games
      6 built by AI agents. Every game is different: different DOM structures, different
      7 controls, different start mechanisms, different rendering approaches (canvas, DOM,
      8 SVG, WebGL). The bot must work with all of them.
      9 
     10 The bot does NOT use Claude or any LLM for grading. All evaluation is deterministic
     11 code. The bot plays the game using a known-good AI algorithm (4-heuristic genetic
     12 optimization, reference implementation: LeeYiyuan/tetrisai, MIT License) and records what happens.
     13 
     14 ## Architecture
     15 
     16 Two main components:
     17 1. **Grid reader**: Reads the game state by sampling pixels (canvas) or DOM colors.
     18    Produces a 10x20 boolean grid. Works regardless of rendering approach.
     19 2. **AI player**: Given a grid state and current piece type, computes the optimal
     20    placement using aggregate height, lines cleared, holes, and bumpiness heuristics.
     21 
     22 The bot requires GPU access for reliable pixel readback from canvas. Without GPU,
     23 headless Chromium's canvas `getImageData()` returns all zeros. The host must expose
     24 `/dev/dri/card0` and `/dev/dri/renderD128` to the container.
     25 
     26 ## Test Structure: Conditional Phases
     27 
     28 Tests are organized as conditional phases. Each phase depends on the previous one
     29 succeeding. If a phase fails, all downstream tests are marked "skipped: [phase] failed"
     30 instead of producing false positives.
     31 
     32 ### Pre-test Survey (not scored, just data collection)
     33 
     34 Before any tests run, survey the page:
     35 - Is there a visible element with tetris grid proportions (~2:1 height:width)?
     36 - Is there a full-screen overlay (high z-index element covering >80% of viewport)?
     37 - Are there clickable elements (buttons, links, divs with click handlers)?
     38 - Is there a canvas element? Multiple canvases?
     39 - Is there a DOM-based grid (table, grid of divs)?
     40 - What text is visible? ("Press Enter", "Start", "Play", etc.)
     41 
     42 Store all survey data. This informs the start mechanism detection but is not a test.
     43 
     44 ### Phase 1: Page Load (test 1)
     45 
     46 **Test: `game_loads`**
     47 - Navigate to the game URL
     48 - Wait 3 seconds for scripts to execute
     49 - Check for console errors
     50 - Pass: page loaded, no critical errors
     51 - Fail: page failed to load or has critical JS errors
     52 
     53 ### Phase 2: Game Start Detection (tests 2-3)
     54 
     55 This phase determines if and how the game starts. It uses a cascading strategy:
     56 
     57 **Step 2a: Auto-start check**
     58 - Take 10 screenshots at 100ms intervals (1 second total)
     59 - Look for a colored cluster of pixels (~4 cells, roughly square-ish bounding box)
     60   that moved downward between frames
     61 - This is the "falling piece detector"
     62 - If found: game auto-starts, no button needed
     63 - Store: `start_mechanism: "auto"`
     64 
     65 **Step 2b: Button/overlay detection (if 2a failed)**
     66 - Check for full-screen overlay (element covering >80% viewport with high z-index)
     67 - If overlay found:
     68   - Try pressing Enter
     69   - Wait 500ms, run falling piece detector (10 screenshots, 100ms each)
     70   - If piece found: `start_mechanism: "enter"`, overlay was a start screen
     71   - If not, try pressing Space, same check
     72   - If not, try clicking the overlay center
     73   - If not, try clicking any visible buttons in the overlay
     74 - If no overlay:
     75   - Find all clickable elements (buttons, elements with onclick, role="button")
     76   - Try clicking each one, run falling piece detector after each
     77   - If piece found: `start_mechanism: "button"`, remember which element worked
     78   - The start button might change state or disappear after clicking -- that's fine
     79 
     80 **Step 2c: Keyboard fallback (if 2b failed)**
     81 - Try pressing: Enter, Space, ArrowDown, Z, P, any key
     82 - After each, run falling piece detector
     83 - If piece found: store which key started the game
     84 
     85 **Step 2d: Canvas click (if 2c failed)**
     86 - Click the center of the canvas/grid element
     87 - Run falling piece detector
     88 - Some games render their start button on the canvas (no DOM element to find)
     89 
     90 **Falling piece detector algorithm:**
     91 - Take 10 screenshots at 100ms intervals
     92 - For each consecutive pair, diff the pixels
     93 - Look for a cluster of changed pixels that:
     94   - Is roughly rectangular (bounding box aspect ratio between 1:1 and 4:1)
     95   - Is in the upper portion of the game area
     96   - Moved downward between frames (centroid Y increased)
     97 - A "cluster" = contiguous region of non-background-color pixels that appeared
     98 - Size: roughly 1-4 cells worth of pixels (each cell is typically 15-40px)
     99 - The piece may have rounded corners, glow effects, shadows -- look at bounding box
    100 - Must see movement in at least 2 frame pairs to confirm it's falling
    101 
    102 **Tests derived from Phase 2:**
    103 - `game_starts`: pass if falling piece detected by any mechanism
    104 - `auto_drop`: pass if piece falls on its own without any key input (only valid if
    105   game auto-started or we confirmed start mechanism worked)
    106 
    107 ### Phase 3: Mechanics Tests (tests 4-9)
    108 
    109 Only runs if Phase 2 succeeded (game started, piece detected).
    110 
    111 Reload the page. Start the game using the mechanism discovered in Phase 2.
    112 Wait for a piece to appear (falling piece detector).
    113 
    114 For each control test:
    115 1. Read the grid state (grid reader)
    116 2. Press the key
    117 3. Wait 60ms
    118 4. Read the grid state again
    119 5. Compare: did the relevant change happen?
    120 
    121 **Test: `move_left`** -- ArrowLeft, piece column decreased
    122 **Test: `move_right`** -- ArrowRight, piece column increased
    123 **Test: `move_down`** -- ArrowDown, piece row increased (soft drop)
    124 **Test: `rotate`** -- ArrowUp (or Z), piece shape changed (bounding box dimensions swapped for non-O pieces)
    125 **Test: `hard_drop`** -- Space, piece instantly at bottom (filled cells appear in bottom rows)
    126 **Test: `all_pieces_rotate`** -- Track piece types seen during play, confirm rotation works for non-O pieces
    127 
    128 If the grid reader cannot read the grid (no GPU, bad calibration), fall back to
    129 screenshot comparison for these tests. Mark as "(screenshot-verified)" not
    130 "(grid-verified)" in the detail string.
    131 
    132 ### Phase 4: Piece Lifecycle Tests (tests 10-12)
    133 
    134 Only runs if Phase 3 mechanics worked.
    135 
    136 Continue from Phase 3 state or reload + start.
    137 
    138 **Test: `piece_locks`**
    139 - Hard drop a piece
    140 - Wait 300ms
    141 - Read grid: are there filled cells at the bottom that persist?
    142 - Must see cells that don't move for 2 consecutive reads 500ms apart
    143 
    144 **Test: `new_piece_spawns`**
    145 - After a piece locks, check top 4 rows of grid
    146 - A new piece should appear (filled cells in top rows)
    147 - Track: `piecesSpawned` counter
    148 
    149 **Test: `multiple_pieces`**
    150 - Play 10+ pieces (hard drop repeatedly)
    151 - Must detect at least 3 distinct piece placements
    152 - Track piece types seen (I, O, T, S, Z, J, L)
    153 
    154 ### Phase 5: Gameplay Tests (tests 13-14)
    155 
    156 Only runs if Phase 4 piece lifecycle works.
    157 
    158 Reload the page. Start game. Play using the AI player for an extended session:
    159 - **60 pieces max, 45 seconds max**
    160 - **60ms polling** between grid reads
    161 - Read score element on every 5th poll cycle (integrated score tracking)
    162 
    163 **Test: `line_clear`**
    164 - During AI play, watch for complete rows (all cells filled)
    165 - After complete row detected, wait 200-500ms for clear animation
    166 - Read grid again: did the complete row disappear?
    167 - If AI play doesn't clear a line, try brute force: drop pieces at each column
    168 - If brute force fails, check if total filled cells decreased (indirect clear detection)
    169 - Pass: at least 1 line cleared (grid-verified)
    170 
    171 **Test: `score_changes`**
    172 - Read score element before play begins (record initial value)
    173 - During play, read score on every 5th poll cycle
    174 - After play, read final score
    175 - Pass: score increased from initial value
    176 - If no score element found, try scanning page text for changing numbers
    177 - Record: all score values observed, deltas between readings
    178 
    179 ### Phase 6: Game Over Test (test 15)
    180 
    181 Only runs if Phase 5 gameplay works (pieces can be placed and lines can clear,
    182 or at minimum pieces can be placed).
    183 
    184 Reload the page. Start game.
    185 
    186 **Test: `game_over`**
    187 - Stack pieces to trigger game over: hard drop in the same column repeatedly
    188   to build a tall tower
    189 - After each drop, check grid reader: are there filled cells in the top 2 rows?
    190 - Once top rows are filled, check:
    191   1. Does the game stop accepting input? (press keys, check if grid changes)
    192   2. Does "Game Over" or similar text appear in DOM?
    193   3. Does the page become static? (2 screenshots 1s apart are identical)
    194 - Pass: game stopped after filling to top (at least 1 of the 3 checks)
    195 - Do NOT use screenshot comparison alone (false positives on static start screens)
    196 - Must have evidence that pieces WERE being placed before the game stopped
    197 
    198 ### Phase 7: Endurance Test (test 16)
    199 
    200 Only runs if Phase 5 gameplay works.
    201 
    202 Reload the page. Start game.
    203 
    204 **Test: `playable_30s`**
    205 - Play using AI player for 30 seconds
    206 - Track: pieces placed, lines cleared, console errors, play errors
    207 - Pass: played for 30+ seconds without crashing, placed 5+ pieces, no critical errors
    208 
    209 ### Phase 8: Competitive Play (not pass/fail, produces metrics + 8 additional tests)
    210 
    211 Only runs if Phase 5 gameplay works.
    212 
    213 Reload the page. Start game. Play competitively for 60 seconds using AI player.
    214 
    215 **Purpose**: Find bugs that the basic tests miss. A game might start, move pieces,
    216 and clear single lines, but fail on multi-line clears, score scaling, level
    217 progression, etc.
    218 
    219 **Data recorded:**
    220 ```json
    221 {
    222   "duration_seconds": 45,
    223   "pieces_placed": 62,
    224   "total_lines_cleared": 18,
    225   "single_clears": 12,
    226   "double_clears": 2,
    227   "triple_clears": 1,
    228   "tetris_clears": 0,
    229   "max_combo": 3,
    230   "score_readings": [0, 100, 200, 500, ...],
    231   "score_final": 4200,
    232   "score_increases": [100, 100, 300, ...],
    233   "level_readings": [1, 1, 1, 2, 2, 3],
    234   "level_final": 3,
    235   "game_over_reached": true,
    236   "game_over_text_found": "Game Over",
    237   "restart_available": true,
    238   "next_piece_visible": true,
    239   "speed_increased": true,
    240   "bugs_detected": ["score_does_not_scale_with_simultaneous_clears"]
    241 }
    242 ```
    243 
    244 **8 additional tests (tests 17-24):**
    245 
    246 Each has three outcomes: pass (tested, works), fail (tested, broken), skip (no opportunity to test).
    247 
    248 **Test 17: `multi_line_clear`**
    249 - During play, detect when 2+ rows are complete simultaneously
    250 - Wait for clear animation (200-500ms)
    251 - Check: did all complete rows disappear?
    252 - Bug: `multi_line_clear_only_removes_one_row` -- only 1 row cleared when 2+ were complete
    253 - Skip: if no multi-line clear opportunity occurred during 60s play
    254 
    255 **Test 18: `score_scaling`**
    256 - Track score delta for each clear event
    257 - Compare: single clear delta vs multi-line clear delta
    258 - Bug: `score_does_not_scale_with_simultaneous_clears` -- multi-line gives same points as single
    259 - Standard Tetris scoring: single=100, double=300, triple=500, tetris=800 (x level)
    260 - Don't enforce exact formula, just check that multi > single
    261 - Skip: if no multi-line clear occurred
    262 
    263 **Test 19: `level_progression`**
    264 - Track lines cleared and level display throughout the session
    265 - After 10+ lines cleared, level should have increased from initial value
    266 - Bug: `level_does_not_increase`
    267 - Skip: if fewer than 10 lines cleared
    268 
    269 **Test 20: `speed_progression`**
    270 - Measure auto-drop interval at game start (time between automatic downward moves)
    271 - After level increases, measure again
    272 - Bug: `speed_does_not_increase` -- interval didn't decrease
    273 - Skip: if level didn't increase
    274 
    275 **Test 21: `next_piece_preview`**
    276 - Look for a secondary display area showing the next piece
    277 - Check: small canvas/div near the main grid with a single piece shape
    278 - Pass: found a next piece display
    279 - Fail: no next piece preview found
    280 
    281 **Test 22: `game_over_display`**
    282 - When game over is triggered (from Phase 6 or competitive play):
    283 - Check for "Game Over" or similar text
    284 - Check for restart button/prompt
    285 - Pass: both message and restart option present
    286 - Fail: missing either
    287 - Skip: game over not reached during competitive play
    288 
    289 **Test 23: `counter_clockwise_rotation`**
    290 - During play, occasionally press Z key instead of Up arrow
    291 - Compare piece shape before and after
    292 - Pass: Z rotates opposite direction from Up
    293 - Fail: Z does same as Up, or doesn't rotate
    294 - Skip: could not detect rotation direction
    295 
    296 **Test 24: `soft_drop_distinct`**
    297 - Press Down arrow: piece should move one row
    298 - Press Space: piece should drop to bottom
    299 - Compare: Down should NOT behave like Space
    300 - Bug: `soft_drop_acts_as_hard_drop`
    301 - Pass: Down moves 1 row, Space drops to bottom
    302 - Fail: Down drops to bottom like Space
    303 
    304 ## Polling and Timing
    305 
    306 - **Grid polling**: 60ms between reads during play
    307 - **Post-lock wait**: 100ms after piece locks before reading settled state
    308 - **Falling piece detector**: 10 screenshots at 100ms intervals
    309 - **Post-trigger settling**: 500ms after pressing start key/button before detection
    310 - **Score reading**: every 5th poll cycle during play (every 300ms)
    311 - **Clear animation wait**: 200-500ms between detecting complete row and checking if it cleared
    312 
    313 ## Grid Reader
    314 
    315 The grid reader samples pixels to determine cell state (filled/empty).
    316 
    317 **Canvas games**: Use `getImageData()` to sample 5 points per cell (center + corners).
    318 Requires GPU access for reliable readback.
    319 
    320 **DOM games**: Read `backgroundColor` or `style.background` of each cell element.
    321 
    322 **Grid detection**: Find the game grid by looking for a rectangular region with
    323 approximately 2:1 height:width ratio containing a 10-column x 20-row grid of
    324 uniformly-sized cells.
    325 
    326 **Validation**: Reject grids where >60% of cells are filled (likely reading UI chrome,
    327 not game state). Validate aspect ratio (height ~= 2 * width).
    328 
    329 ## AI Player
    330 
    331 Pierre Dellacherie's 4-heuristic evaluation (2003) with Colin Fahey's GA-optimized weights, reference implementation: LeeYiyuan/tetrisai (MIT License):
    332 - Aggregate height: sum of column heights (weight: -0.510066)
    333 - Lines cleared: number of complete rows (weight: 0.760666)
    334 - Holes: empty cells below filled cells (weight: -0.35663)
    335 - Bumpiness: sum of absolute height differences between adjacent columns (weight: -0.184483)
    336 
    337 For each possible (rotation, column) placement:
    338 1. Simulate piece drop
    339 2. Score the resulting board
    340 3. Pick the highest-scoring placement
    341 
    342 Execute placement: rotate N times, move to target column, hard drop.
    343 
    344 ## Report Structure
    345 
    346 ```json
    347 {
    348   "implementation": {
    349     "renderer": "canvas|dom|svg|webgl|unknown",
    350     "grid_detected": true,
    351     "grid_bounds": { "x": 0, "y": 0, "width": 320, "height": 640 },
    352     "controls": { "left": "ArrowLeft", "right": "ArrowRight", ... },
    353     "start_mechanism": "auto|enter|space|button|click_canvas|unknown",
    354     "score_element_found": true,
    355     "grid_confidence": 0.95,
    356     "survey": {
    357       "has_overlay": false,
    358       "has_canvas": true,
    359       "has_dom_grid": false,
    360       "visible_text": ["TETRIS", "Score: 0", "Level: 1"],
    361       "clickable_elements": 2
    362     }
    363   },
    364   "tests": [
    365     { "name": "game_loads", "pass": true, "detail": "..." },
    366     { "name": "game_starts", "pass": true, "detail": "started via enter" },
    367     ...
    368   ],
    369   "summary": {
    370     "total": 24,
    371     "passed": 20,
    372     "failed": 2,
    373     "skipped": 2,
    374     "score": 0.83
    375   },
    376   "gameplay": {
    377     "pieces_placed": 45,
    378     "lines_cleared": 12,
    379     "max_score_observed": 4200,
    380     "play_duration_seconds": 30,
    381     "errors_during_play": 0
    382   },
    383   "competitive_play": {
    384     "duration_seconds": 55,
    385     "pieces_placed": 62,
    386     "total_lines_cleared": 18,
    387     "single_clears": 12,
    388     "double_clears": 2,
    389     "triple_clears": 1,
    390     "tetris_clears": 0,
    391     "score_readings": [0, 100, 200, ...],
    392     "score_final": 4200,
    393     "level_final": 3,
    394     "bugs_detected": []
    395   },
    396   "session": {
    397     "frames": 500,
    398     "pieces_spawned": 45,
    399     "pieces_locked": 44,
    400     "lines_cleared": 12,
    401     "piece_types_seen": ["I", "O", "T", "S", "Z", "J", "L"],
    402     "grid_read_success_rate": 0.98
    403   },
    404   "performance": {
    405     "load_time_ms": 150
    406   },
    407   "accessibility": {
    408     "issues": ["canvas without aria-label"],
    409     "issue_count": 1,
    410     "pass": false
    411   }
    412 }
    413 ```
    414 
    415 ## Score Calculation
    416 
    417 The bot score is: `passed / total` (excluding skipped tests from both numerator
    418 and denominator). So if 20/22 non-skipped tests pass, score = 0.91.
    419 
    420 Skipped tests don't penalize -- they indicate the bot couldn't test that feature
    421 because a prerequisite failed. The game may still be good; we just can't verify.
    422 
    423 ## Files
    424 
    425 - `types.ts` -- TypeScript interfaces for all data structures
    426 - `calibrate.ts` -- Grid detection, control detection, start mechanism, survey
    427 - `grid-reader.ts` -- Pixel sampling, grid state reading, piece detection
    428 - `player.ts` -- AI player, placement execution, heuristic evaluation
    429 - `tests.ts` -- Phase execution, test derivation, falling piece detector
    430 - `index.ts` -- Playwright test entry point, HTTP server, report output
    431 
    432 ## Known Limitations
    433 
    434 The bot does NOT test:
    435 - Wall kicks (piece sliding along collision line)
    436 - Lock delay (brief window to slide before piece locks)
    437 - T-spins
    438 - Hold piece functionality
    439 - Ghost piece (shadow showing where piece will land)
    440 - Next piece preview accuracy (only checks if it exists)
    441 - Level/speed progression exact values (only checks direction)
    442 - DAS (delayed auto-shift for held keys)
    443 - Piece randomizer fairness (bag system vs pure random)
    444 
    445 The bot CAN be fooled by:
    446 - Games that render pieces identically to UI chrome (rare)
    447 - Games with unusual grid sizes (not 10x20)
    448 - Games where the grid is not visible (3D Tetris, first-person Tetris)
    449 - Games requiring mouse input for gameplay (not just start)
    450 - Games with very fast initial drop speed (piece may lock before bot reads it)
    451 
    452 ## GPU Requirement
    453 
    454 Without GPU access in the container, canvas `getImageData()` returns all zeros
    455 in headless Chromium. The bot falls back to DOM-based grid reading for DOM-rendered
    456 games, but canvas games will fail grid reading entirely.
    457 
    458 To enable GPU in Proxmox LXC:
    459 ```
    460 lxc.cgroup2.devices.allow: c 226:* rwm
    461 lxc.mount.entry: /dev/dri dev/dri none bind,optional,create=dir
    462 lxc.hook.autodev: sh -c "chmod 666 ${LXC_ROOTFS_MOUNT}/dev/dri/card0 ${LXC_ROOTFS_MOUNT}/dev/dri/renderD128 2>/dev/null || true"
    463 ```
    464 
    465 The bot should detect GPU availability at startup and log a warning if
    466 `/dev/dri/renderD128` is not accessible. Canvas-based tests will report
    467 "grid reader unavailable (no GPU)" instead of false results.

Impressum · Datenschutz