SPEC.md (18082B)
1 # Gameplay Bot Master Spec 2 3 ## Overview 4 5 The gameplay bot is a Playwright-based automated tester that evaluates Tetris games 6 built by AI agents. Every game is different: different DOM structures, different 7 controls, different start mechanisms, different rendering approaches (canvas, DOM, 8 SVG, WebGL). The bot must work with all of them. 9 10 The bot does NOT use Claude or any LLM for grading. All evaluation is deterministic 11 code. The bot plays the game using a known-good AI algorithm (4-heuristic genetic 12 optimization, reference implementation: LeeYiyuan/tetrisai, MIT License) and records what happens. 13 14 ## Architecture 15 16 Two main components: 17 1. **Grid reader**: Reads the game state by sampling pixels (canvas) or DOM colors. 18 Produces a 10x20 boolean grid. Works regardless of rendering approach. 19 2. **AI player**: Given a grid state and current piece type, computes the optimal 20 placement using aggregate height, lines cleared, holes, and bumpiness heuristics. 21 22 The bot requires GPU access for reliable pixel readback from canvas. Without GPU, 23 headless Chromium's canvas `getImageData()` returns all zeros. The host must expose 24 `/dev/dri/card0` and `/dev/dri/renderD128` to the container. 25 26 ## Test Structure: Conditional Phases 27 28 Tests are organized as conditional phases. Each phase depends on the previous one 29 succeeding. If a phase fails, all downstream tests are marked "skipped: [phase] failed" 30 instead of producing false positives. 31 32 ### Pre-test Survey (not scored, just data collection) 33 34 Before any tests run, survey the page: 35 - Is there a visible element with tetris grid proportions (~2:1 height:width)? 36 - Is there a full-screen overlay (high z-index element covering >80% of viewport)? 37 - Are there clickable elements (buttons, links, divs with click handlers)? 38 - Is there a canvas element? Multiple canvases? 39 - Is there a DOM-based grid (table, grid of divs)? 40 - What text is visible? ("Press Enter", "Start", "Play", etc.) 41 42 Store all survey data. This informs the start mechanism detection but is not a test. 43 44 ### Phase 1: Page Load (test 1) 45 46 **Test: `game_loads`** 47 - Navigate to the game URL 48 - Wait 3 seconds for scripts to execute 49 - Check for console errors 50 - Pass: page loaded, no critical errors 51 - Fail: page failed to load or has critical JS errors 52 53 ### Phase 2: Game Start Detection (tests 2-3) 54 55 This phase determines if and how the game starts. It uses a cascading strategy: 56 57 **Step 2a: Auto-start check** 58 - Take 10 screenshots at 100ms intervals (1 second total) 59 - Look for a colored cluster of pixels (~4 cells, roughly square-ish bounding box) 60 that moved downward between frames 61 - This is the "falling piece detector" 62 - If found: game auto-starts, no button needed 63 - Store: `start_mechanism: "auto"` 64 65 **Step 2b: Button/overlay detection (if 2a failed)** 66 - Check for full-screen overlay (element covering >80% viewport with high z-index) 67 - If overlay found: 68 - Try pressing Enter 69 - Wait 500ms, run falling piece detector (10 screenshots, 100ms each) 70 - If piece found: `start_mechanism: "enter"`, overlay was a start screen 71 - If not, try pressing Space, same check 72 - If not, try clicking the overlay center 73 - If not, try clicking any visible buttons in the overlay 74 - If no overlay: 75 - Find all clickable elements (buttons, elements with onclick, role="button") 76 - Try clicking each one, run falling piece detector after each 77 - If piece found: `start_mechanism: "button"`, remember which element worked 78 - The start button might change state or disappear after clicking -- that's fine 79 80 **Step 2c: Keyboard fallback (if 2b failed)** 81 - Try pressing: Enter, Space, ArrowDown, Z, P, any key 82 - After each, run falling piece detector 83 - If piece found: store which key started the game 84 85 **Step 2d: Canvas click (if 2c failed)** 86 - Click the center of the canvas/grid element 87 - Run falling piece detector 88 - Some games render their start button on the canvas (no DOM element to find) 89 90 **Falling piece detector algorithm:** 91 - Take 10 screenshots at 100ms intervals 92 - For each consecutive pair, diff the pixels 93 - Look for a cluster of changed pixels that: 94 - Is roughly rectangular (bounding box aspect ratio between 1:1 and 4:1) 95 - Is in the upper portion of the game area 96 - Moved downward between frames (centroid Y increased) 97 - A "cluster" = contiguous region of non-background-color pixels that appeared 98 - Size: roughly 1-4 cells worth of pixels (each cell is typically 15-40px) 99 - The piece may have rounded corners, glow effects, shadows -- look at bounding box 100 - Must see movement in at least 2 frame pairs to confirm it's falling 101 102 **Tests derived from Phase 2:** 103 - `game_starts`: pass if falling piece detected by any mechanism 104 - `auto_drop`: pass if piece falls on its own without any key input (only valid if 105 game auto-started or we confirmed start mechanism worked) 106 107 ### Phase 3: Mechanics Tests (tests 4-9) 108 109 Only runs if Phase 2 succeeded (game started, piece detected). 110 111 Reload the page. Start the game using the mechanism discovered in Phase 2. 112 Wait for a piece to appear (falling piece detector). 113 114 For each control test: 115 1. Read the grid state (grid reader) 116 2. Press the key 117 3. Wait 60ms 118 4. Read the grid state again 119 5. Compare: did the relevant change happen? 120 121 **Test: `move_left`** -- ArrowLeft, piece column decreased 122 **Test: `move_right`** -- ArrowRight, piece column increased 123 **Test: `move_down`** -- ArrowDown, piece row increased (soft drop) 124 **Test: `rotate`** -- ArrowUp (or Z), piece shape changed (bounding box dimensions swapped for non-O pieces) 125 **Test: `hard_drop`** -- Space, piece instantly at bottom (filled cells appear in bottom rows) 126 **Test: `all_pieces_rotate`** -- Track piece types seen during play, confirm rotation works for non-O pieces 127 128 If the grid reader cannot read the grid (no GPU, bad calibration), fall back to 129 screenshot comparison for these tests. Mark as "(screenshot-verified)" not 130 "(grid-verified)" in the detail string. 131 132 ### Phase 4: Piece Lifecycle Tests (tests 10-12) 133 134 Only runs if Phase 3 mechanics worked. 135 136 Continue from Phase 3 state or reload + start. 137 138 **Test: `piece_locks`** 139 - Hard drop a piece 140 - Wait 300ms 141 - Read grid: are there filled cells at the bottom that persist? 142 - Must see cells that don't move for 2 consecutive reads 500ms apart 143 144 **Test: `new_piece_spawns`** 145 - After a piece locks, check top 4 rows of grid 146 - A new piece should appear (filled cells in top rows) 147 - Track: `piecesSpawned` counter 148 149 **Test: `multiple_pieces`** 150 - Play 10+ pieces (hard drop repeatedly) 151 - Must detect at least 3 distinct piece placements 152 - Track piece types seen (I, O, T, S, Z, J, L) 153 154 ### Phase 5: Gameplay Tests (tests 13-14) 155 156 Only runs if Phase 4 piece lifecycle works. 157 158 Reload the page. Start game. Play using the AI player for an extended session: 159 - **60 pieces max, 45 seconds max** 160 - **60ms polling** between grid reads 161 - Read score element on every 5th poll cycle (integrated score tracking) 162 163 **Test: `line_clear`** 164 - During AI play, watch for complete rows (all cells filled) 165 - After complete row detected, wait 200-500ms for clear animation 166 - Read grid again: did the complete row disappear? 167 - If AI play doesn't clear a line, try brute force: drop pieces at each column 168 - If brute force fails, check if total filled cells decreased (indirect clear detection) 169 - Pass: at least 1 line cleared (grid-verified) 170 171 **Test: `score_changes`** 172 - Read score element before play begins (record initial value) 173 - During play, read score on every 5th poll cycle 174 - After play, read final score 175 - Pass: score increased from initial value 176 - If no score element found, try scanning page text for changing numbers 177 - Record: all score values observed, deltas between readings 178 179 ### Phase 6: Game Over Test (test 15) 180 181 Only runs if Phase 5 gameplay works (pieces can be placed and lines can clear, 182 or at minimum pieces can be placed). 183 184 Reload the page. Start game. 185 186 **Test: `game_over`** 187 - Stack pieces to trigger game over: hard drop in the same column repeatedly 188 to build a tall tower 189 - After each drop, check grid reader: are there filled cells in the top 2 rows? 190 - Once top rows are filled, check: 191 1. Does the game stop accepting input? (press keys, check if grid changes) 192 2. Does "Game Over" or similar text appear in DOM? 193 3. Does the page become static? (2 screenshots 1s apart are identical) 194 - Pass: game stopped after filling to top (at least 1 of the 3 checks) 195 - Do NOT use screenshot comparison alone (false positives on static start screens) 196 - Must have evidence that pieces WERE being placed before the game stopped 197 198 ### Phase 7: Endurance Test (test 16) 199 200 Only runs if Phase 5 gameplay works. 201 202 Reload the page. Start game. 203 204 **Test: `playable_30s`** 205 - Play using AI player for 30 seconds 206 - Track: pieces placed, lines cleared, console errors, play errors 207 - Pass: played for 30+ seconds without crashing, placed 5+ pieces, no critical errors 208 209 ### Phase 8: Competitive Play (not pass/fail, produces metrics + 8 additional tests) 210 211 Only runs if Phase 5 gameplay works. 212 213 Reload the page. Start game. Play competitively for 60 seconds using AI player. 214 215 **Purpose**: Find bugs that the basic tests miss. A game might start, move pieces, 216 and clear single lines, but fail on multi-line clears, score scaling, level 217 progression, etc. 218 219 **Data recorded:** 220 ```json 221 { 222 "duration_seconds": 45, 223 "pieces_placed": 62, 224 "total_lines_cleared": 18, 225 "single_clears": 12, 226 "double_clears": 2, 227 "triple_clears": 1, 228 "tetris_clears": 0, 229 "max_combo": 3, 230 "score_readings": [0, 100, 200, 500, ...], 231 "score_final": 4200, 232 "score_increases": [100, 100, 300, ...], 233 "level_readings": [1, 1, 1, 2, 2, 3], 234 "level_final": 3, 235 "game_over_reached": true, 236 "game_over_text_found": "Game Over", 237 "restart_available": true, 238 "next_piece_visible": true, 239 "speed_increased": true, 240 "bugs_detected": ["score_does_not_scale_with_simultaneous_clears"] 241 } 242 ``` 243 244 **8 additional tests (tests 17-24):** 245 246 Each has three outcomes: pass (tested, works), fail (tested, broken), skip (no opportunity to test). 247 248 **Test 17: `multi_line_clear`** 249 - During play, detect when 2+ rows are complete simultaneously 250 - Wait for clear animation (200-500ms) 251 - Check: did all complete rows disappear? 252 - Bug: `multi_line_clear_only_removes_one_row` -- only 1 row cleared when 2+ were complete 253 - Skip: if no multi-line clear opportunity occurred during 60s play 254 255 **Test 18: `score_scaling`** 256 - Track score delta for each clear event 257 - Compare: single clear delta vs multi-line clear delta 258 - Bug: `score_does_not_scale_with_simultaneous_clears` -- multi-line gives same points as single 259 - Standard Tetris scoring: single=100, double=300, triple=500, tetris=800 (x level) 260 - Don't enforce exact formula, just check that multi > single 261 - Skip: if no multi-line clear occurred 262 263 **Test 19: `level_progression`** 264 - Track lines cleared and level display throughout the session 265 - After 10+ lines cleared, level should have increased from initial value 266 - Bug: `level_does_not_increase` 267 - Skip: if fewer than 10 lines cleared 268 269 **Test 20: `speed_progression`** 270 - Measure auto-drop interval at game start (time between automatic downward moves) 271 - After level increases, measure again 272 - Bug: `speed_does_not_increase` -- interval didn't decrease 273 - Skip: if level didn't increase 274 275 **Test 21: `next_piece_preview`** 276 - Look for a secondary display area showing the next piece 277 - Check: small canvas/div near the main grid with a single piece shape 278 - Pass: found a next piece display 279 - Fail: no next piece preview found 280 281 **Test 22: `game_over_display`** 282 - When game over is triggered (from Phase 6 or competitive play): 283 - Check for "Game Over" or similar text 284 - Check for restart button/prompt 285 - Pass: both message and restart option present 286 - Fail: missing either 287 - Skip: game over not reached during competitive play 288 289 **Test 23: `counter_clockwise_rotation`** 290 - During play, occasionally press Z key instead of Up arrow 291 - Compare piece shape before and after 292 - Pass: Z rotates opposite direction from Up 293 - Fail: Z does same as Up, or doesn't rotate 294 - Skip: could not detect rotation direction 295 296 **Test 24: `soft_drop_distinct`** 297 - Press Down arrow: piece should move one row 298 - Press Space: piece should drop to bottom 299 - Compare: Down should NOT behave like Space 300 - Bug: `soft_drop_acts_as_hard_drop` 301 - Pass: Down moves 1 row, Space drops to bottom 302 - Fail: Down drops to bottom like Space 303 304 ## Polling and Timing 305 306 - **Grid polling**: 60ms between reads during play 307 - **Post-lock wait**: 100ms after piece locks before reading settled state 308 - **Falling piece detector**: 10 screenshots at 100ms intervals 309 - **Post-trigger settling**: 500ms after pressing start key/button before detection 310 - **Score reading**: every 5th poll cycle during play (every 300ms) 311 - **Clear animation wait**: 200-500ms between detecting complete row and checking if it cleared 312 313 ## Grid Reader 314 315 The grid reader samples pixels to determine cell state (filled/empty). 316 317 **Canvas games**: Use `getImageData()` to sample 5 points per cell (center + corners). 318 Requires GPU access for reliable readback. 319 320 **DOM games**: Read `backgroundColor` or `style.background` of each cell element. 321 322 **Grid detection**: Find the game grid by looking for a rectangular region with 323 approximately 2:1 height:width ratio containing a 10-column x 20-row grid of 324 uniformly-sized cells. 325 326 **Validation**: Reject grids where >60% of cells are filled (likely reading UI chrome, 327 not game state). Validate aspect ratio (height ~= 2 * width). 328 329 ## AI Player 330 331 Pierre Dellacherie's 4-heuristic evaluation (2003) with Colin Fahey's GA-optimized weights, reference implementation: LeeYiyuan/tetrisai (MIT License): 332 - Aggregate height: sum of column heights (weight: -0.510066) 333 - Lines cleared: number of complete rows (weight: 0.760666) 334 - Holes: empty cells below filled cells (weight: -0.35663) 335 - Bumpiness: sum of absolute height differences between adjacent columns (weight: -0.184483) 336 337 For each possible (rotation, column) placement: 338 1. Simulate piece drop 339 2. Score the resulting board 340 3. Pick the highest-scoring placement 341 342 Execute placement: rotate N times, move to target column, hard drop. 343 344 ## Report Structure 345 346 ```json 347 { 348 "implementation": { 349 "renderer": "canvas|dom|svg|webgl|unknown", 350 "grid_detected": true, 351 "grid_bounds": { "x": 0, "y": 0, "width": 320, "height": 640 }, 352 "controls": { "left": "ArrowLeft", "right": "ArrowRight", ... }, 353 "start_mechanism": "auto|enter|space|button|click_canvas|unknown", 354 "score_element_found": true, 355 "grid_confidence": 0.95, 356 "survey": { 357 "has_overlay": false, 358 "has_canvas": true, 359 "has_dom_grid": false, 360 "visible_text": ["TETRIS", "Score: 0", "Level: 1"], 361 "clickable_elements": 2 362 } 363 }, 364 "tests": [ 365 { "name": "game_loads", "pass": true, "detail": "..." }, 366 { "name": "game_starts", "pass": true, "detail": "started via enter" }, 367 ... 368 ], 369 "summary": { 370 "total": 24, 371 "passed": 20, 372 "failed": 2, 373 "skipped": 2, 374 "score": 0.83 375 }, 376 "gameplay": { 377 "pieces_placed": 45, 378 "lines_cleared": 12, 379 "max_score_observed": 4200, 380 "play_duration_seconds": 30, 381 "errors_during_play": 0 382 }, 383 "competitive_play": { 384 "duration_seconds": 55, 385 "pieces_placed": 62, 386 "total_lines_cleared": 18, 387 "single_clears": 12, 388 "double_clears": 2, 389 "triple_clears": 1, 390 "tetris_clears": 0, 391 "score_readings": [0, 100, 200, ...], 392 "score_final": 4200, 393 "level_final": 3, 394 "bugs_detected": [] 395 }, 396 "session": { 397 "frames": 500, 398 "pieces_spawned": 45, 399 "pieces_locked": 44, 400 "lines_cleared": 12, 401 "piece_types_seen": ["I", "O", "T", "S", "Z", "J", "L"], 402 "grid_read_success_rate": 0.98 403 }, 404 "performance": { 405 "load_time_ms": 150 406 }, 407 "accessibility": { 408 "issues": ["canvas without aria-label"], 409 "issue_count": 1, 410 "pass": false 411 } 412 } 413 ``` 414 415 ## Score Calculation 416 417 The bot score is: `passed / total` (excluding skipped tests from both numerator 418 and denominator). So if 20/22 non-skipped tests pass, score = 0.91. 419 420 Skipped tests don't penalize -- they indicate the bot couldn't test that feature 421 because a prerequisite failed. The game may still be good; we just can't verify. 422 423 ## Files 424 425 - `types.ts` -- TypeScript interfaces for all data structures 426 - `calibrate.ts` -- Grid detection, control detection, start mechanism, survey 427 - `grid-reader.ts` -- Pixel sampling, grid state reading, piece detection 428 - `player.ts` -- AI player, placement execution, heuristic evaluation 429 - `tests.ts` -- Phase execution, test derivation, falling piece detector 430 - `index.ts` -- Playwright test entry point, HTTP server, report output 431 432 ## Known Limitations 433 434 The bot does NOT test: 435 - Wall kicks (piece sliding along collision line) 436 - Lock delay (brief window to slide before piece locks) 437 - T-spins 438 - Hold piece functionality 439 - Ghost piece (shadow showing where piece will land) 440 - Next piece preview accuracy (only checks if it exists) 441 - Level/speed progression exact values (only checks direction) 442 - DAS (delayed auto-shift for held keys) 443 - Piece randomizer fairness (bag system vs pure random) 444 445 The bot CAN be fooled by: 446 - Games that render pieces identically to UI chrome (rare) 447 - Games with unusual grid sizes (not 10x20) 448 - Games where the grid is not visible (3D Tetris, first-person Tetris) 449 - Games requiring mouse input for gameplay (not just start) 450 - Games with very fast initial drop speed (piece may lock before bot reads it) 451 452 ## GPU Requirement 453 454 Without GPU access in the container, canvas `getImageData()` returns all zeros 455 in headless Chromium. The bot falls back to DOM-based grid reading for DOM-rendered 456 games, but canvas games will fail grid reading entirely. 457 458 To enable GPU in Proxmox LXC: 459 ``` 460 lxc.cgroup2.devices.allow: c 226:* rwm 461 lxc.mount.entry: /dev/dri dev/dri none bind,optional,create=dir 462 lxc.hook.autodev: sh -c "chmod 666 ${LXC_ROOTFS_MOUNT}/dev/dri/card0 ${LXC_ROOTFS_MOUNT}/dev/dri/renderD128 2>/dev/null || true" 463 ``` 464 465 The bot should detect GPU availability at startup and log a warning if 466 `/dev/dri/renderD128` is not accessible. Canvas-based tests will report 467 "grid reader unavailable (no GPU)" instead of false results.