GAMEPLAY_BOT_SPEC.md - loop-benchmarking - Controlled experiments across agentic coding configurations. Same task, one variable, what actually works.

GAMEPLAY_BOT_SPEC.md (8318B)
      1 # Tetris Gameplay Bot Spec
      2 
      3 ## Purpose
      4 
      5 A Playwright-based bot that can load any Tetris implementation, figure out how to interact with it, play the game, and report which game mechanics work and which don't. It must handle wildly different implementations -- different DOM structures, canvas vs DOM rendering, different control schemes, start buttons vs auto-start, etc.
      6 
      7 ## Architecture
      8 
      9 Three phases: **Calibration**, **Play**, **Report**.
     10 
     11 ### Phase 1: Calibration
     12 
     13 The bot loads the page and figures out how to interact with this specific implementation.
     14 
     15 **1a. Start the game**
     16 
     17 Try multiple start mechanisms in order, checking after each if the game state changed:
     18 1. Wait 3 seconds (some games auto-start)
     19 2. Click the canvas or game container
     20 3. Press Enter
     21 4. Press Space
     22 5. Look for a button with text matching /start|play|begin|new game/i and click it
     23 6. Press any key
     24 
     25 After each attempt, take a screenshot and compare to the previous one. If pixels changed, the game has started.
     26 
     27 **1b. Locate the game grid**
     28 
     29 The grid could be:
     30 - A `<canvas>` element
     31 - A grid of `<div>` or `<td>` elements
     32 - An SVG
     33 
     34 Detection strategy:
     35 1. Check for a `<canvas>` element. If found, use `getImageData()` to read pixels.
     36 2. If no canvas, look for a grid-like DOM structure (many sibling elements in a container with grid/flex layout, or a table).
     37 3. Take a screenshot and look for a rectangular region with a grid pattern.
     38 
     39 Once found, determine:
     40 - Grid pixel bounds (x, y, width, height)
     41 - Cell size (width / 10, height / 20 for standard Tetris)
     42 - Sample one pixel per cell to build a 10x20 boolean matrix
     43 
     44 **1c. Detect controls**
     45 
     46 Default to standard controls: ArrowLeft, ArrowRight, ArrowDown, ArrowUp (rotate), Space (hard drop).
     47 
     48 Verify by:
     49 1. Read the page text/HTML for control instructions (look for "arrow", "wasd", "z", "x", "space", "rotate" etc.)
     50 2. Press ArrowLeft, take screenshot, check if a piece moved. If not, try "a".
     51 3. Press ArrowUp, take screenshot, check if a piece rotated. If not, try "z" or "x".
     52 
     53 Store the working key mappings.
     54 
     55 **1d. Locate score display**
     56 
     57 Scan the page for elements containing the text "score" (case insensitive) or elements that contain only a number that changes during gameplay.
     58 
     59 ### Phase 2: Play
     60 
     61 A deterministic play session that exercises all game mechanics. Not trying to play well -- trying to test everything.
     62 
     63 **2a. Test Suite (sequential, do not stop on failure)**
     64 
     65 Each test captures before/after state and reports pass/fail independently.
     66 
     67 | # | Test | Method | Pass condition |
     68 |---|------|--------|----------------|
     69 | 1 | Game loads | Page loads without console errors | No uncaught exceptions in first 3s |
     70 | 2 | Game starts | Run calibration start sequence | Screenshot changes after start |
     71 | 3 | Auto-drop | Wait 5s with no input after start | Grid state changes (piece fell) |
     72 | 4 | Move left | Press left key | Grid state differs from before |
     73 | 5 | Move right | Press right key | Grid state differs from before |
     74 | 6 | Move down | Press down key | Grid state differs from before |
     75 | 7 | Rotate | Press rotate key | Grid state differs, piece shape changed |
     76 | 8 | Hard drop | Press hard drop key | Piece immediately at bottom, new piece appears |
     77 | 9 | Piece locks | Wait for a piece to reach bottom via auto-drop (no input for ~15s) | Grid has filled cells at bottom that persist |
     78 | 10 | New piece spawns | After piece locks, check top of grid | New piece appears at top |
     79 | 11 | Multiple pieces | Play 10 pieces (hard drop each) | Grid accumulates filled cells |
     80 | 12 | Line clear | Fill a complete row by strategic placement | At least one row disappears, cells above shift down |
     81 | 13 | Score changes | Check score element before and after line clear | Score value increased |
     82 | 14 | Game over | Stack pieces to the top rapidly | Game stops, some game-over indication |
     83 | 15 | Playable for 30s | Play normally for 30 seconds | No crashes, console errors, or freezes |
     84 
     85 **2b. Playing Strategy**
     86 
     87 For tests that require actual gameplay (11, 12, 15), use the 4-heuristic algorithm:
     88 
     89 ```
     90 score = -0.51 * aggregateHeight + 0.76 * completeLines - 0.36 * holes - 0.18 * bumpiness
     91 ```
     92 
     93 For each piece:
     94 1. Read current grid state (10x20 boolean matrix)
     95 2. Read current piece (detect from grid -- the moving cells)
     96 3. Try all (rotation, column) placements
     97 4. Score each resulting board
     98 5. Execute: rotate N times, move left/right, hard drop
     99 
    100 If the bot can't read the grid reliably, fall back to random inputs: cycle through left, right, rotate, down in a fixed pattern.
    101 
    102 **2c. Grid Reading**
    103 
    104 For canvas-based games:
    105 ```js
    106 async function readGrid(page, bounds, cellW, cellH) {
    107   return await page.evaluate(({ x, y, cellW, cellH }) => {
    108     const canvas = document.querySelector('canvas');
    109     const ctx = canvas.getContext('2d');
    110     const grid = [];
    111     for (let row = 0; row < 20; row++) {
    112       const rowData = [];
    113       for (let col = 0; col < 10; col++) {
    114         const px = x + col * cellW + cellW / 2;
    115         const py = y + row * cellH + cellH / 2;
    116         const pixel = ctx.getImageData(px, py, 1, 1).data;
    117         // Consider a cell filled if it's not the background color
    118         const brightness = pixel[0] + pixel[1] + pixel[2];
    119         rowData.push(brightness > 100); // threshold
    120       }
    121       grid.push(rowData);
    122     }
    123     return grid;
    124   }, { x: bounds.x, y: bounds.y, cellW, cellH });
    125 }
    126 ```
    127 
    128 For DOM-based games:
    129 ```js
    130 // Find cells by their grid position, check background color or class
    131 ```
    132 
    133 The background color threshold should be calibrated during Phase 1 by reading the empty grid.
    134 
    135 ### Phase 3: Report
    136 
    137 Output a JSON report:
    138 
    139 ```json
    140 {
    141   "implementation": {
    142     "renderer": "canvas|dom|svg",
    143     "grid_detected": true,
    144     "grid_bounds": { "x": 0, "y": 0, "width": 300, "height": 600 },
    145     "controls": { "left": "ArrowLeft", "right": "ArrowRight", "rotate": "ArrowUp", "drop": "Space" },
    146     "start_mechanism": "button|auto|keypress",
    147     "score_element_found": true
    148   },
    149   "tests": [
    150     { "name": "game_loads", "pass": true, "detail": "no console errors" },
    151     { "name": "game_starts", "pass": true, "detail": "started via button click" },
    152     { "name": "auto_drop", "pass": false, "detail": "piece did not move in 5 seconds" },
    153     ...
    154   ],
    155   "summary": {
    156     "total": 15,
    157     "passed": 12,
    158     "failed": 3,
    159     "score": 0.80
    160   },
    161   "gameplay": {
    162     "pieces_placed": 47,
    163     "lines_cleared": 3,
    164     "max_score_observed": 400,
    165     "play_duration_seconds": 30,
    166     "errors_during_play": 0
    167   }
    168 }
    169 ```
    170 
    171 ## Error Handling
    172 
    173 - NEVER crash on a single test failure. Each test is independent.
    174 - If grid detection fails, skip grid-dependent tests but still test basic page load, console errors, and input response via screenshots.
    175 - If a test times out (e.g., waiting for auto-drop), mark it as failed and move on.
    176 - Capture all console errors throughout the session and include them in the report.
    177 - If the game page itself fails to load, report all tests as failed with the error.
    178 
    179 ## File Structure
    180 
    181 ```
    182 tasks/tetris/eval/
    183   gameplay-bot/
    184     index.ts          # Main entry point, orchestrates calibration + play + report
    185     calibrate.ts      # Phase 1: detect grid, controls, start mechanism
    186     grid-reader.ts    # Read grid state from canvas or DOM
    187     player.ts         # Phase 2: heuristic AI + move execution
    188     tests.ts          # Individual test implementations
    189     types.ts          # Shared types
    190   playwright.config.ts
    191 ```
    192 
    193 ## Dependencies
    194 
    195 - `@playwright/test` (already in the project)
    196 - No other dependencies. Pure Playwright + vanilla JS evaluation.
    197 
    198 ## Integration
    199 
    200 The harness calls:
    201 ```bash
    202 npx playwright test --config=tasks/tetris/eval/playwright.config.ts
    203 ```
    204 
    205 The Playwright test:
    206 1. Starts an HTTP server for the workspace (serve static files)
    207 2. Runs the bot against the served game
    208 3. Writes the JSON report to a specified output path
    209 4. Exit code 0 regardless of test results (the report contains pass/fail)
    210 
    211 ## Constraints
    212 
    213 - Must work with canvas-based AND DOM-based Tetris implementations
    214 - Must handle games that auto-start and games with start buttons
    215 - Must handle different control schemes
    216 - Must not depend on any specific DOM structure, class names, or IDs
    217 - Each test has a timeout (default 10 seconds per test, 30 seconds for the play test)
    218 - Total bot runtime should be under 2 minutes per game
	loop-benchmarking Controlled experiments across agentic coding configurations. Same task, one variable, what actually works.
	git clone https://git.shiptheloop.com/loop-benchmarking.git
	Log \| Files \| Refs \| README