Add two-tier architecture refactor spec for gameplay bot - loop-benchmarking - Controlled experiments across agentic coding configurations. Same task, one variable, what actually works.

commit 67bd49c6e259f78aade3caeae40c3418dedf8071
parent 7fbe88ce2a1febb0954305d10f4e1878570e0f14
Author: Brian Graham <brian@buildingbetterteams.de>
Date:   Thu,  9 Apr 2026 12:56:29 +0200

Add two-tier architecture refactor spec for gameplay bot

Driver (webpage abstraction) + Bot (game logic) separation.
17-method TetrisDriver interface, 4-commit incremental migration plan,
~2740 lines (down from 3500). Bot never imports Playwright.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Diffstat:
A tasks/tetris/eval/gameplay-bot/REFACTOR_SPEC.md  | 877 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

1 file changed, 877 insertions(+), 0 deletions(-)
diff --git a/tasks/tetris/eval/gameplay-bot/REFACTOR_SPEC.md b/tasks/tetris/eval/gameplay-bot/REFACTOR_SPEC.md
@@ -0,0 +1,877 @@
+# Two-Tier Refactor Spec: Driver + Bot
+
+## Problem Statement
+
+The gameplay bot is ~3500 lines across 6 files, with two distinct concerns tangled
+together: understanding the webpage (finding grids, clicking buttons, reading pixels,
+sending keystrokes) and playing Tetris (phase orchestration, AI decisions, test
+derivation, bug detection). The boundary between them is blurred:
+
+- `calibrate.ts` handles grid detection, start mechanism detection, control detection,
+  overlay detection, interactivity verification, screenshot sampling, visual change
+  detection, and page surveying -- all in one 1300-line file.
+- `tests.ts` does phase orchestration, BUT ALSO calls `readGrid` directly during
+  mechanics tests, reads score elements, detects game over text, measures drop
+  intervals, detects next piece previews, and reads level displays.
+- `player.ts` calls both `readGrid` and `page.keyboard.press` directly, coupling
+  AI logic to the Playwright API.
+- `grid-reader.ts` is the cleanest module but still exports low-level grid analysis
+  utilities (bounding boxes, cell counts, piece identification) that the bot calls
+  directly instead of going through an abstraction.
+
+The result: any change to how the page is read ripples through all files. You cannot
+test the AI player without a live Playwright page. You cannot swap the grid reader
+without touching the test orchestrator.
+
+## Proposed Architecture
+
+```
+                 +------------------+
+                 |    index.ts      |  Entry point: HTTP server, Playwright test,
+                 |                  |  report output. Unchanged.
+                 +--------+---------+
+                          |
+                          v
+                 +------------------+
+                 |     bot.ts       |  Layer 2: "The Brain"
+                 |                  |  Phase orchestration, AI decisions, test
+                 |                  |  derivation, competitive play, bug detection.
+                 |                  |  Calls only the Driver interface.
+                 +--------+---------+
+                          |
+                          v
+                 +------------------+
+                 |    driver.ts     |  Layer 1: "The Eyes and Hands"
+                 |                  |  Abstracts the webpage. Exposes a clean API.
+                 |                  |  Handles grid reading, start detection,
+                 |                  |  control detection, keyboard input.
+                 +--------+---------+
+                          |
+                +---------+---------+
+                |                   |
+                v                   v
+        +-----------+       +------------+
+        | types.ts  |       | player.ts  |  Pure Tetris logic: AI heuristics,
+        |           |       |            |  board simulation, placement finding.
+        +-----------+       |            |  NO Playwright imports. NO page access.
+                            +------------+
+```
+
+### What goes where
+
+**driver.ts** -- "I can see and interact with this webpage"
+- Grid detection (finding the grid on the page)
+- Grid reading (10x20 boolean matrix from canvas/DOM/SVG)
+- Start mechanism detection (the 5-phase cascade)
+- Control detection (which keys the game responds to)
+- Score/level/lines reading
+- Keyboard input (move, rotate, drop)
+- Screenshot capture
+- Interactivity verification
+- Page surveying (pre-test data collection)
+- Background color sampling
+- Visual change detection
+- Next piece preview detection
+- Game over text detection
+- Re-calibration
+
+**bot.ts** -- "I know Tetris rules and test logic"
+- Phase orchestration (the 8 conditional phases)
+- Test derivation from session data (the 24 tests)
+- Score/timing/event tracking (GameSession bookkeeping)
+- Competitive play with bug detection
+- Line clear detection logic (watching grid state transitions)
+- Game over triggering strategy (stack pieces to fill grid)
+- Endurance testing
+- Report assembly (BotReport construction)
+
+**player.ts** -- "I know where to put pieces" (pure computation, no I/O)
+- 4-heuristic scoring (aggregate height, lines, holes, bumpiness)
+- Piece definitions (rotations, dimensions)
+- Board simulation (drop piece, clear lines)
+- Best placement finding
+- No `Page` import, no `readGrid` call, no `keyboard.press`
+
+**types.ts** -- unchanged, all interfaces stay
+
+**grid-reader.ts** -- absorbed into driver.ts (see migration plan)
+
+**index.ts** -- unchanged except it calls bot.ts instead of tests.ts
+
+---
+
+## Driver Interface
+
+```typescript
+import type { Page } from "@playwright/test";
+import type {
+  Grid,
+  GridBounds,
+  RendererType,
+  Controls,
+  StartMechanism,
+  SurveyData,
+  PieceType,
+} from "./types";
+
+// ---------------------------------------------------------------------------
+// Configuration returned by calibration, passed through subsequent calls.
+// Replaces CalibrationResult for internal use within the Driver.
+// ---------------------------------------------------------------------------
+
+export interface DriverCalibration {
+  renderer: RendererType;
+  gridDetected: boolean;
+  gridBounds: GridBounds | null;
+  cellWidth: number;
+  cellHeight: number;
+  controls: Controls;
+  startMechanism: StartMechanism;
+  scoreElementSelector: string | null;
+  levelElementSelector: string | null;
+  backgroundColor: [number, number, number] | null;
+  consoleErrors: string[];
+  gridConfidence: number;
+  startButton?: {
+    selector: string;
+    text: string;
+    disappeared: boolean;
+    position: { x: number; y: number };
+  };
+}
+
+// ---------------------------------------------------------------------------
+// Grid snapshot: the grid state plus derived information the bot needs.
+// ---------------------------------------------------------------------------
+
+export interface GridSnapshot {
+  /** The 10x20 boolean grid. null if reading failed. */
+  grid: Grid | null;
+  /** Total filled cells. 0 if grid is null. */
+  filledCount: number;
+  /** Filled cells in the bottom N rows. */
+  filledInBottom(rows: number): number;
+  /** Whether any cell in the top N rows is filled. */
+  hasFilledInTop(rows: number): boolean;
+  /** Number of fully complete rows. */
+  completeRows: number;
+  /** Active piece cells (diff against settled grid). null if undetectable. */
+  activePieceCells: [number, number][] | null;
+  /** Identified piece type from active piece cells. null if no active piece. */
+  activePieceType: PieceType | null;
+}
+
+// ---------------------------------------------------------------------------
+// The Driver interface. This is what the Bot sees.
+// ---------------------------------------------------------------------------
+
+export interface TetrisDriver {
+  // -- Lifecycle --
+
+  /**
+   * Navigate to the game URL, wait for load, begin console error collection.
+   * Returns false if the page failed to load.
+   */
+  loadPage(url: string): Promise<{ loaded: boolean; detail: string; errorsOnLoad: number }>;
+
+  /**
+   * Survey the page structure before any interaction.
+   * Returns information about overlays, canvas elements, DOM grids, visible text.
+   */
+  surveyPage(): Promise<SurveyData>;
+
+  /**
+   * Run full calibration: grid detection, start mechanism detection,
+   * control detection, score element detection, grid confidence measurement.
+   * Includes re-calibration fallback if initial detection fails.
+   * Never throws.
+   */
+  calibrate(): Promise<DriverCalibration>;
+
+  /**
+   * Re-run calibration after the game state may have changed
+   * (e.g., after starting, grid might appear that wasn't there before).
+   * Keeps the current calibration if re-calibration finds nothing better.
+   */
+  recalibrate(): Promise<DriverCalibration>;
+
+  /**
+   * Get the current calibration. Throws if calibrate() hasn't been called.
+   */
+  getCalibration(): DriverCalibration;
+
+  // -- Grid Reading --
+
+  /**
+   * Read the current grid state. Returns a GridSnapshot with the raw grid
+   * and derived metrics. If settled grid is provided, active piece detection
+   * is diffed against it.
+   *
+   * Returns a snapshot with grid: null if reading fails.
+   */
+  readGrid(settledGrid?: Grid | null): Promise<GridSnapshot>;
+
+  /**
+   * Compare two grids for equality. True if they differ.
+   */
+  gridsAreDifferent(a: Grid | null, b: Grid | null): boolean;
+
+  // -- Input --
+
+  /**
+   * Press a game control key. Uses the controls detected during calibration.
+   */
+  pressKey(action: "left" | "right" | "down" | "rotate" | "drop"): Promise<void>;
+
+  /**
+   * Press an arbitrary key (for testing CCW rotation with 'z', etc.).
+   */
+  pressRawKey(key: string): Promise<void>;
+
+  /**
+   * Wait for a specified duration (milliseconds).
+   */
+  wait(ms: number): Promise<void>;
+
+  // -- Score/Level/Lines Reading --
+
+  /**
+   * Read the current score from the detected score element.
+   * Returns null if no score element was found or reading fails.
+   */
+  readScore(): Promise<number | null>;
+
+  /**
+   * Read the current level from the page.
+   * Returns null if no level display found or reading fails.
+   */
+  readLevel(): Promise<number | null>;
+
+  // -- Page State Queries --
+
+  /**
+   * Check if "Game Over" (or equivalent) text is visible on the page.
+   * Returns the matched text, or null if not found.
+   */
+  detectGameOverText(): Promise<string | null>;
+
+  /**
+   * Check if a restart button/prompt is visible.
+   */
+  detectRestartOption(): Promise<boolean>;
+
+  /**
+   * Check if a next piece preview display exists.
+   */
+  detectNextPiecePreview(): Promise<boolean>;
+
+  /**
+   * Get all console errors collected since loadPage() was called.
+   */
+  getConsoleErrors(): string[];
+
+  // -- Screenshots --
+
+  /**
+   * Take a screenshot. Returns raw PNG buffer.
+   */
+  screenshot(): Promise<Buffer>;
+
+  /**
+   * Measure the auto-drop interval (time between gravity-driven grid changes
+   * with no input). Returns average interval in ms, or 0 if unmeasurable.
+   */
+  measureDropInterval(): Promise<number>;
+}
+```
+
+### Method-to-Source Mapping
+
+Each Driver method maps to existing code as follows:
+
+| Driver Method | Current Source | Current Function(s) |
+|---|---|---|
+| `loadPage()` | tests.ts:277-303 | `loadAndCheckPage()`, `loadGamePage()` |
+| `surveyPage()` | calibrate.ts:1300-1393 | `surveyPage()` |
+| `calibrate()` | calibrate.ts:24-94 | `calibrate()`, `detectGrid()`, `detectStartMechanism()`, `detectControls()`, `detectScoreElement()`, `measureGridConfidence()` |
+| `recalibrate()` | tests.ts:152-163 | inline re-calibration after start |
+| `readGrid()` | grid-reader.ts:15-38, 46-118, 142-364 | `readGrid()`, `readCanvasGrid()`, `readDomGrid()`, plus `countFilled()`, `countFilledInBottomRows()`, `hasFilledInTopRows()`, `countCompleteRows()`, `detectActivePieceCells()`, `identifyPieceType()` |
+| `gridsAreDifferent()` | grid-reader.ts:400-410 | `gridsAreDifferent()` |
+| `pressKey()` | player.ts:251-277 | inline `page.keyboard.press()` calls using `cal.controls` |
+| `pressRawKey()` | tests.ts:841-842 | inline `page.keyboard.press("z")` |
+| `wait()` | everywhere | `page.waitForTimeout()` |
+| `readScore()` | tests.ts:490-497, 529-538, 743-749 | inline score element reading |
+| `readLevel()` | tests.ts:1597-1630 | `readLevelFromPage()` |
+| `detectGameOverText()` | tests.ts:929-940 | inline `page.evaluate()` for game over text |
+| `detectRestartOption()` | tests.ts:943-955 | inline `page.evaluate()` for restart buttons |
+| `detectNextPiecePreview()` | tests.ts:1669-1717 | `detectNextPiecePreview()` |
+| `getConsoleErrors()` | tests.ts:94-98 | `consoleErrors` array |
+| `screenshot()` | player.ts:370-371 | `page.screenshot()` |
+| `measureDropInterval()` | tests.ts:1636-1664 | `measureDropInterval()` |
+
+### How the Driver handles different renderers
+
+The Driver encapsulates renderer differences entirely. The Bot never knows or cares
+whether the game uses canvas, DOM, SVG, or WebGL.
+
+```
+readGrid() internally:
+  if renderer === "canvas" && gridBounds:
+    -> readCanvasGrid() via page.evaluate(getImageData)
+  if renderer === "dom":
+    -> readDomGrid() via page.evaluate(DOM traversal)
+  if renderer === "svg":
+    -> future: readSvgGrid()
+  fallback:
+    -> try canvas if bounds exist, then try DOM
+```
+
+The `GridSnapshot` returned to the Bot is always the same shape regardless of renderer.
+
+### Re-calibration
+
+The Driver maintains mutable internal state:
+
+```typescript
+class PlaywrightDriver implements TetrisDriver {
+  private page: Page;
+  private cal: DriverCalibration | null = null;
+  private consoleErrors: string[] = [];
+}
+```
+
+`recalibrate()` re-runs grid detection and start detection, but preserves
+the existing calibration if the new one is worse (e.g., grid detection fails
+on re-calibration but worked initially). This handles:
+
+- Games where the grid appears only after clicking "Start"
+- Games where the grid is rebuilt on game restart (new DOM elements)
+- Games where the canvas resizes after initialization
+
+### Error handling
+
+| Scenario | Driver behavior |
+|---|---|
+| Grid read returns null | `readGrid()` returns `GridSnapshot` with `grid: null`, `filledCount: 0` |
+| Grid read throws | Same as null -- caught internally, never thrown to Bot |
+| No score element found | `readScore()` returns `null` |
+| Score element disappeared | `readScore()` returns `null` (caught internally) |
+| Console error during play | Accumulated in `consoleErrors`, accessible via `getConsoleErrors()` |
+| Page navigation fails | `loadPage()` returns `{ loaded: false, detail: "..." }` |
+| Canvas getImageData all zeros (no GPU) | Grid validation rejects (>60% filled), returns null |
+| Calibration finds nothing | Returns calibration with `gridDetected: false`, `startMechanism: "unknown"` |
+
+The Driver never throws. All errors are represented in return values.
+
+---
+
+## Bot Interface
+
+### How the Bot calls the Driver
+
+The Bot receives a `TetrisDriver` instance. It never imports `Page` or
+anything from Playwright. It never calls `page.evaluate()`, `page.keyboard`,
+or `page.screenshot()` directly.
+
+```typescript
+// bot.ts
+import type { TetrisDriver, DriverCalibration, GridSnapshot } from "./driver";
+import type {
+  TestResult,
+  GameplayStats,
+  GameSession,
+  CompetitivePlayResult,
+  SurveyData,
+  BotReport,
+  Grid,
+} from "./types";
+import { findBestPlacement } from "./player";
+
+export async function runAllTests(
+  driver: TetrisDriver,
+  serverUrl: string
+): Promise<{
+  testResults: TestResult[];
+  calibration: DriverCalibration;
+  gameplay: GameplayStats;
+  session: GameSession;
+  survey: SurveyData;
+  competitivePlay: CompetitivePlayResult | null;
+}> {
+  // Phase 1: Load
+  const loadResult = await driver.loadPage(serverUrl);
+  // ...
+
+  // Phase 2: Calibrate
+  const cal = await driver.calibrate();
+  // ...
+
+  // Phase 3-8: Use only driver.readGrid(), driver.pressKey(), etc.
+}
+```
+
+### Phase execution flow using Driver methods
+
+**Phase 1: Page Load**
+```
+driver.loadPage(url) -> { loaded, detail, errorsOnLoad }
+driver.wait(3000)
+```
+
+**Phase 2: Calibrate + Start**
+```
+survey = driver.surveyPage()
+cal = driver.calibrate()
+  // Internally: detectStartMechanism(), detectGrid(), etc.
+if cal.startMechanism === "unknown" || !cal.gridDetected:
+  cal = driver.recalibrate()
+```
+
+**Phase 3: Basic Mechanics**
+```
+// Auto-drop test
+snap0 = driver.readGrid()
+driver.wait(5000)
+snap1 = driver.readGrid()
+gridChanged = driver.gridsAreDifferent(snap0.grid, snap1.grid)
+
+// Movement tests
+for dir in [left, right, down]:
+  snapBefore = driver.readGrid()
+  driver.pressKey(dir)
+  driver.wait(300)
+  snapAfter = driver.readGrid()
+  // compare
+
+// Rotation test
+snapBefore = driver.readGrid()
+driver.pressKey("rotate")
+driver.wait(300)
+snapAfter = driver.readGrid()
+// compare bounding boxes of active piece cells
+
+// Hard drop test
+driver.pressKey("drop")
+driver.wait(500)
+snapAfter = driver.readGrid()
+// check bottom rows
+```
+
+**Phase 4: Piece Lifecycle**
+```
+// Already tested during Phase 3 mechanics
+// Piece locks: bottom cells persist across reads
+// New piece spawns: top rows have cells after drop
+// Multiple pieces: piecesLocked counter >= 3
+```
+
+**Phase 5: Gameplay**
+```
+driver.loadPage(url)
+cal = driver.calibrate()
+initialScore = driver.readScore()
+// Play loop (60 pieces / 45s):
+while pieces < 60 && elapsed < 45s:
+  snap = driver.readGrid(settledGrid)
+  if snap.activePieceCells:
+    placement = findBestPlacement(settledGrid, snap.activePieceType)
+    // Execute placement using driver.pressKey()
+    for i in 0..placement.rotations:
+      driver.pressKey("rotate")
+      driver.wait(50)
+    // Move to column
+    driver.pressKey("left" or "right") * N
+    driver.pressKey("drop")
+    driver.wait(100)
+    settledGrid = (await driver.readGrid()).grid
+  driver.wait(60)
+finalScore = driver.readScore()
+```
+
+**Phase 6: Game Over**
+```
+driver.loadPage(url)
+driver.calibrate()
+// Hard drop 40 times, checking grid after every 5
+for i in 0..40:
+  driver.pressKey("drop")
+  driver.wait(150)
+  if i % 5 === 0:
+    snap = driver.readGrid()
+    if snap.hasFilledInTop(4):
+      driver.pressKey("drop")
+      driver.wait(300)
+      snap2 = driver.readGrid()
+      if !driver.gridsAreDifferent(snap.grid, snap2.grid):
+        // Game over detected
+gameOverText = driver.detectGameOverText()
+```
+
+**Phase 7: Endurance**
+```
+driver.loadPage(url)
+driver.calibrate()
+// Play for 30 seconds using same play loop as Phase 5
+```
+
+**Phase 8: Competitive Play**
+```
+driver.loadPage(url)
+driver.calibrate()
+initialDropInterval = driver.measureDropInterval()
+initialLevel = driver.readLevel()
+// Play for 60 seconds with detailed tracking
+// Every 5th poll: driver.readScore()
+// Every 10th poll: driver.readLevel()
+// Periodic: driver.pressRawKey("z") for CCW test
+// Periodic: soft drop test via driver.pressKey("down")
+finalDropInterval = driver.measureDropInterval()
+nextPieceVisible = driver.detectNextPiecePreview()
+gameOverText = driver.detectGameOverText()
+restartAvailable = driver.detectRestartOption()
+```
+
+### Test derivation
+
+`deriveTestResults()` stays in bot.ts. It receives the `GameSession` data
+that the Bot accumulated during phases, and produces the 24 `TestResult[]` array.
+It does not need the Driver at all -- it operates on pure data.
+
+The function signature is unchanged:
+
+```typescript
+function deriveTestResults(
+  session: GameSession,
+  cal: DriverCalibration,
+  loadResult: LoadResult,
+  consoleErrors: string[],
+  gameplay: GameplayStats,
+  phaseState: PhaseState,
+  competitivePlay: CompetitivePlayResult | null
+): TestResult[]
+```
+
+### Where the AI player logic lives
+
+`player.ts` becomes a pure computation module. It keeps:
+
+- `PIECES` definitions
+- `findBestPlacement()` (exported)
+- `findBestPlacementGeneric()`
+- `simulateDropPiece()`
+- `clearLines()`
+- `aggregateHeight()`, `countHoles()`, `bumpiness()`
+- `stripActivePiece()` (exported)
+- `Placement` interface (exported)
+
+It loses:
+
+- `playGame()` -- moves to bot.ts (it orchestrates grid reads + AI + key presses)
+- `hardDrop()` -- replaced by `driver.pressKey("drop")`
+- `playRandomMove()` -- moves to bot.ts
+- `playRandomForDuration()` -- moves to bot.ts
+- `tryFillRow()` -- moves to bot.ts
+- `stackToGameOver()` -- moves to bot.ts
+- `executePlacement()` -- moves to bot.ts (it calls driver.pressKey)
+- `countTotalFilled()` -- redundant with GridSnapshot.filledCount
+
+After refactor, `player.ts` has zero Playwright imports.
+
+---
+
+## Migration Plan
+
+### New files created
+
+| File | Purpose | Est. lines |
+|---|---|---|
+| `driver.ts` | TetrisDriver interface + PlaywrightDriver implementation | ~900 |
+| `bot.ts` | Phase orchestration, play loops, test derivation | ~1100 |
+
+### Files modified
+
+| File | Change |
+|---|---|
+| `player.ts` | Remove all Playwright-dependent functions, keep pure AI logic | ~350 -> ~250 |
+| `types.ts` | Add `DriverCalibration`, `GridSnapshot` interfaces (or keep in driver.ts). Minor additions. | ~205 -> ~220 |
+| `index.ts` | Change import from `tests.ts` to `bot.ts`, instantiate `PlaywrightDriver`, pass to `runAllTests`. | ~260 -> ~270 |
+
+### Files deleted
+
+| File | Reason |
+|---|---|
+| `calibrate.ts` | Absorbed into `driver.ts` |
+| `grid-reader.ts` | Absorbed into `driver.ts` |
+| `tests.ts` | Replaced by `bot.ts` |
+
+### What stays
+
+- `types.ts` -- interfaces stay the same, report format unchanged
+- `index.ts` -- HTTP server, Playwright test structure, report writing all stay
+- `SPEC.md` -- unchanged
+- `COMPETITIVE_PLAY_SPEC.md` -- unchanged
+- Report format (`BotReport`) -- identical JSON output
+
+### Incremental migration (4 phases)
+
+**Phase A: Create driver.ts with the interface + implementation (no callers yet)**
+
+1. Create `driver.ts` with `TetrisDriver` interface and `PlaywrightDriver` class.
+2. Move into it from `calibrate.ts`:
+   - `detectStartMechanism()` and its sub-functions (`tryKeyboardTriggers`, `tryDomButtons`, `tryCanvasClicks`)
+   - `detectGrid()`
+   - `detectControls()`
+   - `detectScoreElement()`
+   - `measureGridConfidence()`
+   - `surveyPage()`
+   - `sampleScreenshot()`
+   - `detectVisualChange()`
+   - `verifyInteractivity()`
+   - `clusterPoints()`
+   - `recalibrateWithRetry()`
+3. Move into it from `grid-reader.ts`:
+   - `readGrid()`, `readCanvasGrid()`, `readDomGrid()`
+   - `sampleBackgroundColor()`
+   - `validateGridBounds()`
+   - `gridsAreDifferent()`
+   - `countFilled()`, `countFilledInBottomRows()`, `hasFilledInTopRows()`
+   - `countCompleteRows()`, `isRowComplete()`
+   - `getColumnHeights()`
+   - `detectActivePieceCells()`, `identifyPieceType()`
+4. Move into it from `tests.ts`:
+   - `readLevelFromPage()`
+   - `measureDropInterval()`
+   - `detectNextPiecePreview()`
+   - `extractScoreFromText()` (internal helper)
+5. Wrap everything behind `PlaywrightDriver` methods.
+6. Export both the interface and the class.
+7. At this point, old code still works -- `calibrate.ts`, `grid-reader.ts`, and `tests.ts` are unchanged.
+
+**Commit A**: "Add driver.ts: TetrisDriver interface and PlaywrightDriver implementation"
+
+**Phase B: Create bot.ts (calls driver.ts, replaces tests.ts)**
+
+1. Create `bot.ts` with the new `runAllTests()` that accepts `TetrisDriver`.
+2. Move into it from `tests.ts`:
+   - `runAllTests()` (rewritten to call Driver instead of Playwright directly)
+   - `runBasicMechanicsPhase()`
+   - `runGameplayPhase()`
+   - `runGameOverPhase()`
+   - `runEndurancePhase()`
+   - `runCompetitivePlayPhase()`
+   - `deriveTestResults()`
+   - `ALL_TEST_NAMES`
+   - `emptyCalibration()` (adapted to return `DriverCalibration`)
+   - `loadAndCheckPage()` (replaced by `driver.loadPage()`)
+   - `boundingBox()` helper
+   - `countFilledInTopRows()` helper (local in tests.ts, replaced by GridSnapshot method)
+3. Move into it from `player.ts`:
+   - `playGame()` (rewritten to call Driver)
+   - `executePlacement()` (rewritten to call Driver)
+   - `playRandomMove()` (rewritten to call Driver)
+   - `playRandomForDuration()` (rewritten to call Driver)
+   - `tryFillRow()` (rewritten to call Driver)
+   - `stackToGameOver()` (rewritten to call Driver)
+4. bot.ts imports `findBestPlacement`, `stripActivePiece`, `Placement` from `player.ts`
+   and everything else from `driver.ts`.
+
+**Commit B**: "Add bot.ts: phase orchestration using TetrisDriver"
+
+**Phase C: Rewire index.ts, slim player.ts**
+
+1. Update `index.ts`:
+   - Import `PlaywrightDriver` from `./driver`
+   - Import `runAllTests` from `./bot` (not `./tests`)
+   - In the test body: `const driver = new PlaywrightDriver(page); const results = await runAllTests(driver, serverUrl);`
+2. Remove from `player.ts`:
+   - `playGame()`, `hardDrop()`, `executePlacement()`, `playRandomMove()`, `playRandomForDuration()`, `tryFillRow()`, `stackToGameOver()`
+   - `import type { Page }` and `import { readGrid, ... }` from grid-reader
+   - `countTotalFilled()` (redundant)
+3. `player.ts` now exports only:
+   - `findBestPlacement()` (accepts `Grid` and `PieceType`, returns `Placement | null`)
+   - `stripActivePiece()` (accepts `Grid` and cells, returns `Grid`)
+   - `Placement` interface
+
+**Commit C**: "Rewire index.ts to use bot.ts + driver.ts, slim player.ts"
+
+**Phase D: Delete old files**
+
+1. Delete `calibrate.ts`
+2. Delete `grid-reader.ts`
+3. Delete `tests.ts`
+4. Verify all imports resolve
+5. Run the full eval pipeline against a known artifact to confirm identical report output
+
+**Commit D**: "Remove old calibrate.ts, grid-reader.ts, tests.ts"
+
+### Backwards compatibility
+
+The report format (`BotReport`) does not change. The JSON output is byte-identical
+for the same game input. The summary score calculation is unchanged. The test names
+are unchanged. The competitive play data structure is unchanged.
+
+The only external-facing change is the internal file structure. Nothing downstream
+(the scoring pipeline, the dashboard, the harness) needs to change.
+
+---
+
+## File Structure After Refactor
+
+```
+gameplay-bot/
+  types.ts          ~220 lines   Interfaces (unchanged)
+  driver.ts         ~900 lines   TetrisDriver interface + PlaywrightDriver class
+  player.ts         ~250 lines   Pure AI: heuristics, simulation, placement finding
+  bot.ts           ~1100 lines   Phases, play loops, test derivation, competitive play
+  index.ts          ~270 lines   Playwright test entry, HTTP server, report output
+  SPEC.md                        Unchanged
+  COMPETITIVE_PLAY_SPEC.md       Unchanged
+  REFACTOR_SPEC.md               This document
+```
+
+Total: ~2740 lines (down from ~3500 because of deduplication and removing
+redundant helpers that now live behind the Driver).
+
+### Import/dependency graph
+
+```
+index.ts
+  -> driver.ts (PlaywrightDriver constructor)
+  -> bot.ts (runAllTests)
+  -> types.ts (BotReport)
+
+bot.ts
+  -> driver.ts (TetrisDriver interface, DriverCalibration, GridSnapshot)
+  -> player.ts (findBestPlacement, stripActivePiece, Placement)
+  -> types.ts (all data interfaces)
+
+driver.ts
+  -> types.ts (Grid, GridBounds, RendererType, Controls, etc.)
+  -> @playwright/test (Page)
+
+player.ts
+  -> types.ts (Grid, PieceType)
+  (NO @playwright/test import)
+```
+
+Key constraint: `bot.ts` does NOT import `@playwright/test`. It depends on the
+`TetrisDriver` interface, not the implementation. This means the Bot can be tested
+with a mock driver that returns canned grid states -- no browser needed.
+
+---
+
+## Edge Cases
+
+### Games that need re-calibration mid-session
+
+**Scenario**: Grid appears only after clicking "Start". On page load, there is no
+canvas and no DOM grid -- just a splash screen.
+
+**Current behavior**: `calibrate()` runs on the splash screen, finds nothing.
+Then `tests.ts` tries start mechanisms, and after starting, re-runs `calibrate()`.
+
+**Driver behavior**: `calibrate()` includes start detection. If it starts the game
+but finds no grid, it waits and re-scans. `recalibrate()` is also available for the
+Bot to call explicitly after any phase reload.
+
+**Bot flow**:
+```
+cal = driver.calibrate()
+if cal.gridDetected === false && cal.startMechanism !== "unknown":
+  // Game started but grid not found yet -- wait and retry
+  driver.wait(500)
+  cal = driver.recalibrate()
+```
+
+### Games where the Driver cannot read the grid at all
+
+**Scenario**: Canvas game without GPU access. `getImageData()` returns all zeros.
+
+**Driver behavior**: `readGrid()` returns `GridSnapshot { grid: null }` every time.
+The Bot sees grid failures accumulate.
+
+**Bot flow**: Phase 3 (mechanics) detects that `gridReadSuccess === 0`. The Bot
+marks all grid-dependent tests as failed with detail "grid reader unavailable".
+It does NOT fall back to screenshot-only testing (per the "NO FALSE POSITIVES" rule).
+Competitive play is skipped.
+
+### Games that pause themselves
+
+**Scenario**: Player accidentally triggers a pause menu (Escape key, or a pause
+button that overlaps with the game area).
+
+**Driver behavior**: `readGrid()` may return null (if an overlay covers the grid)
+or return a static grid (same state on every read). The Driver does not know about
+pausing -- it just reports what it sees.
+
+**Bot flow**: The play loop in bot.ts already handles stale grids. If the grid
+hasn't changed for 8 seconds, it tries pressing the drop key (which may unpause).
+If grid reads start returning null, the Bot counts consecutive failures. After 10
+consecutive null reads, it falls back to random key presses for a brief period,
+then re-reads.
+
+The Bot could also try pressing Escape or P to dismiss a pause screen:
+```
+if consecutiveUnchanged > 80: // 80 polls * 60ms = ~5 seconds
+  driver.pressRawKey("Escape")
+  driver.wait(500)
+  driver.pressRawKey("p")
+  driver.wait(500)
+```
+
+### Games with overlays that block gameplay
+
+**Scenario**: A modal overlay (tutorial, cookie consent, "enter your name" dialog)
+appears on top of the game, blocking input.
+
+**Driver behavior**: `surveyPage()` detects overlays (positioned elements covering
+>50% of viewport). The start mechanism detection already tries clicking overlays
+and pressing Escape to dismiss them.
+
+**Bot flow**: If the game started but mechanics tests show no response to input
+(movementsObserved === 0), the Bot can request a recalibrate, which may re-run
+start detection and dismiss a new overlay.
+
+### Games in different languages
+
+**Scenario**: The game UI is in Spanish, Japanese, or any non-English language.
+"Start", "Game Over", "Score" have different text.
+
+**Driver behavior**: Start mechanism detection is already fully language-agnostic
+(visual change detection + interactivity verification, no text matching). Score
+element detection falls back from labeled text ("Score: 0") to structural heuristics
+(leaf element containing a standalone number). Game over text detection checks
+multiple languages ("game over", "fin del juego", etc.) or falls back to
+grid-state-based detection (grid frozen after filling to top).
+
+**Bot flow**: The Bot does not do any text matching. It delegates all text-based
+detection to the Driver. Tests like `game_over` use `driver.detectGameOverText()`
+which is the Driver's responsibility. The Bot adds a grid-based game over check
+(frozen grid after stacking) as a secondary signal that doesn't depend on language.
+
+The `detectGameOverText()` method could be extended with more languages:
+```typescript
+// Inside driver.ts
+const gameOverPatterns = [
+  "game over", "gameover", "you lose", "try again",
+  "play again", "restart", "fin del juego", "juego terminado",
+  "ゲームオーバー", "游戏结束"
+];
+```
+
+But the primary game over detection in bot.ts (Phase 6) does not depend on text --
+it watches the grid freeze after filling to the top.
+
+---
+
+## What This Spec Does NOT Cover
+
+- WebGL grid reading (not implemented yet, out of scope)
+- New tests beyond the existing 24
+- Changes to the report format or scoring
+- Dashboard changes
+- Harness changes
+- Performance optimization of grid reading
+- Testability improvements beyond the Driver/Bot split (e.g., mock Driver tests)
+
+These are natural follow-ups after the refactor lands, but they are separate work items.

	loop-benchmarking Controlled experiments across agentic coding configurations. Same task, one variable, what actually works.
	git clone https://git.shiptheloop.com/loop-benchmarking.git
	Log \| Files \| Refs \| README