commit 3d89b1b341dd38bd6e1d3574d07e083fb57b1d62
parent 7df3ddd793a69cba93ded966d634045f4810a5fc
Author: Brian Graham <brian@buildingbetterteams.de>
Date: Fri, 10 Apr 2026 21:11:59 +0200
Update methodology page with current bot architecture
Major rewrite of the bot section to reflect the actual implementation:
- 8 conditional phases (was 4)
- 25 tests across mechanics, lifecycle, gameplay, game state, competitive
- Two-tier architecture (Driver + Bot separation)
- Discovery infrastructure: language-agnostic start, interactivity check,
control discovery, calibration cache
- All 9 competitive play bug detection tests listed
- 60ms polling (was 150ms)
- Updated limitations: GPU requirement, trail bugs, game over masking,
hidden elements, etc.
- Pierre Dellacherie attribution
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Diffstat:
1 file changed, 200 insertions(+), 48 deletions(-)
diff --git a/dashboard/src/pages/methodology.astro b/dashboard/src/pages/methodology.astro
@@ -25,7 +25,7 @@ import Base from "../layouts/Base.astro";
human language, budget, and more. These are the grid axes. Each unique combination of values
is a "cell" in the experiment matrix.
</p>
- <p class="muted">16 axes. See the <a href="/compare">Compare</a> page for the full list.</p>
+ <p class="muted">22 axes. See the <a href="/compare">Compare</a> page for the full list.</p>
</div>
<div class="card">
<h3>Outputs</h3>
@@ -65,26 +65,23 @@ import Base from "../layouts/Base.astro";
<div class="card">
<h3>Gameplay Bot <span class="weight">50%</span></h3>
<p>
- 16 automated Playwright tests. The bot calibrates itself to each game -- it finds the
- grid, discovers controls, locates the start mechanism. Then it plays using a continuous
- 150ms polling loop that reads grid state directly from the canvas or DOM.
+ 25 automated Playwright tests across 8 conditional phases. The bot calibrates itself to
+ each game -- finds the grid, discovers controls, locates the start mechanism -- then
+ plays using a continuous 60ms polling loop that reads grid state directly from the
+ canvas or DOM.
</p>
<p>Tests cover:</p>
<ul>
- <li>Game loads and starts</li>
- <li>Auto-drop (gravity works)</li>
- <li>Movement (left, right, down)</li>
- <li>Rotation</li>
- <li>Hard drop</li>
- <li>Piece locking and new piece spawning</li>
- <li>Line clearing</li>
- <li>Score changes</li>
- <li>Game over detection</li>
- <li>30-second endurance (no crashes or freezes)</li>
+ <li><strong>Mechanics (1-9):</strong> game loads, game starts, auto-drop, movement (left/right/down), rotation, hard drop, all-piece-types rotation</li>
+ <li><strong>Piece lifecycle (10-12):</strong> piece locks, new piece spawns, multiple pieces placed</li>
+ <li><strong>Gameplay (13-14):</strong> line clear, score changes</li>
+ <li><strong>Game state (15-16):</strong> game over detection, 30-second endurance</li>
+ <li><strong>Competitive play bug detection (17-25):</strong> multi-line clear, score scaling, level progression, speed progression, next piece preview, game over display, counter-clockwise rotation, soft drop distinct from hard drop, rendering trail detection</li>
</ul>
<p class="muted">
- Every test is pure deterministic observation. The bot reads pixels or DOM state,
- presses keys, and checks if the game responded correctly.
+ Every test is deterministic observation. The bot reads pixels or DOM state, presses
+ keys, and checks if the game responded correctly. Tests in later phases run only if
+ earlier phases succeeded -- no false positives from cascading failures.
</p>
</div>
<div class="card">
@@ -219,68 +216,198 @@ import Base from "../layouts/Base.astro";
<h2>How the gameplay bot works</h2>
<p>
The bot is a Playwright script that loads any Tetris implementation and figures out
- how to interact with it. It handles different renderers, control schemes, and start
- mechanisms without prior knowledge of the implementation.
+ how to interact with it. It handles different renderers (canvas, DOM, SVG, WebGL),
+ control schemes, languages, and start mechanisms without prior knowledge of the
+ implementation. No text matching of any kind -- everything is discovered through
+ observation.
+ </p>
+
+ <h3 style="margin-top: 24px; margin-bottom: 12px;">Two-tier architecture</h3>
+ <p>
+ The bot is split into two layers:
+ </p>
+ <div class="two-col">
+ <div class="card">
+ <h3>Driver</h3>
+ <p>
+ Abstracts the webpage. Handles grid detection, pixel sampling, button discovery,
+ keyboard input, screenshots, calibration caching. The driver knows about Playwright
+ and the DOM, but knows nothing about Tetris.
+ </p>
+ </div>
+ <div class="card">
+ <h3>Bot</h3>
+ <p>
+ Knows Tetris. Runs the 8 phases, derives the 25 test results, plays the game using
+ Pierre Dellacherie's heuristic. Never imports Playwright -- only talks to the driver
+ through a typed interface.
+ </p>
+ </div>
+ </div>
+
+ <h3 style="margin-top: 24px; margin-bottom: 12px;">Discovery infrastructure</h3>
+ <div class="three-col">
+ <div class="card">
+ <h3>Language-agnostic start detection</h3>
+ <p>
+ Buttons are found by structural properties (cursor:pointer, size, contrast) -- never
+ by text. Tries auto-start, then DOM buttons in order of prominence, then keyboard
+ triggers, then canvas clicks.
+ </p>
+ </div>
+ <div class="card">
+ <h3>Interactivity verification</h3>
+ <p>
+ After every start attempt, the bot verifies the game actually responds to gameplay
+ inputs (ArrowLeft/Right) by checking screenshot AND DOM state changes. Rejects false
+ positives like Pause buttons. Rejects games that immediately game-over.
+ </p>
+ </div>
+ <div class="card">
+ <h3>Control discovery</h3>
+ <p>
+ Each control key is probed against candidate lists. Classifies behavior by observing
+ grid deltas: did the piece move 1 column left? Teleport to bottom (hard drop)? Rotate?
+ Catches games where ArrowDown is hard drop and Space is pause.
+ </p>
+ </div>
+ </div>
+
+ <h3 style="margin-top: 24px; margin-bottom: 12px;">Eight conditional phases</h3>
+ <p>
+ Each phase only runs if the previous phase succeeded. Failed prerequisites mark
+ downstream tests as <code>skipped</code> rather than failed -- so the bot never produces
+ false positives or false negatives from cascading failures.
</p>
<div class="phases">
<div class="phase">
<div class="phase-header">
<span class="phase-number">1</span>
- <h3>Calibration</h3>
+ <h3>Page load</h3>
</div>
<p>
- Detect the game grid (canvas, DOM, or SVG). Find controls by trying arrow keys,
- WASD, and checking if pieces move. Locate the start mechanism (auto-start, button,
- keypress). Find the score display element.
+ Navigate to the game URL. Survey the page: is there a canvas, a DOM grid, an
+ overlay, clickable elements? Capture console errors. Test: <code>game_loads</code>.
</p>
</div>
<div class="phase">
<div class="phase-header">
<span class="phase-number">2</span>
- <h3>Observation</h3>
+ <h3>Start detection</h3>
</div>
<p>
- Continuous 150ms polling loop reads the grid state as a 10x20 boolean matrix.
- For canvas games, this means sampling one pixel per cell and checking against a
- calibrated background threshold. All state changes are recorded as events.
+ Discover candidates (auto-start, buttons, keyboard, canvas clicks). Try each, verify
+ with interactivity check, commit only when the game actually responds to gameplay
+ inputs. Tests: <code>game_starts</code>, <code>auto_drop</code>.
</p>
</div>
<div class="phase">
<div class="phase-header">
<span class="phase-number">3</span>
- <h3>AI play</h3>
+ <h3>Mechanics</h3>
+ </div>
+ <p>
+ Test each control: left, right, down, rotate, hard drop. Read the grid before and
+ after each key press to verify the expected change. Tests: <code>move_left</code>,
+ <code>move_right</code>, <code>move_down</code>, <code>rotate</code>,
+ <code>hard_drop</code>, <code>all_pieces_rotate</code>.
+ </p>
+ </div>
+
+ <div class="phase">
+ <div class="phase-header">
+ <span class="phase-number">4</span>
+ <h3>Piece lifecycle</h3>
+ </div>
+ <p>
+ Verify pieces lock at the bottom, new pieces spawn at the top, and the game produces
+ multiple distinct pieces over time. Tests: <code>piece_locks</code>,
+ <code>new_piece_spawns</code>, <code>multiple_pieces</code>.
+ </p>
+ </div>
+
+ <div class="phase">
+ <div class="phase-header">
+ <span class="phase-number">5</span>
+ <h3>Gameplay</h3>
</div>
<p>
- Pierre Dellacherie's 4-heuristic algorithm (2003) evaluates all possible placements for each piece:
+ Play 60 pieces or 45 seconds (whichever comes first) using the AI player. Track
+ score and lines cleared during play. Tests: <code>line_clear</code>,
+ <code>score_changes</code>.
</p>
<pre><code>score = -0.51 * height + 0.76 * lines - 0.36 * holes - 0.18 * bumpiness</code></pre>
<p class="muted">
- Weights from genetic algorithm optimization by Colin Fahey. Reference implementation:
- <a href="https://github.com/LeeYiyuan/tetrisai" target="_blank" rel="noopener">LeeYiyuan/tetrisai</a> (MIT license).
- The bot is a strong Tetris player -- the original algorithm can clear thousands of lines without losing.
- We use it to exercise game mechanics and trigger events like multi-line clears for bug detection.
+ Pierre Dellacherie's 4-heuristic algorithm (2003), with weights from Colin Fahey's
+ genetic algorithm optimization. Reference implementation:
+ <a href="https://github.com/LeeYiyuan/tetrisai" target="_blank" rel="noopener">LeeYiyuan/tetrisai</a> (MIT).
+ The original algorithm can clear thousands of lines without losing.
</p>
</div>
<div class="phase">
<div class="phase-header">
- <span class="phase-number">4</span>
- <h3>Test derivation</h3>
+ <span class="phase-number">6</span>
+ <h3>Game over</h3>
</div>
<p>
- 16 pass/fail results are derived from the recorded events. Each test is independent --
- a failure in one does not affect others. The bot never crashes on a single test failure.
+ Stack pieces to the top intentionally. Verify via grid reader (filled cells in top
+ rows), check if input stops working, look for game-over text in the DOM. Test:
+ <code>game_over</code>.
</p>
</div>
+
+ <div class="phase">
+ <div class="phase-header">
+ <span class="phase-number">7</span>
+ <h3>Endurance</h3>
+ </div>
+ <p>
+ Play for 30 seconds with the AI player. Track console errors during gameplay (not
+ errors from page load). Test: <code>playable_30s</code>.
+ </p>
+ </div>
+
+ <div class="phase">
+ <div class="phase-header">
+ <span class="phase-number">8</span>
+ <h3>Competitive play (bug detection)</h3>
+ </div>
+ <p>
+ Play 60 seconds of competitive Tetris while watching for specific bugs. Each bug
+ check has three outcomes: pass (tested, works), fail (tested, broken), or skip (no
+ opportunity to test).
+ </p>
+ <ul>
+ <li><code>multi_line_clear</code> -- multiple complete rows clear simultaneously</li>
+ <li><code>score_scaling</code> -- score grows proportionally with multi-line clears</li>
+ <li><code>level_progression</code> -- level increases after 10+ lines cleared</li>
+ <li><code>speed_progression</code> -- drop speed increases with level</li>
+ <li><code>next_piece_preview</code> -- next piece display visible</li>
+ <li><code>game_over_display</code> -- game over message and restart option shown</li>
+ <li><code>counter_clockwise_rotation</code> -- Z key rotates opposite to Up arrow (verified by reload-and-compare)</li>
+ <li><code>soft_drop_distinct</code> -- Down arrow moves one row, distinct from hard drop</li>
+ <li><code>rendering_clean</code> -- pieces don't leave trails on the board</li>
+ </ul>
+ </div>
</div>
+ <h3 style="margin-top: 24px; margin-bottom: 12px;">Calibration cache</h3>
+ <p>
+ After the first successful calibration, the bot caches the start mechanism, controls, and
+ grid bounds. On every subsequent page reload (each phase that requires a fresh state),
+ the cached calibration is replayed instead of re-discovering everything. If the cache
+ fails to apply, the bot detects calibration drift and re-runs full discovery, flagging
+ the conflict in the report.
+ </p>
+
<div class="callout">
- No false positives: the grid reader must confirm state changes through pixel/DOM inspection.
- If grid detection fails entirely, the bot falls back to screenshot comparison and reports
- grid-dependent tests as <strong>INCONCLUSIVE</strong> rather than passed.
+ The bot uses both screenshot comparison AND DOM state inspection to verify game
+ responses. For canvas games this requires GPU access in headless Chromium. For DOM
+ games, comparing class names and inline styles works without a GPU.
</div>
</section>
@@ -289,20 +416,44 @@ import Base from "../layouts/Base.astro";
<h2>Known limitations</h2>
<ul class="limitations">
<li>
- <strong>Non-standard rendering.</strong> Games that draw to canvas without using standard
- 2D context methods (e.g., WebGL-only) may not have their grid detected.
+ <strong>Canvas games need GPU access.</strong> In headless Chromium without GPU,
+ <code>canvas.getImageData()</code> returns all zeros, so the grid reader can't see
+ canvas content. DOM-rendered games work fine without a GPU.
+ </li>
+ <li>
+ <strong>Trail rendering bugs.</strong> Games that leave colored cells behind moving
+ pieces (rather than clearing them on each frame) confuse the grid reader. The bot
+ sees stale trail cells as "filled" and can't track active piece movement reliably.
+ </li>
+ <li>
+ <strong>Score detection.</strong> The bot looks for elements containing changing
+ numbers in known patterns. Games with unusual score displays (rendered to canvas,
+ scattered across multiple elements) may not have their score detected.
+ </li>
+ <li>
+ <strong>Wall kicks, T-spins, lock delay.</strong> The bot tests basic mechanics but
+ not advanced Tetris features. A game with broken wall kicks would still pass the
+ rotation test if rotation works in open space.
+ </li>
+ <li>
+ <strong>Bot skill caps detection.</strong> Some bug detection tests (multi_line_clear,
+ score_scaling, level_progression) require the bot to actually clear multiple lines
+ during play. If the AI player can't clear lines on a particular game, those tests
+ skip rather than fail.
</li>
<li>
- <strong>Score detection.</strong> The bot looks for elements containing "score" text or
- standalone numbers that change during play. Non-standard layouts can cause misdetection.
+ <strong>Game over masking.</strong> Games where the start screen looks like the game
+ itself (or where game-over state appears on load) can cause start detection issues.
+ The bot rejects "starts" that immediately game-over, but edge cases exist.
</li>
<li>
- <strong>Bot skill.</strong> The heuristic player tests mechanics, not mastery. It will
- miss edge cases that only appear at high speeds or with unusual piece sequences.
+ <strong>Hidden game elements.</strong> If a game has working logic but a CSS bug
+ hides the start button, the bot will report the game as broken because it can't get
+ past the start screen -- even though the underlying code is correct.
</li>
<li>
- <strong>SonarQube availability.</strong> SonarQube metrics only populate when the server
- is running locally. Runs evaluated without it will have empty SonarQube sections.
+ <strong>SonarQube availability.</strong> SonarQube metrics only populate when the
+ server is running locally. Runs evaluated without it will have empty SonarQube sections.
</li>
<li>
<strong>Single task.</strong> Currently only the Tetris task is evaluated. Results may
@@ -310,7 +461,8 @@ import Base from "../layouts/Base.astro";
</li>
<li>
<strong>Sample size.</strong> Statistical power depends on having enough runs. Small
- sweeps can produce noisy main effect estimates.
+ sweeps can produce noisy main effect estimates. The dashboard shows confidence
+ intervals everywhere they're computable.
</li>
</ul>
</section>