loop-benchmarking

Controlled experiments across agentic coding configurations. Same task, one variable, what actually works.
git clone https://git.shiptheloop.com/loop-benchmarking.git
Log | Files | Refs | README

commit 3d89b1b341dd38bd6e1d3574d07e083fb57b1d62
parent 7df3ddd793a69cba93ded966d634045f4810a5fc
Author: Brian Graham <brian@buildingbetterteams.de>
Date:   Fri, 10 Apr 2026 21:11:59 +0200

Update methodology page with current bot architecture

Major rewrite of the bot section to reflect the actual implementation:
- 8 conditional phases (was 4)
- 25 tests across mechanics, lifecycle, gameplay, game state, competitive
- Two-tier architecture (Driver + Bot separation)
- Discovery infrastructure: language-agnostic start, interactivity check,
  control discovery, calibration cache
- All 9 competitive play bug detection tests listed
- 60ms polling (was 150ms)
- Updated limitations: GPU requirement, trail bugs, game over masking,
  hidden elements, etc.
- Pierre Dellacherie attribution

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Diffstat:
Mdashboard/src/pages/methodology.astro | 248+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----------------
1 file changed, 200 insertions(+), 48 deletions(-)

diff --git a/dashboard/src/pages/methodology.astro b/dashboard/src/pages/methodology.astro @@ -25,7 +25,7 @@ import Base from "../layouts/Base.astro"; human language, budget, and more. These are the grid axes. Each unique combination of values is a "cell" in the experiment matrix. </p> - <p class="muted">16 axes. See the <a href="/compare">Compare</a> page for the full list.</p> + <p class="muted">22 axes. See the <a href="/compare">Compare</a> page for the full list.</p> </div> <div class="card"> <h3>Outputs</h3> @@ -65,26 +65,23 @@ import Base from "../layouts/Base.astro"; <div class="card"> <h3>Gameplay Bot <span class="weight">50%</span></h3> <p> - 16 automated Playwright tests. The bot calibrates itself to each game -- it finds the - grid, discovers controls, locates the start mechanism. Then it plays using a continuous - 150ms polling loop that reads grid state directly from the canvas or DOM. + 25 automated Playwright tests across 8 conditional phases. The bot calibrates itself to + each game -- finds the grid, discovers controls, locates the start mechanism -- then + plays using a continuous 60ms polling loop that reads grid state directly from the + canvas or DOM. </p> <p>Tests cover:</p> <ul> - <li>Game loads and starts</li> - <li>Auto-drop (gravity works)</li> - <li>Movement (left, right, down)</li> - <li>Rotation</li> - <li>Hard drop</li> - <li>Piece locking and new piece spawning</li> - <li>Line clearing</li> - <li>Score changes</li> - <li>Game over detection</li> - <li>30-second endurance (no crashes or freezes)</li> + <li><strong>Mechanics (1-9):</strong> game loads, game starts, auto-drop, movement (left/right/down), rotation, hard drop, all-piece-types rotation</li> + <li><strong>Piece lifecycle (10-12):</strong> piece locks, new piece spawns, multiple pieces placed</li> + <li><strong>Gameplay (13-14):</strong> line clear, score changes</li> + <li><strong>Game state (15-16):</strong> game over detection, 30-second endurance</li> + <li><strong>Competitive play bug detection (17-25):</strong> multi-line clear, score scaling, level progression, speed progression, next piece preview, game over display, counter-clockwise rotation, soft drop distinct from hard drop, rendering trail detection</li> </ul> <p class="muted"> - Every test is pure deterministic observation. The bot reads pixels or DOM state, - presses keys, and checks if the game responded correctly. + Every test is deterministic observation. The bot reads pixels or DOM state, presses + keys, and checks if the game responded correctly. Tests in later phases run only if + earlier phases succeeded -- no false positives from cascading failures. </p> </div> <div class="card"> @@ -219,68 +216,198 @@ import Base from "../layouts/Base.astro"; <h2>How the gameplay bot works</h2> <p> The bot is a Playwright script that loads any Tetris implementation and figures out - how to interact with it. It handles different renderers, control schemes, and start - mechanisms without prior knowledge of the implementation. + how to interact with it. It handles different renderers (canvas, DOM, SVG, WebGL), + control schemes, languages, and start mechanisms without prior knowledge of the + implementation. No text matching of any kind -- everything is discovered through + observation. + </p> + + <h3 style="margin-top: 24px; margin-bottom: 12px;">Two-tier architecture</h3> + <p> + The bot is split into two layers: + </p> + <div class="two-col"> + <div class="card"> + <h3>Driver</h3> + <p> + Abstracts the webpage. Handles grid detection, pixel sampling, button discovery, + keyboard input, screenshots, calibration caching. The driver knows about Playwright + and the DOM, but knows nothing about Tetris. + </p> + </div> + <div class="card"> + <h3>Bot</h3> + <p> + Knows Tetris. Runs the 8 phases, derives the 25 test results, plays the game using + Pierre Dellacherie's heuristic. Never imports Playwright -- only talks to the driver + through a typed interface. + </p> + </div> + </div> + + <h3 style="margin-top: 24px; margin-bottom: 12px;">Discovery infrastructure</h3> + <div class="three-col"> + <div class="card"> + <h3>Language-agnostic start detection</h3> + <p> + Buttons are found by structural properties (cursor:pointer, size, contrast) -- never + by text. Tries auto-start, then DOM buttons in order of prominence, then keyboard + triggers, then canvas clicks. + </p> + </div> + <div class="card"> + <h3>Interactivity verification</h3> + <p> + After every start attempt, the bot verifies the game actually responds to gameplay + inputs (ArrowLeft/Right) by checking screenshot AND DOM state changes. Rejects false + positives like Pause buttons. Rejects games that immediately game-over. + </p> + </div> + <div class="card"> + <h3>Control discovery</h3> + <p> + Each control key is probed against candidate lists. Classifies behavior by observing + grid deltas: did the piece move 1 column left? Teleport to bottom (hard drop)? Rotate? + Catches games where ArrowDown is hard drop and Space is pause. + </p> + </div> + </div> + + <h3 style="margin-top: 24px; margin-bottom: 12px;">Eight conditional phases</h3> + <p> + Each phase only runs if the previous phase succeeded. Failed prerequisites mark + downstream tests as <code>skipped</code> rather than failed -- so the bot never produces + false positives or false negatives from cascading failures. </p> <div class="phases"> <div class="phase"> <div class="phase-header"> <span class="phase-number">1</span> - <h3>Calibration</h3> + <h3>Page load</h3> </div> <p> - Detect the game grid (canvas, DOM, or SVG). Find controls by trying arrow keys, - WASD, and checking if pieces move. Locate the start mechanism (auto-start, button, - keypress). Find the score display element. + Navigate to the game URL. Survey the page: is there a canvas, a DOM grid, an + overlay, clickable elements? Capture console errors. Test: <code>game_loads</code>. </p> </div> <div class="phase"> <div class="phase-header"> <span class="phase-number">2</span> - <h3>Observation</h3> + <h3>Start detection</h3> </div> <p> - Continuous 150ms polling loop reads the grid state as a 10x20 boolean matrix. - For canvas games, this means sampling one pixel per cell and checking against a - calibrated background threshold. All state changes are recorded as events. + Discover candidates (auto-start, buttons, keyboard, canvas clicks). Try each, verify + with interactivity check, commit only when the game actually responds to gameplay + inputs. Tests: <code>game_starts</code>, <code>auto_drop</code>. </p> </div> <div class="phase"> <div class="phase-header"> <span class="phase-number">3</span> - <h3>AI play</h3> + <h3>Mechanics</h3> + </div> + <p> + Test each control: left, right, down, rotate, hard drop. Read the grid before and + after each key press to verify the expected change. Tests: <code>move_left</code>, + <code>move_right</code>, <code>move_down</code>, <code>rotate</code>, + <code>hard_drop</code>, <code>all_pieces_rotate</code>. + </p> + </div> + + <div class="phase"> + <div class="phase-header"> + <span class="phase-number">4</span> + <h3>Piece lifecycle</h3> + </div> + <p> + Verify pieces lock at the bottom, new pieces spawn at the top, and the game produces + multiple distinct pieces over time. Tests: <code>piece_locks</code>, + <code>new_piece_spawns</code>, <code>multiple_pieces</code>. + </p> + </div> + + <div class="phase"> + <div class="phase-header"> + <span class="phase-number">5</span> + <h3>Gameplay</h3> </div> <p> - Pierre Dellacherie's 4-heuristic algorithm (2003) evaluates all possible placements for each piece: + Play 60 pieces or 45 seconds (whichever comes first) using the AI player. Track + score and lines cleared during play. Tests: <code>line_clear</code>, + <code>score_changes</code>. </p> <pre><code>score = -0.51 * height + 0.76 * lines - 0.36 * holes - 0.18 * bumpiness</code></pre> <p class="muted"> - Weights from genetic algorithm optimization by Colin Fahey. Reference implementation: - <a href="https://github.com/LeeYiyuan/tetrisai" target="_blank" rel="noopener">LeeYiyuan/tetrisai</a> (MIT license). - The bot is a strong Tetris player -- the original algorithm can clear thousands of lines without losing. - We use it to exercise game mechanics and trigger events like multi-line clears for bug detection. + Pierre Dellacherie's 4-heuristic algorithm (2003), with weights from Colin Fahey's + genetic algorithm optimization. Reference implementation: + <a href="https://github.com/LeeYiyuan/tetrisai" target="_blank" rel="noopener">LeeYiyuan/tetrisai</a> (MIT). + The original algorithm can clear thousands of lines without losing. </p> </div> <div class="phase"> <div class="phase-header"> - <span class="phase-number">4</span> - <h3>Test derivation</h3> + <span class="phase-number">6</span> + <h3>Game over</h3> </div> <p> - 16 pass/fail results are derived from the recorded events. Each test is independent -- - a failure in one does not affect others. The bot never crashes on a single test failure. + Stack pieces to the top intentionally. Verify via grid reader (filled cells in top + rows), check if input stops working, look for game-over text in the DOM. Test: + <code>game_over</code>. </p> </div> + + <div class="phase"> + <div class="phase-header"> + <span class="phase-number">7</span> + <h3>Endurance</h3> + </div> + <p> + Play for 30 seconds with the AI player. Track console errors during gameplay (not + errors from page load). Test: <code>playable_30s</code>. + </p> + </div> + + <div class="phase"> + <div class="phase-header"> + <span class="phase-number">8</span> + <h3>Competitive play (bug detection)</h3> + </div> + <p> + Play 60 seconds of competitive Tetris while watching for specific bugs. Each bug + check has three outcomes: pass (tested, works), fail (tested, broken), or skip (no + opportunity to test). + </p> + <ul> + <li><code>multi_line_clear</code> -- multiple complete rows clear simultaneously</li> + <li><code>score_scaling</code> -- score grows proportionally with multi-line clears</li> + <li><code>level_progression</code> -- level increases after 10+ lines cleared</li> + <li><code>speed_progression</code> -- drop speed increases with level</li> + <li><code>next_piece_preview</code> -- next piece display visible</li> + <li><code>game_over_display</code> -- game over message and restart option shown</li> + <li><code>counter_clockwise_rotation</code> -- Z key rotates opposite to Up arrow (verified by reload-and-compare)</li> + <li><code>soft_drop_distinct</code> -- Down arrow moves one row, distinct from hard drop</li> + <li><code>rendering_clean</code> -- pieces don't leave trails on the board</li> + </ul> + </div> </div> + <h3 style="margin-top: 24px; margin-bottom: 12px;">Calibration cache</h3> + <p> + After the first successful calibration, the bot caches the start mechanism, controls, and + grid bounds. On every subsequent page reload (each phase that requires a fresh state), + the cached calibration is replayed instead of re-discovering everything. If the cache + fails to apply, the bot detects calibration drift and re-runs full discovery, flagging + the conflict in the report. + </p> + <div class="callout"> - No false positives: the grid reader must confirm state changes through pixel/DOM inspection. - If grid detection fails entirely, the bot falls back to screenshot comparison and reports - grid-dependent tests as <strong>INCONCLUSIVE</strong> rather than passed. + The bot uses both screenshot comparison AND DOM state inspection to verify game + responses. For canvas games this requires GPU access in headless Chromium. For DOM + games, comparing class names and inline styles works without a GPU. </div> </section> @@ -289,20 +416,44 @@ import Base from "../layouts/Base.astro"; <h2>Known limitations</h2> <ul class="limitations"> <li> - <strong>Non-standard rendering.</strong> Games that draw to canvas without using standard - 2D context methods (e.g., WebGL-only) may not have their grid detected. + <strong>Canvas games need GPU access.</strong> In headless Chromium without GPU, + <code>canvas.getImageData()</code> returns all zeros, so the grid reader can't see + canvas content. DOM-rendered games work fine without a GPU. + </li> + <li> + <strong>Trail rendering bugs.</strong> Games that leave colored cells behind moving + pieces (rather than clearing them on each frame) confuse the grid reader. The bot + sees stale trail cells as "filled" and can't track active piece movement reliably. + </li> + <li> + <strong>Score detection.</strong> The bot looks for elements containing changing + numbers in known patterns. Games with unusual score displays (rendered to canvas, + scattered across multiple elements) may not have their score detected. + </li> + <li> + <strong>Wall kicks, T-spins, lock delay.</strong> The bot tests basic mechanics but + not advanced Tetris features. A game with broken wall kicks would still pass the + rotation test if rotation works in open space. + </li> + <li> + <strong>Bot skill caps detection.</strong> Some bug detection tests (multi_line_clear, + score_scaling, level_progression) require the bot to actually clear multiple lines + during play. If the AI player can't clear lines on a particular game, those tests + skip rather than fail. </li> <li> - <strong>Score detection.</strong> The bot looks for elements containing "score" text or - standalone numbers that change during play. Non-standard layouts can cause misdetection. + <strong>Game over masking.</strong> Games where the start screen looks like the game + itself (or where game-over state appears on load) can cause start detection issues. + The bot rejects "starts" that immediately game-over, but edge cases exist. </li> <li> - <strong>Bot skill.</strong> The heuristic player tests mechanics, not mastery. It will - miss edge cases that only appear at high speeds or with unusual piece sequences. + <strong>Hidden game elements.</strong> If a game has working logic but a CSS bug + hides the start button, the bot will report the game as broken because it can't get + past the start screen -- even though the underlying code is correct. </li> <li> - <strong>SonarQube availability.</strong> SonarQube metrics only populate when the server - is running locally. Runs evaluated without it will have empty SonarQube sections. + <strong>SonarQube availability.</strong> SonarQube metrics only populate when the + server is running locally. Runs evaluated without it will have empty SonarQube sections. </li> <li> <strong>Single task.</strong> Currently only the Tetris task is evaluated. Results may @@ -310,7 +461,8 @@ import Base from "../layouts/Base.astro"; </li> <li> <strong>Sample size.</strong> Statistical power depends on having enough runs. Small - sweeps can produce noisy main effect estimates. + sweeps can produce noisy main effect estimates. The dashboard shows confidence + intervals everywhere they're computable. </li> </ul> </section>

Impressum · Datenschutz