Update methodology page with current bot architecture - loop-benchmarking - Controlled experiments across agentic coding configurations. Same task, one variable, what actually works.

commit 3d89b1b341dd38bd6e1d3574d07e083fb57b1d62
parent 7df3ddd793a69cba93ded966d634045f4810a5fc
Author: Brian Graham <brian@buildingbetterteams.de>
Date:   Fri, 10 Apr 2026 21:11:59 +0200

Update methodology page with current bot architecture

Major rewrite of the bot section to reflect the actual implementation:
- 8 conditional phases (was 4)
- 25 tests across mechanics, lifecycle, gameplay, game state, competitive
- Two-tier architecture (Driver + Bot separation)
- Discovery infrastructure: language-agnostic start, interactivity check,
  control discovery, calibration cache
- All 9 competitive play bug detection tests listed
- 60ms polling (was 150ms)
- Updated limitations: GPU requirement, trail bugs, game over masking,
  hidden elements, etc.
- Pierre Dellacherie attribution

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Diffstat:
M dashboard/src/pages/methodology.astro  | 248 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----------------

1 file changed, 200 insertions(+), 48 deletions(-)
diff --git a/dashboard/src/pages/methodology.astro b/dashboard/src/pages/methodology.astro
@@ -25,7 +25,7 @@ import Base from "../layouts/Base.astro";
             human language, budget, and more. These are the grid axes. Each unique combination of values
             is a "cell" in the experiment matrix.
           </p>
-          <p class="muted">16 axes. See the <a href="/compare">Compare</a> page for the full list.</p>
+          <p class="muted">22 axes. See the <a href="/compare">Compare</a> page for the full list.</p>
         </div>
         <div class="card">
           <h3>Outputs</h3>
@@ -65,26 +65,23 @@ import Base from "../layouts/Base.astro";
         <div class="card">
           <h3>Gameplay Bot <span class="weight">50%</span></h3>
           <p>
-            16 automated Playwright tests. The bot calibrates itself to each game -- it finds the
-            grid, discovers controls, locates the start mechanism. Then it plays using a continuous
-            150ms polling loop that reads grid state directly from the canvas or DOM.
+            25 automated Playwright tests across 8 conditional phases. The bot calibrates itself to
+            each game -- finds the grid, discovers controls, locates the start mechanism -- then
+            plays using a continuous 60ms polling loop that reads grid state directly from the
+            canvas or DOM.
           </p>
           <p>Tests cover:</p>
           <ul>
-            <li>Game loads and starts</li>
-            <li>Auto-drop (gravity works)</li>
-            <li>Movement (left, right, down)</li>
-            <li>Rotation</li>
-            <li>Hard drop</li>
-            <li>Piece locking and new piece spawning</li>
-            <li>Line clearing</li>
-            <li>Score changes</li>
-            <li>Game over detection</li>
-            <li>30-second endurance (no crashes or freezes)</li>
+            <li><strong>Mechanics (1-9):</strong> game loads, game starts, auto-drop, movement (left/right/down), rotation, hard drop, all-piece-types rotation</li>
+            <li><strong>Piece lifecycle (10-12):</strong> piece locks, new piece spawns, multiple pieces placed</li>
+            <li><strong>Gameplay (13-14):</strong> line clear, score changes</li>
+            <li><strong>Game state (15-16):</strong> game over detection, 30-second endurance</li>
+            <li><strong>Competitive play bug detection (17-25):</strong> multi-line clear, score scaling, level progression, speed progression, next piece preview, game over display, counter-clockwise rotation, soft drop distinct from hard drop, rendering trail detection</li>
           </ul>
           <p class="muted">
-            Every test is pure deterministic observation. The bot reads pixels or DOM state,
-            presses keys, and checks if the game responded correctly.
+            Every test is deterministic observation. The bot reads pixels or DOM state, presses
+            keys, and checks if the game responded correctly. Tests in later phases run only if
+            earlier phases succeeded -- no false positives from cascading failures.
           </p>
         </div>
         <div class="card">
@@ -219,68 +216,198 @@ import Base from "../layouts/Base.astro";
       <h2>How the gameplay bot works</h2>
       <p>
         The bot is a Playwright script that loads any Tetris implementation and figures out
-        how to interact with it. It handles different renderers, control schemes, and start
-        mechanisms without prior knowledge of the implementation.
+        how to interact with it. It handles different renderers (canvas, DOM, SVG, WebGL),
+        control schemes, languages, and start mechanisms without prior knowledge of the
+        implementation. No text matching of any kind -- everything is discovered through
+        observation.
+      </p>
+
+      <h3 style="margin-top: 24px; margin-bottom: 12px;">Two-tier architecture</h3>
+      <p>
+        The bot is split into two layers:
+      </p>
+      <div class="two-col">
+        <div class="card">
+          <h3>Driver</h3>
+          <p>
+            Abstracts the webpage. Handles grid detection, pixel sampling, button discovery,
+            keyboard input, screenshots, calibration caching. The driver knows about Playwright
+            and the DOM, but knows nothing about Tetris.
+          </p>
+        </div>
+        <div class="card">
+          <h3>Bot</h3>
+          <p>
+            Knows Tetris. Runs the 8 phases, derives the 25 test results, plays the game using
+            Pierre Dellacherie's heuristic. Never imports Playwright -- only talks to the driver
+            through a typed interface.
+          </p>
+        </div>
+      </div>
+
+      <h3 style="margin-top: 24px; margin-bottom: 12px;">Discovery infrastructure</h3>
+      <div class="three-col">
+        <div class="card">
+          <h3>Language-agnostic start detection</h3>
+          <p>
+            Buttons are found by structural properties (cursor:pointer, size, contrast) -- never
+            by text. Tries auto-start, then DOM buttons in order of prominence, then keyboard
+            triggers, then canvas clicks.
+          </p>
+        </div>
+        <div class="card">
+          <h3>Interactivity verification</h3>
+          <p>
+            After every start attempt, the bot verifies the game actually responds to gameplay
+            inputs (ArrowLeft/Right) by checking screenshot AND DOM state changes. Rejects false
+            positives like Pause buttons. Rejects games that immediately game-over.
+          </p>
+        </div>
+        <div class="card">
+          <h3>Control discovery</h3>
+          <p>
+            Each control key is probed against candidate lists. Classifies behavior by observing
+            grid deltas: did the piece move 1 column left? Teleport to bottom (hard drop)? Rotate?
+            Catches games where ArrowDown is hard drop and Space is pause.
+          </p>
+        </div>
+      </div>
+
+      <h3 style="margin-top: 24px; margin-bottom: 12px;">Eight conditional phases</h3>
+      <p>
+        Each phase only runs if the previous phase succeeded. Failed prerequisites mark
+        downstream tests as <code>skipped</code> rather than failed -- so the bot never produces
+        false positives or false negatives from cascading failures.
       </p>
 
       <div class="phases">
         <div class="phase">
           <div class="phase-header">
             <span class="phase-number">1</span>
-            <h3>Calibration</h3>
+            <h3>Page load</h3>
           </div>
           <p>
-            Detect the game grid (canvas, DOM, or SVG). Find controls by trying arrow keys,
-            WASD, and checking if pieces move. Locate the start mechanism (auto-start, button,
-            keypress). Find the score display element.
+            Navigate to the game URL. Survey the page: is there a canvas, a DOM grid, an
+            overlay, clickable elements? Capture console errors. Test: <code>game_loads</code>.
           </p>
         </div>
 
         <div class="phase">
           <div class="phase-header">
             <span class="phase-number">2</span>
-            <h3>Observation</h3>
+            <h3>Start detection</h3>
           </div>
           <p>
-            Continuous 150ms polling loop reads the grid state as a 10x20 boolean matrix.
-            For canvas games, this means sampling one pixel per cell and checking against a
-            calibrated background threshold. All state changes are recorded as events.
+            Discover candidates (auto-start, buttons, keyboard, canvas clicks). Try each, verify
+            with interactivity check, commit only when the game actually responds to gameplay
+            inputs. Tests: <code>game_starts</code>, <code>auto_drop</code>.
           </p>
         </div>
 
         <div class="phase">
           <div class="phase-header">
             <span class="phase-number">3</span>
-            <h3>AI play</h3>
+            <h3>Mechanics</h3>
+          </div>
+          <p>
+            Test each control: left, right, down, rotate, hard drop. Read the grid before and
+            after each key press to verify the expected change. Tests: <code>move_left</code>,
+            <code>move_right</code>, <code>move_down</code>, <code>rotate</code>,
+            <code>hard_drop</code>, <code>all_pieces_rotate</code>.
+          </p>
+        </div>
+
+        <div class="phase">
+          <div class="phase-header">
+            <span class="phase-number">4</span>
+            <h3>Piece lifecycle</h3>
+          </div>
+          <p>
+            Verify pieces lock at the bottom, new pieces spawn at the top, and the game produces
+            multiple distinct pieces over time. Tests: <code>piece_locks</code>,
+            <code>new_piece_spawns</code>, <code>multiple_pieces</code>.
+          </p>
+        </div>
+
+        <div class="phase">
+          <div class="phase-header">
+            <span class="phase-number">5</span>
+            <h3>Gameplay</h3>
           </div>
           <p>
-            Pierre Dellacherie's 4-heuristic algorithm (2003) evaluates all possible placements for each piece:
+            Play 60 pieces or 45 seconds (whichever comes first) using the AI player. Track
+            score and lines cleared during play. Tests: <code>line_clear</code>,
+            <code>score_changes</code>.
           </p>
           <pre><code>score = -0.51 * height + 0.76 * lines - 0.36 * holes - 0.18 * bumpiness</code></pre>
           <p class="muted">
-            Weights from genetic algorithm optimization by Colin Fahey. Reference implementation:
-            <a href="https://github.com/LeeYiyuan/tetrisai" target="_blank" rel="noopener">LeeYiyuan/tetrisai</a> (MIT license).
-            The bot is a strong Tetris player -- the original algorithm can clear thousands of lines without losing.
-            We use it to exercise game mechanics and trigger events like multi-line clears for bug detection.
+            Pierre Dellacherie's 4-heuristic algorithm (2003), with weights from Colin Fahey's
+            genetic algorithm optimization. Reference implementation:
+            <a href="https://github.com/LeeYiyuan/tetrisai" target="_blank" rel="noopener">LeeYiyuan/tetrisai</a> (MIT).
+            The original algorithm can clear thousands of lines without losing.
           </p>
         </div>
 
         <div class="phase">
           <div class="phase-header">
-            <span class="phase-number">4</span>
-            <h3>Test derivation</h3>
+            <span class="phase-number">6</span>
+            <h3>Game over</h3>
           </div>
           <p>
-            16 pass/fail results are derived from the recorded events. Each test is independent --
-            a failure in one does not affect others. The bot never crashes on a single test failure.
+            Stack pieces to the top intentionally. Verify via grid reader (filled cells in top
+            rows), check if input stops working, look for game-over text in the DOM. Test:
+            <code>game_over</code>.
           </p>
         </div>
+
+        <div class="phase">
+          <div class="phase-header">
+            <span class="phase-number">7</span>
+            <h3>Endurance</h3>
+          </div>
+          <p>
+            Play for 30 seconds with the AI player. Track console errors during gameplay (not
+            errors from page load). Test: <code>playable_30s</code>.
+          </p>
+        </div>
+
+        <div class="phase">
+          <div class="phase-header">
+            <span class="phase-number">8</span>
+            <h3>Competitive play (bug detection)</h3>
+          </div>
+          <p>
+            Play 60 seconds of competitive Tetris while watching for specific bugs. Each bug
+            check has three outcomes: pass (tested, works), fail (tested, broken), or skip (no
+            opportunity to test).
+          </p>
+          <ul>
+            <li><code>multi_line_clear</code> -- multiple complete rows clear simultaneously</li>
+            <li><code>score_scaling</code> -- score grows proportionally with multi-line clears</li>
+            <li><code>level_progression</code> -- level increases after 10+ lines cleared</li>
+            <li><code>speed_progression</code> -- drop speed increases with level</li>
+            <li><code>next_piece_preview</code> -- next piece display visible</li>
+            <li><code>game_over_display</code> -- game over message and restart option shown</li>
+            <li><code>counter_clockwise_rotation</code> -- Z key rotates opposite to Up arrow (verified by reload-and-compare)</li>
+            <li><code>soft_drop_distinct</code> -- Down arrow moves one row, distinct from hard drop</li>
+            <li><code>rendering_clean</code> -- pieces don't leave trails on the board</li>
+          </ul>
+        </div>
       </div>
 
+      <h3 style="margin-top: 24px; margin-bottom: 12px;">Calibration cache</h3>
+      <p>
+        After the first successful calibration, the bot caches the start mechanism, controls, and
+        grid bounds. On every subsequent page reload (each phase that requires a fresh state),
+        the cached calibration is replayed instead of re-discovering everything. If the cache
+        fails to apply, the bot detects calibration drift and re-runs full discovery, flagging
+        the conflict in the report.
+      </p>
+
       <div class="callout">
-        No false positives: the grid reader must confirm state changes through pixel/DOM inspection.
-        If grid detection fails entirely, the bot falls back to screenshot comparison and reports
-        grid-dependent tests as <strong>INCONCLUSIVE</strong> rather than passed.
+        The bot uses both screenshot comparison AND DOM state inspection to verify game
+        responses. For canvas games this requires GPU access in headless Chromium. For DOM
+        games, comparing class names and inline styles works without a GPU.
       </div>
     </section>
 
@@ -289,20 +416,44 @@ import Base from "../layouts/Base.astro";
       <h2>Known limitations</h2>
       <ul class="limitations">
         <li>
-          <strong>Non-standard rendering.</strong> Games that draw to canvas without using standard
-          2D context methods (e.g., WebGL-only) may not have their grid detected.
+          <strong>Canvas games need GPU access.</strong> In headless Chromium without GPU,
+          <code>canvas.getImageData()</code> returns all zeros, so the grid reader can't see
+          canvas content. DOM-rendered games work fine without a GPU.
+        </li>
+        <li>
+          <strong>Trail rendering bugs.</strong> Games that leave colored cells behind moving
+          pieces (rather than clearing them on each frame) confuse the grid reader. The bot
+          sees stale trail cells as "filled" and can't track active piece movement reliably.
+        </li>
+        <li>
+          <strong>Score detection.</strong> The bot looks for elements containing changing
+          numbers in known patterns. Games with unusual score displays (rendered to canvas,
+          scattered across multiple elements) may not have their score detected.
+        </li>
+        <li>
+          <strong>Wall kicks, T-spins, lock delay.</strong> The bot tests basic mechanics but
+          not advanced Tetris features. A game with broken wall kicks would still pass the
+          rotation test if rotation works in open space.
+        </li>
+        <li>
+          <strong>Bot skill caps detection.</strong> Some bug detection tests (multi_line_clear,
+          score_scaling, level_progression) require the bot to actually clear multiple lines
+          during play. If the AI player can't clear lines on a particular game, those tests
+          skip rather than fail.
         </li>
         <li>
-          <strong>Score detection.</strong> The bot looks for elements containing "score" text or
-          standalone numbers that change during play. Non-standard layouts can cause misdetection.
+          <strong>Game over masking.</strong> Games where the start screen looks like the game
+          itself (or where game-over state appears on load) can cause start detection issues.
+          The bot rejects "starts" that immediately game-over, but edge cases exist.
         </li>
         <li>
-          <strong>Bot skill.</strong> The heuristic player tests mechanics, not mastery. It will
-          miss edge cases that only appear at high speeds or with unusual piece sequences.
+          <strong>Hidden game elements.</strong> If a game has working logic but a CSS bug
+          hides the start button, the bot will report the game as broken because it can't get
+          past the start screen -- even though the underlying code is correct.
         </li>
         <li>
-          <strong>SonarQube availability.</strong> SonarQube metrics only populate when the server
-          is running locally. Runs evaluated without it will have empty SonarQube sections.
+          <strong>SonarQube availability.</strong> SonarQube metrics only populate when the
+          server is running locally. Runs evaluated without it will have empty SonarQube sections.
         </li>
         <li>
           <strong>Single task.</strong> Currently only the Tetris task is evaluated. Results may
@@ -310,7 +461,8 @@ import Base from "../layouts/Base.astro";
         </li>
         <li>
           <strong>Sample size.</strong> Statistical power depends on having enough runs. Small
-          sweeps can produce noisy main effect estimates.
+          sweeps can produce noisy main effect estimates. The dashboard shows confidence
+          intervals everywhere they're computable.
         </li>
       </ul>
     </section>

	loop-benchmarking Controlled experiments across agentic coding configurations. Same task, one variable, what actually works.
	git clone https://git.shiptheloop.com/loop-benchmarking.git
	Log \| Files \| Refs \| README