methodology.astro - loop-benchmarking - Controlled experiments across agentic coding configurations. Same task, one variable, what actually works.

methodology.astro (23441B)
      1 ---
      2 import Base from "../layouts/Base.astro";
      3 ---
      4 
      5 <Base title="Methodology">
      6   <h1 style="margin-bottom: 8px;">Methodology</h1>
      7   <p style="color: var(--text-muted); margin-bottom: 32px; font-size: 0.875rem;">
      8     How the benchmark works, what it measures, and why.
      9   </p>
     10 
     11   <div class="methodology">
     12 
     13     <!-- Framework -->
     14     <section class="method-section">
     15       <h2>Framework</h2>
     16       <p>
     17         The benchmark separates three concepts: what goes in, what comes out, and whether it works.
     18       </p>
     19 
     20       <div class="three-col">
     21         <div class="card">
     22           <h3>Inputs</h3>
     23           <p>
     24             The experiment variables: model, effort level, tools, prompt style, programming language,
     25             human language, budget, and more. These are the grid axes. Each unique combination of values
     26             is a "cell" in the experiment matrix.
     27           </p>
     28           <p class="muted">22 axes. See the <a href="/compare">Compare</a> page for the full list.</p>
     29         </div>
     30         <div class="card">
     31           <h3>Outputs</h3>
     32           <p>
     33             Measures of <em>how</em> the code was built. Code quality, structural integrity, agent
     34             efficiency, complexity metrics. These tell you about the process and the codebase, not
     35             whether the end product works.
     36           </p>
     37           <p class="muted">Tracked and displayed, but not in the headline score.</p>
     38         </div>
     39         <div class="card">
     40           <h3>Outcomes</h3>
     41           <p>
     42             Measures of <em>what</em> was delivered. Does the game load? Do the controls work?
     43             Does it clear lines? Can you play for 30 seconds without crashing? These tell you
     44             whether the agent succeeded at the task.
     45           </p>
     46           <p class="muted">This is the headline score.</p>
     47         </div>
     48       </div>
     49 
     50       <div class="callout">
     51         The headline score is <strong>50% gameplay + 50% code quality</strong> (outcomes only).
     52         Output metrics are tracked separately so you can see the full picture without them
     53         distorting the primary question: did it work?
     54       </div>
     55     </section>
     56 
     57     <!-- Scoring -->
     58     <section class="method-section">
     59       <h2>How scoring works</h2>
     60       <p>
     61         All evaluation is deterministic code. No LLM grading. The agent never sees the test suite.
     62       </p>
     63 
     64       <div class="two-col">
     65         <div class="card">
     66           <h3>Gameplay Bot <span class="weight">50%</span></h3>
     67           <p>
     68             25 automated Playwright tests across 8 conditional phases. The bot calibrates itself to
     69             each game (finds the grid, discovers controls, locates the start mechanism), then
     70             plays using a continuous 60ms polling loop that reads grid state directly from the
     71             canvas or DOM.
     72           </p>
     73           <p>Tests cover:</p>
     74           <ul>
     75             <li><strong>Mechanics (1-9):</strong> game loads, game starts, auto-drop, movement (left/right/down), rotation, hard drop, all-piece-types rotation</li>
     76             <li><strong>Piece lifecycle (10-12):</strong> piece locks, new piece spawns, multiple pieces placed</li>
     77             <li><strong>Gameplay (13-14):</strong> line clear, score changes</li>
     78             <li><strong>Game state (15-16):</strong> game over detection, 30-second endurance</li>
     79             <li><strong>Competitive play bug detection (17-25):</strong> multi-line clear, score scaling, level progression, speed progression, next piece preview, game over display, counter-clockwise rotation, soft drop distinct from hard drop, rendering trail detection</li>
     80           </ul>
     81           <p class="muted">
     82             Every test is deterministic observation. The bot reads pixels or DOM state, presses
     83             keys, and checks if the game responded correctly. Tests in later phases run only if
     84             earlier phases succeeded, so cascading failures never produce false positives.
     85           </p>
     86         </div>
     87         <div class="card">
     88           <h3>SonarQube <span class="weight">50%</span></h3>
     89           <p>
     90             Automated code quality scan via SonarQube Community Edition:
     91           </p>
     92           <ul>
     93             <li><strong>Cognitive complexity</strong> normalized per file</li>
     94             <li><strong>Bugs and vulnerabilities</strong> count</li>
     95             <li><strong>Code smells</strong> count</li>
     96             <li><strong>Maintainability rating</strong> A through E</li>
     97             <li><strong>Reliability rating</strong> A through E</li>
     98             <li><strong>Security rating</strong> A through E</li>
     99           </ul>
    100           <p class="muted">
    101             SonarQube weighs the same in the headline score as the gameplay bot. A game that
    102             works perfectly but is full of cognitive complexity, bugs, or smells will score
    103             lower than a clean implementation that works.
    104           </p>
    105         </div>
    106       </div>
    107     </section>
    108 
    109     <!-- Output metrics -->
    110     <section class="method-section">
    111       <h2>Output metrics</h2>
    112       <p>
    113         These are tracked and displayed on each run's detail page, but they do not affect
    114         the headline score. They provide context for understanding how the agent worked.
    115       </p>
    116 
    117       <div class="two-col">
    118         <div class="card">
    119           <h3>Structural</h3>
    120           <ul>
    121             <li>Entry point exists (index.html or equivalent)</li>
    122             <li>Build succeeds</li>
    123             <li>TypeScript compiles without errors</li>
    124           </ul>
    125         </div>
    126 
    127         <div class="card">
    128           <h3>Code Analysis</h3>
    129           <ul>
    130             <li>Function length (max lines per function)</li>
    131             <li>Nesting depth (max indent level)</li>
    132             <li>Naming consistency (camelCase vs mixed)</li>
    133             <li>Separation of concerns (single-file vs modular)</li>
    134             <li>Code duplication</li>
    135             <li>HTML validation</li>
    136             <li>Magic numbers</li>
    137           </ul>
    138         </div>
    139 
    140         <div class="card">
    141           <h3>Code Quality</h3>
    142           <ul>
    143             <li><strong>ESLint</strong> errors and warnings</li>
    144             <li><strong>TypeScript</strong> compilation success</li>
    145             <li><strong>Bundle size</strong> after build</li>
    146           </ul>
    147           <p class="muted">
    148             Build hygiene metrics. Tracked separately from the SonarQube quality score because
    149             they measure tooling output, not code quality.
    150           </p>
    151         </div>
    152 
    153         <div class="card">
    154           <h3>Transcript Analysis</h3>
    155           <ul>
    156             <li>Tool call breakdown (which tools the agent used)</li>
    157             <li>Wasted turns (reading docs, generating ASCII art, starting dev servers)</li>
    158             <li>Productivity ratio (useful actions / total actions)</li>
    159             <li>Self-testing (did the agent test its own code?)</li>
    160           </ul>
    161           <p class="muted">Measures agent efficiency, not code quality.</p>
    162         </div>
    163       </div>
    164     </section>
    165 
    166     <!-- DOE -->
    167     <section class="method-section">
    168       <h2>Experiment design</h2>
    169       <p>
    170         The full grid has 16 axes. A naive full factorial would produce over 200,000 cells.
    171         Instead, we use statistical designs from Design of Experiments (DOE) to sample
    172         the space efficiently.
    173       </p>
    174 
    175       <div class="designs">
    176         <div class="card">
    177           <h3>Main effects sweep</h3>
    178           <p>
    179             Vary one axis at a time from a baseline configuration, holding everything else constant.
    180             This identifies which variables matter most for a given metric (score, cost, time).
    181           </p>
    182           <p class="muted">
    183             ~18 cells. Efficient but cannot detect interactions between variables.
    184           </p>
    185         </div>
    186 
    187         <div class="card">
    188           <h3>Plackett-Burman</h3>
    189           <p>
    190             A screening design for binary factors. Tests many on/off variables simultaneously
    191             using a mathematically constructed matrix that minimizes the number of runs needed
    192             to estimate main effects.
    193           </p>
    194           <p class="muted">
    195             Scales logarithmically with number of factors. Good for tool toggles.
    196           </p>
    197         </div>
    198 
    199         <div class="card">
    200           <h3>Interaction hunt</h3>
    201           <p>
    202             Full factorial on a small subset of axes (typically 2-4). Used after main effects
    203             screening identifies the top variables, to find interactions between them.
    204           </p>
    205           <p class="muted">
    206             Example: does the effect of prompt_style depend on model? Only a factorial can tell you.
    207           </p>
    208         </div>
    209       </div>
    210 
    211       <div class="callout">
    212         Each cell runs <strong>3 times</strong> by default. Repeat trials let us measure variance
    213         and distinguish real effects from noise. A variable that shifts the mean by 5 points is only
    214         meaningful if the run-to-run variance within a cell is smaller than 5 points.
    215       </div>
    216     </section>
    217 
    218     <!-- Gameplay bot -->
    219     <section class="method-section">
    220       <h2>How the gameplay bot works</h2>
    221       <p>
    222         The bot is a Playwright script that loads any Tetris implementation and figures out
    223         how to interact with it. It handles different renderers (canvas, DOM, SVG, WebGL),
    224         control schemes, languages, and start mechanisms without prior knowledge of the
    225         implementation. No text matching of any kind. Everything is discovered through
    226         observation.
    227       </p>
    228 
    229       <h3 style="margin-top: 24px; margin-bottom: 12px;">Two-tier architecture</h3>
    230       <p>
    231         The bot is split into two layers:
    232       </p>
    233       <div class="two-col">
    234         <div class="card">
    235           <h3>Driver</h3>
    236           <p>
    237             Abstracts the webpage. Handles grid detection, pixel sampling, button discovery,
    238             keyboard input, screenshots, calibration caching. The driver knows about Playwright
    239             and the DOM, but knows nothing about Tetris.
    240           </p>
    241         </div>
    242         <div class="card">
    243           <h3>Bot</h3>
    244           <p>
    245             Knows Tetris. Runs the 8 phases, derives the 25 test results, plays the game using
    246             Pierre Dellacherie's heuristic. Never imports Playwright. Only talks to the driver
    247             through a typed interface.
    248           </p>
    249         </div>
    250       </div>
    251 
    252       <h3 style="margin-top: 24px; margin-bottom: 12px;">Discovery infrastructure</h3>
    253       <div class="three-col">
    254         <div class="card">
    255           <h3>Language-agnostic start detection</h3>
    256           <p>
    257             Buttons are found by structural properties (cursor:pointer, size, contrast), never
    258             by text. Tries auto-start, then DOM buttons in order of prominence, then keyboard
    259             triggers, then canvas clicks.
    260           </p>
    261         </div>
    262         <div class="card">
    263           <h3>Interactivity verification</h3>
    264           <p>
    265             After every start attempt, the bot verifies the game actually responds to gameplay
    266             inputs (ArrowLeft/Right) by checking screenshot AND DOM state changes. Rejects false
    267             positives like Pause buttons. Rejects games that immediately game-over.
    268           </p>
    269         </div>
    270         <div class="card">
    271           <h3>Control discovery</h3>
    272           <p>
    273             Each control key is probed against candidate lists. Classifies behavior by observing
    274             grid deltas: did the piece move 1 column left? Teleport to bottom (hard drop)? Rotate?
    275             Catches games where ArrowDown is hard drop and Space is pause.
    276           </p>
    277         </div>
    278       </div>
    279 
    280       <h3 style="margin-top: 24px; margin-bottom: 12px;">Eight conditional phases</h3>
    281       <p>
    282         Each phase only runs if the previous phase succeeded. Failed prerequisites mark
    283         downstream tests as <code>skipped</code> rather than failed, so the bot never produces
    284         false positives or false negatives from cascading failures.
    285       </p>
    286 
    287       <div class="phases">
    288         <div class="phase">
    289           <div class="phase-header">
    290             <span class="phase-number">1</span>
    291             <h3>Page load</h3>
    292           </div>
    293           <p>
    294             Navigate to the game URL. Survey the page: is there a canvas, a DOM grid, an
    295             overlay, clickable elements? Capture console errors. Test: <code>game_loads</code>.
    296           </p>
    297         </div>
    298 
    299         <div class="phase">
    300           <div class="phase-header">
    301             <span class="phase-number">2</span>
    302             <h3>Start detection</h3>
    303           </div>
    304           <p>
    305             Discover candidates (auto-start, buttons, keyboard, canvas clicks). Try each, verify
    306             with interactivity check, commit only when the game actually responds to gameplay
    307             inputs. Tests: <code>game_starts</code>, <code>auto_drop</code>.
    308           </p>
    309         </div>
    310 
    311         <div class="phase">
    312           <div class="phase-header">
    313             <span class="phase-number">3</span>
    314             <h3>Mechanics</h3>
    315           </div>
    316           <p>
    317             Test each control: left, right, down, rotate, hard drop. Read the grid before and
    318             after each key press to verify the expected change. Tests: <code>move_left</code>,
    319             <code>move_right</code>, <code>move_down</code>, <code>rotate</code>,
    320             <code>hard_drop</code>, <code>all_pieces_rotate</code>.
    321           </p>
    322         </div>
    323 
    324         <div class="phase">
    325           <div class="phase-header">
    326             <span class="phase-number">4</span>
    327             <h3>Piece lifecycle</h3>
    328           </div>
    329           <p>
    330             Verify pieces lock at the bottom, new pieces spawn at the top, and the game produces
    331             multiple distinct pieces over time. Tests: <code>piece_locks</code>,
    332             <code>new_piece_spawns</code>, <code>multiple_pieces</code>.
    333           </p>
    334         </div>
    335 
    336         <div class="phase">
    337           <div class="phase-header">
    338             <span class="phase-number">5</span>
    339             <h3>Gameplay</h3>
    340           </div>
    341           <p>
    342             Play 60 pieces or 45 seconds (whichever comes first) using the AI player. Track
    343             score and lines cleared during play. Tests: <code>line_clear</code>,
    344             <code>score_changes</code>.
    345           </p>
    346           <pre><code>score = -0.51 * height + 0.76 * lines - 0.36 * holes - 0.18 * bumpiness</code></pre>
    347           <p class="muted">
    348             Pierre Dellacherie's 4-heuristic algorithm (2003), with weights from Colin Fahey's
    349             genetic algorithm optimization. Reference implementation:
    350             <a href="https://github.com/LeeYiyuan/tetrisai" target="_blank" rel="noopener">LeeYiyuan/tetrisai</a> (MIT).
    351             The original algorithm can clear thousands of lines without losing.
    352           </p>
    353         </div>
    354 
    355         <div class="phase">
    356           <div class="phase-header">
    357             <span class="phase-number">6</span>
    358             <h3>Game over</h3>
    359           </div>
    360           <p>
    361             Stack pieces to the top intentionally. Verify via grid reader (filled cells in top
    362             rows), check if input stops working, look for game-over text in the DOM. Test:
    363             <code>game_over</code>.
    364           </p>
    365         </div>
    366 
    367         <div class="phase">
    368           <div class="phase-header">
    369             <span class="phase-number">7</span>
    370             <h3>Endurance</h3>
    371           </div>
    372           <p>
    373             Play for 30 seconds with the AI player. Track console errors during gameplay (not
    374             errors from page load). Test: <code>playable_30s</code>.
    375           </p>
    376         </div>
    377 
    378         <div class="phase">
    379           <div class="phase-header">
    380             <span class="phase-number">8</span>
    381             <h3>Competitive play (bug detection)</h3>
    382           </div>
    383           <p>
    384             Play 60 seconds of competitive Tetris while watching for specific bugs. Each bug
    385             check has three outcomes: pass (tested, works), fail (tested, broken), or skip (no
    386             opportunity to test).
    387           </p>
    388           <ul>
    389             <li><code>multi_line_clear</code>: multiple complete rows clear simultaneously</li>
    390             <li><code>score_scaling</code>: score grows proportionally with multi-line clears</li>
    391             <li><code>level_progression</code>: level increases after 10+ lines cleared</li>
    392             <li><code>speed_progression</code>: drop speed increases with level</li>
    393             <li><code>next_piece_preview</code>: next piece display visible</li>
    394             <li><code>game_over_display</code>: game over message and restart option shown</li>
    395             <li><code>counter_clockwise_rotation</code>: Z key rotates opposite to Up arrow (verified by reload-and-compare)</li>
    396             <li><code>soft_drop_distinct</code>: Down arrow moves one row, distinct from hard drop</li>
    397             <li><code>rendering_clean</code>: pieces don't leave trails on the board</li>
    398           </ul>
    399         </div>
    400       </div>
    401 
    402       <h3 style="margin-top: 24px; margin-bottom: 12px;">Calibration cache</h3>
    403       <p>
    404         After the first successful calibration, the bot caches the start mechanism, controls, and
    405         grid bounds. On every subsequent page reload (each phase that requires a fresh state),
    406         the cached calibration is replayed instead of re-discovering everything. If the cache
    407         fails to apply, the bot detects calibration drift and re-runs full discovery, flagging
    408         the conflict in the report.
    409       </p>
    410 
    411       <div class="callout">
    412         The bot uses both screenshot comparison AND DOM state inspection to verify game
    413         responses. For canvas games this requires GPU access in headless Chromium. For DOM
    414         games, comparing class names and inline styles works without a GPU.
    415       </div>
    416     </section>
    417 
    418     <!-- Limitations -->
    419     <section class="method-section">
    420       <h2>Known limitations</h2>
    421       <ul class="limitations">
    422         <li>
    423           <strong>Canvas games need GPU access.</strong> In headless Chromium without GPU,
    424           <code>canvas.getImageData()</code> returns all zeros, so the grid reader can't see
    425           canvas content. DOM-rendered games work fine without a GPU.
    426         </li>
    427         <li>
    428           <strong>Trail rendering bugs.</strong> Games that leave colored cells behind moving
    429           pieces (rather than clearing them on each frame) confuse the grid reader. The bot
    430           sees stale trail cells as "filled" and can't track active piece movement reliably.
    431         </li>
    432         <li>
    433           <strong>Score detection.</strong> The bot looks for elements containing changing
    434           numbers in known patterns. Games with unusual score displays (rendered to canvas,
    435           scattered across multiple elements) may not have their score detected.
    436         </li>
    437         <li>
    438           <strong>Wall kicks, T-spins, lock delay.</strong> The bot tests basic mechanics but
    439           not advanced Tetris features. A game with broken wall kicks would still pass the
    440           rotation test if rotation works in open space.
    441         </li>
    442         <li>
    443           <strong>Bot skill caps detection.</strong> Some bug detection tests (multi_line_clear,
    444           score_scaling, level_progression) require the bot to actually clear multiple lines
    445           during play. If the AI player can't clear lines on a particular game, those tests
    446           skip rather than fail.
    447         </li>
    448         <li>
    449           <strong>Game over masking.</strong> Games where the start screen looks like the game
    450           itself (or where game-over state appears on load) can cause start detection issues.
    451           The bot rejects "starts" that immediately game-over, but edge cases exist.
    452         </li>
    453         <li>
    454           <strong>Hidden game elements.</strong> If a game has working logic but a CSS bug
    455           hides the start button, the bot will report the game as broken because it can't get
    456           past the start screen, even though the underlying code is correct.
    457         </li>
    458         <li>
    459           <strong>SonarQube availability.</strong> SonarQube metrics only populate when the
    460           server is running locally. Runs evaluated without it will have empty SonarQube sections.
    461         </li>
    462         <li>
    463           <strong>Single task.</strong> Currently only the Tetris task is evaluated. Results may
    464           not generalize to other task types (APIs, data pipelines, etc.).
    465         </li>
    466         <li>
    467           <strong>Sample size.</strong> Statistical power depends on having enough runs. Small
    468           sweeps can produce noisy main effect estimates. The dashboard shows confidence
    469           intervals everywhere they're computable.
    470         </li>
    471       </ul>
    472     </section>
    473 
    474   </div>
    475 </Base>
    476 
    477 <style>
    478   .methodology {
    479     max-width: 960px;
    480   }
    481 
    482   .method-section {
    483     margin-bottom: 48px;
    484   }
    485 
    486   .method-section h2 {
    487     margin-bottom: 12px;
    488     color: hsl(var(--primary));
    489   }
    490 
    491   .method-section > p {
    492     margin-bottom: 16px;
    493     line-height: 1.7;
    494   }
    495 
    496   .three-col {
    497     display: grid;
    498     grid-template-columns: repeat(3, 1fr);
    499     gap: 16px;
    500     margin-bottom: 16px;
    501   }
    502 
    503   .two-col {
    504     display: grid;
    505     grid-template-columns: repeat(2, 1fr);
    506     gap: 16px;
    507     margin-bottom: 16px;
    508   }
    509 
    510   .designs {
    511     display: grid;
    512     grid-template-columns: repeat(3, 1fr);
    513     gap: 16px;
    514     margin-bottom: 16px;
    515   }
    516 
    517   .card h3 {
    518     margin-bottom: 8px;
    519     color: hsl(var(--foreground));
    520   }
    521 
    522   .card p {
    523     margin-bottom: 8px;
    524     line-height: 1.6;
    525   }
    526 
    527   .card ul {
    528     margin: 8px 0;
    529     padding-left: 20px;
    530   }
    531 
    532   .card li {
    533     margin-bottom: 4px;
    534     line-height: 1.5;
    535   }
    536 
    537   .weight {
    538     color: hsl(var(--muted-foreground));
    539     font-size: var(--text-label);
    540     font-weight: 400;
    541   }
    542 
    543   .muted {
    544     color: hsl(var(--muted-foreground));
    545     font-size: 0.85rem;
    546   }
    547 
    548   .callout {
    549     border-left: 3px solid hsl(var(--primary));
    550     padding: 12px 16px;
    551     background: hsl(var(--primary) / 0.04);
    552     margin-bottom: 16px;
    553     line-height: 1.6;
    554   }
    555 
    556   .phases {
    557     display: flex;
    558     flex-direction: column;
    559     gap: 16px;
    560     margin-bottom: 16px;
    561   }
    562 
    563   .phase {
    564     background: hsl(var(--card));
    565     border: 1px solid hsl(var(--border));
    566     padding: 16px 20px;
    567   }
    568 
    569   .phase-header {
    570     display: flex;
    571     align-items: center;
    572     gap: 12px;
    573     margin-bottom: 8px;
    574   }
    575 
    576   .phase-number {
    577     display: inline-flex;
    578     align-items: center;
    579     justify-content: center;
    580     width: 28px;
    581     height: 28px;
    582     border: 1px solid hsl(var(--primary));
    583     color: hsl(var(--primary));
    584     font-weight: 600;
    585     font-size: var(--text-ui);
    586     flex-shrink: 0;
    587   }
    588 
    589   .phase-header h3 {
    590     margin: 0;
    591     color: hsl(var(--foreground));
    592   }
    593 
    594   .phase p {
    595     margin-bottom: 8px;
    596     line-height: 1.6;
    597   }
    598 
    599   .phase pre {
    600     background: hsl(var(--smui-surface-2));
    601     border: 1px solid hsl(var(--border));
    602     padding: 10px 14px;
    603     overflow-x: auto;
    604     margin-bottom: 8px;
    605   }
    606 
    607   .phase code {
    608     font-family: var(--font-mono);
    609     font-size: var(--text-ui);
    610     color: hsl(var(--smui-green));
    611   }
    612 
    613   .limitations {
    614     list-style: none;
    615     padding: 0;
    616   }
    617 
    618   .limitations li {
    619     padding: 10px 0;
    620     border-bottom: 1px solid hsl(var(--border));
    621     line-height: 1.6;
    622   }
    623 
    624   .limitations li:last-child {
    625     border-bottom: none;
    626   }
    627 
    628   @media (max-width: 768px) {
    629     .three-col,
    630     .two-col,
    631     .designs {
    632       grid-template-columns: 1fr;
    633     }
    634   }
    635 </style>
	loop-benchmarking Controlled experiments across agentic coding configurations. Same task, one variable, what actually works.
	git clone https://git.shiptheloop.com/loop-benchmarking.git
	Log \| Files \| Refs \| README