methodology.astro (23441B)
1 --- 2 import Base from "../layouts/Base.astro"; 3 --- 4 5 <Base title="Methodology"> 6 <h1 style="margin-bottom: 8px;">Methodology</h1> 7 <p style="color: var(--text-muted); margin-bottom: 32px; font-size: 0.875rem;"> 8 How the benchmark works, what it measures, and why. 9 </p> 10 11 <div class="methodology"> 12 13 <!-- Framework --> 14 <section class="method-section"> 15 <h2>Framework</h2> 16 <p> 17 The benchmark separates three concepts: what goes in, what comes out, and whether it works. 18 </p> 19 20 <div class="three-col"> 21 <div class="card"> 22 <h3>Inputs</h3> 23 <p> 24 The experiment variables: model, effort level, tools, prompt style, programming language, 25 human language, budget, and more. These are the grid axes. Each unique combination of values 26 is a "cell" in the experiment matrix. 27 </p> 28 <p class="muted">22 axes. See the <a href="/compare">Compare</a> page for the full list.</p> 29 </div> 30 <div class="card"> 31 <h3>Outputs</h3> 32 <p> 33 Measures of <em>how</em> the code was built. Code quality, structural integrity, agent 34 efficiency, complexity metrics. These tell you about the process and the codebase, not 35 whether the end product works. 36 </p> 37 <p class="muted">Tracked and displayed, but not in the headline score.</p> 38 </div> 39 <div class="card"> 40 <h3>Outcomes</h3> 41 <p> 42 Measures of <em>what</em> was delivered. Does the game load? Do the controls work? 43 Does it clear lines? Can you play for 30 seconds without crashing? These tell you 44 whether the agent succeeded at the task. 45 </p> 46 <p class="muted">This is the headline score.</p> 47 </div> 48 </div> 49 50 <div class="callout"> 51 The headline score is <strong>50% gameplay + 50% code quality</strong> (outcomes only). 52 Output metrics are tracked separately so you can see the full picture without them 53 distorting the primary question: did it work? 54 </div> 55 </section> 56 57 <!-- Scoring --> 58 <section class="method-section"> 59 <h2>How scoring works</h2> 60 <p> 61 All evaluation is deterministic code. No LLM grading. The agent never sees the test suite. 62 </p> 63 64 <div class="two-col"> 65 <div class="card"> 66 <h3>Gameplay Bot <span class="weight">50%</span></h3> 67 <p> 68 25 automated Playwright tests across 8 conditional phases. The bot calibrates itself to 69 each game (finds the grid, discovers controls, locates the start mechanism), then 70 plays using a continuous 60ms polling loop that reads grid state directly from the 71 canvas or DOM. 72 </p> 73 <p>Tests cover:</p> 74 <ul> 75 <li><strong>Mechanics (1-9):</strong> game loads, game starts, auto-drop, movement (left/right/down), rotation, hard drop, all-piece-types rotation</li> 76 <li><strong>Piece lifecycle (10-12):</strong> piece locks, new piece spawns, multiple pieces placed</li> 77 <li><strong>Gameplay (13-14):</strong> line clear, score changes</li> 78 <li><strong>Game state (15-16):</strong> game over detection, 30-second endurance</li> 79 <li><strong>Competitive play bug detection (17-25):</strong> multi-line clear, score scaling, level progression, speed progression, next piece preview, game over display, counter-clockwise rotation, soft drop distinct from hard drop, rendering trail detection</li> 80 </ul> 81 <p class="muted"> 82 Every test is deterministic observation. The bot reads pixels or DOM state, presses 83 keys, and checks if the game responded correctly. Tests in later phases run only if 84 earlier phases succeeded, so cascading failures never produce false positives. 85 </p> 86 </div> 87 <div class="card"> 88 <h3>SonarQube <span class="weight">50%</span></h3> 89 <p> 90 Automated code quality scan via SonarQube Community Edition: 91 </p> 92 <ul> 93 <li><strong>Cognitive complexity</strong> normalized per file</li> 94 <li><strong>Bugs and vulnerabilities</strong> count</li> 95 <li><strong>Code smells</strong> count</li> 96 <li><strong>Maintainability rating</strong> A through E</li> 97 <li><strong>Reliability rating</strong> A through E</li> 98 <li><strong>Security rating</strong> A through E</li> 99 </ul> 100 <p class="muted"> 101 SonarQube weighs the same in the headline score as the gameplay bot. A game that 102 works perfectly but is full of cognitive complexity, bugs, or smells will score 103 lower than a clean implementation that works. 104 </p> 105 </div> 106 </div> 107 </section> 108 109 <!-- Output metrics --> 110 <section class="method-section"> 111 <h2>Output metrics</h2> 112 <p> 113 These are tracked and displayed on each run's detail page, but they do not affect 114 the headline score. They provide context for understanding how the agent worked. 115 </p> 116 117 <div class="two-col"> 118 <div class="card"> 119 <h3>Structural</h3> 120 <ul> 121 <li>Entry point exists (index.html or equivalent)</li> 122 <li>Build succeeds</li> 123 <li>TypeScript compiles without errors</li> 124 </ul> 125 </div> 126 127 <div class="card"> 128 <h3>Code Analysis</h3> 129 <ul> 130 <li>Function length (max lines per function)</li> 131 <li>Nesting depth (max indent level)</li> 132 <li>Naming consistency (camelCase vs mixed)</li> 133 <li>Separation of concerns (single-file vs modular)</li> 134 <li>Code duplication</li> 135 <li>HTML validation</li> 136 <li>Magic numbers</li> 137 </ul> 138 </div> 139 140 <div class="card"> 141 <h3>Code Quality</h3> 142 <ul> 143 <li><strong>ESLint</strong> errors and warnings</li> 144 <li><strong>TypeScript</strong> compilation success</li> 145 <li><strong>Bundle size</strong> after build</li> 146 </ul> 147 <p class="muted"> 148 Build hygiene metrics. Tracked separately from the SonarQube quality score because 149 they measure tooling output, not code quality. 150 </p> 151 </div> 152 153 <div class="card"> 154 <h3>Transcript Analysis</h3> 155 <ul> 156 <li>Tool call breakdown (which tools the agent used)</li> 157 <li>Wasted turns (reading docs, generating ASCII art, starting dev servers)</li> 158 <li>Productivity ratio (useful actions / total actions)</li> 159 <li>Self-testing (did the agent test its own code?)</li> 160 </ul> 161 <p class="muted">Measures agent efficiency, not code quality.</p> 162 </div> 163 </div> 164 </section> 165 166 <!-- DOE --> 167 <section class="method-section"> 168 <h2>Experiment design</h2> 169 <p> 170 The full grid has 16 axes. A naive full factorial would produce over 200,000 cells. 171 Instead, we use statistical designs from Design of Experiments (DOE) to sample 172 the space efficiently. 173 </p> 174 175 <div class="designs"> 176 <div class="card"> 177 <h3>Main effects sweep</h3> 178 <p> 179 Vary one axis at a time from a baseline configuration, holding everything else constant. 180 This identifies which variables matter most for a given metric (score, cost, time). 181 </p> 182 <p class="muted"> 183 ~18 cells. Efficient but cannot detect interactions between variables. 184 </p> 185 </div> 186 187 <div class="card"> 188 <h3>Plackett-Burman</h3> 189 <p> 190 A screening design for binary factors. Tests many on/off variables simultaneously 191 using a mathematically constructed matrix that minimizes the number of runs needed 192 to estimate main effects. 193 </p> 194 <p class="muted"> 195 Scales logarithmically with number of factors. Good for tool toggles. 196 </p> 197 </div> 198 199 <div class="card"> 200 <h3>Interaction hunt</h3> 201 <p> 202 Full factorial on a small subset of axes (typically 2-4). Used after main effects 203 screening identifies the top variables, to find interactions between them. 204 </p> 205 <p class="muted"> 206 Example: does the effect of prompt_style depend on model? Only a factorial can tell you. 207 </p> 208 </div> 209 </div> 210 211 <div class="callout"> 212 Each cell runs <strong>3 times</strong> by default. Repeat trials let us measure variance 213 and distinguish real effects from noise. A variable that shifts the mean by 5 points is only 214 meaningful if the run-to-run variance within a cell is smaller than 5 points. 215 </div> 216 </section> 217 218 <!-- Gameplay bot --> 219 <section class="method-section"> 220 <h2>How the gameplay bot works</h2> 221 <p> 222 The bot is a Playwright script that loads any Tetris implementation and figures out 223 how to interact with it. It handles different renderers (canvas, DOM, SVG, WebGL), 224 control schemes, languages, and start mechanisms without prior knowledge of the 225 implementation. No text matching of any kind. Everything is discovered through 226 observation. 227 </p> 228 229 <h3 style="margin-top: 24px; margin-bottom: 12px;">Two-tier architecture</h3> 230 <p> 231 The bot is split into two layers: 232 </p> 233 <div class="two-col"> 234 <div class="card"> 235 <h3>Driver</h3> 236 <p> 237 Abstracts the webpage. Handles grid detection, pixel sampling, button discovery, 238 keyboard input, screenshots, calibration caching. The driver knows about Playwright 239 and the DOM, but knows nothing about Tetris. 240 </p> 241 </div> 242 <div class="card"> 243 <h3>Bot</h3> 244 <p> 245 Knows Tetris. Runs the 8 phases, derives the 25 test results, plays the game using 246 Pierre Dellacherie's heuristic. Never imports Playwright. Only talks to the driver 247 through a typed interface. 248 </p> 249 </div> 250 </div> 251 252 <h3 style="margin-top: 24px; margin-bottom: 12px;">Discovery infrastructure</h3> 253 <div class="three-col"> 254 <div class="card"> 255 <h3>Language-agnostic start detection</h3> 256 <p> 257 Buttons are found by structural properties (cursor:pointer, size, contrast), never 258 by text. Tries auto-start, then DOM buttons in order of prominence, then keyboard 259 triggers, then canvas clicks. 260 </p> 261 </div> 262 <div class="card"> 263 <h3>Interactivity verification</h3> 264 <p> 265 After every start attempt, the bot verifies the game actually responds to gameplay 266 inputs (ArrowLeft/Right) by checking screenshot AND DOM state changes. Rejects false 267 positives like Pause buttons. Rejects games that immediately game-over. 268 </p> 269 </div> 270 <div class="card"> 271 <h3>Control discovery</h3> 272 <p> 273 Each control key is probed against candidate lists. Classifies behavior by observing 274 grid deltas: did the piece move 1 column left? Teleport to bottom (hard drop)? Rotate? 275 Catches games where ArrowDown is hard drop and Space is pause. 276 </p> 277 </div> 278 </div> 279 280 <h3 style="margin-top: 24px; margin-bottom: 12px;">Eight conditional phases</h3> 281 <p> 282 Each phase only runs if the previous phase succeeded. Failed prerequisites mark 283 downstream tests as <code>skipped</code> rather than failed, so the bot never produces 284 false positives or false negatives from cascading failures. 285 </p> 286 287 <div class="phases"> 288 <div class="phase"> 289 <div class="phase-header"> 290 <span class="phase-number">1</span> 291 <h3>Page load</h3> 292 </div> 293 <p> 294 Navigate to the game URL. Survey the page: is there a canvas, a DOM grid, an 295 overlay, clickable elements? Capture console errors. Test: <code>game_loads</code>. 296 </p> 297 </div> 298 299 <div class="phase"> 300 <div class="phase-header"> 301 <span class="phase-number">2</span> 302 <h3>Start detection</h3> 303 </div> 304 <p> 305 Discover candidates (auto-start, buttons, keyboard, canvas clicks). Try each, verify 306 with interactivity check, commit only when the game actually responds to gameplay 307 inputs. Tests: <code>game_starts</code>, <code>auto_drop</code>. 308 </p> 309 </div> 310 311 <div class="phase"> 312 <div class="phase-header"> 313 <span class="phase-number">3</span> 314 <h3>Mechanics</h3> 315 </div> 316 <p> 317 Test each control: left, right, down, rotate, hard drop. Read the grid before and 318 after each key press to verify the expected change. Tests: <code>move_left</code>, 319 <code>move_right</code>, <code>move_down</code>, <code>rotate</code>, 320 <code>hard_drop</code>, <code>all_pieces_rotate</code>. 321 </p> 322 </div> 323 324 <div class="phase"> 325 <div class="phase-header"> 326 <span class="phase-number">4</span> 327 <h3>Piece lifecycle</h3> 328 </div> 329 <p> 330 Verify pieces lock at the bottom, new pieces spawn at the top, and the game produces 331 multiple distinct pieces over time. Tests: <code>piece_locks</code>, 332 <code>new_piece_spawns</code>, <code>multiple_pieces</code>. 333 </p> 334 </div> 335 336 <div class="phase"> 337 <div class="phase-header"> 338 <span class="phase-number">5</span> 339 <h3>Gameplay</h3> 340 </div> 341 <p> 342 Play 60 pieces or 45 seconds (whichever comes first) using the AI player. Track 343 score and lines cleared during play. Tests: <code>line_clear</code>, 344 <code>score_changes</code>. 345 </p> 346 <pre><code>score = -0.51 * height + 0.76 * lines - 0.36 * holes - 0.18 * bumpiness</code></pre> 347 <p class="muted"> 348 Pierre Dellacherie's 4-heuristic algorithm (2003), with weights from Colin Fahey's 349 genetic algorithm optimization. Reference implementation: 350 <a href="https://github.com/LeeYiyuan/tetrisai" target="_blank" rel="noopener">LeeYiyuan/tetrisai</a> (MIT). 351 The original algorithm can clear thousands of lines without losing. 352 </p> 353 </div> 354 355 <div class="phase"> 356 <div class="phase-header"> 357 <span class="phase-number">6</span> 358 <h3>Game over</h3> 359 </div> 360 <p> 361 Stack pieces to the top intentionally. Verify via grid reader (filled cells in top 362 rows), check if input stops working, look for game-over text in the DOM. Test: 363 <code>game_over</code>. 364 </p> 365 </div> 366 367 <div class="phase"> 368 <div class="phase-header"> 369 <span class="phase-number">7</span> 370 <h3>Endurance</h3> 371 </div> 372 <p> 373 Play for 30 seconds with the AI player. Track console errors during gameplay (not 374 errors from page load). Test: <code>playable_30s</code>. 375 </p> 376 </div> 377 378 <div class="phase"> 379 <div class="phase-header"> 380 <span class="phase-number">8</span> 381 <h3>Competitive play (bug detection)</h3> 382 </div> 383 <p> 384 Play 60 seconds of competitive Tetris while watching for specific bugs. Each bug 385 check has three outcomes: pass (tested, works), fail (tested, broken), or skip (no 386 opportunity to test). 387 </p> 388 <ul> 389 <li><code>multi_line_clear</code>: multiple complete rows clear simultaneously</li> 390 <li><code>score_scaling</code>: score grows proportionally with multi-line clears</li> 391 <li><code>level_progression</code>: level increases after 10+ lines cleared</li> 392 <li><code>speed_progression</code>: drop speed increases with level</li> 393 <li><code>next_piece_preview</code>: next piece display visible</li> 394 <li><code>game_over_display</code>: game over message and restart option shown</li> 395 <li><code>counter_clockwise_rotation</code>: Z key rotates opposite to Up arrow (verified by reload-and-compare)</li> 396 <li><code>soft_drop_distinct</code>: Down arrow moves one row, distinct from hard drop</li> 397 <li><code>rendering_clean</code>: pieces don't leave trails on the board</li> 398 </ul> 399 </div> 400 </div> 401 402 <h3 style="margin-top: 24px; margin-bottom: 12px;">Calibration cache</h3> 403 <p> 404 After the first successful calibration, the bot caches the start mechanism, controls, and 405 grid bounds. On every subsequent page reload (each phase that requires a fresh state), 406 the cached calibration is replayed instead of re-discovering everything. If the cache 407 fails to apply, the bot detects calibration drift and re-runs full discovery, flagging 408 the conflict in the report. 409 </p> 410 411 <div class="callout"> 412 The bot uses both screenshot comparison AND DOM state inspection to verify game 413 responses. For canvas games this requires GPU access in headless Chromium. For DOM 414 games, comparing class names and inline styles works without a GPU. 415 </div> 416 </section> 417 418 <!-- Limitations --> 419 <section class="method-section"> 420 <h2>Known limitations</h2> 421 <ul class="limitations"> 422 <li> 423 <strong>Canvas games need GPU access.</strong> In headless Chromium without GPU, 424 <code>canvas.getImageData()</code> returns all zeros, so the grid reader can't see 425 canvas content. DOM-rendered games work fine without a GPU. 426 </li> 427 <li> 428 <strong>Trail rendering bugs.</strong> Games that leave colored cells behind moving 429 pieces (rather than clearing them on each frame) confuse the grid reader. The bot 430 sees stale trail cells as "filled" and can't track active piece movement reliably. 431 </li> 432 <li> 433 <strong>Score detection.</strong> The bot looks for elements containing changing 434 numbers in known patterns. Games with unusual score displays (rendered to canvas, 435 scattered across multiple elements) may not have their score detected. 436 </li> 437 <li> 438 <strong>Wall kicks, T-spins, lock delay.</strong> The bot tests basic mechanics but 439 not advanced Tetris features. A game with broken wall kicks would still pass the 440 rotation test if rotation works in open space. 441 </li> 442 <li> 443 <strong>Bot skill caps detection.</strong> Some bug detection tests (multi_line_clear, 444 score_scaling, level_progression) require the bot to actually clear multiple lines 445 during play. If the AI player can't clear lines on a particular game, those tests 446 skip rather than fail. 447 </li> 448 <li> 449 <strong>Game over masking.</strong> Games where the start screen looks like the game 450 itself (or where game-over state appears on load) can cause start detection issues. 451 The bot rejects "starts" that immediately game-over, but edge cases exist. 452 </li> 453 <li> 454 <strong>Hidden game elements.</strong> If a game has working logic but a CSS bug 455 hides the start button, the bot will report the game as broken because it can't get 456 past the start screen, even though the underlying code is correct. 457 </li> 458 <li> 459 <strong>SonarQube availability.</strong> SonarQube metrics only populate when the 460 server is running locally. Runs evaluated without it will have empty SonarQube sections. 461 </li> 462 <li> 463 <strong>Single task.</strong> Currently only the Tetris task is evaluated. Results may 464 not generalize to other task types (APIs, data pipelines, etc.). 465 </li> 466 <li> 467 <strong>Sample size.</strong> Statistical power depends on having enough runs. Small 468 sweeps can produce noisy main effect estimates. The dashboard shows confidence 469 intervals everywhere they're computable. 470 </li> 471 </ul> 472 </section> 473 474 </div> 475 </Base> 476 477 <style> 478 .methodology { 479 max-width: 960px; 480 } 481 482 .method-section { 483 margin-bottom: 48px; 484 } 485 486 .method-section h2 { 487 margin-bottom: 12px; 488 color: hsl(var(--primary)); 489 } 490 491 .method-section > p { 492 margin-bottom: 16px; 493 line-height: 1.7; 494 } 495 496 .three-col { 497 display: grid; 498 grid-template-columns: repeat(3, 1fr); 499 gap: 16px; 500 margin-bottom: 16px; 501 } 502 503 .two-col { 504 display: grid; 505 grid-template-columns: repeat(2, 1fr); 506 gap: 16px; 507 margin-bottom: 16px; 508 } 509 510 .designs { 511 display: grid; 512 grid-template-columns: repeat(3, 1fr); 513 gap: 16px; 514 margin-bottom: 16px; 515 } 516 517 .card h3 { 518 margin-bottom: 8px; 519 color: hsl(var(--foreground)); 520 } 521 522 .card p { 523 margin-bottom: 8px; 524 line-height: 1.6; 525 } 526 527 .card ul { 528 margin: 8px 0; 529 padding-left: 20px; 530 } 531 532 .card li { 533 margin-bottom: 4px; 534 line-height: 1.5; 535 } 536 537 .weight { 538 color: hsl(var(--muted-foreground)); 539 font-size: var(--text-label); 540 font-weight: 400; 541 } 542 543 .muted { 544 color: hsl(var(--muted-foreground)); 545 font-size: 0.85rem; 546 } 547 548 .callout { 549 border-left: 3px solid hsl(var(--primary)); 550 padding: 12px 16px; 551 background: hsl(var(--primary) / 0.04); 552 margin-bottom: 16px; 553 line-height: 1.6; 554 } 555 556 .phases { 557 display: flex; 558 flex-direction: column; 559 gap: 16px; 560 margin-bottom: 16px; 561 } 562 563 .phase { 564 background: hsl(var(--card)); 565 border: 1px solid hsl(var(--border)); 566 padding: 16px 20px; 567 } 568 569 .phase-header { 570 display: flex; 571 align-items: center; 572 gap: 12px; 573 margin-bottom: 8px; 574 } 575 576 .phase-number { 577 display: inline-flex; 578 align-items: center; 579 justify-content: center; 580 width: 28px; 581 height: 28px; 582 border: 1px solid hsl(var(--primary)); 583 color: hsl(var(--primary)); 584 font-weight: 600; 585 font-size: var(--text-ui); 586 flex-shrink: 0; 587 } 588 589 .phase-header h3 { 590 margin: 0; 591 color: hsl(var(--foreground)); 592 } 593 594 .phase p { 595 margin-bottom: 8px; 596 line-height: 1.6; 597 } 598 599 .phase pre { 600 background: hsl(var(--smui-surface-2)); 601 border: 1px solid hsl(var(--border)); 602 padding: 10px 14px; 603 overflow-x: auto; 604 margin-bottom: 8px; 605 } 606 607 .phase code { 608 font-family: var(--font-mono); 609 font-size: var(--text-ui); 610 color: hsl(var(--smui-green)); 611 } 612 613 .limitations { 614 list-style: none; 615 padding: 0; 616 } 617 618 .limitations li { 619 padding: 10px 0; 620 border-bottom: 1px solid hsl(var(--border)); 621 line-height: 1.6; 622 } 623 624 .limitations li:last-child { 625 border-bottom: none; 626 } 627 628 @media (max-width: 768px) { 629 .three-col, 630 .two-col, 631 .designs { 632 grid-template-columns: 1fr; 633 } 634 } 635 </style>