loop-benchmarking

Controlled experiments across agentic coding configurations. Same task, one variable, what actually works.
git clone https://git.shiptheloop.com/loop-benchmarking.git
Log | Files | Refs | README

commit 4f28472171324e5ca141cd341697a345ac9438fc
parent f2f3ae07a56e601bfb819d07f034eedeabc6a9c8
Author: Brian Graham <brian@buildingbetterteams.de>
Date:   Fri, 10 Apr 2026 21:24:30 +0200

Methodology: scoring uses SonarQube, code quality is in outputs, no emdashes

Headline score is 50% gameplay bot + 50% SonarQube. The lint/typecheck/
bundle "code quality" metrics are tracked as output metrics, not part of
the headline score.

Also removed all prose emdashes per project convention.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Diffstat:
Mdashboard/src/pages/methodology.astro | 68++++++++++++++++++++++++++++++++++++--------------------------------
1 file changed, 36 insertions(+), 32 deletions(-)

diff --git a/dashboard/src/pages/methodology.astro b/dashboard/src/pages/methodology.astro @@ -21,7 +21,7 @@ import Base from "../layouts/Base.astro"; <div class="card"> <h3>Inputs</h3> <p> - The experiment variables -- model, effort level, tools, prompt style, programming language, + The experiment variables: model, effort level, tools, prompt style, programming language, human language, budget, and more. These are the grid axes. Each unique combination of values is a "cell" in the experiment matrix. </p> @@ -66,7 +66,7 @@ import Base from "../layouts/Base.astro"; <h3>Gameplay Bot <span class="weight">50%</span></h3> <p> 25 automated Playwright tests across 8 conditional phases. The bot calibrates itself to - each game -- finds the grid, discovers controls, locates the start mechanism -- then + each game (finds the grid, discovers controls, locates the start mechanism), then plays using a continuous 60ms polling loop that reads grid state directly from the canvas or DOM. </p> @@ -81,22 +81,26 @@ import Base from "../layouts/Base.astro"; <p class="muted"> Every test is deterministic observation. The bot reads pixels or DOM state, presses keys, and checks if the game responded correctly. Tests in later phases run only if - earlier phases succeeded -- no false positives from cascading failures. + earlier phases succeeded, so cascading failures never produce false positives. </p> </div> <div class="card"> - <h3>Code Quality <span class="weight">50%</span></h3> + <h3>SonarQube <span class="weight">50%</span></h3> <p> - Automated quality checks run against the agent's output: + Automated code quality scan via SonarQube Community Edition: </p> <ul> - <li><strong>ESLint</strong> -- errors and warnings counted</li> - <li><strong>TypeScript</strong> -- compilation success/failure</li> - <li><strong>Bundle size</strong> -- measured after build</li> + <li><strong>Cognitive complexity</strong> normalized per file</li> + <li><strong>Bugs and vulnerabilities</strong> count</li> + <li><strong>Code smells</strong> count</li> + <li><strong>Maintainability rating</strong> A through E</li> + <li><strong>Reliability rating</strong> A through E</li> + <li><strong>Security rating</strong> A through E</li> </ul> <p class="muted"> - Quality scoring rewards clean, buildable code. A game that works perfectly but has - 200 lint errors will score lower than one with clean code. + SonarQube weighs the same in the headline score as the gameplay bot. A game that + works perfectly but is full of cognitive complexity, bugs, or smells will score + lower than a clean implementation that works. </p> </div> </div> @@ -134,16 +138,16 @@ import Base from "../layouts/Base.astro"; </div> <div class="card"> - <h3>SonarQube</h3> + <h3>Code Quality</h3> <ul> - <li>Cognitive complexity</li> - <li>Bugs and vulnerabilities</li> - <li>Code smells</li> - <li>Maintainability rating</li> - <li>Reliability rating</li> - <li>Security rating</li> + <li><strong>ESLint</strong> errors and warnings</li> + <li><strong>TypeScript</strong> compilation success</li> + <li><strong>Bundle size</strong> after build</li> </ul> - <p class="muted">Only runs when the SonarQube server is available locally.</p> + <p class="muted"> + Build hygiene metrics. Tracked separately from the SonarQube quality score because + they measure tooling output, not code quality. + </p> </div> <div class="card"> @@ -218,7 +222,7 @@ import Base from "../layouts/Base.astro"; The bot is a Playwright script that loads any Tetris implementation and figures out how to interact with it. It handles different renderers (canvas, DOM, SVG, WebGL), control schemes, languages, and start mechanisms without prior knowledge of the - implementation. No text matching of any kind -- everything is discovered through + implementation. No text matching of any kind. Everything is discovered through observation. </p> @@ -239,7 +243,7 @@ import Base from "../layouts/Base.astro"; <h3>Bot</h3> <p> Knows Tetris. Runs the 8 phases, derives the 25 test results, plays the game using - Pierre Dellacherie's heuristic. Never imports Playwright -- only talks to the driver + Pierre Dellacherie's heuristic. Never imports Playwright. Only talks to the driver through a typed interface. </p> </div> @@ -250,7 +254,7 @@ import Base from "../layouts/Base.astro"; <div class="card"> <h3>Language-agnostic start detection</h3> <p> - Buttons are found by structural properties (cursor:pointer, size, contrast) -- never + Buttons are found by structural properties (cursor:pointer, size, contrast), never by text. Tries auto-start, then DOM buttons in order of prominence, then keyboard triggers, then canvas clicks. </p> @@ -276,7 +280,7 @@ import Base from "../layouts/Base.astro"; <h3 style="margin-top: 24px; margin-bottom: 12px;">Eight conditional phases</h3> <p> Each phase only runs if the previous phase succeeded. Failed prerequisites mark - downstream tests as <code>skipped</code> rather than failed -- so the bot never produces + downstream tests as <code>skipped</code> rather than failed, so the bot never produces false positives or false negatives from cascading failures. </p> @@ -382,15 +386,15 @@ import Base from "../layouts/Base.astro"; opportunity to test). </p> <ul> - <li><code>multi_line_clear</code> -- multiple complete rows clear simultaneously</li> - <li><code>score_scaling</code> -- score grows proportionally with multi-line clears</li> - <li><code>level_progression</code> -- level increases after 10+ lines cleared</li> - <li><code>speed_progression</code> -- drop speed increases with level</li> - <li><code>next_piece_preview</code> -- next piece display visible</li> - <li><code>game_over_display</code> -- game over message and restart option shown</li> - <li><code>counter_clockwise_rotation</code> -- Z key rotates opposite to Up arrow (verified by reload-and-compare)</li> - <li><code>soft_drop_distinct</code> -- Down arrow moves one row, distinct from hard drop</li> - <li><code>rendering_clean</code> -- pieces don't leave trails on the board</li> + <li><code>multi_line_clear</code>: multiple complete rows clear simultaneously</li> + <li><code>score_scaling</code>: score grows proportionally with multi-line clears</li> + <li><code>level_progression</code>: level increases after 10+ lines cleared</li> + <li><code>speed_progression</code>: drop speed increases with level</li> + <li><code>next_piece_preview</code>: next piece display visible</li> + <li><code>game_over_display</code>: game over message and restart option shown</li> + <li><code>counter_clockwise_rotation</code>: Z key rotates opposite to Up arrow (verified by reload-and-compare)</li> + <li><code>soft_drop_distinct</code>: Down arrow moves one row, distinct from hard drop</li> + <li><code>rendering_clean</code>: pieces don't leave trails on the board</li> </ul> </div> </div> @@ -449,7 +453,7 @@ import Base from "../layouts/Base.astro"; <li> <strong>Hidden game elements.</strong> If a game has working logic but a CSS bug hides the start button, the bot will report the game as broken because it can't get - past the start screen -- even though the underlying code is correct. + past the start screen, even though the underlying code is correct. </li> <li> <strong>SonarQube availability.</strong> SonarQube metrics only populate when the

Impressum · Datenschutz