loop-benchmarking

Controlled experiments across agentic coding configurations. Same task, one variable, what actually works.
git clone https://git.shiptheloop.com/loop-benchmarking.git
Log | Files | Refs | README

commit 1faf94e22b8b5d1181ae2c8cc880dc972fd3f07a
parent 9dd8921a1378f43deeb46eb036fdaf73a3f7f75f
Author: Brian Graham <brian@buildingbetterteams.de>
Date:   Fri,  3 Apr 2026 19:19:58 +0200

Add git commit to footer, document metrics and Pareto frontier

- Footer shows short git hash so deployed version is verifiable
- README documents metric switching (score, cost, turns, wall_time, pass_rate)
- README documents future Pareto frontier analysis for multi-objective comparison

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Diffstat:
MREADME.md | 22++++++++++++++++++++++
Mdashboard/src/layouts/Base.astro | 9+++++++++
2 files changed, 31 insertions(+), 0 deletions(-)

diff --git a/README.md b/README.md @@ -122,3 +122,25 @@ Instead of running the full 204,800-cell grid, use statistical designs: - **Interaction hunt**: Full factorial on a small subset of axes to find interactions. The dashboard's Insights page visualizes main effects as tornado charts and interactions as heatmaps. + +## Metrics + +All analyses can target different metrics. Switch between them in the dashboard or via CLI: + +```bash +# Which variables most affect quality? +python3 harness/lib/experiment_design.py analyze results main_effects score + +# Which variables most affect cost? +python3 harness/lib/experiment_design.py analyze results main_effects cost + +# Which variables most affect speed? +python3 harness/lib/experiment_design.py analyze results main_effects wall_time + +# Which variables most affect iteration count? +python3 harness/lib/experiment_design.py analyze results main_effects turns +``` + +Available metrics: `score`, `cost`, `turns`, `wall_time`, `pass_rate`. + +These metrics often conflict. A config that maximizes score may also maximize cost. A future addition is Pareto frontier analysis to identify configurations that are not dominated on any metric (e.g., "highest score at each cost level"). This would let you answer questions like "what's the cheapest config that still passes?" or "is Opus worth 5x the cost of Haiku for this task?" diff --git a/dashboard/src/layouts/Base.astro b/dashboard/src/layouts/Base.astro @@ -1,11 +1,19 @@ --- import "../styles/global.css"; +import { execSync } from "node:child_process"; interface Props { title: string; } const { title } = Astro.props; + +let gitCommit = "unknown"; +try { + gitCommit = execSync("git rev-parse --short HEAD", { encoding: "utf-8" }).trim(); +} catch { + // Not in a git repo during build +} --- <!doctype html> @@ -34,6 +42,7 @@ const { title } = Astro.props; <footer style="border-top: 1px solid var(--border); padding: 24px 0; margin-top: 48px;"> <div class="container" style="text-align: center; color: var(--text-muted); font-size: 0.75rem;"> Loop Benchmarking - Open agentic loop benchmark data + <span style="margin-left: 12px; font-family: var(--font-mono);">{gitCommit}</span> </div> </footer> </body>

Impressum · Datenschutz