Bootstrap loop benchmarking project - loop-benchmarking - Controlled experiments across agentic coding configurations. Same task, one variable, what actually works.

commit d4cbd4a130aa26069b412af8c331d5e797bd6959
Author: Brian Graham <brian@buildingbetterteams.de>
Date:   Fri,  3 Apr 2026 14:49:42 +0200

Bootstrap loop benchmarking project

Rich CLAUDE.md with full context for future session: what to benchmark,
task design principles, measurement criteria, connection to Ship the Loop
video content. Placeholder only, no code yet.

Diffstat:
A CLAUDE.md  | 86 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
A README.md  | 5 +++++

2 files changed, 91 insertions(+), 0 deletions(-)
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -0,0 +1,86 @@
+# Loop Benchmarking
+
+## What This Is
+
+An open benchmark for comparing different agentic coding loop configurations on identical tasks. Run the same task with different setups (model, tools, verification layers) and publish comparable results. The "Ship the Loop Benchmark."
+
+## Why It Exists
+
+- Referenced from the Ship the Loop YouTube video "AI Made Experienced Developers 19% Slower"
+- The jagged performance frontier concept: AI productivity isn't a single number, it depends on loop architecture
+- The chain-doubling insight: 90% per-step accuracy = ~10 steps, 95% = ~20 steps. Verification layers have outsized impact.
+- Nobody has a rigorous, reproducible benchmark for comparing agentic LOOP configurations (not just models)
+
+## What We're Testing
+
+Variables to control:
+- **Model**: Opus vs Sonnet vs Haiku (or equivalent tiers from other providers)
+- **Loop tools**: with/without Playwright, with/without test runner, with/without type checker
+- **Language**: TypeScript vs JavaScript (type compiler catches defects)
+- **Context**: with/without CLAUDE.md rules files, with/without existing codebase context
+- **MCP servers**: with/without various tool access
+
+Constant: the TASK. Same task definition given to every configuration.
+
+## Task Design Principles
+
+- Tasks should span the jagged frontier: some should be agent-friendly (clear success criteria, well-defined), others should be hard (ambiguous requirements, complex interdependencies)
+- Each task needs: a clear spec, automated success criteria (tests that determine pass/fail), a complexity rating
+- Tasks should be self-contained (no external API dependencies that could change)
+- Suggested starter tasks:
+  - **Tetris** - agent-friendly, clear rules, visual verification
+  - **Form with validation** - medium, needs browser testing
+  - **REST API with auth + rate limiting + caching** - medium-hard, multiple concerns
+  - **Data pipeline with error handling** - hard, complex state management
+  - **Something with subtle correctness requirements** - hard, tests are hard to write too
+
+## Measurement
+
+- Pass/fail on automated tests (binary)
+- Number of loop iterations to reach pass
+- Total tokens consumed
+- Wall-clock time
+- Code quality metrics (optional: lint score, type coverage, test coverage)
+- Human judgment score on a 1-5 scale for "would you ship this?"
+
+## Output
+
+- Results published as a micro-site (likely static, could be at benchmarks.shiptheloop.com or similar)
+- Interactive comparison: pick two configs, see side-by-side results
+- Updated periodically as new models/tools release
+- Raw data available for download
+
+## Tech Stack (suggested)
+
+- TypeScript for the harness
+- Each benchmark run is isolated (clean directory, fresh agent session)
+- Results stored as YAML/JSON
+- Static site generator for the results micro-site
+- Could use Claude Code's Agent SDK for programmatic runs
+
+## Connection to Video Content
+
+- The 60-second demo in the productivity truth video is a teaser for this
+- The setup-showdown video (bp-setup-showdown) is the full walkthrough
+- Results referenced in future videos as evidence
+
+## Brian's Context
+
+- Brian runs Ship the Loop (YouTube) and Building Better Teams (consulting)
+- He has a research corpus of 1,200+ AI papers at ~/projects/ai-research-survey/
+- The video production pipeline is at ~/video-studio/
+- Primary stack: TypeScript. Uses Forgejo, not GitHub.
+- Start conservative with resource-intensive settings. Scale up, not down.
+
+## Infrastructure
+
+- Source control: Forgejo (NOT GitHub)
+- No CI/CD yet - will be set up when the harness is built
+- No external dependencies at bootstrap stage
+
+## Conventions
+
+- Never use emdashes
+- Keep results data in YAML/JSON (human-readable, diffable)
+- Each benchmark run gets a unique ID (timestamp + config hash)
+- Raw results are never overwritten, only appended
diff --git a/README.md b/README.md
@@ -0,0 +1,5 @@
+# Loop Benchmarking
+
+Agentic loop configuration benchmark for Ship the Loop.
+
+Status: bootstrapping. See CLAUDE.md for full context.

	loop-benchmarking Controlled experiments across agentic coding configurations. Same task, one variable, what actually works.
	git clone https://git.shiptheloop.com/loop-benchmarking.git
	Log \| Files \| Refs \| README

A	CLAUDE.md	\|	86	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
A	README.md	\|	5	+++++