Fix harness bugs, add DOE experiment design, insights dashboard - loop-benchmarking - Controlled experiments across agentic coding configurations. Same task, one variable, what actually works.

commit f188f40361a8a1dc7600e2c625ff045d29da3d2b
parent 147931383ebcb9584b54dd141a05bac520b2c3b5
Author: Brian Graham <brian@buildingbetterteams.de>
Date:   Fri,  3 Apr 2026 19:12:22 +0200

Fix harness bugs, add DOE experiment design, insights dashboard

Harness fixes:
- Replace --dangerously-skip-permissions with --permission-mode dontAsk
- Add --verbose (required for stream-json output)
- Use OAuth token extraction for --bare mode (get-oauth-token.sh)
- Rewrite orchestrator in Python (run.py) to avoid bash subshell issues
- Fix eval scripts: replace bc with awk, handle empty/invalid JSON

New features:
- 5 individual tool axes (tool_read/write/edit/glob/grep) replace base_tools
- DOE experiment design module (main effects sweep, Plackett-Burman
  screening, interaction hunt) for efficient grid exploration
- Analysis functions to compute effect sizes and interactions from results
- Insights dashboard page with tornado charts and interaction heatmaps
- Metric switcher (score, cost, turns, wall time)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Diffstat:
M .gitignore  | 1 +
M CLAUDE.md  | 2 +-
M README.md  | 123 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
A dashboard/src/components/Heatmap.tsx  | 164 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
A dashboard/src/components/Insights.tsx  | 109 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
A dashboard/src/components/TornadoChart.tsx  | 168 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
M dashboard/src/layouts/Base.astro  | 1 +
A dashboard/src/lib/analysis.ts  | 180 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
A dashboard/src/pages/insights.astro  | 17 +++++++++++++++++
M grid.yaml  | 23 +++++++++++++++++++++--
M harness/lib/compute_grid.py  | 1 -
M harness/lib/evaluate.sh  | 41 ++++++++++++++++++++++++-----------------
A harness/lib/experiment_design.py  | 582 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
A harness/lib/get-oauth-token.sh  | 13 +++++++++++++
M harness/lib/invoke.sh  | 24 +++++++++++++++++++++---
A harness/run.py  | 424 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
M harness/run.sh  | 100 ++++++++++++++++++++++++++++++++++++++++++++++---------------------------------
A results/index.jsonl  | 6 ++++++
M tasks/bookmarks-api/eval/quality.sh  | 2 +-
M tasks/bookmarks-api/eval/structural.sh  | 2 +-
M tasks/data-pipeline/eval/quality.sh  | 2 +-
M tasks/data-pipeline/eval/structural.sh  | 2 +-
M tasks/data-pipeline/eval/tests/functional.sh  | 4 ++--
M tasks/tetris/eval/quality.sh  | 33 ++++++++++++++++++---------------
M tasks/tetris/eval/structural.sh  | 2 +-

25 files changed, 1936 insertions(+), 90 deletions(-)
diff --git a/.gitignore b/.gitignore
@@ -3,3 +3,4 @@ dist/
 .astro/
 results/runs/
 *.tar.gz
+__pycache__/
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -61,7 +61,7 @@ Static site showing:
 
 ## Tech
 
-- **Harness runner**: Bash script that orchestrates experiment runs. Claude Code cannot invoke itself, so the harness must be external. The script reads YAML definitions, computes the grid, and invokes `claude` CLI for each cell.
+- **Harness runner**: Bash script that orchestrates experiment runs. Claude Code cannot invoke itself, so the harness must be external. The script reads YAML definitions, computes the grid, and invokes `claude` CLI for each cell. Uses `--permission-mode dontAsk` with `--allowedTools` for non-interactive runs (not `--dangerously-skip-permissions`, which is blocked as root).
 - **Model support**: Primarily Anthropic models (Haiku, Sonnet, Opus). Non-Anthropic models possible via LiteLLM proxy in front of Ollama or similar, but expect reduced feature support (extended thinking, tool use may not work). This is valid benchmark data.
 - Results stored as YAML/JSON (append-only, never overwritten)
 - Each run gets a unique ID
diff --git a/README.md b/README.md
@@ -1,5 +1,124 @@
 # Loop Benchmarking
 
-Agentic loop configuration benchmark for Ship the Loop.
+An open benchmark for comparing agentic coding loop configurations. Same task, different setups, all data public.
 
-Status: bootstrapping. See CLAUDE.md for full context.
+## What this does
+
+Define the variables that make up a coding loop (model, tools, prompt style, etc.), and the system generates every permutation. Each is run against a set of tasks in a clean-room environment with deterministic evaluation. No LLM grading.
+
+## Quick start
+
+### Prerequisites
+
+- Node.js 22+
+- Python 3.12+ with PyYAML
+- Claude Code CLI (authenticated via `claude login`)
+
+### Running experiments
+
+```bash
+# 1. Screen: which variables matter? (~53 cells, vary one axis at a time)
+python3 harness/run.py grid.yaml main_effects
+
+# 2. Analyze: rank variables by effect size
+python3 harness/lib/experiment_design.py analyze results main_effects score
+
+# 3. Deep dive: full factorial on the top variables that matter
+python3 harness/run.py grid.yaml "interaction_hunt:model,effort,tool_write"
+
+# 4. Check for interactions between variables
+python3 harness/lib/experiment_design.py analyze results interactions model effort score
+```
+
+### Other run modes
+
+```bash
+# Profile-based (predefined subsets of the grid)
+python3 harness/run.py grid.yaml smoke          # 6 cells, 1 run each
+python3 harness/run.py grid.yaml core           # 30 cells, 3 runs each
+python3 harness/run.py grid.yaml full           # 204,800 cells (don't)
+
+# Plackett-Burman screening (efficient multi-factor screening)
+python3 harness/run.py grid.yaml plackett_burman
+```
+
+### Building the dashboard
+
+```bash
+cd dashboard
+npm install
+npm run build        # Static site in dashboard/dist/
+npm run dev          # Dev server for local preview
+```
+
+## Project structure
+
+```
+grid.yaml                    # Experiment grid: axes, values, exclusions, profiles
+harness/
+  run.py                     # Main orchestrator (Python)
+  lib/
+    compute_grid.py          # Cartesian product + exclusions
+    experiment_design.py     # DOE plans + analysis (main effects, PB, interactions)
+    get-oauth-token.sh       # Extracts OAuth token for --bare mode
+    invoke.sh                # Claude CLI invocation (bash, used by run.sh)
+    evaluate.sh              # Evaluation dispatch (bash, used by run.sh)
+    workspace.sh             # Workspace creation (bash, used by run.sh)
+tasks/
+  tetris/                    # Agent-friendly: build a game
+  bookmarks-api/             # Medium: REST API with auth
+  data-pipeline/             # Hard: CSV processing with edge cases
+  Each task has:
+    prompts/                 # simple/detailed x en/es
+    eval/                    # Deterministic test suites the agent never sees
+    context.md               # Rules file (used when context_file=provided)
+    scoring.yaml             # Category weights
+results/
+  runs/{run_id}/             # One directory per experiment run
+    meta.json                # Config, timing, exit code
+    transcript.jsonl         # Full conversation (every tool call and response)
+    claude_output.json       # Summary metrics (cost, turns, tokens)
+    eval_results.json        # Structural, functional, quality scores
+    workspace.tar.gz         # Archived agent output
+dashboard/                   # Astro + React static site
+  Grid overview, insights (tornado charts, heatmaps), run detail with transcript viewer
+```
+
+## Configuration dimensions (16 axes)
+
+| Axis | Values |
+|---|---|
+| model | haiku, sonnet, opus |
+| effort | high, max (extended thinking) |
+| prompt_style | simple, detailed |
+| language | typescript, javascript |
+| human_language | en, es |
+| tool_read | on, off |
+| tool_write | on, off |
+| tool_edit | on, off |
+| tool_glob | on, off |
+| tool_grep | on, off |
+| linter | on, off |
+| playwright | on, off |
+| context_file | none, provided |
+| sub_agents | on, off |
+| web_search | on, off |
+| max_budget | low ($0.50), high ($5.00) |
+
+## Evaluation
+
+All scoring is deterministic code. The agent never sees the test suite.
+
+- **Structural**: Does it build? Do expected files exist?
+- **Functional**: Pre-written test suites (Playwright, vitest, golden file diff)
+- **Quality**: Lint, type check, accessibility, security, performance
+
+## Experiment design
+
+Instead of running the full 204,800-cell grid, use statistical designs:
+
+- **Main effects sweep**: Vary one axis at a time from a baseline. Identifies which variables matter.
+- **Plackett-Burman**: Screening design that tests many binary factors efficiently.
+- **Interaction hunt**: Full factorial on a small subset of axes to find interactions.
+
+The dashboard's Insights page visualizes main effects as tornado charts and interactions as heatmaps.
diff --git a/dashboard/src/components/Heatmap.tsx b/dashboard/src/components/Heatmap.tsx
@@ -0,0 +1,164 @@
+import type { InteractionResult } from "../lib/analysis";
+
+interface HeatmapProps {
+  data: InteractionResult;
+  metric: string;
+}
+
+export default function Heatmap({ data, metric }: HeatmapProps) {
+  const { axisA, axisB, table } = data;
+
+  const aValues = Object.keys(table).sort();
+  const bValues = Array.from(
+    new Set(aValues.flatMap((a) => Object.keys(table[a])))
+  ).sort();
+
+  if (aValues.length === 0 || bValues.length === 0) {
+    return (
+      <div
+        className="card"
+        style={{
+          textAlign: "center",
+          padding: "40px",
+          color: "var(--text-muted)",
+        }}
+      >
+        Not enough data for this interaction.
+      </div>
+    );
+  }
+
+  // Find min/max for color scale
+  const allMeans = aValues.flatMap((a) =>
+    bValues.filter((b) => table[a]?.[b]).map((b) => table[a][b].mean)
+  );
+  const minVal = Math.min(...allMeans);
+  const maxVal = Math.max(...allMeans);
+  const range = maxVal - minVal || 1;
+
+  function cellColor(value: number): string {
+    const ratio = (value - minVal) / range;
+    if (ratio > 0.66)
+      return `rgba(34, 197, 94, ${0.3 + ratio * 0.5})`;
+    if (ratio > 0.33)
+      return `rgba(234, 179, 8, ${0.3 + ratio * 0.4})`;
+    return `rgba(239, 68, 68, ${0.3 + (1 - ratio) * 0.4})`;
+  }
+
+  return (
+    <div className="card">
+      <h3 style={{ marginBottom: "4px" }}>
+        {axisA} x {axisB}
+      </h3>
+      <p
+        style={{
+          color: "var(--text-muted)",
+          fontSize: "0.75rem",
+          marginBottom: "16px",
+        }}
+      >
+        Mean {metric} for each combination. Interaction strength:{" "}
+        <span
+          style={{
+            fontFamily: "var(--font-mono)",
+            color:
+              data.maxInteraction > 0.05
+                ? "var(--yellow)"
+                : "var(--text-muted)",
+          }}
+        >
+          {(data.maxInteraction * 100).toFixed(1)}%
+        </span>
+      </p>
+
+      <div style={{ overflowX: "auto" }}>
+        <table style={{ borderCollapse: "collapse" }}>
+          <thead>
+            <tr>
+              <th
+                style={{
+                  padding: "8px 12px",
+                  fontSize: "0.7rem",
+                  textAlign: "center",
+                }}
+              >
+                {axisA} \ {axisB}
+              </th>
+              {bValues.map((b) => (
+                <th
+                  key={b}
+                  style={{
+                    padding: "8px 12px",
+                    fontSize: "0.75rem",
+                    textAlign: "center",
+                    fontFamily: "var(--font-mono)",
+                  }}
+                >
+                  {b}
+                </th>
+              ))}
+            </tr>
+          </thead>
+          <tbody>
+            {aValues.map((a) => (
+              <tr key={a}>
+                <td
+                  style={{
+                    padding: "8px 12px",
+                    fontSize: "0.75rem",
+                    fontFamily: "var(--font-mono)",
+                    fontWeight: 600,
+                  }}
+                >
+                  {a}
+                </td>
+                {bValues.map((b) => {
+                  const cell = table[a]?.[b];
+                  if (!cell) {
+                    return (
+                      <td
+                        key={b}
+                        style={{
+                          padding: "8px 12px",
+                          textAlign: "center",
+                          color: "var(--text-muted)",
+                        }}
+                      >
+                        -
+                      </td>
+                    );
+                  }
+                  return (
+                    <td
+                      key={b}
+                      style={{
+                        padding: "8px 12px",
+                        textAlign: "center",
+                        background: cellColor(cell.mean),
+                        fontFamily: "var(--font-mono)",
+                        fontSize: "0.8rem",
+                        fontWeight: 600,
+                        borderRadius: "2px",
+                      }}
+                    >
+                      {(cell.mean * 100).toFixed(0)}%
+                      <div
+                        style={{
+                          fontSize: "0.6rem",
+                          fontWeight: 400,
+                          color: "var(--text-muted)",
+                        }}
+                      >
+                        n={cell.n}
+                      </div>
+                    </td>
+                  );
+                })}
+              </tr>
+            ))}
+          </tbody>
+        </table>
+      </div>
+    </div>
+  );
+}
diff --git a/dashboard/src/components/Insights.tsx b/dashboard/src/components/Insights.tsx
@@ -0,0 +1,109 @@
+import { useState, useMemo } from "react";
+import type { Run } from "../lib/data";
+import { computeMainEffects, computeInteraction } from "../lib/analysis";
+import TornadoChart from "./TornadoChart";
+import Heatmap from "./Heatmap";
+
+interface InsightsProps {
+  runs: Run[];
+}
+
+const METRICS = [
+  { key: "score", label: "Score" },
+  { key: "cost", label: "Cost" },
+  { key: "turns", label: "Turns" },
+  { key: "wall_time", label: "Wall Time" },
+];
+
+export default function Insights({ runs }: InsightsProps) {
+  const [metric, setMetric] = useState("score");
+  const [axisA, setAxisA] = useState("");
+  const [axisB, setAxisB] = useState("");
+
+  const effects = useMemo(
+    () => computeMainEffects(runs, metric),
+    [runs, metric]
+  );
+
+  // Auto-pick top 2 axes for interaction if not selected
+  const topAxes = useMemo(() => effects.slice(0, 6).map((e) => e.axis), [effects]);
+
+  const interaction = useMemo(() => {
+    const a = axisA || topAxes[0] || "";
+    const b = axisB || topAxes[1] || "";
+    if (!a || !b || a === b) return null;
+    return computeInteraction(runs, a, b, metric);
+  }, [runs, axisA, axisB, metric, topAxes]);
+
+  return (
+    <div style={{ display: "flex", flexDirection: "column", gap: "24px" }}>
+      {/* Metric selector */}
+      <div style={{ display: "flex", gap: "8px", alignItems: "center" }}>
+        <span style={{ fontSize: "0.8rem", color: "var(--text-muted)" }}>
+          Metric:
+        </span>
+        {METRICS.map((m) => (
+          <button
+            key={m.key}
+            onClick={() => setMetric(m.key)}
+            style={{
+              padding: "4px 12px",
+              borderRadius: "4px",
+              border:
+                metric === m.key
+                  ? "1px solid var(--accent)"
+                  : "1px solid var(--border)",
+              background:
+                metric === m.key ? "rgba(99, 102, 241, 0.15)" : "transparent",
+              color: metric === m.key ? "var(--accent)" : "var(--text-muted)",
+              cursor: "pointer",
+              fontSize: "0.8rem",
+            }}
+          >
+            {m.label}
+          </button>
+        ))}
+      </div>
+
+      {/* Tornado chart */}
+      <TornadoChart effects={effects} metric={metric} />
+
+      {/* Interaction explorer */}
+      <div className="card">
+        <h3 style={{ marginBottom: "12px" }}>Interaction Explorer</h3>
+        <div style={{ display: "flex", gap: "12px", marginBottom: "16px" }}>
+          <div className="filter-group">
+            <label>Axis A</label>
+            <select
+              value={axisA || topAxes[0] || ""}
+              onChange={(e) => setAxisA(e.target.value)}
+            >
+              {topAxes.map((a) => (
+                <option key={a} value={a}>
+                  {a}
+                </option>
+              ))}
+            </select>
+          </div>
+          <div className="filter-group">
+            <label>Axis B</label>
+            <select
+              value={axisB || topAxes[1] || ""}
+              onChange={(e) => setAxisB(e.target.value)}
+            >
+              {topAxes
+                .filter((a) => a !== (axisA || topAxes[0]))
+                .map((a) => (
+                  <option key={a} value={a}>
+                    {a}
+                  </option>
+                ))}
+            </select>
+          </div>
+        </div>
+
+        {interaction && <Heatmap data={interaction} metric={metric} />}
+      </div>
+    </div>
+  );
+}
diff --git a/dashboard/src/components/TornadoChart.tsx b/dashboard/src/components/TornadoChart.tsx
@@ -0,0 +1,168 @@
+import type { AxisEffect } from "../lib/analysis";
+
+interface TornadoChartProps {
+  effects: AxisEffect[];
+  metric: string;
+}
+
+const AXIS_LABELS: Record<string, string> = {
+  model: "Model",
+  effort: "Effort",
+  prompt_style: "Prompt Style",
+  language: "Language",
+  human_language: "Human Language",
+  tool_read: "Read Tool",
+  tool_write: "Write Tool",
+  tool_edit: "Edit Tool",
+  tool_glob: "Glob Tool",
+  tool_grep: "Grep Tool",
+  linter: "Linter",
+  playwright: "Playwright",
+  context_file: "Context File",
+  sub_agents: "Sub-agents",
+  web_search: "Web Search",
+  max_budget: "Budget",
+};
+
+export default function TornadoChart({ effects, metric }: TornadoChartProps) {
+  if (effects.length === 0) {
+    return (
+      <div
+        className="card"
+        style={{
+          textAlign: "center",
+          padding: "40px",
+          color: "var(--text-muted)",
+        }}
+      >
+        Not enough data to compute effects. Run more experiments with varying
+        configurations.
+      </div>
+    );
+  }
+
+  const maxSpread = Math.max(...effects.map((e) => e.spread));
+  const scale = maxSpread > 0 ? 200 / maxSpread : 1; // max bar width = 200px
+
+  return (
+    <div className="card">
+      <h3 style={{ marginBottom: "4px" }}>Variable Impact on {metric}</h3>
+      <p
+        style={{
+          color: "var(--text-muted)",
+          fontSize: "0.75rem",
+          marginBottom: "16px",
+        }}
+      >
+        Sorted by effect size. Wider bars = bigger impact on outcomes.
+      </p>
+
+      {effects.map((effect) => (
+        <div
+          key={effect.axis}
+          style={{
+            display: "flex",
+            alignItems: "center",
+            marginBottom: "12px",
+            gap: "12px",
+          }}
+        >
+          {/* Label */}
+          <div
+            style={{
+              width: "120px",
+              textAlign: "right",
+              fontSize: "0.8rem",
+              flexShrink: 0,
+            }}
+          >
+            {AXIS_LABELS[effect.axis] || effect.axis}
+          </div>
+
+          {/* Bars */}
+          <div
+            style={{
+              flex: 1,
+              display: "flex",
+              flexDirection: "column",
+              gap: "2px",
+            }}
+          >
+            {effect.values.map((entry) => {
+              const width = Math.abs(entry.effect) * scale;
+              const isPositive = entry.effect >= 0;
+              return (
+                <div
+                  key={entry.value}
+                  style={{
+                    display: "flex",
+                    alignItems: "center",
+                    gap: "8px",
+                  }}
+                >
+                  <div
+                    style={{
+                      width: "50px",
+                      textAlign: "right",
+                      fontSize: "0.7rem",
+                      fontFamily: "var(--font-mono)",
+                      color: "var(--text-muted)",
+                      flexShrink: 0,
+                    }}
+                  >
+                    {entry.value}
+                  </div>
+                  <div
+                    style={{
+                      height: "16px",
+                      width: `${Math.max(width, 2)}px`,
+                      background: isPositive
+                        ? "var(--green)"
+                        : "var(--red)",
+                      borderRadius: "2px",
+                      opacity: 0.8,
+                    }}
+                  />
+                  <div
+                    style={{
+                      fontSize: "0.7rem",
+                      fontFamily: "var(--font-mono)",
+                      color: isPositive
+                        ? "var(--green)"
+                        : "var(--red)",
+                    }}
+                  >
+                    {entry.effect >= 0 ? "+" : ""}
+                    {(entry.effect * 100).toFixed(1)}%
+                  </div>
+                  <div
+                    style={{
+                      fontSize: "0.65rem",
+                      color: "var(--text-muted)",
+                    }}
+                  >
+                    (n={entry.n})
+                  </div>
+                </div>
+              );
+            })}
+          </div>
+
+          {/* Spread */}
+          <div
+            style={{
+              width: "60px",
+              textAlign: "right",
+              fontSize: "0.75rem",
+              fontFamily: "var(--font-mono)",
+              color: "var(--accent)",
+              flexShrink: 0,
+            }}
+          >
+            {(effect.spread * 100).toFixed(1)}%
+          </div>
+        </div>
+      ))}
+    </div>
+  );
+}
diff --git a/dashboard/src/layouts/Base.astro b/dashboard/src/layouts/Base.astro
@@ -23,6 +23,7 @@ const { title } = Astro.props;
         </a>
         <nav style="display: flex; gap: 16px; font-size: 0.875rem;">
           <a href="/">Grid</a>
+          <a href="/insights">Insights</a>
           <a href="/compare">Compare</a>
         </nav>
       </div>
diff --git a/dashboard/src/lib/analysis.ts b/dashboard/src/lib/analysis.ts
@@ -0,0 +1,180 @@
+import type { Run, AxisName, AXIS_NAMES } from "./data";
+
+export interface EffectEntry {
+  value: string;
+  mean: number;
+  effect: number;
+  n: number;
+}
+
+export interface AxisEffect {
+  axis: string;
+  spread: number;
+  values: EffectEntry[];
+}
+
+export interface InteractionCell {
+  mean: number;
+  n: number;
+}
+
+export interface InteractionResult {
+  axisA: string;
+  axisB: string;
+  table: Record<string, Record<string, InteractionCell>>;
+  maxInteraction: number;
+}
+
+const SKIP_KEYS = new Set([
+  "task",
+  "cell_id",
+  "run_id",
+  "run_number",
+  "runs_per_cell",
+  "max_budget_usd",
+  "timeout_seconds",
+  "base_tools",
+  "started_at",
+  "completed_at",
+  "wall_time_seconds",
+  "exit_code",
+]);
+
+type MetricExtractor = (run: Run) => number | null;
+
+const METRICS: Record<string, MetricExtractor> = {
+  score: (r) => r.eval_results?.score ?? null,
+  cost: (r) => r.claude_output?.total_cost_usd ?? null,
+  turns: (r) => r.claude_output?.num_turns ?? null,
+  wall_time: (r) => r.meta.wall_time_seconds ?? null,
+};
+
+export function computeMainEffects(
+  runs: Run[],
+  metric: string = "score"
+): AxisEffect[] {
+  const extract = METRICS[metric];
+  if (!extract) return [];
+
+  const scored: Array<{ meta: Run["meta"]; value: number }> = [];
+  for (const run of runs) {
+    const val = extract(run);
+    if (val !== null) scored.push({ meta: run.meta, value: val });
+  }
+  if (scored.length === 0) return [];
+
+  const grandMean = scored.reduce((s, r) => s + r.value, 0) / scored.length;
+
+  // Find axis keys from meta
+  const axisKeys = Object.keys(scored[0].meta).filter(
+    (k) => !SKIP_KEYS.has(k)
+  );
+
+  const effects: AxisEffect[] = [];
+
+  for (const axis of axisKeys) {
+    const groups: Record<string, number[]> = {};
+    for (const { meta, value } of scored) {
+      const key = String((meta as Record<string, unknown>)[axis] ?? "unknown");
+      (groups[key] ??= []).push(value);
+    }
+
+    if (Object.keys(groups).length < 2) continue;
+
+    const values: EffectEntry[] = [];
+    for (const [val, vals] of Object.entries(groups)) {
+      const mean = vals.reduce((a, b) => a + b, 0) / vals.length;
+      values.push({
+        value: val,
+        mean: Math.round(mean * 10000) / 10000,
+        effect: Math.round((mean - grandMean) * 10000) / 10000,
+        n: vals.length,
+      });
+    }
+
+    const means = values.map((v) => v.mean);
+    const spread = Math.max(...means) - Math.min(...means);
+
+    effects.push({
+      axis,
+      spread: Math.round(spread * 10000) / 10000,
+      values: values.sort((a, b) => b.effect - a.effect),
+    });
+  }
+
+  return effects.sort((a, b) => b.spread - a.spread);
+}
+
+export function computeInteraction(
+  runs: Run[],
+  axisA: string,
+  axisB: string,
+  metric: string = "score"
+): InteractionResult {
+  const extract = METRICS[metric];
+  if (!extract)
+    return { axisA, axisB, table: {}, maxInteraction: 0 };
+
+  const groups: Record<string, Record<string, number[]>> = {};
+
+  for (const run of runs) {
+    const val = extract(run);
+    if (val === null) continue;
+    const a = String((run.meta as Record<string, unknown>)[axisA] ?? "?");
+    const b = String((run.meta as Record<string, unknown>)[axisB] ?? "?");
+    ((groups[a] ??= {})[b] ??= []).push(val);
+  }
+
+  const table: Record<string, Record<string, InteractionCell>> = {};
+  const allVals: number[] = [];
+
+  for (const [a, bGroups] of Object.entries(groups)) {
+    table[a] = {};
+    for (const [b, vals] of Object.entries(bGroups)) {
+      const mean = vals.reduce((s, v) => s + v, 0) / vals.length;
+      table[a][b] = { mean: Math.round(mean * 10000) / 10000, n: vals.length };
+      allVals.push(mean);
+    }
+  }
+
+  const grandMean =
+    allVals.length > 0
+      ? allVals.reduce((a, b) => a + b, 0) / allVals.length
+      : 0;
+
+  // Row and column means
+  const aMeans: Record<string, number> = {};
+  const bMeans: Record<string, number> = {};
+  const bKeys = new Set<string>();
+
+  for (const [a, bGroups] of Object.entries(table)) {
+    const vals = Object.values(bGroups).map((c) => c.mean);
+    aMeans[a] = vals.reduce((s, v) => s + v, 0) / vals.length;
+    for (const b of Object.keys(bGroups)) bKeys.add(b);
+  }
+
+  for (const b of bKeys) {
+    const vals: number[] = [];
+    for (const a of Object.keys(table)) {
+      if (table[a][b]) vals.push(table[a][b].mean);
+    }
+    bMeans[b] = vals.length > 0 ? vals.reduce((s, v) => s + v, 0) / vals.length : grandMean;
+  }
+
+  // Max interaction = max deviation from additive model
+  let maxInteraction = 0;
+  for (const a of Object.keys(table)) {
+    for (const b of Object.keys(table[a])) {
+      const expected = aMeans[a] + bMeans[b] - grandMean;
+      const actual = table[a][b].mean;
+      maxInteraction = Math.max(maxInteraction, Math.abs(actual - expected));
+    }
+  }
+
+  return {
+    axisA,
+    axisB,
+    table,
+    maxInteraction: Math.round(maxInteraction * 10000) / 10000,
+  };
+}
diff --git a/dashboard/src/pages/insights.astro b/dashboard/src/pages/insights.astro
@@ -0,0 +1,17 @@
+---
+import Base from "../layouts/Base.astro";
+import { loadAllRuns } from "../lib/data";
+import Insights from "../components/Insights";
+
+const runs = loadAllRuns();
+---
+
+<Base title="Insights">
+  <h1 style="margin-bottom: 8px;">Insights</h1>
+  <p style="color: var(--text-muted); margin-bottom: 24px; font-size: 0.875rem;">
+    Which variables actually move the needle? Tornado charts show main effects,
+    heatmaps reveal interactions.
+  </p>
+
+  <Insights client:load runs={runs} />
+</Base>
diff --git a/grid.yaml b/grid.yaml
@@ -3,7 +3,6 @@ version: 1
 defaults:
   runs_per_cell: 3
   timeout_seconds: 600
-  base_tools: "Bash,Read,Edit,Write,Glob,Grep"
   budget:
     low: 0.50
     high: 5.00
@@ -19,6 +18,16 @@ axes:
     values: [typescript, javascript]
   human_language:
     values: [en, es]
+  tool_read:
+    values: ["on", "off"]
+  tool_write:
+    values: ["on", "off"]
+  tool_edit:
+    values: ["on", "off"]
+  tool_glob:
+    values: ["on", "off"]
+  tool_grep:
+    values: ["on", "off"]
   linter:
     values: ["on", "off"]
   playwright:
@@ -54,11 +63,16 @@ profiles:
   smoke:
     description: "Quick validation -- minimal grid"
     axes:
-      model: [sonnet]
+      model: [haiku]
       effort: [high]
       prompt_style: [simple, detailed]
       language: [typescript]
       human_language: [en]
+      tool_read: ["on"]
+      tool_write: ["on"]
+      tool_edit: ["on"]
+      tool_glob: ["on"]
+      tool_grep: ["on"]
       linter: ["off"]
       playwright: ["off"]
       context_file: [none]
@@ -75,6 +89,11 @@ profiles:
       prompt_style: [simple, detailed]
       language: [typescript]
       human_language: [en]
+      tool_read: ["on"]
+      tool_write: ["on"]
+      tool_edit: ["on"]
+      tool_glob: ["on"]
+      tool_grep: ["on"]
       linter: ["off"]
       playwright: ["off"]
       context_file: [none]
diff --git a/harness/lib/compute_grid.py b/harness/lib/compute_grid.py
@@ -108,7 +108,6 @@ def compute_cells(grid, profile_name):
             cell["runs_per_cell"] = runs_per_cell
             cell["max_budget_usd"] = budget_usd
             cell["timeout_seconds"] = defaults["timeout_seconds"]
-            cell["base_tools"] = defaults["base_tools"]
 
             cells.append(cell)
 
diff --git a/harness/lib/evaluate.sh b/harness/lib/evaluate.sh
@@ -13,37 +13,44 @@ evaluate() {
 
   local eval_results='{"structural": null, "functional": null, "quality": null, "score": null}'
 
+  # Helper: safely merge JSON into eval_results
+  merge_result() {
+    local key="$1"
+    local output="$2"
+
+    if [[ -z "$output" ]]; then
+      eval_results=$(echo "$eval_results" | jq --arg k "$key" '.[$k] = {"pass": false, "error": "no output"}')
+      return
+    fi
+
+    if echo "$output" | jq . > /dev/null 2>&1; then
+      eval_results=$(echo "$eval_results" | jq --arg k "$key" --argjson v "$output" '.[$k] = $v')
+    else
+      # Truncate long non-JSON output to avoid jq issues
+      local truncated="${output:0:500}"
+      eval_results=$(echo "$eval_results" | jq --arg k "$key" --arg e "$truncated" '.[$k] = {"pass": false, "error": $e}')
+    fi
+  }
+
   # --- Structural checks ---
   if [[ -f "$task_dir/eval/structural.sh" ]]; then
     local structural_output
     structural_output=$(bash "$task_dir/eval/structural.sh" "$workspace" "$language" 2>&1) || true
-    if echo "$structural_output" | jq . > /dev/null 2>&1; then
-      eval_results=$(echo "$eval_results" | jq --argjson s "$structural_output" '.structural = $s')
-    else
-      eval_results=$(echo "$eval_results" | jq --arg s "$structural_output" '.structural = {"pass": false, "error": $s}')
-    fi
+    merge_result "structural" "$structural_output"
   fi
 
   # --- Functional tests ---
-  local functional_output='{}'
   if [[ -d "$task_dir/eval/tests" ]]; then
-    functional_output=$(run_functional_tests "$task_dir" "$workspace" "$language" "$run_dir") || true
-    if echo "$functional_output" | jq . > /dev/null 2>&1; then
-      eval_results=$(echo "$eval_results" | jq --argjson f "$functional_output" '.functional = $f')
-    else
-      eval_results=$(echo "$eval_results" | jq '.functional = {"pass": false, "error": "test runner failed"}')
-    fi
+    local functional_output
+    functional_output=$(run_functional_tests "$task_dir" "$workspace" "$language" "$run_dir" 2>&1) || true
+    merge_result "functional" "$functional_output"
   fi
 
   # --- Quality checks ---
   if [[ -f "$task_dir/eval/quality.sh" ]]; then
     local quality_output
     quality_output=$(bash "$task_dir/eval/quality.sh" "$workspace" "$language" 2>&1) || true
-    if echo "$quality_output" | jq . > /dev/null 2>&1; then
-      eval_results=$(echo "$eval_results" | jq --argjson q "$quality_output" '.quality = $q')
-    else
-      eval_results=$(echo "$eval_results" | jq --arg q "$quality_output" '.quality = {"pass": false, "error": $q}')
-    fi
+    merge_result "quality" "$quality_output"
   fi
 
   # --- Compute aggregate score ---
diff --git a/harness/lib/experiment_design.py b/harness/lib/experiment_design.py
@@ -0,0 +1,582 @@
+#!/usr/bin/env python3
+"""Experiment design and analysis for loop benchmarking.
+
+Generates efficient experiment plans instead of full factorial grids.
+Analyzes results to identify which variables have the biggest impact.
+
+Approaches:
+  1. Main effects sweep: vary one axis at a time from a baseline
+  2. Fractional factorial: Plackett-Burman screening for binary factors
+  3. Interaction hunt: full factorial on the top-k most impactful axes
+"""
+
+import json
+import math
+import sys
+from itertools import product
+from pathlib import Path
+
+import yaml
+
+
+def load_grid(path):
+    with open(path) as f:
+        return yaml.safe_load(f)
+
+
+def get_axes(grid, profile_name=None):
+    """Get axis definitions, optionally filtered by profile."""
+    top_axes = {name: spec["values"] for name, spec in grid["axes"].items()}
+    if profile_name and profile_name in grid.get("profiles", {}):
+        profile = grid["profiles"][profile_name]
+        if "axes" in profile:
+            axes = dict(top_axes)
+            for name, values in profile["axes"].items():
+                axes[name] = values
+            return axes
+    return top_axes
+
+
+# ---------------------------------------------------------------------------
+# 1. Main effects sweep
+# ---------------------------------------------------------------------------
+
+def main_effects_plan(grid, baseline=None, tasks=None):
+    """Generate a one-at-a-time sweep from a baseline.
+
+    For each axis, vary it through all its values while holding everything
+    else at baseline. This identifies main effects cheaply.
+
+    Returns a list of cell dicts.
+    """
+    axes = get_axes(grid)
+    tasks = tasks or grid["tasks"]
+    defaults = grid["defaults"]
+
+    # Pick baseline: first value of each axis unless overridden
+    if baseline is None:
+        baseline = {name: values[0] for name, values in axes.items()}
+
+    cells = []
+    seen = set()
+
+    for task in tasks:
+        # Apply task overrides to axes
+        task_axes = dict(axes)
+        overrides = grid.get("task_overrides", {}).get(task, {})
+        if "axes" in overrides:
+            for axis_name, spec in overrides["axes"].items():
+                task_axes[axis_name] = spec["values"]
+
+        # Baseline cell
+        base_cell = dict(baseline)
+        # Ensure baseline values are valid for this task
+        for name, values in task_axes.items():
+            if base_cell[name] not in values:
+                base_cell[name] = values[0]
+
+        base_key = _cell_key(task, base_cell)
+        if base_key not in seen:
+            seen.add(base_key)
+            cells.append(_build_cell(task, base_cell, defaults, grid))
+
+        # Vary each axis
+        for axis_name, values in task_axes.items():
+            for value in values:
+                if value == base_cell[axis_name]:
+                    continue
+                varied = dict(base_cell)
+                varied[axis_name] = value
+                key = _cell_key(task, varied)
+                if key not in seen:
+                    seen.add(key)
+                    cells.append(_build_cell(task, varied, defaults, grid))
+
+    return cells
+
+
+# ---------------------------------------------------------------------------
+# 2. Plackett-Burman screening
+# ---------------------------------------------------------------------------
+
+def _hadamard_matrix(n):
+    """Generate a Hadamard-like matrix for Plackett-Burman design.
+
+    n must be a multiple of 4. Returns an n x (n-1) matrix of +1/-1.
+    Uses the Paley construction for prime n-1.
+    """
+    # For simplicity, use the standard PB generators for common sizes
+    # These are the first rows; subsequent rows are cyclic shifts
+    generators = {
+        4: [1, 1, -1],
+        8: [1, 1, 1, -1, 1, -1, -1],
+        12: [1, 1, -1, 1, 1, 1, -1, -1, -1, 1, -1],
+        16: [1, 1, 1, 1, -1, 1, -1, 1, 1, -1, -1, 1, -1, -1, -1],
+        20: [1, 1, -1, 1, 1, -1, -1, -1, -1, 1, -1, 1, -1, 1, 1, 1, 1, -1, -1],
+        24: [1, 1, 1, 1, 1, -1, 1, -1, 1, 1, -1, -1, 1, 1, -1, -1, 1, -1, 1, -1, -1, -1, -1],
+    }
+
+    if n not in generators:
+        # Fall back to nearest larger size
+        for size in sorted(generators.keys()):
+            if size >= n:
+                n = size
+                break
+        else:
+            n = max(generators.keys())
+
+    gen = generators[n]
+    k = len(gen)
+    matrix = []
+
+    for i in range(k):
+        row = gen[i:] + gen[:i]
+        matrix.append(row)
+
+    # Add a row of all -1
+    matrix.append([-1] * k)
+
+    return matrix
+
+
+def plackett_burman_plan(grid, tasks=None):
+    """Generate a Plackett-Burman screening design for binary factors.
+
+    For factors with more than 2 levels (e.g., model: haiku/sonnet/opus),
+    we create dummy binary variables or sweep them separately.
+
+    Returns a list of cell dicts.
+    """
+    axes = get_axes(grid)
+    tasks = tasks or grid["tasks"]
+    defaults = grid["defaults"]
+
+    # Separate binary and multi-level factors
+    binary_axes = {}
+    multi_axes = {}
+    for name, values in axes.items():
+        if len(values) == 2:
+            binary_axes[name] = values
+        elif len(values) > 2:
+            multi_axes[name] = values
+
+    binary_names = sorted(binary_axes.keys())
+    n_factors = len(binary_names)
+
+    if n_factors == 0:
+        return main_effects_plan(grid, tasks=tasks)
+
+    # Find the smallest PB design that fits
+    n_runs = n_factors + 1
+    # Round up to multiple of 4
+    n_runs = math.ceil(n_runs / 4) * 4
+
+    matrix = _hadamard_matrix(n_runs)
+
+    cells = []
+    seen = set()
+
+    # For multi-level factors, fix at each level and run the PB design
+    if multi_axes:
+        multi_names = sorted(multi_axes.keys())
+        multi_combos = list(product(*[multi_axes[n] for n in multi_names]))
+    else:
+        multi_names = []
+        multi_combos = [()]
+
+    for multi_combo in multi_combos:
+        multi_fixed = dict(zip(multi_names, multi_combo))
+
+        for row in matrix:
+            cell = dict(multi_fixed)
+            for i, name in enumerate(binary_names):
+                if i < len(row):
+                    idx = 0 if row[i] == -1 else 1
+                else:
+                    idx = 0
+                cell[name] = binary_axes[name][idx]
+
+            for task in tasks:
+                # Apply task overrides
+                task_axes = dict(axes)
+                overrides = grid.get("task_overrides", {}).get(task, {})
+                if "axes" in overrides:
+                    for axis_name, spec in overrides["axes"].items():
+                        task_axes[axis_name] = spec["values"]
+
+                # Ensure values are valid for this task
+                valid = True
+                for name, values in task_axes.items():
+                    if cell.get(name) not in values:
+                        if len(values) == 1:
+                            cell[name] = values[0]
+                        else:
+                            valid = False
+                            break
+
+                # Check exclusions
+                if valid and not _is_excluded(cell, grid):
+                    key = _cell_key(task, cell)
+                    if key not in seen:
+                        seen.add(key)
+                        cells.append(_build_cell(task, cell, defaults, grid))
+
+    return cells
+
+
+# ---------------------------------------------------------------------------
+# 3. Interaction hunt
+# ---------------------------------------------------------------------------
+
+def interaction_hunt_plan(grid, top_axes, tasks=None):
+    """Full factorial on a subset of axes, baseline for the rest.
+
+    Args:
+        top_axes: list of axis names to fully explore (e.g., ["model", "effort", "linter"])
+        tasks: which tasks to include
+    """
+    axes = get_axes(grid)
+    tasks = tasks or grid["tasks"]
+    defaults = grid["defaults"]
+
+    # Baseline for non-explored axes
+    baseline = {name: values[0] for name, values in axes.items()}
+
+    # Full factorial on top_axes
+    explore_names = sorted(top_axes)
+    explore_values = [axes[n] for n in explore_names]
+
+    cells = []
+    seen = set()
+
+    for combo in product(*explore_values):
+        cell = dict(baseline)
+        for name, value in zip(explore_names, combo):
+            cell[name] = value
+
+        for task in tasks:
+            task_axes = dict(axes)
+            overrides = grid.get("task_overrides", {}).get(task, {})
+            if "axes" in overrides:
+                for axis_name, spec in overrides["axes"].items():
+                    task_axes[axis_name] = spec["values"]
+
+            # Adjust for task constraints
+            for name, values in task_axes.items():
+                if cell.get(name) not in values:
+                    cell[name] = values[0]
+
+            if not _is_excluded(cell, grid):
+                key = _cell_key(task, cell)
+                if key not in seen:
+                    seen.add(key)
+                    cells.append(_build_cell(task, cell, defaults, grid))
+
+    return cells
+
+
+# ---------------------------------------------------------------------------
+# Analysis: compute effects from results
+# ---------------------------------------------------------------------------
+
+def analyze_main_effects(results_dir, metric="score"):
+    """Compute the main effect of each axis on a given metric.
+
+    Reads all completed runs, groups by axis values, computes mean metric
+    for each group, and returns the effect size (difference from grand mean).
+
+    Returns a dict: {axis_name: {value: effect_size, ...}, ...}
+    sorted by absolute effect size.
+    """
+    runs = _load_results(results_dir)
+    if not runs:
+        return {}
+
+    # Extract metric values
+    scored_runs = []
+    for run in runs:
+        val = _extract_metric(run, metric)
+        if val is not None:
+            scored_runs.append((run["meta"], val))
+
+    if not scored_runs:
+        return {}
+
+    grand_mean = sum(v for _, v in scored_runs) / len(scored_runs)
+
+    # Identify axes from the first run's meta
+    meta_keys = set(scored_runs[0][0].keys())
+    skip_keys = {
+        "task", "cell_id", "run_id", "run_number", "runs_per_cell",
+        "max_budget_usd", "timeout_seconds", "base_tools",
+        "started_at", "completed_at", "wall_time_seconds", "exit_code",
+    }
+    axis_names = sorted(meta_keys - skip_keys)
+
+    effects = {}
+    for axis in axis_names:
+        groups = {}
+        for meta, val in scored_runs:
+            key = str(meta.get(axis, "unknown"))
+            groups.setdefault(key, []).append(val)
+
+        if len(groups) < 2:
+            continue
+
+        axis_effects = {}
+        for value, vals in sorted(groups.items()):
+            group_mean = sum(vals) / len(vals)
+            effect = group_mean - grand_mean
+            axis_effects[value] = {
+                "mean": round(group_mean, 4),
+                "effect": round(effect, 4),
+                "n": len(vals),
+            }
+
+        # Effect magnitude = max spread between any two values
+        means = [v["mean"] for v in axis_effects.values()]
+        spread = max(means) - min(means) if means else 0
+
+        effects[axis] = {
+            "values": axis_effects,
+            "spread": round(spread, 4),
+        }
+
+    # Sort by spread (biggest effects first)
+    effects = dict(sorted(effects.items(), key=lambda x: -x[1]["spread"]))
+    return effects
+
+
+def analyze_interactions(results_dir, axis_a, axis_b, metric="score"):
+    """Compute the interaction effect between two axes.
+
+    Returns a 2D table of mean metric values for each (a_value, b_value) combo,
+    plus the interaction effect size.
+    """
+    runs = _load_results(results_dir)
+    if not runs:
+        return {}
+
+    groups = {}
+    for run in runs:
+        val = _extract_metric(run, metric)
+        if val is None:
+            continue
+        a_val = str(run["meta"].get(axis_a, "?"))
+        b_val = str(run["meta"].get(axis_b, "?"))
+        key = (a_val, b_val)
+        groups.setdefault(key, []).append(val)
+
+    if not groups:
+        return {}
+
+    table = {}
+    for (a_val, b_val), vals in sorted(groups.items()):
+        table.setdefault(a_val, {})[b_val] = {
+            "mean": round(sum(vals) / len(vals), 4),
+            "n": len(vals),
+        }
+
+    # Compute interaction: does the effect of axis_a change depending on axis_b?
+    a_values = sorted(table.keys())
+    b_values = sorted(set(b for row in table.values() for b in row.keys()))
+
+    # Interaction = deviation from additive model
+    grand_mean = sum(
+        v for row in table.values() for cell in row.values() for v in [cell["mean"]]
+    ) / sum(1 for row in table.values() for _ in row.values())
+
+    a_means = {}
+    for a in a_values:
+        vals = [table[a][b]["mean"] for b in b_values if b in table.get(a, {})]
+        a_means[a] = sum(vals) / len(vals) if vals else grand_mean
+
+    b_means = {}
+    for b in b_values:
+        vals = [table[a][b]["mean"] for a in a_values if b in table.get(a, {})]
+        b_means[b] = sum(vals) / len(vals) if vals else grand_mean
+
+    # Interaction effects
+    interactions = {}
+    max_interaction = 0
+    for a in a_values:
+        for b in b_values:
+            if b in table.get(a, {}):
+                expected = a_means[a] + b_means[b] - grand_mean
+                actual = table[a][b]["mean"]
+                interaction = round(actual - expected, 4)
+                interactions[(a, b)] = interaction
+                max_interaction = max(max_interaction, abs(interaction))
+
+    return {
+        "table": table,
+        "grand_mean": round(grand_mean, 4),
+        "a_means": {k: round(v, 4) for k, v in a_means.items()},
+        "b_means": {k: round(v, 4) for k, v in b_means.items()},
+        "interactions": {f"{a},{b}": v for (a, b), v in interactions.items()},
+        "max_interaction": round(max_interaction, 4),
+    }
+
+
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+
+def _cell_key(task, cell):
+    axis_names = sorted(k for k in cell.keys() if k not in (
+        "task", "cell_id", "runs_per_cell", "max_budget_usd",
+        "timeout_seconds", "base_tools",
+    ))
+    parts = [task] + [f"{k}={cell[k]}" for k in axis_names]
+    return "_".join(parts)
+
+
+def _is_excluded(cell, grid):
+    for exclusion in grid.get("exclusions", []):
+        match = True
+        for key, value in exclusion["when"].items():
+            if cell.get(key) != value:
+                match = False
+                break
+        if match:
+            return True
+    return False
+
+
+def _build_cell(task, cell, defaults, grid):
+    axis_names = sorted(cell.keys())
+    cell_id_parts = [task] + [f"{k}={cell[k]}" for k in axis_names]
+
+    result = dict(cell)
+    result["task"] = task
+    result["cell_id"] = "_".join(cell_id_parts)
+    result["runs_per_cell"] = defaults.get("runs_per_cell", 3)
+    result["timeout_seconds"] = defaults.get("timeout_seconds", 600)
+
+    budget_key = cell.get("max_budget", "low")
+    result["max_budget_usd"] = defaults.get("budget", {}).get(budget_key, 0.50)
+
+    return result
+
+
+def _load_results(results_dir):
+    """Load all completed runs from the results directory."""
+    results_dir = Path(results_dir)
+    runs_dir = results_dir / "runs"
+    if not runs_dir.exists():
+        return []
+
+    runs = []
+    for run_dir in runs_dir.iterdir():
+        if not run_dir.is_dir():
+            continue
+        meta_path = run_dir / "meta.json"
+        eval_path = run_dir / "eval_results.json"
+        claude_path = run_dir / "claude_output.json"
+
+        if not meta_path.exists() or not eval_path.exists():
+            continue
+
+        try:
+            meta = json.loads(meta_path.read_text())
+            eval_results = json.loads(eval_path.read_text())
+            claude_output = {}
+            if claude_path.exists():
+                claude_output = json.loads(claude_path.read_text())
+
+            runs.append({
+                "meta": meta,
+                "eval": eval_results,
+                "claude": claude_output,
+            })
+        except (json.JSONDecodeError, OSError):
+            continue
+
+    return runs
+
+
+def _extract_metric(run, metric):
+    """Extract a numeric metric from a run."""
+    if metric == "score":
+        val = run["eval"].get("score")
+        return val if isinstance(val, (int, float)) else None
+    elif metric == "cost":
+        return run["claude"].get("total_cost_usd")
+    elif metric == "turns":
+        return run["claude"].get("num_turns")
+    elif metric == "wall_time":
+        return run["meta"].get("wall_time_seconds")
+    elif metric == "pass_rate":
+        func = run["eval"].get("functional", {})
+        if isinstance(func, dict) and "pass" in func:
+            return 1.0 if func["pass"] else 0.0
+        return None
+    return None
+
+
+# ---------------------------------------------------------------------------
+# CLI
+# ---------------------------------------------------------------------------
+
+def main():
+    if len(sys.argv) < 3:
+        print("Usage:")
+        print("  experiment_design.py plan <grid_file> <design> [args...]")
+        print("    designs: main_effects, plackett_burman, interaction_hunt")
+        print("  experiment_design.py analyze <results_dir> <analysis> [args...]")
+        print("    analyses: main_effects, interactions")
+        sys.exit(1)
+
+    command = sys.argv[1]
+
+    if command == "plan":
+        grid_file = sys.argv[2]
+        design = sys.argv[3] if len(sys.argv) > 3 else "main_effects"
+        grid = load_grid(grid_file)
+
+        if design == "main_effects":
+            cells = main_effects_plan(grid)
+        elif design == "plackett_burman":
+            cells = plackett_burman_plan(grid)
+        elif design == "interaction_hunt":
+            top_axes = sys.argv[4].split(",") if len(sys.argv) > 4 else []
+            if not top_axes:
+                print("ERROR: interaction_hunt requires comma-separated axis names", file=sys.stderr)
+                sys.exit(1)
+            cells = interaction_hunt_plan(grid, top_axes)
+        else:
+            print(f"Unknown design: {design}", file=sys.stderr)
+            sys.exit(1)
+
+        print(f"# {design}: {len(cells)} cells", file=sys.stderr)
+        for cell in cells:
+            print(json.dumps(cell))
+
+    elif command == "analyze":
+        results_dir = sys.argv[2]
+        analysis = sys.argv[3] if len(sys.argv) > 3 else "main_effects"
+
+        if analysis == "main_effects":
+            metric = sys.argv[4] if len(sys.argv) > 4 else "score"
+            effects = analyze_main_effects(results_dir, metric)
+            print(json.dumps(effects, indent=2))
+        elif analysis == "interactions":
+            if len(sys.argv) < 6:
+                print("ERROR: interactions requires two axis names", file=sys.stderr)
+                sys.exit(1)
+            axis_a = sys.argv[4]
+            axis_b = sys.argv[5]
+            metric = sys.argv[6] if len(sys.argv) > 6 else "score"
+            result = analyze_interactions(results_dir, axis_a, axis_b, metric)
+            print(json.dumps(result, indent=2))
+        else:
+            print(f"Unknown analysis: {analysis}", file=sys.stderr)
+            sys.exit(1)
+
+    else:
+        print(f"Unknown command: {command}", file=sys.stderr)
+        sys.exit(1)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/harness/lib/get-oauth-token.sh b/harness/lib/get-oauth-token.sh
@@ -0,0 +1,13 @@
+#!/usr/bin/env bash
+# Extract OAuth token from Claude Code credentials for use with --bare mode.
+# This lets the harness use your Claude plan while maintaining full isolation.
+
+CREDS_FILE="${CLAUDE_CONFIG_DIR:-$HOME/.claude}/.credentials.json"
+
+if [[ ! -f "$CREDS_FILE" ]]; then
+  echo "ERROR: No credentials file found at $CREDS_FILE" >&2
+  exit 1
+fi
+
+# Extract the OAuth access token
+jq -r '.claudeAiOauth.accessToken // empty' "$CREDS_FILE"
diff --git a/harness/lib/invoke.sh b/harness/lib/invoke.sh
@@ -45,8 +45,19 @@ Use TypeScript."
 Use JavaScript (no TypeScript)."
   fi
 
-  # Build tool list
-  local tools="$base_tools"
+  # Build tool list from individual axes (Bash always on)
+  local tools="Bash"
+  local tool_read tool_write tool_edit tool_glob tool_grep
+  tool_read=$(echo "$cell_json" | jq -r '.tool_read // "on"')
+  tool_write=$(echo "$cell_json" | jq -r '.tool_write // "on"')
+  tool_edit=$(echo "$cell_json" | jq -r '.tool_edit // "on"')
+  tool_glob=$(echo "$cell_json" | jq -r '.tool_glob // "on"')
+  tool_grep=$(echo "$cell_json" | jq -r '.tool_grep // "on"')
+  [[ "$tool_read" == "on" ]] && tools="$tools,Read"
+  [[ "$tool_write" == "on" ]] && tools="$tools,Write"
+  [[ "$tool_edit" == "on" ]] && tools="$tools,Edit"
+  [[ "$tool_glob" == "on" ]] && tools="$tools,Glob"
+  [[ "$tool_grep" == "on" ]] && tools="$tools,Grep"
   if [[ "$sub_agents" == "on" ]]; then
     tools="$tools,Agent"
   fi
@@ -55,15 +66,22 @@ Use JavaScript (no TypeScript)."
   fi
 
   # Build the claude command
+  # --bare for full isolation (no CLAUDE.md, hooks, MCP, memory).
+  # Auth via apiKeyHelper that reads OAuth token from ~/.claude/.credentials.json.
+  local auth_helper
+  auth_helper="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)/get-oauth-token.sh"
+
   local cmd=(
     claude
     --bare
     -p "$prompt"
     --model "$model"
     --output-format stream-json
-    --dangerously-skip-permissions
+    --verbose
+    --permission-mode dontAsk
     --max-budget-usd "$budget"
     --allowedTools "$tools"
+    --settings "{\"apiKeyHelper\": \"$auth_helper\"}"
   )
 
   # Add effort level
diff --git a/harness/run.py b/harness/run.py
@@ -0,0 +1,424 @@
+#!/usr/bin/env python3
+"""Loop Benchmarking Harness - Main orchestrator.
+
+Computes the experiment grid, creates isolated workspaces, invokes claude,
+runs evaluation, and stores results.
+
+Usage:
+    python3 run.py [grid_file] [profile_or_design]
+
+    profile_or_design can be:
+      - A profile name from grid.yaml (e.g., smoke, core, full)
+      - A DOE design: main_effects, plackett_burman
+      - interaction_hunt:axis1,axis2,axis3
+"""
+
+import json
+import os
+import shutil
+import subprocess
+import sys
+import tarfile
+import tempfile
+import time
+from datetime import datetime, timezone
+from pathlib import Path
+
+SCRIPT_DIR = Path(__file__).resolve().parent
+PROJECT_DIR = SCRIPT_DIR.parent
+sys.path.insert(0, str(SCRIPT_DIR / "lib"))
+
+from compute_grid import load_grid, compute_cells
+from experiment_design import (
+    main_effects_plan,
+    plackett_burman_plan,
+    interaction_hunt_plan,
+    analyze_main_effects,
+)
+
+
+def create_workspace(project_dir: Path, task: str, cell: dict) -> Path:
+    """Create an isolated temp directory with appropriate setup."""
+    workspace = Path(tempfile.mkdtemp(prefix="loop-bench-"))
+
+    language = cell.get("language", "typescript")
+    linter = cell.get("linter", "off")
+    playwright = cell.get("playwright", "off")
+
+    # npm init
+    subprocess.run(["npm", "init", "-y"], cwd=workspace, capture_output=True)
+
+    # TypeScript
+    if language == "typescript":
+        subprocess.run(
+            ["npm", "install", "--save-dev", "typescript", "@types/node"],
+            cwd=workspace, capture_output=True,
+        )
+
+    # Linter
+    if linter == "on":
+        subprocess.run(
+            ["npm", "install", "--save-dev", "eslint", "@eslint/js"],
+            cwd=workspace, capture_output=True,
+        )
+
+    # Playwright
+    if playwright == "on":
+        subprocess.run(
+            ["npm", "install", "--save-dev", "@playwright/test"],
+            cwd=workspace, capture_output=True,
+        )
+        subprocess.run(
+            ["npx", "playwright", "install", "chromium", "--with-deps"],
+            cwd=workspace, capture_output=True,
+        )
+
+    # Copy fixtures
+    fixtures_dir = project_dir / "tasks" / task / "fixtures"
+    if fixtures_dir.is_dir():
+        for item in fixtures_dir.iterdir():
+            dest = workspace / item.name
+            if item.is_dir():
+                shutil.copytree(item, dest)
+            else:
+                shutil.copy2(item, dest)
+
+    return workspace
+
+
+def build_prompt(project_dir: Path, cell: dict) -> str:
+    """Read the prompt file and append language instruction."""
+    task = cell["task"]
+    style = cell["prompt_style"]
+    lang_code = cell["human_language"]
+
+    prompt_file = project_dir / "tasks" / task / "prompts" / f"{style}.{lang_code}.md"
+    prompt = prompt_file.read_text()
+
+    language = cell.get("language", "typescript")
+    if language == "typescript":
+        prompt += "\n\nUse TypeScript."
+    elif language == "javascript":
+        prompt += "\n\nUse JavaScript (no TypeScript)."
+
+    return prompt
+
+
+def invoke_claude(cell: dict, workspace: Path, run_dir: Path, project_dir: Path) -> int:
+    """Invoke claude CLI and capture output."""
+    prompt = build_prompt(project_dir, cell)
+    model = cell["model"]
+    effort = cell.get("effort", "high")
+    budget = cell.get("max_budget_usd", 0.50)
+    timeout = cell.get("timeout_seconds", 600)
+    # Build tool list from individual tool axes
+    # Bash is always available - it's the agent's escape hatch
+    tools_list = ["Bash"]
+    if cell.get("tool_read", "on") == "on":
+        tools_list.append("Read")
+    if cell.get("tool_write", "on") == "on":
+        tools_list.append("Write")
+    if cell.get("tool_edit", "on") == "on":
+        tools_list.append("Edit")
+    if cell.get("tool_glob", "on") == "on":
+        tools_list.append("Glob")
+    if cell.get("tool_grep", "on") == "on":
+        tools_list.append("Grep")
+    if cell.get("sub_agents") == "on":
+        tools_list.append("Agent")
+    if cell.get("web_search") == "on":
+        tools_list.extend(["WebSearch", "WebFetch"])
+    tools = ",".join(tools_list)
+
+    # Auth helper for --bare mode
+    auth_helper = str(SCRIPT_DIR / "lib" / "get-oauth-token.sh")
+
+    cmd = [
+        "claude",
+        "--bare",
+        "-p", prompt,
+        "--model", model,
+        "--output-format", "stream-json",
+        "--verbose",
+        "--permission-mode", "dontAsk",
+        "--max-budget-usd", str(budget),
+        "--allowedTools", tools,
+        "--settings", json.dumps({"apiKeyHelper": auth_helper}),
+    ]
+
+    if effort:
+        cmd.extend(["--effort", effort])
+
+    # Context file
+    if cell.get("context_file") == "provided":
+        ctx_file = project_dir / "tasks" / cell["task"] / "context.md"
+        if ctx_file.exists():
+            cmd.extend(["--append-system-prompt", ctx_file.read_text()])
+
+    # Run claude
+    transcript_path = run_dir / "transcript.jsonl"
+    stderr_path = run_dir / "claude_stderr.log"
+
+    with open(transcript_path, "w") as transcript_f, open(stderr_path, "w") as stderr_f:
+        try:
+            result = subprocess.run(
+                cmd,
+                cwd=workspace,
+                stdout=transcript_f,
+                stderr=stderr_f,
+                timeout=timeout,
+            )
+            exit_code = result.returncode
+        except subprocess.TimeoutExpired:
+            exit_code = 124  # Same as timeout(1) convention
+
+    # Extract final result line
+    output_path = run_dir / "claude_output.json"
+    try:
+        lines = transcript_path.read_text().strip().split("\n")
+        if lines:
+            output_path.write_text(lines[-1])
+    except Exception:
+        output_path.write_text("{}")
+
+    return exit_code
+
+
+def run_eval_script(script: Path, workspace: Path, language: str) -> str:
+    """Run a bash eval script and return its stdout."""
+    try:
+        result = subprocess.run(
+            ["bash", str(script), str(workspace), language],
+            capture_output=True, text=True, timeout=120,
+        )
+        return result.stdout.strip()
+    except Exception as e:
+        return json.dumps({"pass": False, "error": str(e)})
+
+
+def safe_parse_json(text: str, fallback_key: str = "error") -> dict:
+    """Parse JSON, returning an error dict if parsing fails."""
+    if not text:
+        return {"pass": False, "error": "no output"}
+    try:
+        return json.loads(text)
+    except json.JSONDecodeError:
+        return {"pass": False, "error": text[:500]}
+
+
+def evaluate(task_dir: Path, workspace: Path, cell: dict, run_dir: Path):
+    """Run all evaluation scripts and write eval_results.json."""
+    language = cell.get("language", "typescript")
+
+    results = {
+        "structural": None,
+        "functional": None,
+        "quality": None,
+        "score": None,
+    }
+
+    # Structural
+    structural_sh = task_dir / "eval" / "structural.sh"
+    if structural_sh.exists():
+        output = run_eval_script(structural_sh, workspace, language)
+        results["structural"] = safe_parse_json(output)
+
+    # Functional
+    tests_dir = task_dir / "eval" / "tests"
+    if tests_dir.is_dir():
+        # Check for different test types
+        if (tests_dir / "functional.sh").exists():
+            output = run_eval_script(tests_dir / "functional.sh", workspace, language)
+            results["functional"] = safe_parse_json(output)
+        elif (tests_dir / "functional.spec.ts").exists():
+            # Playwright tests - would need server setup, skip for now
+            results["functional"] = {"pass": False, "error": "playwright eval not yet wired", "score": 0}
+        elif (tests_dir / "functional.test.ts").exists():
+            # vitest tests - would need server setup, skip for now
+            results["functional"] = {"pass": False, "error": "vitest eval not yet wired", "score": 0}
+
+    # Quality
+    quality_sh = task_dir / "eval" / "quality.sh"
+    if quality_sh.exists():
+        output = run_eval_script(quality_sh, workspace, language)
+        results["quality"] = safe_parse_json(output)
+
+    # Compute weighted score
+    try:
+        scoring_file = task_dir / "scoring.yaml"
+        if scoring_file.exists():
+            import yaml
+            scoring = yaml.safe_load(scoring_file.read_text())
+            weights = scoring.get("weights", {})
+
+            score = 0.0
+            for category, weight in weights.items():
+                cat_data = results.get(category)
+                if cat_data and isinstance(cat_data.get("score"), (int, float)):
+                    score += cat_data["score"] * weight
+
+            results["score"] = round(score, 4)
+    except Exception:
+        pass
+
+    (run_dir / "eval_results.json").write_text(json.dumps(results, indent=2))
+
+
+def archive_workspace(workspace: Path, run_dir: Path):
+    """Archive and delete the workspace."""
+    archive_path = run_dir / "workspace.tar.gz"
+    try:
+        with tarfile.open(archive_path, "w:gz") as tar:
+            tar.add(workspace, arcname=workspace.name,
+                     filter=lambda t: None if "node_modules" in t.name else t)
+    except Exception:
+        # If archiving fails, just note it
+        pass
+
+    try:
+        shutil.rmtree(workspace)
+    except Exception:
+        pass
+
+
+def main():
+    grid_file = sys.argv[1] if len(sys.argv) > 1 else str(PROJECT_DIR / "grid.yaml")
+    profile = sys.argv[2] if len(sys.argv) > 2 else "smoke"
+    results_dir = PROJECT_DIR / "results"
+    results_dir.mkdir(exist_ok=True)
+    (results_dir / "runs").mkdir(exist_ok=True)
+
+    # Preflight
+    if shutil.which("claude") is None:
+        print("ERROR: claude CLI not found in PATH.")
+        sys.exit(1)
+
+    print("=" * 40)
+    print("Loop Benchmarking Harness")
+    print("=" * 40)
+    print(f"Grid file:  {grid_file}")
+    print(f"Profile:    {profile}")
+    print(f"Results:    {results_dir}")
+    print("=" * 40)
+
+    grid = load_grid(grid_file)
+
+    # Determine cell generation strategy
+    if profile == "main_effects":
+        cells = main_effects_plan(grid)
+        print(f"Design:     main effects sweep")
+    elif profile == "plackett_burman":
+        cells = plackett_burman_plan(grid)
+        print(f"Design:     Plackett-Burman screening")
+    elif profile.startswith("interaction_hunt:"):
+        top_axes = profile.split(":", 1)[1].split(",")
+        cells = interaction_hunt_plan(grid, top_axes)
+        print(f"Design:     interaction hunt on {top_axes}")
+    else:
+        cells = compute_cells(grid, profile)
+        print(f"Profile:    {profile}")
+
+    print(f"Grid cells: {len(cells)}")
+    print()
+
+    completed = 0
+    skipped = 0
+    failed = 0
+
+    for cell in cells:
+        task = cell["task"]
+        cell_id = cell["cell_id"]
+        runs_per_cell = cell.get("runs_per_cell", 3)
+        model = cell["model"]
+        prompt_style = cell["prompt_style"]
+
+        for run_num in range(1, runs_per_cell + 1):
+            run_id = f"{cell_id}_run{run_num}"
+            run_dir = results_dir / "runs" / run_id
+
+            # Resume support
+            if (run_dir / "eval_results.json").exists():
+                print(f"SKIP: {run_id}")
+                skipped += 1
+                continue
+
+            print("-" * 40)
+            print(f"RUN:  {run_id}")
+            print(f"Task: {task} | Model: {model} | Prompt: {prompt_style}")
+            print("-" * 40)
+
+            run_dir.mkdir(parents=True, exist_ok=True)
+
+            # Save meta
+            meta = {
+                **cell,
+                "run_id": run_id,
+                "run_number": run_num,
+                "started_at": datetime.now(timezone.utc).isoformat(),
+            }
+            (run_dir / "meta.json").write_text(json.dumps(meta, indent=2))
+
+            # Create workspace
+            print("  Creating workspace...")
+            try:
+                workspace = create_workspace(PROJECT_DIR, task, cell)
+                print(f"  Workspace: {workspace}")
+            except Exception as e:
+                print(f"  ERROR creating workspace: {e}")
+                failed += 1
+                continue
+
+            # Invoke claude
+            print(f"  Invoking claude (model={model})...")
+            start_time = time.time()
+            exit_code = invoke_claude(cell, workspace, run_dir, PROJECT_DIR)
+            wall_time = int(time.time() - start_time)
+
+            if exit_code == 0:
+                print("  Claude completed successfully")
+            else:
+                print(f"  Claude exited with error (exit code: {exit_code})")
+
+            # Update meta with timing
+            meta["wall_time_seconds"] = wall_time
+            meta["exit_code"] = exit_code
+            meta["completed_at"] = datetime.now(timezone.utc).isoformat()
+            (run_dir / "meta.json").write_text(json.dumps(meta, indent=2))
+
+            # Evaluate
+            print("  Running evaluation...")
+            task_dir = PROJECT_DIR / "tasks" / task
+            evaluate(task_dir, workspace, cell, run_dir)
+            print("  Evaluation complete")
+
+            # Append to index
+            index_entry = {
+                "run_id": run_id,
+                "task": task,
+                "model": model,
+                "cell_id": cell_id,
+                "completed_at": meta["completed_at"],
+            }
+            with open(results_dir / "index.jsonl", "a") as f:
+                f.write(json.dumps(index_entry) + "\n")
+
+            # Archive and cleanup
+            print("  Archiving workspace...")
+            archive_workspace(workspace, run_dir)
+
+            if (run_dir / "eval_results.json").exists():
+                completed += 1
+            else:
+                failed += 1
+
+            print(f"  Done. ({completed} completed, {skipped} skipped, {failed} failed)")
+            print()
+
+    print("=" * 40)
+    print("All runs complete.")
+    print(f"Completed: {completed} | Skipped: {skipped} | Failed: {failed}")
+    print("=" * 40)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/harness/run.sh b/harness/run.sh
@@ -1,5 +1,7 @@
 #!/usr/bin/env bash
-set -euo pipefail
+set -uo pipefail
+# Note: no set -e. The main loop handles errors per-run so one failure
+# doesn't kill the entire harness. Critical setup errors still exit explicitly.
 
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 PROJECT_DIR="$(dirname "$SCRIPT_DIR")"
@@ -17,6 +19,12 @@ GRID_FILE="${1:-$PROJECT_DIR/grid.yaml}"
 PROFILE="${2:-smoke}"
 RESULTS_DIR="$PROJECT_DIR/results"
 
+# Preflight: verify claude is available and authenticated
+if ! command -v claude > /dev/null 2>&1; then
+  echo "ERROR: claude CLI not found in PATH."
+  exit 1
+fi
+
 echo "========================================"
 echo "Loop Benchmarking Harness"
 echo "========================================"
@@ -36,7 +44,7 @@ completed=0
 skipped=0
 failed=0
 
-echo "$cells" | while IFS= read -r cell_json; do
+while IFS= read -r cell_json; do
   task=$(echo "$cell_json" | jq -r '.task')
   cell_id=$(echo "$cell_json" | jq -r '.cell_id')
   runs_per_cell=$(echo "$cell_json" | jq -r '.runs_per_cell')
@@ -58,53 +66,61 @@ echo "$cells" | while IFS= read -r cell_json; do
     echo "Task: $task | Model: $model | Prompt: $prompt_style"
     echo "----------------------------------------"
 
-    # Create run results directory
+    # Run everything in a subshell so cd's don't affect the main loop
+    (
+      # Create run results directory
+      run_dir="$RESULTS_DIR/runs/$run_id"
+      mkdir -p "$run_dir"
+
+      # Save cell config as meta.json
+      echo "$cell_json" | jq --arg run_id "$run_id" --argjson run_num "$run_num" \
+        '. + {run_id: $run_id, run_number: $run_num, started_at: (now | todate)}' \
+        > "$run_dir/meta.json"
+
+      # Create isolated workspace
+      echo "  Creating workspace..."
+      workspace=$(create_workspace "$PROJECT_DIR" "$task" "$cell_json")
+      echo "  Workspace: $workspace"
+
+      # Invoke claude
+      echo "  Invoking claude (model=$model)..."
+      if invoke_claude "$cell_json" "$workspace" "$run_dir" "$PROJECT_DIR"; then
+        echo "  Claude completed successfully"
+      else
+        echo "  Claude exited with error (exit code: $?)"
+      fi
+
+      # Run evaluation
+      echo "  Running evaluation..."
+      task_dir="$PROJECT_DIR/tasks/$task"
+      evaluate "$task_dir" "$workspace" "$cell_json" "$run_dir"
+      echo "  Evaluation complete"
+
+      # Append to run index
+      jq -c '{
+        run_id: .run_id,
+        task: .task,
+        model: .model,
+        cell_id: .cell_id,
+        completed_at: .completed_at
+      }' "$run_dir/meta.json" >> "$RESULTS_DIR/index.jsonl"
+
+      # Archive and cleanup workspace
+      echo "  Archiving workspace..."
+      cleanup_workspace "$workspace" "$run_dir"
+    ) || true
+
+    # Count results (outside subshell)
     run_dir="$RESULTS_DIR/runs/$run_id"
-    mkdir -p "$run_dir"
-
-    # Save cell config as meta.json
-    echo "$cell_json" | jq --arg run_id "$run_id" --argjson run_num "$run_num" \
-      '. + {run_id: $run_id, run_number: $run_num, started_at: (now | todate)}' \
-      > "$run_dir/meta.json"
-
-    # Create isolated workspace
-    echo "  Creating workspace..."
-    workspace=$(create_workspace "$PROJECT_DIR" "$task" "$cell_json")
-    echo "  Workspace: $workspace"
-
-    # Invoke claude
-    echo "  Invoking claude (model=$model)..."
-    if invoke_claude "$cell_json" "$workspace" "$run_dir" "$PROJECT_DIR"; then
-      echo "  Claude completed successfully"
+    if [[ -f "$run_dir/eval_results.json" ]]; then
+      completed=$((completed + 1))
     else
-      echo "  Claude exited with error (exit code: $?)"
       failed=$((failed + 1))
     fi
-
-    # Run evaluation
-    echo "  Running evaluation..."
-    task_dir="$PROJECT_DIR/tasks/$task"
-    evaluate "$task_dir" "$workspace" "$cell_json" "$run_dir"
-    echo "  Evaluation complete"
-
-    # Append to run index
-    jq -c '{
-      run_id: .run_id,
-      task: .task,
-      model: .model,
-      cell_id: .cell_id,
-      completed_at: .completed_at
-    }' "$run_dir/meta.json" >> "$RESULTS_DIR/index.jsonl"
-
-    # Archive and cleanup workspace
-    echo "  Archiving workspace..."
-    cleanup_workspace "$workspace" "$run_dir"
-
-    completed=$((completed + 1))
     echo "  Done. ($completed completed, $skipped skipped, $failed failed)"
     echo ""
   done
-done
+done <<< "$cells"
 
 echo "========================================"
 echo "All runs complete."
diff --git a/results/index.jsonl b/results/index.jsonl
@@ -0,0 +1,6 @@
+{"run_id":"tetris_context_file=none_effort=high_human_language=en_language=typescript_linter=off_max_budget=low_model=haiku_playwright=off_prompt_style=simple_sub_agents=off_web_search=off_run1","task":"tetris","model":"haiku","cell_id":"tetris_context_file=none_effort=high_human_language=en_language=typescript_linter=off_max_budget=low_model=haiku_playwright=off_prompt_style=simple_sub_agents=off_web_search=off","completed_at":"2026-04-03T15:34:51Z"}
+{"run_id":"tetris_context_file=none_effort=high_human_language=en_language=typescript_linter=off_max_budget=low_model=haiku_playwright=off_prompt_style=detailed_sub_agents=off_web_search=off_run1","task":"tetris","model":"haiku","cell_id":"tetris_context_file=none_effort=high_human_language=en_language=typescript_linter=off_max_budget=low_model=haiku_playwright=off_prompt_style=detailed_sub_agents=off_web_search=off","completed_at":"2026-04-03T15:39:20Z"}
+{"run_id":"bookmarks-api_context_file=none_effort=high_human_language=en_language=typescript_linter=off_max_budget=low_model=haiku_playwright=off_prompt_style=simple_sub_agents=off_web_search=off_run1","task":"bookmarks-api","model":"haiku","cell_id":"bookmarks-api_context_file=none_effort=high_human_language=en_language=typescript_linter=off_max_budget=low_model=haiku_playwright=off_prompt_style=simple_sub_agents=off_web_search=off","completed_at":"2026-04-03T15:48:50Z"}
+{"run_id": "bookmarks-api_context_file=none_effort=high_human_language=en_language=typescript_linter=off_max_budget=low_model=haiku_playwright=off_prompt_style=detailed_sub_agents=off_web_search=off_run1", "task": "bookmarks-api", "model": "haiku", "cell_id": "bookmarks-api_context_file=none_effort=high_human_language=en_language=typescript_linter=off_max_budget=low_model=haiku_playwright=off_prompt_style=detailed_sub_agents=off_web_search=off", "completed_at": "2026-04-03T16:14:47.247624+00:00"}
+{"run_id": "data-pipeline_context_file=none_effort=high_human_language=en_language=typescript_linter=off_max_budget=low_model=haiku_playwright=off_prompt_style=simple_sub_agents=off_web_search=off_run1", "task": "data-pipeline", "model": "haiku", "cell_id": "data-pipeline_context_file=none_effort=high_human_language=en_language=typescript_linter=off_max_budget=low_model=haiku_playwright=off_prompt_style=simple_sub_agents=off_web_search=off", "completed_at": "2026-04-03T16:19:53.900594+00:00"}
+{"run_id": "data-pipeline_context_file=none_effort=high_human_language=en_language=typescript_linter=off_max_budget=low_model=haiku_playwright=off_prompt_style=detailed_sub_agents=off_web_search=off_run1", "task": "data-pipeline", "model": "haiku", "cell_id": "data-pipeline_context_file=none_effort=high_human_language=en_language=typescript_linter=off_max_budget=low_model=haiku_playwright=off_prompt_style=detailed_sub_agents=off_web_search=off", "completed_at": "2026-04-03T16:21:33.913511+00:00"}
diff --git a/tasks/bookmarks-api/eval/quality.sh b/tasks/bookmarks-api/eval/quality.sh
@@ -97,7 +97,7 @@ for key in lint typecheck security_passwords security_jwt_secret security_sql; d
 done
 
 if [[ $score_count -gt 0 ]]; then
-  score=$(echo "scale=2; $score_sum / ($score_count * 100)" | bc)
+  score=$(awk "BEGIN {printf \"%.2f\", $score_sum / ($score_count * 100)}")
 else
   score="0"
 fi
diff --git a/tasks/bookmarks-api/eval/structural.sh b/tasks/bookmarks-api/eval/structural.sh
@@ -100,7 +100,7 @@ checks_json=$(printf '%s,' "${checks[@]}")
 checks_json="[${checks_json%,}]"
 
 if [[ $total_count -gt 0 ]]; then
-  score=$(echo "scale=2; $pass_count / $total_count" | bc)
+  score=$(awk "BEGIN {printf \"%.2f\", $pass_count / $total_count}")
 else
   score="0"
 fi
diff --git a/tasks/data-pipeline/eval/quality.sh b/tasks/data-pipeline/eval/quality.sh
@@ -126,7 +126,7 @@ for key in lint typecheck no_float_currency handles_empty_input handles_malforme
 done
 
 if [[ $score_count -gt 0 ]]; then
-  score=$(echo "scale=2; $score_sum / ($score_count * 100)" | bc)
+  score=$(awk "BEGIN {printf \"%.2f\", $score_sum / ($score_count * 100)}")
 else
   score="0"
 fi
diff --git a/tasks/data-pipeline/eval/structural.sh b/tasks/data-pipeline/eval/structural.sh
@@ -91,7 +91,7 @@ checks_json=$(printf '%s,' "${checks[@]}")
 checks_json="[${checks_json%,}]"
 
 if [[ $total_count -gt 0 ]]; then
-  score=$(echo "scale=2; $pass_count / $total_count" | bc)
+  score=$(awk "BEGIN {printf \"%.2f\", $pass_count / $total_count}")
 else
   score="0"
 fi
diff --git a/tasks/data-pipeline/eval/tests/functional.sh b/tasks/data-pipeline/eval/tests/functional.sh
@@ -73,7 +73,7 @@ if ! echo "$actual_output" | jq . > /dev/null 2>&1; then
   add_check "valid_json" "false" "output is not valid JSON"
   checks_json=$(printf '%s,' "${checks[@]}")
   checks_json="[${checks_json%,}]"
-  echo "{\"pass\": false, \"checks\": $checks_json, \"score\": $(echo "scale=2; $pass_count / $total_count" | bc)}"
+  echo "{\"pass\": false, \"checks\": $checks_json, \"score\": $(awk "BEGIN {printf \"%.2f\", $pass_count / $total_count}")}"
   exit 0
 fi
 
@@ -166,7 +166,7 @@ checks_json=$(printf '%s,' "${checks[@]}")
 checks_json="[${checks_json%,}]"
 
 if [[ $total_count -gt 0 ]]; then
-  score=$(echo "scale=2; $pass_count / $total_count" | bc)
+  score=$(awk "BEGIN {printf \"%.2f\", $pass_count / $total_count}")
 else
   score="0"
 fi
diff --git a/tasks/tetris/eval/quality.sh b/tasks/tetris/eval/quality.sh
@@ -13,10 +13,8 @@ results='{}'
 # --- Lint check ---
 cd "$WORKSPACE"
 if command -v npx > /dev/null 2>&1; then
-  # Install eslint if not present
   npm install --save-dev eslint @eslint/js > /dev/null 2>&1
 
-  # Find source files
   if [[ "$LANGUAGE" == "typescript" ]]; then
     extensions="ts,tsx"
   else
@@ -29,12 +27,15 @@ if command -v npx > /dev/null 2>&1; then
   if echo "$lint_output" | jq . > /dev/null 2>&1; then
     errors=$(echo "$lint_output" | jq '[.[].errorCount] | add // 0')
     warnings=$(echo "$lint_output" | jq '[.[].warningCount] | add // 0')
-    lint_pass="true"
+    errors=${errors:-0}
+    warnings=${warnings:-0}
     if [[ "$errors" -gt 0 ]]; then
-      lint_pass="false"
+      results=$(echo "$results" | jq --argjson e "$errors" --argjson w "$warnings" \
+        '. + {lint: {pass: false, errors: $e, warnings: $w}}')
+    else
+      results=$(echo "$results" | jq --argjson e "$errors" --argjson w "$warnings" \
+        '. + {lint: {pass: true, errors: $e, warnings: $w}}')
     fi
-    results=$(echo "$results" | jq --argjson e "$errors" --argjson w "$warnings" --argjson p "$lint_pass" \
-      '. + {lint: {pass: ($p == true), errors: $e, warnings: $w}}')
   else
     results=$(echo "$results" | jq '. + {lint: {pass: false, errors: -1, warnings: 0, error: "eslint failed to run"}}')
   fi
@@ -49,7 +50,8 @@ if [[ "$LANGUAGE" == "typescript" ]]; then
     if npx tsc --noEmit > /dev/null 2>&1; then
       results=$(echo "$results" | jq '. + {typecheck: {pass: true}}')
     else
-      type_errors=$(npx tsc --noEmit 2>&1 | grep -c "error TS" || echo "0")
+      type_errors=$(npx tsc --noEmit 2>&1 | grep -c "error TS" || true)
+      type_errors=${type_errors:-0}
       results=$(echo "$results" | jq --argjson e "$type_errors" '. + {typecheck: {pass: false, errors: $e}}')
     fi
   else
@@ -60,22 +62,23 @@ else
 fi
 
 # --- File size check ---
-# Find the main HTML file and measure total size
 total_size=0
 if [[ -d "$WORKSPACE/dist" ]]; then
   total_size=$(du -sb "$WORKSPACE/dist" 2>/dev/null | awk '{print $1}')
 elif [[ -f "$WORKSPACE/index.html" ]]; then
   total_size=$(du -sb "$WORKSPACE" --exclude=node_modules --exclude=.git 2>/dev/null | awk '{print $1}')
 fi
-size_pass="true"
-if [[ "$total_size" -gt 524288 ]]; then  # 512KB
-  size_pass="false"
+total_size=${total_size:-0}
+
+if [[ "$total_size" -gt 524288 ]]; then
+  results=$(echo "$results" | jq --argjson s "$total_size" \
+    '. + {performance: {bundle_size_bytes: $s, size_under_512kb: false}}')
+else
+  results=$(echo "$results" | jq --argjson s "$total_size" \
+    '. + {performance: {bundle_size_bytes: $s, size_under_512kb: true}}')
 fi
-results=$(echo "$results" | jq --argjson s "$total_size" --argjson p "$size_pass" \
-  '. + {performance: {bundle_size_bytes: $s, size_under_512kb: ($p == true)}}')
 
 # --- Compute aggregate quality score ---
-# Each check contributes equally
 score_sum=0
 score_count=0
 
@@ -88,7 +91,7 @@ for key in lint typecheck performance; do
 done
 
 if [[ $score_count -gt 0 ]]; then
-  score=$(echo "scale=2; $score_sum / ($score_count * 100)" | bc)
+  score=$(awk "BEGIN {printf \"%.2f\", $score_sum / ($score_count * 100)}")
 else
   score="0"
 fi
diff --git a/tasks/tetris/eval/structural.sh b/tasks/tetris/eval/structural.sh
@@ -80,7 +80,7 @@ checks_json="[${checks_json%,}]"
 
 # Compute score
 if [[ $total_count -gt 0 ]]; then
-  score=$(echo "scale=2; $pass_count / $total_count" | bc)
+  score=$(awk "BEGIN {printf \"%.2f\", $pass_count / $total_count}")
 else
   score="0"
 fi

	loop-benchmarking Controlled experiments across agentic coding configurations. Same task, one variable, what actually works.
	git clone https://git.shiptheloop.com/loop-benchmarking.git
	Log \| Files \| Refs \| README

M	.gitignore	\|	1	+
M	CLAUDE.md	\|	2	+-
M	README.md	\|	123	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
A	dashboard/src/components/Heatmap.tsx	\|	164	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
A	dashboard/src/components/Insights.tsx	\|	109	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
A	dashboard/src/components/TornadoChart.tsx	\|	168	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
M	dashboard/src/layouts/Base.astro	\|	1	+
A	dashboard/src/lib/analysis.ts	\|	180	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
A	dashboard/src/pages/insights.astro	\|	17	+++++++++++++++++
M	grid.yaml	\|	23	+++++++++++++++++++++--
M	harness/lib/compute_grid.py	\|	1	-
M	harness/lib/evaluate.sh	\|	41	++++++++++++++++++++++++-----------------
A	harness/lib/experiment_design.py	\|	582	++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
A	harness/lib/get-oauth-token.sh	\|	13	+++++++++++++
M	harness/lib/invoke.sh	\|	24	+++++++++++++++++++++---
A	harness/run.py	\|	424	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
M	harness/run.sh	\|	100	++++++++++++++++++++++++++++++++++++++++++++++---------------------------------
A	results/index.jsonl	\|	6	++++++
M	tasks/bookmarks-api/eval/quality.sh	\|	2	+-
M	tasks/bookmarks-api/eval/structural.sh	\|	2	+-
M	tasks/data-pipeline/eval/quality.sh	\|	2	+-
M	tasks/data-pipeline/eval/structural.sh	\|	2	+-
M	tasks/data-pipeline/eval/tests/functional.sh	\|	4	++--
M	tasks/tetris/eval/quality.sh	\|	33	++++++++++++++++++---------------
M	tasks/tetris/eval/structural.sh	\|	2	+-