commit f188f40361a8a1dc7600e2c625ff045d29da3d2b
parent 147931383ebcb9584b54dd141a05bac520b2c3b5
Author: Brian Graham <brian@buildingbetterteams.de>
Date: Fri, 3 Apr 2026 19:12:22 +0200
Fix harness bugs, add DOE experiment design, insights dashboard
Harness fixes:
- Replace --dangerously-skip-permissions with --permission-mode dontAsk
- Add --verbose (required for stream-json output)
- Use OAuth token extraction for --bare mode (get-oauth-token.sh)
- Rewrite orchestrator in Python (run.py) to avoid bash subshell issues
- Fix eval scripts: replace bc with awk, handle empty/invalid JSON
New features:
- 5 individual tool axes (tool_read/write/edit/glob/grep) replace base_tools
- DOE experiment design module (main effects sweep, Plackett-Burman
screening, interaction hunt) for efficient grid exploration
- Analysis functions to compute effect sizes and interactions from results
- Insights dashboard page with tornado charts and interaction heatmaps
- Metric switcher (score, cost, turns, wall time)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Diffstat:
25 files changed, 1936 insertions(+), 90 deletions(-)
diff --git a/.gitignore b/.gitignore
@@ -3,3 +3,4 @@ dist/
.astro/
results/runs/
*.tar.gz
+__pycache__/
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -61,7 +61,7 @@ Static site showing:
## Tech
-- **Harness runner**: Bash script that orchestrates experiment runs. Claude Code cannot invoke itself, so the harness must be external. The script reads YAML definitions, computes the grid, and invokes `claude` CLI for each cell.
+- **Harness runner**: Bash script that orchestrates experiment runs. Claude Code cannot invoke itself, so the harness must be external. The script reads YAML definitions, computes the grid, and invokes `claude` CLI for each cell. Uses `--permission-mode dontAsk` with `--allowedTools` for non-interactive runs (not `--dangerously-skip-permissions`, which is blocked as root).
- **Model support**: Primarily Anthropic models (Haiku, Sonnet, Opus). Non-Anthropic models possible via LiteLLM proxy in front of Ollama or similar, but expect reduced feature support (extended thinking, tool use may not work). This is valid benchmark data.
- Results stored as YAML/JSON (append-only, never overwritten)
- Each run gets a unique ID
diff --git a/README.md b/README.md
@@ -1,5 +1,124 @@
# Loop Benchmarking
-Agentic loop configuration benchmark for Ship the Loop.
+An open benchmark for comparing agentic coding loop configurations. Same task, different setups, all data public.
-Status: bootstrapping. See CLAUDE.md for full context.
+## What this does
+
+Define the variables that make up a coding loop (model, tools, prompt style, etc.), and the system generates every permutation. Each is run against a set of tasks in a clean-room environment with deterministic evaluation. No LLM grading.
+
+## Quick start
+
+### Prerequisites
+
+- Node.js 22+
+- Python 3.12+ with PyYAML
+- Claude Code CLI (authenticated via `claude login`)
+
+### Running experiments
+
+```bash
+# 1. Screen: which variables matter? (~53 cells, vary one axis at a time)
+python3 harness/run.py grid.yaml main_effects
+
+# 2. Analyze: rank variables by effect size
+python3 harness/lib/experiment_design.py analyze results main_effects score
+
+# 3. Deep dive: full factorial on the top variables that matter
+python3 harness/run.py grid.yaml "interaction_hunt:model,effort,tool_write"
+
+# 4. Check for interactions between variables
+python3 harness/lib/experiment_design.py analyze results interactions model effort score
+```
+
+### Other run modes
+
+```bash
+# Profile-based (predefined subsets of the grid)
+python3 harness/run.py grid.yaml smoke # 6 cells, 1 run each
+python3 harness/run.py grid.yaml core # 30 cells, 3 runs each
+python3 harness/run.py grid.yaml full # 204,800 cells (don't)
+
+# Plackett-Burman screening (efficient multi-factor screening)
+python3 harness/run.py grid.yaml plackett_burman
+```
+
+### Building the dashboard
+
+```bash
+cd dashboard
+npm install
+npm run build # Static site in dashboard/dist/
+npm run dev # Dev server for local preview
+```
+
+## Project structure
+
+```
+grid.yaml # Experiment grid: axes, values, exclusions, profiles
+harness/
+ run.py # Main orchestrator (Python)
+ lib/
+ compute_grid.py # Cartesian product + exclusions
+ experiment_design.py # DOE plans + analysis (main effects, PB, interactions)
+ get-oauth-token.sh # Extracts OAuth token for --bare mode
+ invoke.sh # Claude CLI invocation (bash, used by run.sh)
+ evaluate.sh # Evaluation dispatch (bash, used by run.sh)
+ workspace.sh # Workspace creation (bash, used by run.sh)
+tasks/
+ tetris/ # Agent-friendly: build a game
+ bookmarks-api/ # Medium: REST API with auth
+ data-pipeline/ # Hard: CSV processing with edge cases
+ Each task has:
+ prompts/ # simple/detailed x en/es
+ eval/ # Deterministic test suites the agent never sees
+ context.md # Rules file (used when context_file=provided)
+ scoring.yaml # Category weights
+results/
+ runs/{run_id}/ # One directory per experiment run
+ meta.json # Config, timing, exit code
+ transcript.jsonl # Full conversation (every tool call and response)
+ claude_output.json # Summary metrics (cost, turns, tokens)
+ eval_results.json # Structural, functional, quality scores
+ workspace.tar.gz # Archived agent output
+dashboard/ # Astro + React static site
+ Grid overview, insights (tornado charts, heatmaps), run detail with transcript viewer
+```
+
+## Configuration dimensions (16 axes)
+
+| Axis | Values |
+|---|---|
+| model | haiku, sonnet, opus |
+| effort | high, max (extended thinking) |
+| prompt_style | simple, detailed |
+| language | typescript, javascript |
+| human_language | en, es |
+| tool_read | on, off |
+| tool_write | on, off |
+| tool_edit | on, off |
+| tool_glob | on, off |
+| tool_grep | on, off |
+| linter | on, off |
+| playwright | on, off |
+| context_file | none, provided |
+| sub_agents | on, off |
+| web_search | on, off |
+| max_budget | low ($0.50), high ($5.00) |
+
+## Evaluation
+
+All scoring is deterministic code. The agent never sees the test suite.
+
+- **Structural**: Does it build? Do expected files exist?
+- **Functional**: Pre-written test suites (Playwright, vitest, golden file diff)
+- **Quality**: Lint, type check, accessibility, security, performance
+
+## Experiment design
+
+Instead of running the full 204,800-cell grid, use statistical designs:
+
+- **Main effects sweep**: Vary one axis at a time from a baseline. Identifies which variables matter.
+- **Plackett-Burman**: Screening design that tests many binary factors efficiently.
+- **Interaction hunt**: Full factorial on a small subset of axes to find interactions.
+
+The dashboard's Insights page visualizes main effects as tornado charts and interactions as heatmaps.
diff --git a/dashboard/src/components/Heatmap.tsx b/dashboard/src/components/Heatmap.tsx
@@ -0,0 +1,164 @@
+import type { InteractionResult } from "../lib/analysis";
+
+interface HeatmapProps {
+ data: InteractionResult;
+ metric: string;
+}
+
+export default function Heatmap({ data, metric }: HeatmapProps) {
+ const { axisA, axisB, table } = data;
+
+ const aValues = Object.keys(table).sort();
+ const bValues = Array.from(
+ new Set(aValues.flatMap((a) => Object.keys(table[a])))
+ ).sort();
+
+ if (aValues.length === 0 || bValues.length === 0) {
+ return (
+ <div
+ className="card"
+ style={{
+ textAlign: "center",
+ padding: "40px",
+ color: "var(--text-muted)",
+ }}
+ >
+ Not enough data for this interaction.
+ </div>
+ );
+ }
+
+ // Find min/max for color scale
+ const allMeans = aValues.flatMap((a) =>
+ bValues.filter((b) => table[a]?.[b]).map((b) => table[a][b].mean)
+ );
+ const minVal = Math.min(...allMeans);
+ const maxVal = Math.max(...allMeans);
+ const range = maxVal - minVal || 1;
+
+ function cellColor(value: number): string {
+ const ratio = (value - minVal) / range;
+ if (ratio > 0.66)
+ return `rgba(34, 197, 94, ${0.3 + ratio * 0.5})`;
+ if (ratio > 0.33)
+ return `rgba(234, 179, 8, ${0.3 + ratio * 0.4})`;
+ return `rgba(239, 68, 68, ${0.3 + (1 - ratio) * 0.4})`;
+ }
+
+ return (
+ <div className="card">
+ <h3 style={{ marginBottom: "4px" }}>
+ {axisA} x {axisB}
+ </h3>
+ <p
+ style={{
+ color: "var(--text-muted)",
+ fontSize: "0.75rem",
+ marginBottom: "16px",
+ }}
+ >
+ Mean {metric} for each combination. Interaction strength:{" "}
+ <span
+ style={{
+ fontFamily: "var(--font-mono)",
+ color:
+ data.maxInteraction > 0.05
+ ? "var(--yellow)"
+ : "var(--text-muted)",
+ }}
+ >
+ {(data.maxInteraction * 100).toFixed(1)}%
+ </span>
+ </p>
+
+ <div style={{ overflowX: "auto" }}>
+ <table style={{ borderCollapse: "collapse" }}>
+ <thead>
+ <tr>
+ <th
+ style={{
+ padding: "8px 12px",
+ fontSize: "0.7rem",
+ textAlign: "center",
+ }}
+ >
+ {axisA} \ {axisB}
+ </th>
+ {bValues.map((b) => (
+ <th
+ key={b}
+ style={{
+ padding: "8px 12px",
+ fontSize: "0.75rem",
+ textAlign: "center",
+ fontFamily: "var(--font-mono)",
+ }}
+ >
+ {b}
+ </th>
+ ))}
+ </tr>
+ </thead>
+ <tbody>
+ {aValues.map((a) => (
+ <tr key={a}>
+ <td
+ style={{
+ padding: "8px 12px",
+ fontSize: "0.75rem",
+ fontFamily: "var(--font-mono)",
+ fontWeight: 600,
+ }}
+ >
+ {a}
+ </td>
+ {bValues.map((b) => {
+ const cell = table[a]?.[b];
+ if (!cell) {
+ return (
+ <td
+ key={b}
+ style={{
+ padding: "8px 12px",
+ textAlign: "center",
+ color: "var(--text-muted)",
+ }}
+ >
+ -
+ </td>
+ );
+ }
+ return (
+ <td
+ key={b}
+ style={{
+ padding: "8px 12px",
+ textAlign: "center",
+ background: cellColor(cell.mean),
+ fontFamily: "var(--font-mono)",
+ fontSize: "0.8rem",
+ fontWeight: 600,
+ borderRadius: "2px",
+ }}
+ >
+ {(cell.mean * 100).toFixed(0)}%
+ <div
+ style={{
+ fontSize: "0.6rem",
+ fontWeight: 400,
+ color: "var(--text-muted)",
+ }}
+ >
+ n={cell.n}
+ </div>
+ </td>
+ );
+ })}
+ </tr>
+ ))}
+ </tbody>
+ </table>
+ </div>
+ </div>
+ );
+}
diff --git a/dashboard/src/components/Insights.tsx b/dashboard/src/components/Insights.tsx
@@ -0,0 +1,109 @@
+import { useState, useMemo } from "react";
+import type { Run } from "../lib/data";
+import { computeMainEffects, computeInteraction } from "../lib/analysis";
+import TornadoChart from "./TornadoChart";
+import Heatmap from "./Heatmap";
+
+interface InsightsProps {
+ runs: Run[];
+}
+
+const METRICS = [
+ { key: "score", label: "Score" },
+ { key: "cost", label: "Cost" },
+ { key: "turns", label: "Turns" },
+ { key: "wall_time", label: "Wall Time" },
+];
+
+export default function Insights({ runs }: InsightsProps) {
+ const [metric, setMetric] = useState("score");
+ const [axisA, setAxisA] = useState("");
+ const [axisB, setAxisB] = useState("");
+
+ const effects = useMemo(
+ () => computeMainEffects(runs, metric),
+ [runs, metric]
+ );
+
+ // Auto-pick top 2 axes for interaction if not selected
+ const topAxes = useMemo(() => effects.slice(0, 6).map((e) => e.axis), [effects]);
+
+ const interaction = useMemo(() => {
+ const a = axisA || topAxes[0] || "";
+ const b = axisB || topAxes[1] || "";
+ if (!a || !b || a === b) return null;
+ return computeInteraction(runs, a, b, metric);
+ }, [runs, axisA, axisB, metric, topAxes]);
+
+ return (
+ <div style={{ display: "flex", flexDirection: "column", gap: "24px" }}>
+ {/* Metric selector */}
+ <div style={{ display: "flex", gap: "8px", alignItems: "center" }}>
+ <span style={{ fontSize: "0.8rem", color: "var(--text-muted)" }}>
+ Metric:
+ </span>
+ {METRICS.map((m) => (
+ <button
+ key={m.key}
+ onClick={() => setMetric(m.key)}
+ style={{
+ padding: "4px 12px",
+ borderRadius: "4px",
+ border:
+ metric === m.key
+ ? "1px solid var(--accent)"
+ : "1px solid var(--border)",
+ background:
+ metric === m.key ? "rgba(99, 102, 241, 0.15)" : "transparent",
+ color: metric === m.key ? "var(--accent)" : "var(--text-muted)",
+ cursor: "pointer",
+ fontSize: "0.8rem",
+ }}
+ >
+ {m.label}
+ </button>
+ ))}
+ </div>
+
+ {/* Tornado chart */}
+ <TornadoChart effects={effects} metric={metric} />
+
+ {/* Interaction explorer */}
+ <div className="card">
+ <h3 style={{ marginBottom: "12px" }}>Interaction Explorer</h3>
+ <div style={{ display: "flex", gap: "12px", marginBottom: "16px" }}>
+ <div className="filter-group">
+ <label>Axis A</label>
+ <select
+ value={axisA || topAxes[0] || ""}
+ onChange={(e) => setAxisA(e.target.value)}
+ >
+ {topAxes.map((a) => (
+ <option key={a} value={a}>
+ {a}
+ </option>
+ ))}
+ </select>
+ </div>
+ <div className="filter-group">
+ <label>Axis B</label>
+ <select
+ value={axisB || topAxes[1] || ""}
+ onChange={(e) => setAxisB(e.target.value)}
+ >
+ {topAxes
+ .filter((a) => a !== (axisA || topAxes[0]))
+ .map((a) => (
+ <option key={a} value={a}>
+ {a}
+ </option>
+ ))}
+ </select>
+ </div>
+ </div>
+
+ {interaction && <Heatmap data={interaction} metric={metric} />}
+ </div>
+ </div>
+ );
+}
diff --git a/dashboard/src/components/TornadoChart.tsx b/dashboard/src/components/TornadoChart.tsx
@@ -0,0 +1,168 @@
+import type { AxisEffect } from "../lib/analysis";
+
+interface TornadoChartProps {
+ effects: AxisEffect[];
+ metric: string;
+}
+
+const AXIS_LABELS: Record<string, string> = {
+ model: "Model",
+ effort: "Effort",
+ prompt_style: "Prompt Style",
+ language: "Language",
+ human_language: "Human Language",
+ tool_read: "Read Tool",
+ tool_write: "Write Tool",
+ tool_edit: "Edit Tool",
+ tool_glob: "Glob Tool",
+ tool_grep: "Grep Tool",
+ linter: "Linter",
+ playwright: "Playwright",
+ context_file: "Context File",
+ sub_agents: "Sub-agents",
+ web_search: "Web Search",
+ max_budget: "Budget",
+};
+
+export default function TornadoChart({ effects, metric }: TornadoChartProps) {
+ if (effects.length === 0) {
+ return (
+ <div
+ className="card"
+ style={{
+ textAlign: "center",
+ padding: "40px",
+ color: "var(--text-muted)",
+ }}
+ >
+ Not enough data to compute effects. Run more experiments with varying
+ configurations.
+ </div>
+ );
+ }
+
+ const maxSpread = Math.max(...effects.map((e) => e.spread));
+ const scale = maxSpread > 0 ? 200 / maxSpread : 1; // max bar width = 200px
+
+ return (
+ <div className="card">
+ <h3 style={{ marginBottom: "4px" }}>Variable Impact on {metric}</h3>
+ <p
+ style={{
+ color: "var(--text-muted)",
+ fontSize: "0.75rem",
+ marginBottom: "16px",
+ }}
+ >
+ Sorted by effect size. Wider bars = bigger impact on outcomes.
+ </p>
+
+ {effects.map((effect) => (
+ <div
+ key={effect.axis}
+ style={{
+ display: "flex",
+ alignItems: "center",
+ marginBottom: "12px",
+ gap: "12px",
+ }}
+ >
+ {/* Label */}
+ <div
+ style={{
+ width: "120px",
+ textAlign: "right",
+ fontSize: "0.8rem",
+ flexShrink: 0,
+ }}
+ >
+ {AXIS_LABELS[effect.axis] || effect.axis}
+ </div>
+
+ {/* Bars */}
+ <div
+ style={{
+ flex: 1,
+ display: "flex",
+ flexDirection: "column",
+ gap: "2px",
+ }}
+ >
+ {effect.values.map((entry) => {
+ const width = Math.abs(entry.effect) * scale;
+ const isPositive = entry.effect >= 0;
+ return (
+ <div
+ key={entry.value}
+ style={{
+ display: "flex",
+ alignItems: "center",
+ gap: "8px",
+ }}
+ >
+ <div
+ style={{
+ width: "50px",
+ textAlign: "right",
+ fontSize: "0.7rem",
+ fontFamily: "var(--font-mono)",
+ color: "var(--text-muted)",
+ flexShrink: 0,
+ }}
+ >
+ {entry.value}
+ </div>
+ <div
+ style={{
+ height: "16px",
+ width: `${Math.max(width, 2)}px`,
+ background: isPositive
+ ? "var(--green)"
+ : "var(--red)",
+ borderRadius: "2px",
+ opacity: 0.8,
+ }}
+ />
+ <div
+ style={{
+ fontSize: "0.7rem",
+ fontFamily: "var(--font-mono)",
+ color: isPositive
+ ? "var(--green)"
+ : "var(--red)",
+ }}
+ >
+ {entry.effect >= 0 ? "+" : ""}
+ {(entry.effect * 100).toFixed(1)}%
+ </div>
+ <div
+ style={{
+ fontSize: "0.65rem",
+ color: "var(--text-muted)",
+ }}
+ >
+ (n={entry.n})
+ </div>
+ </div>
+ );
+ })}
+ </div>
+
+ {/* Spread */}
+ <div
+ style={{
+ width: "60px",
+ textAlign: "right",
+ fontSize: "0.75rem",
+ fontFamily: "var(--font-mono)",
+ color: "var(--accent)",
+ flexShrink: 0,
+ }}
+ >
+ {(effect.spread * 100).toFixed(1)}%
+ </div>
+ </div>
+ ))}
+ </div>
+ );
+}
diff --git a/dashboard/src/layouts/Base.astro b/dashboard/src/layouts/Base.astro
@@ -23,6 +23,7 @@ const { title } = Astro.props;
</a>
<nav style="display: flex; gap: 16px; font-size: 0.875rem;">
<a href="/">Grid</a>
+ <a href="/insights">Insights</a>
<a href="/compare">Compare</a>
</nav>
</div>
diff --git a/dashboard/src/lib/analysis.ts b/dashboard/src/lib/analysis.ts
@@ -0,0 +1,180 @@
+import type { Run, AxisName, AXIS_NAMES } from "./data";
+
+export interface EffectEntry {
+ value: string;
+ mean: number;
+ effect: number;
+ n: number;
+}
+
+export interface AxisEffect {
+ axis: string;
+ spread: number;
+ values: EffectEntry[];
+}
+
+export interface InteractionCell {
+ mean: number;
+ n: number;
+}
+
+export interface InteractionResult {
+ axisA: string;
+ axisB: string;
+ table: Record<string, Record<string, InteractionCell>>;
+ maxInteraction: number;
+}
+
+const SKIP_KEYS = new Set([
+ "task",
+ "cell_id",
+ "run_id",
+ "run_number",
+ "runs_per_cell",
+ "max_budget_usd",
+ "timeout_seconds",
+ "base_tools",
+ "started_at",
+ "completed_at",
+ "wall_time_seconds",
+ "exit_code",
+]);
+
+type MetricExtractor = (run: Run) => number | null;
+
+const METRICS: Record<string, MetricExtractor> = {
+ score: (r) => r.eval_results?.score ?? null,
+ cost: (r) => r.claude_output?.total_cost_usd ?? null,
+ turns: (r) => r.claude_output?.num_turns ?? null,
+ wall_time: (r) => r.meta.wall_time_seconds ?? null,
+};
+
+export function computeMainEffects(
+ runs: Run[],
+ metric: string = "score"
+): AxisEffect[] {
+ const extract = METRICS[metric];
+ if (!extract) return [];
+
+ const scored: Array<{ meta: Run["meta"]; value: number }> = [];
+ for (const run of runs) {
+ const val = extract(run);
+ if (val !== null) scored.push({ meta: run.meta, value: val });
+ }
+ if (scored.length === 0) return [];
+
+ const grandMean = scored.reduce((s, r) => s + r.value, 0) / scored.length;
+
+ // Find axis keys from meta
+ const axisKeys = Object.keys(scored[0].meta).filter(
+ (k) => !SKIP_KEYS.has(k)
+ );
+
+ const effects: AxisEffect[] = [];
+
+ for (const axis of axisKeys) {
+ const groups: Record<string, number[]> = {};
+ for (const { meta, value } of scored) {
+ const key = String((meta as Record<string, unknown>)[axis] ?? "unknown");
+ (groups[key] ??= []).push(value);
+ }
+
+ if (Object.keys(groups).length < 2) continue;
+
+ const values: EffectEntry[] = [];
+ for (const [val, vals] of Object.entries(groups)) {
+ const mean = vals.reduce((a, b) => a + b, 0) / vals.length;
+ values.push({
+ value: val,
+ mean: Math.round(mean * 10000) / 10000,
+ effect: Math.round((mean - grandMean) * 10000) / 10000,
+ n: vals.length,
+ });
+ }
+
+ const means = values.map((v) => v.mean);
+ const spread = Math.max(...means) - Math.min(...means);
+
+ effects.push({
+ axis,
+ spread: Math.round(spread * 10000) / 10000,
+ values: values.sort((a, b) => b.effect - a.effect),
+ });
+ }
+
+ return effects.sort((a, b) => b.spread - a.spread);
+}
+
+export function computeInteraction(
+ runs: Run[],
+ axisA: string,
+ axisB: string,
+ metric: string = "score"
+): InteractionResult {
+ const extract = METRICS[metric];
+ if (!extract)
+ return { axisA, axisB, table: {}, maxInteraction: 0 };
+
+ const groups: Record<string, Record<string, number[]>> = {};
+
+ for (const run of runs) {
+ const val = extract(run);
+ if (val === null) continue;
+ const a = String((run.meta as Record<string, unknown>)[axisA] ?? "?");
+ const b = String((run.meta as Record<string, unknown>)[axisB] ?? "?");
+ ((groups[a] ??= {})[b] ??= []).push(val);
+ }
+
+ const table: Record<string, Record<string, InteractionCell>> = {};
+ const allVals: number[] = [];
+
+ for (const [a, bGroups] of Object.entries(groups)) {
+ table[a] = {};
+ for (const [b, vals] of Object.entries(bGroups)) {
+ const mean = vals.reduce((s, v) => s + v, 0) / vals.length;
+ table[a][b] = { mean: Math.round(mean * 10000) / 10000, n: vals.length };
+ allVals.push(mean);
+ }
+ }
+
+ const grandMean =
+ allVals.length > 0
+ ? allVals.reduce((a, b) => a + b, 0) / allVals.length
+ : 0;
+
+ // Row and column means
+ const aMeans: Record<string, number> = {};
+ const bMeans: Record<string, number> = {};
+ const bKeys = new Set<string>();
+
+ for (const [a, bGroups] of Object.entries(table)) {
+ const vals = Object.values(bGroups).map((c) => c.mean);
+ aMeans[a] = vals.reduce((s, v) => s + v, 0) / vals.length;
+ for (const b of Object.keys(bGroups)) bKeys.add(b);
+ }
+
+ for (const b of bKeys) {
+ const vals: number[] = [];
+ for (const a of Object.keys(table)) {
+ if (table[a][b]) vals.push(table[a][b].mean);
+ }
+ bMeans[b] = vals.length > 0 ? vals.reduce((s, v) => s + v, 0) / vals.length : grandMean;
+ }
+
+ // Max interaction = max deviation from additive model
+ let maxInteraction = 0;
+ for (const a of Object.keys(table)) {
+ for (const b of Object.keys(table[a])) {
+ const expected = aMeans[a] + bMeans[b] - grandMean;
+ const actual = table[a][b].mean;
+ maxInteraction = Math.max(maxInteraction, Math.abs(actual - expected));
+ }
+ }
+
+ return {
+ axisA,
+ axisB,
+ table,
+ maxInteraction: Math.round(maxInteraction * 10000) / 10000,
+ };
+}
diff --git a/dashboard/src/pages/insights.astro b/dashboard/src/pages/insights.astro
@@ -0,0 +1,17 @@
+---
+import Base from "../layouts/Base.astro";
+import { loadAllRuns } from "../lib/data";
+import Insights from "../components/Insights";
+
+const runs = loadAllRuns();
+---
+
+<Base title="Insights">
+ <h1 style="margin-bottom: 8px;">Insights</h1>
+ <p style="color: var(--text-muted); margin-bottom: 24px; font-size: 0.875rem;">
+ Which variables actually move the needle? Tornado charts show main effects,
+ heatmaps reveal interactions.
+ </p>
+
+ <Insights client:load runs={runs} />
+</Base>
diff --git a/grid.yaml b/grid.yaml
@@ -3,7 +3,6 @@ version: 1
defaults:
runs_per_cell: 3
timeout_seconds: 600
- base_tools: "Bash,Read,Edit,Write,Glob,Grep"
budget:
low: 0.50
high: 5.00
@@ -19,6 +18,16 @@ axes:
values: [typescript, javascript]
human_language:
values: [en, es]
+ tool_read:
+ values: ["on", "off"]
+ tool_write:
+ values: ["on", "off"]
+ tool_edit:
+ values: ["on", "off"]
+ tool_glob:
+ values: ["on", "off"]
+ tool_grep:
+ values: ["on", "off"]
linter:
values: ["on", "off"]
playwright:
@@ -54,11 +63,16 @@ profiles:
smoke:
description: "Quick validation -- minimal grid"
axes:
- model: [sonnet]
+ model: [haiku]
effort: [high]
prompt_style: [simple, detailed]
language: [typescript]
human_language: [en]
+ tool_read: ["on"]
+ tool_write: ["on"]
+ tool_edit: ["on"]
+ tool_glob: ["on"]
+ tool_grep: ["on"]
linter: ["off"]
playwright: ["off"]
context_file: [none]
@@ -75,6 +89,11 @@ profiles:
prompt_style: [simple, detailed]
language: [typescript]
human_language: [en]
+ tool_read: ["on"]
+ tool_write: ["on"]
+ tool_edit: ["on"]
+ tool_glob: ["on"]
+ tool_grep: ["on"]
linter: ["off"]
playwright: ["off"]
context_file: [none]
diff --git a/harness/lib/compute_grid.py b/harness/lib/compute_grid.py
@@ -108,7 +108,6 @@ def compute_cells(grid, profile_name):
cell["runs_per_cell"] = runs_per_cell
cell["max_budget_usd"] = budget_usd
cell["timeout_seconds"] = defaults["timeout_seconds"]
- cell["base_tools"] = defaults["base_tools"]
cells.append(cell)
diff --git a/harness/lib/evaluate.sh b/harness/lib/evaluate.sh
@@ -13,37 +13,44 @@ evaluate() {
local eval_results='{"structural": null, "functional": null, "quality": null, "score": null}'
+ # Helper: safely merge JSON into eval_results
+ merge_result() {
+ local key="$1"
+ local output="$2"
+
+ if [[ -z "$output" ]]; then
+ eval_results=$(echo "$eval_results" | jq --arg k "$key" '.[$k] = {"pass": false, "error": "no output"}')
+ return
+ fi
+
+ if echo "$output" | jq . > /dev/null 2>&1; then
+ eval_results=$(echo "$eval_results" | jq --arg k "$key" --argjson v "$output" '.[$k] = $v')
+ else
+ # Truncate long non-JSON output to avoid jq issues
+ local truncated="${output:0:500}"
+ eval_results=$(echo "$eval_results" | jq --arg k "$key" --arg e "$truncated" '.[$k] = {"pass": false, "error": $e}')
+ fi
+ }
+
# --- Structural checks ---
if [[ -f "$task_dir/eval/structural.sh" ]]; then
local structural_output
structural_output=$(bash "$task_dir/eval/structural.sh" "$workspace" "$language" 2>&1) || true
- if echo "$structural_output" | jq . > /dev/null 2>&1; then
- eval_results=$(echo "$eval_results" | jq --argjson s "$structural_output" '.structural = $s')
- else
- eval_results=$(echo "$eval_results" | jq --arg s "$structural_output" '.structural = {"pass": false, "error": $s}')
- fi
+ merge_result "structural" "$structural_output"
fi
# --- Functional tests ---
- local functional_output='{}'
if [[ -d "$task_dir/eval/tests" ]]; then
- functional_output=$(run_functional_tests "$task_dir" "$workspace" "$language" "$run_dir") || true
- if echo "$functional_output" | jq . > /dev/null 2>&1; then
- eval_results=$(echo "$eval_results" | jq --argjson f "$functional_output" '.functional = $f')
- else
- eval_results=$(echo "$eval_results" | jq '.functional = {"pass": false, "error": "test runner failed"}')
- fi
+ local functional_output
+ functional_output=$(run_functional_tests "$task_dir" "$workspace" "$language" "$run_dir" 2>&1) || true
+ merge_result "functional" "$functional_output"
fi
# --- Quality checks ---
if [[ -f "$task_dir/eval/quality.sh" ]]; then
local quality_output
quality_output=$(bash "$task_dir/eval/quality.sh" "$workspace" "$language" 2>&1) || true
- if echo "$quality_output" | jq . > /dev/null 2>&1; then
- eval_results=$(echo "$eval_results" | jq --argjson q "$quality_output" '.quality = $q')
- else
- eval_results=$(echo "$eval_results" | jq --arg q "$quality_output" '.quality = {"pass": false, "error": $q}')
- fi
+ merge_result "quality" "$quality_output"
fi
# --- Compute aggregate score ---
diff --git a/harness/lib/experiment_design.py b/harness/lib/experiment_design.py
@@ -0,0 +1,582 @@
+#!/usr/bin/env python3
+"""Experiment design and analysis for loop benchmarking.
+
+Generates efficient experiment plans instead of full factorial grids.
+Analyzes results to identify which variables have the biggest impact.
+
+Approaches:
+ 1. Main effects sweep: vary one axis at a time from a baseline
+ 2. Fractional factorial: Plackett-Burman screening for binary factors
+ 3. Interaction hunt: full factorial on the top-k most impactful axes
+"""
+
+import json
+import math
+import sys
+from itertools import product
+from pathlib import Path
+
+import yaml
+
+
+def load_grid(path):
+ with open(path) as f:
+ return yaml.safe_load(f)
+
+
+def get_axes(grid, profile_name=None):
+ """Get axis definitions, optionally filtered by profile."""
+ top_axes = {name: spec["values"] for name, spec in grid["axes"].items()}
+ if profile_name and profile_name in grid.get("profiles", {}):
+ profile = grid["profiles"][profile_name]
+ if "axes" in profile:
+ axes = dict(top_axes)
+ for name, values in profile["axes"].items():
+ axes[name] = values
+ return axes
+ return top_axes
+
+
+# ---------------------------------------------------------------------------
+# 1. Main effects sweep
+# ---------------------------------------------------------------------------
+
+def main_effects_plan(grid, baseline=None, tasks=None):
+ """Generate a one-at-a-time sweep from a baseline.
+
+ For each axis, vary it through all its values while holding everything
+ else at baseline. This identifies main effects cheaply.
+
+ Returns a list of cell dicts.
+ """
+ axes = get_axes(grid)
+ tasks = tasks or grid["tasks"]
+ defaults = grid["defaults"]
+
+ # Pick baseline: first value of each axis unless overridden
+ if baseline is None:
+ baseline = {name: values[0] for name, values in axes.items()}
+
+ cells = []
+ seen = set()
+
+ for task in tasks:
+ # Apply task overrides to axes
+ task_axes = dict(axes)
+ overrides = grid.get("task_overrides", {}).get(task, {})
+ if "axes" in overrides:
+ for axis_name, spec in overrides["axes"].items():
+ task_axes[axis_name] = spec["values"]
+
+ # Baseline cell
+ base_cell = dict(baseline)
+ # Ensure baseline values are valid for this task
+ for name, values in task_axes.items():
+ if base_cell[name] not in values:
+ base_cell[name] = values[0]
+
+ base_key = _cell_key(task, base_cell)
+ if base_key not in seen:
+ seen.add(base_key)
+ cells.append(_build_cell(task, base_cell, defaults, grid))
+
+ # Vary each axis
+ for axis_name, values in task_axes.items():
+ for value in values:
+ if value == base_cell[axis_name]:
+ continue
+ varied = dict(base_cell)
+ varied[axis_name] = value
+ key = _cell_key(task, varied)
+ if key not in seen:
+ seen.add(key)
+ cells.append(_build_cell(task, varied, defaults, grid))
+
+ return cells
+
+
+# ---------------------------------------------------------------------------
+# 2. Plackett-Burman screening
+# ---------------------------------------------------------------------------
+
+def _hadamard_matrix(n):
+ """Generate a Hadamard-like matrix for Plackett-Burman design.
+
+ n must be a multiple of 4. Returns an n x (n-1) matrix of +1/-1.
+ Uses the Paley construction for prime n-1.
+ """
+ # For simplicity, use the standard PB generators for common sizes
+ # These are the first rows; subsequent rows are cyclic shifts
+ generators = {
+ 4: [1, 1, -1],
+ 8: [1, 1, 1, -1, 1, -1, -1],
+ 12: [1, 1, -1, 1, 1, 1, -1, -1, -1, 1, -1],
+ 16: [1, 1, 1, 1, -1, 1, -1, 1, 1, -1, -1, 1, -1, -1, -1],
+ 20: [1, 1, -1, 1, 1, -1, -1, -1, -1, 1, -1, 1, -1, 1, 1, 1, 1, -1, -1],
+ 24: [1, 1, 1, 1, 1, -1, 1, -1, 1, 1, -1, -1, 1, 1, -1, -1, 1, -1, 1, -1, -1, -1, -1],
+ }
+
+ if n not in generators:
+ # Fall back to nearest larger size
+ for size in sorted(generators.keys()):
+ if size >= n:
+ n = size
+ break
+ else:
+ n = max(generators.keys())
+
+ gen = generators[n]
+ k = len(gen)
+ matrix = []
+
+ for i in range(k):
+ row = gen[i:] + gen[:i]
+ matrix.append(row)
+
+ # Add a row of all -1
+ matrix.append([-1] * k)
+
+ return matrix
+
+
+def plackett_burman_plan(grid, tasks=None):
+ """Generate a Plackett-Burman screening design for binary factors.
+
+ For factors with more than 2 levels (e.g., model: haiku/sonnet/opus),
+ we create dummy binary variables or sweep them separately.
+
+ Returns a list of cell dicts.
+ """
+ axes = get_axes(grid)
+ tasks = tasks or grid["tasks"]
+ defaults = grid["defaults"]
+
+ # Separate binary and multi-level factors
+ binary_axes = {}
+ multi_axes = {}
+ for name, values in axes.items():
+ if len(values) == 2:
+ binary_axes[name] = values
+ elif len(values) > 2:
+ multi_axes[name] = values
+
+ binary_names = sorted(binary_axes.keys())
+ n_factors = len(binary_names)
+
+ if n_factors == 0:
+ return main_effects_plan(grid, tasks=tasks)
+
+ # Find the smallest PB design that fits
+ n_runs = n_factors + 1
+ # Round up to multiple of 4
+ n_runs = math.ceil(n_runs / 4) * 4
+
+ matrix = _hadamard_matrix(n_runs)
+
+ cells = []
+ seen = set()
+
+ # For multi-level factors, fix at each level and run the PB design
+ if multi_axes:
+ multi_names = sorted(multi_axes.keys())
+ multi_combos = list(product(*[multi_axes[n] for n in multi_names]))
+ else:
+ multi_names = []
+ multi_combos = [()]
+
+ for multi_combo in multi_combos:
+ multi_fixed = dict(zip(multi_names, multi_combo))
+
+ for row in matrix:
+ cell = dict(multi_fixed)
+ for i, name in enumerate(binary_names):
+ if i < len(row):
+ idx = 0 if row[i] == -1 else 1
+ else:
+ idx = 0
+ cell[name] = binary_axes[name][idx]
+
+ for task in tasks:
+ # Apply task overrides
+ task_axes = dict(axes)
+ overrides = grid.get("task_overrides", {}).get(task, {})
+ if "axes" in overrides:
+ for axis_name, spec in overrides["axes"].items():
+ task_axes[axis_name] = spec["values"]
+
+ # Ensure values are valid for this task
+ valid = True
+ for name, values in task_axes.items():
+ if cell.get(name) not in values:
+ if len(values) == 1:
+ cell[name] = values[0]
+ else:
+ valid = False
+ break
+
+ # Check exclusions
+ if valid and not _is_excluded(cell, grid):
+ key = _cell_key(task, cell)
+ if key not in seen:
+ seen.add(key)
+ cells.append(_build_cell(task, cell, defaults, grid))
+
+ return cells
+
+
+# ---------------------------------------------------------------------------
+# 3. Interaction hunt
+# ---------------------------------------------------------------------------
+
+def interaction_hunt_plan(grid, top_axes, tasks=None):
+ """Full factorial on a subset of axes, baseline for the rest.
+
+ Args:
+ top_axes: list of axis names to fully explore (e.g., ["model", "effort", "linter"])
+ tasks: which tasks to include
+ """
+ axes = get_axes(grid)
+ tasks = tasks or grid["tasks"]
+ defaults = grid["defaults"]
+
+ # Baseline for non-explored axes
+ baseline = {name: values[0] for name, values in axes.items()}
+
+ # Full factorial on top_axes
+ explore_names = sorted(top_axes)
+ explore_values = [axes[n] for n in explore_names]
+
+ cells = []
+ seen = set()
+
+ for combo in product(*explore_values):
+ cell = dict(baseline)
+ for name, value in zip(explore_names, combo):
+ cell[name] = value
+
+ for task in tasks:
+ task_axes = dict(axes)
+ overrides = grid.get("task_overrides", {}).get(task, {})
+ if "axes" in overrides:
+ for axis_name, spec in overrides["axes"].items():
+ task_axes[axis_name] = spec["values"]
+
+ # Adjust for task constraints
+ for name, values in task_axes.items():
+ if cell.get(name) not in values:
+ cell[name] = values[0]
+
+ if not _is_excluded(cell, grid):
+ key = _cell_key(task, cell)
+ if key not in seen:
+ seen.add(key)
+ cells.append(_build_cell(task, cell, defaults, grid))
+
+ return cells
+
+
+# ---------------------------------------------------------------------------
+# Analysis: compute effects from results
+# ---------------------------------------------------------------------------
+
+def analyze_main_effects(results_dir, metric="score"):
+ """Compute the main effect of each axis on a given metric.
+
+ Reads all completed runs, groups by axis values, computes mean metric
+ for each group, and returns the effect size (difference from grand mean).
+
+ Returns a dict: {axis_name: {value: effect_size, ...}, ...}
+ sorted by absolute effect size.
+ """
+ runs = _load_results(results_dir)
+ if not runs:
+ return {}
+
+ # Extract metric values
+ scored_runs = []
+ for run in runs:
+ val = _extract_metric(run, metric)
+ if val is not None:
+ scored_runs.append((run["meta"], val))
+
+ if not scored_runs:
+ return {}
+
+ grand_mean = sum(v for _, v in scored_runs) / len(scored_runs)
+
+ # Identify axes from the first run's meta
+ meta_keys = set(scored_runs[0][0].keys())
+ skip_keys = {
+ "task", "cell_id", "run_id", "run_number", "runs_per_cell",
+ "max_budget_usd", "timeout_seconds", "base_tools",
+ "started_at", "completed_at", "wall_time_seconds", "exit_code",
+ }
+ axis_names = sorted(meta_keys - skip_keys)
+
+ effects = {}
+ for axis in axis_names:
+ groups = {}
+ for meta, val in scored_runs:
+ key = str(meta.get(axis, "unknown"))
+ groups.setdefault(key, []).append(val)
+
+ if len(groups) < 2:
+ continue
+
+ axis_effects = {}
+ for value, vals in sorted(groups.items()):
+ group_mean = sum(vals) / len(vals)
+ effect = group_mean - grand_mean
+ axis_effects[value] = {
+ "mean": round(group_mean, 4),
+ "effect": round(effect, 4),
+ "n": len(vals),
+ }
+
+ # Effect magnitude = max spread between any two values
+ means = [v["mean"] for v in axis_effects.values()]
+ spread = max(means) - min(means) if means else 0
+
+ effects[axis] = {
+ "values": axis_effects,
+ "spread": round(spread, 4),
+ }
+
+ # Sort by spread (biggest effects first)
+ effects = dict(sorted(effects.items(), key=lambda x: -x[1]["spread"]))
+ return effects
+
+
+def analyze_interactions(results_dir, axis_a, axis_b, metric="score"):
+ """Compute the interaction effect between two axes.
+
+ Returns a 2D table of mean metric values for each (a_value, b_value) combo,
+ plus the interaction effect size.
+ """
+ runs = _load_results(results_dir)
+ if not runs:
+ return {}
+
+ groups = {}
+ for run in runs:
+ val = _extract_metric(run, metric)
+ if val is None:
+ continue
+ a_val = str(run["meta"].get(axis_a, "?"))
+ b_val = str(run["meta"].get(axis_b, "?"))
+ key = (a_val, b_val)
+ groups.setdefault(key, []).append(val)
+
+ if not groups:
+ return {}
+
+ table = {}
+ for (a_val, b_val), vals in sorted(groups.items()):
+ table.setdefault(a_val, {})[b_val] = {
+ "mean": round(sum(vals) / len(vals), 4),
+ "n": len(vals),
+ }
+
+ # Compute interaction: does the effect of axis_a change depending on axis_b?
+ a_values = sorted(table.keys())
+ b_values = sorted(set(b for row in table.values() for b in row.keys()))
+
+ # Interaction = deviation from additive model
+ grand_mean = sum(
+ v for row in table.values() for cell in row.values() for v in [cell["mean"]]
+ ) / sum(1 for row in table.values() for _ in row.values())
+
+ a_means = {}
+ for a in a_values:
+ vals = [table[a][b]["mean"] for b in b_values if b in table.get(a, {})]
+ a_means[a] = sum(vals) / len(vals) if vals else grand_mean
+
+ b_means = {}
+ for b in b_values:
+ vals = [table[a][b]["mean"] for a in a_values if b in table.get(a, {})]
+ b_means[b] = sum(vals) / len(vals) if vals else grand_mean
+
+ # Interaction effects
+ interactions = {}
+ max_interaction = 0
+ for a in a_values:
+ for b in b_values:
+ if b in table.get(a, {}):
+ expected = a_means[a] + b_means[b] - grand_mean
+ actual = table[a][b]["mean"]
+ interaction = round(actual - expected, 4)
+ interactions[(a, b)] = interaction
+ max_interaction = max(max_interaction, abs(interaction))
+
+ return {
+ "table": table,
+ "grand_mean": round(grand_mean, 4),
+ "a_means": {k: round(v, 4) for k, v in a_means.items()},
+ "b_means": {k: round(v, 4) for k, v in b_means.items()},
+ "interactions": {f"{a},{b}": v for (a, b), v in interactions.items()},
+ "max_interaction": round(max_interaction, 4),
+ }
+
+
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+
+def _cell_key(task, cell):
+ axis_names = sorted(k for k in cell.keys() if k not in (
+ "task", "cell_id", "runs_per_cell", "max_budget_usd",
+ "timeout_seconds", "base_tools",
+ ))
+ parts = [task] + [f"{k}={cell[k]}" for k in axis_names]
+ return "_".join(parts)
+
+
+def _is_excluded(cell, grid):
+ for exclusion in grid.get("exclusions", []):
+ match = True
+ for key, value in exclusion["when"].items():
+ if cell.get(key) != value:
+ match = False
+ break
+ if match:
+ return True
+ return False
+
+
+def _build_cell(task, cell, defaults, grid):
+ axis_names = sorted(cell.keys())
+ cell_id_parts = [task] + [f"{k}={cell[k]}" for k in axis_names]
+
+ result = dict(cell)
+ result["task"] = task
+ result["cell_id"] = "_".join(cell_id_parts)
+ result["runs_per_cell"] = defaults.get("runs_per_cell", 3)
+ result["timeout_seconds"] = defaults.get("timeout_seconds", 600)
+
+ budget_key = cell.get("max_budget", "low")
+ result["max_budget_usd"] = defaults.get("budget", {}).get(budget_key, 0.50)
+
+ return result
+
+
+def _load_results(results_dir):
+ """Load all completed runs from the results directory."""
+ results_dir = Path(results_dir)
+ runs_dir = results_dir / "runs"
+ if not runs_dir.exists():
+ return []
+
+ runs = []
+ for run_dir in runs_dir.iterdir():
+ if not run_dir.is_dir():
+ continue
+ meta_path = run_dir / "meta.json"
+ eval_path = run_dir / "eval_results.json"
+ claude_path = run_dir / "claude_output.json"
+
+ if not meta_path.exists() or not eval_path.exists():
+ continue
+
+ try:
+ meta = json.loads(meta_path.read_text())
+ eval_results = json.loads(eval_path.read_text())
+ claude_output = {}
+ if claude_path.exists():
+ claude_output = json.loads(claude_path.read_text())
+
+ runs.append({
+ "meta": meta,
+ "eval": eval_results,
+ "claude": claude_output,
+ })
+ except (json.JSONDecodeError, OSError):
+ continue
+
+ return runs
+
+
+def _extract_metric(run, metric):
+ """Extract a numeric metric from a run."""
+ if metric == "score":
+ val = run["eval"].get("score")
+ return val if isinstance(val, (int, float)) else None
+ elif metric == "cost":
+ return run["claude"].get("total_cost_usd")
+ elif metric == "turns":
+ return run["claude"].get("num_turns")
+ elif metric == "wall_time":
+ return run["meta"].get("wall_time_seconds")
+ elif metric == "pass_rate":
+ func = run["eval"].get("functional", {})
+ if isinstance(func, dict) and "pass" in func:
+ return 1.0 if func["pass"] else 0.0
+ return None
+ return None
+
+
+# ---------------------------------------------------------------------------
+# CLI
+# ---------------------------------------------------------------------------
+
+def main():
+ if len(sys.argv) < 3:
+ print("Usage:")
+ print(" experiment_design.py plan <grid_file> <design> [args...]")
+ print(" designs: main_effects, plackett_burman, interaction_hunt")
+ print(" experiment_design.py analyze <results_dir> <analysis> [args...]")
+ print(" analyses: main_effects, interactions")
+ sys.exit(1)
+
+ command = sys.argv[1]
+
+ if command == "plan":
+ grid_file = sys.argv[2]
+ design = sys.argv[3] if len(sys.argv) > 3 else "main_effects"
+ grid = load_grid(grid_file)
+
+ if design == "main_effects":
+ cells = main_effects_plan(grid)
+ elif design == "plackett_burman":
+ cells = plackett_burman_plan(grid)
+ elif design == "interaction_hunt":
+ top_axes = sys.argv[4].split(",") if len(sys.argv) > 4 else []
+ if not top_axes:
+ print("ERROR: interaction_hunt requires comma-separated axis names", file=sys.stderr)
+ sys.exit(1)
+ cells = interaction_hunt_plan(grid, top_axes)
+ else:
+ print(f"Unknown design: {design}", file=sys.stderr)
+ sys.exit(1)
+
+ print(f"# {design}: {len(cells)} cells", file=sys.stderr)
+ for cell in cells:
+ print(json.dumps(cell))
+
+ elif command == "analyze":
+ results_dir = sys.argv[2]
+ analysis = sys.argv[3] if len(sys.argv) > 3 else "main_effects"
+
+ if analysis == "main_effects":
+ metric = sys.argv[4] if len(sys.argv) > 4 else "score"
+ effects = analyze_main_effects(results_dir, metric)
+ print(json.dumps(effects, indent=2))
+ elif analysis == "interactions":
+ if len(sys.argv) < 6:
+ print("ERROR: interactions requires two axis names", file=sys.stderr)
+ sys.exit(1)
+ axis_a = sys.argv[4]
+ axis_b = sys.argv[5]
+ metric = sys.argv[6] if len(sys.argv) > 6 else "score"
+ result = analyze_interactions(results_dir, axis_a, axis_b, metric)
+ print(json.dumps(result, indent=2))
+ else:
+ print(f"Unknown analysis: {analysis}", file=sys.stderr)
+ sys.exit(1)
+
+ else:
+ print(f"Unknown command: {command}", file=sys.stderr)
+ sys.exit(1)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/harness/lib/get-oauth-token.sh b/harness/lib/get-oauth-token.sh
@@ -0,0 +1,13 @@
+#!/usr/bin/env bash
+# Extract OAuth token from Claude Code credentials for use with --bare mode.
+# This lets the harness use your Claude plan while maintaining full isolation.
+
+CREDS_FILE="${CLAUDE_CONFIG_DIR:-$HOME/.claude}/.credentials.json"
+
+if [[ ! -f "$CREDS_FILE" ]]; then
+ echo "ERROR: No credentials file found at $CREDS_FILE" >&2
+ exit 1
+fi
+
+# Extract the OAuth access token
+jq -r '.claudeAiOauth.accessToken // empty' "$CREDS_FILE"
diff --git a/harness/lib/invoke.sh b/harness/lib/invoke.sh
@@ -45,8 +45,19 @@ Use TypeScript."
Use JavaScript (no TypeScript)."
fi
- # Build tool list
- local tools="$base_tools"
+ # Build tool list from individual axes (Bash always on)
+ local tools="Bash"
+ local tool_read tool_write tool_edit tool_glob tool_grep
+ tool_read=$(echo "$cell_json" | jq -r '.tool_read // "on"')
+ tool_write=$(echo "$cell_json" | jq -r '.tool_write // "on"')
+ tool_edit=$(echo "$cell_json" | jq -r '.tool_edit // "on"')
+ tool_glob=$(echo "$cell_json" | jq -r '.tool_glob // "on"')
+ tool_grep=$(echo "$cell_json" | jq -r '.tool_grep // "on"')
+ [[ "$tool_read" == "on" ]] && tools="$tools,Read"
+ [[ "$tool_write" == "on" ]] && tools="$tools,Write"
+ [[ "$tool_edit" == "on" ]] && tools="$tools,Edit"
+ [[ "$tool_glob" == "on" ]] && tools="$tools,Glob"
+ [[ "$tool_grep" == "on" ]] && tools="$tools,Grep"
if [[ "$sub_agents" == "on" ]]; then
tools="$tools,Agent"
fi
@@ -55,15 +66,22 @@ Use JavaScript (no TypeScript)."
fi
# Build the claude command
+ # --bare for full isolation (no CLAUDE.md, hooks, MCP, memory).
+ # Auth via apiKeyHelper that reads OAuth token from ~/.claude/.credentials.json.
+ local auth_helper
+ auth_helper="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)/get-oauth-token.sh"
+
local cmd=(
claude
--bare
-p "$prompt"
--model "$model"
--output-format stream-json
- --dangerously-skip-permissions
+ --verbose
+ --permission-mode dontAsk
--max-budget-usd "$budget"
--allowedTools "$tools"
+ --settings "{\"apiKeyHelper\": \"$auth_helper\"}"
)
# Add effort level
diff --git a/harness/run.py b/harness/run.py
@@ -0,0 +1,424 @@
+#!/usr/bin/env python3
+"""Loop Benchmarking Harness - Main orchestrator.
+
+Computes the experiment grid, creates isolated workspaces, invokes claude,
+runs evaluation, and stores results.
+
+Usage:
+ python3 run.py [grid_file] [profile_or_design]
+
+ profile_or_design can be:
+ - A profile name from grid.yaml (e.g., smoke, core, full)
+ - A DOE design: main_effects, plackett_burman
+ - interaction_hunt:axis1,axis2,axis3
+"""
+
+import json
+import os
+import shutil
+import subprocess
+import sys
+import tarfile
+import tempfile
+import time
+from datetime import datetime, timezone
+from pathlib import Path
+
+SCRIPT_DIR = Path(__file__).resolve().parent
+PROJECT_DIR = SCRIPT_DIR.parent
+sys.path.insert(0, str(SCRIPT_DIR / "lib"))
+
+from compute_grid import load_grid, compute_cells
+from experiment_design import (
+ main_effects_plan,
+ plackett_burman_plan,
+ interaction_hunt_plan,
+ analyze_main_effects,
+)
+
+
+def create_workspace(project_dir: Path, task: str, cell: dict) -> Path:
+ """Create an isolated temp directory with appropriate setup."""
+ workspace = Path(tempfile.mkdtemp(prefix="loop-bench-"))
+
+ language = cell.get("language", "typescript")
+ linter = cell.get("linter", "off")
+ playwright = cell.get("playwright", "off")
+
+ # npm init
+ subprocess.run(["npm", "init", "-y"], cwd=workspace, capture_output=True)
+
+ # TypeScript
+ if language == "typescript":
+ subprocess.run(
+ ["npm", "install", "--save-dev", "typescript", "@types/node"],
+ cwd=workspace, capture_output=True,
+ )
+
+ # Linter
+ if linter == "on":
+ subprocess.run(
+ ["npm", "install", "--save-dev", "eslint", "@eslint/js"],
+ cwd=workspace, capture_output=True,
+ )
+
+ # Playwright
+ if playwright == "on":
+ subprocess.run(
+ ["npm", "install", "--save-dev", "@playwright/test"],
+ cwd=workspace, capture_output=True,
+ )
+ subprocess.run(
+ ["npx", "playwright", "install", "chromium", "--with-deps"],
+ cwd=workspace, capture_output=True,
+ )
+
+ # Copy fixtures
+ fixtures_dir = project_dir / "tasks" / task / "fixtures"
+ if fixtures_dir.is_dir():
+ for item in fixtures_dir.iterdir():
+ dest = workspace / item.name
+ if item.is_dir():
+ shutil.copytree(item, dest)
+ else:
+ shutil.copy2(item, dest)
+
+ return workspace
+
+
+def build_prompt(project_dir: Path, cell: dict) -> str:
+ """Read the prompt file and append language instruction."""
+ task = cell["task"]
+ style = cell["prompt_style"]
+ lang_code = cell["human_language"]
+
+ prompt_file = project_dir / "tasks" / task / "prompts" / f"{style}.{lang_code}.md"
+ prompt = prompt_file.read_text()
+
+ language = cell.get("language", "typescript")
+ if language == "typescript":
+ prompt += "\n\nUse TypeScript."
+ elif language == "javascript":
+ prompt += "\n\nUse JavaScript (no TypeScript)."
+
+ return prompt
+
+
+def invoke_claude(cell: dict, workspace: Path, run_dir: Path, project_dir: Path) -> int:
+ """Invoke claude CLI and capture output."""
+ prompt = build_prompt(project_dir, cell)
+ model = cell["model"]
+ effort = cell.get("effort", "high")
+ budget = cell.get("max_budget_usd", 0.50)
+ timeout = cell.get("timeout_seconds", 600)
+ # Build tool list from individual tool axes
+ # Bash is always available - it's the agent's escape hatch
+ tools_list = ["Bash"]
+ if cell.get("tool_read", "on") == "on":
+ tools_list.append("Read")
+ if cell.get("tool_write", "on") == "on":
+ tools_list.append("Write")
+ if cell.get("tool_edit", "on") == "on":
+ tools_list.append("Edit")
+ if cell.get("tool_glob", "on") == "on":
+ tools_list.append("Glob")
+ if cell.get("tool_grep", "on") == "on":
+ tools_list.append("Grep")
+ if cell.get("sub_agents") == "on":
+ tools_list.append("Agent")
+ if cell.get("web_search") == "on":
+ tools_list.extend(["WebSearch", "WebFetch"])
+ tools = ",".join(tools_list)
+
+ # Auth helper for --bare mode
+ auth_helper = str(SCRIPT_DIR / "lib" / "get-oauth-token.sh")
+
+ cmd = [
+ "claude",
+ "--bare",
+ "-p", prompt,
+ "--model", model,
+ "--output-format", "stream-json",
+ "--verbose",
+ "--permission-mode", "dontAsk",
+ "--max-budget-usd", str(budget),
+ "--allowedTools", tools,
+ "--settings", json.dumps({"apiKeyHelper": auth_helper}),
+ ]
+
+ if effort:
+ cmd.extend(["--effort", effort])
+
+ # Context file
+ if cell.get("context_file") == "provided":
+ ctx_file = project_dir / "tasks" / cell["task"] / "context.md"
+ if ctx_file.exists():
+ cmd.extend(["--append-system-prompt", ctx_file.read_text()])
+
+ # Run claude
+ transcript_path = run_dir / "transcript.jsonl"
+ stderr_path = run_dir / "claude_stderr.log"
+
+ with open(transcript_path, "w") as transcript_f, open(stderr_path, "w") as stderr_f:
+ try:
+ result = subprocess.run(
+ cmd,
+ cwd=workspace,
+ stdout=transcript_f,
+ stderr=stderr_f,
+ timeout=timeout,
+ )
+ exit_code = result.returncode
+ except subprocess.TimeoutExpired:
+ exit_code = 124 # Same as timeout(1) convention
+
+ # Extract final result line
+ output_path = run_dir / "claude_output.json"
+ try:
+ lines = transcript_path.read_text().strip().split("\n")
+ if lines:
+ output_path.write_text(lines[-1])
+ except Exception:
+ output_path.write_text("{}")
+
+ return exit_code
+
+
+def run_eval_script(script: Path, workspace: Path, language: str) -> str:
+ """Run a bash eval script and return its stdout."""
+ try:
+ result = subprocess.run(
+ ["bash", str(script), str(workspace), language],
+ capture_output=True, text=True, timeout=120,
+ )
+ return result.stdout.strip()
+ except Exception as e:
+ return json.dumps({"pass": False, "error": str(e)})
+
+
+def safe_parse_json(text: str, fallback_key: str = "error") -> dict:
+ """Parse JSON, returning an error dict if parsing fails."""
+ if not text:
+ return {"pass": False, "error": "no output"}
+ try:
+ return json.loads(text)
+ except json.JSONDecodeError:
+ return {"pass": False, "error": text[:500]}
+
+
+def evaluate(task_dir: Path, workspace: Path, cell: dict, run_dir: Path):
+ """Run all evaluation scripts and write eval_results.json."""
+ language = cell.get("language", "typescript")
+
+ results = {
+ "structural": None,
+ "functional": None,
+ "quality": None,
+ "score": None,
+ }
+
+ # Structural
+ structural_sh = task_dir / "eval" / "structural.sh"
+ if structural_sh.exists():
+ output = run_eval_script(structural_sh, workspace, language)
+ results["structural"] = safe_parse_json(output)
+
+ # Functional
+ tests_dir = task_dir / "eval" / "tests"
+ if tests_dir.is_dir():
+ # Check for different test types
+ if (tests_dir / "functional.sh").exists():
+ output = run_eval_script(tests_dir / "functional.sh", workspace, language)
+ results["functional"] = safe_parse_json(output)
+ elif (tests_dir / "functional.spec.ts").exists():
+ # Playwright tests - would need server setup, skip for now
+ results["functional"] = {"pass": False, "error": "playwright eval not yet wired", "score": 0}
+ elif (tests_dir / "functional.test.ts").exists():
+ # vitest tests - would need server setup, skip for now
+ results["functional"] = {"pass": False, "error": "vitest eval not yet wired", "score": 0}
+
+ # Quality
+ quality_sh = task_dir / "eval" / "quality.sh"
+ if quality_sh.exists():
+ output = run_eval_script(quality_sh, workspace, language)
+ results["quality"] = safe_parse_json(output)
+
+ # Compute weighted score
+ try:
+ scoring_file = task_dir / "scoring.yaml"
+ if scoring_file.exists():
+ import yaml
+ scoring = yaml.safe_load(scoring_file.read_text())
+ weights = scoring.get("weights", {})
+
+ score = 0.0
+ for category, weight in weights.items():
+ cat_data = results.get(category)
+ if cat_data and isinstance(cat_data.get("score"), (int, float)):
+ score += cat_data["score"] * weight
+
+ results["score"] = round(score, 4)
+ except Exception:
+ pass
+
+ (run_dir / "eval_results.json").write_text(json.dumps(results, indent=2))
+
+
+def archive_workspace(workspace: Path, run_dir: Path):
+ """Archive and delete the workspace."""
+ archive_path = run_dir / "workspace.tar.gz"
+ try:
+ with tarfile.open(archive_path, "w:gz") as tar:
+ tar.add(workspace, arcname=workspace.name,
+ filter=lambda t: None if "node_modules" in t.name else t)
+ except Exception:
+ # If archiving fails, just note it
+ pass
+
+ try:
+ shutil.rmtree(workspace)
+ except Exception:
+ pass
+
+
+def main():
+ grid_file = sys.argv[1] if len(sys.argv) > 1 else str(PROJECT_DIR / "grid.yaml")
+ profile = sys.argv[2] if len(sys.argv) > 2 else "smoke"
+ results_dir = PROJECT_DIR / "results"
+ results_dir.mkdir(exist_ok=True)
+ (results_dir / "runs").mkdir(exist_ok=True)
+
+ # Preflight
+ if shutil.which("claude") is None:
+ print("ERROR: claude CLI not found in PATH.")
+ sys.exit(1)
+
+ print("=" * 40)
+ print("Loop Benchmarking Harness")
+ print("=" * 40)
+ print(f"Grid file: {grid_file}")
+ print(f"Profile: {profile}")
+ print(f"Results: {results_dir}")
+ print("=" * 40)
+
+ grid = load_grid(grid_file)
+
+ # Determine cell generation strategy
+ if profile == "main_effects":
+ cells = main_effects_plan(grid)
+ print(f"Design: main effects sweep")
+ elif profile == "plackett_burman":
+ cells = plackett_burman_plan(grid)
+ print(f"Design: Plackett-Burman screening")
+ elif profile.startswith("interaction_hunt:"):
+ top_axes = profile.split(":", 1)[1].split(",")
+ cells = interaction_hunt_plan(grid, top_axes)
+ print(f"Design: interaction hunt on {top_axes}")
+ else:
+ cells = compute_cells(grid, profile)
+ print(f"Profile: {profile}")
+
+ print(f"Grid cells: {len(cells)}")
+ print()
+
+ completed = 0
+ skipped = 0
+ failed = 0
+
+ for cell in cells:
+ task = cell["task"]
+ cell_id = cell["cell_id"]
+ runs_per_cell = cell.get("runs_per_cell", 3)
+ model = cell["model"]
+ prompt_style = cell["prompt_style"]
+
+ for run_num in range(1, runs_per_cell + 1):
+ run_id = f"{cell_id}_run{run_num}"
+ run_dir = results_dir / "runs" / run_id
+
+ # Resume support
+ if (run_dir / "eval_results.json").exists():
+ print(f"SKIP: {run_id}")
+ skipped += 1
+ continue
+
+ print("-" * 40)
+ print(f"RUN: {run_id}")
+ print(f"Task: {task} | Model: {model} | Prompt: {prompt_style}")
+ print("-" * 40)
+
+ run_dir.mkdir(parents=True, exist_ok=True)
+
+ # Save meta
+ meta = {
+ **cell,
+ "run_id": run_id,
+ "run_number": run_num,
+ "started_at": datetime.now(timezone.utc).isoformat(),
+ }
+ (run_dir / "meta.json").write_text(json.dumps(meta, indent=2))
+
+ # Create workspace
+ print(" Creating workspace...")
+ try:
+ workspace = create_workspace(PROJECT_DIR, task, cell)
+ print(f" Workspace: {workspace}")
+ except Exception as e:
+ print(f" ERROR creating workspace: {e}")
+ failed += 1
+ continue
+
+ # Invoke claude
+ print(f" Invoking claude (model={model})...")
+ start_time = time.time()
+ exit_code = invoke_claude(cell, workspace, run_dir, PROJECT_DIR)
+ wall_time = int(time.time() - start_time)
+
+ if exit_code == 0:
+ print(" Claude completed successfully")
+ else:
+ print(f" Claude exited with error (exit code: {exit_code})")
+
+ # Update meta with timing
+ meta["wall_time_seconds"] = wall_time
+ meta["exit_code"] = exit_code
+ meta["completed_at"] = datetime.now(timezone.utc).isoformat()
+ (run_dir / "meta.json").write_text(json.dumps(meta, indent=2))
+
+ # Evaluate
+ print(" Running evaluation...")
+ task_dir = PROJECT_DIR / "tasks" / task
+ evaluate(task_dir, workspace, cell, run_dir)
+ print(" Evaluation complete")
+
+ # Append to index
+ index_entry = {
+ "run_id": run_id,
+ "task": task,
+ "model": model,
+ "cell_id": cell_id,
+ "completed_at": meta["completed_at"],
+ }
+ with open(results_dir / "index.jsonl", "a") as f:
+ f.write(json.dumps(index_entry) + "\n")
+
+ # Archive and cleanup
+ print(" Archiving workspace...")
+ archive_workspace(workspace, run_dir)
+
+ if (run_dir / "eval_results.json").exists():
+ completed += 1
+ else:
+ failed += 1
+
+ print(f" Done. ({completed} completed, {skipped} skipped, {failed} failed)")
+ print()
+
+ print("=" * 40)
+ print("All runs complete.")
+ print(f"Completed: {completed} | Skipped: {skipped} | Failed: {failed}")
+ print("=" * 40)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/harness/run.sh b/harness/run.sh
@@ -1,5 +1,7 @@
#!/usr/bin/env bash
-set -euo pipefail
+set -uo pipefail
+# Note: no set -e. The main loop handles errors per-run so one failure
+# doesn't kill the entire harness. Critical setup errors still exit explicitly.
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
PROJECT_DIR="$(dirname "$SCRIPT_DIR")"
@@ -17,6 +19,12 @@ GRID_FILE="${1:-$PROJECT_DIR/grid.yaml}"
PROFILE="${2:-smoke}"
RESULTS_DIR="$PROJECT_DIR/results"
+# Preflight: verify claude is available and authenticated
+if ! command -v claude > /dev/null 2>&1; then
+ echo "ERROR: claude CLI not found in PATH."
+ exit 1
+fi
+
echo "========================================"
echo "Loop Benchmarking Harness"
echo "========================================"
@@ -36,7 +44,7 @@ completed=0
skipped=0
failed=0
-echo "$cells" | while IFS= read -r cell_json; do
+while IFS= read -r cell_json; do
task=$(echo "$cell_json" | jq -r '.task')
cell_id=$(echo "$cell_json" | jq -r '.cell_id')
runs_per_cell=$(echo "$cell_json" | jq -r '.runs_per_cell')
@@ -58,53 +66,61 @@ echo "$cells" | while IFS= read -r cell_json; do
echo "Task: $task | Model: $model | Prompt: $prompt_style"
echo "----------------------------------------"
- # Create run results directory
+ # Run everything in a subshell so cd's don't affect the main loop
+ (
+ # Create run results directory
+ run_dir="$RESULTS_DIR/runs/$run_id"
+ mkdir -p "$run_dir"
+
+ # Save cell config as meta.json
+ echo "$cell_json" | jq --arg run_id "$run_id" --argjson run_num "$run_num" \
+ '. + {run_id: $run_id, run_number: $run_num, started_at: (now | todate)}' \
+ > "$run_dir/meta.json"
+
+ # Create isolated workspace
+ echo " Creating workspace..."
+ workspace=$(create_workspace "$PROJECT_DIR" "$task" "$cell_json")
+ echo " Workspace: $workspace"
+
+ # Invoke claude
+ echo " Invoking claude (model=$model)..."
+ if invoke_claude "$cell_json" "$workspace" "$run_dir" "$PROJECT_DIR"; then
+ echo " Claude completed successfully"
+ else
+ echo " Claude exited with error (exit code: $?)"
+ fi
+
+ # Run evaluation
+ echo " Running evaluation..."
+ task_dir="$PROJECT_DIR/tasks/$task"
+ evaluate "$task_dir" "$workspace" "$cell_json" "$run_dir"
+ echo " Evaluation complete"
+
+ # Append to run index
+ jq -c '{
+ run_id: .run_id,
+ task: .task,
+ model: .model,
+ cell_id: .cell_id,
+ completed_at: .completed_at
+ }' "$run_dir/meta.json" >> "$RESULTS_DIR/index.jsonl"
+
+ # Archive and cleanup workspace
+ echo " Archiving workspace..."
+ cleanup_workspace "$workspace" "$run_dir"
+ ) || true
+
+ # Count results (outside subshell)
run_dir="$RESULTS_DIR/runs/$run_id"
- mkdir -p "$run_dir"
-
- # Save cell config as meta.json
- echo "$cell_json" | jq --arg run_id "$run_id" --argjson run_num "$run_num" \
- '. + {run_id: $run_id, run_number: $run_num, started_at: (now | todate)}' \
- > "$run_dir/meta.json"
-
- # Create isolated workspace
- echo " Creating workspace..."
- workspace=$(create_workspace "$PROJECT_DIR" "$task" "$cell_json")
- echo " Workspace: $workspace"
-
- # Invoke claude
- echo " Invoking claude (model=$model)..."
- if invoke_claude "$cell_json" "$workspace" "$run_dir" "$PROJECT_DIR"; then
- echo " Claude completed successfully"
+ if [[ -f "$run_dir/eval_results.json" ]]; then
+ completed=$((completed + 1))
else
- echo " Claude exited with error (exit code: $?)"
failed=$((failed + 1))
fi
-
- # Run evaluation
- echo " Running evaluation..."
- task_dir="$PROJECT_DIR/tasks/$task"
- evaluate "$task_dir" "$workspace" "$cell_json" "$run_dir"
- echo " Evaluation complete"
-
- # Append to run index
- jq -c '{
- run_id: .run_id,
- task: .task,
- model: .model,
- cell_id: .cell_id,
- completed_at: .completed_at
- }' "$run_dir/meta.json" >> "$RESULTS_DIR/index.jsonl"
-
- # Archive and cleanup workspace
- echo " Archiving workspace..."
- cleanup_workspace "$workspace" "$run_dir"
-
- completed=$((completed + 1))
echo " Done. ($completed completed, $skipped skipped, $failed failed)"
echo ""
done
-done
+done <<< "$cells"
echo "========================================"
echo "All runs complete."
diff --git a/results/index.jsonl b/results/index.jsonl
@@ -0,0 +1,6 @@
+{"run_id":"tetris_context_file=none_effort=high_human_language=en_language=typescript_linter=off_max_budget=low_model=haiku_playwright=off_prompt_style=simple_sub_agents=off_web_search=off_run1","task":"tetris","model":"haiku","cell_id":"tetris_context_file=none_effort=high_human_language=en_language=typescript_linter=off_max_budget=low_model=haiku_playwright=off_prompt_style=simple_sub_agents=off_web_search=off","completed_at":"2026-04-03T15:34:51Z"}
+{"run_id":"tetris_context_file=none_effort=high_human_language=en_language=typescript_linter=off_max_budget=low_model=haiku_playwright=off_prompt_style=detailed_sub_agents=off_web_search=off_run1","task":"tetris","model":"haiku","cell_id":"tetris_context_file=none_effort=high_human_language=en_language=typescript_linter=off_max_budget=low_model=haiku_playwright=off_prompt_style=detailed_sub_agents=off_web_search=off","completed_at":"2026-04-03T15:39:20Z"}
+{"run_id":"bookmarks-api_context_file=none_effort=high_human_language=en_language=typescript_linter=off_max_budget=low_model=haiku_playwright=off_prompt_style=simple_sub_agents=off_web_search=off_run1","task":"bookmarks-api","model":"haiku","cell_id":"bookmarks-api_context_file=none_effort=high_human_language=en_language=typescript_linter=off_max_budget=low_model=haiku_playwright=off_prompt_style=simple_sub_agents=off_web_search=off","completed_at":"2026-04-03T15:48:50Z"}
+{"run_id": "bookmarks-api_context_file=none_effort=high_human_language=en_language=typescript_linter=off_max_budget=low_model=haiku_playwright=off_prompt_style=detailed_sub_agents=off_web_search=off_run1", "task": "bookmarks-api", "model": "haiku", "cell_id": "bookmarks-api_context_file=none_effort=high_human_language=en_language=typescript_linter=off_max_budget=low_model=haiku_playwright=off_prompt_style=detailed_sub_agents=off_web_search=off", "completed_at": "2026-04-03T16:14:47.247624+00:00"}
+{"run_id": "data-pipeline_context_file=none_effort=high_human_language=en_language=typescript_linter=off_max_budget=low_model=haiku_playwright=off_prompt_style=simple_sub_agents=off_web_search=off_run1", "task": "data-pipeline", "model": "haiku", "cell_id": "data-pipeline_context_file=none_effort=high_human_language=en_language=typescript_linter=off_max_budget=low_model=haiku_playwright=off_prompt_style=simple_sub_agents=off_web_search=off", "completed_at": "2026-04-03T16:19:53.900594+00:00"}
+{"run_id": "data-pipeline_context_file=none_effort=high_human_language=en_language=typescript_linter=off_max_budget=low_model=haiku_playwright=off_prompt_style=detailed_sub_agents=off_web_search=off_run1", "task": "data-pipeline", "model": "haiku", "cell_id": "data-pipeline_context_file=none_effort=high_human_language=en_language=typescript_linter=off_max_budget=low_model=haiku_playwright=off_prompt_style=detailed_sub_agents=off_web_search=off", "completed_at": "2026-04-03T16:21:33.913511+00:00"}
diff --git a/tasks/bookmarks-api/eval/quality.sh b/tasks/bookmarks-api/eval/quality.sh
@@ -97,7 +97,7 @@ for key in lint typecheck security_passwords security_jwt_secret security_sql; d
done
if [[ $score_count -gt 0 ]]; then
- score=$(echo "scale=2; $score_sum / ($score_count * 100)" | bc)
+ score=$(awk "BEGIN {printf \"%.2f\", $score_sum / ($score_count * 100)}")
else
score="0"
fi
diff --git a/tasks/bookmarks-api/eval/structural.sh b/tasks/bookmarks-api/eval/structural.sh
@@ -100,7 +100,7 @@ checks_json=$(printf '%s,' "${checks[@]}")
checks_json="[${checks_json%,}]"
if [[ $total_count -gt 0 ]]; then
- score=$(echo "scale=2; $pass_count / $total_count" | bc)
+ score=$(awk "BEGIN {printf \"%.2f\", $pass_count / $total_count}")
else
score="0"
fi
diff --git a/tasks/data-pipeline/eval/quality.sh b/tasks/data-pipeline/eval/quality.sh
@@ -126,7 +126,7 @@ for key in lint typecheck no_float_currency handles_empty_input handles_malforme
done
if [[ $score_count -gt 0 ]]; then
- score=$(echo "scale=2; $score_sum / ($score_count * 100)" | bc)
+ score=$(awk "BEGIN {printf \"%.2f\", $score_sum / ($score_count * 100)}")
else
score="0"
fi
diff --git a/tasks/data-pipeline/eval/structural.sh b/tasks/data-pipeline/eval/structural.sh
@@ -91,7 +91,7 @@ checks_json=$(printf '%s,' "${checks[@]}")
checks_json="[${checks_json%,}]"
if [[ $total_count -gt 0 ]]; then
- score=$(echo "scale=2; $pass_count / $total_count" | bc)
+ score=$(awk "BEGIN {printf \"%.2f\", $pass_count / $total_count}")
else
score="0"
fi
diff --git a/tasks/data-pipeline/eval/tests/functional.sh b/tasks/data-pipeline/eval/tests/functional.sh
@@ -73,7 +73,7 @@ if ! echo "$actual_output" | jq . > /dev/null 2>&1; then
add_check "valid_json" "false" "output is not valid JSON"
checks_json=$(printf '%s,' "${checks[@]}")
checks_json="[${checks_json%,}]"
- echo "{\"pass\": false, \"checks\": $checks_json, \"score\": $(echo "scale=2; $pass_count / $total_count" | bc)}"
+ echo "{\"pass\": false, \"checks\": $checks_json, \"score\": $(awk "BEGIN {printf \"%.2f\", $pass_count / $total_count}")}"
exit 0
fi
@@ -166,7 +166,7 @@ checks_json=$(printf '%s,' "${checks[@]}")
checks_json="[${checks_json%,}]"
if [[ $total_count -gt 0 ]]; then
- score=$(echo "scale=2; $pass_count / $total_count" | bc)
+ score=$(awk "BEGIN {printf \"%.2f\", $pass_count / $total_count}")
else
score="0"
fi
diff --git a/tasks/tetris/eval/quality.sh b/tasks/tetris/eval/quality.sh
@@ -13,10 +13,8 @@ results='{}'
# --- Lint check ---
cd "$WORKSPACE"
if command -v npx > /dev/null 2>&1; then
- # Install eslint if not present
npm install --save-dev eslint @eslint/js > /dev/null 2>&1
- # Find source files
if [[ "$LANGUAGE" == "typescript" ]]; then
extensions="ts,tsx"
else
@@ -29,12 +27,15 @@ if command -v npx > /dev/null 2>&1; then
if echo "$lint_output" | jq . > /dev/null 2>&1; then
errors=$(echo "$lint_output" | jq '[.[].errorCount] | add // 0')
warnings=$(echo "$lint_output" | jq '[.[].warningCount] | add // 0')
- lint_pass="true"
+ errors=${errors:-0}
+ warnings=${warnings:-0}
if [[ "$errors" -gt 0 ]]; then
- lint_pass="false"
+ results=$(echo "$results" | jq --argjson e "$errors" --argjson w "$warnings" \
+ '. + {lint: {pass: false, errors: $e, warnings: $w}}')
+ else
+ results=$(echo "$results" | jq --argjson e "$errors" --argjson w "$warnings" \
+ '. + {lint: {pass: true, errors: $e, warnings: $w}}')
fi
- results=$(echo "$results" | jq --argjson e "$errors" --argjson w "$warnings" --argjson p "$lint_pass" \
- '. + {lint: {pass: ($p == true), errors: $e, warnings: $w}}')
else
results=$(echo "$results" | jq '. + {lint: {pass: false, errors: -1, warnings: 0, error: "eslint failed to run"}}')
fi
@@ -49,7 +50,8 @@ if [[ "$LANGUAGE" == "typescript" ]]; then
if npx tsc --noEmit > /dev/null 2>&1; then
results=$(echo "$results" | jq '. + {typecheck: {pass: true}}')
else
- type_errors=$(npx tsc --noEmit 2>&1 | grep -c "error TS" || echo "0")
+ type_errors=$(npx tsc --noEmit 2>&1 | grep -c "error TS" || true)
+ type_errors=${type_errors:-0}
results=$(echo "$results" | jq --argjson e "$type_errors" '. + {typecheck: {pass: false, errors: $e}}')
fi
else
@@ -60,22 +62,23 @@ else
fi
# --- File size check ---
-# Find the main HTML file and measure total size
total_size=0
if [[ -d "$WORKSPACE/dist" ]]; then
total_size=$(du -sb "$WORKSPACE/dist" 2>/dev/null | awk '{print $1}')
elif [[ -f "$WORKSPACE/index.html" ]]; then
total_size=$(du -sb "$WORKSPACE" --exclude=node_modules --exclude=.git 2>/dev/null | awk '{print $1}')
fi
-size_pass="true"
-if [[ "$total_size" -gt 524288 ]]; then # 512KB
- size_pass="false"
+total_size=${total_size:-0}
+
+if [[ "$total_size" -gt 524288 ]]; then
+ results=$(echo "$results" | jq --argjson s "$total_size" \
+ '. + {performance: {bundle_size_bytes: $s, size_under_512kb: false}}')
+else
+ results=$(echo "$results" | jq --argjson s "$total_size" \
+ '. + {performance: {bundle_size_bytes: $s, size_under_512kb: true}}')
fi
-results=$(echo "$results" | jq --argjson s "$total_size" --argjson p "$size_pass" \
- '. + {performance: {bundle_size_bytes: $s, size_under_512kb: ($p == true)}}')
# --- Compute aggregate quality score ---
-# Each check contributes equally
score_sum=0
score_count=0
@@ -88,7 +91,7 @@ for key in lint typecheck performance; do
done
if [[ $score_count -gt 0 ]]; then
- score=$(echo "scale=2; $score_sum / ($score_count * 100)" | bc)
+ score=$(awk "BEGIN {printf \"%.2f\", $score_sum / ($score_count * 100)}")
else
score="0"
fi
diff --git a/tasks/tetris/eval/structural.sh b/tasks/tetris/eval/structural.sh
@@ -80,7 +80,7 @@ checks_json="[${checks_json%,}]"
# Compute score
if [[ $total_count -gt 0 ]]; then
- score=$(echo "scale=2; $pass_count / $total_count" | bc)
+ score=$(awk "BEGIN {printf \"%.2f\", $pass_count / $total_count}")
else
score="0"
fi