loop-benchmarking

Controlled experiments across agentic coding configurations. Same task, one variable, what actually works.
git clone https://git.shiptheloop.com/loop-benchmarking.git
Log | Files | Refs | README

commit f188f40361a8a1dc7600e2c625ff045d29da3d2b
parent 147931383ebcb9584b54dd141a05bac520b2c3b5
Author: Brian Graham <brian@buildingbetterteams.de>
Date:   Fri,  3 Apr 2026 19:12:22 +0200

Fix harness bugs, add DOE experiment design, insights dashboard

Harness fixes:
- Replace --dangerously-skip-permissions with --permission-mode dontAsk
- Add --verbose (required for stream-json output)
- Use OAuth token extraction for --bare mode (get-oauth-token.sh)
- Rewrite orchestrator in Python (run.py) to avoid bash subshell issues
- Fix eval scripts: replace bc with awk, handle empty/invalid JSON

New features:
- 5 individual tool axes (tool_read/write/edit/glob/grep) replace base_tools
- DOE experiment design module (main effects sweep, Plackett-Burman
  screening, interaction hunt) for efficient grid exploration
- Analysis functions to compute effect sizes and interactions from results
- Insights dashboard page with tornado charts and interaction heatmaps
- Metric switcher (score, cost, turns, wall time)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Diffstat:
M.gitignore | 1+
MCLAUDE.md | 2+-
MREADME.md | 123+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
Adashboard/src/components/Heatmap.tsx | 164+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adashboard/src/components/Insights.tsx | 109+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adashboard/src/components/TornadoChart.tsx | 168+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Mdashboard/src/layouts/Base.astro | 1+
Adashboard/src/lib/analysis.ts | 180+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adashboard/src/pages/insights.astro | 17+++++++++++++++++
Mgrid.yaml | 23+++++++++++++++++++++--
Mharness/lib/compute_grid.py | 1-
Mharness/lib/evaluate.sh | 41++++++++++++++++++++++++-----------------
Aharness/lib/experiment_design.py | 582++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Aharness/lib/get-oauth-token.sh | 13+++++++++++++
Mharness/lib/invoke.sh | 24+++++++++++++++++++++---
Aharness/run.py | 424+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Mharness/run.sh | 100++++++++++++++++++++++++++++++++++++++++++++++---------------------------------
Aresults/index.jsonl | 6++++++
Mtasks/bookmarks-api/eval/quality.sh | 2+-
Mtasks/bookmarks-api/eval/structural.sh | 2+-
Mtasks/data-pipeline/eval/quality.sh | 2+-
Mtasks/data-pipeline/eval/structural.sh | 2+-
Mtasks/data-pipeline/eval/tests/functional.sh | 4++--
Mtasks/tetris/eval/quality.sh | 33++++++++++++++++++---------------
Mtasks/tetris/eval/structural.sh | 2+-
25 files changed, 1936 insertions(+), 90 deletions(-)

diff --git a/.gitignore b/.gitignore @@ -3,3 +3,4 @@ dist/ .astro/ results/runs/ *.tar.gz +__pycache__/ diff --git a/CLAUDE.md b/CLAUDE.md @@ -61,7 +61,7 @@ Static site showing: ## Tech -- **Harness runner**: Bash script that orchestrates experiment runs. Claude Code cannot invoke itself, so the harness must be external. The script reads YAML definitions, computes the grid, and invokes `claude` CLI for each cell. +- **Harness runner**: Bash script that orchestrates experiment runs. Claude Code cannot invoke itself, so the harness must be external. The script reads YAML definitions, computes the grid, and invokes `claude` CLI for each cell. Uses `--permission-mode dontAsk` with `--allowedTools` for non-interactive runs (not `--dangerously-skip-permissions`, which is blocked as root). - **Model support**: Primarily Anthropic models (Haiku, Sonnet, Opus). Non-Anthropic models possible via LiteLLM proxy in front of Ollama or similar, but expect reduced feature support (extended thinking, tool use may not work). This is valid benchmark data. - Results stored as YAML/JSON (append-only, never overwritten) - Each run gets a unique ID diff --git a/README.md b/README.md @@ -1,5 +1,124 @@ # Loop Benchmarking -Agentic loop configuration benchmark for Ship the Loop. +An open benchmark for comparing agentic coding loop configurations. Same task, different setups, all data public. -Status: bootstrapping. See CLAUDE.md for full context. +## What this does + +Define the variables that make up a coding loop (model, tools, prompt style, etc.), and the system generates every permutation. Each is run against a set of tasks in a clean-room environment with deterministic evaluation. No LLM grading. + +## Quick start + +### Prerequisites + +- Node.js 22+ +- Python 3.12+ with PyYAML +- Claude Code CLI (authenticated via `claude login`) + +### Running experiments + +```bash +# 1. Screen: which variables matter? (~53 cells, vary one axis at a time) +python3 harness/run.py grid.yaml main_effects + +# 2. Analyze: rank variables by effect size +python3 harness/lib/experiment_design.py analyze results main_effects score + +# 3. Deep dive: full factorial on the top variables that matter +python3 harness/run.py grid.yaml "interaction_hunt:model,effort,tool_write" + +# 4. Check for interactions between variables +python3 harness/lib/experiment_design.py analyze results interactions model effort score +``` + +### Other run modes + +```bash +# Profile-based (predefined subsets of the grid) +python3 harness/run.py grid.yaml smoke # 6 cells, 1 run each +python3 harness/run.py grid.yaml core # 30 cells, 3 runs each +python3 harness/run.py grid.yaml full # 204,800 cells (don't) + +# Plackett-Burman screening (efficient multi-factor screening) +python3 harness/run.py grid.yaml plackett_burman +``` + +### Building the dashboard + +```bash +cd dashboard +npm install +npm run build # Static site in dashboard/dist/ +npm run dev # Dev server for local preview +``` + +## Project structure + +``` +grid.yaml # Experiment grid: axes, values, exclusions, profiles +harness/ + run.py # Main orchestrator (Python) + lib/ + compute_grid.py # Cartesian product + exclusions + experiment_design.py # DOE plans + analysis (main effects, PB, interactions) + get-oauth-token.sh # Extracts OAuth token for --bare mode + invoke.sh # Claude CLI invocation (bash, used by run.sh) + evaluate.sh # Evaluation dispatch (bash, used by run.sh) + workspace.sh # Workspace creation (bash, used by run.sh) +tasks/ + tetris/ # Agent-friendly: build a game + bookmarks-api/ # Medium: REST API with auth + data-pipeline/ # Hard: CSV processing with edge cases + Each task has: + prompts/ # simple/detailed x en/es + eval/ # Deterministic test suites the agent never sees + context.md # Rules file (used when context_file=provided) + scoring.yaml # Category weights +results/ + runs/{run_id}/ # One directory per experiment run + meta.json # Config, timing, exit code + transcript.jsonl # Full conversation (every tool call and response) + claude_output.json # Summary metrics (cost, turns, tokens) + eval_results.json # Structural, functional, quality scores + workspace.tar.gz # Archived agent output +dashboard/ # Astro + React static site + Grid overview, insights (tornado charts, heatmaps), run detail with transcript viewer +``` + +## Configuration dimensions (16 axes) + +| Axis | Values | +|---|---| +| model | haiku, sonnet, opus | +| effort | high, max (extended thinking) | +| prompt_style | simple, detailed | +| language | typescript, javascript | +| human_language | en, es | +| tool_read | on, off | +| tool_write | on, off | +| tool_edit | on, off | +| tool_glob | on, off | +| tool_grep | on, off | +| linter | on, off | +| playwright | on, off | +| context_file | none, provided | +| sub_agents | on, off | +| web_search | on, off | +| max_budget | low ($0.50), high ($5.00) | + +## Evaluation + +All scoring is deterministic code. The agent never sees the test suite. + +- **Structural**: Does it build? Do expected files exist? +- **Functional**: Pre-written test suites (Playwright, vitest, golden file diff) +- **Quality**: Lint, type check, accessibility, security, performance + +## Experiment design + +Instead of running the full 204,800-cell grid, use statistical designs: + +- **Main effects sweep**: Vary one axis at a time from a baseline. Identifies which variables matter. +- **Plackett-Burman**: Screening design that tests many binary factors efficiently. +- **Interaction hunt**: Full factorial on a small subset of axes to find interactions. + +The dashboard's Insights page visualizes main effects as tornado charts and interactions as heatmaps. diff --git a/dashboard/src/components/Heatmap.tsx b/dashboard/src/components/Heatmap.tsx @@ -0,0 +1,164 @@ +import type { InteractionResult } from "../lib/analysis"; + +interface HeatmapProps { + data: InteractionResult; + metric: string; +} + +export default function Heatmap({ data, metric }: HeatmapProps) { + const { axisA, axisB, table } = data; + + const aValues = Object.keys(table).sort(); + const bValues = Array.from( + new Set(aValues.flatMap((a) => Object.keys(table[a]))) + ).sort(); + + if (aValues.length === 0 || bValues.length === 0) { + return ( + <div + className="card" + style={{ + textAlign: "center", + padding: "40px", + color: "var(--text-muted)", + }} + > + Not enough data for this interaction. + </div> + ); + } + + // Find min/max for color scale + const allMeans = aValues.flatMap((a) => + bValues.filter((b) => table[a]?.[b]).map((b) => table[a][b].mean) + ); + const minVal = Math.min(...allMeans); + const maxVal = Math.max(...allMeans); + const range = maxVal - minVal || 1; + + function cellColor(value: number): string { + const ratio = (value - minVal) / range; + if (ratio > 0.66) + return `rgba(34, 197, 94, ${0.3 + ratio * 0.5})`; + if (ratio > 0.33) + return `rgba(234, 179, 8, ${0.3 + ratio * 0.4})`; + return `rgba(239, 68, 68, ${0.3 + (1 - ratio) * 0.4})`; + } + + return ( + <div className="card"> + <h3 style={{ marginBottom: "4px" }}> + {axisA} x {axisB} + </h3> + <p + style={{ + color: "var(--text-muted)", + fontSize: "0.75rem", + marginBottom: "16px", + }} + > + Mean {metric} for each combination. Interaction strength:{" "} + <span + style={{ + fontFamily: "var(--font-mono)", + color: + data.maxInteraction > 0.05 + ? "var(--yellow)" + : "var(--text-muted)", + }} + > + {(data.maxInteraction * 100).toFixed(1)}% + </span> + </p> + + <div style={{ overflowX: "auto" }}> + <table style={{ borderCollapse: "collapse" }}> + <thead> + <tr> + <th + style={{ + padding: "8px 12px", + fontSize: "0.7rem", + textAlign: "center", + }} + > + {axisA} \ {axisB} + </th> + {bValues.map((b) => ( + <th + key={b} + style={{ + padding: "8px 12px", + fontSize: "0.75rem", + textAlign: "center", + fontFamily: "var(--font-mono)", + }} + > + {b} + </th> + ))} + </tr> + </thead> + <tbody> + {aValues.map((a) => ( + <tr key={a}> + <td + style={{ + padding: "8px 12px", + fontSize: "0.75rem", + fontFamily: "var(--font-mono)", + fontWeight: 600, + }} + > + {a} + </td> + {bValues.map((b) => { + const cell = table[a]?.[b]; + if (!cell) { + return ( + <td + key={b} + style={{ + padding: "8px 12px", + textAlign: "center", + color: "var(--text-muted)", + }} + > + - + </td> + ); + } + return ( + <td + key={b} + style={{ + padding: "8px 12px", + textAlign: "center", + background: cellColor(cell.mean), + fontFamily: "var(--font-mono)", + fontSize: "0.8rem", + fontWeight: 600, + borderRadius: "2px", + }} + > + {(cell.mean * 100).toFixed(0)}% + <div + style={{ + fontSize: "0.6rem", + fontWeight: 400, + color: "var(--text-muted)", + }} + > + n={cell.n} + </div> + </td> + ); + })} + </tr> + ))} + </tbody> + </table> + </div> + </div> + ); +} diff --git a/dashboard/src/components/Insights.tsx b/dashboard/src/components/Insights.tsx @@ -0,0 +1,109 @@ +import { useState, useMemo } from "react"; +import type { Run } from "../lib/data"; +import { computeMainEffects, computeInteraction } from "../lib/analysis"; +import TornadoChart from "./TornadoChart"; +import Heatmap from "./Heatmap"; + +interface InsightsProps { + runs: Run[]; +} + +const METRICS = [ + { key: "score", label: "Score" }, + { key: "cost", label: "Cost" }, + { key: "turns", label: "Turns" }, + { key: "wall_time", label: "Wall Time" }, +]; + +export default function Insights({ runs }: InsightsProps) { + const [metric, setMetric] = useState("score"); + const [axisA, setAxisA] = useState(""); + const [axisB, setAxisB] = useState(""); + + const effects = useMemo( + () => computeMainEffects(runs, metric), + [runs, metric] + ); + + // Auto-pick top 2 axes for interaction if not selected + const topAxes = useMemo(() => effects.slice(0, 6).map((e) => e.axis), [effects]); + + const interaction = useMemo(() => { + const a = axisA || topAxes[0] || ""; + const b = axisB || topAxes[1] || ""; + if (!a || !b || a === b) return null; + return computeInteraction(runs, a, b, metric); + }, [runs, axisA, axisB, metric, topAxes]); + + return ( + <div style={{ display: "flex", flexDirection: "column", gap: "24px" }}> + {/* Metric selector */} + <div style={{ display: "flex", gap: "8px", alignItems: "center" }}> + <span style={{ fontSize: "0.8rem", color: "var(--text-muted)" }}> + Metric: + </span> + {METRICS.map((m) => ( + <button + key={m.key} + onClick={() => setMetric(m.key)} + style={{ + padding: "4px 12px", + borderRadius: "4px", + border: + metric === m.key + ? "1px solid var(--accent)" + : "1px solid var(--border)", + background: + metric === m.key ? "rgba(99, 102, 241, 0.15)" : "transparent", + color: metric === m.key ? "var(--accent)" : "var(--text-muted)", + cursor: "pointer", + fontSize: "0.8rem", + }} + > + {m.label} + </button> + ))} + </div> + + {/* Tornado chart */} + <TornadoChart effects={effects} metric={metric} /> + + {/* Interaction explorer */} + <div className="card"> + <h3 style={{ marginBottom: "12px" }}>Interaction Explorer</h3> + <div style={{ display: "flex", gap: "12px", marginBottom: "16px" }}> + <div className="filter-group"> + <label>Axis A</label> + <select + value={axisA || topAxes[0] || ""} + onChange={(e) => setAxisA(e.target.value)} + > + {topAxes.map((a) => ( + <option key={a} value={a}> + {a} + </option> + ))} + </select> + </div> + <div className="filter-group"> + <label>Axis B</label> + <select + value={axisB || topAxes[1] || ""} + onChange={(e) => setAxisB(e.target.value)} + > + {topAxes + .filter((a) => a !== (axisA || topAxes[0])) + .map((a) => ( + <option key={a} value={a}> + {a} + </option> + ))} + </select> + </div> + </div> + + {interaction && <Heatmap data={interaction} metric={metric} />} + </div> + </div> + ); +} diff --git a/dashboard/src/components/TornadoChart.tsx b/dashboard/src/components/TornadoChart.tsx @@ -0,0 +1,168 @@ +import type { AxisEffect } from "../lib/analysis"; + +interface TornadoChartProps { + effects: AxisEffect[]; + metric: string; +} + +const AXIS_LABELS: Record<string, string> = { + model: "Model", + effort: "Effort", + prompt_style: "Prompt Style", + language: "Language", + human_language: "Human Language", + tool_read: "Read Tool", + tool_write: "Write Tool", + tool_edit: "Edit Tool", + tool_glob: "Glob Tool", + tool_grep: "Grep Tool", + linter: "Linter", + playwright: "Playwright", + context_file: "Context File", + sub_agents: "Sub-agents", + web_search: "Web Search", + max_budget: "Budget", +}; + +export default function TornadoChart({ effects, metric }: TornadoChartProps) { + if (effects.length === 0) { + return ( + <div + className="card" + style={{ + textAlign: "center", + padding: "40px", + color: "var(--text-muted)", + }} + > + Not enough data to compute effects. Run more experiments with varying + configurations. + </div> + ); + } + + const maxSpread = Math.max(...effects.map((e) => e.spread)); + const scale = maxSpread > 0 ? 200 / maxSpread : 1; // max bar width = 200px + + return ( + <div className="card"> + <h3 style={{ marginBottom: "4px" }}>Variable Impact on {metric}</h3> + <p + style={{ + color: "var(--text-muted)", + fontSize: "0.75rem", + marginBottom: "16px", + }} + > + Sorted by effect size. Wider bars = bigger impact on outcomes. + </p> + + {effects.map((effect) => ( + <div + key={effect.axis} + style={{ + display: "flex", + alignItems: "center", + marginBottom: "12px", + gap: "12px", + }} + > + {/* Label */} + <div + style={{ + width: "120px", + textAlign: "right", + fontSize: "0.8rem", + flexShrink: 0, + }} + > + {AXIS_LABELS[effect.axis] || effect.axis} + </div> + + {/* Bars */} + <div + style={{ + flex: 1, + display: "flex", + flexDirection: "column", + gap: "2px", + }} + > + {effect.values.map((entry) => { + const width = Math.abs(entry.effect) * scale; + const isPositive = entry.effect >= 0; + return ( + <div + key={entry.value} + style={{ + display: "flex", + alignItems: "center", + gap: "8px", + }} + > + <div + style={{ + width: "50px", + textAlign: "right", + fontSize: "0.7rem", + fontFamily: "var(--font-mono)", + color: "var(--text-muted)", + flexShrink: 0, + }} + > + {entry.value} + </div> + <div + style={{ + height: "16px", + width: `${Math.max(width, 2)}px`, + background: isPositive + ? "var(--green)" + : "var(--red)", + borderRadius: "2px", + opacity: 0.8, + }} + /> + <div + style={{ + fontSize: "0.7rem", + fontFamily: "var(--font-mono)", + color: isPositive + ? "var(--green)" + : "var(--red)", + }} + > + {entry.effect >= 0 ? "+" : ""} + {(entry.effect * 100).toFixed(1)}% + </div> + <div + style={{ + fontSize: "0.65rem", + color: "var(--text-muted)", + }} + > + (n={entry.n}) + </div> + </div> + ); + })} + </div> + + {/* Spread */} + <div + style={{ + width: "60px", + textAlign: "right", + fontSize: "0.75rem", + fontFamily: "var(--font-mono)", + color: "var(--accent)", + flexShrink: 0, + }} + > + {(effect.spread * 100).toFixed(1)}% + </div> + </div> + ))} + </div> + ); +} diff --git a/dashboard/src/layouts/Base.astro b/dashboard/src/layouts/Base.astro @@ -23,6 +23,7 @@ const { title } = Astro.props; </a> <nav style="display: flex; gap: 16px; font-size: 0.875rem;"> <a href="/">Grid</a> + <a href="/insights">Insights</a> <a href="/compare">Compare</a> </nav> </div> diff --git a/dashboard/src/lib/analysis.ts b/dashboard/src/lib/analysis.ts @@ -0,0 +1,180 @@ +import type { Run, AxisName, AXIS_NAMES } from "./data"; + +export interface EffectEntry { + value: string; + mean: number; + effect: number; + n: number; +} + +export interface AxisEffect { + axis: string; + spread: number; + values: EffectEntry[]; +} + +export interface InteractionCell { + mean: number; + n: number; +} + +export interface InteractionResult { + axisA: string; + axisB: string; + table: Record<string, Record<string, InteractionCell>>; + maxInteraction: number; +} + +const SKIP_KEYS = new Set([ + "task", + "cell_id", + "run_id", + "run_number", + "runs_per_cell", + "max_budget_usd", + "timeout_seconds", + "base_tools", + "started_at", + "completed_at", + "wall_time_seconds", + "exit_code", +]); + +type MetricExtractor = (run: Run) => number | null; + +const METRICS: Record<string, MetricExtractor> = { + score: (r) => r.eval_results?.score ?? null, + cost: (r) => r.claude_output?.total_cost_usd ?? null, + turns: (r) => r.claude_output?.num_turns ?? null, + wall_time: (r) => r.meta.wall_time_seconds ?? null, +}; + +export function computeMainEffects( + runs: Run[], + metric: string = "score" +): AxisEffect[] { + const extract = METRICS[metric]; + if (!extract) return []; + + const scored: Array<{ meta: Run["meta"]; value: number }> = []; + for (const run of runs) { + const val = extract(run); + if (val !== null) scored.push({ meta: run.meta, value: val }); + } + if (scored.length === 0) return []; + + const grandMean = scored.reduce((s, r) => s + r.value, 0) / scored.length; + + // Find axis keys from meta + const axisKeys = Object.keys(scored[0].meta).filter( + (k) => !SKIP_KEYS.has(k) + ); + + const effects: AxisEffect[] = []; + + for (const axis of axisKeys) { + const groups: Record<string, number[]> = {}; + for (const { meta, value } of scored) { + const key = String((meta as Record<string, unknown>)[axis] ?? "unknown"); + (groups[key] ??= []).push(value); + } + + if (Object.keys(groups).length < 2) continue; + + const values: EffectEntry[] = []; + for (const [val, vals] of Object.entries(groups)) { + const mean = vals.reduce((a, b) => a + b, 0) / vals.length; + values.push({ + value: val, + mean: Math.round(mean * 10000) / 10000, + effect: Math.round((mean - grandMean) * 10000) / 10000, + n: vals.length, + }); + } + + const means = values.map((v) => v.mean); + const spread = Math.max(...means) - Math.min(...means); + + effects.push({ + axis, + spread: Math.round(spread * 10000) / 10000, + values: values.sort((a, b) => b.effect - a.effect), + }); + } + + return effects.sort((a, b) => b.spread - a.spread); +} + +export function computeInteraction( + runs: Run[], + axisA: string, + axisB: string, + metric: string = "score" +): InteractionResult { + const extract = METRICS[metric]; + if (!extract) + return { axisA, axisB, table: {}, maxInteraction: 0 }; + + const groups: Record<string, Record<string, number[]>> = {}; + + for (const run of runs) { + const val = extract(run); + if (val === null) continue; + const a = String((run.meta as Record<string, unknown>)[axisA] ?? "?"); + const b = String((run.meta as Record<string, unknown>)[axisB] ?? "?"); + ((groups[a] ??= {})[b] ??= []).push(val); + } + + const table: Record<string, Record<string, InteractionCell>> = {}; + const allVals: number[] = []; + + for (const [a, bGroups] of Object.entries(groups)) { + table[a] = {}; + for (const [b, vals] of Object.entries(bGroups)) { + const mean = vals.reduce((s, v) => s + v, 0) / vals.length; + table[a][b] = { mean: Math.round(mean * 10000) / 10000, n: vals.length }; + allVals.push(mean); + } + } + + const grandMean = + allVals.length > 0 + ? allVals.reduce((a, b) => a + b, 0) / allVals.length + : 0; + + // Row and column means + const aMeans: Record<string, number> = {}; + const bMeans: Record<string, number> = {}; + const bKeys = new Set<string>(); + + for (const [a, bGroups] of Object.entries(table)) { + const vals = Object.values(bGroups).map((c) => c.mean); + aMeans[a] = vals.reduce((s, v) => s + v, 0) / vals.length; + for (const b of Object.keys(bGroups)) bKeys.add(b); + } + + for (const b of bKeys) { + const vals: number[] = []; + for (const a of Object.keys(table)) { + if (table[a][b]) vals.push(table[a][b].mean); + } + bMeans[b] = vals.length > 0 ? vals.reduce((s, v) => s + v, 0) / vals.length : grandMean; + } + + // Max interaction = max deviation from additive model + let maxInteraction = 0; + for (const a of Object.keys(table)) { + for (const b of Object.keys(table[a])) { + const expected = aMeans[a] + bMeans[b] - grandMean; + const actual = table[a][b].mean; + maxInteraction = Math.max(maxInteraction, Math.abs(actual - expected)); + } + } + + return { + axisA, + axisB, + table, + maxInteraction: Math.round(maxInteraction * 10000) / 10000, + }; +} diff --git a/dashboard/src/pages/insights.astro b/dashboard/src/pages/insights.astro @@ -0,0 +1,17 @@ +--- +import Base from "../layouts/Base.astro"; +import { loadAllRuns } from "../lib/data"; +import Insights from "../components/Insights"; + +const runs = loadAllRuns(); +--- + +<Base title="Insights"> + <h1 style="margin-bottom: 8px;">Insights</h1> + <p style="color: var(--text-muted); margin-bottom: 24px; font-size: 0.875rem;"> + Which variables actually move the needle? Tornado charts show main effects, + heatmaps reveal interactions. + </p> + + <Insights client:load runs={runs} /> +</Base> diff --git a/grid.yaml b/grid.yaml @@ -3,7 +3,6 @@ version: 1 defaults: runs_per_cell: 3 timeout_seconds: 600 - base_tools: "Bash,Read,Edit,Write,Glob,Grep" budget: low: 0.50 high: 5.00 @@ -19,6 +18,16 @@ axes: values: [typescript, javascript] human_language: values: [en, es] + tool_read: + values: ["on", "off"] + tool_write: + values: ["on", "off"] + tool_edit: + values: ["on", "off"] + tool_glob: + values: ["on", "off"] + tool_grep: + values: ["on", "off"] linter: values: ["on", "off"] playwright: @@ -54,11 +63,16 @@ profiles: smoke: description: "Quick validation -- minimal grid" axes: - model: [sonnet] + model: [haiku] effort: [high] prompt_style: [simple, detailed] language: [typescript] human_language: [en] + tool_read: ["on"] + tool_write: ["on"] + tool_edit: ["on"] + tool_glob: ["on"] + tool_grep: ["on"] linter: ["off"] playwright: ["off"] context_file: [none] @@ -75,6 +89,11 @@ profiles: prompt_style: [simple, detailed] language: [typescript] human_language: [en] + tool_read: ["on"] + tool_write: ["on"] + tool_edit: ["on"] + tool_glob: ["on"] + tool_grep: ["on"] linter: ["off"] playwright: ["off"] context_file: [none] diff --git a/harness/lib/compute_grid.py b/harness/lib/compute_grid.py @@ -108,7 +108,6 @@ def compute_cells(grid, profile_name): cell["runs_per_cell"] = runs_per_cell cell["max_budget_usd"] = budget_usd cell["timeout_seconds"] = defaults["timeout_seconds"] - cell["base_tools"] = defaults["base_tools"] cells.append(cell) diff --git a/harness/lib/evaluate.sh b/harness/lib/evaluate.sh @@ -13,37 +13,44 @@ evaluate() { local eval_results='{"structural": null, "functional": null, "quality": null, "score": null}' + # Helper: safely merge JSON into eval_results + merge_result() { + local key="$1" + local output="$2" + + if [[ -z "$output" ]]; then + eval_results=$(echo "$eval_results" | jq --arg k "$key" '.[$k] = {"pass": false, "error": "no output"}') + return + fi + + if echo "$output" | jq . > /dev/null 2>&1; then + eval_results=$(echo "$eval_results" | jq --arg k "$key" --argjson v "$output" '.[$k] = $v') + else + # Truncate long non-JSON output to avoid jq issues + local truncated="${output:0:500}" + eval_results=$(echo "$eval_results" | jq --arg k "$key" --arg e "$truncated" '.[$k] = {"pass": false, "error": $e}') + fi + } + # --- Structural checks --- if [[ -f "$task_dir/eval/structural.sh" ]]; then local structural_output structural_output=$(bash "$task_dir/eval/structural.sh" "$workspace" "$language" 2>&1) || true - if echo "$structural_output" | jq . > /dev/null 2>&1; then - eval_results=$(echo "$eval_results" | jq --argjson s "$structural_output" '.structural = $s') - else - eval_results=$(echo "$eval_results" | jq --arg s "$structural_output" '.structural = {"pass": false, "error": $s}') - fi + merge_result "structural" "$structural_output" fi # --- Functional tests --- - local functional_output='{}' if [[ -d "$task_dir/eval/tests" ]]; then - functional_output=$(run_functional_tests "$task_dir" "$workspace" "$language" "$run_dir") || true - if echo "$functional_output" | jq . > /dev/null 2>&1; then - eval_results=$(echo "$eval_results" | jq --argjson f "$functional_output" '.functional = $f') - else - eval_results=$(echo "$eval_results" | jq '.functional = {"pass": false, "error": "test runner failed"}') - fi + local functional_output + functional_output=$(run_functional_tests "$task_dir" "$workspace" "$language" "$run_dir" 2>&1) || true + merge_result "functional" "$functional_output" fi # --- Quality checks --- if [[ -f "$task_dir/eval/quality.sh" ]]; then local quality_output quality_output=$(bash "$task_dir/eval/quality.sh" "$workspace" "$language" 2>&1) || true - if echo "$quality_output" | jq . > /dev/null 2>&1; then - eval_results=$(echo "$eval_results" | jq --argjson q "$quality_output" '.quality = $q') - else - eval_results=$(echo "$eval_results" | jq --arg q "$quality_output" '.quality = {"pass": false, "error": $q}') - fi + merge_result "quality" "$quality_output" fi # --- Compute aggregate score --- diff --git a/harness/lib/experiment_design.py b/harness/lib/experiment_design.py @@ -0,0 +1,582 @@ +#!/usr/bin/env python3 +"""Experiment design and analysis for loop benchmarking. + +Generates efficient experiment plans instead of full factorial grids. +Analyzes results to identify which variables have the biggest impact. + +Approaches: + 1. Main effects sweep: vary one axis at a time from a baseline + 2. Fractional factorial: Plackett-Burman screening for binary factors + 3. Interaction hunt: full factorial on the top-k most impactful axes +""" + +import json +import math +import sys +from itertools import product +from pathlib import Path + +import yaml + + +def load_grid(path): + with open(path) as f: + return yaml.safe_load(f) + + +def get_axes(grid, profile_name=None): + """Get axis definitions, optionally filtered by profile.""" + top_axes = {name: spec["values"] for name, spec in grid["axes"].items()} + if profile_name and profile_name in grid.get("profiles", {}): + profile = grid["profiles"][profile_name] + if "axes" in profile: + axes = dict(top_axes) + for name, values in profile["axes"].items(): + axes[name] = values + return axes + return top_axes + + +# --------------------------------------------------------------------------- +# 1. Main effects sweep +# --------------------------------------------------------------------------- + +def main_effects_plan(grid, baseline=None, tasks=None): + """Generate a one-at-a-time sweep from a baseline. + + For each axis, vary it through all its values while holding everything + else at baseline. This identifies main effects cheaply. + + Returns a list of cell dicts. + """ + axes = get_axes(grid) + tasks = tasks or grid["tasks"] + defaults = grid["defaults"] + + # Pick baseline: first value of each axis unless overridden + if baseline is None: + baseline = {name: values[0] for name, values in axes.items()} + + cells = [] + seen = set() + + for task in tasks: + # Apply task overrides to axes + task_axes = dict(axes) + overrides = grid.get("task_overrides", {}).get(task, {}) + if "axes" in overrides: + for axis_name, spec in overrides["axes"].items(): + task_axes[axis_name] = spec["values"] + + # Baseline cell + base_cell = dict(baseline) + # Ensure baseline values are valid for this task + for name, values in task_axes.items(): + if base_cell[name] not in values: + base_cell[name] = values[0] + + base_key = _cell_key(task, base_cell) + if base_key not in seen: + seen.add(base_key) + cells.append(_build_cell(task, base_cell, defaults, grid)) + + # Vary each axis + for axis_name, values in task_axes.items(): + for value in values: + if value == base_cell[axis_name]: + continue + varied = dict(base_cell) + varied[axis_name] = value + key = _cell_key(task, varied) + if key not in seen: + seen.add(key) + cells.append(_build_cell(task, varied, defaults, grid)) + + return cells + + +# --------------------------------------------------------------------------- +# 2. Plackett-Burman screening +# --------------------------------------------------------------------------- + +def _hadamard_matrix(n): + """Generate a Hadamard-like matrix for Plackett-Burman design. + + n must be a multiple of 4. Returns an n x (n-1) matrix of +1/-1. + Uses the Paley construction for prime n-1. + """ + # For simplicity, use the standard PB generators for common sizes + # These are the first rows; subsequent rows are cyclic shifts + generators = { + 4: [1, 1, -1], + 8: [1, 1, 1, -1, 1, -1, -1], + 12: [1, 1, -1, 1, 1, 1, -1, -1, -1, 1, -1], + 16: [1, 1, 1, 1, -1, 1, -1, 1, 1, -1, -1, 1, -1, -1, -1], + 20: [1, 1, -1, 1, 1, -1, -1, -1, -1, 1, -1, 1, -1, 1, 1, 1, 1, -1, -1], + 24: [1, 1, 1, 1, 1, -1, 1, -1, 1, 1, -1, -1, 1, 1, -1, -1, 1, -1, 1, -1, -1, -1, -1], + } + + if n not in generators: + # Fall back to nearest larger size + for size in sorted(generators.keys()): + if size >= n: + n = size + break + else: + n = max(generators.keys()) + + gen = generators[n] + k = len(gen) + matrix = [] + + for i in range(k): + row = gen[i:] + gen[:i] + matrix.append(row) + + # Add a row of all -1 + matrix.append([-1] * k) + + return matrix + + +def plackett_burman_plan(grid, tasks=None): + """Generate a Plackett-Burman screening design for binary factors. + + For factors with more than 2 levels (e.g., model: haiku/sonnet/opus), + we create dummy binary variables or sweep them separately. + + Returns a list of cell dicts. + """ + axes = get_axes(grid) + tasks = tasks or grid["tasks"] + defaults = grid["defaults"] + + # Separate binary and multi-level factors + binary_axes = {} + multi_axes = {} + for name, values in axes.items(): + if len(values) == 2: + binary_axes[name] = values + elif len(values) > 2: + multi_axes[name] = values + + binary_names = sorted(binary_axes.keys()) + n_factors = len(binary_names) + + if n_factors == 0: + return main_effects_plan(grid, tasks=tasks) + + # Find the smallest PB design that fits + n_runs = n_factors + 1 + # Round up to multiple of 4 + n_runs = math.ceil(n_runs / 4) * 4 + + matrix = _hadamard_matrix(n_runs) + + cells = [] + seen = set() + + # For multi-level factors, fix at each level and run the PB design + if multi_axes: + multi_names = sorted(multi_axes.keys()) + multi_combos = list(product(*[multi_axes[n] for n in multi_names])) + else: + multi_names = [] + multi_combos = [()] + + for multi_combo in multi_combos: + multi_fixed = dict(zip(multi_names, multi_combo)) + + for row in matrix: + cell = dict(multi_fixed) + for i, name in enumerate(binary_names): + if i < len(row): + idx = 0 if row[i] == -1 else 1 + else: + idx = 0 + cell[name] = binary_axes[name][idx] + + for task in tasks: + # Apply task overrides + task_axes = dict(axes) + overrides = grid.get("task_overrides", {}).get(task, {}) + if "axes" in overrides: + for axis_name, spec in overrides["axes"].items(): + task_axes[axis_name] = spec["values"] + + # Ensure values are valid for this task + valid = True + for name, values in task_axes.items(): + if cell.get(name) not in values: + if len(values) == 1: + cell[name] = values[0] + else: + valid = False + break + + # Check exclusions + if valid and not _is_excluded(cell, grid): + key = _cell_key(task, cell) + if key not in seen: + seen.add(key) + cells.append(_build_cell(task, cell, defaults, grid)) + + return cells + + +# --------------------------------------------------------------------------- +# 3. Interaction hunt +# --------------------------------------------------------------------------- + +def interaction_hunt_plan(grid, top_axes, tasks=None): + """Full factorial on a subset of axes, baseline for the rest. + + Args: + top_axes: list of axis names to fully explore (e.g., ["model", "effort", "linter"]) + tasks: which tasks to include + """ + axes = get_axes(grid) + tasks = tasks or grid["tasks"] + defaults = grid["defaults"] + + # Baseline for non-explored axes + baseline = {name: values[0] for name, values in axes.items()} + + # Full factorial on top_axes + explore_names = sorted(top_axes) + explore_values = [axes[n] for n in explore_names] + + cells = [] + seen = set() + + for combo in product(*explore_values): + cell = dict(baseline) + for name, value in zip(explore_names, combo): + cell[name] = value + + for task in tasks: + task_axes = dict(axes) + overrides = grid.get("task_overrides", {}).get(task, {}) + if "axes" in overrides: + for axis_name, spec in overrides["axes"].items(): + task_axes[axis_name] = spec["values"] + + # Adjust for task constraints + for name, values in task_axes.items(): + if cell.get(name) not in values: + cell[name] = values[0] + + if not _is_excluded(cell, grid): + key = _cell_key(task, cell) + if key not in seen: + seen.add(key) + cells.append(_build_cell(task, cell, defaults, grid)) + + return cells + + +# --------------------------------------------------------------------------- +# Analysis: compute effects from results +# --------------------------------------------------------------------------- + +def analyze_main_effects(results_dir, metric="score"): + """Compute the main effect of each axis on a given metric. + + Reads all completed runs, groups by axis values, computes mean metric + for each group, and returns the effect size (difference from grand mean). + + Returns a dict: {axis_name: {value: effect_size, ...}, ...} + sorted by absolute effect size. + """ + runs = _load_results(results_dir) + if not runs: + return {} + + # Extract metric values + scored_runs = [] + for run in runs: + val = _extract_metric(run, metric) + if val is not None: + scored_runs.append((run["meta"], val)) + + if not scored_runs: + return {} + + grand_mean = sum(v for _, v in scored_runs) / len(scored_runs) + + # Identify axes from the first run's meta + meta_keys = set(scored_runs[0][0].keys()) + skip_keys = { + "task", "cell_id", "run_id", "run_number", "runs_per_cell", + "max_budget_usd", "timeout_seconds", "base_tools", + "started_at", "completed_at", "wall_time_seconds", "exit_code", + } + axis_names = sorted(meta_keys - skip_keys) + + effects = {} + for axis in axis_names: + groups = {} + for meta, val in scored_runs: + key = str(meta.get(axis, "unknown")) + groups.setdefault(key, []).append(val) + + if len(groups) < 2: + continue + + axis_effects = {} + for value, vals in sorted(groups.items()): + group_mean = sum(vals) / len(vals) + effect = group_mean - grand_mean + axis_effects[value] = { + "mean": round(group_mean, 4), + "effect": round(effect, 4), + "n": len(vals), + } + + # Effect magnitude = max spread between any two values + means = [v["mean"] for v in axis_effects.values()] + spread = max(means) - min(means) if means else 0 + + effects[axis] = { + "values": axis_effects, + "spread": round(spread, 4), + } + + # Sort by spread (biggest effects first) + effects = dict(sorted(effects.items(), key=lambda x: -x[1]["spread"])) + return effects + + +def analyze_interactions(results_dir, axis_a, axis_b, metric="score"): + """Compute the interaction effect between two axes. + + Returns a 2D table of mean metric values for each (a_value, b_value) combo, + plus the interaction effect size. + """ + runs = _load_results(results_dir) + if not runs: + return {} + + groups = {} + for run in runs: + val = _extract_metric(run, metric) + if val is None: + continue + a_val = str(run["meta"].get(axis_a, "?")) + b_val = str(run["meta"].get(axis_b, "?")) + key = (a_val, b_val) + groups.setdefault(key, []).append(val) + + if not groups: + return {} + + table = {} + for (a_val, b_val), vals in sorted(groups.items()): + table.setdefault(a_val, {})[b_val] = { + "mean": round(sum(vals) / len(vals), 4), + "n": len(vals), + } + + # Compute interaction: does the effect of axis_a change depending on axis_b? + a_values = sorted(table.keys()) + b_values = sorted(set(b for row in table.values() for b in row.keys())) + + # Interaction = deviation from additive model + grand_mean = sum( + v for row in table.values() for cell in row.values() for v in [cell["mean"]] + ) / sum(1 for row in table.values() for _ in row.values()) + + a_means = {} + for a in a_values: + vals = [table[a][b]["mean"] for b in b_values if b in table.get(a, {})] + a_means[a] = sum(vals) / len(vals) if vals else grand_mean + + b_means = {} + for b in b_values: + vals = [table[a][b]["mean"] for a in a_values if b in table.get(a, {})] + b_means[b] = sum(vals) / len(vals) if vals else grand_mean + + # Interaction effects + interactions = {} + max_interaction = 0 + for a in a_values: + for b in b_values: + if b in table.get(a, {}): + expected = a_means[a] + b_means[b] - grand_mean + actual = table[a][b]["mean"] + interaction = round(actual - expected, 4) + interactions[(a, b)] = interaction + max_interaction = max(max_interaction, abs(interaction)) + + return { + "table": table, + "grand_mean": round(grand_mean, 4), + "a_means": {k: round(v, 4) for k, v in a_means.items()}, + "b_means": {k: round(v, 4) for k, v in b_means.items()}, + "interactions": {f"{a},{b}": v for (a, b), v in interactions.items()}, + "max_interaction": round(max_interaction, 4), + } + + +# --------------------------------------------------------------------------- +# Helpers +# --------------------------------------------------------------------------- + +def _cell_key(task, cell): + axis_names = sorted(k for k in cell.keys() if k not in ( + "task", "cell_id", "runs_per_cell", "max_budget_usd", + "timeout_seconds", "base_tools", + )) + parts = [task] + [f"{k}={cell[k]}" for k in axis_names] + return "_".join(parts) + + +def _is_excluded(cell, grid): + for exclusion in grid.get("exclusions", []): + match = True + for key, value in exclusion["when"].items(): + if cell.get(key) != value: + match = False + break + if match: + return True + return False + + +def _build_cell(task, cell, defaults, grid): + axis_names = sorted(cell.keys()) + cell_id_parts = [task] + [f"{k}={cell[k]}" for k in axis_names] + + result = dict(cell) + result["task"] = task + result["cell_id"] = "_".join(cell_id_parts) + result["runs_per_cell"] = defaults.get("runs_per_cell", 3) + result["timeout_seconds"] = defaults.get("timeout_seconds", 600) + + budget_key = cell.get("max_budget", "low") + result["max_budget_usd"] = defaults.get("budget", {}).get(budget_key, 0.50) + + return result + + +def _load_results(results_dir): + """Load all completed runs from the results directory.""" + results_dir = Path(results_dir) + runs_dir = results_dir / "runs" + if not runs_dir.exists(): + return [] + + runs = [] + for run_dir in runs_dir.iterdir(): + if not run_dir.is_dir(): + continue + meta_path = run_dir / "meta.json" + eval_path = run_dir / "eval_results.json" + claude_path = run_dir / "claude_output.json" + + if not meta_path.exists() or not eval_path.exists(): + continue + + try: + meta = json.loads(meta_path.read_text()) + eval_results = json.loads(eval_path.read_text()) + claude_output = {} + if claude_path.exists(): + claude_output = json.loads(claude_path.read_text()) + + runs.append({ + "meta": meta, + "eval": eval_results, + "claude": claude_output, + }) + except (json.JSONDecodeError, OSError): + continue + + return runs + + +def _extract_metric(run, metric): + """Extract a numeric metric from a run.""" + if metric == "score": + val = run["eval"].get("score") + return val if isinstance(val, (int, float)) else None + elif metric == "cost": + return run["claude"].get("total_cost_usd") + elif metric == "turns": + return run["claude"].get("num_turns") + elif metric == "wall_time": + return run["meta"].get("wall_time_seconds") + elif metric == "pass_rate": + func = run["eval"].get("functional", {}) + if isinstance(func, dict) and "pass" in func: + return 1.0 if func["pass"] else 0.0 + return None + return None + + +# --------------------------------------------------------------------------- +# CLI +# --------------------------------------------------------------------------- + +def main(): + if len(sys.argv) < 3: + print("Usage:") + print(" experiment_design.py plan <grid_file> <design> [args...]") + print(" designs: main_effects, plackett_burman, interaction_hunt") + print(" experiment_design.py analyze <results_dir> <analysis> [args...]") + print(" analyses: main_effects, interactions") + sys.exit(1) + + command = sys.argv[1] + + if command == "plan": + grid_file = sys.argv[2] + design = sys.argv[3] if len(sys.argv) > 3 else "main_effects" + grid = load_grid(grid_file) + + if design == "main_effects": + cells = main_effects_plan(grid) + elif design == "plackett_burman": + cells = plackett_burman_plan(grid) + elif design == "interaction_hunt": + top_axes = sys.argv[4].split(",") if len(sys.argv) > 4 else [] + if not top_axes: + print("ERROR: interaction_hunt requires comma-separated axis names", file=sys.stderr) + sys.exit(1) + cells = interaction_hunt_plan(grid, top_axes) + else: + print(f"Unknown design: {design}", file=sys.stderr) + sys.exit(1) + + print(f"# {design}: {len(cells)} cells", file=sys.stderr) + for cell in cells: + print(json.dumps(cell)) + + elif command == "analyze": + results_dir = sys.argv[2] + analysis = sys.argv[3] if len(sys.argv) > 3 else "main_effects" + + if analysis == "main_effects": + metric = sys.argv[4] if len(sys.argv) > 4 else "score" + effects = analyze_main_effects(results_dir, metric) + print(json.dumps(effects, indent=2)) + elif analysis == "interactions": + if len(sys.argv) < 6: + print("ERROR: interactions requires two axis names", file=sys.stderr) + sys.exit(1) + axis_a = sys.argv[4] + axis_b = sys.argv[5] + metric = sys.argv[6] if len(sys.argv) > 6 else "score" + result = analyze_interactions(results_dir, axis_a, axis_b, metric) + print(json.dumps(result, indent=2)) + else: + print(f"Unknown analysis: {analysis}", file=sys.stderr) + sys.exit(1) + + else: + print(f"Unknown command: {command}", file=sys.stderr) + sys.exit(1) + + +if __name__ == "__main__": + main() diff --git a/harness/lib/get-oauth-token.sh b/harness/lib/get-oauth-token.sh @@ -0,0 +1,13 @@ +#!/usr/bin/env bash +# Extract OAuth token from Claude Code credentials for use with --bare mode. +# This lets the harness use your Claude plan while maintaining full isolation. + +CREDS_FILE="${CLAUDE_CONFIG_DIR:-$HOME/.claude}/.credentials.json" + +if [[ ! -f "$CREDS_FILE" ]]; then + echo "ERROR: No credentials file found at $CREDS_FILE" >&2 + exit 1 +fi + +# Extract the OAuth access token +jq -r '.claudeAiOauth.accessToken // empty' "$CREDS_FILE" diff --git a/harness/lib/invoke.sh b/harness/lib/invoke.sh @@ -45,8 +45,19 @@ Use TypeScript." Use JavaScript (no TypeScript)." fi - # Build tool list - local tools="$base_tools" + # Build tool list from individual axes (Bash always on) + local tools="Bash" + local tool_read tool_write tool_edit tool_glob tool_grep + tool_read=$(echo "$cell_json" | jq -r '.tool_read // "on"') + tool_write=$(echo "$cell_json" | jq -r '.tool_write // "on"') + tool_edit=$(echo "$cell_json" | jq -r '.tool_edit // "on"') + tool_glob=$(echo "$cell_json" | jq -r '.tool_glob // "on"') + tool_grep=$(echo "$cell_json" | jq -r '.tool_grep // "on"') + [[ "$tool_read" == "on" ]] && tools="$tools,Read" + [[ "$tool_write" == "on" ]] && tools="$tools,Write" + [[ "$tool_edit" == "on" ]] && tools="$tools,Edit" + [[ "$tool_glob" == "on" ]] && tools="$tools,Glob" + [[ "$tool_grep" == "on" ]] && tools="$tools,Grep" if [[ "$sub_agents" == "on" ]]; then tools="$tools,Agent" fi @@ -55,15 +66,22 @@ Use JavaScript (no TypeScript)." fi # Build the claude command + # --bare for full isolation (no CLAUDE.md, hooks, MCP, memory). + # Auth via apiKeyHelper that reads OAuth token from ~/.claude/.credentials.json. + local auth_helper + auth_helper="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)/get-oauth-token.sh" + local cmd=( claude --bare -p "$prompt" --model "$model" --output-format stream-json - --dangerously-skip-permissions + --verbose + --permission-mode dontAsk --max-budget-usd "$budget" --allowedTools "$tools" + --settings "{\"apiKeyHelper\": \"$auth_helper\"}" ) # Add effort level diff --git a/harness/run.py b/harness/run.py @@ -0,0 +1,424 @@ +#!/usr/bin/env python3 +"""Loop Benchmarking Harness - Main orchestrator. + +Computes the experiment grid, creates isolated workspaces, invokes claude, +runs evaluation, and stores results. + +Usage: + python3 run.py [grid_file] [profile_or_design] + + profile_or_design can be: + - A profile name from grid.yaml (e.g., smoke, core, full) + - A DOE design: main_effects, plackett_burman + - interaction_hunt:axis1,axis2,axis3 +""" + +import json +import os +import shutil +import subprocess +import sys +import tarfile +import tempfile +import time +from datetime import datetime, timezone +from pathlib import Path + +SCRIPT_DIR = Path(__file__).resolve().parent +PROJECT_DIR = SCRIPT_DIR.parent +sys.path.insert(0, str(SCRIPT_DIR / "lib")) + +from compute_grid import load_grid, compute_cells +from experiment_design import ( + main_effects_plan, + plackett_burman_plan, + interaction_hunt_plan, + analyze_main_effects, +) + + +def create_workspace(project_dir: Path, task: str, cell: dict) -> Path: + """Create an isolated temp directory with appropriate setup.""" + workspace = Path(tempfile.mkdtemp(prefix="loop-bench-")) + + language = cell.get("language", "typescript") + linter = cell.get("linter", "off") + playwright = cell.get("playwright", "off") + + # npm init + subprocess.run(["npm", "init", "-y"], cwd=workspace, capture_output=True) + + # TypeScript + if language == "typescript": + subprocess.run( + ["npm", "install", "--save-dev", "typescript", "@types/node"], + cwd=workspace, capture_output=True, + ) + + # Linter + if linter == "on": + subprocess.run( + ["npm", "install", "--save-dev", "eslint", "@eslint/js"], + cwd=workspace, capture_output=True, + ) + + # Playwright + if playwright == "on": + subprocess.run( + ["npm", "install", "--save-dev", "@playwright/test"], + cwd=workspace, capture_output=True, + ) + subprocess.run( + ["npx", "playwright", "install", "chromium", "--with-deps"], + cwd=workspace, capture_output=True, + ) + + # Copy fixtures + fixtures_dir = project_dir / "tasks" / task / "fixtures" + if fixtures_dir.is_dir(): + for item in fixtures_dir.iterdir(): + dest = workspace / item.name + if item.is_dir(): + shutil.copytree(item, dest) + else: + shutil.copy2(item, dest) + + return workspace + + +def build_prompt(project_dir: Path, cell: dict) -> str: + """Read the prompt file and append language instruction.""" + task = cell["task"] + style = cell["prompt_style"] + lang_code = cell["human_language"] + + prompt_file = project_dir / "tasks" / task / "prompts" / f"{style}.{lang_code}.md" + prompt = prompt_file.read_text() + + language = cell.get("language", "typescript") + if language == "typescript": + prompt += "\n\nUse TypeScript." + elif language == "javascript": + prompt += "\n\nUse JavaScript (no TypeScript)." + + return prompt + + +def invoke_claude(cell: dict, workspace: Path, run_dir: Path, project_dir: Path) -> int: + """Invoke claude CLI and capture output.""" + prompt = build_prompt(project_dir, cell) + model = cell["model"] + effort = cell.get("effort", "high") + budget = cell.get("max_budget_usd", 0.50) + timeout = cell.get("timeout_seconds", 600) + # Build tool list from individual tool axes + # Bash is always available - it's the agent's escape hatch + tools_list = ["Bash"] + if cell.get("tool_read", "on") == "on": + tools_list.append("Read") + if cell.get("tool_write", "on") == "on": + tools_list.append("Write") + if cell.get("tool_edit", "on") == "on": + tools_list.append("Edit") + if cell.get("tool_glob", "on") == "on": + tools_list.append("Glob") + if cell.get("tool_grep", "on") == "on": + tools_list.append("Grep") + if cell.get("sub_agents") == "on": + tools_list.append("Agent") + if cell.get("web_search") == "on": + tools_list.extend(["WebSearch", "WebFetch"]) + tools = ",".join(tools_list) + + # Auth helper for --bare mode + auth_helper = str(SCRIPT_DIR / "lib" / "get-oauth-token.sh") + + cmd = [ + "claude", + "--bare", + "-p", prompt, + "--model", model, + "--output-format", "stream-json", + "--verbose", + "--permission-mode", "dontAsk", + "--max-budget-usd", str(budget), + "--allowedTools", tools, + "--settings", json.dumps({"apiKeyHelper": auth_helper}), + ] + + if effort: + cmd.extend(["--effort", effort]) + + # Context file + if cell.get("context_file") == "provided": + ctx_file = project_dir / "tasks" / cell["task"] / "context.md" + if ctx_file.exists(): + cmd.extend(["--append-system-prompt", ctx_file.read_text()]) + + # Run claude + transcript_path = run_dir / "transcript.jsonl" + stderr_path = run_dir / "claude_stderr.log" + + with open(transcript_path, "w") as transcript_f, open(stderr_path, "w") as stderr_f: + try: + result = subprocess.run( + cmd, + cwd=workspace, + stdout=transcript_f, + stderr=stderr_f, + timeout=timeout, + ) + exit_code = result.returncode + except subprocess.TimeoutExpired: + exit_code = 124 # Same as timeout(1) convention + + # Extract final result line + output_path = run_dir / "claude_output.json" + try: + lines = transcript_path.read_text().strip().split("\n") + if lines: + output_path.write_text(lines[-1]) + except Exception: + output_path.write_text("{}") + + return exit_code + + +def run_eval_script(script: Path, workspace: Path, language: str) -> str: + """Run a bash eval script and return its stdout.""" + try: + result = subprocess.run( + ["bash", str(script), str(workspace), language], + capture_output=True, text=True, timeout=120, + ) + return result.stdout.strip() + except Exception as e: + return json.dumps({"pass": False, "error": str(e)}) + + +def safe_parse_json(text: str, fallback_key: str = "error") -> dict: + """Parse JSON, returning an error dict if parsing fails.""" + if not text: + return {"pass": False, "error": "no output"} + try: + return json.loads(text) + except json.JSONDecodeError: + return {"pass": False, "error": text[:500]} + + +def evaluate(task_dir: Path, workspace: Path, cell: dict, run_dir: Path): + """Run all evaluation scripts and write eval_results.json.""" + language = cell.get("language", "typescript") + + results = { + "structural": None, + "functional": None, + "quality": None, + "score": None, + } + + # Structural + structural_sh = task_dir / "eval" / "structural.sh" + if structural_sh.exists(): + output = run_eval_script(structural_sh, workspace, language) + results["structural"] = safe_parse_json(output) + + # Functional + tests_dir = task_dir / "eval" / "tests" + if tests_dir.is_dir(): + # Check for different test types + if (tests_dir / "functional.sh").exists(): + output = run_eval_script(tests_dir / "functional.sh", workspace, language) + results["functional"] = safe_parse_json(output) + elif (tests_dir / "functional.spec.ts").exists(): + # Playwright tests - would need server setup, skip for now + results["functional"] = {"pass": False, "error": "playwright eval not yet wired", "score": 0} + elif (tests_dir / "functional.test.ts").exists(): + # vitest tests - would need server setup, skip for now + results["functional"] = {"pass": False, "error": "vitest eval not yet wired", "score": 0} + + # Quality + quality_sh = task_dir / "eval" / "quality.sh" + if quality_sh.exists(): + output = run_eval_script(quality_sh, workspace, language) + results["quality"] = safe_parse_json(output) + + # Compute weighted score + try: + scoring_file = task_dir / "scoring.yaml" + if scoring_file.exists(): + import yaml + scoring = yaml.safe_load(scoring_file.read_text()) + weights = scoring.get("weights", {}) + + score = 0.0 + for category, weight in weights.items(): + cat_data = results.get(category) + if cat_data and isinstance(cat_data.get("score"), (int, float)): + score += cat_data["score"] * weight + + results["score"] = round(score, 4) + except Exception: + pass + + (run_dir / "eval_results.json").write_text(json.dumps(results, indent=2)) + + +def archive_workspace(workspace: Path, run_dir: Path): + """Archive and delete the workspace.""" + archive_path = run_dir / "workspace.tar.gz" + try: + with tarfile.open(archive_path, "w:gz") as tar: + tar.add(workspace, arcname=workspace.name, + filter=lambda t: None if "node_modules" in t.name else t) + except Exception: + # If archiving fails, just note it + pass + + try: + shutil.rmtree(workspace) + except Exception: + pass + + +def main(): + grid_file = sys.argv[1] if len(sys.argv) > 1 else str(PROJECT_DIR / "grid.yaml") + profile = sys.argv[2] if len(sys.argv) > 2 else "smoke" + results_dir = PROJECT_DIR / "results" + results_dir.mkdir(exist_ok=True) + (results_dir / "runs").mkdir(exist_ok=True) + + # Preflight + if shutil.which("claude") is None: + print("ERROR: claude CLI not found in PATH.") + sys.exit(1) + + print("=" * 40) + print("Loop Benchmarking Harness") + print("=" * 40) + print(f"Grid file: {grid_file}") + print(f"Profile: {profile}") + print(f"Results: {results_dir}") + print("=" * 40) + + grid = load_grid(grid_file) + + # Determine cell generation strategy + if profile == "main_effects": + cells = main_effects_plan(grid) + print(f"Design: main effects sweep") + elif profile == "plackett_burman": + cells = plackett_burman_plan(grid) + print(f"Design: Plackett-Burman screening") + elif profile.startswith("interaction_hunt:"): + top_axes = profile.split(":", 1)[1].split(",") + cells = interaction_hunt_plan(grid, top_axes) + print(f"Design: interaction hunt on {top_axes}") + else: + cells = compute_cells(grid, profile) + print(f"Profile: {profile}") + + print(f"Grid cells: {len(cells)}") + print() + + completed = 0 + skipped = 0 + failed = 0 + + for cell in cells: + task = cell["task"] + cell_id = cell["cell_id"] + runs_per_cell = cell.get("runs_per_cell", 3) + model = cell["model"] + prompt_style = cell["prompt_style"] + + for run_num in range(1, runs_per_cell + 1): + run_id = f"{cell_id}_run{run_num}" + run_dir = results_dir / "runs" / run_id + + # Resume support + if (run_dir / "eval_results.json").exists(): + print(f"SKIP: {run_id}") + skipped += 1 + continue + + print("-" * 40) + print(f"RUN: {run_id}") + print(f"Task: {task} | Model: {model} | Prompt: {prompt_style}") + print("-" * 40) + + run_dir.mkdir(parents=True, exist_ok=True) + + # Save meta + meta = { + **cell, + "run_id": run_id, + "run_number": run_num, + "started_at": datetime.now(timezone.utc).isoformat(), + } + (run_dir / "meta.json").write_text(json.dumps(meta, indent=2)) + + # Create workspace + print(" Creating workspace...") + try: + workspace = create_workspace(PROJECT_DIR, task, cell) + print(f" Workspace: {workspace}") + except Exception as e: + print(f" ERROR creating workspace: {e}") + failed += 1 + continue + + # Invoke claude + print(f" Invoking claude (model={model})...") + start_time = time.time() + exit_code = invoke_claude(cell, workspace, run_dir, PROJECT_DIR) + wall_time = int(time.time() - start_time) + + if exit_code == 0: + print(" Claude completed successfully") + else: + print(f" Claude exited with error (exit code: {exit_code})") + + # Update meta with timing + meta["wall_time_seconds"] = wall_time + meta["exit_code"] = exit_code + meta["completed_at"] = datetime.now(timezone.utc).isoformat() + (run_dir / "meta.json").write_text(json.dumps(meta, indent=2)) + + # Evaluate + print(" Running evaluation...") + task_dir = PROJECT_DIR / "tasks" / task + evaluate(task_dir, workspace, cell, run_dir) + print(" Evaluation complete") + + # Append to index + index_entry = { + "run_id": run_id, + "task": task, + "model": model, + "cell_id": cell_id, + "completed_at": meta["completed_at"], + } + with open(results_dir / "index.jsonl", "a") as f: + f.write(json.dumps(index_entry) + "\n") + + # Archive and cleanup + print(" Archiving workspace...") + archive_workspace(workspace, run_dir) + + if (run_dir / "eval_results.json").exists(): + completed += 1 + else: + failed += 1 + + print(f" Done. ({completed} completed, {skipped} skipped, {failed} failed)") + print() + + print("=" * 40) + print("All runs complete.") + print(f"Completed: {completed} | Skipped: {skipped} | Failed: {failed}") + print("=" * 40) + + +if __name__ == "__main__": + main() diff --git a/harness/run.sh b/harness/run.sh @@ -1,5 +1,7 @@ #!/usr/bin/env bash -set -euo pipefail +set -uo pipefail +# Note: no set -e. The main loop handles errors per-run so one failure +# doesn't kill the entire harness. Critical setup errors still exit explicitly. SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" PROJECT_DIR="$(dirname "$SCRIPT_DIR")" @@ -17,6 +19,12 @@ GRID_FILE="${1:-$PROJECT_DIR/grid.yaml}" PROFILE="${2:-smoke}" RESULTS_DIR="$PROJECT_DIR/results" +# Preflight: verify claude is available and authenticated +if ! command -v claude > /dev/null 2>&1; then + echo "ERROR: claude CLI not found in PATH." + exit 1 +fi + echo "========================================" echo "Loop Benchmarking Harness" echo "========================================" @@ -36,7 +44,7 @@ completed=0 skipped=0 failed=0 -echo "$cells" | while IFS= read -r cell_json; do +while IFS= read -r cell_json; do task=$(echo "$cell_json" | jq -r '.task') cell_id=$(echo "$cell_json" | jq -r '.cell_id') runs_per_cell=$(echo "$cell_json" | jq -r '.runs_per_cell') @@ -58,53 +66,61 @@ echo "$cells" | while IFS= read -r cell_json; do echo "Task: $task | Model: $model | Prompt: $prompt_style" echo "----------------------------------------" - # Create run results directory + # Run everything in a subshell so cd's don't affect the main loop + ( + # Create run results directory + run_dir="$RESULTS_DIR/runs/$run_id" + mkdir -p "$run_dir" + + # Save cell config as meta.json + echo "$cell_json" | jq --arg run_id "$run_id" --argjson run_num "$run_num" \ + '. + {run_id: $run_id, run_number: $run_num, started_at: (now | todate)}' \ + > "$run_dir/meta.json" + + # Create isolated workspace + echo " Creating workspace..." + workspace=$(create_workspace "$PROJECT_DIR" "$task" "$cell_json") + echo " Workspace: $workspace" + + # Invoke claude + echo " Invoking claude (model=$model)..." + if invoke_claude "$cell_json" "$workspace" "$run_dir" "$PROJECT_DIR"; then + echo " Claude completed successfully" + else + echo " Claude exited with error (exit code: $?)" + fi + + # Run evaluation + echo " Running evaluation..." + task_dir="$PROJECT_DIR/tasks/$task" + evaluate "$task_dir" "$workspace" "$cell_json" "$run_dir" + echo " Evaluation complete" + + # Append to run index + jq -c '{ + run_id: .run_id, + task: .task, + model: .model, + cell_id: .cell_id, + completed_at: .completed_at + }' "$run_dir/meta.json" >> "$RESULTS_DIR/index.jsonl" + + # Archive and cleanup workspace + echo " Archiving workspace..." + cleanup_workspace "$workspace" "$run_dir" + ) || true + + # Count results (outside subshell) run_dir="$RESULTS_DIR/runs/$run_id" - mkdir -p "$run_dir" - - # Save cell config as meta.json - echo "$cell_json" | jq --arg run_id "$run_id" --argjson run_num "$run_num" \ - '. + {run_id: $run_id, run_number: $run_num, started_at: (now | todate)}' \ - > "$run_dir/meta.json" - - # Create isolated workspace - echo " Creating workspace..." - workspace=$(create_workspace "$PROJECT_DIR" "$task" "$cell_json") - echo " Workspace: $workspace" - - # Invoke claude - echo " Invoking claude (model=$model)..." - if invoke_claude "$cell_json" "$workspace" "$run_dir" "$PROJECT_DIR"; then - echo " Claude completed successfully" + if [[ -f "$run_dir/eval_results.json" ]]; then + completed=$((completed + 1)) else - echo " Claude exited with error (exit code: $?)" failed=$((failed + 1)) fi - - # Run evaluation - echo " Running evaluation..." - task_dir="$PROJECT_DIR/tasks/$task" - evaluate "$task_dir" "$workspace" "$cell_json" "$run_dir" - echo " Evaluation complete" - - # Append to run index - jq -c '{ - run_id: .run_id, - task: .task, - model: .model, - cell_id: .cell_id, - completed_at: .completed_at - }' "$run_dir/meta.json" >> "$RESULTS_DIR/index.jsonl" - - # Archive and cleanup workspace - echo " Archiving workspace..." - cleanup_workspace "$workspace" "$run_dir" - - completed=$((completed + 1)) echo " Done. ($completed completed, $skipped skipped, $failed failed)" echo "" done -done +done <<< "$cells" echo "========================================" echo "All runs complete." diff --git a/results/index.jsonl b/results/index.jsonl @@ -0,0 +1,6 @@ +{"run_id":"tetris_context_file=none_effort=high_human_language=en_language=typescript_linter=off_max_budget=low_model=haiku_playwright=off_prompt_style=simple_sub_agents=off_web_search=off_run1","task":"tetris","model":"haiku","cell_id":"tetris_context_file=none_effort=high_human_language=en_language=typescript_linter=off_max_budget=low_model=haiku_playwright=off_prompt_style=simple_sub_agents=off_web_search=off","completed_at":"2026-04-03T15:34:51Z"} +{"run_id":"tetris_context_file=none_effort=high_human_language=en_language=typescript_linter=off_max_budget=low_model=haiku_playwright=off_prompt_style=detailed_sub_agents=off_web_search=off_run1","task":"tetris","model":"haiku","cell_id":"tetris_context_file=none_effort=high_human_language=en_language=typescript_linter=off_max_budget=low_model=haiku_playwright=off_prompt_style=detailed_sub_agents=off_web_search=off","completed_at":"2026-04-03T15:39:20Z"} +{"run_id":"bookmarks-api_context_file=none_effort=high_human_language=en_language=typescript_linter=off_max_budget=low_model=haiku_playwright=off_prompt_style=simple_sub_agents=off_web_search=off_run1","task":"bookmarks-api","model":"haiku","cell_id":"bookmarks-api_context_file=none_effort=high_human_language=en_language=typescript_linter=off_max_budget=low_model=haiku_playwright=off_prompt_style=simple_sub_agents=off_web_search=off","completed_at":"2026-04-03T15:48:50Z"} +{"run_id": "bookmarks-api_context_file=none_effort=high_human_language=en_language=typescript_linter=off_max_budget=low_model=haiku_playwright=off_prompt_style=detailed_sub_agents=off_web_search=off_run1", "task": "bookmarks-api", "model": "haiku", "cell_id": "bookmarks-api_context_file=none_effort=high_human_language=en_language=typescript_linter=off_max_budget=low_model=haiku_playwright=off_prompt_style=detailed_sub_agents=off_web_search=off", "completed_at": "2026-04-03T16:14:47.247624+00:00"} +{"run_id": "data-pipeline_context_file=none_effort=high_human_language=en_language=typescript_linter=off_max_budget=low_model=haiku_playwright=off_prompt_style=simple_sub_agents=off_web_search=off_run1", "task": "data-pipeline", "model": "haiku", "cell_id": "data-pipeline_context_file=none_effort=high_human_language=en_language=typescript_linter=off_max_budget=low_model=haiku_playwright=off_prompt_style=simple_sub_agents=off_web_search=off", "completed_at": "2026-04-03T16:19:53.900594+00:00"} +{"run_id": "data-pipeline_context_file=none_effort=high_human_language=en_language=typescript_linter=off_max_budget=low_model=haiku_playwright=off_prompt_style=detailed_sub_agents=off_web_search=off_run1", "task": "data-pipeline", "model": "haiku", "cell_id": "data-pipeline_context_file=none_effort=high_human_language=en_language=typescript_linter=off_max_budget=low_model=haiku_playwright=off_prompt_style=detailed_sub_agents=off_web_search=off", "completed_at": "2026-04-03T16:21:33.913511+00:00"} diff --git a/tasks/bookmarks-api/eval/quality.sh b/tasks/bookmarks-api/eval/quality.sh @@ -97,7 +97,7 @@ for key in lint typecheck security_passwords security_jwt_secret security_sql; d done if [[ $score_count -gt 0 ]]; then - score=$(echo "scale=2; $score_sum / ($score_count * 100)" | bc) + score=$(awk "BEGIN {printf \"%.2f\", $score_sum / ($score_count * 100)}") else score="0" fi diff --git a/tasks/bookmarks-api/eval/structural.sh b/tasks/bookmarks-api/eval/structural.sh @@ -100,7 +100,7 @@ checks_json=$(printf '%s,' "${checks[@]}") checks_json="[${checks_json%,}]" if [[ $total_count -gt 0 ]]; then - score=$(echo "scale=2; $pass_count / $total_count" | bc) + score=$(awk "BEGIN {printf \"%.2f\", $pass_count / $total_count}") else score="0" fi diff --git a/tasks/data-pipeline/eval/quality.sh b/tasks/data-pipeline/eval/quality.sh @@ -126,7 +126,7 @@ for key in lint typecheck no_float_currency handles_empty_input handles_malforme done if [[ $score_count -gt 0 ]]; then - score=$(echo "scale=2; $score_sum / ($score_count * 100)" | bc) + score=$(awk "BEGIN {printf \"%.2f\", $score_sum / ($score_count * 100)}") else score="0" fi diff --git a/tasks/data-pipeline/eval/structural.sh b/tasks/data-pipeline/eval/structural.sh @@ -91,7 +91,7 @@ checks_json=$(printf '%s,' "${checks[@]}") checks_json="[${checks_json%,}]" if [[ $total_count -gt 0 ]]; then - score=$(echo "scale=2; $pass_count / $total_count" | bc) + score=$(awk "BEGIN {printf \"%.2f\", $pass_count / $total_count}") else score="0" fi diff --git a/tasks/data-pipeline/eval/tests/functional.sh b/tasks/data-pipeline/eval/tests/functional.sh @@ -73,7 +73,7 @@ if ! echo "$actual_output" | jq . > /dev/null 2>&1; then add_check "valid_json" "false" "output is not valid JSON" checks_json=$(printf '%s,' "${checks[@]}") checks_json="[${checks_json%,}]" - echo "{\"pass\": false, \"checks\": $checks_json, \"score\": $(echo "scale=2; $pass_count / $total_count" | bc)}" + echo "{\"pass\": false, \"checks\": $checks_json, \"score\": $(awk "BEGIN {printf \"%.2f\", $pass_count / $total_count}")}" exit 0 fi @@ -166,7 +166,7 @@ checks_json=$(printf '%s,' "${checks[@]}") checks_json="[${checks_json%,}]" if [[ $total_count -gt 0 ]]; then - score=$(echo "scale=2; $pass_count / $total_count" | bc) + score=$(awk "BEGIN {printf \"%.2f\", $pass_count / $total_count}") else score="0" fi diff --git a/tasks/tetris/eval/quality.sh b/tasks/tetris/eval/quality.sh @@ -13,10 +13,8 @@ results='{}' # --- Lint check --- cd "$WORKSPACE" if command -v npx > /dev/null 2>&1; then - # Install eslint if not present npm install --save-dev eslint @eslint/js > /dev/null 2>&1 - # Find source files if [[ "$LANGUAGE" == "typescript" ]]; then extensions="ts,tsx" else @@ -29,12 +27,15 @@ if command -v npx > /dev/null 2>&1; then if echo "$lint_output" | jq . > /dev/null 2>&1; then errors=$(echo "$lint_output" | jq '[.[].errorCount] | add // 0') warnings=$(echo "$lint_output" | jq '[.[].warningCount] | add // 0') - lint_pass="true" + errors=${errors:-0} + warnings=${warnings:-0} if [[ "$errors" -gt 0 ]]; then - lint_pass="false" + results=$(echo "$results" | jq --argjson e "$errors" --argjson w "$warnings" \ + '. + {lint: {pass: false, errors: $e, warnings: $w}}') + else + results=$(echo "$results" | jq --argjson e "$errors" --argjson w "$warnings" \ + '. + {lint: {pass: true, errors: $e, warnings: $w}}') fi - results=$(echo "$results" | jq --argjson e "$errors" --argjson w "$warnings" --argjson p "$lint_pass" \ - '. + {lint: {pass: ($p == true), errors: $e, warnings: $w}}') else results=$(echo "$results" | jq '. + {lint: {pass: false, errors: -1, warnings: 0, error: "eslint failed to run"}}') fi @@ -49,7 +50,8 @@ if [[ "$LANGUAGE" == "typescript" ]]; then if npx tsc --noEmit > /dev/null 2>&1; then results=$(echo "$results" | jq '. + {typecheck: {pass: true}}') else - type_errors=$(npx tsc --noEmit 2>&1 | grep -c "error TS" || echo "0") + type_errors=$(npx tsc --noEmit 2>&1 | grep -c "error TS" || true) + type_errors=${type_errors:-0} results=$(echo "$results" | jq --argjson e "$type_errors" '. + {typecheck: {pass: false, errors: $e}}') fi else @@ -60,22 +62,23 @@ else fi # --- File size check --- -# Find the main HTML file and measure total size total_size=0 if [[ -d "$WORKSPACE/dist" ]]; then total_size=$(du -sb "$WORKSPACE/dist" 2>/dev/null | awk '{print $1}') elif [[ -f "$WORKSPACE/index.html" ]]; then total_size=$(du -sb "$WORKSPACE" --exclude=node_modules --exclude=.git 2>/dev/null | awk '{print $1}') fi -size_pass="true" -if [[ "$total_size" -gt 524288 ]]; then # 512KB - size_pass="false" +total_size=${total_size:-0} + +if [[ "$total_size" -gt 524288 ]]; then + results=$(echo "$results" | jq --argjson s "$total_size" \ + '. + {performance: {bundle_size_bytes: $s, size_under_512kb: false}}') +else + results=$(echo "$results" | jq --argjson s "$total_size" \ + '. + {performance: {bundle_size_bytes: $s, size_under_512kb: true}}') fi -results=$(echo "$results" | jq --argjson s "$total_size" --argjson p "$size_pass" \ - '. + {performance: {bundle_size_bytes: $s, size_under_512kb: ($p == true)}}') # --- Compute aggregate quality score --- -# Each check contributes equally score_sum=0 score_count=0 @@ -88,7 +91,7 @@ for key in lint typecheck performance; do done if [[ $score_count -gt 0 ]]; then - score=$(echo "scale=2; $score_sum / ($score_count * 100)" | bc) + score=$(awk "BEGIN {printf \"%.2f\", $score_sum / ($score_count * 100)}") else score="0" fi diff --git a/tasks/tetris/eval/structural.sh b/tasks/tetris/eval/structural.sh @@ -80,7 +80,7 @@ checks_json="[${checks_json%,}]" # Compute score if [[ $total_count -gt 0 ]]; then - score=$(echo "scale=2; $pass_count / $total_count" | bc) + score=$(awk "BEGIN {printf \"%.2f\", $pass_count / $total_count}") else score="0" fi

Impressum · Datenschutz