commit 296bd543e30783c961ea89001e7d1df52e986345
parent 088b9fe8603b0e9272c25c577e87cdb8f5c48846
Author: Brian Graham <brian@buildingbetterteams.de>
Date: Sat, 4 Apr 2026 10:23:16 +0200
Add sortable grid columns, show context file on run page, update TODO
Grid table:
- All columns sortable (click header to sort, click again to reverse)
- Sort indicator arrows on active column
- Default sort: score descending
Run detail:
- When context_file=provided, shows the actual context.md content
in the configuration card (loaded at build time from task directory)
CLAUDE.md updated with comprehensive TODO list covering analysis,
eval, dashboard, harness, and data collection.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Diffstat:
4 files changed, 142 insertions(+), 55 deletions(-)
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -8,67 +8,96 @@ An open benchmark for comparing agentic coding loop configurations. Define the v
The grid is a cartesian product of configuration variables. You define the axes and their possible values; the harness explodes them into the full experiment matrix.
-Axes (see `grid.yaml` for full definition):
+16 axes (see `grid.yaml` for full definition):
- **Model**: Haiku, Sonnet, Opus (or other providers via LiteLLM)
- **Effort**: high, max (extended thinking)
-- **Prompt style**: simple ("build tetris"), detailed (full spec)
-- **Programming language**: TypeScript, JavaScript
+- **Prompt style**: simple, detailed
+- **Programming language**: TypeScript, JavaScript, unspecified
- **Human language**: English, Spanish
+- **Individual tools**: Read, Write, Edit, Glob, Grep (each on/off)
- **Tooling**: Playwright on/off, linter (eslint) on/off
- **Context**: rules file provided or not
- **Sub-agents**: enabled/disabled
- **Web search**: enabled/disabled
- **Budget**: low ($0.50), high ($5.00)
-Each cell = 1 task x 1 configuration permutation, run N times.
-
-## Experiment Definitions
-
-Experiments are defined in YAML. The YAML describes the axes and their values, not individual experiments. The harness computes the permutation set.
-
## Test Harness
-Each experiment run executes in a clean room:
-- Isolated directory
-- No context contamination from this project or other experiments
-- Fresh agent session
-- Repeatable from the YAML definition
+- Python orchestrator (`harness/run.py`) with parallel execution (`-j N`)
+- OAuth auth via `--bare` + `apiKeyHelper` script
+- DOE designs: main_effects, plackett_burman, interaction_hunt
+- Re-eval existing runs: `python3 harness/reeval.py -j 4`
+- Auto-extracts workspace artifacts for dashboard iframe preview
## Scoring
All evaluation is deterministic code. No LLM grading.
-**Automated (quantitative):**
-- Pass/fail on pre-written test suites (agent never sees these tests)
-- Structural checks (correct files exist, build succeeds)
-- Quality checks (lint, type check, accessibility, performance, security)
-- Wall-clock time, tokens consumed, cost, loop iterations, agents spawned
+Categories (weights in `tasks/tetris/scoring.yaml`):
+- **Functional** (25%): gameplay bot (16 Playwright tests)
+- **Quality** (20%): lint, type check, bundle size
+- **Code analysis** (15%): file count, function length, nesting depth, naming consistency, separation of concerns, duplication, HTML validation
+- **Structural** (10%): entry point exists, build succeeds
+- **Gameplay bot** (10%): auto-calibrating Tetris player that tests all mechanics
+- **Transcript analysis** (10%): agent efficiency, wasted turns, self-testing
-**Qualitative:**
-- Human judgment scores (e.g., "would you ship this?" 1-5)
+## Dashboard
-## Transcript Storage
+Static Astro site with React islands. Pages:
+- **Grid** (`/`): summary stats, bar charts, filterable run table
+- **Insights** (`/insights`): surprise cards, scatter plots, tornado chart, interaction heatmap
+- **Explore** (`/explore`): correlation matrix, efficiency frontier, bump chart, heatmap matrix, radar comparison, treemap
+- **Compare** (`/compare`): aggregate stats per axis value
+- **Run detail** (`/run/{id}`): metrics, config pills, 7 score bars, code analysis, agent behavior, transcript viewer, artifact iframe
-Each run saves the full conversation via `--output-format stream-json`. Every tool call, response, and reasoning step is captured for review in the dashboard.
-
-## Web Dashboard
-
-Static site showing:
-- Grid overview of all experiments
-- Aggregate insights and comparisons across any axis
-- Drill-down into individual experiment results
-- Slicing/filtering (e.g., "Opus vs Sonnet when Playwright is on")
+Light/dark theme toggle. SMUI design system (JetBrains Mono, Nord palette).
## Tech
-- **Harness runner**: Bash script that orchestrates experiment runs. Claude Code cannot invoke itself, so the harness must be external. The script reads YAML definitions, computes the grid, and invokes `claude` CLI for each cell. Uses `--permission-mode dontAsk` with `--allowedTools` for non-interactive runs (not `--dangerously-skip-permissions`, which is blocked as root).
-- **Model support**: Primarily Anthropic models (Haiku, Sonnet, Opus). Non-Anthropic models possible via LiteLLM proxy in front of Ollama or similar, but expect reduced feature support (extended thinking, tool use may not work). This is valid benchmark data.
-- Results stored as YAML/JSON (append-only, never overwritten)
-- Each run gets a unique ID
-- Static site for the dashboard
+- Harness: Python orchestrator, bash eval scripts
+- Auth: OAuth token from ~/.claude/.credentials.json, auto-refresh
+- Dashboard: Astro + React + recharts
+- Deploy: Forgejo CI to research subdomain (blue/green)
+- Results committed to repo for dashboard build
## Conventions
- Source control: Forgejo (not GitHub)
- Start conservative with resource-intensive settings
- Never use emdashes
+
+## TODO
+
+### Analysis
+- [ ] PCA analysis: add when 100+ runs exist. One-hot encode categoricals, identify principal components explaining variance. Show which variable combinations matter most.
+- [ ] Pareto frontier analysis: multi-objective optimization (score vs cost, score vs time)
+- [ ] Per-task analysis: when more tasks are added, compare how variables affect different tasks differently
+
+### Eval
+- [ ] Wire functional eval (Playwright tests from gameplay bot) into the score more robustly
+- [ ] Cyclomatic complexity measurement (escomplex or typhonjs-escomplex)
+- [ ] Memory leak detection via Playwright heap snapshots
+- [ ] Frame rate measurement during gameplay
+- [ ] Dead code detection (knip)
+- [ ] Wall kick rotation testing (position piece against wall, try rotate)
+- [ ] I-piece detection in rotation test needs tuning (not reliably found in 60 attempts)
+
+### Dashboard
+- [ ] Sortable columns on grid table
+- [ ] Show context file content on run detail page when context_file=provided
+- [ ] Add human_language=unspecified option
+- [ ] Run detail: show gameplay bot test results (16 individual tests with pass/fail)
+- [ ] Run detail: show game screenshots from bot at key moments
+- [ ] Inline Tetris artifact previews in grid table (thumbnails)
+- [ ] Re-eval button in UI (trigger reeval.py from dashboard)
+
+### Harness
+- [ ] OAuth token refresh: verify the refresh endpoint works reliably
+- [ ] Auto-commit and push results after sweep completes
+- [ ] Support non-Anthropic models via LiteLLM/Ollama
+- [ ] Add more tasks beyond Tetris
+
+### Data
+- [ ] Complete sonnet sweep (many timed out)
+- [ ] Run opus sweep
+- [ ] Interaction hunt on top variables (model x language x context_file x prompt_style)
diff --git a/dashboard/src/components/Grid.tsx b/dashboard/src/components/Grid.tsx
@@ -43,11 +43,38 @@ function formatTime(seconds: number | null | undefined): string {
return Math.floor(seconds / 60) + "m " + (seconds % 60) + "s";
}
+type SortKey = "task" | "model" | "effort" | "prompt" | "lang" | "score" | "cost" | "time" | "turns";
+
+function getSortValue(run: Run, key: SortKey): string | number {
+ switch (key) {
+ case "task": return run.meta.task;
+ case "model": return run.meta.model;
+ case "effort": return run.meta.effort;
+ case "prompt": return run.meta.prompt_style;
+ case "lang": return run.meta.language;
+ case "score": return run.eval_results?.score ?? -1;
+ case "cost": return run.claude_output?.total_cost_usd ?? -1;
+ case "time": return run.meta.wall_time_seconds ?? -1;
+ case "turns": return run.claude_output?.num_turns ?? -1;
+ }
+}
+
export default function Grid({ runs, axisValues, tasks }: GridProps) {
const [filters, setFilters] = useState<Record<string, string>>({});
+ const [sortKey, setSortKey] = useState<SortKey>("score");
+ const [sortAsc, setSortAsc] = useState(false);
+
+ const handleSort = (key: SortKey) => {
+ if (sortKey === key) {
+ setSortAsc(!sortAsc);
+ } else {
+ setSortKey(key);
+ setSortAsc(false);
+ }
+ };
const filteredRuns = useMemo(() => {
- return runs.filter((run) => {
+ const filtered = runs.filter((run) => {
for (const [key, value] of Object.entries(filters)) {
if (key === "task") {
if (run.meta.task !== value) return false;
@@ -58,7 +85,14 @@ export default function Grid({ runs, axisValues, tasks }: GridProps) {
}
return true;
});
- }, [runs, filters]);
+
+ return filtered.sort((a, b) => {
+ const va = getSortValue(a, sortKey);
+ const vb = getSortValue(b, sortKey);
+ const cmp = va < vb ? -1 : va > vb ? 1 : 0;
+ return sortAsc ? cmp : -cmp;
+ });
+ }, [runs, filters, sortKey, sortAsc]);
return (
<div>
@@ -73,16 +107,22 @@ export default function Grid({ runs, axisValues, tasks }: GridProps) {
<thead>
<tr>
<th>Run ID</th>
- <th>Task</th>
- <th>Model</th>
- <th>Effort</th>
- <th>Prompt</th>
- <th>Lang</th>
- <th>Score</th>
+ {(["task", "model", "effort", "prompt", "lang", "score", "cost", "time", "turns"] as SortKey[]).map((key) => {
+ const labels: Record<SortKey, string> = {
+ task: "Task", model: "Model", effort: "Effort", prompt: "Prompt",
+ lang: "Lang", score: "Score", cost: "Cost", time: "Time", turns: "Turns",
+ };
+ return (
+ <th
+ key={key}
+ onClick={() => handleSort(key)}
+ style={{ cursor: "pointer", userSelect: "none" }}
+ >
+ {labels[key]} {sortKey === key ? (sortAsc ? "\u25B2" : "\u25BC") : ""}
+ </th>
+ );
+ })}
<th>Pass</th>
- <th>Cost</th>
- <th>Time</th>
- <th>Turns</th>
</tr>
</thead>
<tbody>
diff --git a/dashboard/src/components/RunDetail.tsx b/dashboard/src/components/RunDetail.tsx
@@ -16,6 +16,7 @@ interface RunDetailProps {
run: Run;
transcriptLines: string[];
axisValues: Record<AxisName, string[]>;
+ contextContent?: string;
}
const EXIT_CODES: Record<number, string> = {
@@ -129,7 +130,7 @@ function ScoreBar({ label, score }: { label: string; score: number | null | unde
);
}
-export default function RunDetail({ run, transcriptLines, axisValues }: RunDetailProps) {
+export default function RunDetail({ run, transcriptLines, axisValues, contextContent }: RunDetailProps) {
const { meta, eval_results, claude_output } = run;
// Check if this run has an artifact to preview (tetris games, web apps)
@@ -200,6 +201,12 @@ export default function RunDetail({ run, transcriptLines, axisValues }: RunDetai
if (!active) return null;
return <ConfigPills key={key} label={label} activeValue={active} allValues={all} />;
})}
+ {contextContent && (
+ <div style={{ marginTop: "10px", borderTop: "1px solid var(--border)", paddingTop: "8px" }}>
+ <div style={{ fontSize: "0.7rem", color: "var(--text-muted)", marginBottom: "4px", textTransform: "uppercase", letterSpacing: "0.5px" }}>Context file provided</div>
+ <pre style={{ fontSize: "0.65rem", color: "var(--text-muted)", whiteSpace: "pre-wrap", lineHeight: 1.5, maxHeight: "150px", overflow: "auto" }}>{contextContent}</pre>
+ </div>
+ )}
</div>
{/* Scores */}
diff --git a/dashboard/src/pages/run/[id].astro b/dashboard/src/pages/run/[id].astro
@@ -2,6 +2,8 @@
import Base from "../../layouts/Base.astro";
import { loadAllRuns, loadTranscript, getAxisValues } from "../../lib/data";
import RunDetail from "../../components/RunDetail";
+import fs from "node:fs";
+import path from "node:path";
const allRuns = loadAllRuns();
const axisValues = getAxisValues(allRuns);
@@ -9,16 +11,25 @@ const axisValues = getAxisValues(allRuns);
export function getStaticPaths() {
const runs = loadAllRuns();
const axisValues = getAxisValues(runs);
- return runs.map((run) => ({
- params: { id: run.meta.run_id },
- props: { run, axisValues },
- }));
+ return runs.map((run) => {
+ // Load context file if context_file=provided
+ let contextContent = "";
+ if (run.meta.context_file === "provided") {
+ try {
+ const ctxPath = path.resolve(process.cwd(), `../tasks/${run.meta.task}/context.md`);
+ contextContent = fs.readFileSync(ctxPath, "utf-8");
+ } catch { /* not found */ }
+ }
+ return {
+ params: { id: run.meta.run_id },
+ props: { run, axisValues, contextContent },
+ };
+ });
}
-const { run, axisValues: av } = Astro.props;
+const { run, axisValues: av, contextContent } = Astro.props;
const transcriptLines = loadTranscript(run.meta.run_id);
-// Format run ID for display
const parts = run.meta.run_id.split("_run");
const runNum = parts.length > 1 ? `Run #${parts[parts.length - 1]}` : "";
---
@@ -35,7 +46,7 @@ const runNum = parts.length > 1 ? `Run #${parts[parts.length - 1]}` : "";
{run.meta.completed_at || "in progress"}
</p>
- <RunDetail client:load run={run} transcriptLines={transcriptLines} axisValues={av} />
+ <RunDetail client:load run={run} transcriptLines={transcriptLines} axisValues={av} contextContent={contextContent} />
</Base>
<style>