loop-benchmarking

Controlled experiments across agentic coding configurations. Same task, one variable, what actually works.
git clone https://git.shiptheloop.com/loop-benchmarking.git
Log | Files | Refs | README

commit 7cad71376c1153a15b0ef0e87176d2c5ee230578
parent 7489b45ee21ecf7850b2b3f445534eae15f26b10
Author: Brian Graham <brian@buildingbetterteams.de>
Date:   Tue,  7 Apr 2026 10:39:16 +0200

Discard runs with 0 turns before eval/commit

If Claude produced nothing (num_turns=0), delete the run directory
immediately instead of evaluating and persisting garbage data.
Resume logic will retry the cell next time.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Diffstat:
Mharness/run.py | 13+++++++++++++
1 file changed, 13 insertions(+), 0 deletions(-)

diff --git a/harness/run.py b/harness/run.py @@ -714,6 +714,19 @@ def run_single( meta["completed_at"] = datetime.now(timezone.utc).isoformat() (run_dir / "meta.json").write_text(json.dumps(meta, indent=2)) + # Guard: if claude produced nothing (0 turns), discard the run + output_path = run_dir / "claude_output.json" + if output_path.exists(): + try: + output = json.loads(output_path.read_text()) + if (output.get("num_turns") or 0) == 0: + log(f" DISCARD: {run_id} - 0 turns (no work done)") + shutil.rmtree(run_dir, ignore_errors=True) + shutil.rmtree(workspace, ignore_errors=True) + return "failed" + except Exception: + pass + # Evaluate task_dir = project_dir / "tasks" / task evaluate(task_dir, workspace, cell, run_dir)

Impressum · Datenschutz