Switch production eval to V2 gameplay bot - loop-benchmarking - Controlled experiments across agentic coding configurations. Same task, one variable, what actually works.

commit 821022cb2060158a630e184a67c95678e8dca7c7
parent 00055378a50253cc949795147e20b64ed2a2767f
Author: Brian Graham <brian@buildingbetterteams.de>
Date:   Sun, 12 Apr 2026 17:43:33 +0200

Switch production eval to V2 gameplay bot

Harness now uses gameplay-bot-v2 (two-tier architecture) when available,
falls back to V1 if not. V2 has 95% agreement with human calibration
(vs V1's 58%).

Expect breakage on canvas games without GPU (getImageData returns zeros).
DOM games should work well.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Diffstat:
M harness/run.py  | 8 ++++++--

1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/harness/run.py b/harness/run.py
@@ -442,7 +442,10 @@ def evaluate(task_dir: Path, workspace: Path, cell: dict, run_dir: Path):
             results["transcript_analysis"] = {"error": str(e), "score": 0}
 
     # Gameplay bot (Playwright-based interactive testing, e.g. Tetris)
-    gameplay_bot_entry = task_dir / "eval" / "gameplay-bot" / "index.ts"
+    # Use V2 bot (two-tier architecture) if available, fall back to V1
+    gameplay_bot_v2 = task_dir / "eval" / "gameplay-bot-v2" / "index.ts"
+    gameplay_bot_v1 = task_dir / "eval" / "gameplay-bot" / "index.ts"
+    gameplay_bot_entry = gameplay_bot_v2 if gameplay_bot_v2.exists() else gameplay_bot_v1
     if gameplay_bot_entry.exists():
         # Pre-check: is there an HTML file to test?
         html_files = list(workspace.rglob("*.html"))
@@ -455,7 +458,8 @@ def evaluate(task_dir: Path, workspace: Path, cell: dict, run_dir: Path):
             }
         else:
             report_path = run_dir / "gameplay-bot-report.json"
-            playwright_config = task_dir / "eval" / "playwright.config.ts"
+            bot_dir = gameplay_bot_entry.parent
+            playwright_config = bot_dir / "playwright.config.ts"
             try:
                 bot_env = os.environ.copy()
                 bot_env["WORKSPACE_PATH"] = str(workspace)

	loop-benchmarking Controlled experiments across agentic coding configurations. Same task, one variable, what actually works.
	git clone https://git.shiptheloop.com/loop-benchmarking.git
	Log \| Files \| Refs \| README