Context update for GPU machine testing - loop-benchmarking - Controlled experiments across agentic coding configurations. Same task, one variable, what actually works.

commit d07dba794c0abd20688958f4185daf3447786621
parent 499b8e496ada5d677434f56c376d41db2517b3a3
Author: Brian Graham <brian@buildingbetterteams.de>
Date:   Mon, 13 Apr 2026 13:58:14 +0200

Context update for GPU machine testing

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Diffstat:
A tasks/tetris/eval/gameplay-bot/COMPETITIVE_PLAY_SPEC.md  | 210 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
D tasks/tetris/eval/gameplay-bot/NEXT_SESSION_SPEC.md  | 66 ------------------------------------------------------------------
M tasks/tetris/eval/playwright.config.ts  | 2 +-

3 files changed, 211 insertions(+), 67 deletions(-)
diff --git a/tasks/tetris/eval/gameplay-bot/COMPETITIVE_PLAY_SPEC.md b/tasks/tetris/eval/gameplay-bot/COMPETITIVE_PLAY_SPEC.md
@@ -0,0 +1,210 @@
+# Competitive Play Phase -- Bot Upgrade Spec
+
+## Overview
+
+Add a new Phase 7 after the existing 16 pass/fail tests. This phase plays the game competitively for 60 seconds, recording detailed gameplay data. It's not pass/fail -- it produces metrics that reveal bugs the basic tests miss.
+
+## When it runs
+
+- Only if Phase 4 (gameplay) succeeded: the bot can place pieces
+- After Phase 6 (endurance), on a fresh page reload
+- If Phase 4 failed, skip with empty competitive_play data
+
+## What it does
+
+1. Reload the page, calibrate, start the game
+2. Play using the AI player for 60 seconds (or until game over)
+3. Record everything that happens
+
+## Data recorded (added to the gameplay bot report as `competitive_play`)
+
+```json
+{
+  "duration_seconds": 45,
+  "pieces_placed": 62,
+  "total_lines_cleared": 18,
+  "single_clears": 12,
+  "double_clears": 2,
+  "triple_clears": 1,
+  "tetris_clears": 0,
+  "max_combo": 3,
+  "score_readings": [0, 100, 200, 500, 800, 1300, ...],
+  "score_final": 4200,
+  "score_increases": [100, 100, 300, 300, 500, ...],
+  "level_readings": [1, 1, 1, 2, 2, 3],
+  "level_final": 3,
+  "lines_display_readings": [0, 1, 2, 4, 5, 8, ...],
+  "game_over_reached": true,
+  "game_over_text_found": "Game Over",
+  "restart_available": true,
+  "next_piece_visible": true,
+  "speed_increased": true,
+  "bugs_detected": [
+    "multi_line_clear_only_removes_one_row",
+    "score_does_not_scale_with_simultaneous_clears",
+    "level_does_not_increase"
+  ]
+}
+```
+
+## Bug detection logic
+
+During play, the bot watches for specific anomalies:
+
+### 1. Multi-line clear bug
+- When the grid reader detects 2+ complete rows simultaneously, watch how many rows disappear
+- If only 1 row disappears when 2+ were complete, flag: `multi_line_clear_only_removes_one_row`
+
+### 2. Score scaling bug  
+- Track score before and after each line clear event
+- For single clears, record the score delta
+- For multi-line clears (2+ rows), check if the delta is larger than a single clear
+- If multi-line clear gives the same delta as a single, flag: `score_does_not_scale_with_simultaneous_clears`
+
+### 3. Level progression bug
+- Track score/lines and level readings throughout the session
+- If lines_cleared reaches 10+ but level stays at 1, flag: `level_does_not_increase`
+
+### 4. Speed progression bug
+- Measure time between auto-drops at the start vs after 10+ lines cleared
+- If the interval doesn't decrease, flag: `speed_does_not_increase`
+
+### 5. Next piece preview
+- Check for a "next piece" display area (look for a small canvas/div near the main grid showing a single piece)
+- Record: `next_piece_visible: true/false`
+
+### 6. Game over handling
+- When the grid fills to the top, check if:
+  - Game stops accepting input
+  - "Game Over" or similar text appears
+  - A restart button/prompt appears
+- Record each separately
+
+### 7. Counter-clockwise rotation
+- During play, occasionally press Z key instead of Up arrow
+- Check if the piece rotates the opposite direction
+- Record: `counter_clockwise_rotation_works: true/false`
+
+### 8. Soft drop vs hard drop
+- Verify Down arrow moves piece one row (soft drop) vs Space drops to bottom (hard drop)
+- If Down arrow drops to bottom same as Space, flag: `soft_drop_acts_as_hard_drop`
+
+## Implementation approach
+
+### New function: `runCompetitivePlayPhase()`
+
+```typescript
+async function runCompetitivePlayPhase(
+  page: Page,
+  cal: CalibrationResult,
+  session: GameSession,
+  gameplay: GameplayStats
+): Promise<CompetitivePlayResult> {
+  const result: CompetitivePlayResult = { ... };
+  
+  // Play using AI with integrated monitoring
+  const startTime = Date.now();
+  let lastScore = 0;
+  let lastLevel = 1;
+  let lastLines = 0;
+  
+  // Use playGame but with a callback on each piece placement
+  // that reads score, level, lines, and checks for anomalies
+  
+  // After each piece placement:
+  // 1. Read score element
+  // 2. Read grid for complete rows (before they clear)
+  // 3. Wait for clear animation
+  // 4. Read grid again (after clear)
+  // 5. Count how many rows actually disappeared
+  // 6. Compare to how many were complete
+  
+  // Every 10 pieces, try a Z-key rotation
+  // Every 5 pieces, check level display
+  
+  return result;
+}
+```
+
+### How to detect multi-line clears
+
+The critical measurement. Between piece placements:
+1. Read grid immediately after piece locks (before clear animation)
+2. Count complete rows (all cells filled)
+3. Wait 200-500ms for clear animation
+4. Read grid again
+5. Count how many rows actually disappeared
+6. If complete_rows > disappeared_rows, it's the multi-line bug
+
+The grid reader can count complete rows with `countCompleteRows()` which already exists in grid-reader.ts.
+
+### Score monitoring
+
+During competitive play, read the score element on every piece placement (the gameplay phase already does this with integrated score tracking). Track every score delta. Group deltas by the number of lines cleared in that event. Check if deltas scale:
+- Single clear delta D
+- Double clear should be ~3x D
+- Triple should be ~5x D  
+- Tetris should be ~8x D
+
+Exact ratios depend on the game's scoring formula, but they should NOT all be equal.
+
+### Speed monitoring
+
+Record timestamps of auto-drops (piece moving down without input). At level 1, the interval should be ~800ms. After level increases, it should decrease. Compare average intervals: first 10 pieces vs last 10 pieces.
+
+## Integration with existing report
+
+Add `competitive_play` as a new field in the gameplay bot report, alongside `tests`, `implementation`, `gameplay`, `session`:
+
+```json
+{
+  "implementation": { ... },
+  "tests": [ ... ],
+  "summary": { ... },
+  "gameplay": { ... },
+  "session": { ... },
+  "competitive_play": { ... }  // NEW
+}
+```
+
+## Additional tests (new pass/fail tests added to the test suite)
+
+The 8 bug checks become additional tests (17-24) with three possible outcomes:
+- **pass**: we tested it and it works correctly
+- **fail**: we tested it and found a bug
+- **skip**: we didn't get an opportunity to test (e.g., no multi-line clear happened)
+
+New test names:
+- `multi_line_clear`: multiple complete rows clear simultaneously
+- `score_scaling`: score increases proportionally with multi-line clears
+- `level_progression`: level increases after clearing 10+ lines
+- `speed_progression`: drop speed increases with level
+- `next_piece_preview`: next piece display is visible
+- `game_over_display`: game over message and restart option shown
+- `counter_clockwise_rotation`: Z key rotates opposite to Up arrow
+- `soft_drop_distinct`: Down arrow moves one row, not same as hard drop
+
+These tests are appended to the existing 16. The total becomes up to 24 tests. Score scaling data is tracked regardless (score_readings, score_increases arrays) for analysis.
+
+## Dashboard display
+
+On the run detail page, add a "Competitive Play" detail card showing:
+- Duration, pieces placed, lines cleared breakdown
+- Score progression (small sparkline)
+- Bugs detected (red badges)
+- Next piece / game over / restart status
+
+## Files to modify
+
+1. `tests.ts` -- add Phase 7 (`runCompetitivePlayPhase`), add `CompetitivePlayResult` type, call it from `runAllTests`, include data in return value
+2. `types.ts` -- add `CompetitivePlayResult` interface 
+3. `player.ts` -- may need a variant of `playGame` that calls back after each piece for monitoring (or just use the existing one and do monitoring externally via polling)
+4. `index.ts` -- include competitive_play data in the report output
+5. `dashboard/src/components/RunDetail.tsx` -- add competitive play detail card
+
+## What NOT to change
+
+- The 16 existing tests and their pass/fail logic
+- The gameplay bot score calculation updates to include new tests (out of 24 total)
+- The grid reader or calibrate modules
+- The AI player heuristics
diff --git a/tasks/tetris/eval/gameplay-bot/NEXT_SESSION_SPEC.md b/tasks/tetris/eval/gameplay-bot/NEXT_SESSION_SPEC.md
@@ -1,66 +0,0 @@
-# Gameplay Bot Rewrite Spec
-
-## Problem
-Bot has false positives because it thinks the game started when it didn't.
-Current start detection clicks canvas and checks if any pixel changed -- 
-this triggers on title screens, hover effects, animations.
-
-## New Start Detection
-
-Universal signal: **a piece is falling**. After each trigger attempt,
-run a falling piece detector instead of screenshot comparison.
-
-### Trigger sequence (try each, check for falling piece after each):
-1. Wait 3s (auto-start)
-2. Click canvas center
-3. Press Enter  
-4. Press Space
-5. Click body at various positions
-6. Press various keys (arrow down, Z, etc.)
-
-### Falling piece detector:
-- Take 3 screenshots ~1s apart
-- Find a rectangular cluster of colored pixels (~4 cells) that moved downward
-- "Roughly square-ish" -- tetromino bounding box is 2x2 to 4x1
-- May have rounded edges, glows, shadows -- look for the bounding box
-- Works for canvas, DOM, SVG, WebGL -- any rendering approach
-- If piece already at bottom, detect new piece spawning at top instead
-- Consider: games might render pieces as individual DOM divs, SVG rects, 
-  canvas fills, or WebGL quads
-
-### If no falling piece after all triggers:
-- Game did not start
-- All downstream tests: "skipped: game did not start"
-- Zero false positives
-
-## Conditional Phase Execution
-
-Each phase depends on the previous succeeding:
-
-1. **Load + calibrate**: always runs
-2. **Start detection**: try triggers, confirm falling piece
-3. **Mechanics test**: only if game started (piece detected)
-4. **Gameplay (play to win)**: only if mechanics worked
-5. **Game over**: only if pieces can be placed. Must stack pieces to top
-   and verify via grid reader (filled cells in top rows), NOT screenshot comparison
-6. **Endurance**: only if gameplay phase succeeded
-
-Failed prerequisites -> "skipped: [prerequisite] failed" on all downstream tests.
-No more false positives from static screens.
-
-## Game Over Fix
-
-Current: screenshot comparison (nothing changed = game over). 
-This false-positives on static start screens.
-
-New: 
-1. Actually place pieces (hard drop repeatedly)
-2. Verify via grid reader that filled cells reach top rows
-3. Then check if inputs stop having effect (piece doesn't spawn)
-4. Optionally look for "game over" text in DOM
-
-## Notes
-- Games might auto-start (no button needed)
-- Start buttons might be canvas-rendered (no DOM button to find)
-- Some games have splash screens with animations (pixel change != game start)
-- The key insight: a FALLING PIECE is the only universal signal that gameplay began
diff --git a/tasks/tetris/eval/playwright.config.ts b/tasks/tetris/eval/playwright.config.ts
@@ -3,7 +3,7 @@ import { defineConfig } from "@playwright/test";
 export default defineConfig({
   testDir: "./gameplay-bot",
   testMatch: "index.ts",
-  timeout: 240_000, // 4 minutes per test
+  timeout: 60_000, // 1 minute per individual test (each test sets its own)
   retries: 0,
   workers: 1, // sequential -- only one game at a time
   reporter: [["list"]],

	loop-benchmarking Controlled experiments across agentic coding configurations. Same task, one variable, what actually works.
	git clone https://git.shiptheloop.com/loop-benchmarking.git
	Log \| Files \| Refs \| README

A	tasks/tetris/eval/gameplay-bot/COMPETITIVE_PLAY_SPEC.md	\|	210	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
D	tasks/tetris/eval/gameplay-bot/NEXT_SESSION_SPEC.md	\|	66	------------------------------------------------------------------
M	tasks/tetris/eval/playwright.config.ts	\|	2	+-