loop-benchmarking

Controlled experiments across agentic coding configurations. Same task, one variable, what actually works.
git clone https://git.shiptheloop.com/loop-benchmarking.git
Log | Files | Refs | README

COMPETITIVE_PLAY_SPEC.md (7765B)


      1 # Competitive Play Phase -- Bot Upgrade Spec
      2 
      3 ## Overview
      4 
      5 Add a new Phase 7 after the existing 16 pass/fail tests. This phase plays the game competitively for 60 seconds, recording detailed gameplay data. It's not pass/fail -- it produces metrics that reveal bugs the basic tests miss.
      6 
      7 ## When it runs
      8 
      9 - Only if Phase 4 (gameplay) succeeded: the bot can place pieces
     10 - After Phase 6 (endurance), on a fresh page reload
     11 - If Phase 4 failed, skip with empty competitive_play data
     12 
     13 ## What it does
     14 
     15 1. Reload the page, calibrate, start the game
     16 2. Play using the AI player for 60 seconds (or until game over)
     17 3. Record everything that happens
     18 
     19 ## Data recorded (added to the gameplay bot report as `competitive_play`)
     20 
     21 ```json
     22 {
     23   "duration_seconds": 45,
     24   "pieces_placed": 62,
     25   "total_lines_cleared": 18,
     26   "single_clears": 12,
     27   "double_clears": 2,
     28   "triple_clears": 1,
     29   "tetris_clears": 0,
     30   "max_combo": 3,
     31   "score_readings": [0, 100, 200, 500, 800, 1300, ...],
     32   "score_final": 4200,
     33   "score_increases": [100, 100, 300, 300, 500, ...],
     34   "level_readings": [1, 1, 1, 2, 2, 3],
     35   "level_final": 3,
     36   "lines_display_readings": [0, 1, 2, 4, 5, 8, ...],
     37   "game_over_reached": true,
     38   "game_over_text_found": "Game Over",
     39   "restart_available": true,
     40   "next_piece_visible": true,
     41   "speed_increased": true,
     42   "bugs_detected": [
     43     "multi_line_clear_only_removes_one_row",
     44     "score_does_not_scale_with_simultaneous_clears",
     45     "level_does_not_increase"
     46   ]
     47 }
     48 ```
     49 
     50 ## Bug detection logic
     51 
     52 During play, the bot watches for specific anomalies:
     53 
     54 ### 1. Multi-line clear bug
     55 - When the grid reader detects 2+ complete rows simultaneously, watch how many rows disappear
     56 - If only 1 row disappears when 2+ were complete, flag: `multi_line_clear_only_removes_one_row`
     57 
     58 ### 2. Score scaling bug  
     59 - Track score before and after each line clear event
     60 - For single clears, record the score delta
     61 - For multi-line clears (2+ rows), check if the delta is larger than a single clear
     62 - If multi-line clear gives the same delta as a single, flag: `score_does_not_scale_with_simultaneous_clears`
     63 
     64 ### 3. Level progression bug
     65 - Track score/lines and level readings throughout the session
     66 - If lines_cleared reaches 10+ but level stays at 1, flag: `level_does_not_increase`
     67 
     68 ### 4. Speed progression bug
     69 - Measure time between auto-drops at the start vs after 10+ lines cleared
     70 - If the interval doesn't decrease, flag: `speed_does_not_increase`
     71 
     72 ### 5. Next piece preview
     73 - Check for a "next piece" display area (look for a small canvas/div near the main grid showing a single piece)
     74 - Record: `next_piece_visible: true/false`
     75 
     76 ### 6. Game over handling
     77 - When the grid fills to the top, check if:
     78   - Game stops accepting input
     79   - "Game Over" or similar text appears
     80   - A restart button/prompt appears
     81 - Record each separately
     82 
     83 ### 7. Counter-clockwise rotation
     84 - During play, occasionally press Z key instead of Up arrow
     85 - Check if the piece rotates the opposite direction
     86 - Record: `counter_clockwise_rotation_works: true/false`
     87 
     88 ### 8. Soft drop vs hard drop
     89 - Verify Down arrow moves piece one row (soft drop) vs Space drops to bottom (hard drop)
     90 - If Down arrow drops to bottom same as Space, flag: `soft_drop_acts_as_hard_drop`
     91 
     92 ## Implementation approach
     93 
     94 ### New function: `runCompetitivePlayPhase()`
     95 
     96 ```typescript
     97 async function runCompetitivePlayPhase(
     98   page: Page,
     99   cal: CalibrationResult,
    100   session: GameSession,
    101   gameplay: GameplayStats
    102 ): Promise<CompetitivePlayResult> {
    103   const result: CompetitivePlayResult = { ... };
    104   
    105   // Play using AI with integrated monitoring
    106   const startTime = Date.now();
    107   let lastScore = 0;
    108   let lastLevel = 1;
    109   let lastLines = 0;
    110   
    111   // Use playGame but with a callback on each piece placement
    112   // that reads score, level, lines, and checks for anomalies
    113   
    114   // After each piece placement:
    115   // 1. Read score element
    116   // 2. Read grid for complete rows (before they clear)
    117   // 3. Wait for clear animation
    118   // 4. Read grid again (after clear)
    119   // 5. Count how many rows actually disappeared
    120   // 6. Compare to how many were complete
    121   
    122   // Every 10 pieces, try a Z-key rotation
    123   // Every 5 pieces, check level display
    124   
    125   return result;
    126 }
    127 ```
    128 
    129 ### How to detect multi-line clears
    130 
    131 The critical measurement. Between piece placements:
    132 1. Read grid immediately after piece locks (before clear animation)
    133 2. Count complete rows (all cells filled)
    134 3. Wait 200-500ms for clear animation
    135 4. Read grid again
    136 5. Count how many rows actually disappeared
    137 6. If complete_rows > disappeared_rows, it's the multi-line bug
    138 
    139 The grid reader can count complete rows with `countCompleteRows()` which already exists in grid-reader.ts.
    140 
    141 ### Score monitoring
    142 
    143 During competitive play, read the score element on every piece placement (the gameplay phase already does this with integrated score tracking). Track every score delta. Group deltas by the number of lines cleared in that event. Check if deltas scale:
    144 - Single clear delta D
    145 - Double clear should be ~3x D
    146 - Triple should be ~5x D  
    147 - Tetris should be ~8x D
    148 
    149 Exact ratios depend on the game's scoring formula, but they should NOT all be equal.
    150 
    151 ### Speed monitoring
    152 
    153 Record timestamps of auto-drops (piece moving down without input). At level 1, the interval should be ~800ms. After level increases, it should decrease. Compare average intervals: first 10 pieces vs last 10 pieces.
    154 
    155 ## Integration with existing report
    156 
    157 Add `competitive_play` as a new field in the gameplay bot report, alongside `tests`, `implementation`, `gameplay`, `session`:
    158 
    159 ```json
    160 {
    161   "implementation": { ... },
    162   "tests": [ ... ],
    163   "summary": { ... },
    164   "gameplay": { ... },
    165   "session": { ... },
    166   "competitive_play": { ... }  // NEW
    167 }
    168 ```
    169 
    170 ## Additional tests (new pass/fail tests added to the test suite)
    171 
    172 The 8 bug checks become additional tests (17-24) with three possible outcomes:
    173 - **pass**: we tested it and it works correctly
    174 - **fail**: we tested it and found a bug
    175 - **skip**: we didn't get an opportunity to test (e.g., no multi-line clear happened)
    176 
    177 New test names:
    178 - `multi_line_clear`: multiple complete rows clear simultaneously
    179 - `score_scaling`: score increases proportionally with multi-line clears
    180 - `level_progression`: level increases after clearing 10+ lines
    181 - `speed_progression`: drop speed increases with level
    182 - `next_piece_preview`: next piece display is visible
    183 - `game_over_display`: game over message and restart option shown
    184 - `counter_clockwise_rotation`: Z key rotates opposite to Up arrow
    185 - `soft_drop_distinct`: Down arrow moves one row, not same as hard drop
    186 
    187 These tests are appended to the existing 16. The total becomes up to 24 tests. Score scaling data is tracked regardless (score_readings, score_increases arrays) for analysis.
    188 
    189 ## Dashboard display
    190 
    191 On the run detail page, add a "Competitive Play" detail card showing:
    192 - Duration, pieces placed, lines cleared breakdown
    193 - Score progression (small sparkline)
    194 - Bugs detected (red badges)
    195 - Next piece / game over / restart status
    196 
    197 ## Files to modify
    198 
    199 1. `tests.ts` -- add Phase 7 (`runCompetitivePlayPhase`), add `CompetitivePlayResult` type, call it from `runAllTests`, include data in return value
    200 2. `types.ts` -- add `CompetitivePlayResult` interface 
    201 3. `player.ts` -- may need a variant of `playGame` that calls back after each piece for monitoring (or just use the existing one and do monitoring externally via polling)
    202 4. `index.ts` -- include competitive_play data in the report output
    203 5. `dashboard/src/components/RunDetail.tsx` -- add competitive play detail card
    204 
    205 ## What NOT to change
    206 
    207 - The 16 existing tests and their pass/fail logic
    208 - The gameplay bot score calculation updates to include new tests (out of 24 total)
    209 - The grid reader or calibrate modules
    210 - The AI player heuristics

Impressum · Datenschutz