<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<title>loop-benchmarking, branch HEAD</title>
<subtitle>Controlled experiments across agentic coding configurations. Same task, one variable, what actually works.
</subtitle>
<entry>
<id>bbd676244fcbed8e4b2767bc3325769489b5da48</id>
<published>2026-04-16T14:47:08Z</published>
<updated>2026-04-16T14:47:08Z</updated>
<title>Drop aborted glm-4.5-air run 0c19668a</title>
<link rel="alternate" type="text/html" href="commit/bbd676244fcbed8e4b2767bc3325769489b5da48.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit bbd676244fcbed8e4b2767bc3325769489b5da48
parent dd1bad2d5568fcda6b6b52af5c89baae4e8991ad
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Thu, 16 Apr 2026 16:47:08 +0200

Drop aborted glm-4.5-air run 0c19668a

No claude_output.json, empty stderr, 40-line transcript, unset
exit_code/wall_time -- same failure pattern as prior Z.AI purges.
Rebuild PCA against the resulting 510-run dataset.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>dd1bad2d5568fcda6b6b52af5c89baae4e8991ad</id>
<published>2026-04-16T14:46:44Z</published>
<updated>2026-04-16T14:46:44Z</updated>
<title>Analyze and push 511 runs</title>
<link rel="alternate" type="text/html" href="commit/dd1bad2d5568fcda6b6b52af5c89baae4e8991ad.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit dd1bad2d5568fcda6b6b52af5c89baae4e8991ad
parent 0f44859d3eac85c7dc11272eacc4e57fda2b14c0
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Thu, 16 Apr 2026 16:46:44 +0200

Analyze and push 511 runs

</content>
</entry>
<entry>
<id>0f44859d3eac85c7dc11272eacc4e57fda2b14c0</id>
<published>2026-04-16T14:45:12Z</published>
<updated>2026-04-16T14:45:12Z</updated>
<title>Project runs across all dashboard pages</title>
<link rel="alternate" type="text/html" href="commit/0f44859d3eac85c7dc11272eacc4e57fda2b14c0.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 0f44859d3eac85c7dc11272eacc4e57fda2b14c0
parent 3b81eb9246542dee665795aeb510ae2ced79f03b
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Thu, 16 Apr 2026 16:45:12 +0200

Project runs across all dashboard pages

Apply the projectRunForIndex pattern from index.astro to insights,
explore, compare, pca, and surprises pages. All four active pages
(insights, explore, compare, surprises) need only summary fields
already covered by projectRunForIndex -- no new projectors required.
pca.astro passes pre-computed JSON, not runs, so no change needed.

Before (raw / gzipped):
  insights:  34.0 MB /  ~3.1 MB
  explore:   50.8 MB /  ~5.1 MB
  compare:    8.5 MB /  ~800 KB
  surprises:  8.4 MB /  ~800 KB
  dist/ total: 344 MB

After (raw / gzipped):
  insights:   6.0 MB / 222 KB
  explore:    8.8 MB / 318 KB
  compare:    1.5 MB /  57 KB
  surprises:  1.4 MB /  55 KB
  dist/ total: 263 MB

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>3b81eb9246542dee665795aeb510ae2ced79f03b</id>
<published>2026-04-16T14:32:06Z</published>
<updated>2026-04-16T14:32:06Z</updated>
<title>Project runs before serializing into index-page islands</title>
<link rel="alternate" type="text/html" href="commit/3b81eb9246542dee665795aeb510ae2ced79f03b.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 3b81eb9246542dee665795aeb510ae2ced79f03b
parent 0af972817d114910874b95bc4ec84298b7511e40
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Thu, 16 Apr 2026 16:32:06 +0200

Project runs before serializing into index-page islands

index.astro passes the full Run[] to 4 client:load islands
(StatisticalPowerCard, Charts, TopBottomConfigs, Grid). Astro serializes
each island&#39;s props independently, so the full eval_results payload
(gameplay bot report with per-test details, SonarQube details,
code_analysis, transcript_analysis) was embedded four times, once per
island -- ~34 MB of HTML on a 510-run dataset.

Add projectRunForIndex() in data.ts that returns a Run-shaped object
containing only the fields these islands and analysis.groupIntoCells
actually read (score summaries, functional.pass, cost, num_turns). Call
it once in index.astro and pass the slim array to all four islands.

dist/index.html: 34 MB -&gt; 5.9 MB raw, 3.1 MB -&gt; 217 KB gzipped.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>0af972817d114910874b95bc4ec84298b7511e40</id>
<published>2026-04-16T13:53:49Z</published>
<updated>2026-04-16T13:53:49Z</updated>
<title>Rebuild PCA from post-reeval 510-run dataset</title>
<link rel="alternate" type="text/html" href="commit/0af972817d114910874b95bc4ec84298b7511e40.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 0af972817d114910874b95bc4ec84298b7511e40
parent 46364ff78312c3c0d5d647e2f6d59c0c40345cec
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Thu, 16 Apr 2026 15:53:49 +0200

Rebuild PCA from post-reeval 510-run dataset

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>46364ff78312c3c0d5d647e2f6d59c0c40345cec</id>
<published>2026-04-16T13:53:40Z</published>
<updated>2026-04-16T13:53:40Z</updated>
<title>Analyze and push 512 runs</title>
<link rel="alternate" type="text/html" href="commit/46364ff78312c3c0d5d647e2f6d59c0c40345cec.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 46364ff78312c3c0d5d647e2f6d59c0c40345cec
parent 03f7652cb15c203683d9239f08dc22efbb51b1b5
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Thu, 16 Apr 2026 15:53:40 +0200

Analyze and push 512 runs

</content>
</entry>
<entry>
<id>03f7652cb15c203683d9239f08dc22efbb51b1b5</id>
<published>2026-04-16T13:50:27Z</published>
<updated>2026-04-16T13:50:27Z</updated>
<title>Full reeval on GPU machine: V2 bot + SonarQube</title>
<link rel="alternate" type="text/html" href="commit/03f7652cb15c203683d9239f08dc22efbb51b1b5.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 03f7652cb15c203683d9239f08dc22efbb51b1b5
parent b499a01fb7df37b81f26449bd66bfd4cf68de116
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Thu, 16 Apr 2026 15:50:27 +0200

Full reeval on GPU machine: V2 bot + SonarQube

All 510 runs re-evaluated at -j 20. SonarQube Community 9.9.8 started
locally for the scan; sonarqube-scan.py already updated from sonar.token to
sonar.login for version compat.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>b499a01fb7df37b81f26449bd66bfd4cf68de116</id>
<published>2026-04-16T11:08:21Z</published>
<updated>2026-04-16T11:08:21Z</updated>
<title>900s bot timeout + inactivity watchdog; aggregate agreement 48% to 79%</title>
<link rel="alternate" type="text/html" href="commit/b499a01fb7df37b81f26449bd66bfd4cf68de116.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit b499a01fb7df37b81f26449bd66bfd4cf68de116
parent fd8274318dc475fe75d10c9b588c4af38d451c91
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Thu, 16 Apr 2026 13:08:21 +0200

900s bot timeout + inactivity watchdog; aggregate agreement 48% to 79%

Three changes to the gameplay bot pipeline:
- Raise harness/run.py bot subprocess timeout from 300s to 900s.
- Raise playwright.config.ts and index.ts test timeout from 360s to 900s.
- Driver-level inactivity watchdog: readGrid()/wait() throw
  InactivityAbortError when 120s pass without a successful grid read
  (armed only after the game confirms started). bot.ts wraps Phases 3-8
  in a guard that catches the abort and still writes a partial report.

Fix calibration run_id mappings: commit 711df365 retagged 176 anthropic
runs with prov=anth and bumped colliding run_num, leaving 13 of 17
calibration JSONs pointing at stale paths. Remap from the retag commit&#39;s
rename list.

Results on 17 calibration runs (j=20 on RTX 4070 / Ryzen 9):
- Bot-vs-human agreement: 90/189 (48%) -&gt; 181/228 (79%)
- grid_detected: 9/17 -&gt; 16/17
- renderer=unknown: 8/17 -&gt; 1/17 (the one remaining is c1013100
  which legitimately fails to load per human label)
- Every anthropic run that previously got killed mid-calibration now
  finishes with proper detection and scores in the 0.67-1.00 range.
- Wall-clock at j=20: 6m31s; at j=5: 16m48s; identical grid/renderer
  output across both, no inactivity aborts in either.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>fd8274318dc475fe75d10c9b588c4af38d451c91</id>
<published>2026-04-16T10:10:23Z</published>
<updated>2026-04-16T10:10:35Z</updated>
<title>Add human labels for 3 more calibration runs</title>
<link rel="alternate" type="text/html" href="commit/fd8274318dc475fe75d10c9b588c4af38d451c91.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit fd8274318dc475fe75d10c9b588c4af38d451c91
parent 711df365354d81be00d01bce2428e7c283e0ec2b
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Thu, 16 Apr 2026 12:10:23 +0200

Add human labels for 3 more calibration runs

bbb70053 (haiku-4.5 DOM) flagged as very laggy -- playable but the
lag hurts playability. c1013100 (gemma-4-26b) fails to load.
e047cf3a (haiku-4.5) plays fully.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>711df365354d81be00d01bce2428e7c283e0ec2b</id>
<published>2026-04-16T09:58:46Z</published>
<updated>2026-04-16T09:58:46Z</updated>
<title>Retag 176 pre-provider anthropic runs with prov=anth in cell_id</title>
<link rel="alternate" type="text/html" href="commit/711df365354d81be00d01bce2428e7c283e0ec2b.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 711df365354d81be00d01bce2428e7c283e0ec2b
parent f801efc9b7f7880049fdeeeed53d55f0ecae5ecc
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Thu, 16 Apr 2026 11:58:46 +0200

Retag 176 pre-provider anthropic runs with prov=anth in cell_id

These runs were created before the &#39;provider&#39; axis was introduced. The
earlier legacy migration added provider=&#39;anthropic&#39; to each meta.json
but didn&#39;t regenerate cell_ids to include the prov=anth segment, leaving
them invisible to the current main_effects coverage check even though
the run data itself was intact.

This pass rebuilds each cell_id with the current AXIS_ABBREV/VALUE_ABBREV
logic, renames run and artifact directories, updates meta.json&#39;s cell_id
and run_id, and rewrites results/index.jsonl. Collisions where a
post-provider run already occupied the target slot were resolved by
bumping run_num (Option C: kept both as additional replicates).

Impact:
- haiku-4.5: 73 retagged
- sonnet-4.6: 52 retagged
- opus-4.6: 51 retagged
- 194 total anthropic runs preserved, none deleted

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>f801efc9b7f7880049fdeeeed53d55f0ecae5ecc</id>
<published>2026-04-16T07:54:00Z</published>
<updated>2026-04-16T07:54:00Z</updated>
<title>Add human trial labels for 4 calibration runs</title>
<link rel="alternate" type="text/html" href="commit/f801efc9b7f7880049fdeeeed53d55f0ecae5ecc.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit f801efc9b7f7880049fdeeeed53d55f0ecae5ecc
parent 2fae566a4db8aae4b68eb9a4ff587a6a8e4245a2
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Thu, 16 Apr 2026 09:54:00 +0200

Add human trial labels for 4 calibration runs

Covers qwen-3.6-plus, haiku-4.5, opus-4.6, glm-5.1 (each with
strat=usub or strat=none). All four are reported playable by the
human tester but the bot currently scores them near zero because
renderer detection fails (renderer=unknown, grid_detected=false),
so aggregate bot-vs-human agreement drops to 47.6%.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>2fae566a4db8aae4b68eb9a4ff587a6a8e4245a2</id>
<published>2026-04-16T07:06:51Z</published>
<updated>2026-04-16T07:06:51Z</updated>
<title>Preserve gameplay bot report on timeout</title>
<link rel="alternate" type="text/html" href="commit/2fae566a4db8aae4b68eb9a4ff587a6a8e4245a2.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 2fae566a4db8aae4b68eb9a4ff587a6a8e4245a2
parent 07408995fff0b06d234c4a19fdff6a2dae5b5028
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Thu, 16 Apr 2026 09:06:51 +0200

Preserve gameplay bot report on timeout

When the bot subprocess hit its 300s timeout, the report file written
during the run was discarded, so eval_results.json.gameplay_bot had
no test data and the /calibrate dashboard couldn&#39;t display results.
On timeout, read the report if it exists and set timed_out=true.

Re-eval 17 calibration runs with the fix: all now carry populated
report.tests in eval_results.json.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>07408995fff0b06d234c4a19fdff6a2dae5b5028</id>
<published>2026-04-16T05:59:34Z</published>
<updated>2026-04-16T05:59:34Z</updated>
<title>Remove 39 invalid glm-4.7 runs and add new sweep results</title>
<link rel="alternate" type="text/html" href="commit/07408995fff0b06d234c4a19fdff6a2dae5b5028.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 07408995fff0b06d234c4a19fdff6a2dae5b5028
parent baa0f5098d7e7bd291146968e751bbfe5ee255e4
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Thu, 16 Apr 2026 07:59:34 +0200

Remove 39 invalid glm-4.7 runs and add new sweep results

Purged zero-turn 429s from glm-4.7 sweep (Z.AI rate-limited the
model hard during a ~7.5h window). Also includes the successful
glm-4.7 runs from the same sweep and fresh glm-5.1 runs.

glm-5.1: 123 clean runs, 0 bad
glm-4.7: 55 clean runs retained, 39 bad removed

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>baa0f5098d7e7bd291146968e751bbfe5ee255e4</id>
<published>2026-04-15T14:27:17Z</published>
<updated>2026-04-15T14:53:13Z</updated>
<title>Add 18 new runs (458 total)</title>
<link rel="alternate" type="text/html" href="commit/baa0f5098d7e7bd291146968e751bbfe5ee255e4.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit baa0f5098d7e7bd291146968e751bbfe5ee255e4
parent 4b971780246afdc6e97d3ed1d6c2aa6dcdeaa931
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Wed, 15 Apr 2026 16:27:17 +0200

Add 18 new runs (458 total)

Profile: main_effects
Completed: 18 | Skipped: 109 | Failed: 20

</content>
</entry>
<entry>
<id>4b971780246afdc6e97d3ed1d6c2aa6dcdeaa931</id>
<published>2026-04-15T13:41:27Z</published>
<updated>2026-04-15T13:41:27Z</updated>
<title>Re-eval 17 calibration runs; fix reeval.py artifact cleanup</title>
<link rel="alternate" type="text/html" href="commit/4b971780246afdc6e97d3ed1d6c2aa6dcdeaa931.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 4b971780246afdc6e97d3ed1d6c2aa6dcdeaa931
parent e82be6aca0708fad30ff11975bd16e5be13f53ff
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Wed, 15 Apr 2026 15:41:27 +0200

Re-eval 17 calibration runs; fix reeval.py artifact cleanup

Previous artifact-path fix broke the cleanup safety check:
rmtree guard still matched dashboard/ when artifact_dir moved to
artifacts/, so successful reevals wiped the agent-generated game code.
Update the guard to match the new artifacts/ path.

Bot vs human agreement on 17 calibration runs: 72/116 (62.1%).
DOM-detected games agree at ~91%.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>e82be6aca0708fad30ff11975bd16e5be13f53ff</id>
<published>2026-04-15T11:47:11Z</published>
<updated>2026-04-15T11:47:11Z</updated>
<title>Fix compute_grid OOM: fail on unknown profile, stream via generator, dispatch DOE designs</title>
<link rel="alternate" type="text/html" href="commit/e82be6aca0708fad30ff11975bd16e5be13f53ff.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit e82be6aca0708fad30ff11975bd16e5be13f53ff
parent 6678831b7fac8cd35467d4539afc9ce70d68d388
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Wed, 15 Apr 2026 13:47:11 +0200

Fix compute_grid OOM: fail on unknown profile, stream via generator, dispatch DOE designs

Three fixes in one pass:

1. get_axes() used to silently fall back to the full top-level grid when
   given an unknown profile name. With 23 axes this expands to ~40B
   cartesian combinations, and the process OOMed the host (7.6GB+ before
   swap-stormed into D-state). Now it raises ValueError listing the
   known profiles.

2. compute_cells() accumulated every cell in a list before returning.
   Even with lazy itertools.product, building the intermediate list
   defeats it. Converted to a generator yielding one cell at a time.
   Streaming the &#39;full&#39; profile now peaks at ~12MB RSS instead of
   unbounded growth. The only in-repo consumer (harness/run.py) already
   materializes via a list comprehension, so the change is transparent
   there.

3. compute_grid.py now recognizes the DOE design names (main_effects,
   plackett_burman, interaction_hunt) and dispatches to
   experiment_design.py. Previously &#39;compute_grid.py grid.yaml
   main_effects&#39; triggered the silent fallback (bug #1) because
   main_effects is a design, not a profile. Now it produces the
   expected one-at-a-time sweep.

Unknown names now print the full list of valid profiles and designs
instead of silently misbehaving.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>6678831b7fac8cd35467d4539afc9ce70d68d388</id>
<published>2026-04-15T09:37:13Z</published>
<updated>2026-04-15T09:37:13Z</updated>
<title>Remove 20 more zero-turn 429 runs from glm-5.1 sweep</title>
<link rel="alternate" type="text/html" href="commit/6678831b7fac8cd35467d4539afc9ce70d68d388.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 6678831b7fac8cd35467d4539afc9ce70d68d388
parent d4faa51819846758ef1376070eaf8d72ebbfa48b
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Wed, 15 Apr 2026 11:37:13 +0200

Remove 20 more zero-turn 429 runs from glm-5.1 sweep

Same Z.AI rate-limit pattern as previous purges: first request
of each run gets 429&#39;d, Claude CLI retries 10x, run dies ~200s
in with num_turns=1 and no work product. Purged to keep the
benchmark clean.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>d4faa51819846758ef1376070eaf8d72ebbfa48b</id>
<published>2026-04-15T08:03:00Z</published>
<updated>2026-04-15T08:03:00Z</updated>
<title>Add 20 new runs (460 total)</title>
<link rel="alternate" type="text/html" href="commit/d4faa51819846758ef1376070eaf8d72ebbfa48b.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit d4faa51819846758ef1376070eaf8d72ebbfa48b
parent 1d3667655369838306205f50bcc8210fe54f3b6f
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Wed, 15 Apr 2026 10:03:00 +0200

Add 20 new runs (460 total)

Profile: main_effects
Completed: 20 | Skipped: 109 | Failed: 18

</content>
</entry>
<entry>
<id>1d3667655369838306205f50bcc8210fe54f3b6f</id>
<published>2026-04-15T05:03:15Z</published>
<updated>2026-04-15T05:03:15Z</updated>
<title>Remove 20 invalid glm-5.1 runs (429 / aborted / zero-turn)</title>
<link rel="alternate" type="text/html" href="commit/1d3667655369838306205f50bcc8210fe54f3b6f.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 1d3667655369838306205f50bcc8210fe54f3b6f
parent 610f73ccd69efee8791e3090594c1ad45dabc63b
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Wed, 15 Apr 2026 07:03:15 +0200

Remove 20 invalid glm-5.1 runs (429 / aborted / zero-turn)

14 runs from the recent sweep hit Z.AI rate limits on the first
turn despite correct auth, 2 older runs got cut off partway
through by 429s, and 4 stale pre-fix runs logged 1 turn with
non-zero cost. None have usable outcome data.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>610f73ccd69efee8791e3090594c1ad45dabc63b</id>
<published>2026-04-15T02:30:23Z</published>
<updated>2026-04-15T02:30:23Z</updated>
<title>Add 14 new runs (460 total)</title>
<link rel="alternate" type="text/html" href="commit/610f73ccd69efee8791e3090594c1ad45dabc63b.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 610f73ccd69efee8791e3090594c1ad45dabc63b
parent 238f1a535996225cdf2c4e054730278cfae58f0f
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Wed, 15 Apr 2026 04:30:23 +0200

Add 14 new runs (460 total)

Profile: main_effects
Completed: 14 | Skipped: 115 | Failed: 18

</content>
</entry>
<entry>
<id>238f1a535996225cdf2c4e054730278cfae58f0f</id>
<published>2026-04-14T20:42:18Z</published>
<updated>2026-04-14T20:42:18Z</updated>
<title>Fix Z.AI auth: skip apiKeyHelper for non-anthropic providers</title>
<link rel="alternate" type="text/html" href="commit/238f1a535996225cdf2c4e054730278cfae58f0f.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 238f1a535996225cdf2c4e054730278cfae58f0f
parent 7418a1208c757383b4559e0bc32b40526ac75cb7
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Tue, 14 Apr 2026 22:42:18 +0200

Fix Z.AI auth: skip apiKeyHelper for non-anthropic providers

apiKeyHelper in --settings returned an Anthropic OAuth token and
overrode the ANTHROPIC_AUTH_TOKEN env var, so every zai (and
openrouter) request authenticated with the wrong credential. Z.AI
responded with 429 on the first turn, Claude CLI retried 10x, and
the run died after ~200s with zero useful work. Now apiKeyHelper
is only set when provider has no base_url override, so env-var
auth flows through for zai/openrouter.

Also commits ~30 new glm-5.1 runs from the main_effects sweep
that completed cleanly after the fix, minus 5 purged invalid
runs (429/aborted/zero-turn) captured before the fix landed.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>7418a1208c757383b4559e0bc32b40526ac75cb7</id>
<published>2026-04-14T11:08:09Z</published>
<updated>2026-04-14T11:08:09Z</updated>
<title>Add 1 new runs (393 total)</title>
<link rel="alternate" type="text/html" href="commit/7418a1208c757383b4559e0bc32b40526ac75cb7.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 7418a1208c757383b4559e0bc32b40526ac75cb7
parent c79224fc666a6101927d5f64238f4331d9939ea5
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Tue, 14 Apr 2026 13:08:09 +0200

Add 1 new runs (393 total)

Profile: main_effects
Completed: 1 | Skipped: 0 | Failed: 0

</content>
</entry>
<entry>
<id>c79224fc666a6101927d5f64238f4331d9939ea5</id>
<published>2026-04-14T07:32:07Z</published>
<updated>2026-04-14T07:32:07Z</updated>
<title>Remove 68 more zero-cost GLM-5.1 runs (Z.AI auth still broken)</title>
<link rel="alternate" type="text/html" href="commit/c79224fc666a6101927d5f64238f4331d9939ea5.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit c79224fc666a6101927d5f64238f4331d9939ea5
parent 92464fd70b670fba6b71a3d3ad995cee4c286044
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Tue, 14 Apr 2026 09:32:07 +0200

Remove 68 more zero-cost GLM-5.1 runs (Z.AI auth still broken)

</content>
</entry>
<entry>
<id>92464fd70b670fba6b71a3d3ad995cee4c286044</id>
<published>2026-04-14T04:07:33Z</published>
<updated>2026-04-14T04:07:33Z</updated>
<title>Add 68 new runs (459 total)</title>
<link rel="alternate" type="text/html" href="commit/92464fd70b670fba6b71a3d3ad995cee4c286044.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 92464fd70b670fba6b71a3d3ad995cee4c286044
parent 0b3c14da72aae531800f474ecadf3d960ff431c8
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Tue, 14 Apr 2026 06:07:33 +0200

Add 68 new runs (459 total)

Profile: main_effects
Completed: 68 | Skipped: 61 | Failed: 18

</content>
</entry>
<entry>
<id>0b3c14da72aae531800f474ecadf3d960ff431c8</id>
<published>2026-04-14T03:54:07Z</published>
<updated>2026-04-14T03:54:07Z</updated>
<title>Checkpoint: 60 runs (453 total)</title>
<link rel="alternate" type="text/html" href="commit/0b3c14da72aae531800f474ecadf3d960ff431c8.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 0b3c14da72aae531800f474ecadf3d960ff431c8
parent c3e360e804cee1c489ce684800cb272cdfc4071c
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Tue, 14 Apr 2026 05:54:07 +0200

Checkpoint: 60 runs (453 total)

</content>
</entry>
<entry>
<id>c3e360e804cee1c489ce684800cb272cdfc4071c</id>
<published>2026-04-14T03:03:53Z</published>
<updated>2026-04-14T03:03:53Z</updated>
<title>Checkpoint: 30 runs (423 total)</title>
<link rel="alternate" type="text/html" href="commit/c3e360e804cee1c489ce684800cb272cdfc4071c.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit c3e360e804cee1c489ce684800cb272cdfc4071c
parent e52e85189d28eaa9be6271e17b363b5321c9ba16
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Tue, 14 Apr 2026 05:03:53 +0200

Checkpoint: 30 runs (423 total)

</content>
</entry>
<entry>
<id>e52e85189d28eaa9be6271e17b363b5321c9ba16</id>
<published>2026-04-13T21:11:31Z</published>
<updated>2026-04-13T21:11:31Z</updated>
<title>Remove 68 zero-cost GLM-5.1 runs (auth failures)</title>
<link rel="alternate" type="text/html" href="commit/e52e85189d28eaa9be6271e17b363b5321c9ba16.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit e52e85189d28eaa9be6271e17b363b5321c9ba16
parent 19603805e8fb248250f450fc3acd810b42dcc389
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Mon, 13 Apr 2026 23:11:31 +0200

Remove 68 zero-cost GLM-5.1 runs (auth failures)

Z.AI API key was expired/invalid during these runs, resulting in
0 turns and 0 cost. All 68 were glm-5.1 model.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>19603805e8fb248250f450fc3acd810b42dcc389</id>
<published>2026-04-13T20:56:53Z</published>
<updated>2026-04-13T20:56:53Z</updated>
<title>Add 24 new runs (459 total)</title>
<link rel="alternate" type="text/html" href="commit/19603805e8fb248250f450fc3acd810b42dcc389.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 19603805e8fb248250f450fc3acd810b42dcc389
parent 3f102921e8bdb6a5e5f6f0ac940fd6052d67bde4
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Mon, 13 Apr 2026 22:56:53 +0200

Add 24 new runs (459 total)

Profile: main_effects
Completed: 24 | Skipped: 105 | Failed: 18

</content>
</entry>
<entry>
<id>3f102921e8bdb6a5e5f6f0ac940fd6052d67bde4</id>
<published>2026-04-13T20:53:20Z</published>
<updated>2026-04-13T20:53:20Z</updated>
<title>Checkpoint: 20 runs (459 total)</title>
<link rel="alternate" type="text/html" href="commit/3f102921e8bdb6a5e5f6f0ac940fd6052d67bde4.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 3f102921e8bdb6a5e5f6f0ac940fd6052d67bde4
parent 23f965ae24a10947be7b2d35deb1af122bfd49c3
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Mon, 13 Apr 2026 22:53:20 +0200

Checkpoint: 20 runs (459 total)

</content>
</entry>
<entry>
<id>23f965ae24a10947be7b2d35deb1af122bfd49c3</id>
<published>2026-04-13T20:46:20Z</published>
<updated>2026-04-13T20:46:20Z</updated>
<title>Checkpoint: 10 runs (449 total)</title>
<link rel="alternate" type="text/html" href="commit/23f965ae24a10947be7b2d35deb1af122bfd49c3.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 23f965ae24a10947be7b2d35deb1af122bfd49c3
parent 15bbcc86f3fbbb02c35a2ab5ee724272dcc49aeb
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Mon, 13 Apr 2026 22:46:20 +0200

Checkpoint: 10 runs (449 total)

</content>
</entry>
<entry>
<id>15bbcc86f3fbbb02c35a2ab5ee724272dcc49aeb</id>
<published>2026-04-13T20:34:29Z</published>
<updated>2026-04-13T20:34:29Z</updated>
<title>Add smaller noise files: 1k, 10k, 50k, 100k for both lorem and wikipedia</title>
<link rel="alternate" type="text/html" href="commit/15bbcc86f3fbbb02c35a2ab5ee724272dcc49aeb.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 15bbcc86f3fbbb02c35a2ab5ee724272dcc49aeb
parent 7e8573576791e92cf46d189047bfac223df15bfc
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Mon, 13 Apr 2026 22:34:29 +0200

Add smaller noise files: 1k, 10k, 50k, 100k for both lorem and wikipedia

The existing 25/50/75% noise files (195-587KB) exceed some API input
limits (Z.AI rejects them with 0 turns). Added smaller sizes:
- 1KB, 10KB, 50KB, 100KB for both lorem and wikipedia

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>7e8573576791e92cf46d189047bfac223df15bfc</id>
<published>2026-04-13T20:24:08Z</published>
<updated>2026-04-13T20:24:08Z</updated>
<title>Add 44 new runs (435 total)</title>
<link rel="alternate" type="text/html" href="commit/7e8573576791e92cf46d189047bfac223df15bfc.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 7e8573576791e92cf46d189047bfac223df15bfc
parent 5c6e636b97dc9f81a5ff432fc9b730442d5ec8ef
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Mon, 13 Apr 2026 22:24:08 +0200

Add 44 new runs (435 total)

Profile: main_effects
Completed: 44 | Skipped: 61 | Failed: 18

</content>
</entry>
<entry>
<id>5c6e636b97dc9f81a5ff432fc9b730442d5ec8ef</id>
<published>2026-04-13T20:17:09Z</published>
<updated>2026-04-13T20:17:09Z</updated>
<title>Checkpoint: 40 runs (433 total)</title>
<link rel="alternate" type="text/html" href="commit/5c6e636b97dc9f81a5ff432fc9b730442d5ec8ef.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 5c6e636b97dc9f81a5ff432fc9b730442d5ec8ef
parent e6f94efd9e1797f169e38db6fe6e566c1b9cec37
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Mon, 13 Apr 2026 22:17:09 +0200

Checkpoint: 40 runs (433 total)

</content>
</entry>
<entry>
<id>e6f94efd9e1797f169e38db6fe6e566c1b9cec37</id>
<published>2026-04-13T15:14:54Z</published>
<updated>2026-04-13T15:14:54Z</updated>
<title>Restore game artifacts deleted by GPU machine commit</title>
<link rel="alternate" type="text/html" href="commit/e6f94efd9e1797f169e38db6fe6e566c1b9cec37.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit e6f94efd9e1797f169e38db6fe6e566c1b9cec37
parent 742064a538f4dfd67a0c2f63f3a48023a6a8fde4
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Mon, 13 Apr 2026 17:14:54 +0200

Restore game artifacts deleted by GPU machine commit

8daeef92 accidentally deleted game HTML/JS/CSS files from artifacts/
when the GPU machine committed without having the full artifacts.
Restored from d07dba79.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>742064a538f4dfd67a0c2f63f3a48023a6a8fde4</id>
<published>2026-04-13T14:13:21Z</published>
<updated>2026-04-13T14:13:21Z</updated>
<title>CI: exclude artifacts/ from rsync --delete</title>
<link rel="alternate" type="text/html" href="commit/742064a538f4dfd67a0c2f63f3a48023a6a8fde4.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 742064a538f4dfd67a0c2f63f3a48023a6a8fde4
parent 8972d44e8b39923d82288a7c34ce62ed50b4261a
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Mon, 13 Apr 2026 16:13:21 +0200

CI: exclude artifacts/ from rsync --delete

The --delete flag on dashboard rsync was wiping artifacts/ on every
deploy. If the artifact rsync then failed (no artifacts on CI runner),
the game previews were gone. Now --exclude=&#39;artifacts/&#39; preserves
previously deployed games.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>8972d44e8b39923d82288a7c34ce62ed50b4261a</id>
<published>2026-04-13T13:44:43Z</published>
<updated>2026-04-13T13:44:43Z</updated>
<title>Add all game artifacts, fix CI artifact rsync</title>
<link rel="alternate" type="text/html" href="commit/8972d44e8b39923d82288a7c34ce62ed50b4261a.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 8972d44e8b39923d82288a7c34ce62ed50b4261a
parent 8daeef92ba6951d8f87f6476d03c4fafe25a57d2
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Mon, 13 Apr 2026 15:44:43 +0200

Add all game artifacts, fix CI artifact rsync

Committed all game artifacts (HTML, JS, CSS) from workspace extracts.
node_modules excluded by .gitignore.

CI deploy.yml: skip artifact rsync gracefully if directory missing
on the CI runner (artifacts are large and not always available).

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>8daeef92ba6951d8f87f6476d03c4fafe25a57d2</id>
<published>2026-04-13T13:28:24Z</published>
<updated>2026-04-13T13:28:41Z</updated>
<title>Re-eval all 390 runs with V2 bot on GPU machine</title>
<link rel="alternate" type="text/html" href="commit/8daeef92ba6951d8f87f6476d03c4fafe25a57d2.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 8daeef92ba6951d8f87f6476d03c4fafe25a57d2
parent d07dba794c0abd20688958f4185daf3447786621
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Mon, 13 Apr 2026 15:28:24 +0200

Re-eval all 390 runs with V2 bot on GPU machine

GPU canvas readback works (getImageData returns real pixels),
unlocking 148 canvas games that previously failed on non-GPU machine.
Fix reeval.py artifact path (was dashboard/public/artifacts/, now artifacts/).
Clean up SonarQube .scannerwork temp files from artifacts.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>d07dba794c0abd20688958f4185daf3447786621</id>
<published>2026-04-13T11:58:14Z</published>
<updated>2026-04-13T11:58:14Z</updated>
<title>Context update for GPU machine testing</title>
<link rel="alternate" type="text/html" href="commit/d07dba794c0abd20688958f4185daf3447786621.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit d07dba794c0abd20688958f4185daf3447786621
parent 499b8e496ada5d677434f56c376d41db2517b3a3
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Mon, 13 Apr 2026 13:58:14 +0200

Context update for GPU machine testing

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>499b8e496ada5d677434f56c376d41db2517b3a3</id>
<published>2026-04-12T16:46:57Z</published>
<updated>2026-04-12T16:46:57Z</updated>
<title>Update eval results: 123 runs re-evaluated with V2 bot</title>
<link rel="alternate" type="text/html" href="commit/499b8e496ada5d677434f56c376d41db2517b3a3.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 499b8e496ada5d677434f56c376d41db2517b3a3
parent a8683609ce323889201069673c68c03616555eb6
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Sun, 12 Apr 2026 18:46:57 +0200

Update eval results: 123 runs re-evaluated with V2 bot

V2 mean: 43% (up from V1&#39;s 20%). Distribution shifted from 0-10%
cluster to 60-70% for games the bot can now play.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>a8683609ce323889201069673c68c03616555eb6</id>
<published>2026-04-12T16:23:03Z</published>
<updated>2026-04-12T16:23:03Z</updated>
<title>Add 7 new games to calibration page</title>
<link rel="alternate" type="text/html" href="commit/a8683609ce323889201069673c68c03616555eb6.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit a8683609ce323889201069673c68c03616555eb6
parent da239183820ed8946e7f489fa3abfe7ff2513ba6
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Sun, 12 Apr 2026 18:23:03 +0200

Add 7 new games to calibration page

opus, qwen, glm-5.1, haiku, gemma-4-26b across various score ranges
(0-44%). Human tests unanswered, ready for testing.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>da239183820ed8946e7f489fa3abfe7ff2513ba6</id>
<published>2026-04-12T15:56:13Z</published>
<updated>2026-04-12T15:56:13Z</updated>
<title>Analyze and push 391 runs</title>
<link rel="alternate" type="text/html" href="commit/da239183820ed8946e7f489fa3abfe7ff2513ba6.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit da239183820ed8946e7f489fa3abfe7ff2513ba6
parent 821022cb2060158a630e184a67c95678e8dca7c7
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Sun, 12 Apr 2026 17:56:13 +0200

Analyze and push 391 runs

</content>
</entry>
<entry>
<id>821022cb2060158a630e184a67c95678e8dca7c7</id>
<published>2026-04-12T15:43:33Z</published>
<updated>2026-04-12T15:43:33Z</updated>
<title>Switch production eval to V2 gameplay bot</title>
<link rel="alternate" type="text/html" href="commit/821022cb2060158a630e184a67c95678e8dca7c7.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 821022cb2060158a630e184a67c95678e8dca7c7
parent 00055378a50253cc949795147e20b64ed2a2767f
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Sun, 12 Apr 2026 17:43:33 +0200

Switch production eval to V2 gameplay bot

Harness now uses gameplay-bot-v2 (two-tier architecture) when available,
falls back to V1 if not. V2 has 95% agreement with human calibration
(vs V1&#39;s 58%).

Expect breakage on canvas games without GPU (getImageData returns zeros).
DOM games should work well.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>00055378a50253cc949795147e20b64ed2a2767f</id>
<published>2026-04-12T15:38:24Z</published>
<updated>2026-04-12T15:38:24Z</updated>
<title>Update calibration: cbbff570 CW rotation works, e2e04e75 scores on clear, 9805c24a has game over overlay</title>
<link rel="alternate" type="text/html" href="commit/00055378a50253cc949795147e20b64ed2a2767f.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 00055378a50253cc949795147e20b64ed2a2767f
parent 42321c004e708af0b099db58e9cbedf54e03e145
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Sun, 12 Apr 2026 17:38:24 +0200

Update calibration: cbbff570 CW rotation works, e2e04e75 scores on clear,
9805c24a has game over overlay

cbbff570: CW rotation (Up) works, CCW (Z) is broken. Updated rotate=true.
e2e04e75: Bot was right, score increases by 100 on line clear. Updated
score_increases_on_clear=true.
9805c24a: Game over shows overlay with GAME OVER text + Play Again button.
Updated game_over_display=true.

V2 agreement now 95% (102/107). 5 remaining disagreements: 3 from trail
rendering bug (4949d521), 1 game_over_display detection (9805c24a), 1
all_pieces_rotate edge case (cbbff570).

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>42321c004e708af0b099db58e9cbedf54e03e145</id>
<published>2026-04-12T06:31:49Z</published>
<updated>2026-04-12T06:31:49Z</updated>
<title>V2: landmarks-based game_loads, updated calibration test names</title>
<link rel="alternate" type="text/html" href="commit/42321c004e708af0b099db58e9cbedf54e03e145.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 42321c004e708af0b099db58e9cbedf54e03e145
parent 13710ed75e85f22717f6463b1b3bc0b822a0a66b
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Sun, 12 Apr 2026 08:31:49 +0200

V2: landmarks-based game_loads, updated calibration test names

game_loads now checks for structural landmarks (canvas, DOM grid,
tetris-ratio elements, cell containers) instead of failing on console
errors. Console errors are informational, not a pass/fail gate.

8fe72fce: game_loads now PASS (was FAIL from benign startup TypeError),
score 100% (20/20 scorable).

Updated all calibration JSON files: score_changes renamed to
score_increases_on_clear + score_element_visible.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>13710ed75e85f22717f6463b1b3bc0b822a0a66b</id>
<published>2026-04-11T07:59:29Z</published>
<updated>2026-04-11T07:59:29Z</updated>
<title>V2: partial landmarks work (agent hit limit)</title>
<link rel="alternate" type="text/html" href="commit/13710ed75e85f22717f6463b1b3bc0b822a0a66b.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 13710ed75e85f22717f6463b1b3bc0b822a0a66b
parent d1b5c77738368fcf645c325b912383d5c69f22ed
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Sat, 11 Apr 2026 09:59:29 +0200

V2: partial landmarks work (agent hit limit)

</content>
</entry>
<entry>
<id>d1b5c77738368fcf645c325b912383d5c69f22ed</id>
<published>2026-04-11T07:12:39Z</published>
<updated>2026-04-11T07:12:39Z</updated>
<title>V2: stricter rotation test requires distinct rotation states</title>
<link rel="alternate" type="text/html" href="commit/d1b5c77738368fcf645c325b912383d5c69f22ed.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit d1b5c77738368fcf645c325b912383d5c69f22ed
parent 669aa68861617a446e5ae56aea0a462995d183f0
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Sat, 11 Apr 2026 09:12:39 +0200

V2: stricter rotation test requires distinct rotation states

Previous rotate test passed if pressing rotate caused ANY grid change.
Games with broken rotation (only 1 of 4 states works) would pass.

New test:
- Press rotate 4 times, wait 100ms between each
- Record normalized active piece shape after each press
- rotate test: passes if 2+ distinct shapes (baseline + rotation)
- all_pieces_rotate: passes if 2+ J/L/T pieces reach 3+ distinct shapes
- Skips if fewer than 2 J/L/T piece types seen

Uses position-invariant shape keys (normalized to top-left origin) so
auto-drop during the test doesn&#39;t confuse the comparison.

Tracking:
- session.distinctRotationShapes: max observed in Phase 3 probe
- session.rotationShapesByPiece: Map&lt;piece_type, Set&lt;shape_key&gt;&gt;
- playGame accepts rotationTrack param for gameplay-phase probing

Results:
- 9805c24a (broken): rotate now FAIL, all_pieces_rotate FAIL (was PASS/PASS)
- cbbff570 (flaky): rotate FAIL (was PASS)
- 4c7db3b9 (working): 100% score (up from 94%)
- 1d08ee76: 95% (up from 89%)
- 8fe72fce: 94% (unchanged)

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>669aa68861617a446e5ae56aea0a462995d183f0</id>
<published>2026-04-11T05:30:25Z</published>
<updated>2026-04-11T05:30:25Z</updated>
<title>V2: game_over_display test passes on overlay OR restart presence</title>
<link rel="alternate" type="text/html" href="commit/669aa68861617a446e5ae56aea0a462995d183f0.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 669aa68861617a446e5ae56aea0a462995d183f0
parent 71f0c4b7931a2d58ae370587a6a0f62ba352b716
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Sat, 11 Apr 2026 07:30:25 +0200

V2: game_over_display test passes on overlay OR restart presence

Previous check required BOTH a modal and a restart button. Now accepts
either signal because different games show game-over UI differently
(some have full modal, some just show a restart button overlay).

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>71f0c4b7931a2d58ae370587a6a0f62ba352b716</id>
<published>2026-04-11T05:28:13Z</published>
<updated>2026-04-11T05:28:13Z</updated>
<title>V2: language-agnostic game over detection, capture in Phase 6</title>
<link rel="alternate" type="text/html" href="commit/71f0c4b7931a2d58ae370587a6a0f62ba352b716.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 71f0c4b7931a2d58ae370587a6a0f62ba352b716
parent e0a13b62466491d470aedf8a0eb251d5075f2951
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Sat, 11 Apr 2026 07:28:13 +0200

V2: language-agnostic game over detection, capture in Phase 6

Two fixes:

1. Game over display detection happens in Phase 6 (when game over is
   actually triggered) and is stored on session. Phase 8 no longer needs
   to re-trigger game over. Added gameOverText and gameOverRestartAvailable
   to GameSession.

2. detectGameOverText() and detectRestartOption() are now language-agnostic.
   Instead of matching text patterns, they detect structural modals:
   position fixed/absolute elements covering &gt;15% of viewport with visible
   background/content and z-index. Restart detection finds clickable
   elements inside the detected modal.

Also updated CLAUDE.md with explicit &quot;driver MUST NOT hard-code language
strings&quot; convention.

Known follow-ups: readLevel, detectNextPiecePreview, detectControls, and
score element search still use text matching. These need language-agnostic
replacements.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>e0a13b62466491d470aedf8a0eb251d5075f2951</id>
<published>2026-04-11T05:25:49Z</published>
<updated>2026-04-11T05:25:49Z</updated>
<title>Calibration cbbff570: rotation is flaky (human was wrong)</title>
<link rel="alternate" type="text/html" href="commit/e0a13b62466491d470aedf8a0eb251d5075f2951.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit e0a13b62466491d470aedf8a0eb251d5075f2951
parent 4f28472171324e5ca141cd341697a345ac9438fc
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Sat, 11 Apr 2026 07:25:49 +0200

Calibration cbbff570: rotation is flaky (human was wrong)

Re-tested rotation. At best it rotates once per piece, sometimes stalls
the game or causes blocks to vanish. CCW rotation also broken. Previous
human test was incorrect.

Game over can be triggered by spamming space and shows proper modal
with &quot;Play Again&quot; button.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>4f28472171324e5ca141cd341697a345ac9438fc</id>
<published>2026-04-10T19:24:30Z</published>
<updated>2026-04-10T19:24:30Z</updated>
<title>Methodology: scoring uses SonarQube, code quality is in outputs, no emdashes</title>
<link rel="alternate" type="text/html" href="commit/4f28472171324e5ca141cd341697a345ac9438fc.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 4f28472171324e5ca141cd341697a345ac9438fc
parent f2f3ae07a56e601bfb819d07f034eedeabc6a9c8
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Fri, 10 Apr 2026 21:24:30 +0200

Methodology: scoring uses SonarQube, code quality is in outputs, no emdashes

Headline score is 50% gameplay bot + 50% SonarQube. The lint/typecheck/
bundle &quot;code quality&quot; metrics are tracked as output metrics, not part of
the headline score.

Also removed all prose emdashes per project convention.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>f2f3ae07a56e601bfb819d07f034eedeabc6a9c8</id>
<published>2026-04-10T19:20:12Z</published>
<updated>2026-04-10T19:20:12Z</updated>
<title>V2: fix AI player so it actually plays Tetris</title>
<link rel="alternate" type="text/html" href="commit/f2f3ae07a56e601bfb819d07f034eedeabc6a9c8.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit f2f3ae07a56e601bfb819d07f034eedeabc6a9c8
parent 3d89b1b341dd38bd6e1d3574d07e083fb57b1d62
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Fri, 10 Apr 2026 21:20:12 +0200

V2: fix AI player so it actually plays Tetris

Four bugs in the LeeYiyuan port were preventing the bot from playing:

Bug 1 (primary): settledGrid pollution
After hard-drop, the bot read the grid and stored it as &#39;settled&#39; but
the next piece had already spawned. detectActivePieceCells then returned
null because the new piece was &#39;baked into settled&#39;. Bot froze after
move 1 (telemetry: pieces_spawned=1, pieces_locked=20).

Fix: stripActivePiece(boardBeforePlacement) before placement, wait
350ms after drop, detect new piece against the saved board, strip it
to get clean settledGrid for next iteration.

Bug 2: pre-rotation column math wrong
currentCol was captured before rotation, but rotation shifts the
piece&#39;s leftmost column, so column moves were off by 1-2 cells.

Fix: Adopt LeeYiyuan&#39;s slam-left strategy. After computing placement:
rotate N times, press Left 10 times (slam to wall), press Right
placement.column times, hard drop. No live position tracking needed.

Bug 3: J[3] had negative column offset
PIECES.J[3] = [[0,0],[1,0],[2,0],[2,-1]] -- minCol=-1 broke the
column math for J piece in rotation state 3.

Fix: Normalized to [[0,1],[1,1],[2,1],[2,0]] with minCol=0. All 28
rotation states now have minRow=0 and minCol=0.

Bug 4: First-frame piece detection sloppy
detectActivePieceCells fallback scanned all cells in top 6 rows when
settledGrid was null, picking up UI chrome.

Fix: BFS for connected component of 3-5 cells in top 4 rows, pick the
one closest to spawn center column 4.5.

Results: bot now actually plays Tetris.
- Test on 4c7db3b9: 94% score (was 89%)
- Gameplay phase cleared 10 lines (was 0)
- Competitive play cleared 2 lines, scored 200 (was 0)

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>3d89b1b341dd38bd6e1d3574d07e083fb57b1d62</id>
<published>2026-04-10T19:11:59Z</published>
<updated>2026-04-10T19:11:59Z</updated>
<title>Update methodology page with current bot architecture</title>
<link rel="alternate" type="text/html" href="commit/3d89b1b341dd38bd6e1d3574d07e083fb57b1d62.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 3d89b1b341dd38bd6e1d3574d07e083fb57b1d62
parent 7df3ddd793a69cba93ded966d634045f4810a5fc
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Fri, 10 Apr 2026 21:11:59 +0200

Update methodology page with current bot architecture

Major rewrite of the bot section to reflect the actual implementation:
- 8 conditional phases (was 4)
- 25 tests across mechanics, lifecycle, gameplay, game state, competitive
- Two-tier architecture (Driver + Bot separation)
- Discovery infrastructure: language-agnostic start, interactivity check,
  control discovery, calibration cache
- All 9 competitive play bug detection tests listed
- 60ms polling (was 150ms)
- Updated limitations: GPU requirement, trail bugs, game over masking,
  hidden elements, etc.
- Pierre Dellacherie attribution

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>7df3ddd793a69cba93ded966d634045f4810a5fc</id>
<published>2026-04-10T19:06:34Z</published>
<updated>2026-04-10T19:06:34Z</updated>
<title>Correct attribution: Pierre Dellacherie&#39;s 4-heuristic Tetris AI</title>
<link rel="alternate" type="text/html" href="commit/7df3ddd793a69cba93ded966d634045f4810a5fc.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 7df3ddd793a69cba93ded966d634045f4810a5fc
parent 2644610c24ac12d9ef707571aec1e31a934389a8
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Fri, 10 Apr 2026 21:06:34 +0200

Correct attribution: Pierre Dellacherie&#39;s 4-heuristic Tetris AI

The algorithm is from Pierre Dellacherie (2003), not LeeYiyuan.
Weights are from Colin Fahey&#39;s GA optimization. LeeYiyuan/tetrisai
is the reference implementation we adapted code from.

Updated methodology page, SPEC.md, player.ts, bot.ts, CLAUDE.md.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>2644610c24ac12d9ef707571aec1e31a934389a8</id>
<published>2026-04-10T18:04:37Z</published>
<updated>2026-04-10T18:04:37Z</updated>
<title>V2: control discovery system</title>
<link rel="alternate" type="text/html" href="commit/2644610c24ac12d9ef707571aec1e31a934389a8.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 2644610c24ac12d9ef707571aec1e31a934389a8
parent 8dc9ec566791cf32913b7ea8f3ba37a789ef0b86
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Fri, 10 Apr 2026 20:04:37 +0200

V2: control discovery system

Bot no longer assumes default Tetris controls. Driver now probes each
candidate key to discover what it actually does:
- ArrowDown might be hard drop instead of soft drop
- Space might be pause instead of hard drop
- Some games have no soft drop at all (skip move_down test as N/A)

Discovery is reload-safe (clears game state between probes), classifies
based on grid delta (movement direction, distance, shape change),
budget capped at 50s.

New types: GameAction, ControlMapping, ControlMap with confidence levels
New driver methods: discoverControls(), getControl()
Bot updated: move_down and soft_drop_distinct skip when soft_drop not found
Report includes control_discovery field showing what each key does

Results:
- 1d08ee76 (control swap): 67% -&gt; 83-89%
- 4c7db3b9 (working game): 86% -&gt; 100%
- 8fe72fce (held): 95%

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>8dc9ec566791cf32913b7ea8f3ba37a789ef0b86</id>
<published>2026-04-10T16:58:30Z</published>
<updated>2026-04-10T16:58:30Z</updated>
<title>V2 fix: handle absolute-positioned active piece overlays</title>
<link rel="alternate" type="text/html" href="commit/8dc9ec566791cf32913b7ea8f3ba37a789ef0b86.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 8dc9ec566791cf32913b7ea8f3ba37a789ef0b86
parent 14d5747dc2f92d77eaa7655145f2fe71bcde4d0a
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Fri, 10 Apr 2026 18:58:30 +0200

V2 fix: handle absolute-positioned active piece overlays

Game 8fe72fce uses absolute-positioned div overlays for the falling
piece, separate from the 200 grid cells. The grid reader was missing
the active piece because it only read the first 200 children.

Fix:
- Added refreshGridDetection() in driver: re-detects grid without
  full re-calibration, called by verifyGameStarted() after start clicks
- readDomGrid() now reads overlay children (&gt;200 children) and computes
  which grid cell each absolute-positioned overlay falls into
- Widened child-count ranges from 180-220 to 180-230 to accommodate overlays
- Added screenshotGridArea() and captureGridDomFingerprint() as fallback
  signals for verifyGameStarted when grid-based detection misses

Results: 8fe72fce 0% -&gt; 95% (matches human&#39;s 20/20).
Overall V2 vs human: 82% -&gt; 86% agreement.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>14d5747dc2f92d77eaa7655145f2fe71bcde4d0a</id>
<published>2026-04-10T12:38:22Z</published>
<updated>2026-04-10T12:38:22Z</updated>
<title>Add gemma426b run artifacts and results</title>
<link rel="alternate" type="text/html" href="commit/14d5747dc2f92d77eaa7655145f2fe71bcde4d0a.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 14d5747dc2f92d77eaa7655145f2fe71bcde4d0a
parent 3bde26d36a17e8b79525bbe582d3ab13b8d8387b
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Fri, 10 Apr 2026 14:38:22 +0200

Add gemma426b run artifacts and results

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>3bde26d36a17e8b79525bbe582d3ab13b8d8387b</id>
<published>2026-04-10T12:36:48Z</published>
<updated>2026-04-10T12:36:48Z</updated>
<title>V2 bot: caching, bot/driver bridge, fixed CCW rotation test</title>
<link rel="alternate" type="text/html" href="commit/3bde26d36a17e8b79525bbe582d3ab13b8d8387b.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 3bde26d36a17e8b79525bbe582d3ab13b8d8387b
parent d162c5ba603ac08e3db2a7fe0919dd0494c4f14d
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Fri, 10 Apr 2026 14:36:48 +0200

V2 bot: caching, bot/driver bridge, fixed CCW rotation test

Three improvements merged:

1. Calibration caching (driver.ts): caches start mechanism, controls,
   grid bounds across reloads. Detects drift, flags conflicts. Eliminates
   timeouts from repeated full calibration.

2. Bot/driver bridge (bot.ts, driver.ts, types.ts): bot verifies game
   actually started before driver commits to a mechanism. Checks grid
   populated, movement responsive, no game-over text. discoverStartCandidates,
   tryStartMechanism, confirmStartMechanism, rejectStartMechanism methods.

3. CCW rotation test (bot.ts): fixed broken sequential test that was
   tautologically true. Now reloads page between Z and X tests, compares
   rotation states from same baseline.

Results vs human calibration (9 games):
- V1: 56/97 = 58% agreement
- V2: 80/98 = 82% agreement

Major wins: e2e04e75 (Spanish 18% -&gt; 85%, perfect agreement),
4949d521 (trail bug 18% -&gt; 67%), cbbff570 (18% -&gt; 67%),
9805c24a (80% -&gt; 95%), 7a348b81 (correctly finds working start button).

Known regression: 8fe72fce went 44% -&gt; 0% because bridge&#39;s strict
verification rejects start mechanisms when benign startup console
errors occur. Needs follow-up: distinguish pre-start errors from
fatal errors.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>d162c5ba603ac08e3db2a7fe0919dd0494c4f14d</id>
<published>2026-04-09T19:14:39Z</published>
<updated>2026-04-09T19:14:39Z</updated>
<title>Add gameplay-bot-v2: two-tier architecture (Driver + Bot)</title>
<link rel="alternate" type="text/html" href="commit/d162c5ba603ac08e3db2a7fe0919dd0494c4f14d.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit d162c5ba603ac08e3db2a7fe0919dd0494c4f14d
parent 3012989bb80dca8980569effc60dd0bd59e283c3
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Thu,  9 Apr 2026 21:14:39 +0200

Add gameplay-bot-v2: two-tier architecture (Driver + Bot)

New implementation with clean separation:
- driver.ts (1710 lines): TetrisDriver class, all Playwright interaction
- bot.ts (1690 lines): game logic, 25 tests, zero Playwright imports
- types.ts (233 lines): TetrisDriver interface with 17 methods

Improvements over v1:
- Buttons before keyboard in start detection
- 300ms post-click initialization wait
- False start rejection (immediate game-over check)
- Grid re-calibration after start
- playable_30s gates on errors_during_play only
- Interactivity verification via screenshot + DOM state

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>3012989bb80dca8980569effc60dd0bd59e283c3</id>
<published>2026-04-09T18:22:15Z</published>
<updated>2026-04-09T18:22:15Z</updated>
<title>Update calibration: 9805c24a (broken rotation, bad randomizer), cbbff570 (mostly works, spurious line clear, weird preview)</title>
<link rel="alternate" type="text/html" href="commit/3012989bb80dca8980569effc60dd0bd59e283c3.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 3012989bb80dca8980569effc60dd0bd59e283c3
parent 58c112b941608fed3c655c638c0e4c17daa5bb19
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Thu,  9 Apr 2026 20:22:15 +0200

Update calibration: 9805c24a (broken rotation, bad randomizer),
cbbff570 (mostly works, spurious line clear, weird preview)

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>58c112b941608fed3c655c638c0e4c17daa5bb19</id>
<published>2026-04-09T18:15:12Z</published>
<updated>2026-04-09T18:15:12Z</updated>
<title>Add test #25 rendering_clean, update calibration data</title>
<link rel="alternate" type="text/html" href="commit/58c112b941608fed3c655c638c0e4c17daa5bb19.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 58c112b941608fed3c655c638c0e4c17daa5bb19
parent 67bd49c6e259f78aade3caeae40c3418dedf8071
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Thu,  9 Apr 2026 20:15:12 +0200

Add test #25 rendering_clean, update calibration data

New test detects rendering trail bugs where falling pieces leave old
positions still colored. Checks filled cell growth vs pieces placed
during competitive play (threshold: 8x = trail bug).

Updated calibration: 1d08ee76 (broken rotation, no soft drop),
4949d521 (trail rendering bug, lines never clear).

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>67bd49c6e259f78aade3caeae40c3418dedf8071</id>
<published>2026-04-09T10:56:29Z</published>
<updated>2026-04-09T10:56:29Z</updated>
<title>Add two-tier architecture refactor spec for gameplay bot</title>
<link rel="alternate" type="text/html" href="commit/67bd49c6e259f78aade3caeae40c3418dedf8071.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 67bd49c6e259f78aade3caeae40c3418dedf8071
parent 7fbe88ce2a1febb0954305d10f4e1878570e0f14
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Thu,  9 Apr 2026 12:56:29 +0200

Add two-tier architecture refactor spec for gameplay bot

Driver (webpage abstraction) + Bot (game logic) separation.
17-method TetrisDriver interface, 4-commit incremental migration plan,
~2740 lines (down from 3500). Bot never imports Playwright.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>7fbe88ce2a1febb0954305d10f4e1878570e0f14</id>
<published>2026-04-09T09:48:33Z</published>
<updated>2026-04-09T09:48:33Z</updated>
<title>Verify game interactivity via DOM + screenshot after start detection</title>
<link rel="alternate" type="text/html" href="commit/7fbe88ce2a1febb0954305d10f4e1878570e0f14.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 7fbe88ce2a1febb0954305d10f4e1878570e0f14
parent 53c719fefdf6f437deb0b34eb1b8dbff56d06643
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Thu,  9 Apr 2026 11:48:33 +0200

Verify game interactivity via DOM + screenshot after start detection

Start detection now requires the game to respond to gameplay inputs
(ArrowLeft/Right/Down) before confirming a mechanism worked. Checks
both screenshot changes AND DOM state changes (class names, styles on
grid children). This catches:
- False starts from Space (visual change but not interactive)
- Games that rebuild DOM via innerHTML (screenshot identical but DOM differs)

Spanish game e2e04e75 now correctly starts via &quot;Iniciar Juego&quot; button
(was falsely starting via Space). Score went from 18% to 75%.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>53c719fefdf6f437deb0b34eb1b8dbff56d06643</id>
<published>2026-04-09T09:20:13Z</published>
<updated>2026-04-09T09:20:13Z</updated>
<title>Add grid re-sampling after game start detection</title>
<link rel="alternate" type="text/html" href="commit/53c719fefdf6f437deb0b34eb1b8dbff56d06643.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 53c719fefdf6f437deb0b34eb1b8dbff56d06643
parent 7ec3ff4435d0c822bb73dc7b4689cc4908ca9883
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Thu,  9 Apr 2026 11:20:13 +0200

Add grid re-sampling after game start detection

If initial calibration finds no grid (renderer: unknown), re-calibrate
after the game starts. Games that create DOM cells dynamically via JS
(innerHTML on render loop) have 0 children at page load but 200 after
starting. Records grid_detected_at: &quot;initial&quot; vs &quot;after_start&quot; in report.

Known issue: e2e04e75 game has startBtn button but bot falsely detects
start via Space (something changed but game didn&#39;t actually start).
Needs stronger start verification in next session.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>7ec3ff4435d0c822bb73dc7b4689cc4908ca9883</id>
<published>2026-04-09T09:10:57Z</published>
<updated>2026-04-09T09:10:57Z</updated>
<title>Add all 10 DOM games to calibration page</title>
<link rel="alternate" type="text/html" href="commit/7ec3ff4435d0c822bb73dc7b4689cc4908ca9883.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 7ec3ff4435d0c822bb73dc7b4689cc4908ca9883
parent 4bffa2cd4b2213f424528e1cc00cecce1fcd1a8e
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Thu,  9 Apr 2026 11:10:57 +0200

Add all 10 DOM games to calibration page

5 new entries (human tests unanswered, ready for testing).
5 existing entries already have human test data.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>4bffa2cd4b2213f424528e1cc00cecce1fcd1a8e</id>
<published>2026-04-09T07:18:31Z</published>
<updated>2026-04-09T07:18:31Z</updated>
<title>Update gameplay bot results for 10 DOM games with new start detection</title>
<link rel="alternate" type="text/html" href="commit/4bffa2cd4b2213f424528e1cc00cecce1fcd1a8e.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 4bffa2cd4b2213f424528e1cc00cecce1fcd1a8e
parent 1d5cce537fb6c78ac946ca42dca010215a97e6fd
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Thu,  9 Apr 2026 09:18:31 +0200

Update gameplay bot results for 10 DOM games with new start detection

All 10 DOM games now start successfully (10/10 game_starts: PASS).
Language-agnostic detection works: space, auto, and button click all detected.
Scores range 18-85%. The 18% scores are grid reader failures (canvas
rendering despite DOM classification), not start detection issues.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>1d5cce537fb6c78ac946ca42dca010215a97e6fd</id>
<published>2026-04-09T07:03:26Z</published>
<updated>2026-04-09T07:03:26Z</updated>
<title>Language-agnostic start detection for gameplay bot</title>
<link rel="alternate" type="text/html" href="commit/1d5cce537fb6c78ac946ca42dca010215a97e6fd.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 1d5cce537fb6c78ac946ca42dca010215a97e6fd
parent 4ce8d09103c723f23b4f1d266fe3aef143995996
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Thu,  9 Apr 2026 09:03:26 +0200

Language-agnostic start detection for gameplay bot

Rewrote start mechanism detection to be fully language-agnostic:
- No text string matching (removed btn/start/play selectors)
- Detects clickable elements by cursor:pointer, background color, size
- Sorts candidates by prominence (size, center proximity, contrast)
- Keyboard triggers (Enter, Space) tried before DOM buttons
- All wait times reduced from 300-500ms to 100ms
- Overlay detection purely structural (position, z-index, viewport %)

Tested: Spanish DOM game now starts correctly (game_starts: PASS via space).
Spanish DOM game 2 scores 89% (up from 80%).
Canvas games still blocked by GPU pixel readback issue.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>4ce8d09103c723f23b4f1d266fe3aef143995996</id>
<published>2026-04-09T06:07:32Z</published>
<updated>2026-04-09T06:07:32Z</updated>
<title>Update calibration: 93e8feea starts into game over, e2e04e75 no scoring</title>
<link rel="alternate" type="text/html" href="commit/4ce8d09103c723f23b4f1d266fe3aef143995996.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 4ce8d09103c723f23b4f1d266fe3aef143995996
parent b19aa539899396ddc9373afcaa71621210b6e113
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Thu,  9 Apr 2026 08:07:32 +0200

Update calibration: 93e8feea starts into game over, e2e04e75 no scoring

93e8feea: Game loads into immediate game over state with overlay that
never dismisses. Bot false positive (says game_starts: PASS).
e2e04e75: Spanish game, all mechanics work but score never changes
(real game bug). Bot false negative (18% but most mechanics work).

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>b19aa539899396ddc9373afcaa71621210b6e113</id>
<published>2026-04-09T06:04:42Z</published>
<updated>2026-04-09T06:04:42Z</updated>
<title>Calibration: copy button instead of JSON block, update human results</title>
<link rel="alternate" type="text/html" href="commit/b19aa539899396ddc9373afcaa71621210b6e113.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit b19aa539899396ddc9373afcaa71621210b6e113
parent 9fab5af2106b43a76d5ec6827903ccd666eb3945
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Thu,  9 Apr 2026 08:04:42 +0200

Calibration: copy button instead of JSON block, update human results

Replace inline JSON pre block with a clean &quot;Copy JSON&quot; button.
Updated 4c7db3b9 (Spanish, all mechanics work) and 8fe72fce (English,
19 human passes including multi-line clear, score scaling, CCW rotation).

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>9fab5af2106b43a76d5ec6827903ccd666eb3945</id>
<published>2026-04-09T05:56:41Z</published>
<updated>2026-04-09T05:56:41Z</updated>
<title>Fix calibration UI: connect Human Testing toggle to all cards</title>
<link rel="alternate" type="text/html" href="commit/9fab5af2106b43a76d5ec6827903ccd666eb3945.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 9fab5af2106b43a76d5ec6827903ccd666eb3945
parent cce938f0ee50da98113502ccbc5e5d066efc5137
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Thu,  9 Apr 2026 07:56:41 +0200

Fix calibration UI: connect Human Testing toggle to all cards

The editing state from the parent &quot;Human Testing&quot; button now flows
down to each CalibrationCard, showing tri-state test controls, editable
notes, and copyable JSON export on all cards simultaneously.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>cce938f0ee50da98113502ccbc5e5d066efc5137</id>
<published>2026-04-09T05:52:24Z</published>
<updated>2026-04-09T05:52:24Z</updated>
<title>Interactive calibration UI with human testing mode</title>
<link rel="alternate" type="text/html" href="commit/cce938f0ee50da98113502ccbc5e5d066efc5137.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit cce938f0ee50da98113502ccbc5e5d066efc5137
parent d748de6f4a388178c427cd55d11db0149a9d0d5b
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Thu,  9 Apr 2026 07:52:24 +0200

Interactive calibration UI with human testing mode

Calibrate page now uses a React island with:
- &quot;Human Testing&quot; toggle button reveals clickable tri-state controls
  (yes/no/unanswered) for each test per game
- Short code + game link in title for easy click-and-play
- Editable notes field
- Copyable JSON export per card for pasting results back
- Aggregate agree/disagree stats at top
- Bot results referenced from eval_results.json (not duplicated)

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>d748de6f4a388178c427cd55d11db0149a9d0d5b</id>
<published>2026-04-09T05:41:41Z</published>
<updated>2026-04-09T05:41:41Z</updated>
<title>Add bot calibration page with human vs bot comparison</title>
<link rel="alternate" type="text/html" href="commit/d748de6f4a388178c427cd55d11db0149a9d0d5b.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit d748de6f4a388178c427cd55d11db0149a9d0d5b
parent dcef6a4928511792f670d74ad63b8e1b9a7bde45
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Thu,  9 Apr 2026 07:41:41 +0200

Add bot calibration page with human vs bot comparison

Hidden /calibrate page showing hand-picked games with human test results
side-by-side with bot results. Data is JSON-powered (one file per game in
calibration/), references canonical bot results from eval_results.json.

Initial 5 entries from manual testing of DOM-rendered games:
- 2 games match well (80-85% bot vs human &quot;playable&quot;)
- 1 false negative (bot 18%, human says playable -- likely GPU issue)
- 1 overlay bug correctly identified by human, bot confused
- 1 genuinely broken game (both agree: won&#39;t start)

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>dcef6a4928511792f670d74ad63b8e1b9a7bde45</id>
<published>2026-04-09T05:23:38Z</published>
<updated>2026-04-09T05:23:38Z</updated>
<title>Rewrite gameplay bot: 24 tests, 8 conditional phases, competitive play</title>
<link rel="alternate" type="text/html" href="commit/dcef6a4928511792f670d74ad63b8e1b9a7bde45.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit dcef6a4928511792f670d74ad63b8e1b9a7bde45
parent f978492f1169d00686406170f46df2c1f5f783ca
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Thu,  9 Apr 2026 07:23:38 +0200

Rewrite gameplay bot: 24 tests, 8 conditional phases, competitive play

Major rewrite implementing the full SPEC.md design:

Phase 1: Page load
Phase 2: Start detection with falling piece detector (10 screenshots
  at 100ms, pixel cluster tracking for downward movement), overlay
  detection, cascading trigger sequence (auto/enter/space/button/canvas)
Phase 3: Mechanics (movement, rotation, hard drop) -- conditional on P2
Phase 4: Piece lifecycle (lock, spawn, multiple) -- conditional on P3
Phase 5: Gameplay (60 pieces/45s, integrated score tracking) -- cond. P4
Phase 6: Game over (stack to top via grid reader) -- conditional on P4
Phase 7: Endurance (30s play) -- conditional on P5
Phase 8: Competitive play (60s, 8 bug-detection tests) -- conditional on P5

New tests 17-24: multi_line_clear, score_scaling, level_progression,
speed_progression, next_piece_preview, game_over_display,
counter_clockwise_rotation, soft_drop_distinct

Score = passed / (total - skipped). Skipped tests don&#39;t penalize.
Added SurveyData, CompetitivePlayResult types. Page survey function
in calibrate.ts. 5-minute timeout for competitive play phase.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>f978492f1169d00686406170f46df2c1f5f783ca</id>
<published>2026-04-08T21:40:24Z</published>
<updated>2026-04-08T21:40:24Z</updated>
<title>Add comprehensive gameplay bot spec (24 tests, 8 phases)</title>
<link rel="alternate" type="text/html" href="commit/f978492f1169d00686406170f46df2c1f5f783ca.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit f978492f1169d00686406170f46df2c1f5f783ca
parent dcf0b2b68809e147c61745c349e067cdc700b022
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Wed,  8 Apr 2026 23:40:24 +0200

Add comprehensive gameplay bot spec (24 tests, 8 phases)

Master spec consolidating all design decisions for the gameplay bot rewrite:
- Conditional phase execution (each depends on previous succeeding)
- Falling piece detector (10 screenshots at 100ms, pixel cluster tracking)
- Start detection cascade: auto -&gt; overlay -&gt; buttons -&gt; keyboard -&gt; canvas
- Competitive play phase (60s, bug detection for multi-line clear, score
  scaling, level/speed progression, rotation, soft drop)
- 24 total tests (16 basic + 8 competitive play bug checks)
- Skipped tests don&#39;t penalize score: passed / (total - skipped)
- GPU requirement documented for canvas pixel readback

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>dcf0b2b68809e147c61745c349e067cdc700b022</id>
<published>2026-04-08T18:54:35Z</published>
<updated>2026-04-08T18:54:35Z</updated>
<title>Checkpoint: 40 runs (438 total)</title>
<link rel="alternate" type="text/html" href="commit/dcf0b2b68809e147c61745c349e067cdc700b022.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit dcf0b2b68809e147c61745c349e067cdc700b022
parent 5df80751cd068bed2ff25120efc40e25ea15b4d8
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Wed,  8 Apr 2026 20:54:35 +0200

Checkpoint: 40 runs (438 total)

</content>
</entry>
<entry>
<id>5df80751cd068bed2ff25120efc40e25ea15b4d8</id>
<published>2026-04-08T18:37:51Z</published>
<updated>2026-04-08T18:37:51Z</updated>
<title>Checkpoint: 35 runs (433 total)</title>
<link rel="alternate" type="text/html" href="commit/5df80751cd068bed2ff25120efc40e25ea15b4d8.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 5df80751cd068bed2ff25120efc40e25ea15b4d8
parent a683c185d8f6ef2a499b586db5302236627e7013
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Wed,  8 Apr 2026 20:37:51 +0200

Checkpoint: 35 runs (433 total)

</content>
</entry>
<entry>
<id>a683c185d8f6ef2a499b586db5302236627e7013</id>
<published>2026-04-08T18:20:43Z</published>
<updated>2026-04-08T18:20:43Z</updated>
<title>Checkpoint: 30 runs (428 total)</title>
<link rel="alternate" type="text/html" href="commit/a683c185d8f6ef2a499b586db5302236627e7013.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit a683c185d8f6ef2a499b586db5302236627e7013
parent 664a80e943fc00e4a5c6e147ebeaec3051c6def5
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Wed,  8 Apr 2026 20:20:43 +0200

Checkpoint: 30 runs (428 total)

</content>
</entry>
<entry>
<id>664a80e943fc00e4a5c6e147ebeaec3051c6def5</id>
<published>2026-04-08T18:04:16Z</published>
<updated>2026-04-08T18:04:16Z</updated>
<title>Checkpoint: 25 runs (423 total)</title>
<link rel="alternate" type="text/html" href="commit/664a80e943fc00e4a5c6e147ebeaec3051c6def5.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 664a80e943fc00e4a5c6e147ebeaec3051c6def5
parent ef35ac28812a8669213b243d85d99716d7caf2cd
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Wed,  8 Apr 2026 20:04:16 +0200

Checkpoint: 25 runs (423 total)

</content>
</entry>
<entry>
<id>ef35ac28812a8669213b243d85d99716d7caf2cd</id>
<published>2026-04-08T17:47:23Z</published>
<updated>2026-04-08T17:47:23Z</updated>
<title>Checkpoint: 20 runs (418 total)</title>
<link rel="alternate" type="text/html" href="commit/ef35ac28812a8669213b243d85d99716d7caf2cd.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit ef35ac28812a8669213b243d85d99716d7caf2cd
parent 8199b922d2c75cb7dbe9ca98bf806ca4ce70a962
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Wed,  8 Apr 2026 19:47:23 +0200

Checkpoint: 20 runs (418 total)

</content>
</entry>
<entry>
<id>8199b922d2c75cb7dbe9ca98bf806ca4ce70a962</id>
<published>2026-04-08T17:30:46Z</published>
<updated>2026-04-08T17:30:46Z</updated>
<title>Checkpoint: 15 runs (413 total)</title>
<link rel="alternate" type="text/html" href="commit/8199b922d2c75cb7dbe9ca98bf806ca4ce70a962.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 8199b922d2c75cb7dbe9ca98bf806ca4ce70a962
parent 413bf6cd88b73f770d308ed9eade42d2cfc570be
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Wed,  8 Apr 2026 19:30:46 +0200

Checkpoint: 15 runs (413 total)

</content>
</entry>
<entry>
<id>413bf6cd88b73f770d308ed9eade42d2cfc570be</id>
<published>2026-04-08T16:09:55Z</published>
<updated>2026-04-08T16:09:55Z</updated>
<title>Checkpoint: 10 runs (408 total)</title>
<link rel="alternate" type="text/html" href="commit/413bf6cd88b73f770d308ed9eade42d2cfc570be.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 413bf6cd88b73f770d308ed9eade42d2cfc570be
parent 077892dc11b20dbd6266eaad70102215e24f8108
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Wed,  8 Apr 2026 18:09:55 +0200

Checkpoint: 10 runs (408 total)

</content>
</entry>
<entry>
<id>077892dc11b20dbd6266eaad70102215e24f8108</id>
<published>2026-04-08T11:48:57Z</published>
<updated>2026-04-08T11:48:57Z</updated>
<title>Checkpoint: 5 runs (403 total)</title>
<link rel="alternate" type="text/html" href="commit/077892dc11b20dbd6266eaad70102215e24f8108.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 077892dc11b20dbd6266eaad70102215e24f8108
parent 0fb5a7736cd2f9318f185a4e0a0f2d5b3e73f97d
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Wed,  8 Apr 2026 13:48:57 +0200

Checkpoint: 5 runs (403 total)

</content>
</entry>
<entry>
<id>0fb5a7736cd2f9318f185a4e0a0f2d5b3e73f97d</id>
<published>2026-04-08T11:23:55Z</published>
<updated>2026-04-08T11:23:55Z</updated>
<title>Fix page load: use waitUntil commit, try root URL first</title>
<link rel="alternate" type="text/html" href="commit/0fb5a7736cd2f9318f185a4e0a0f2d5b3e73f97d.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 0fb5a7736cd2f9318f185a4e0a0f2d5b3e73f97d
parent 43fb9fa4943a04511bffd96ccd1ba7e925d1ef15
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Wed,  8 Apr 2026 13:23:55 +0200

Fix page load: use waitUntil commit, try root URL first

Games with blocking JS never fire domcontentloaded.
waitUntil: commit just waits for first bytes.
Try root / before /index.html (serve SPA mode redirect).

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>43fb9fa4943a04511bffd96ccd1ba7e925d1ef15</id>
<published>2026-04-08T07:59:16Z</published>
<updated>2026-04-08T07:59:16Z</updated>
<title>Rewrite start detection: 5-phase, language-agnostic, visual change</title>
<link rel="alternate" type="text/html" href="commit/43fb9fa4943a04511bffd96ccd1ba7e925d1ef15.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 43fb9fa4943a04511bffd96ccd1ba7e925d1ef15
parent 69173d2750e5cab2d6a94d1c152116be336341c2
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Wed,  8 Apr 2026 09:59:16 +0200

Rewrite start detection: 5-phase, language-agnostic, visual change

Phase 1: auto-start (10 frames at 100ms, no input)
Phase 2: DOM buttons by visual prominence (no text matching)
Phase 3: canvas click grid (center, upper, lower, 3x3)
Phase 4: keyboard triggers with combos
Phase 5: retry all phases
detectVisualChange: Level 1 (any change) + Level 2 (gameplay pattern)
30-second total budget. Stateful button recording.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>69173d2750e5cab2d6a94d1c152116be336341c2</id>
<published>2026-04-08T07:21:29Z</published>
<updated>2026-04-08T07:21:29Z</updated>
<title>Fix large prompt handling: use wrapper script instead of bash -c</title>
<link rel="alternate" type="text/html" href="commit/69173d2750e5cab2d6a94d1c152116be336341c2.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 69173d2750e5cab2d6a94d1c152116be336341c2
parent 77e7e9a06140edc233811dccc1cb005b0402575d
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Wed,  8 Apr 2026 09:21:29 +0200

Fix large prompt handling: use wrapper script instead of bash -c

bash -c with $(cat) broke on quotes in settings JSON.
Now writes a shell wrapper script that reads the prompt from file.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>77e7e9a06140edc233811dccc1cb005b0402575d</id>
<published>2026-04-08T06:52:25Z</published>
<updated>2026-04-08T06:52:25Z</updated>
<title>Checkpoint: 30 runs (414 total)</title>
<link rel="alternate" type="text/html" href="commit/77e7e9a06140edc233811dccc1cb005b0402575d.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 77e7e9a06140edc233811dccc1cb005b0402575d
parent e9c7251cd07c133098a32de1b00898bc7ea79d3f
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Wed,  8 Apr 2026 08:52:25 +0200

Checkpoint: 30 runs (414 total)

</content>
</entry>
<entry>
<id>e9c7251cd07c133098a32de1b00898bc7ea79d3f</id>
<published>2026-04-08T05:58:36Z</published>
<updated>2026-04-08T05:58:36Z</updated>
<title>Add 95% CI bands, statistical power card, tornado CI whiskers</title>
<link rel="alternate" type="text/html" href="commit/e9c7251cd07c133098a32de1b00898bc7ea79d3f.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit e9c7251cd07c133098a32de1b00898bc7ea79d3f
parent 4c5457fbc3c2f5ff52de70289b518e2f956800f4
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Wed,  8 Apr 2026 07:58:36 +0200

Add 95% CI bands, statistical power card, tornado CI whiskers

- Box plot: CI band overlay with mean dot, tooltip shows CI range
- Statistical Power card: avg CI width, detectable effect, color status
- Tornado: CI whiskers on effect bars, non-significant dimmed with &quot;n.s.&quot;
- confidenceInterval() function with t-distribution for small samples

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>4c5457fbc3c2f5ff52de70289b518e2f956800f4</id>
<published>2026-04-08T05:45:51Z</published>
<updated>2026-04-08T05:45:51Z</updated>
<title>Checkpoint: 15 runs (399 total)</title>
<link rel="alternate" type="text/html" href="commit/4c5457fbc3c2f5ff52de70289b518e2f956800f4.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 4c5457fbc3c2f5ff52de70289b518e2f956800f4
parent 150e14e6b771b6276fa497d524bfb868bd1cb9d0
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Wed,  8 Apr 2026 07:45:51 +0200

Checkpoint: 15 runs (399 total)

</content>
</entry>
<entry>
<id>150e14e6b771b6276fa497d524bfb868bd1cb9d0</id>
<published>2026-04-08T05:32:40Z</published>
<updated>2026-04-08T05:32:40Z</updated>
<title>Switch qwen-3.6-plus from free to paid endpoint</title>
<link rel="alternate" type="text/html" href="commit/150e14e6b771b6276fa497d524bfb868bd1cb9d0.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 150e14e6b771b6276fa497d524bfb868bd1cb9d0
parent 625d14b3b226e882d25a00909c6d47ab82d0080b
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Wed,  8 Apr 2026 07:32:40 +0200

Switch qwen-3.6-plus from free to paid endpoint

</content>
</entry>
<entry>
<id>625d14b3b226e882d25a00909c6d47ab82d0080b</id>
<published>2026-04-08T05:17:24Z</published>
<updated>2026-04-08T05:17:24Z</updated>
<title>Fix argument list too long for noise cells</title>
<link rel="alternate" type="text/html" href="commit/625d14b3b226e882d25a00909c6d47ab82d0080b.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 625d14b3b226e882d25a00909c6d47ab82d0080b
parent e59ff443edb659c9d21d3fef8d708bd29176f827
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Wed,  8 Apr 2026 07:17:24 +0200

Fix argument list too long for noise cells

Large prompts (&gt;100KB from context noise) exceeded OS arg limit.
Now writes prompt to temp file and uses bash -c with cat for large prompts.
Also deleted 20 gemma runs with 403 auth errors.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>e59ff443edb659c9d21d3fef8d708bd29176f827</id>
<published>2026-04-08T05:09:29Z</published>
<updated>2026-04-08T05:09:29Z</updated>
<title>Add minimax-m2.7 and kimi-k2.5 via OpenRouter</title>
<link rel="alternate" type="text/html" href="commit/e59ff443edb659c9d21d3fef8d708bd29176f827.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit e59ff443edb659c9d21d3fef8d708bd29176f827
parent 7a1efd6efd6f2649b557db27eb8af49fd4795d6e
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Wed,  8 Apr 2026 07:09:29 +0200

Add minimax-m2.7 and kimi-k2.5 via OpenRouter

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>7a1efd6efd6f2649b557db27eb8af49fd4795d6e</id>
<published>2026-04-08T05:07:45Z</published>
<updated>2026-04-08T05:07:45Z</updated>
<title>Checkpoint: 30 runs (453 total)</title>
<link rel="alternate" type="text/html" href="commit/7a1efd6efd6f2649b557db27eb8af49fd4795d6e.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 7a1efd6efd6f2649b557db27eb8af49fd4795d6e
parent 2c88a48235c8cd6f7a422e556594f7c8dbd3ffbe
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Wed,  8 Apr 2026 07:07:45 +0200

Checkpoint: 30 runs (453 total)

</content>
</entry>
<entry>
<id>2c88a48235c8cd6f7a422e556594f7c8dbd3ffbe</id>
<published>2026-04-08T05:06:28Z</published>
<updated>2026-04-08T05:06:28Z</updated>
<title>Checkpoint: 20 runs (433 total)</title>
<link rel="alternate" type="text/html" href="commit/2c88a48235c8cd6f7a422e556594f7c8dbd3ffbe.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 2c88a48235c8cd6f7a422e556594f7c8dbd3ffbe
parent c7095255ad2f40ee04f9d0e3824a1813ec8603fa
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Wed,  8 Apr 2026 07:06:28 +0200

Checkpoint: 20 runs (433 total)

</content>
</entry>
<entry>
<id>c7095255ad2f40ee04f9d0e3824a1813ec8603fa</id>
<published>2026-04-08T05:05:18Z</published>
<updated>2026-04-08T05:05:18Z</updated>
<title>Checkpoint: 10 runs (433 total)</title>
<link rel="alternate" type="text/html" href="commit/c7095255ad2f40ee04f9d0e3824a1813ec8603fa.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit c7095255ad2f40ee04f9d0e3824a1813ec8603fa
parent 858f9ae354cd621f1969c4ad6c188e41a297451c
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Wed,  8 Apr 2026 07:05:18 +0200

Checkpoint: 10 runs (433 total)

</content>
</entry>
<entry>
<id>858f9ae354cd621f1969c4ad6c188e41a297451c</id>
<published>2026-04-08T04:59:26Z</published>
<updated>2026-04-08T04:59:26Z</updated>
<title>Analyze and push 393 runs</title>
<link rel="alternate" type="text/html" href="commit/858f9ae354cd621f1969c4ad6c188e41a297451c.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 858f9ae354cd621f1969c4ad6c188e41a297451c
parent ce57e6ee85e544459288916a5b5c147fb83db69d
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Wed,  8 Apr 2026 06:59:26 +0200

Analyze and push 393 runs

</content>
</entry>
<entry>
<id>ce57e6ee85e544459288916a5b5c147fb83db69d</id>
<published>2026-04-07T22:11:37Z</published>
<updated>2026-04-07T22:11:37Z</updated>
<title>Checkpoint: 10 runs (396 total)</title>
<link rel="alternate" type="text/html" href="commit/ce57e6ee85e544459288916a5b5c147fb83db69d.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit ce57e6ee85e544459288916a5b5c147fb83db69d
parent 1e69260f07274dca928db476f7e78700b94c472a
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Wed,  8 Apr 2026 00:11:37 +0200

Checkpoint: 10 runs (396 total)

</content>
</entry>
<entry>
<id>1e69260f07274dca928db476f7e78700b94c472a</id>
<published>2026-04-07T21:30:26Z</published>
<updated>2026-04-07T21:30:26Z</updated>
<title>Add 21 new runs (394 total)</title>
<link rel="alternate" type="text/html" href="commit/1e69260f07274dca928db476f7e78700b94c472a.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 1e69260f07274dca928db476f7e78700b94c472a
parent a5c7df1a95000d3e881f09d29479b93d4169e27b
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Tue,  7 Apr 2026 23:30:26 +0200

Add 21 new runs (394 total)

Profile: main_effects
Completed: 21 | Skipped: 13 | Failed: 7

</content>
</entry>
<entry>
<id>a5c7df1a95000d3e881f09d29479b93d4169e27b</id>
<published>2026-04-07T21:12:31Z</published>
<updated>2026-04-07T21:12:31Z</updated>
<title>Checkpoint: 20 runs (393 total)</title>
<link rel="alternate" type="text/html" href="commit/a5c7df1a95000d3e881f09d29479b93d4169e27b.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit a5c7df1a95000d3e881f09d29479b93d4169e27b
parent 91f15a4a0a9b76a012f13ca069eec988584d5f99
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Tue,  7 Apr 2026 23:12:31 +0200

Checkpoint: 20 runs (393 total)

</content>
</entry>
<entry>
<id>91f15a4a0a9b76a012f13ca069eec988584d5f99</id>
<published>2026-04-07T20:03:59Z</published>
<updated>2026-04-07T20:03:59Z</updated>
<title>Add 33 new runs (373 total)</title>
<link rel="alternate" type="text/html" href="commit/91f15a4a0a9b76a012f13ca069eec988584d5f99.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 91f15a4a0a9b76a012f13ca069eec988584d5f99
parent f078feba3d37071568e8638b9514b41e90605d84
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Tue,  7 Apr 2026 22:03:59 +0200

Add 33 new runs (373 total)

Profile: main_effects
Completed: 33 | Skipped: 1 | Failed: 6

</content>
</entry>
<entry>
<id>f078feba3d37071568e8638b9514b41e90605d84</id>
<published>2026-04-07T20:03:09Z</published>
<updated>2026-04-07T20:03:09Z</updated>
<title>Checkpoint: 30 runs (373 total)</title>
<link rel="alternate" type="text/html" href="commit/f078feba3d37071568e8638b9514b41e90605d84.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit f078feba3d37071568e8638b9514b41e90605d84
parent 08782c8c43bf0df2b7e5287bab5753f9bd7cee24
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Tue,  7 Apr 2026 22:03:09 +0200

Checkpoint: 30 runs (373 total)

</content>
</entry>
<entry>
<id>08782c8c43bf0df2b7e5287bab5753f9bd7cee24</id>
<published>2026-04-07T20:02:08Z</published>
<updated>2026-04-07T20:02:08Z</updated>
<title>Checkpoint: 20 runs (365 total)</title>
<link rel="alternate" type="text/html" href="commit/08782c8c43bf0df2b7e5287bab5753f9bd7cee24.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 08782c8c43bf0df2b7e5287bab5753f9bd7cee24
parent d42c3a6342b91bcd6a46a99a86a26d7039dbb749
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Tue,  7 Apr 2026 22:02:08 +0200

Checkpoint: 20 runs (365 total)

</content>
</entry>
</feed>
