<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<title>ai-research-survey, branch HEAD</title>
<subtitle>Systematic scan of agentic development research. What&#39;s signal, what&#39;s noise.
</subtitle>
<entry>
<id>4689504f291925f272e9d12a2ec4b69941989d18</id>
<published>2026-04-14T21:31:00Z</published>
<updated>2026-04-14T21:31:00Z</updated>
<title>calibration: ship learned weights (20 anchors, v1 fit)</title>
<link rel="alternate" type="text/html" href="commit/4689504f291925f272e9d12a2ec4b69941989d18.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 4689504f291925f272e9d12a2ec4b69941989d18
parent 56999c964a569d3e4507b6766777b41e11f76fdd
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Tue, 14 Apr 2026 23:31:00 +0200

calibration: ship learned weights (20 anchors, v1 fit)

Weights learned via scripts/calibration/fit-weights.py against a 20-
anchor labeled set spanning 6 bad / 12 good / 2 middling papers, with
16 pairwise ordering constraints.

Results:
- 15/20 anchors in band; all 16 pair constraints satisfied
- Wakefield 45.7 -&gt; 32.3 (structural ceiling; see below)
- Attention 52.8 -&gt; 58.7
- Corpus median 49.1 -&gt; 54.7, mean 48.1 -&gt; 54.4, distribution widens

Weight story: claims_and_evidence (1.71), setup_transparency (1.08),
conflicts_of_interest (0.61), artifacts (0.51) dominate. Five
categories zero out: statistical_methodology, data_integrity,
human_studies, experimental_rigor, survey_methodology.

The zeros aren&#39;t a fit defect. Wakefield passes the surface-compliance
questions in those categories (IRB disclosed, contemporary controls,
bowel histology) that a fraudulent case series can satisfy while
lying about the data. The only way the optimizer can simultaneously
respect pair ordering and the Wakefield band is to down-weight those
surface-compliance categories to zero. That&#39;s a rubric structural
limit, not a weighting problem; a future iteration should add
fraud-adjacent questions (effect-size plausibility, COI magnitude,
extraordinary-evidence thresholds). Until then, this is the best
weight vector the current rubric can produce.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>56999c964a569d3e4507b6766777b41e11f76fdd</id>
<published>2026-04-14T21:11:14Z</published>
<updated>2026-04-14T21:11:14Z</updated>
<title>calibration: pairwise weight fitting against labeled anchors</title>
<link rel="alternate" type="text/html" href="commit/56999c964a569d3e4507b6766777b41e11f76fdd.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 56999c964a569d3e4507b6766777b41e11f76fdd
parent 5ad6af87a22aa18f92dac25f1979ae94c66367bb
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Tue, 14 Apr 2026 23:11:14 +0200

calibration: pairwise weight fitting against labeled anchors

Scaffolding for learning per-category rubric weights from a small set
of hand-labeled anchor papers. Keeps uniform flat-question averaging
as the default behavior; opts into learned weights only when
scripts/calibration/weights.json exists.

Files:
- scripts/calibration/anchors.yaml: seed set of 8 anchors (Wakefield at
  0-15, Attention/BERT/ReAct/AlphaCode/ARC/BERT-papers at 70-90, meta
  papers Show Your Work / Deep RL that Matters at 80-92). Comments
  mark candidates to add; aim for 15+ anchors before trusting weights.
- scripts/calibration/fit-weights.py: scipy L-BFGS-B fit over
  per-category weights [0-5] with L2 regularization toward uniform and
  a pairwise ordering hinge. Prints per-anchor predicted scores + pair
  separation check, writes weights.json.
- build-explorer-data.py: compute_overall_score accepts optional
  category_weights. load_category_weights reads the JSON if present.

First fit with 8 seed anchors separates Wakefield (7.5) from Attention
(74.7) by 67 points - was 7 points with uniform weights. But the
optimizer zeros several categories at that anchor count, a classic
overfit signal. Add 7-15 more anchors before shipping weights.json.

weights.json is intentionally not committed in this PR; treat it as a
deliverable Brian generates after labeling enough anchors.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>5ad6af87a22aa18f92dac25f1979ae94c66367bb</id>
<published>2026-04-14T20:34:32Z</published>
<updated>2026-04-14T20:34:32Z</updated>
<title>partition benchmark-eval + tag Attention as reference-benchmark</title>
<link rel="alternate" type="text/html" href="commit/5ad6af87a22aa18f92dac25f1979ae94c66367bb.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 5ad6af87a22aa18f92dac25f1979ae94c66367bb
parent 47067ff2e58055add7db030be4ee682e0cd01f43
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Tue, 14 Apr 2026 22:34:32 +0200

partition benchmark-eval + tag Attention as reference-benchmark

Two things:

1. attention-is-all-you-need-2017 now tagged reference-benchmark
   (keeps &#39;landmark&#39; too). Foundational transformer paper used as a
   rubric anchor like Wakefield and Ioannidis.

2. Papers tagged benchmark-eval now partitioned from aggregates too.
   Rationale: they introduce benchmarks used BY the field, they&#39;re
   reference material rather than subjects of the same kind of rubric
   evaluation. 5 papers affected.

Output: new benchmarks.json alongside calibration.json.

Effect on dashboard: n = 1530 -&gt; 1524, median unchanged at 49.1.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>47067ff2e58055add7db030be4ee682e0cd01f43</id>
<published>2026-04-14T20:24:55Z</published>
<updated>2026-04-14T20:24:55Z</updated>
<title>partition calibration (reference-benchmark) specimens</title>
<link rel="alternate" type="text/html" href="commit/47067ff2e58055add7db030be4ee682e0cd01f43.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 47067ff2e58055add7db030be4ee682e0cd01f43
parent 4b8436506afa1c261f8cd6e046caa136ce386732
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Tue, 14 Apr 2026 22:24:55 +0200

partition calibration (reference-benchmark) specimens

Registry entries tagged &quot;reference-benchmark&quot; (currently Wakefield 1998
and Ioannidis 2005, only the first scanned) now skip the agentic-AI
corpus aggregates entirely. They still get per-paper scoring, still get
individual papers/{slug}.json written (so detail pages work), but they
no longer contribute to:

- total_papers / dash.n
- median / mean / full_reproducibility_pct
- histogram, category_rates, year_trends, tag_counts
- archetype_counts, game_counts / game_pcts
- venue_scores, citation_band_scores, funding_groups
- tensions (claim classification)
- papers-index.json (hidden from the papers explorer)

Effect: n = 1530 (was 1531), median unchanged at 49.1, full_repro
4.0 -&gt; 4.1 (Wakefield&#39;s 0% full-reproducibility weight removed).

New output: calibration.json listing the calibration specimens with
their full detail + a calibration_notes field carrying the registry
notes so the consumer can explain each specimen&#39;s purpose.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>4b8436506afa1c261f8cd6e046caa136ce386732</id>
<published>2026-04-14T19:45:34Z</published>
<updated>2026-04-14T19:45:34Z</updated>
<title>stats: include v1 scans with graceful degradation</title>
<link rel="alternate" type="text/html" href="commit/4b8436506afa1c261f8cd6e046caa136ce386732.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 4b8436506afa1c261f8cd6e046caa136ce386732
parent 06cbf721cea34bee65e068cd7363caff35325a3b
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Tue, 14 Apr 2026 21:45:34 +0200

stats: include v1 scans with graceful degradation

The scan_version &lt; 2 filter was excluding 558 papers (~28% of the
scanned corpus). Inspection showed the v1 rubric is a proper subset
of v2+: 50 identical questions across 11 identical categories, zero
dropped or changed. The v2+ additions (proxy_outcome_distinction +
data_leakage + experimental_rigor + survey_methodology = 17 questions
in one new field + 3 new conditional modules) are purely additive.

compute_overall_score already uses passed/applicable over present
questions, so v1 papers degrade gracefully: their 50 applicable
questions are scored normally and the 7 v2+-only questions are
treated as absent. classify_archetype only touches categories in
the shared 11. detect_games only references questions in the
shared 11. No scoring bias introduced.

Effect: n rises from 1,047 to 1,531 (+484 v1 papers that had
scorable data; 74 more v1 scans still excluded via the
&quot;no applicable questions&quot; check). Median moves 47.2 -&gt; 49.1,
all game_pcts within 2 points of prior values.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>06cbf721cea34bee65e068cd7363caff35325a3b</id>
<published>2026-04-13T13:51:42Z</published>
<updated>2026-04-13T13:51:42Z</updated>
<title>CI: trigger deploy on scan-v5.json changes</title>
<link rel="alternate" type="text/html" href="commit/06cbf721cea34bee65e068cd7363caff35325a3b.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 06cbf721cea34bee65e068cd7363caff35325a3b
parent 1829fbe2bf4e383bc57aa9285593d195c5e53838
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Mon, 13 Apr 2026 15:51:42 +0200

CI: trigger deploy on scan-v5.json changes

The v5 Haiku scans weren&#39;t triggering rebuilds because
the path filter only matched papers/*/scan.json.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>1829fbe2bf4e383bc57aa9285593d195c5e53838</id>
<published>2026-04-13T13:24:21Z</published>
<updated>2026-04-13T13:24:21Z</updated>
<title>V5 Haiku sweep: 531 papers scanned (280 new)</title>
<link rel="alternate" type="text/html" href="commit/1829fbe2bf4e383bc57aa9285593d195c5e53838.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 1829fbe2bf4e383bc57aa9285593d195c5e53838
parent 7325e2836dacec068ecd0eebfdca0c729a902baf
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Mon, 13 Apr 2026 15:24:21 +0200

V5 Haiku sweep: 531 papers scanned (280 new)

20 failures: 15 claude exit 1, 3 no JSON, 1 JSON parse error, 1 mixed.
Can retry failures separately.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>7325e2836dacec068ecd0eebfdca0c729a902baf</id>
<published>2026-04-12T17:49:54Z</published>
<updated>2026-04-12T17:49:54Z</updated>
<title>V5 Haiku sweep: 220 papers scanned</title>
<link rel="alternate" type="text/html" href="commit/7325e2836dacec068ecd0eebfdca0c729a902baf.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 7325e2836dacec068ecd0eebfdca0c729a902baf
parent eb6c3464af535659c94124df71ec8abd0e9a2ab3
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Sun, 12 Apr 2026 19:49:54 +0200

V5 Haiku sweep: 220 papers scanned

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>eb6c3464af535659c94124df71ec8abd0e9a2ab3</id>
<published>2026-04-12T16:10:46Z</published>
<updated>2026-04-12T16:10:46Z</updated>
<title>Progress bar: 4-segment v5 pipeline, 106 pure Haiku scans</title>
<link rel="alternate" type="text/html" href="commit/eb6c3464af535659c94124df71ec8abd0e9a2ab3.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit eb6c3464af535659c94124df71ec8abd0e9a2ab3
parent 375564a74735195015b853fe4ec2af98ff6e4fa0
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Sun, 12 Apr 2026 18:10:46 +0200

Progress bar: 4-segment v5 pipeline, 106 pure Haiku scans

Replace old v3/v2/v1/queued/no-text progress bar with:
V5 Opus | V5 Haiku/Sonnet | Deprecated | Not scanned

Build pipeline counts scan-v5.json (checking source field for
opus vs haiku) and falls back to scan.json as deprecated.
No cascade loading yet — metrics still read from old scan.json.

Also: v5 script cosmetic fixes (v4→v5 references) and stderr
capture on claude failures for better error diagnostics.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>375564a74735195015b853fe4ec2af98ff6e4fa0</id>
<published>2026-04-11T05:45:46Z</published>
<updated>2026-04-11T05:45:46Z</updated>
<title>Add run-scan-v5-haiku.py: pure Haiku output, no Opus merge</title>
<link rel="alternate" type="text/html" href="commit/375564a74735195015b853fe4ec2af98ff6e4fa0.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 375564a74735195015b853fe4ec2af98ff6e4fa0
parent 450388ee71cfe1d95d028029358c8007afb36ba5
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Sat, 11 Apr 2026 07:45:46 +0200

Add run-scan-v5-haiku.py: pure Haiku output, no Opus merge

v4-haiku script merged Opus answers at write time, contaminating
scan-v4.json with v2 Opus overwrites. v5 script writes raw
Haiku/Sonnet output to scan-v5.json so per-question Haiku-Opus
comparisons remain possible for calibration analysis.

The build pipeline will handle the Opus/Haiku merge at read time,
preferring Opus where available but keeping the raw v5 data around.

Usage: python3 scripts/run-scan-v5-haiku.py --parallel 8

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>450388ee71cfe1d95d028029358c8007afb36ba5</id>
<published>2026-04-11T05:31:03Z</published>
<updated>2026-04-11T05:31:03Z</updated>
<title>V4 Haiku sweep: 241 new papers (263 total)</title>
<link rel="alternate" type="text/html" href="commit/450388ee71cfe1d95d028029358c8007afb36ba5.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 450388ee71cfe1d95d028029358c8007afb36ba5
parent c9f58bde8535e444a35685593af2b9b4b2f9d55b
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Sat, 11 Apr 2026 07:31:03 +0200

V4 Haiku sweep: 241 new papers (263 total)

Run: python3 scripts/run-scan-v4-haiku.py --parallel 8
Merged with existing Opus v2/v3 answers where available.

12,347 Opus overrides (92%), 1,806 Haiku answers (14%), 0 Sonnet.
The Haiku answers are on the new v4 questions (scope_and_framing,
type-specific modules) where no v2 Opus answer existed.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>c9f58bde8535e444a35685593af2b9b4b2f9d55b</id>
<published>2026-03-31T06:40:03Z</published>
<updated>2026-03-31T06:40:03Z</updated>
<title>V4 Haiku scan pipeline: type-routed instrument with Opus overlay</title>
<link rel="alternate" type="text/html" href="commit/c9f58bde8535e444a35685593af2b9b4b2f9d55b.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit c9f58bde8535e444a35685593af2b9b4b2f9d55b
parent 736a50a032a47708cf7293b93076df2b494eb27b
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Tue, 31 Mar 2026 08:40:03 +0200

V4 Haiku scan pipeline: type-routed instrument with Opus overlay

New schema (scan-v4.schema.json): shared core (15q) + 5 type-specific
modules (empirical 39q, benchmark 12q, survey 12q, position 12q,
theoretical 10q). Two-field boolean design preserved.

New script (run-scan-v4-haiku.py):
- Haiku for papers &lt;50K chars, auto-fallback to Sonnet for larger
- Reads paper_type.json for routing
- Merges existing Opus v2/v3 answers (Opus always overrides)
- Tracks source per question (opus/haiku/sonnet)
- Free calibration: reports Haiku-Opus agreement rate
- Fetches HN data inline
- Writes scan-v4.json (separate from scan.json)

Tested:
- Tao (position): 12 opus + 15 haiku. 75% agreement. Position module
  shows 7/7 argument quality, 1/5 clarity. Much better than v2&#39;s 10%.
- METR (empirical): 51 opus + 3 sonnet. 86.3% agreement.

Run: python3 scripts/run-scan-v4-haiku.py --parallel 8

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>736a50a032a47708cf7293b93076df2b494eb27b</id>
<published>2026-03-30T18:17:11Z</published>
<updated>2026-03-30T18:17:11Z</updated>
<title>Classify all 1,206 papers by type via Haiku</title>
<link rel="alternate" type="text/html" href="commit/736a50a032a47708cf7293b93076df2b494eb27b.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 736a50a032a47708cf7293b93076df2b494eb27b
parent fbc3c552e124c8c6c91d532e531bbc6f81f4d957
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Mon, 30 Mar 2026 20:17:11 +0200

Classify all 1,206 papers by type via Haiku

Distribution:
  empirical           816 (69%)
  benchmark-creation  155 (13%)
  survey              102 (9%)
  position             63 (5%)
  theoretical          49 (4%)

Spot-checked: Tao→position, METR→empirical, SWE-Bench→benchmark-creation,
HalluLens→benchmark-creation, multi-agent survey→survey. All correct.

Separate paper_type.json files, non-destructive to scan data.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>fbc3c552e124c8c6c91d532e531bbc6f81f4d957</id>
<published>2026-03-30T14:40:40Z</published>
<updated>2026-03-30T14:40:40Z</updated>
<title>Add Haiku paper type classification script (preliminary)</title>
<link rel="alternate" type="text/html" href="commit/fbc3c552e124c8c6c91d532e531bbc6f81f4d957.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit fbc3c552e124c8c6c91d532e531bbc6f81f4d957
parent 95f484d01c4aded0fbdb7faed0aa7f17b69da21b
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Mon, 30 Mar 2026 16:40:40 +0200

Add Haiku paper type classification script (preliminary)

scripts/classify-paper-type.py classifies papers into 5 types:
empirical, benchmark-creation, survey, position, theoretical.

Uses Haiku (cheap, fast) reading title + key_findings + tags from
existing scan.json. Writes papers/{slug}/paper_type.json as a
separate non-destructive file.

20/20 correct on manual verification. Running full corpus in
background. This is preliminary — classification feeds into v4
instrument redesign where each type gets its own question panel.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>95f484d01c4aded0fbdb7faed0aa7f17b69da21b</id>
<published>2026-03-30T14:10:48Z</published>
<updated>2026-03-30T14:10:48Z</updated>
<title>Filter non-empirical papers from findings, tag in UI</title>
<link rel="alternate" type="text/html" href="commit/95f484d01c4aded0fbdb7faed0aa7f17b69da21b.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 95f484d01c4aded0fbdb7faed0aa7f17b69da21b
parent b4f6f0caa07a8a5d8d382792a236646c772d9b4b
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Mon, 30 Mar 2026 16:10:48 +0200

Filter non-empirical papers from findings, tag in UI

Papers without both statistical_methodology and evaluation_design
applicable are classified as non-empirical (159 papers). These are
excluded from all findings aggregations (median, games, tensions,
correlations, year trends). Dashboard now reports empirical-only:
1,047 papers, median 47.2% (was 46.3% mixed).

Non-empirical papers still appear in the papers browser with:
- Score shown in gray with asterisk (e.g., &quot;10.0%*&quot;)
- &quot;Non-empirical&quot; badge instead of archetype
- Tooltip explaining limited criteria

Progress bar shows empirical/non-empirical split.
Paper detail pages still show full checklist for all papers.

This is a stopgap before v4 instrument redesign that will add
paper-type-specific question panels.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>b4f6f0caa07a8a5d8d382792a236646c772d9b4b</id>
<published>2026-03-29T14:13:53Z</published>
<updated>2026-03-29T14:13:53Z</updated>
<title>Update to 1205 scans, new sampling checkpoint, refresh memory</title>
<link rel="alternate" type="text/html" href="commit/b4f6f0caa07a8a5d8d382792a236646c772d9b4b.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit b4f6f0caa07a8a5d8d382792a236646c772d9b4b
parent 208801951bbe904415b4a651bd792bc95c8f9241
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Sun, 29 Mar 2026 16:13:53 +0200

Update to 1205 scans, new sampling checkpoint, refresh memory

Sampling: n=932→47.1%, n=1205→46.3%. Decline continues — 2025
papers (n=595) have lowest median at 44.9%.

At n=1205:
- Games: Overclaiming 65.4%, Big Numbers 65.3% (both rising)
- Quality contagion widened: 37.2%→50.0% (13pp gradient)
- Funding gap stable at 11.5pp
- Network: 1359 nodes, 4512 edges
- 492 code URLs extracted
- 6 tensions with 3600+ claims

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>208801951bbe904415b4a651bd792bc95c8f9241</id>
<published>2026-03-24T05:51:58Z</published>
<updated>2026-03-24T05:51:58Z</updated>
<title>Add explanatory descriptions to each tension section</title>
<link rel="alternate" type="text/html" href="commit/208801951bbe904415b4a651bd792bc95c8f9241.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 208801951bbe904415b4a651bd792bc95c8f9241
parent ddde6369343ce6a1c7129bb5e8093318815ae2a7
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Tue, 24 Mar 2026 06:51:58 +0100

Add explanatory descriptions to each tension section

Each tension now has 2 sentences below the title explaining what the
tension is and why it matters. E.g., Security Arms Race: &quot;Defense
papers claim their mitigations work; attack papers show they can be
bypassed. Neither side engages seriously with the other.&quot;

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>ddde6369343ce6a1c7129bb5e8093318815ae2a7</id>
<published>2026-03-24T05:49:13Z</published>
<updated>2026-03-24T05:49:13Z</updated>
<title>Add 3 new tensions, expand keyword matching for existing 3</title>
<link rel="alternate" type="text/html" href="commit/ddde6369343ce6a1c7129bb5e8093318815ae2a7.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit ddde6369343ce6a1c7129bb5e8093318815ae2a7
parent 372ecdeaa4a2719d071db645776f137c093a7bc5
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Tue, 24 Mar 2026 06:49:13 +0100

Add 3 new tensions, expand keyword matching for existing 3

New tensions:
- Security Arms Race: 379 defense vs 546 attack claims
- Code Quality Paradox: 363 LLMs-help vs 251 LLMs-hurt
- Scaling Debate: 152 efficient vs 452 limits

Expanded keywords for existing tensions:
- Benchmarks: added pass@, accuracy, f1, performance on, sota (103→315 pos)
- Agents: added agentic, tool use, planning, chain-of-thought (72→110 pos)
- Productivity: added developer productivity, coding efficiency (minor)

Each tension gets a butterfly chart with bar width encoding.
Total claim coverage: 858→3600 (from 5115 total claims).

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>372ecdeaa4a2719d071db645776f137c093a7bc5</id>
<published>2026-03-24T05:43:14Z</published>
<updated>2026-03-24T05:43:14Z</updated>
<title>Use bar width for methodology score, not border thickness</title>
<link rel="alternate" type="text/html" href="commit/372ecdeaa4a2719d071db645776f137c093a7bc5.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 372ecdeaa4a2719d071db645776f137c093a7bc5
parent e794391f9bd6ec7b3ed6676a0738aecb985100dc
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Tue, 24 Mar 2026 06:43:14 +0100

Use bar width for methodology score, not border thickness

Height = claim count, width = methodology score. A tall narrow bar
means many claims from weak papers. A short wide bar means few
claims from rigorous papers. Solid fills, no border tricks.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>e794391f9bd6ec7b3ed6676a0738aecb985100dc</id>
<published>2026-03-24T05:40:47Z</published>
<updated>2026-03-24T05:40:47Z</updated>
<title>Replace dot chart with border-thickness encoding for tension quality</title>
<link rel="alternate" type="text/html" href="commit/e794391f9bd6ec7b3ed6676a0738aecb985100dc.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit e794391f9bd6ec7b3ed6676a0738aecb985100dc
parent 48c57f7f304597d6144168e70d4abea8de51fb65
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Tue, 24 Mar 2026 06:40:47 +0100

Replace dot chart with border-thickness encoding for tension quality

Bars are now lightly filled outlines where border thickness encodes
methodology score: 1px at 20% → 5px at 70%. A tall thin-bordered
bar = many claims from weak papers. A short thick-bordered bar =
few claims from rigorous papers. Score shown as text label.

Cleaner than dots (which were cramped) and shades (which were
imperceptible).

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>48c57f7f304597d6144168e70d4abea8de51fb65</id>
<published>2026-03-24T05:35:47Z</published>
<updated>2026-03-24T05:35:47Z</updated>
<title>Replace shade gradient with bar+dot chart on tensions</title>
<link rel="alternate" type="text/html" href="commit/48c57f7f304597d6144168e70d4abea8de51fb65.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 48c57f7f304597d6144168e70d4abea8de51fb65
parent 3776fd8528b73d85622c4a27e63d7a6e653b67c9
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Tue, 24 Mar 2026 06:35:47 +0100

Replace shade gradient with bar+dot chart on tensions

Bars show claim count (flat blue up, flat gray down).
Dots show mean methodology score as position on a mini x-axis
per year column, with score number inside each dot.
Two independent visual channels — no color interpretation needed.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>3776fd8528b73d85622c4a27e63d7a6e653b67c9</id>
<published>2026-03-24T05:31:56Z</published>
<updated>2026-03-24T05:31:56Z</updated>
<title>Add count scale ticks and clearer axis description to tension charts</title>
<link rel="alternate" type="text/html" href="commit/3776fd8528b73d85622c4a27e63d7a6e653b67c9.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 3776fd8528b73d85622c4a27e63d7a6e653b67c9
parent 8667f4fa26aaf394db5fbd7bf771dbcdeffe505e
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Tue, 24 Mar 2026 06:31:56 +0100

Add count scale ticks and clearer axis description to tension charts

Left edge now shows numeric count ticks (midpoint and max) for both
positive and nuanced sides. Description text uses bold labels:
Height = count, Darkness = methodology score.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>8667f4fa26aaf394db5fbd7bf771dbcdeffe505e</id>
<published>2026-03-24T05:29:17Z</published>
<updated>2026-03-24T05:29:17Z</updated>
<title>Tension butterfly: single-hue gradient + quality-weighted balance line</title>
<link rel="alternate" type="text/html" href="commit/8667f4fa26aaf394db5fbd7bf771dbcdeffe505e.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 8667f4fa26aaf394db5fbd7bf771dbcdeffe505e
parent fdf7bf9ec122ae8104c86d45c5563560fafefd4f
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Tue, 24 Mar 2026 06:29:17 +0100

Tension butterfly: single-hue gradient + quality-weighted balance line

Bars use single-hue intensity gradients instead of color categories:
- Positive claims: light-to-dark blue (pale = weak methodology)
- Nuanced claims: light-to-dark gray
Works in grayscale and for colorblind viewers.

Added dashed balance line across years: quality-weighted center of
gravity between positive and nuanced claims. Above center line means
optimism dominates (weighted by methodology), below means skepticism.
Shows how each tension&#39;s balance shifts over time.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>fdf7bf9ec122ae8104c86d45c5563560fafefd4f</id>
<published>2026-03-24T05:25:59Z</published>
<updated>2026-03-24T05:25:59Z</updated>
<title>Add butterfly timeline charts to tensions view</title>
<link rel="alternate" type="text/html" href="commit/fdf7bf9ec122ae8104c86d45c5563560fafefd4f.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit fdf7bf9ec122ae8104c86d45c5563560fafefd4f
parent f41b10dd2346179281c7a9ef5e809bb5cb87c2ec
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Tue, 24 Mar 2026 06:25:59 +0100

Add butterfly timeline charts to tensions view

Each tension now has a diverging bar chart with horizontal time axis:
- Blue bars extend UP for positive claims (count + mean method score)
- Gray bars extend DOWN for nuanced claims (count + mean method score)
- No color encoding for score — numbers shown directly as labels
- Hover tooltips on each bar with full details

Year field added to tension claims in build pipeline.
Replaces the previous vertical color-encoded butterfly prototype.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>f41b10dd2346179281c7a9ef5e809bb5cb87c2ec</id>
<published>2026-03-24T05:19:41Z</published>
<updated>2026-03-24T05:19:41Z</updated>
<title>Expand HN analysis: scatter, tag paradox, repost/controversy signals</title>
<link rel="alternate" type="text/html" href="commit/f41b10dd2346179281c7a9ef5e809bb5cb87c2ec.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit f41b10dd2346179281c7a9ef5e809bb5cb87c2ec
parent 7072c581ce666fdd9dc061902b0e8823888d9732
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Tue, 24 Mar 2026 06:19:41 +0100

Expand HN analysis: scatter, tag paradox, repost/controversy signals

Social Attention section now includes:
- HN points vs methodology scatter (586 papers, log-scale x-axis)
  with &quot;the blob has no slope&quot; annotation
- Case study paradox: tag comparison showing HN attention vs methodology
  side-by-side (case studies: most HN love, worst methodology)
- Repost signal: 8+ reposts = 50.6% method vs 48.0% for single posts
- Controversy signal: high-discussion papers score 50.7% vs 48.9%
- Updated heatmap annotation text for n=932 correlation values

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>7072c581ce666fdd9dc061902b0e8823888d9732</id>
<published>2026-03-24T05:13:09Z</published>
<updated>2026-03-24T05:13:09Z</updated>
<title>Add 187 scans (932 total), new sampling checkpoint, remove corrupt scan</title>
<link rel="alternate" type="text/html" href="commit/7072c581ce666fdd9dc061902b0e8823888d9732.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 7072c581ce666fdd9dc061902b0e8823888d9732
parent 3d52164e9d12fde30413fa1250caeaecb3f678cd
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Tue, 24 Mar 2026 06:13:09 +0100

Add 187 scans (932 total), new sampling checkpoint, remove corrupt scan

Sampling checkpoint: n=745→48.1%, n=932→47.1%. Decline continues.
Removed corrupt your-prompt-safe-2025/scan.json (truncated string).

Key shifts at n=932:
- Games worsened: Overclaiming 63.2%, Big Numbers 63.1%
- Quality contagion gradient widened: 37.8%→50.6% (was 43.1%→52.3%)
- 2025 is worst year at 45.5% median (n=455)
- Funding gap stable at 12.1pp
- Optimism-rigor gap grew for productivity (+6.6pp)
- All correlation structure stable (contamination↔leakage 0.857,
  artifacts↔stats 0.058, two cultures -0.198)

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>3d52164e9d12fde30413fa1250caeaecb3f678cd</id>
<published>2026-03-23T18:53:31Z</published>
<updated>2026-03-23T18:53:31Z</updated>
<title>Add engagement factor strip to papers table</title>
<link rel="alternate" type="text/html" href="commit/3d52164e9d12fde30413fa1250caeaecb3f678cd.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 3d52164e9d12fde30413fa1250caeaecb3f678cd
parent bdd13e9d815e0e04aaa193ea80a794d756b99f8f
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Mon, 23 Mar 2026 19:53:31 +0100

Add engagement factor strip to papers table

V3 papers show a second DNA strip (purple tones) next to the
methodology strip (red-yellow-blue-green). 6 cells for practical,
surprise, fear, drama, demo, brand. Hover shows dimension name
and score. Blank for v2-only papers.

53 papers currently have both strips.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>bdd13e9d815e0e04aaa193ea80a794d756b99f8f</id>
<published>2026-03-23T14:41:38Z</published>
<updated>2026-03-23T14:41:38Z</updated>
<title>Show engagement factors on paper detail pages</title>
<link rel="alternate" type="text/html" href="commit/bdd13e9d815e0e04aaa193ea80a794d756b99f8f.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit bdd13e9d815e0e04aaa193ea80a794d756b99f8f
parent 96a3826ca9778e25930fd3677f6e78f06b21e146
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Mon, 23 Mar 2026 15:41:38 +0100

Show engagement factors on paper detail pages

V3 papers now display a section with 6 horizontal bars showing
engagement scores (0-3) with justification text for each dimension.
Only shown when engagement_factors exist in the paper data.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>96a3826ca9778e25930fd3677f6e78f06b21e146</id>
<published>2026-03-23T14:05:25Z</published>
<updated>2026-03-23T14:05:25Z</updated>
<title>Show v3 scans in survey progress bar</title>
<link rel="alternate" type="text/html" href="commit/96a3826ca9778e25930fd3677f6e78f06b21e146.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 96a3826ca9778e25930fd3677f6e78f06b21e146
parent 51dc81021fc0182b9668043b714be26b1f230a2a
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Mon, 23 Mar 2026 15:05:25 +0100

Show v3 scans in survey progress bar

Progress bar now has 5 segments: V3 (blue, 53), V2 (green, 692),
V1 rescan, Queued, No PDF. Header shows engagement factor count.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>51dc81021fc0182b9668043b714be26b1f230a2a</id>
<published>2026-03-23T13:54:04Z</published>
<updated>2026-03-23T13:54:04Z</updated>
<title>Integrate v3 engagement factors into explorer pipeline</title>
<link rel="alternate" type="text/html" href="commit/51dc81021fc0182b9668043b714be26b1f230a2a.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 51dc81021fc0182b9668043b714be26b1f230a2a
parent a85920f8b970cf039362ba691b05a72e8439d3d1
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Mon, 23 Mar 2026 14:54:04 +0100

Integrate v3 engagement factors into explorer pipeline

Build script now reads engagement_factors from v3 scans and computes:
- Per-dimension correlation with log(HN points)
- High-HN vs low-HN mean engagement scores per dimension
- All data flows to findings.json and paper detail pages

Currently n=45 v3 papers — correlations are weak but directional:
brand recognition and fear are the only positive signals. Numbers
will sharpen as more v3 catchup batches run.

Findings view shows engagement factor correlations when n&gt;=10.
Paper detail pages include engagement_factors when present.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>a85920f8b970cf039362ba691b05a72e8439d3d1</id>
<published>2026-03-23T12:49:40Z</published>
<updated>2026-03-23T12:49:40Z</updated>
<title>Fix catchup-v3 to read full paper.txt, not just scan summary</title>
<link rel="alternate" type="text/html" href="commit/a85920f8b970cf039362ba691b05a72e8439d3d1.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit a85920f8b970cf039362ba691b05a72e8439d3d1
parent d6d31c6cb0ff5d41b80f2e4eaeb2e992c3702dd8
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Mon, 23 Mar 2026 13:49:40 +0100

Fix catchup-v3 to read full paper.txt, not just scan summary

The cheap version (title + key_findings only) produced plausible but
ungrounded engagement scores. Now reads the full paper text so Opus
can assess demo-ability from actual URLs, practical relevance from
implementation details, and surprise from the full argument.

Reset 36 papers from cheap v3 back to v2 for re-processing.
Timeout bumped 120s → 300s for longer paper reads.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>d6d31c6cb0ff5d41b80f2e4eaeb2e992c3702dd8</id>
<published>2026-03-23T12:26:54Z</published>
<updated>2026-03-23T12:26:54Z</updated>
<title>Add v3 scan instrument with engagement factors, catchup script</title>
<link rel="alternate" type="text/html" href="commit/d6d31c6cb0ff5d41b80f2e4eaeb2e992c3702dd8.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit d6d31c6cb0ff5d41b80f2e4eaeb2e992c3702dd8
parent 781cf7f2cc3cdd41d5fea630408c3cd59982712e
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Mon, 23 Mar 2026 13:26:54 +0100

Add v3 scan instrument with engagement factors, catchup script

v3 extends v2 with 6 engagement factor dimensions (0-3 each):
- practical_relevance, surprise_contrarian, fear_safety,
  drama_conflict, demo_ability, brand_recognition

Two scripts:
- scripts/catchup-v3.py: upgrades existing v2 scans to v3 by running
  Opus classification on title + key_findings + claims (cheap, no full
  paper text needed). Supports --parallel, --limit, --id.
- agents/scan-agent.md: updated to produce v3 directly for new scans.

Tested on 13 papers. Scores align with manual prototype:
Codex [3,2,1,1,2,3], Agents of Chaos [2,2,3,2,1,2],
Chain-of-Thought [3,2,0,0,2,2].

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>781cf7f2cc3cdd41d5fea630408c3cd59982712e</id>
<published>2026-03-23T12:09:22Z</published>
<updated>2026-03-23T12:09:22Z</updated>
<title>Add HN social attention data: enrichment script, findings, paper links</title>
<link rel="alternate" type="text/html" href="commit/781cf7f2cc3cdd41d5fea630408c3cd59982712e.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 781cf7f2cc3cdd41d5fea630408c3cd59982712e
parent 4b1edc22cbf261dbc4f797132d00a2067c22d276
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Mon, 23 Mar 2026 13:09:22 +0100

Add HN social attention data: enrichment script, findings, paper links

New script: scripts/enrich-hn.py queries HN Algolia API for all v2
papers (arxiv_id search + title fallback). 586/745 papers found,
3,433 HN threads collected. Saves papers/{slug}/hn.json with thread
details (title, points, comments, URL).

Key finding: HN attention is uncorrelated with methodology (r=0.061).
Social media amplifies novelty, not rigor.

New findings section &quot;Social Attention vs Rigor&quot;:
- Hidden gems: 15 papers scoring 65%+ with &lt;=5 HN points
  (e.g., &quot;Measuring Mid-2025 LLM-Assistance&quot; at 86%, 0 HN pts)
- Overhyped: 15 papers scoring &lt;40% with 30+ HN points
  (e.g., &quot;Efficient Guided Generation&quot; at 854 pts, 31.7% method)
- Top 15 most-discussed papers table

Paper detail pages show &quot;HN (Npts)&quot; link button to top thread.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>4b1edc22cbf261dbc4f797132d00a2067c22d276</id>
<published>2026-03-23T10:19:23Z</published>
<updated>2026-03-23T10:19:23Z</updated>
<title>Replace jittered scatter with bubble grid for two cultures</title>
<link rel="alternate" type="text/html" href="commit/4b1edc22cbf261dbc4f797132d00a2067c22d276.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 4b1edc22cbf261dbc4f797132d00a2067c22d276
parent 8378a8226cb32ffa456ceaab94abc9b42ccb619b
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Mon, 23 Mar 2026 11:19:23 +0100

Replace jittered scatter with bubble grid for two cultures

Scatter with jitter was dishonest — data is discrete (4-7 questions
per category = quantized scores). Replaced with bubble grid: each
grid intersection shows a circle sized by paper count, colored by
mean score, with count label. Honestly represents the discrete data
while clearly showing the negative correlation pattern.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>8378a8226cb32ffa456ceaab94abc9b42ccb619b</id>
<published>2026-03-23T10:17:48Z</published>
<updated>2026-03-23T10:17:48Z</updated>
<title>Add jitter to two cultures scatter to show density</title>
<link rel="alternate" type="text/html" href="commit/8378a8226cb32ffa456ceaab94abc9b42ccb619b.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 8378a8226cb32ffa456ceaab94abc9b42ccb619b
parent 40085bd948ffe671e0449b2f1e9da4ba121b01d5
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Mon, 23 Mar 2026 11:17:48 +0100

Add jitter to two cultures scatter to show density

Scores are quantized (4 artifacts questions = 0/25/50/75/100%) so
dots stack on grid intersections. Added deterministic jitter using
golden-angle distribution so overlapping dots spread into visible
clusters while maintaining approximate position.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>40085bd948ffe671e0449b2f1e9da4ba121b01d5</id>
<published>2026-03-23T10:15:30Z</published>
<updated>2026-03-23T10:15:30Z</updated>
<title>Add repro funnel, methodology treemap, two cultures scatter; fix network zoom</title>
<link rel="alternate" type="text/html" href="commit/40085bd948ffe671e0449b2f1e9da4ba121b01d5.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 40085bd948ffe671e0449b2f1e9da4ba121b01d5
parent a2c488b4b161129d19bc4aff0445e74a4c93407f
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Mon, 23 Mar 2026 11:15:30 +0100

Add repro funnel, methodology treemap, two cultures scatter; fix network zoom

New findings visuals:
- Reproducibility funnel: 745→400→351→61→49, cliff at env specs
- Methodology treemap: benchmark-eval dominates (561), colored by score
- Two cultures scatter: human_studies vs artifacts (r=-0.24)

Network fixes:
- Start zoomed out (k=0.5) so full graph is visible on load
- Thicker edges (1.0→1.5 default, 1.8→2.5 hover)

Updated memory files with current state, explorer architecture,
and &quot;when user says more papers scanned&quot; workflow.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>a2c488b4b161129d19bc4aff0445e74a4c93407f</id>
<published>2026-03-23T10:03:05Z</published>
<updated>2026-03-23T10:03:05Z</updated>
<title>Add reproducibility funnel, methodology treemap, two cultures scatter</title>
<link rel="alternate" type="text/html" href="commit/a2c488b4b161129d19bc4aff0445e74a4c93407f.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit a2c488b4b161129d19bc4aff0445e74a4c93407f
parent 59c5b1043da1db314c2da2b0d833733c9fe627f5
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Mon, 23 Mar 2026 11:03:05 +0100

Add reproducibility funnel, methodology treemap, two cultures scatter

Three new visualizations in findings:

Reproducibility funnel: 745 → 400 (code) → 351 (data) → 61 (env) →
49 (instructions). The cliff at environment specs is where
reproducibility collapses — 90% of code-releasing papers stop there.

Methodology landscape treemap: proportional blocks sized by paper
count, colored by mean score. Benchmark-eval dominates (561 papers),
RCTs score highest (64.3%), case studies lowest (39.5%).

Two cultures scatter: human_studies vs artifacts score for 80 papers
with human subjects. Negatively correlated (r=-0.24) — CS researchers
release code but skip IRB; psychology researchers do ethics review
but don&#39;t release data. Four quadrants labeled.

Also: 3 new v2 scans (Codex 71.7%, CoT 56.6%, ReAct 48.2%) and
Agents of Chaos rescan (47.5%).

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>59c5b1043da1db314c2da2b0d833733c9fe627f5</id>
<published>2026-03-23T09:23:12Z</published>
<updated>2026-03-23T09:23:12Z</updated>
<title>Rescan Agents of Chaos (2602.20021) as v2: 47.5%</title>
<link rel="alternate" type="text/html" href="commit/59c5b1043da1db314c2da2b0d833733c9fe627f5.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 59c5b1043da1db314c2da2b0d833733c9fe627f5
parent 4d2226787818ffd5455c35bf72eeef923ae3a7ce
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Mon, 23 Mar 2026 10:23:12 +0100

Rescan Agents of Chaos (2602.20021) as v2: 47.5%

Red-teaming study of 6 autonomous LLM agents in live lab environment.
Strong on claims/evidence and limitations (100%), weak on artifacts
and human studies. 5 red flags including no IRB and convenience sample.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>4d2226787818ffd5455c35bf72eeef923ae3a7ce</id>
<published>2026-03-23T09:09:30Z</published>
<updated>2026-03-23T09:09:30Z</updated>
<title>Add human-readable descriptions to per-question pass rates</title>
<link rel="alternate" type="text/html" href="commit/4d2226787818ffd5455c35bf72eeef923ae3a7ce.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 4d2226787818ffd5455c35bf72eeef923ae3a7ce
parent 5618e59897d0dae1dd10f47a1d8e147054069a9d
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Mon, 23 Mar 2026 10:09:30 +0100

Add human-readable descriptions to per-question pass rates

Each question bar now shows a concise description instead of the raw
snake_case field name. Descriptions pre-computed in build script
(67 questions mapped). E.g., &quot;self_comparison_bias_addressed&quot; becomes
&quot;Self-evaluation bias acknowledged&quot;.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>5618e59897d0dae1dd10f47a1d8e147054069a9d</id>
<published>2026-03-23T09:06:09Z</published>
<updated>2026-03-23T09:06:09Z</updated>
<title>Add explainer text for each named game on dashboard</title>
<link rel="alternate" type="text/html" href="commit/5618e59897d0dae1dd10f47a1d8e147054069a9d.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 5618e59897d0dae1dd10f47a1d8e147054069a9d
parent 96e47d0e1acc7b77cf730518cd97e6b0cb6f419b
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Mon, 23 Mar 2026 10:06:09 +0100

Add explainer text for each named game on dashboard

Each game row now shows a one-line description explaining what the
pattern means and how it&#39;s detected.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>96e47d0e1acc7b77cf730518cd97e6b0cb6f419b</id>
<published>2026-03-23T08:59:27Z</published>
<updated>2026-03-23T08:59:27Z</updated>
<title>Rebuild citation network, add ego mode, quality flow, and network findings</title>
<link rel="alternate" type="text/html" href="commit/96e47d0e1acc7b77cf730518cd97e6b0cb6f419b.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 96e47d0e1acc7b77cf730518cd97e6b0cb6f419b
parent 6d3758b3b52628e5f9c0bb8cb38aae235f766dde
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Mon, 23 Mar 2026 09:59:27 +0100

Rebuild citation network, add ego mode, quality flow, and network findings

Network rebuilt from cited_papers (960 nodes, 2952 edges vs old 572/715).
Scanned 3 foundational papers: Codex 71.7%, CoT 56.6%, ReAct 48.2%.

Network view:
- Directed arrows on edges (visible when zoomed in or in ego mode)
- Click node = ego mode: shows 1-hop neighborhood with in/out distinction,
  color-coded edges (blue=cited-by, orange=cites), info panel with stats
- Double-click = navigate to paper detail
- Edge color toggle: default or quality flow (green=good→good, red=weak→good)
- Escape or &quot;Show all&quot; exits ego mode
- Hover shows &quot;Cites N / Cited by M&quot; with directional counts

Findings:
- Citation Network Insights section with foundational paper leaderboard,
  quality contagion gradient (43.1% → 52.3%), and rigor diffusion table

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>6d3758b3b52628e5f9c0bb8cb38aae235f766dde</id>
<published>2026-03-23T07:46:46Z</published>
<updated>2026-03-23T07:46:46Z</updated>
<title>Add 4 new games (10 total), DNA profile strips in paper table</title>
<link rel="alternate" type="text/html" href="commit/6d3758b3b52628e5f9c0bb8cb38aae235f766dde.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 6d3758b3b52628e5f9c0bb8cb38aae235f766dde
parent 0bf67124d60b5a1d8c6d27deb7def340c1f0c0f0
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Mon, 23 Mar 2026 08:46:46 +0100

Add 4 new games (10 total), DNA profile strips in paper table

New games detecting orthogonal methodology failures:
- Trust Us (40.5%): no raw data AND no code — unverifiable
- The Black Box (12.3%): no prompts AND no hyperparameters — unreplicable
- Moving Goalpost (26.9%): causal claims without causal design
- Limitation Theater (4.2%): has limitations section, all boilerplate

DNA strips: colored inline heatmap per paper row showing 11 base
category scores at a glance (red→yellow→blue→green). Replaces
venue column — methodology profile is more useful.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>0bf67124d60b5a1d8c6d27deb7def340c1f0c0f0</id>
<published>2026-03-22T20:54:26Z</published>
<updated>2026-03-22T20:54:26Z</updated>
<title>Add PCA scatter plot — paper methodology map</title>
<link rel="alternate" type="text/html" href="commit/0bf67124d60b5a1d8c6d27deb7def340c1f0c0f0.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 0bf67124d60b5a1d8c6d27deb7def340c1f0c0f0
parent c641e50fbc95253d2debbe8c25dc5e8357e58dc3
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Sun, 22 Mar 2026 21:54:26 +0100

Add PCA scatter plot — paper methodology map

Project 708 papers from 9 category scores to 2D via PCA (52.8%
variance explained). Papers colored by archetype, hover for details,
click to navigate.

PC1 = overall rigor (limitations, data_integrity, claims dominate)
PC2 = practical detail vs reflection (cost, setup vs limitations)

Archetypes separate clearly: Complete clusters left (rigorous),
Minimal right (weak), Theater and Mixed overlap in the middle.
Hand-rolled PCA in build script (power iteration, no numpy needed).

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>c641e50fbc95253d2debbe8c25dc5e8357e58dc3</id>
<published>2026-03-22T20:49:48Z</published>
<updated>2026-03-22T20:49:48Z</updated>
<title>Add category correlation heatmap to findings view</title>
<link rel="alternate" type="text/html" href="commit/c641e50fbc95253d2debbe8c25dc5e8357e58dc3.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit c641e50fbc95253d2debbe8c25dc5e8357e58dc3
parent d240203118b1d2332118fdcb2cfd94594a523da2
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Sun, 22 Mar 2026 21:49:48 +0100

Add category correlation heatmap to findings view

Pre-compute 14x14 Pearson correlation matrix between category-level
pass rates. Rendered as interactive SVG heatmap with hover tooltips.

Key findings surfaced:
- contamination &lt;-&gt; data_leakage r=0.87 (same decision)
- artifacts &lt;-&gt; stat_methodology r=0.05 (completely independent)
- human_studies &lt;-&gt; artifacts r=-0.24 (two cultures)
- Three independent rigor clusters: transparency, statistics, contamination

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>d240203118b1d2332118fdcb2cfd94594a523da2</id>
<published>2026-03-22T20:33:58Z</published>
<updated>2026-03-22T20:33:58Z</updated>
<title>Add findings view with 10 analysis sections and code URL extraction</title>
<link rel="alternate" type="text/html" href="commit/d240203118b1d2332118fdcb2cfd94594a523da2.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit d240203118b1d2332118fdcb2cfd94594a523da2
parent 1818e336e2cc2445cd1006f83c3fa66c7eec7259
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Sun, 22 Mar 2026 21:33:58 +0100

Add findings view with 10 analysis sections and code URL extraction

New #/findings view surfaces deep analysis that was only in docs:
- Per-question pass rates (67 questions, worst: self_comparison_bias 0.8%)
- Year trends by category with toggleable lines (contamination 29%→7%)
- Venue &amp; citation scoring (500+ cites score below average)
- Optimism-rigor inversion (positive claims from weaker papers)
- Quality homophily (high-quality papers cite high-quality 3x more)
- Sampling effect (median drops as long tail scanned)
- Benchmark monoculture (58% pure benchmark-eval)
- Funding gap (13pp between disclosed/undisclosed)
- Reproducibility drill-down (4.2% fully reproducible)
- All 6 named games (added Cherry-picked Comparisons, All Show No Substance)

Also: extracted 282 code URLs from scan justification text, shown as
&quot;Code&quot; link on paper detail pages.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>1818e336e2cc2445cd1006f83c3fa66c7eec7259</id>
<published>2026-03-22T18:04:29Z</published>
<updated>2026-03-22T18:04:29Z</updated>
<title>Fix network: edge contrast, mouseover hit detection</title>
<link rel="alternate" type="text/html" href="commit/1818e336e2cc2445cd1006f83c3fa66c7eec7259.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 1818e336e2cc2445cd1006f83c3fa66c7eec7259
parent 63a3d148e5733c645300c806f4ffdafd4d344f71
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Sun, 22 Mar 2026 19:04:29 +0100

Fix network: edge contrast, mouseover hit detection

- Edge opacity 0.25 → 0.5, line width 0.8 → 1.2
- Fixed mouseover/click: canvas CSS scaling wasn&#39;t accounted for in
  mouse coordinate math, making hit detection wildly off on any
  screen where the canvas CSS width != 1200px (i.e. almost always)
- Zoom and drag also fixed for the same CSS-to-canvas scaling issue

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>63a3d148e5733c645300c806f4ffdafd4d344f71</id>
<published>2026-03-22T17:38:54Z</published>
<updated>2026-03-22T17:38:54Z</updated>
<title>Add pipeline progress bar and show all registry papers in browser</title>
<link rel="alternate" type="text/html" href="commit/63a3d148e5733c645300c806f4ffdafd4d344f71.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 63a3d148e5733c645300c806f4ffdafd4d344f71
parent 64fcaa825f5ab1a80912f80abc8e771b5b0a3a81
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Sun, 22 Mar 2026 18:38:54 +0100

Add pipeline progress bar and show all registry papers in browser

- Dashboard shows segmented progress bar: v2 scanned, v1 rescan,
  queued, no PDF — updates automatically with each deploy.
- Papers browser lists all 2,687 registry entries. Unscanned papers
  show &quot;--&quot; for score/archetype and are not clickable.
- Pipeline stats added to dashboard.json (registry_total, v2_scanned,
  v1_needs_rescan, has_text_no_scan, no_text, excluded).

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>64fcaa825f5ab1a80912f80abc8e771b5b0a3a81</id>
<published>2026-03-22T12:03:16Z</published>
<updated>2026-03-22T12:03:16Z</updated>
<title>Trigger deploy on scan.json and registry changes</title>
<link rel="alternate" type="text/html" href="commit/64fcaa825f5ab1a80912f80abc8e771b5b0a3a81.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 64fcaa825f5ab1a80912f80abc8e771b5b0a3a81
parent ad711786b26cf3ee682d7c97d2e74615983eafda
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Sun, 22 Mar 2026 13:03:16 +0100

Trigger deploy on scan.json and registry changes

Deploy only fired on explorer/ changes, so new scan batches
didn&#39;t update the site.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>ad711786b26cf3ee682d7c97d2e74615983eafda</id>
<published>2026-03-22T11:19:20Z</published>
<updated>2026-03-22T11:19:20Z</updated>
<title>Add 274 v2 scans (741 total), remove corrupt spectr scan</title>
<link rel="alternate" type="text/html" href="commit/ad711786b26cf3ee682d7c97d2e74615983eafda.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit ad711786b26cf3ee682d7c97d2e74615983eafda
parent dd1e239a2cd9d62a8dd1f144070ba1d9b604c983
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Sun, 22 Mar 2026 12:19:20 +0100

Add 274 v2 scans (741 total), remove corrupt spectr scan

Batch from parallel scan run with --max-turns 8 fix.
Removed spectr-fast-speculative-2023/scan.json (trailing comma).

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>dd1e239a2cd9d62a8dd1f144070ba1d9b604c983</id>
<published>2026-03-22T06:45:01Z</published>
<updated>2026-03-22T06:45:01Z</updated>
<title>Fix scan agent max-turns: 3 → 8, accept --max-turns CLI arg</title>
<link rel="alternate" type="text/html" href="commit/dd1e239a2cd9d62a8dd1f144070ba1d9b604c983.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit dd1e239a2cd9d62a8dd1f144070ba1d9b604c983
parent fab02ffcee6cc4b837e996171068e5654295171d
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Sun, 22 Mar 2026 07:45:01 +0100

Fix scan agent max-turns: 3 → 8, accept --max-turns CLI arg

3 turns was the exact minimum (read agent prompt, read schema, write
scan.json) with zero margin. 129/200 papers silently failed when the
agent needed an extra turn. Bumping to 8 resolved all but the known
persistent failures (bad PDFs, survey_methodology truncation).

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>fab02ffcee6cc4b837e996171068e5654295171d</id>
<published>2026-03-18T13:11:09Z</published>
<updated>2026-03-18T13:11:09Z</updated>
<title>Split data pipeline, add light/dark mode, fix network and detail views</title>
<link rel="alternate" type="text/html" href="commit/fab02ffcee6cc4b837e996171068e5654295171d.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit fab02ffcee6cc4b837e996171068e5654295171d
parent f40a5cabd9d1f25608787a26f5ebfc43cd07177e
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Wed, 18 Mar 2026 14:11:09 +0100

Split data pipeline, add light/dark mode, fix network and detail views

- Split explorer.json into per-view files: dashboard.json (1.6KB),
  papers-index.json (150KB), papers/{slug}.json, network.json, tensions.json.
  Dashboard now loads instantly instead of waiting for 9MB.
- Add light/dark mode toggle with localStorage persistence and
  prefers-color-scheme detection.
- Fix network: higher edge contrast (theme-aware), larger hit radius for
  hover/click, pointer cursor, &quot;click to view&quot; hint, drag vs click
  distinction, node outlines.
- Add arXiv/DOI/source links on paper detail pages.
- Add CSS spinner on all view loads.
- Gitignore all generated data files (explorer/public/data/).

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>f40a5cabd9d1f25608787a26f5ebfc43cd07177e</id>
<published>2026-03-18T12:38:54Z</published>
<updated>2026-03-18T12:38:54Z</updated>
<title>Add 467 v2 scans, metadata, calibration, explorer, and deploy pipeline</title>
<link rel="alternate" type="text/html" href="commit/f40a5cabd9d1f25608787a26f5ebfc43cd07177e.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit f40a5cabd9d1f25608787a26f5ebfc43cd07177e
parent 279c91802101fbb60c70d19a451bfc29babcd85d
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Wed, 18 Mar 2026 13:38:54 +0100

Add 467 v2 scans, metadata, calibration, explorer, and deploy pipeline

Bulk commit of accumulated work:
- 1028 scan.json (467 v2 + 561 v1), 730 metadata.json, 60 calibration.json
- Static data explorer (Vite + vanilla TS): dashboard, paper browser,
  paper detail, citation network, claim tensions
- scripts/build-explorer-data.py aggregates scan data into explorer.json
- Forgejo CI/CD workflow with blue/green deployment
- Updated scan schema (proxy_outcome_distinction, scaffold_confound_addressed)
- Analysis artifacts: citation graph, v2 findings, deep patterns
- Playwright test suite (23 tests)
- .gitignore: pdf-finder-result.txt, explorer build artifacts, settings.local

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>279c91802101fbb60c70d19a451bfc29babcd85d</id>
<published>2026-03-08T09:22:14Z</published>
<updated>2026-03-08T09:22:14Z</updated>
<title>Implement v2 scan pipeline with conditional modules and enrichment</title>
<link rel="alternate" type="text/html" href="commit/279c91802101fbb60c70d19a451bfc29babcd85d.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 279c91802101fbb60c70d19a451bfc29babcd85d
parent 6e63d899afcec26a7a1e6668f9197bfad25b53f0
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Sun,  8 Mar 2026 10:22:14 +0100

Implement v2 scan pipeline with conditional modules and enrichment

- claim.py: extend expiry to 1 hour, add take-next atomic command
- validate-scan.py: standalone schema validator (572/572 existing scans pass)
- Schema: add scan_version, active_modules, 3 conditional categories
  (experimental_rigor 8q, data_leakage 4q, survey_methodology 3q)
  sourced from Henderson, Dodge, Lucic, Kapoor meta-research findings
- scan-worker.md: v2 worker loop (triage → 6 parallel category agents)
- scan-triage.md + scan-category-{a..f}.md: split evaluation prompts
- scan.md command: v2 default with v1 fallback flag
- enrich-metadata.py: Semantic Scholar API enrichment
- build-citation-graph.py: cross-reference cited_papers against registry
- methodology.md: document new questions with sources
- V1 scans remain valid (all new fields optional)

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>6e63d899afcec26a7a1e6668f9197bfad25b53f0</id>
<published>2026-02-28T05:55:19Z</published>
<updated>2026-02-28T05:55:19Z</updated>
<title>Tighten scan instrument based on Opus calibration (93.2% agreement)</title>
<link rel="alternate" type="text/html" href="commit/6e63d899afcec26a7a1e6668f9197bfad25b53f0.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 6e63d899afcec26a7a1e6668f9197bfad25b53f0
parent fd2ab321110f363123c295dbaa3862329aec7709
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Sat, 28 Feb 2026 06:55:19 +0100

Tighten scan instrument based on Opus calibration (93.2% agreement)

Calibration of 8 papers found two systematic Sonnet failure modes:
- NA boundary errors (56% of disagreements): added explicit &quot;NA when:&quot;
  guidance to contamination, human_studies, artifacts, cost categories
- Generosity bias (44%): added &quot;does NOT count&quot; examples to
  prompts_provided, variance_reported, model_versions_specified, etc.

Schema: 14 question descriptions updated with sharper criteria.
Agent prompt: added &quot;When to use NA&quot; section, &quot;Common traps&quot; list,
and per-paper-type NA/NO guidance for surveys, mining studies, benchmarks.
Methodology context rewritten to reflect boolean checklist (was still
describing old 0-3 rubric).

All 30 scan.json and 8 calibration.json removed for re-run with
improved instrument. Calibration round 1 results preserved in
analysis/calibration-summary.{json,md}.

Added /scan project command for running scan pipeline.

Co-Authored-By: Claude Sonnet 4.6 &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>fd2ab321110f363123c295dbaa3862329aec7709</id>
<published>2026-02-27T21:34:30Z</published>
<updated>2026-02-27T21:34:30Z</updated>
<title>Add paper claim system for parallel scan agents</title>
<link rel="alternate" type="text/html" href="commit/fd2ab321110f363123c295dbaa3862329aec7709.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit fd2ab321110f363123c295dbaa3862329aec7709
parent a0bf4b555a37ff061384f55a6807ac4235ed17bf
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Fri, 27 Feb 2026 22:34:30 +0100

Add paper claim system for parallel scan agents

scripts/claim.py provides file-based locking to prevent two agents
from scanning the same paper:
- take: claim a paper (fails if already claimed)
- done/fail: release a claim
- list: show unclaimed papers ready to scan
- status: summary of scan progress
- Claims expire after 10 minutes (stale agent recovery)

Claims stored as papers/&lt;slug&gt;/.claimed_&lt;timestamp&gt; files.
Added to .gitignore along with paper.txt (regenerable).

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>a0bf4b555a37ff061384f55a6807ac4235ed17bf</id>
<published>2026-02-27T21:31:36Z</published>
<updated>2026-02-27T21:31:36Z</updated>
<title>Replace subjective 0-3 rubric with 50-question boolean checklist</title>
<link rel="alternate" type="text/html" href="commit/a0bf4b555a37ff061384f55a6807ac4235ed17bf.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit a0bf4b555a37ff061384f55a6807ac4235ed17bf
parent b4be3d6dbb04eddb183b2a99dad0327adbe38b39
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Fri, 27 Feb 2026 22:31:36 +0100

Replace subjective 0-3 rubric with 50-question boolean checklist

Redesigned the scan instrument for verifiability and auditability:
- 50 yes/no/na questions across 11 categories (artifacts, statistical
  methodology, evaluation design, claims &amp; evidence, setup transparency,
  limitations, data integrity, conflicts of interest, contamination,
  human studies, cost &amp; practicality)
- Each question has detailed evaluation guidance in the schema
  description explaining exactly what to look for
- Each answer requires a justification citing specific paper sections
- Inspired by Wakefield case: added data_integrity and
  conflicts_of_interest categories to catch fabrication and undisclosed
  conflicts
- Changed model assignment from Opus to Sonnet (booleans are factual
  lookups, not subjective judgment)

Old rubric (6 dimensions, 0-3 scores) removed. Composite scores are
now derived deterministically from boolean counts.

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>b4be3d6dbb04eddb183b2a99dad0327adbe38b39</id>
<published>2026-02-27T21:05:52Z</published>
<updated>2026-02-27T21:05:52Z</updated>
<title>Update scan agent prompt and add scan orchestrator</title>
<link rel="alternate" type="text/html" href="commit/b4be3d6dbb04eddb183b2a99dad0327adbe38b39.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit b4be3d6dbb04eddb183b2a99dad0327adbe38b39
parent 08f6c3db4222ef96a668db4bf3fd6b61e3326b67
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Fri, 27 Feb 2026 22:05:52 +0100

Update scan agent prompt and add scan orchestrator

scan-agent.md:
- Added file paths (paper.txt input, scan.json output)
- Added write-immediately rule and registry status update
- Added guidance for scoring survey papers (are they rigorous or
  just laundering weak results?)
- Added handling for theoretical/position papers
- Added schema validation requirements
- Added note that important papers can still score poorly

extract-text.py:
- Added extraction-failures.txt log for papers that fail both
  pymupdf and Sonnet fallback

scripts/run-scan.py (new):
- Orchestrates full pipeline: extract text → scan agent → validate
- Calls claude CLI with opus model for each paper
- Validates scan.json output (required fields, rubric dimensions)
- Updates registry status to &#39;scanned&#39;
- Supports --parallel N for concurrent scanning
- Writes scan-failures.txt for debugging

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>08f6c3db4222ef96a668db4bf3fd6b61e3326b67</id>
<published>2026-02-27T20:55:00Z</published>
<updated>2026-02-27T20:55:00Z</updated>
<title>Add DOI-based download script for non-arXiv papers</title>
<link rel="alternate" type="text/html" href="commit/08f6c3db4222ef96a668db4bf3fd6b61e3326b67.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 08f6c3db4222ef96a668db4bf3fd6b61e3326b67
parent 1021d39ac6f95f5694904bc4c19a3953006c570d
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Fri, 27 Feb 2026 21:55:00 +0100

Add DOI-based download script for non-arXiv papers

scripts/download-doi.py tries three strategies for 749 papers without
arxiv_id: (1) Semantic Scholar open access PDF lookup, (2) Unpaywall
API, (3) arXiv fallback if S2 finds a previously unknown arxiv_id.
Updates manual-download-needed.txt with whatever remains.

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>1021d39ac6f95f5694904bc4c19a3953006c570d</id>
<published>2026-02-27T20:45:23Z</published>
<updated>2026-02-27T20:45:23Z</updated>
<title>Harvester run 2: 2492 new papers via S2 citation graph + keyword search</title>
<link rel="alternate" type="text/html" href="commit/1021d39ac6f95f5694904bc4c19a3953006c570d.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 1021d39ac6f95f5694904bc4c19a3953006c570d
parent 9aa129f9efbb8bf248c15cdd99a3c0205c7295b7
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Fri, 27 Feb 2026 21:45:23 +0100

Harvester run 2: 2492 new papers via S2 citation graph + keyword search

Phase 1 (citation graph): 692 new papers
  - Fetched citations + references for 8 seed papers via Semantic Scholar API
  - Seeds: METR RCT, Emergent abilities mirage, MAST, Sleeper Agents,
    TypeScript type-check, Scaffolded LLMs, Remote Labor Index, Code gen survey

Phase 2 (keyword search): 1800 new papers
  - 15 query clusters via Semantic Scholar paper search
  - Queries: LLM code generation, AI code review, prompt injection,
    alignment deception, APR, test generation, RAG code, multi-agent
    failure, scaling, AI software engineering, code completion, etc.

Registry grows from 155 → 2647 papers (well past 1000 target).

Co-Authored-By: Claude Sonnet 4.6 &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>9aa129f9efbb8bf248c15cdd99a3c0205c7295b7</id>
<published>2026-02-27T20:32:03Z</published>
<updated>2026-02-27T20:32:03Z</updated>
<title>Harvester: write to registry immediately after each search</title>
<link rel="alternate" type="text/html" href="commit/9aa129f9efbb8bf248c15cdd99a3c0205c7295b7.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 9aa129f9efbb8bf248c15cdd99a3c0205c7295b7
parent fde1ea0a6da1bd0c26c61b0bc3cde325a3551fe1
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Fri, 27 Feb 2026 21:32:03 +0100

Harvester: write to registry immediately after each search

Prevent data loss from session timeouts by requiring the agent to
append new entries to registry.jsonl after each API call or web search,
not accumulate in memory.

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>fde1ea0a6da1bd0c26c61b0bc3cde325a3551fe1</id>
<published>2026-02-27T20:30:41Z</published>
<updated>2026-02-27T20:30:41Z</updated>
<title>Harvester run 1 (138 papers) + rewrite harvester prompt</title>
<link rel="alternate" type="text/html" href="commit/fde1ea0a6da1bd0c26c61b0bc3cde325a3551fe1.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit fde1ea0a6da1bd0c26c61b0bc3cde325a3551fe1
parent 3a05a3aea1e82e39df3a495ed95e9004e4d25b8f
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Fri, 27 Feb 2026 21:30:41 +0100

Harvester run 1 (138 papers) + rewrite harvester prompt

Registry: 17 → 155 entries. All from arXiv web search only.

Rewrote harvester-agent.md to fix the gap:
- Added explicit Semantic Scholar API endpoints (keyword search,
  citation graph traversal for forward/backward chasing, venue filter)
- Added arXiv API query syntax with category+keyword combinations
- Added HuggingFace daily papers API
- Expanded query clusters from 8 to 28
- Added concrete strategy: keyword search ~400, citation graph ~400,
  venue crawl ~200 to reach 1000 target
- Previous prompt listed sources but gave no actionable instructions,
  so Sonnet only used the easiest path (arXiv web search)

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>3a05a3aea1e82e39df3a495ed95e9004e4d25b8f</id>
<published>2026-02-27T20:27:14Z</published>
<updated>2026-02-27T20:27:14Z</updated>
<title>Add build pipeline: text extraction, summary aggregation, venue list</title>
<link rel="alternate" type="text/html" href="commit/3a05a3aea1e82e39df3a495ed95e9004e4d25b8f.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 3a05a3aea1e82e39df3a495ed95e9004e4d25b8f
parent 69c92da1bfbb276ddd27e9ba8256d0087e01c43a
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Fri, 27 Feb 2026 21:27:14 +0100

Add build pipeline: text extraction, summary aggregation, venue list

- scripts/extract-text.py: pymupdf text extraction with Sonnet fallback
  for low-quality results. Outputs paper.txt co-located with PDFs.
- scripts/build-summary.py: aggregates all scan.json into
  analysis/summary.json + summary.md (score distributions, ranked lists,
  red flags, breakdowns by year/tag). Static artifact for narrative work.
- context/requirements.md: full pipeline diagram, venue brainstorm
  (TOSEM, EMSE, NeurIPS D&amp;B, ICSE, Nature MI, etc.), output format (LaTeX)

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>69c92da1bfbb276ddd27e9ba8256d0087e01c43a</id>
<published>2026-02-27T20:11:39Z</published>
<updated>2026-02-27T20:11:39Z</updated>
<title>Add downstream pipeline context to harvester agent prompt</title>
<link rel="alternate" type="text/html" href="commit/69c92da1bfbb276ddd27e9ba8256d0087e01c43a.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 69c92da1bfbb276ddd27e9ba8256d0087e01c43a
parent a6c809bdf74b788c3f3285e3a37f649be5193fe9
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Fri, 27 Feb 2026 21:11:39 +0100

Add downstream pipeline context to harvester agent prompt

Explain how registry entries feed into download, scan, and citation
chasing so the agent understands why arxiv_id matters.

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>a6c809bdf74b788c3f3285e3a37f649be5193fe9</id>
<published>2026-02-27T20:08:24Z</published>
<updated>2026-02-27T20:08:24Z</updated>
<title>Add arXiv PDF download script</title>
<link rel="alternate" type="text/html" href="commit/a6c809bdf74b788c3f3285e3a37f649be5193fe9.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit a6c809bdf74b788c3f3285e3a37f649be5193fe9
parent 9168d67f29d824ef189647a7b5049acf0e55cdca
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Fri, 27 Feb 2026 21:08:24 +0100

Add arXiv PDF download script

Downloads PDFs for registry entries with status &#39;queued&#39; and an arxiv_id.
Updates registry status to &#39;downloaded&#39; on success. Respects arXiv rate
limits (3s between requests). Supports --dry-run, --limit N, and --id.

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>9168d67f29d824ef189647a7b5049acf0e55cdca</id>
<published>2026-02-27T20:05:53Z</published>
<updated>2026-02-27T20:05:53Z</updated>
<title>Add .gitignore and CLAUDE.md project rules</title>
<link rel="alternate" type="text/html" href="commit/9168d67f29d824ef189647a7b5049acf0e55cdca.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 9168d67f29d824ef189647a7b5049acf0e55cdca
parent 1c6a723f27a3efff840ec524ce2f155237b0429a
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Fri, 27 Feb 2026 21:05:53 +0100

Add .gitignore and CLAUDE.md project rules

- .gitignore: exclude PDFs from papers/ and inbox/, OS files, Python cache
- CLAUDE.md: registry conventions (slug format, dedup rules, status flow),
  model assignments per agent, code style, git rules

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>1c6a723f27a3efff840ec524ce2f155237b0429a</id>
<published>2026-02-27T20:00:54Z</published>
<updated>2026-02-27T20:00:54Z</updated>
<title>Document target scope (~1000 papers) and model assignments</title>
<link rel="alternate" type="text/html" href="commit/1c6a723f27a3efff840ec524ce2f155237b0429a.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit 1c6a723f27a3efff840ec524ce2f155237b0429a
parent aaa7097d653f63eb5a6a63611954e3c1ec4c4887
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Fri, 27 Feb 2026 21:00:54 +0100

Document target scope (~1000 papers) and model assignments

- Target: ~1000 papers scanned, subset for deep eval
- Harvester: Sonnet (structured metadata, no deep reasoning)
- Scan agent: Opus (methodology quality judgment)
- Deep-eval agent: Opus
- Add harvester to agent tier design in requirements

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>aaa7097d653f63eb5a6a63611954e3c1ec4c4887</id>
<published>2026-02-27T19:57:52Z</published>
<updated>2026-02-27T19:57:52Z</updated>
<title>Add citation-chasing pipeline: cited_papers in scan + harvest script</title>
<link rel="alternate" type="text/html" href="commit/aaa7097d653f63eb5a6a63611954e3c1ec4c4887.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit aaa7097d653f63eb5a6a63611954e3c1ec4c4887
parent ed00b8092b4a223958bce1880699cb75d0dc4fe7
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Fri, 27 Feb 2026 20:57:52 +0100

Add citation-chasing pipeline: cited_papers in scan + harvest script

- Add cited_papers array to scan.schema.json (required field)
- Update scan-agent.md with instructions to extract survey-relevant
  references from each scanned paper (expect 3-15 per paper)
- Add scripts/harvest-citations.py: reads cited_papers from all
  scan.json files, deduplicates against registry by arxiv_id/doi/title,
  and proposes or appends new registry entries (--apply flag)

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>ed00b8092b4a223958bce1880699cb75d0dc4fe7</id>
<published>2026-02-27T19:54:59Z</published>
<updated>2026-02-27T19:54:59Z</updated>
<title>Add Agents of Chaos paper and Wakefield methodology precedent</title>
<link rel="alternate" type="text/html" href="commit/ed00b8092b4a223958bce1880699cb75d0dc4fe7.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit ed00b8092b4a223958bce1880699cb75d0dc4fe7
parent c75cb779a62ec9a2460f035d431cdca818dcc407
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Fri, 27 Feb 2026 20:54:59 +0100

Add Agents of Chaos paper and Wakefield methodology precedent

- Add arXiv:2602.20021 (Shapira et al.) to registry: red-teaming study
  of autonomous LLM agents documenting live-environment failures
- Add Wakefield/MMR section to related-work.md explaining why
  methodological quality assessment matters, with parallels to AI research

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;

</content>
</entry>
<entry>
<id>c75cb779a62ec9a2460f035d431cdca818dcc407</id>
<published>2026-02-27T19:51:32Z</published>
<updated>2026-02-27T19:51:32Z</updated>
<title>Initial scaffold for AI research survey project</title>
<link rel="alternate" type="text/html" href="commit/c75cb779a62ec9a2460f035d431cdca818dcc407.html" />
<author>
<name>Brian Graham</name>
<email>brian@buildingbetterteams.de</email>
</author>
<content>commit c75cb779a62ec9a2460f035d431cdca818dcc407
Author: Brian Graham &lt;brian@buildingbetterteams.de&gt;
Date:   Fri, 27 Feb 2026 20:51:32 +0100

Initial scaffold for AI research survey project

Set up systematic review pipeline for evaluating methodological quality
of agentic AI/LLM programming research papers.

- context/: Project requirements, scoring methodology (6 dimensions,
  0-3 scale), and related work (Cochrane, PRISMA, emergent abilities)
- schema/: JSON Schemas for scan results, deep evaluations, and
  registry entries
- agents/: Prompt files for scan, deep-eval, harvester, and
  inbox-sorter sub-agents
- registry.jsonl: Seeded with 16 papers from existing knowledge base
- papers/, inbox/: Empty directories for paper storage pipeline

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;

</content>
</entry>
</feed>
