ai-research-survey, branch HEAD

calibration: ship learned weights (20 anchors, v1 fit)

2026-04-14T21:31:00Z

commit 4689504f291925f272e9d12a2ec4b69941989d18 parent 56999c964a569d3e4507b6766777b41e11f76fdd Author: Brian Graham Date: Tue, 14 Apr 2026 23:31:00 +0200 calibration: ship learned weights (20 anchors, v1 fit) Weights learned via scripts/calibration/fit-weights.py against a 20- anchor labeled set spanning 6 bad / 12 good / 2 middling papers, with 16 pairwise ordering constraints. Results: - 15/20 anchors in band; all 16 pair constraints satisfied - Wakefield 45.7 -> 32.3 (structural ceiling; see below) - Attention 52.8 -> 58.7 - Corpus median 49.1 -> 54.7, mean 48.1 -> 54.4, distribution widens Weight story: claims_and_evidence (1.71), setup_transparency (1.08), conflicts_of_interest (0.61), artifacts (0.51) dominate. Five categories zero out: statistical_methodology, data_integrity, human_studies, experimental_rigor, survey_methodology. The zeros aren't a fit defect. Wakefield passes the surface-compliance questions in those categories (IRB disclosed, contemporary controls, bowel histology) that a fraudulent case series can satisfy while lying about the data. The only way the optimizer can simultaneously respect pair ordering and the Wakefield band is to down-weight those surface-compliance categories to zero. That's a rubric structural limit, not a weighting problem; a future iteration should add fraud-adjacent questions (effect-size plausibility, COI magnitude, extraordinary-evidence thresholds). Until then, this is the best weight vector the current rubric can produce. Co-Authored-By: Claude Opus 4.6 (1M context)

calibration: pairwise weight fitting against labeled anchors

2026-04-14T21:11:14Z

commit 56999c964a569d3e4507b6766777b41e11f76fdd parent 5ad6af87a22aa18f92dac25f1979ae94c66367bb Author: Brian Graham Date: Tue, 14 Apr 2026 23:11:14 +0200 calibration: pairwise weight fitting against labeled anchors Scaffolding for learning per-category rubric weights from a small set of hand-labeled anchor papers. Keeps uniform flat-question averaging as the default behavior; opts into learned weights only when scripts/calibration/weights.json exists. Files: - scripts/calibration/anchors.yaml: seed set of 8 anchors (Wakefield at 0-15, Attention/BERT/ReAct/AlphaCode/ARC/BERT-papers at 70-90, meta papers Show Your Work / Deep RL that Matters at 80-92). Comments mark candidates to add; aim for 15+ anchors before trusting weights. - scripts/calibration/fit-weights.py: scipy L-BFGS-B fit over per-category weights [0-5] with L2 regularization toward uniform and a pairwise ordering hinge. Prints per-anchor predicted scores + pair separation check, writes weights.json. - build-explorer-data.py: compute_overall_score accepts optional category_weights. load_category_weights reads the JSON if present. First fit with 8 seed anchors separates Wakefield (7.5) from Attention (74.7) by 67 points - was 7 points with uniform weights. But the optimizer zeros several categories at that anchor count, a classic overfit signal. Add 7-15 more anchors before shipping weights.json. weights.json is intentionally not committed in this PR; treat it as a deliverable Brian generates after labeling enough anchors. Co-Authored-By: Claude Opus 4.6 (1M context)

partition benchmark-eval + tag Attention as reference-benchmark

2026-04-14T20:34:32Z

commit 5ad6af87a22aa18f92dac25f1979ae94c66367bb parent 47067ff2e58055add7db030be4ee682e0cd01f43 Author: Brian Graham Date: Tue, 14 Apr 2026 22:34:32 +0200 partition benchmark-eval + tag Attention as reference-benchmark Two things: 1. attention-is-all-you-need-2017 now tagged reference-benchmark (keeps 'landmark' too). Foundational transformer paper used as a rubric anchor like Wakefield and Ioannidis. 2. Papers tagged benchmark-eval now partitioned from aggregates too. Rationale: they introduce benchmarks used BY the field, they're reference material rather than subjects of the same kind of rubric evaluation. 5 papers affected. Output: new benchmarks.json alongside calibration.json. Effect on dashboard: n = 1530 -> 1524, median unchanged at 49.1. Co-Authored-By: Claude Opus 4.6 (1M context)

partition calibration (reference-benchmark) specimens

2026-04-14T20:24:55Z

commit 47067ff2e58055add7db030be4ee682e0cd01f43 parent 4b8436506afa1c261f8cd6e046caa136ce386732 Author: Brian Graham Date: Tue, 14 Apr 2026 22:24:55 +0200 partition calibration (reference-benchmark) specimens Registry entries tagged "reference-benchmark" (currently Wakefield 1998 and Ioannidis 2005, only the first scanned) now skip the agentic-AI corpus aggregates entirely. They still get per-paper scoring, still get individual papers/{slug}.json written (so detail pages work), but they no longer contribute to: - total_papers / dash.n - median / mean / full_reproducibility_pct - histogram, category_rates, year_trends, tag_counts - archetype_counts, game_counts / game_pcts - venue_scores, citation_band_scores, funding_groups - tensions (claim classification) - papers-index.json (hidden from the papers explorer) Effect: n = 1530 (was 1531), median unchanged at 49.1, full_repro 4.0 -> 4.1 (Wakefield's 0% full-reproducibility weight removed). New output: calibration.json listing the calibration specimens with their full detail + a calibration_notes field carrying the registry notes so the consumer can explain each specimen's purpose. Co-Authored-By: Claude Opus 4.6 (1M context)

stats: include v1 scans with graceful degradation

2026-04-14T19:45:34Z

commit 4b8436506afa1c261f8cd6e046caa136ce386732 parent 06cbf721cea34bee65e068cd7363caff35325a3b Author: Brian Graham Date: Tue, 14 Apr 2026 21:45:34 +0200 stats: include v1 scans with graceful degradation The scan_version < 2 filter was excluding 558 papers (~28% of the scanned corpus). Inspection showed the v1 rubric is a proper subset of v2+: 50 identical questions across 11 identical categories, zero dropped or changed. The v2+ additions (proxy_outcome_distinction + data_leakage + experimental_rigor + survey_methodology = 17 questions in one new field + 3 new conditional modules) are purely additive. compute_overall_score already uses passed/applicable over present questions, so v1 papers degrade gracefully: their 50 applicable questions are scored normally and the 7 v2+-only questions are treated as absent. classify_archetype only touches categories in the shared 11. detect_games only references questions in the shared 11. No scoring bias introduced. Effect: n rises from 1,047 to 1,531 (+484 v1 papers that had scorable data; 74 more v1 scans still excluded via the "no applicable questions" check). Median moves 47.2 -> 49.1, all game_pcts within 2 points of prior values. Co-Authored-By: Claude Opus 4.6 (1M context)

CI: trigger deploy on scan-v5.json changes

2026-04-13T13:51:42Z

commit 06cbf721cea34bee65e068cd7363caff35325a3b parent 1829fbe2bf4e383bc57aa9285593d195c5e53838 Author: Brian Graham Date: Mon, 13 Apr 2026 15:51:42 +0200 CI: trigger deploy on scan-v5.json changes The v5 Haiku scans weren't triggering rebuilds because the path filter only matched papers/*/scan.json. Co-Authored-By: Claude Opus 4.6 (1M context)

V5 Haiku sweep: 531 papers scanned (280 new)

2026-04-13T13:24:21Z

commit 1829fbe2bf4e383bc57aa9285593d195c5e53838 parent 7325e2836dacec068ecd0eebfdca0c729a902baf Author: Brian Graham Date: Mon, 13 Apr 2026 15:24:21 +0200 V5 Haiku sweep: 531 papers scanned (280 new) 20 failures: 15 claude exit 1, 3 no JSON, 1 JSON parse error, 1 mixed. Can retry failures separately. Co-Authored-By: Claude Opus 4.6 (1M context)

V5 Haiku sweep: 220 papers scanned

2026-04-12T17:49:54Z

commit 7325e2836dacec068ecd0eebfdca0c729a902baf parent eb6c3464af535659c94124df71ec8abd0e9a2ab3 Author: Brian Graham Date: Sun, 12 Apr 2026 19:49:54 +0200 V5 Haiku sweep: 220 papers scanned Co-Authored-By: Claude Opus 4.6 (1M context)

Progress bar: 4-segment v5 pipeline, 106 pure Haiku scans

2026-04-12T16:10:46Z

commit eb6c3464af535659c94124df71ec8abd0e9a2ab3 parent 375564a74735195015b853fe4ec2af98ff6e4fa0 Author: Brian Graham Date: Sun, 12 Apr 2026 18:10:46 +0200 Progress bar: 4-segment v5 pipeline, 106 pure Haiku scans Replace old v3/v2/v1/queued/no-text progress bar with: V5 Opus | V5 Haiku/Sonnet | Deprecated | Not scanned Build pipeline counts scan-v5.json (checking source field for opus vs haiku) and falls back to scan.json as deprecated. No cascade loading yet — metrics still read from old scan.json. Also: v5 script cosmetic fixes (v4→v5 references) and stderr capture on claude failures for better error diagnostics. Co-Authored-By: Claude Opus 4.6 (1M context)

Add run-scan-v5-haiku.py: pure Haiku output, no Opus merge

2026-04-11T05:45:46Z

commit 375564a74735195015b853fe4ec2af98ff6e4fa0 parent 450388ee71cfe1d95d028029358c8007afb36ba5 Author: Brian Graham Date: Sat, 11 Apr 2026 07:45:46 +0200 Add run-scan-v5-haiku.py: pure Haiku output, no Opus merge v4-haiku script merged Opus answers at write time, contaminating scan-v4.json with v2 Opus overwrites. v5 script writes raw Haiku/Sonnet output to scan-v5.json so per-question Haiku-Opus comparisons remain possible for calibration analysis. The build pipeline will handle the Opus/Haiku merge at read time, preferring Opus where available but keeping the raw v5 data around. Usage: python3 scripts/run-scan-v5-haiku.py --parallel 8 Co-Authored-By: Claude Opus 4.6 (1M context)

V4 Haiku sweep: 241 new papers (263 total)

2026-04-11T05:31:03Z

commit 450388ee71cfe1d95d028029358c8007afb36ba5 parent c9f58bde8535e444a35685593af2b9b4b2f9d55b Author: Brian Graham Date: Sat, 11 Apr 2026 07:31:03 +0200 V4 Haiku sweep: 241 new papers (263 total) Run: python3 scripts/run-scan-v4-haiku.py --parallel 8 Merged with existing Opus v2/v3 answers where available. 12,347 Opus overrides (92%), 1,806 Haiku answers (14%), 0 Sonnet. The Haiku answers are on the new v4 questions (scope_and_framing, type-specific modules) where no v2 Opus answer existed. Co-Authored-By: Claude Opus 4.6 (1M context)

V4 Haiku scan pipeline: type-routed instrument with Opus overlay

2026-03-31T06:40:03Z

commit c9f58bde8535e444a35685593af2b9b4b2f9d55b parent 736a50a032a47708cf7293b93076df2b494eb27b Author: Brian Graham Date: Tue, 31 Mar 2026 08:40:03 +0200 V4 Haiku scan pipeline: type-routed instrument with Opus overlay New schema (scan-v4.schema.json): shared core (15q) + 5 type-specific modules (empirical 39q, benchmark 12q, survey 12q, position 12q, theoretical 10q). Two-field boolean design preserved. New script (run-scan-v4-haiku.py): - Haiku for papers <50K chars, auto-fallback to Sonnet for larger - Reads paper_type.json for routing - Merges existing Opus v2/v3 answers (Opus always overrides) - Tracks source per question (opus/haiku/sonnet) - Free calibration: reports Haiku-Opus agreement rate - Fetches HN data inline - Writes scan-v4.json (separate from scan.json) Tested: - Tao (position): 12 opus + 15 haiku. 75% agreement. Position module shows 7/7 argument quality, 1/5 clarity. Much better than v2's 10%. - METR (empirical): 51 opus + 3 sonnet. 86.3% agreement. Run: python3 scripts/run-scan-v4-haiku.py --parallel 8 Co-Authored-By: Claude Opus 4.6 (1M context)

Classify all 1,206 papers by type via Haiku

2026-03-30T18:17:11Z

commit 736a50a032a47708cf7293b93076df2b494eb27b parent fbc3c552e124c8c6c91d532e531bbc6f81f4d957 Author: Brian Graham Date: Mon, 30 Mar 2026 20:17:11 +0200 Classify all 1,206 papers by type via Haiku Distribution: empirical 816 (69%) benchmark-creation 155 (13%) survey 102 (9%) position 63 (5%) theoretical 49 (4%) Spot-checked: Tao→position, METR→empirical, SWE-Bench→benchmark-creation, HalluLens→benchmark-creation, multi-agent survey→survey. All correct. Separate paper_type.json files, non-destructive to scan data. Co-Authored-By: Claude Opus 4.6 (1M context)

Add Haiku paper type classification script (preliminary)

2026-03-30T14:40:40Z

commit fbc3c552e124c8c6c91d532e531bbc6f81f4d957 parent 95f484d01c4aded0fbdb7faed0aa7f17b69da21b Author: Brian Graham Date: Mon, 30 Mar 2026 16:40:40 +0200 Add Haiku paper type classification script (preliminary) scripts/classify-paper-type.py classifies papers into 5 types: empirical, benchmark-creation, survey, position, theoretical. Uses Haiku (cheap, fast) reading title + key_findings + tags from existing scan.json. Writes papers/{slug}/paper_type.json as a separate non-destructive file. 20/20 correct on manual verification. Running full corpus in background. This is preliminary — classification feeds into v4 instrument redesign where each type gets its own question panel. Co-Authored-By: Claude Opus 4.6 (1M context)

Filter non-empirical papers from findings, tag in UI

2026-03-30T14:10:48Z

commit 95f484d01c4aded0fbdb7faed0aa7f17b69da21b parent b4f6f0caa07a8a5d8d382792a236646c772d9b4b Author: Brian Graham Date: Mon, 30 Mar 2026 16:10:48 +0200 Filter non-empirical papers from findings, tag in UI Papers without both statistical_methodology and evaluation_design applicable are classified as non-empirical (159 papers). These are excluded from all findings aggregations (median, games, tensions, correlations, year trends). Dashboard now reports empirical-only: 1,047 papers, median 47.2% (was 46.3% mixed). Non-empirical papers still appear in the papers browser with: - Score shown in gray with asterisk (e.g., "10.0%*") - "Non-empirical" badge instead of archetype - Tooltip explaining limited criteria Progress bar shows empirical/non-empirical split. Paper detail pages still show full checklist for all papers. This is a stopgap before v4 instrument redesign that will add paper-type-specific question panels. Co-Authored-By: Claude Opus 4.6 (1M context)

Update to 1205 scans, new sampling checkpoint, refresh memory

2026-03-29T14:13:53Z

commit b4f6f0caa07a8a5d8d382792a236646c772d9b4b parent 208801951bbe904415b4a651bd792bc95c8f9241 Author: Brian Graham Date: Sun, 29 Mar 2026 16:13:53 +0200 Update to 1205 scans, new sampling checkpoint, refresh memory Sampling: n=932→47.1%, n=1205→46.3%. Decline continues — 2025 papers (n=595) have lowest median at 44.9%. At n=1205: - Games: Overclaiming 65.4%, Big Numbers 65.3% (both rising) - Quality contagion widened: 37.2%→50.0% (13pp gradient) - Funding gap stable at 11.5pp - Network: 1359 nodes, 4512 edges - 492 code URLs extracted - 6 tensions with 3600+ claims Co-Authored-By: Claude Opus 4.6 (1M context)

Add explanatory descriptions to each tension section

2026-03-24T05:51:58Z

commit 208801951bbe904415b4a651bd792bc95c8f9241 parent ddde6369343ce6a1c7129bb5e8093318815ae2a7 Author: Brian Graham Date: Tue, 24 Mar 2026 06:51:58 +0100 Add explanatory descriptions to each tension section Each tension now has 2 sentences below the title explaining what the tension is and why it matters. E.g., Security Arms Race: "Defense papers claim their mitigations work; attack papers show they can be bypassed. Neither side engages seriously with the other." Co-Authored-By: Claude Opus 4.6 (1M context)

Add 3 new tensions, expand keyword matching for existing 3

2026-03-24T05:49:13Z

commit ddde6369343ce6a1c7129bb5e8093318815ae2a7 parent 372ecdeaa4a2719d071db645776f137c093a7bc5 Author: Brian Graham Date: Tue, 24 Mar 2026 06:49:13 +0100 Add 3 new tensions, expand keyword matching for existing 3 New tensions: - Security Arms Race: 379 defense vs 546 attack claims - Code Quality Paradox: 363 LLMs-help vs 251 LLMs-hurt - Scaling Debate: 152 efficient vs 452 limits Expanded keywords for existing tensions: - Benchmarks: added pass@, accuracy, f1, performance on, sota (103→315 pos) - Agents: added agentic, tool use, planning, chain-of-thought (72→110 pos) - Productivity: added developer productivity, coding efficiency (minor) Each tension gets a butterfly chart with bar width encoding. Total claim coverage: 858→3600 (from 5115 total claims). Co-Authored-By: Claude Opus 4.6 (1M context)

Use bar width for methodology score, not border thickness

2026-03-24T05:43:14Z

commit 372ecdeaa4a2719d071db645776f137c093a7bc5 parent e794391f9bd6ec7b3ed6676a0738aecb985100dc Author: Brian Graham Date: Tue, 24 Mar 2026 06:43:14 +0100 Use bar width for methodology score, not border thickness Height = claim count, width = methodology score. A tall narrow bar means many claims from weak papers. A short wide bar means few claims from rigorous papers. Solid fills, no border tricks. Co-Authored-By: Claude Opus 4.6 (1M context)

Replace dot chart with border-thickness encoding for tension quality

2026-03-24T05:40:47Z

commit e794391f9bd6ec7b3ed6676a0738aecb985100dc parent 48c57f7f304597d6144168e70d4abea8de51fb65 Author: Brian Graham Date: Tue, 24 Mar 2026 06:40:47 +0100 Replace dot chart with border-thickness encoding for tension quality Bars are now lightly filled outlines where border thickness encodes methodology score: 1px at 20% → 5px at 70%. A tall thin-bordered bar = many claims from weak papers. A short thick-bordered bar = few claims from rigorous papers. Score shown as text label. Cleaner than dots (which were cramped) and shades (which were imperceptible). Co-Authored-By: Claude Opus 4.6 (1M context)

Replace shade gradient with bar+dot chart on tensions

2026-03-24T05:35:47Z

commit 48c57f7f304597d6144168e70d4abea8de51fb65 parent 3776fd8528b73d85622c4a27e63d7a6e653b67c9 Author: Brian Graham Date: Tue, 24 Mar 2026 06:35:47 +0100 Replace shade gradient with bar+dot chart on tensions Bars show claim count (flat blue up, flat gray down). Dots show mean methodology score as position on a mini x-axis per year column, with score number inside each dot. Two independent visual channels — no color interpretation needed. Co-Authored-By: Claude Opus 4.6 (1M context)

Add count scale ticks and clearer axis description to tension charts

2026-03-24T05:31:56Z

commit 3776fd8528b73d85622c4a27e63d7a6e653b67c9 parent 8667f4fa26aaf394db5fbd7bf771dbcdeffe505e Author: Brian Graham Date: Tue, 24 Mar 2026 06:31:56 +0100 Add count scale ticks and clearer axis description to tension charts Left edge now shows numeric count ticks (midpoint and max) for both positive and nuanced sides. Description text uses bold labels: Height = count, Darkness = methodology score. Co-Authored-By: Claude Opus 4.6 (1M context)

Tension butterfly: single-hue gradient + quality-weighted balance line

2026-03-24T05:29:17Z

commit 8667f4fa26aaf394db5fbd7bf771dbcdeffe505e parent fdf7bf9ec122ae8104c86d45c5563560fafefd4f Author: Brian Graham Date: Tue, 24 Mar 2026 06:29:17 +0100 Tension butterfly: single-hue gradient + quality-weighted balance line Bars use single-hue intensity gradients instead of color categories: - Positive claims: light-to-dark blue (pale = weak methodology) - Nuanced claims: light-to-dark gray Works in grayscale and for colorblind viewers. Added dashed balance line across years: quality-weighted center of gravity between positive and nuanced claims. Above center line means optimism dominates (weighted by methodology), below means skepticism. Shows how each tension's balance shifts over time. Co-Authored-By: Claude Opus 4.6 (1M context)

Add butterfly timeline charts to tensions view

2026-03-24T05:25:59Z

commit fdf7bf9ec122ae8104c86d45c5563560fafefd4f parent f41b10dd2346179281c7a9ef5e809bb5cb87c2ec Author: Brian Graham Date: Tue, 24 Mar 2026 06:25:59 +0100 Add butterfly timeline charts to tensions view Each tension now has a diverging bar chart with horizontal time axis: - Blue bars extend UP for positive claims (count + mean method score) - Gray bars extend DOWN for nuanced claims (count + mean method score) - No color encoding for score — numbers shown directly as labels - Hover tooltips on each bar with full details Year field added to tension claims in build pipeline. Replaces the previous vertical color-encoded butterfly prototype. Co-Authored-By: Claude Opus 4.6 (1M context)

Expand HN analysis: scatter, tag paradox, repost/controversy signals

2026-03-24T05:19:41Z

commit f41b10dd2346179281c7a9ef5e809bb5cb87c2ec parent 7072c581ce666fdd9dc061902b0e8823888d9732 Author: Brian Graham Date: Tue, 24 Mar 2026 06:19:41 +0100 Expand HN analysis: scatter, tag paradox, repost/controversy signals Social Attention section now includes: - HN points vs methodology scatter (586 papers, log-scale x-axis) with "the blob has no slope" annotation - Case study paradox: tag comparison showing HN attention vs methodology side-by-side (case studies: most HN love, worst methodology) - Repost signal: 8+ reposts = 50.6% method vs 48.0% for single posts - Controversy signal: high-discussion papers score 50.7% vs 48.9% - Updated heatmap annotation text for n=932 correlation values Co-Authored-By: Claude Opus 4.6 (1M context)

Add 187 scans (932 total), new sampling checkpoint, remove corrupt scan

2026-03-24T05:13:09Z

commit 7072c581ce666fdd9dc061902b0e8823888d9732 parent 3d52164e9d12fde30413fa1250caeaecb3f678cd Author: Brian Graham Date: Tue, 24 Mar 2026 06:13:09 +0100 Add 187 scans (932 total), new sampling checkpoint, remove corrupt scan Sampling checkpoint: n=745→48.1%, n=932→47.1%. Decline continues. Removed corrupt your-prompt-safe-2025/scan.json (truncated string). Key shifts at n=932: - Games worsened: Overclaiming 63.2%, Big Numbers 63.1% - Quality contagion gradient widened: 37.8%→50.6% (was 43.1%→52.3%) - 2025 is worst year at 45.5% median (n=455) - Funding gap stable at 12.1pp - Optimism-rigor gap grew for productivity (+6.6pp) - All correlation structure stable (contamination↔leakage 0.857, artifacts↔stats 0.058, two cultures -0.198) Co-Authored-By: Claude Opus 4.6 (1M context)

Add engagement factor strip to papers table

2026-03-23T18:53:31Z

commit 3d52164e9d12fde30413fa1250caeaecb3f678cd parent bdd13e9d815e0e04aaa193ea80a794d756b99f8f Author: Brian Graham Date: Mon, 23 Mar 2026 19:53:31 +0100 Add engagement factor strip to papers table V3 papers show a second DNA strip (purple tones) next to the methodology strip (red-yellow-blue-green). 6 cells for practical, surprise, fear, drama, demo, brand. Hover shows dimension name and score. Blank for v2-only papers. 53 papers currently have both strips. Co-Authored-By: Claude Opus 4.6 (1M context)

Show engagement factors on paper detail pages

2026-03-23T14:41:38Z

commit bdd13e9d815e0e04aaa193ea80a794d756b99f8f parent 96a3826ca9778e25930fd3677f6e78f06b21e146 Author: Brian Graham Date: Mon, 23 Mar 2026 15:41:38 +0100 Show engagement factors on paper detail pages V3 papers now display a section with 6 horizontal bars showing engagement scores (0-3) with justification text for each dimension. Only shown when engagement_factors exist in the paper data. Co-Authored-By: Claude Opus 4.6 (1M context)

Show v3 scans in survey progress bar

2026-03-23T14:05:25Z

commit 96a3826ca9778e25930fd3677f6e78f06b21e146 parent 51dc81021fc0182b9668043b714be26b1f230a2a Author: Brian Graham Date: Mon, 23 Mar 2026 15:05:25 +0100 Show v3 scans in survey progress bar Progress bar now has 5 segments: V3 (blue, 53), V2 (green, 692), V1 rescan, Queued, No PDF. Header shows engagement factor count. Co-Authored-By: Claude Opus 4.6 (1M context)

Integrate v3 engagement factors into explorer pipeline

2026-03-23T13:54:04Z

commit 51dc81021fc0182b9668043b714be26b1f230a2a parent a85920f8b970cf039362ba691b05a72e8439d3d1 Author: Brian Graham Date: Mon, 23 Mar 2026 14:54:04 +0100 Integrate v3 engagement factors into explorer pipeline Build script now reads engagement_factors from v3 scans and computes: - Per-dimension correlation with log(HN points) - High-HN vs low-HN mean engagement scores per dimension - All data flows to findings.json and paper detail pages Currently n=45 v3 papers — correlations are weak but directional: brand recognition and fear are the only positive signals. Numbers will sharpen as more v3 catchup batches run. Findings view shows engagement factor correlations when n>=10. Paper detail pages include engagement_factors when present. Co-Authored-By: Claude Opus 4.6 (1M context)

Fix catchup-v3 to read full paper.txt, not just scan summary

2026-03-23T12:49:40Z

commit a85920f8b970cf039362ba691b05a72e8439d3d1 parent d6d31c6cb0ff5d41b80f2e4eaeb2e992c3702dd8 Author: Brian Graham Date: Mon, 23 Mar 2026 13:49:40 +0100 Fix catchup-v3 to read full paper.txt, not just scan summary The cheap version (title + key_findings only) produced plausible but ungrounded engagement scores. Now reads the full paper text so Opus can assess demo-ability from actual URLs, practical relevance from implementation details, and surprise from the full argument. Reset 36 papers from cheap v3 back to v2 for re-processing. Timeout bumped 120s → 300s for longer paper reads. Co-Authored-By: Claude Opus 4.6 (1M context)

Add v3 scan instrument with engagement factors, catchup script

2026-03-23T12:26:54Z

commit d6d31c6cb0ff5d41b80f2e4eaeb2e992c3702dd8 parent 781cf7f2cc3cdd41d5fea630408c3cd59982712e Author: Brian Graham Date: Mon, 23 Mar 2026 13:26:54 +0100 Add v3 scan instrument with engagement factors, catchup script v3 extends v2 with 6 engagement factor dimensions (0-3 each): - practical_relevance, surprise_contrarian, fear_safety, drama_conflict, demo_ability, brand_recognition Two scripts: - scripts/catchup-v3.py: upgrades existing v2 scans to v3 by running Opus classification on title + key_findings + claims (cheap, no full paper text needed). Supports --parallel, --limit, --id. - agents/scan-agent.md: updated to produce v3 directly for new scans. Tested on 13 papers. Scores align with manual prototype: Codex [3,2,1,1,2,3], Agents of Chaos [2,2,3,2,1,2], Chain-of-Thought [3,2,0,0,2,2]. Co-Authored-By: Claude Opus 4.6 (1M context)

Add HN social attention data: enrichment script, findings, paper links

2026-03-23T12:09:22Z

commit 781cf7f2cc3cdd41d5fea630408c3cd59982712e parent 4b1edc22cbf261dbc4f797132d00a2067c22d276 Author: Brian Graham Date: Mon, 23 Mar 2026 13:09:22 +0100 Add HN social attention data: enrichment script, findings, paper links New script: scripts/enrich-hn.py queries HN Algolia API for all v2 papers (arxiv_id search + title fallback). 586/745 papers found, 3,433 HN threads collected. Saves papers/{slug}/hn.json with thread details (title, points, comments, URL). Key finding: HN attention is uncorrelated with methodology (r=0.061). Social media amplifies novelty, not rigor. New findings section "Social Attention vs Rigor": - Hidden gems: 15 papers scoring 65%+ with <=5 HN points (e.g., "Measuring Mid-2025 LLM-Assistance" at 86%, 0 HN pts) - Overhyped: 15 papers scoring <40% with 30+ HN points (e.g., "Efficient Guided Generation" at 854 pts, 31.7% method) - Top 15 most-discussed papers table Paper detail pages show "HN (Npts)" link button to top thread. Co-Authored-By: Claude Opus 4.6 (1M context)

Replace jittered scatter with bubble grid for two cultures

2026-03-23T10:19:23Z

commit 4b1edc22cbf261dbc4f797132d00a2067c22d276 parent 8378a8226cb32ffa456ceaab94abc9b42ccb619b Author: Brian Graham Date: Mon, 23 Mar 2026 11:19:23 +0100 Replace jittered scatter with bubble grid for two cultures Scatter with jitter was dishonest — data is discrete (4-7 questions per category = quantized scores). Replaced with bubble grid: each grid intersection shows a circle sized by paper count, colored by mean score, with count label. Honestly represents the discrete data while clearly showing the negative correlation pattern. Co-Authored-By: Claude Opus 4.6 (1M context)

Add jitter to two cultures scatter to show density

2026-03-23T10:17:48Z

commit 8378a8226cb32ffa456ceaab94abc9b42ccb619b parent 40085bd948ffe671e0449b2f1e9da4ba121b01d5 Author: Brian Graham Date: Mon, 23 Mar 2026 11:17:48 +0100 Add jitter to two cultures scatter to show density Scores are quantized (4 artifacts questions = 0/25/50/75/100%) so dots stack on grid intersections. Added deterministic jitter using golden-angle distribution so overlapping dots spread into visible clusters while maintaining approximate position. Co-Authored-By: Claude Opus 4.6 (1M context)

Add repro funnel, methodology treemap, two cultures scatter; fix network zoom

2026-03-23T10:15:30Z

commit 40085bd948ffe671e0449b2f1e9da4ba121b01d5 parent a2c488b4b161129d19bc4aff0445e74a4c93407f Author: Brian Graham Date: Mon, 23 Mar 2026 11:15:30 +0100 Add repro funnel, methodology treemap, two cultures scatter; fix network zoom New findings visuals: - Reproducibility funnel: 745→400→351→61→49, cliff at env specs - Methodology treemap: benchmark-eval dominates (561), colored by score - Two cultures scatter: human_studies vs artifacts (r=-0.24) Network fixes: - Start zoomed out (k=0.5) so full graph is visible on load - Thicker edges (1.0→1.5 default, 1.8→2.5 hover) Updated memory files with current state, explorer architecture, and "when user says more papers scanned" workflow. Co-Authored-By: Claude Opus 4.6 (1M context)

Add reproducibility funnel, methodology treemap, two cultures scatter

2026-03-23T10:03:05Z

commit a2c488b4b161129d19bc4aff0445e74a4c93407f parent 59c5b1043da1db314c2da2b0d833733c9fe627f5 Author: Brian Graham Date: Mon, 23 Mar 2026 11:03:05 +0100 Add reproducibility funnel, methodology treemap, two cultures scatter Three new visualizations in findings: Reproducibility funnel: 745 → 400 (code) → 351 (data) → 61 (env) → 49 (instructions). The cliff at environment specs is where reproducibility collapses — 90% of code-releasing papers stop there. Methodology landscape treemap: proportional blocks sized by paper count, colored by mean score. Benchmark-eval dominates (561 papers), RCTs score highest (64.3%), case studies lowest (39.5%). Two cultures scatter: human_studies vs artifacts score for 80 papers with human subjects. Negatively correlated (r=-0.24) — CS researchers release code but skip IRB; psychology researchers do ethics review but don't release data. Four quadrants labeled. Also: 3 new v2 scans (Codex 71.7%, CoT 56.6%, ReAct 48.2%) and Agents of Chaos rescan (47.5%). Co-Authored-By: Claude Opus 4.6 (1M context)

Rescan Agents of Chaos (2602.20021) as v2: 47.5%

2026-03-23T09:23:12Z

commit 59c5b1043da1db314c2da2b0d833733c9fe627f5 parent 4d2226787818ffd5455c35bf72eeef923ae3a7ce Author: Brian Graham Date: Mon, 23 Mar 2026 10:23:12 +0100 Rescan Agents of Chaos (2602.20021) as v2: 47.5% Red-teaming study of 6 autonomous LLM agents in live lab environment. Strong on claims/evidence and limitations (100%), weak on artifacts and human studies. 5 red flags including no IRB and convenience sample. Co-Authored-By: Claude Opus 4.6 (1M context)

Add human-readable descriptions to per-question pass rates

2026-03-23T09:09:30Z

commit 4d2226787818ffd5455c35bf72eeef923ae3a7ce parent 5618e59897d0dae1dd10f47a1d8e147054069a9d Author: Brian Graham Date: Mon, 23 Mar 2026 10:09:30 +0100 Add human-readable descriptions to per-question pass rates Each question bar now shows a concise description instead of the raw snake_case field name. Descriptions pre-computed in build script (67 questions mapped). E.g., "self_comparison_bias_addressed" becomes "Self-evaluation bias acknowledged". Co-Authored-By: Claude Opus 4.6 (1M context)

Add explainer text for each named game on dashboard

2026-03-23T09:06:09Z

commit 5618e59897d0dae1dd10f47a1d8e147054069a9d parent 96e47d0e1acc7b77cf730518cd97e6b0cb6f419b Author: Brian Graham Date: Mon, 23 Mar 2026 10:06:09 +0100 Add explainer text for each named game on dashboard Each game row now shows a one-line description explaining what the pattern means and how it's detected. Co-Authored-By: Claude Opus 4.6 (1M context)

Rebuild citation network, add ego mode, quality flow, and network findings

2026-03-23T08:59:27Z

commit 96e47d0e1acc7b77cf730518cd97e6b0cb6f419b parent 6d3758b3b52628e5f9c0bb8cb38aae235f766dde Author: Brian Graham Date: Mon, 23 Mar 2026 09:59:27 +0100 Rebuild citation network, add ego mode, quality flow, and network findings Network rebuilt from cited_papers (960 nodes, 2952 edges vs old 572/715). Scanned 3 foundational papers: Codex 71.7%, CoT 56.6%, ReAct 48.2%. Network view: - Directed arrows on edges (visible when zoomed in or in ego mode) - Click node = ego mode: shows 1-hop neighborhood with in/out distinction, color-coded edges (blue=cited-by, orange=cites), info panel with stats - Double-click = navigate to paper detail - Edge color toggle: default or quality flow (green=good→good, red=weak→good) - Escape or "Show all" exits ego mode - Hover shows "Cites N / Cited by M" with directional counts Findings: - Citation Network Insights section with foundational paper leaderboard, quality contagion gradient (43.1% → 52.3%), and rigor diffusion table Co-Authored-By: Claude Opus 4.6 (1M context)

Add 4 new games (10 total), DNA profile strips in paper table

2026-03-23T07:46:46Z

commit 6d3758b3b52628e5f9c0bb8cb38aae235f766dde parent 0bf67124d60b5a1d8c6d27deb7def340c1f0c0f0 Author: Brian Graham Date: Mon, 23 Mar 2026 08:46:46 +0100 Add 4 new games (10 total), DNA profile strips in paper table New games detecting orthogonal methodology failures: - Trust Us (40.5%): no raw data AND no code — unverifiable - The Black Box (12.3%): no prompts AND no hyperparameters — unreplicable - Moving Goalpost (26.9%): causal claims without causal design - Limitation Theater (4.2%): has limitations section, all boilerplate DNA strips: colored inline heatmap per paper row showing 11 base category scores at a glance (red→yellow→blue→green). Replaces venue column — methodology profile is more useful. Co-Authored-By: Claude Opus 4.6 (1M context)

Add PCA scatter plot — paper methodology map

2026-03-22T20:54:26Z

commit 0bf67124d60b5a1d8c6d27deb7def340c1f0c0f0 parent c641e50fbc95253d2debbe8c25dc5e8357e58dc3 Author: Brian Graham Date: Sun, 22 Mar 2026 21:54:26 +0100 Add PCA scatter plot — paper methodology map Project 708 papers from 9 category scores to 2D via PCA (52.8% variance explained). Papers colored by archetype, hover for details, click to navigate. PC1 = overall rigor (limitations, data_integrity, claims dominate) PC2 = practical detail vs reflection (cost, setup vs limitations) Archetypes separate clearly: Complete clusters left (rigorous), Minimal right (weak), Theater and Mixed overlap in the middle. Hand-rolled PCA in build script (power iteration, no numpy needed). Co-Authored-By: Claude Opus 4.6 (1M context)

Add category correlation heatmap to findings view

2026-03-22T20:49:48Z

commit c641e50fbc95253d2debbe8c25dc5e8357e58dc3 parent d240203118b1d2332118fdcb2cfd94594a523da2 Author: Brian Graham Date: Sun, 22 Mar 2026 21:49:48 +0100 Add category correlation heatmap to findings view Pre-compute 14x14 Pearson correlation matrix between category-level pass rates. Rendered as interactive SVG heatmap with hover tooltips. Key findings surfaced: - contamination <-> data_leakage r=0.87 (same decision) - artifacts <-> stat_methodology r=0.05 (completely independent) - human_studies <-> artifacts r=-0.24 (two cultures) - Three independent rigor clusters: transparency, statistics, contamination Co-Authored-By: Claude Opus 4.6 (1M context)

Add findings view with 10 analysis sections and code URL extraction

2026-03-22T20:33:58Z

commit d240203118b1d2332118fdcb2cfd94594a523da2 parent 1818e336e2cc2445cd1006f83c3fa66c7eec7259 Author: Brian Graham Date: Sun, 22 Mar 2026 21:33:58 +0100 Add findings view with 10 analysis sections and code URL extraction New #/findings view surfaces deep analysis that was only in docs: - Per-question pass rates (67 questions, worst: self_comparison_bias 0.8%) - Year trends by category with toggleable lines (contamination 29%→7%) - Venue & citation scoring (500+ cites score below average) - Optimism-rigor inversion (positive claims from weaker papers) - Quality homophily (high-quality papers cite high-quality 3x more) - Sampling effect (median drops as long tail scanned) - Benchmark monoculture (58% pure benchmark-eval) - Funding gap (13pp between disclosed/undisclosed) - Reproducibility drill-down (4.2% fully reproducible) - All 6 named games (added Cherry-picked Comparisons, All Show No Substance) Also: extracted 282 code URLs from scan justification text, shown as "Code" link on paper detail pages. Co-Authored-By: Claude Opus 4.6 (1M context)

Fix network: edge contrast, mouseover hit detection

2026-03-22T18:04:29Z

commit 1818e336e2cc2445cd1006f83c3fa66c7eec7259 parent 63a3d148e5733c645300c806f4ffdafd4d344f71 Author: Brian Graham Date: Sun, 22 Mar 2026 19:04:29 +0100 Fix network: edge contrast, mouseover hit detection - Edge opacity 0.25 → 0.5, line width 0.8 → 1.2 - Fixed mouseover/click: canvas CSS scaling wasn't accounted for in mouse coordinate math, making hit detection wildly off on any screen where the canvas CSS width != 1200px (i.e. almost always) - Zoom and drag also fixed for the same CSS-to-canvas scaling issue Co-Authored-By: Claude Opus 4.6 (1M context)

Add pipeline progress bar and show all registry papers in browser

2026-03-22T17:38:54Z

commit 63a3d148e5733c645300c806f4ffdafd4d344f71 parent 64fcaa825f5ab1a80912f80abc8e771b5b0a3a81 Author: Brian Graham Date: Sun, 22 Mar 2026 18:38:54 +0100 Add pipeline progress bar and show all registry papers in browser - Dashboard shows segmented progress bar: v2 scanned, v1 rescan, queued, no PDF — updates automatically with each deploy. - Papers browser lists all 2,687 registry entries. Unscanned papers show "--" for score/archetype and are not clickable. - Pipeline stats added to dashboard.json (registry_total, v2_scanned, v1_needs_rescan, has_text_no_scan, no_text, excluded). Co-Authored-By: Claude Opus 4.6 (1M context)

Trigger deploy on scan.json and registry changes

2026-03-22T12:03:16Z

commit 64fcaa825f5ab1a80912f80abc8e771b5b0a3a81 parent ad711786b26cf3ee682d7c97d2e74615983eafda Author: Brian Graham Date: Sun, 22 Mar 2026 13:03:16 +0100 Trigger deploy on scan.json and registry changes Deploy only fired on explorer/ changes, so new scan batches didn't update the site. Co-Authored-By: Claude Opus 4.6 (1M context)

Add 274 v2 scans (741 total), remove corrupt spectr scan

2026-03-22T11:19:20Z

commit ad711786b26cf3ee682d7c97d2e74615983eafda parent dd1e239a2cd9d62a8dd1f144070ba1d9b604c983 Author: Brian Graham Date: Sun, 22 Mar 2026 12:19:20 +0100 Add 274 v2 scans (741 total), remove corrupt spectr scan Batch from parallel scan run with --max-turns 8 fix. Removed spectr-fast-speculative-2023/scan.json (trailing comma). Co-Authored-By: Claude Opus 4.6 (1M context)

Fix scan agent max-turns: 3 → 8, accept --max-turns CLI arg

2026-03-22T06:45:01Z

commit dd1e239a2cd9d62a8dd1f144070ba1d9b604c983 parent fab02ffcee6cc4b837e996171068e5654295171d Author: Brian Graham Date: Sun, 22 Mar 2026 07:45:01 +0100 Fix scan agent max-turns: 3 → 8, accept --max-turns CLI arg 3 turns was the exact minimum (read agent prompt, read schema, write scan.json) with zero margin. 129/200 papers silently failed when the agent needed an extra turn. Bumping to 8 resolved all but the known persistent failures (bad PDFs, survey_methodology truncation). Co-Authored-By: Claude Opus 4.6 (1M context)

Split data pipeline, add light/dark mode, fix network and detail views

2026-03-18T13:11:09Z

commit fab02ffcee6cc4b837e996171068e5654295171d parent f40a5cabd9d1f25608787a26f5ebfc43cd07177e Author: Brian Graham Date: Wed, 18 Mar 2026 14:11:09 +0100 Split data pipeline, add light/dark mode, fix network and detail views - Split explorer.json into per-view files: dashboard.json (1.6KB), papers-index.json (150KB), papers/{slug}.json, network.json, tensions.json. Dashboard now loads instantly instead of waiting for 9MB. - Add light/dark mode toggle with localStorage persistence and prefers-color-scheme detection. - Fix network: higher edge contrast (theme-aware), larger hit radius for hover/click, pointer cursor, "click to view" hint, drag vs click distinction, node outlines. - Add arXiv/DOI/source links on paper detail pages. - Add CSS spinner on all view loads. - Gitignore all generated data files (explorer/public/data/). Co-Authored-By: Claude Opus 4.6 (1M context)

Add 467 v2 scans, metadata, calibration, explorer, and deploy pipeline

2026-03-18T12:38:54Z

commit f40a5cabd9d1f25608787a26f5ebfc43cd07177e parent 279c91802101fbb60c70d19a451bfc29babcd85d Author: Brian Graham Date: Wed, 18 Mar 2026 13:38:54 +0100 Add 467 v2 scans, metadata, calibration, explorer, and deploy pipeline Bulk commit of accumulated work: - 1028 scan.json (467 v2 + 561 v1), 730 metadata.json, 60 calibration.json - Static data explorer (Vite + vanilla TS): dashboard, paper browser, paper detail, citation network, claim tensions - scripts/build-explorer-data.py aggregates scan data into explorer.json - Forgejo CI/CD workflow with blue/green deployment - Updated scan schema (proxy_outcome_distinction, scaffold_confound_addressed) - Analysis artifacts: citation graph, v2 findings, deep patterns - Playwright test suite (23 tests) - .gitignore: pdf-finder-result.txt, explorer build artifacts, settings.local Co-Authored-By: Claude Opus 4.6 (1M context)

Implement v2 scan pipeline with conditional modules and enrichment

2026-03-08T09:22:14Z

commit 279c91802101fbb60c70d19a451bfc29babcd85d parent 6e63d899afcec26a7a1e6668f9197bfad25b53f0 Author: Brian Graham Date: Sun, 8 Mar 2026 10:22:14 +0100 Implement v2 scan pipeline with conditional modules and enrichment - claim.py: extend expiry to 1 hour, add take-next atomic command - validate-scan.py: standalone schema validator (572/572 existing scans pass) - Schema: add scan_version, active_modules, 3 conditional categories (experimental_rigor 8q, data_leakage 4q, survey_methodology 3q) sourced from Henderson, Dodge, Lucic, Kapoor meta-research findings - scan-worker.md: v2 worker loop (triage → 6 parallel category agents) - scan-triage.md + scan-category-{a..f}.md: split evaluation prompts - scan.md command: v2 default with v1 fallback flag - enrich-metadata.py: Semantic Scholar API enrichment - build-citation-graph.py: cross-reference cited_papers against registry - methodology.md: document new questions with sources - V1 scans remain valid (all new fields optional) Co-Authored-By: Claude Opus 4.6

Tighten scan instrument based on Opus calibration (93.2% agreement)

2026-02-28T05:55:19Z

commit 6e63d899afcec26a7a1e6668f9197bfad25b53f0 parent fd2ab321110f363123c295dbaa3862329aec7709 Author: Brian Graham Date: Sat, 28 Feb 2026 06:55:19 +0100 Tighten scan instrument based on Opus calibration (93.2% agreement) Calibration of 8 papers found two systematic Sonnet failure modes: - NA boundary errors (56% of disagreements): added explicit "NA when:" guidance to contamination, human_studies, artifacts, cost categories - Generosity bias (44%): added "does NOT count" examples to prompts_provided, variance_reported, model_versions_specified, etc. Schema: 14 question descriptions updated with sharper criteria. Agent prompt: added "When to use NA" section, "Common traps" list, and per-paper-type NA/NO guidance for surveys, mining studies, benchmarks. Methodology context rewritten to reflect boolean checklist (was still describing old 0-3 rubric). All 30 scan.json and 8 calibration.json removed for re-run with improved instrument. Calibration round 1 results preserved in analysis/calibration-summary.{json,md}. Added /scan project command for running scan pipeline. Co-Authored-By: Claude Sonnet 4.6

Add paper claim system for parallel scan agents

2026-02-27T21:34:30Z

commit fd2ab321110f363123c295dbaa3862329aec7709 parent a0bf4b555a37ff061384f55a6807ac4235ed17bf Author: Brian Graham Date: Fri, 27 Feb 2026 22:34:30 +0100 Add paper claim system for parallel scan agents scripts/claim.py provides file-based locking to prevent two agents from scanning the same paper: - take: claim a paper (fails if already claimed) - done/fail: release a claim - list: show unclaimed papers ready to scan - status: summary of scan progress - Claims expire after 10 minutes (stale agent recovery) Claims stored as papers//.claimed_ files. Added to .gitignore along with paper.txt (regenerable). Co-Authored-By: Claude Opus 4.6

Replace subjective 0-3 rubric with 50-question boolean checklist

2026-02-27T21:31:36Z

commit a0bf4b555a37ff061384f55a6807ac4235ed17bf parent b4be3d6dbb04eddb183b2a99dad0327adbe38b39 Author: Brian Graham Date: Fri, 27 Feb 2026 22:31:36 +0100 Replace subjective 0-3 rubric with 50-question boolean checklist Redesigned the scan instrument for verifiability and auditability: - 50 yes/no/na questions across 11 categories (artifacts, statistical methodology, evaluation design, claims & evidence, setup transparency, limitations, data integrity, conflicts of interest, contamination, human studies, cost & practicality) - Each question has detailed evaluation guidance in the schema description explaining exactly what to look for - Each answer requires a justification citing specific paper sections - Inspired by Wakefield case: added data_integrity and conflicts_of_interest categories to catch fabrication and undisclosed conflicts - Changed model assignment from Opus to Sonnet (booleans are factual lookups, not subjective judgment) Old rubric (6 dimensions, 0-3 scores) removed. Composite scores are now derived deterministically from boolean counts. Co-Authored-By: Claude Opus 4.6

Update scan agent prompt and add scan orchestrator

2026-02-27T21:05:52Z

commit b4be3d6dbb04eddb183b2a99dad0327adbe38b39 parent 08f6c3db4222ef96a668db4bf3fd6b61e3326b67 Author: Brian Graham Date: Fri, 27 Feb 2026 22:05:52 +0100 Update scan agent prompt and add scan orchestrator scan-agent.md: - Added file paths (paper.txt input, scan.json output) - Added write-immediately rule and registry status update - Added guidance for scoring survey papers (are they rigorous or just laundering weak results?) - Added handling for theoretical/position papers - Added schema validation requirements - Added note that important papers can still score poorly extract-text.py: - Added extraction-failures.txt log for papers that fail both pymupdf and Sonnet fallback scripts/run-scan.py (new): - Orchestrates full pipeline: extract text → scan agent → validate - Calls claude CLI with opus model for each paper - Validates scan.json output (required fields, rubric dimensions) - Updates registry status to 'scanned' - Supports --parallel N for concurrent scanning - Writes scan-failures.txt for debugging Co-Authored-By: Claude Opus 4.6

Add DOI-based download script for non-arXiv papers

2026-02-27T20:55:00Z

commit 08f6c3db4222ef96a668db4bf3fd6b61e3326b67 parent 1021d39ac6f95f5694904bc4c19a3953006c570d Author: Brian Graham Date: Fri, 27 Feb 2026 21:55:00 +0100 Add DOI-based download script for non-arXiv papers scripts/download-doi.py tries three strategies for 749 papers without arxiv_id: (1) Semantic Scholar open access PDF lookup, (2) Unpaywall API, (3) arXiv fallback if S2 finds a previously unknown arxiv_id. Updates manual-download-needed.txt with whatever remains. Co-Authored-By: Claude Opus 4.6

Harvester run 2: 2492 new papers via S2 citation graph + keyword search

2026-02-27T20:45:23Z

commit 1021d39ac6f95f5694904bc4c19a3953006c570d parent 9aa129f9efbb8bf248c15cdd99a3c0205c7295b7 Author: Brian Graham Date: Fri, 27 Feb 2026 21:45:23 +0100 Harvester run 2: 2492 new papers via S2 citation graph + keyword search Phase 1 (citation graph): 692 new papers - Fetched citations + references for 8 seed papers via Semantic Scholar API - Seeds: METR RCT, Emergent abilities mirage, MAST, Sleeper Agents, TypeScript type-check, Scaffolded LLMs, Remote Labor Index, Code gen survey Phase 2 (keyword search): 1800 new papers - 15 query clusters via Semantic Scholar paper search - Queries: LLM code generation, AI code review, prompt injection, alignment deception, APR, test generation, RAG code, multi-agent failure, scaling, AI software engineering, code completion, etc. Registry grows from 155 → 2647 papers (well past 1000 target). Co-Authored-By: Claude Sonnet 4.6

Harvester: write to registry immediately after each search

2026-02-27T20:32:03Z

commit 9aa129f9efbb8bf248c15cdd99a3c0205c7295b7 parent fde1ea0a6da1bd0c26c61b0bc3cde325a3551fe1 Author: Brian Graham Date: Fri, 27 Feb 2026 21:32:03 +0100 Harvester: write to registry immediately after each search Prevent data loss from session timeouts by requiring the agent to append new entries to registry.jsonl after each API call or web search, not accumulate in memory. Co-Authored-By: Claude Opus 4.6

Harvester run 1 (138 papers) + rewrite harvester prompt

2026-02-27T20:30:41Z

commit fde1ea0a6da1bd0c26c61b0bc3cde325a3551fe1 parent 3a05a3aea1e82e39df3a495ed95e9004e4d25b8f Author: Brian Graham Date: Fri, 27 Feb 2026 21:30:41 +0100 Harvester run 1 (138 papers) + rewrite harvester prompt Registry: 17 → 155 entries. All from arXiv web search only. Rewrote harvester-agent.md to fix the gap: - Added explicit Semantic Scholar API endpoints (keyword search, citation graph traversal for forward/backward chasing, venue filter) - Added arXiv API query syntax with category+keyword combinations - Added HuggingFace daily papers API - Expanded query clusters from 8 to 28 - Added concrete strategy: keyword search ~400, citation graph ~400, venue crawl ~200 to reach 1000 target - Previous prompt listed sources but gave no actionable instructions, so Sonnet only used the easiest path (arXiv web search) Co-Authored-By: Claude Opus 4.6

Add build pipeline: text extraction, summary aggregation, venue list

2026-02-27T20:27:14Z

commit 3a05a3aea1e82e39df3a495ed95e9004e4d25b8f parent 69c92da1bfbb276ddd27e9ba8256d0087e01c43a Author: Brian Graham Date: Fri, 27 Feb 2026 21:27:14 +0100 Add build pipeline: text extraction, summary aggregation, venue list - scripts/extract-text.py: pymupdf text extraction with Sonnet fallback for low-quality results. Outputs paper.txt co-located with PDFs. - scripts/build-summary.py: aggregates all scan.json into analysis/summary.json + summary.md (score distributions, ranked lists, red flags, breakdowns by year/tag). Static artifact for narrative work. - context/requirements.md: full pipeline diagram, venue brainstorm (TOSEM, EMSE, NeurIPS D&B, ICSE, Nature MI, etc.), output format (LaTeX) Co-Authored-By: Claude Opus 4.6

Add downstream pipeline context to harvester agent prompt

2026-02-27T20:11:39Z

commit 69c92da1bfbb276ddd27e9ba8256d0087e01c43a parent a6c809bdf74b788c3f3285e3a37f649be5193fe9 Author: Brian Graham Date: Fri, 27 Feb 2026 21:11:39 +0100 Add downstream pipeline context to harvester agent prompt Explain how registry entries feed into download, scan, and citation chasing so the agent understands why arxiv_id matters. Co-Authored-By: Claude Opus 4.6

Add arXiv PDF download script

2026-02-27T20:08:24Z

commit a6c809bdf74b788c3f3285e3a37f649be5193fe9 parent 9168d67f29d824ef189647a7b5049acf0e55cdca Author: Brian Graham Date: Fri, 27 Feb 2026 21:08:24 +0100 Add arXiv PDF download script Downloads PDFs for registry entries with status 'queued' and an arxiv_id. Updates registry status to 'downloaded' on success. Respects arXiv rate limits (3s between requests). Supports --dry-run, --limit N, and --id. Co-Authored-By: Claude Opus 4.6

Add .gitignore and CLAUDE.md project rules

2026-02-27T20:05:53Z

commit 9168d67f29d824ef189647a7b5049acf0e55cdca parent 1c6a723f27a3efff840ec524ce2f155237b0429a Author: Brian Graham Date: Fri, 27 Feb 2026 21:05:53 +0100 Add .gitignore and CLAUDE.md project rules - .gitignore: exclude PDFs from papers/ and inbox/, OS files, Python cache - CLAUDE.md: registry conventions (slug format, dedup rules, status flow), model assignments per agent, code style, git rules Co-Authored-By: Claude Opus 4.6

Document target scope (~1000 papers) and model assignments

2026-02-27T20:00:54Z

commit 1c6a723f27a3efff840ec524ce2f155237b0429a parent aaa7097d653f63eb5a6a63611954e3c1ec4c4887 Author: Brian Graham Date: Fri, 27 Feb 2026 21:00:54 +0100 Document target scope (~1000 papers) and model assignments - Target: ~1000 papers scanned, subset for deep eval - Harvester: Sonnet (structured metadata, no deep reasoning) - Scan agent: Opus (methodology quality judgment) - Deep-eval agent: Opus - Add harvester to agent tier design in requirements Co-Authored-By: Claude Opus 4.6

Add citation-chasing pipeline: cited_papers in scan + harvest script

2026-02-27T19:57:52Z

commit aaa7097d653f63eb5a6a63611954e3c1ec4c4887 parent ed00b8092b4a223958bce1880699cb75d0dc4fe7 Author: Brian Graham Date: Fri, 27 Feb 2026 20:57:52 +0100 Add citation-chasing pipeline: cited_papers in scan + harvest script - Add cited_papers array to scan.schema.json (required field) - Update scan-agent.md with instructions to extract survey-relevant references from each scanned paper (expect 3-15 per paper) - Add scripts/harvest-citations.py: reads cited_papers from all scan.json files, deduplicates against registry by arxiv_id/doi/title, and proposes or appends new registry entries (--apply flag) Co-Authored-By: Claude Opus 4.6

Add Agents of Chaos paper and Wakefield methodology precedent

2026-02-27T19:54:59Z

commit ed00b8092b4a223958bce1880699cb75d0dc4fe7 parent c75cb779a62ec9a2460f035d431cdca818dcc407 Author: Brian Graham Date: Fri, 27 Feb 2026 20:54:59 +0100 Add Agents of Chaos paper and Wakefield methodology precedent - Add arXiv:2602.20021 (Shapira et al.) to registry: red-teaming study of autonomous LLM agents documenting live-environment failures - Add Wakefield/MMR section to related-work.md explaining why methodological quality assessment matters, with parallels to AI research Co-Authored-By: Claude Opus 4.6

Initial scaffold for AI research survey project

2026-02-27T19:51:32Z

commit c75cb779a62ec9a2460f035d431cdca818dcc407 Author: Brian Graham Date: Fri, 27 Feb 2026 20:51:32 +0100 Initial scaffold for AI research survey project Set up systematic review pipeline for evaluating methodological quality of agentic AI/LLM programming research papers. - context/: Project requirements, scoring methodology (6 dimensions, 0-3 scale), and related work (Cochrane, PRISMA, emergent abilities) - schema/: JSON Schemas for scan results, deep evaluations, and registry entries - agents/: Prompt files for scan, deep-eval, harvester, and inbox-sorter sub-agents - registry.jsonl: Seeded with 16 papers from existing knowledge base - papers/, inbox/: Empty directories for paper storage pipeline Co-Authored-By: Claude Opus 4.6