methodology.md - ai-research-survey - Systematic scan of agentic development research. What's signal, what's noise.

methodology.md (14379B)

1 # Survey Methodology
2
3 ## Precedent
4
5 This project adapts systematic review methodology from medical research, particularly the Cochrane Review process and PRISMA reporting guidelines. These frameworks provide decades of refinement on how to:
6
7 - Define inclusion/exclusion criteria defensibly
8 - Score study quality on structured rubrics
9 - Minimize reviewer bias through structured extraction
10 - Report findings transparently
11
12 The key adaptation is that we are reviewing *methodological quality*, not synthesizing effect sizes. We are not doing a meta-analysis of "how much does AI help"; we are asking "how well did each study support its claims."
13
14 ## Quality Assessment Instrument
15
16 ### Design: 50-question two-field boolean checklist
17
18 We use a 50-question checklist with two boolean fields per question rather than subjective Likert-style scores or a three-way yes/no/na enum. This was a deliberate design decision refined across two calibration rounds.
19
20 **Each question has:**
21 - `applies` (boolean): Is this criterion relevant to this paper type?
22 - `answer` (boolean): Does the paper satisfy this criterion? (Only meaningful when applies=true.)
23 - `justification` (string): 1-3 sentence explanation citing specific paper sections.
24
25 **Why two fields instead of yes/no/na:**
26 The original design used a single `answer: yes/no/na` field. Calibration round 1 (93.2% agreement) revealed that 47-56% of Sonnet-Opus disagreements were "NA boundary errors" — the model conflating "the paper didn't do this" (should be no) with "this doesn't apply" (should be na). The two-field design forces explicit, separate decisions on applicability and compliance, eliminating this conflation.
27
28 **Design principles:**
29 - **Verifiable**: each question has a factually correct answer checkable against the paper
30 - **Auditable**: justification text cites specific sections/quotes
31 - **High inter-rater reliability**: booleans have much less variance than 0-3 scores
32 - **Fast human calibration**: checking 50 booleans takes ~15 min, not hours
33 - **Derived scores**: composite scores computed deterministically from boolean counts
34 - **Separate denominators**: compliance rates computed only over papers where applies=true
35 - **The questions are findings**: "only 34% of papers release code" is concrete and publishable
36
37 **Previous designs (discarded):**
38 1. 6-dimension 0-3 rubric — discarded because LLM-assigned subjective scores are hard to defend in a paper about methodological rigor.
39 2. Single yes/no/na field — discarded after calibration showed NA boundary confusion was the dominant error mode.
40
41 ### Categories (11 groups, 50 questions)
42
43 1. **Artifacts** (4q): code released, data released, environment specs, reproduction instructions
44 2. **Statistical methodology** (5q): CIs/error bars, significance tests, effect sizes, sample size justification, variance
45 3. **Evaluation design** (9q): baselines, contemporary baselines, ablation, multiple metrics, human eval, held-out test, breakdowns, failure cases, negative results
46 4. **Claims & evidence** (5q): abstract supported, causal claims justified, generalization bounded, alternatives discussed, proxy-outcome distinction
47 5. **Setup transparency** (5q): model versions, prompts, hyperparameters, scaffolding, data preprocessing
48 6. **Limitations & scope** (3q): limitations section, specific threats, scope boundaries
49 7. **Data integrity** (4q): raw data available, collection described, recruitment described, pipeline documented
50 8. **Conflicts of interest** (4q): funding disclosed, affiliations disclosed, funder independent, financial interests declared
51 9. **Contamination** (3q): training cutoff stated, train/test overlap discussed, contamination addressed
52 10. **Human studies** (7q): pre-registered, IRB, demographics, inclusion/exclusion, randomization, blinding, attrition
53 11. **Cost & practicality** (2q): inference cost, compute budget
54
55 Data integrity and conflicts of interest categories inspired by the Wakefield MMR case — "Is raw data available for independent verification?" would have caught the fabrication years earlier.
56
57 ### Conditional modules (v2, 15 questions)
58
59 V2 scans add conditional question modules activated by methodology_tags. These target systematic issues identified by meta-research papers in the corpus.
60
61 **12. Experimental rigor** (8q, activated by `benchmark-eval`):
62 - `seed_sensitivity_reported` — Henderson et al. (2018) showed RL results vary 2x across seeds
63 - `number_of_runs_stated` — exact run count, not implicit
64 - `hyperparameter_search_budget` — Dodge et al. (2019) showed search budget dramatically affects results
65 - `best_config_selection_justified` — selection on validation set, not cherry-picked
66 - `multiple_comparison_correction` — Bonferroni/Holm/BH for multiple tests
67 - `self_comparison_bias_addressed` — Lucic et al. (2018) showed authors' baseline re-implementations systematically underperform
68 - `compute_budget_vs_performance` — performance as function of compute, not just peak
69 - `benchmark_construct_validity` — Kapoor & Narayanan (2024) documented widespread validity gaps
70 - `scaffold_confound_addressed` — SWE-bench scores vary 2.7–28.3% for the same model depending on scaffold; scaffold effect often exceeds model effect
71
72 **13. Data leakage** (4q, activated by `benchmark-eval`):
73 - `temporal_leakage_addressed` — training data from after prediction target
74 - `feature_leakage_addressed` — input features leak answer information
75 - `non_independence_addressed` — train/test share structural similarities
76 - `leakage_detection_method` — concrete detection (canary strings, n-gram overlap, etc.)
77
78 Source: Kapoor & Narayanan (2024) leakage taxonomy.
79
80 **14. Survey methodology** (3q, activated by `meta-analysis`):
81 - `prisma_or_structured_protocol` — PRISMA or equivalent systematic protocol
82 - `quality_assessment_of_sources` — quality scoring of included studies (Leech et al.)
83 - `publication_bias_discussed` — funnel plots, negative-result underrepresentation
84
85 Total: 51 base + 16 conditional = 67 max per paper. V1 scans remain valid (new fields optional).
86
87 Full schema with evaluation criteria for each question: `schema/scan.schema.json`
88
89 ### Answer rules
90
91 - **`applies: true, answer: true`** = the paper clearly satisfies the criterion; you can point to where
92 - **`applies: true, answer: false`** = the paper does not satisfy the criterion, or evidence is absent. Absence of evidence is `answer: false`, not `applies: false`.
93 - **`applies: false, answer: false`** = the criterion is structurally inapplicable to this paper type (e.g., human_studies questions for a benchmark paper, contamination questions for a mining study)
94
95 Each answer includes a 1-3 sentence justification citing specific paper sections.
96
97 ### Model assignment
98
99 - **Primary rater (v1)**: Sonnet — switched to Opus after Round 3 calibration showed persistent Sonnet generosity bias
100 - **Primary rater (v2)**: Opus (Claude Opus 4.6) — all scanning from Round 3 onward
101 - **Calibration rater**: Opus (independent re-evaluation of subset to measure inter-rater agreement)
102
103 ### Calibration results
104
105 **Round 1 (2026-02-28, yes/no/na format):** 8 papers, **93.2%** agreement (373/400). Two systematic issues:
106 1. NA boundary errors (56% of disagreements): Sonnet confused "didn't do it" with "doesn't apply." Fixed by adding explicit NA guidance per question.
107 2. Generosity bias (44%): Sonnet credited partial information. Fixed by adding "does NOT count" examples.
108
109 **Round 2 (2026-02-28, yes/no/na format, post-fixes):** 10 papers, **96.2%** agreement (481/500). Improvement confirmed, but NA boundary errors still 47% of remaining disagreements. Led to two-field redesign (applies + answer) to structurally eliminate the conflation.
110
111 **Round 3 (2026-02-28, applies/answer format):** 60 papers, **97.0%** agreement (2,911/3,000). Two-field design validated at scale. Remaining disagreements: applies_boundary 52%, sonnet_generous 36%, opus_generous 12%. Sonnet generosity bias persisted, leading to switch to Opus as primary rater.
112
113 ## LLM-Assisted Systematic Review
114
115 This survey is itself a contribution to the methodology of large-scale systematic reviews. The entire pipeline — from paper harvesting and PDF acquisition through structured quality assessment — is conducted using Claude Opus 4.6 (Anthropic) as the primary evaluation instrument, with human oversight for calibration and editorial judgment.
116
117 ### Why this matters
118
119 Traditional systematic reviews face a scale ceiling: human reviewers can assess tens to low hundreds of papers before fatigue and inconsistency degrade quality. This survey targets ~1,000 papers with a 65-question structured instrument — infeasible for a small research team using manual review alone.
120
121 ### What the LLM does
122
123 1. **Structured extraction**: Each paper is read in full and evaluated against the boolean checklist. The LLM cites specific sections, tables, and figures in its justifications — these are auditable by human reviewers.
124 2. **Consistency at scale**: Unlike human reviewers who drift over hundreds of papers, the LLM applies the same criteria to paper #1 and paper #900. Calibration data (97% inter-rater agreement at 60 papers) quantifies this consistency.
125 3. **Conditional evaluation**: V2 scans activate domain-specific question modules (experimental rigor, data leakage, survey methodology) based on paper type, applying targeted criteria from meta-research literature.
126
127 ### What the LLM does not do
128
129 - **Editorial judgment**: Decisions about what findings mean, which narrative threads to pursue, and how to position the work are human decisions.
130 - **Instrument design**: The 65-question checklist was designed by humans, informed by meta-research literature (Henderson, Dodge, Kapoor, Leech, REFORMS). The LLM's role is to apply the instrument, not design it.
131 - **Calibration**: Inter-rater reliability was measured by comparing independent Opus evaluations against prior scans. Disagreement analysis and instrument refinement were human-driven.
132
133 ### Transparency
134
135 - The scanning model (Claude Opus 4.6), prompt text (`agents/scan-agent.md`), and schema (`schema/scan.schema.json`) are all published as part of the replication package.
136 - Every checklist answer includes a justification citing specific paper content, enabling human spot-checking at any granularity.
137 - The calibration journey (93.2% → 96.2% → 97.0%) and the specific failure modes discovered (NA boundary confusion, generosity bias) are documented as methodological findings in their own right.
138 - V1 vs V2 scan versions are tracked per paper, so analysis can control for instrument version.
139
140 ### Limitations of LLM-assisted review
141
142 - The LLM cannot verify claims against external sources (e.g., checking if a claimed GitHub repo actually exists and contains working code).
143 - Fabricated data would not be detected — the checklist assesses transparency and methodology, not truthfulness of reported numbers (the Wakefield benchmark scored 45.7%, catching transparency gaps but not fabrication).
144 - The LLM's training data includes many of the papers being reviewed, creating a potential bias toward charitable interpretation of familiar work. The strict "absence of evidence is false" rule and calibration process mitigate but do not eliminate this.
145
146 ## Paper Selection
147
148 ### Inclusion Criteria
149 - Published 2023 or later (post-GPT-4, when agentic AI research accelerated)
150 - Makes empirical claims about AI/LLM capability, productivity, or safety
151 - Relevant to software development, code generation, or agentic workflows
152 - Available in English
153
154 ### Exclusion Criteria
155 - Pure opinion pieces or blog posts (no empirical content)
156 - Product announcements without methodology
157 - Papers focused exclusively on non-code domains (unless methodology is transferable)
158
159 ### Sampling Strategy
160 - **Seed set**: Papers cited in existing survey documents and well-known references
161 - **Forward/backward citation chasing**: Follow citations from seed papers
162 - **Venue monitoring**: arXiv cs.SE, cs.AI, cs.CL; major ML conferences (NeurIPS, ICML, ACL)
163 - **Community sources**: HuggingFace trending, Semantic Scholar alerts
164
165 This is a purposive sample, not a random one. The goal is coverage of the most influential and most cited papers, not statistical representativeness of all papers published.
166
167 ## PDF Acquisition Pipeline
168
169 PDFs are obtained through a multi-stage automated pipeline before falling back to manual retrieval. All stages are fully documented for transparency in the PRISMA flow diagram.
170
171 ### Automated stages (scripts/download-arxiv.py, scripts/download-doi.py)
172
173 1. **arXiv direct download** — papers with `arxiv_id` downloaded from `arxiv.org/pdf/<id>.pdf`. Also catches arXiv DOIs (`10.48550/arXiv.*`) where the arXiv ID is embedded in the DOI.
174 2. **Semantic Scholar open access** — queries S2 API for open-access PDF URL; also recovers arXiv IDs missed during harvesting.
175 3. **Unpaywall** — queries Unpaywall API for green/gold OA versions.
176 4. **CORE API** — queries CORE (core.ac.uk) for author manuscripts and institutional repository copies.
177 5. **OpenAlex** — queries OpenAlex for additional OA links not indexed by Unpaywall.
178 6. **Sci-Hub** — opt-in (`--scihub` flag); parses mirror HTML to find embedded PDF URL.
179
180 ### Claude web search stage (scripts/run-pdf-finder.py / Agent tool)
181
182 For papers that survive all automated stages without a PDF, Claude agents (Sonnet, WebSearch + WebFetch + Bash) perform targeted web searches:
183 - DOI landing page crawl for embedded PDF links
184 - Author institutional page and publication list
185 - Preprint servers (arXiv, SSRN, bioRxiv, OSF)
186 - ResearchGate and Semantic Scholar pages
187 - Publisher "free access" or author-accepted-manuscript versions
188
189 Each agent writes `papers/<slug>/pdf-finder-result.txt` with `FOUND <url>` or `NOT_FOUND`. The orchestrator updates registry status on success.
190
191 **Observed hit rate**: ~50% of papers attempted via web search are found (primarily through author pages and preprint servers). The remaining failures are documented as genuinely paywalled with no open-access version available.
192
193 ### Reporting
194
195 Papers that could not be obtained are counted in the PRISMA flow diagram under "full text not available." The acquisition method (arXiv, OA repository, author page, etc.) is not tracked per paper but the overall breakdown is available from registry metadata.

	ai-research-survey Systematic scan of agentic development research. What's signal, what's noise.
	git clone https://git.shiptheloop.com/ai-research-survey.git
	Log \| Files \| Refs