methodology.md (14379B)
1 # Survey Methodology 2 3 ## Precedent 4 5 This project adapts systematic review methodology from medical research, particularly the Cochrane Review process and PRISMA reporting guidelines. These frameworks provide decades of refinement on how to: 6 7 - Define inclusion/exclusion criteria defensibly 8 - Score study quality on structured rubrics 9 - Minimize reviewer bias through structured extraction 10 - Report findings transparently 11 12 The key adaptation is that we are reviewing *methodological quality*, not synthesizing effect sizes. We are not doing a meta-analysis of "how much does AI help"; we are asking "how well did each study support its claims." 13 14 ## Quality Assessment Instrument 15 16 ### Design: 50-question two-field boolean checklist 17 18 We use a 50-question checklist with two boolean fields per question rather than subjective Likert-style scores or a three-way yes/no/na enum. This was a deliberate design decision refined across two calibration rounds. 19 20 **Each question has:** 21 - `applies` (boolean): Is this criterion relevant to this paper type? 22 - `answer` (boolean): Does the paper satisfy this criterion? (Only meaningful when applies=true.) 23 - `justification` (string): 1-3 sentence explanation citing specific paper sections. 24 25 **Why two fields instead of yes/no/na:** 26 The original design used a single `answer: yes/no/na` field. Calibration round 1 (93.2% agreement) revealed that 47-56% of Sonnet-Opus disagreements were "NA boundary errors" — the model conflating "the paper didn't do this" (should be no) with "this doesn't apply" (should be na). The two-field design forces explicit, separate decisions on applicability and compliance, eliminating this conflation. 27 28 **Design principles:** 29 - **Verifiable**: each question has a factually correct answer checkable against the paper 30 - **Auditable**: justification text cites specific sections/quotes 31 - **High inter-rater reliability**: booleans have much less variance than 0-3 scores 32 - **Fast human calibration**: checking 50 booleans takes ~15 min, not hours 33 - **Derived scores**: composite scores computed deterministically from boolean counts 34 - **Separate denominators**: compliance rates computed only over papers where applies=true 35 - **The questions are findings**: "only 34% of papers release code" is concrete and publishable 36 37 **Previous designs (discarded):** 38 1. 6-dimension 0-3 rubric — discarded because LLM-assigned subjective scores are hard to defend in a paper about methodological rigor. 39 2. Single yes/no/na field — discarded after calibration showed NA boundary confusion was the dominant error mode. 40 41 ### Categories (11 groups, 50 questions) 42 43 1. **Artifacts** (4q): code released, data released, environment specs, reproduction instructions 44 2. **Statistical methodology** (5q): CIs/error bars, significance tests, effect sizes, sample size justification, variance 45 3. **Evaluation design** (9q): baselines, contemporary baselines, ablation, multiple metrics, human eval, held-out test, breakdowns, failure cases, negative results 46 4. **Claims & evidence** (5q): abstract supported, causal claims justified, generalization bounded, alternatives discussed, proxy-outcome distinction 47 5. **Setup transparency** (5q): model versions, prompts, hyperparameters, scaffolding, data preprocessing 48 6. **Limitations & scope** (3q): limitations section, specific threats, scope boundaries 49 7. **Data integrity** (4q): raw data available, collection described, recruitment described, pipeline documented 50 8. **Conflicts of interest** (4q): funding disclosed, affiliations disclosed, funder independent, financial interests declared 51 9. **Contamination** (3q): training cutoff stated, train/test overlap discussed, contamination addressed 52 10. **Human studies** (7q): pre-registered, IRB, demographics, inclusion/exclusion, randomization, blinding, attrition 53 11. **Cost & practicality** (2q): inference cost, compute budget 54 55 Data integrity and conflicts of interest categories inspired by the Wakefield MMR case — "Is raw data available for independent verification?" would have caught the fabrication years earlier. 56 57 ### Conditional modules (v2, 15 questions) 58 59 V2 scans add conditional question modules activated by methodology_tags. These target systematic issues identified by meta-research papers in the corpus. 60 61 **12. Experimental rigor** (8q, activated by `benchmark-eval`): 62 - `seed_sensitivity_reported` — Henderson et al. (2018) showed RL results vary 2x across seeds 63 - `number_of_runs_stated` — exact run count, not implicit 64 - `hyperparameter_search_budget` — Dodge et al. (2019) showed search budget dramatically affects results 65 - `best_config_selection_justified` — selection on validation set, not cherry-picked 66 - `multiple_comparison_correction` — Bonferroni/Holm/BH for multiple tests 67 - `self_comparison_bias_addressed` — Lucic et al. (2018) showed authors' baseline re-implementations systematically underperform 68 - `compute_budget_vs_performance` — performance as function of compute, not just peak 69 - `benchmark_construct_validity` — Kapoor & Narayanan (2024) documented widespread validity gaps 70 - `scaffold_confound_addressed` — SWE-bench scores vary 2.7–28.3% for the same model depending on scaffold; scaffold effect often exceeds model effect 71 72 **13. Data leakage** (4q, activated by `benchmark-eval`): 73 - `temporal_leakage_addressed` — training data from after prediction target 74 - `feature_leakage_addressed` — input features leak answer information 75 - `non_independence_addressed` — train/test share structural similarities 76 - `leakage_detection_method` — concrete detection (canary strings, n-gram overlap, etc.) 77 78 Source: Kapoor & Narayanan (2024) leakage taxonomy. 79 80 **14. Survey methodology** (3q, activated by `meta-analysis`): 81 - `prisma_or_structured_protocol` — PRISMA or equivalent systematic protocol 82 - `quality_assessment_of_sources` — quality scoring of included studies (Leech et al.) 83 - `publication_bias_discussed` — funnel plots, negative-result underrepresentation 84 85 Total: 51 base + 16 conditional = 67 max per paper. V1 scans remain valid (new fields optional). 86 87 Full schema with evaluation criteria for each question: `schema/scan.schema.json` 88 89 ### Answer rules 90 91 - **`applies: true, answer: true`** = the paper clearly satisfies the criterion; you can point to where 92 - **`applies: true, answer: false`** = the paper does not satisfy the criterion, or evidence is absent. Absence of evidence is `answer: false`, not `applies: false`. 93 - **`applies: false, answer: false`** = the criterion is structurally inapplicable to this paper type (e.g., human_studies questions for a benchmark paper, contamination questions for a mining study) 94 95 Each answer includes a 1-3 sentence justification citing specific paper sections. 96 97 ### Model assignment 98 99 - **Primary rater (v1)**: Sonnet — switched to Opus after Round 3 calibration showed persistent Sonnet generosity bias 100 - **Primary rater (v2)**: Opus (Claude Opus 4.6) — all scanning from Round 3 onward 101 - **Calibration rater**: Opus (independent re-evaluation of subset to measure inter-rater agreement) 102 103 ### Calibration results 104 105 **Round 1 (2026-02-28, yes/no/na format):** 8 papers, **93.2%** agreement (373/400). Two systematic issues: 106 1. NA boundary errors (56% of disagreements): Sonnet confused "didn't do it" with "doesn't apply." Fixed by adding explicit NA guidance per question. 107 2. Generosity bias (44%): Sonnet credited partial information. Fixed by adding "does NOT count" examples. 108 109 **Round 2 (2026-02-28, yes/no/na format, post-fixes):** 10 papers, **96.2%** agreement (481/500). Improvement confirmed, but NA boundary errors still 47% of remaining disagreements. Led to two-field redesign (applies + answer) to structurally eliminate the conflation. 110 111 **Round 3 (2026-02-28, applies/answer format):** 60 papers, **97.0%** agreement (2,911/3,000). Two-field design validated at scale. Remaining disagreements: applies_boundary 52%, sonnet_generous 36%, opus_generous 12%. Sonnet generosity bias persisted, leading to switch to Opus as primary rater. 112 113 ## LLM-Assisted Systematic Review 114 115 This survey is itself a contribution to the methodology of large-scale systematic reviews. The entire pipeline — from paper harvesting and PDF acquisition through structured quality assessment — is conducted using Claude Opus 4.6 (Anthropic) as the primary evaluation instrument, with human oversight for calibration and editorial judgment. 116 117 ### Why this matters 118 119 Traditional systematic reviews face a scale ceiling: human reviewers can assess tens to low hundreds of papers before fatigue and inconsistency degrade quality. This survey targets ~1,000 papers with a 65-question structured instrument — infeasible for a small research team using manual review alone. 120 121 ### What the LLM does 122 123 1. **Structured extraction**: Each paper is read in full and evaluated against the boolean checklist. The LLM cites specific sections, tables, and figures in its justifications — these are auditable by human reviewers. 124 2. **Consistency at scale**: Unlike human reviewers who drift over hundreds of papers, the LLM applies the same criteria to paper #1 and paper #900. Calibration data (97% inter-rater agreement at 60 papers) quantifies this consistency. 125 3. **Conditional evaluation**: V2 scans activate domain-specific question modules (experimental rigor, data leakage, survey methodology) based on paper type, applying targeted criteria from meta-research literature. 126 127 ### What the LLM does not do 128 129 - **Editorial judgment**: Decisions about what findings mean, which narrative threads to pursue, and how to position the work are human decisions. 130 - **Instrument design**: The 65-question checklist was designed by humans, informed by meta-research literature (Henderson, Dodge, Kapoor, Leech, REFORMS). The LLM's role is to apply the instrument, not design it. 131 - **Calibration**: Inter-rater reliability was measured by comparing independent Opus evaluations against prior scans. Disagreement analysis and instrument refinement were human-driven. 132 133 ### Transparency 134 135 - The scanning model (Claude Opus 4.6), prompt text (`agents/scan-agent.md`), and schema (`schema/scan.schema.json`) are all published as part of the replication package. 136 - Every checklist answer includes a justification citing specific paper content, enabling human spot-checking at any granularity. 137 - The calibration journey (93.2% → 96.2% → 97.0%) and the specific failure modes discovered (NA boundary confusion, generosity bias) are documented as methodological findings in their own right. 138 - V1 vs V2 scan versions are tracked per paper, so analysis can control for instrument version. 139 140 ### Limitations of LLM-assisted review 141 142 - The LLM cannot verify claims against external sources (e.g., checking if a claimed GitHub repo actually exists and contains working code). 143 - Fabricated data would not be detected — the checklist assesses transparency and methodology, not truthfulness of reported numbers (the Wakefield benchmark scored 45.7%, catching transparency gaps but not fabrication). 144 - The LLM's training data includes many of the papers being reviewed, creating a potential bias toward charitable interpretation of familiar work. The strict "absence of evidence is false" rule and calibration process mitigate but do not eliminate this. 145 146 ## Paper Selection 147 148 ### Inclusion Criteria 149 - Published 2023 or later (post-GPT-4, when agentic AI research accelerated) 150 - Makes empirical claims about AI/LLM capability, productivity, or safety 151 - Relevant to software development, code generation, or agentic workflows 152 - Available in English 153 154 ### Exclusion Criteria 155 - Pure opinion pieces or blog posts (no empirical content) 156 - Product announcements without methodology 157 - Papers focused exclusively on non-code domains (unless methodology is transferable) 158 159 ### Sampling Strategy 160 - **Seed set**: Papers cited in existing survey documents and well-known references 161 - **Forward/backward citation chasing**: Follow citations from seed papers 162 - **Venue monitoring**: arXiv cs.SE, cs.AI, cs.CL; major ML conferences (NeurIPS, ICML, ACL) 163 - **Community sources**: HuggingFace trending, Semantic Scholar alerts 164 165 This is a purposive sample, not a random one. The goal is coverage of the most influential and most cited papers, not statistical representativeness of all papers published. 166 167 ## PDF Acquisition Pipeline 168 169 PDFs are obtained through a multi-stage automated pipeline before falling back to manual retrieval. All stages are fully documented for transparency in the PRISMA flow diagram. 170 171 ### Automated stages (scripts/download-arxiv.py, scripts/download-doi.py) 172 173 1. **arXiv direct download** — papers with `arxiv_id` downloaded from `arxiv.org/pdf/<id>.pdf`. Also catches arXiv DOIs (`10.48550/arXiv.*`) where the arXiv ID is embedded in the DOI. 174 2. **Semantic Scholar open access** — queries S2 API for open-access PDF URL; also recovers arXiv IDs missed during harvesting. 175 3. **Unpaywall** — queries Unpaywall API for green/gold OA versions. 176 4. **CORE API** — queries CORE (core.ac.uk) for author manuscripts and institutional repository copies. 177 5. **OpenAlex** — queries OpenAlex for additional OA links not indexed by Unpaywall. 178 6. **Sci-Hub** — opt-in (`--scihub` flag); parses mirror HTML to find embedded PDF URL. 179 180 ### Claude web search stage (scripts/run-pdf-finder.py / Agent tool) 181 182 For papers that survive all automated stages without a PDF, Claude agents (Sonnet, WebSearch + WebFetch + Bash) perform targeted web searches: 183 - DOI landing page crawl for embedded PDF links 184 - Author institutional page and publication list 185 - Preprint servers (arXiv, SSRN, bioRxiv, OSF) 186 - ResearchGate and Semantic Scholar pages 187 - Publisher "free access" or author-accepted-manuscript versions 188 189 Each agent writes `papers/<slug>/pdf-finder-result.txt` with `FOUND <url>` or `NOT_FOUND`. The orchestrator updates registry status on success. 190 191 **Observed hit rate**: ~50% of papers attempted via web search are found (primarily through author pages and preprint servers). The remaining failures are documented as genuinely paywalled with no open-access version available. 192 193 ### Reporting 194 195 Papers that could not be obtained are counted in the PRISMA flow diagram under "full text not available." The acquisition method (arXiv, OA repository, author page, etc.) is not tracked per paper but the overall breakdown is available from registry metadata.