ai-research-survey

Systematic scan of agentic development research. What's signal, what's noise.
git clone https://git.shiptheloop.com/ai-research-survey.git
Log | Files | Refs

methodology.md (14379B)


      1 # Survey Methodology
      2 
      3 ## Precedent
      4 
      5 This project adapts systematic review methodology from medical research, particularly the Cochrane Review process and PRISMA reporting guidelines. These frameworks provide decades of refinement on how to:
      6 
      7 - Define inclusion/exclusion criteria defensibly
      8 - Score study quality on structured rubrics
      9 - Minimize reviewer bias through structured extraction
     10 - Report findings transparently
     11 
     12 The key adaptation is that we are reviewing *methodological quality*, not synthesizing effect sizes. We are not doing a meta-analysis of "how much does AI help"; we are asking "how well did each study support its claims."
     13 
     14 ## Quality Assessment Instrument
     15 
     16 ### Design: 50-question two-field boolean checklist
     17 
     18 We use a 50-question checklist with two boolean fields per question rather than subjective Likert-style scores or a three-way yes/no/na enum. This was a deliberate design decision refined across two calibration rounds.
     19 
     20 **Each question has:**
     21 - `applies` (boolean): Is this criterion relevant to this paper type?
     22 - `answer` (boolean): Does the paper satisfy this criterion? (Only meaningful when applies=true.)
     23 - `justification` (string): 1-3 sentence explanation citing specific paper sections.
     24 
     25 **Why two fields instead of yes/no/na:**
     26 The original design used a single `answer: yes/no/na` field. Calibration round 1 (93.2% agreement) revealed that 47-56% of Sonnet-Opus disagreements were "NA boundary errors" — the model conflating "the paper didn't do this" (should be no) with "this doesn't apply" (should be na). The two-field design forces explicit, separate decisions on applicability and compliance, eliminating this conflation.
     27 
     28 **Design principles:**
     29 - **Verifiable**: each question has a factually correct answer checkable against the paper
     30 - **Auditable**: justification text cites specific sections/quotes
     31 - **High inter-rater reliability**: booleans have much less variance than 0-3 scores
     32 - **Fast human calibration**: checking 50 booleans takes ~15 min, not hours
     33 - **Derived scores**: composite scores computed deterministically from boolean counts
     34 - **Separate denominators**: compliance rates computed only over papers where applies=true
     35 - **The questions are findings**: "only 34% of papers release code" is concrete and publishable
     36 
     37 **Previous designs (discarded):**
     38 1. 6-dimension 0-3 rubric — discarded because LLM-assigned subjective scores are hard to defend in a paper about methodological rigor.
     39 2. Single yes/no/na field — discarded after calibration showed NA boundary confusion was the dominant error mode.
     40 
     41 ### Categories (11 groups, 50 questions)
     42 
     43 1. **Artifacts** (4q): code released, data released, environment specs, reproduction instructions
     44 2. **Statistical methodology** (5q): CIs/error bars, significance tests, effect sizes, sample size justification, variance
     45 3. **Evaluation design** (9q): baselines, contemporary baselines, ablation, multiple metrics, human eval, held-out test, breakdowns, failure cases, negative results
     46 4. **Claims & evidence** (5q): abstract supported, causal claims justified, generalization bounded, alternatives discussed, proxy-outcome distinction
     47 5. **Setup transparency** (5q): model versions, prompts, hyperparameters, scaffolding, data preprocessing
     48 6. **Limitations & scope** (3q): limitations section, specific threats, scope boundaries
     49 7. **Data integrity** (4q): raw data available, collection described, recruitment described, pipeline documented
     50 8. **Conflicts of interest** (4q): funding disclosed, affiliations disclosed, funder independent, financial interests declared
     51 9. **Contamination** (3q): training cutoff stated, train/test overlap discussed, contamination addressed
     52 10. **Human studies** (7q): pre-registered, IRB, demographics, inclusion/exclusion, randomization, blinding, attrition
     53 11. **Cost & practicality** (2q): inference cost, compute budget
     54 
     55 Data integrity and conflicts of interest categories inspired by the Wakefield MMR case — "Is raw data available for independent verification?" would have caught the fabrication years earlier.
     56 
     57 ### Conditional modules (v2, 15 questions)
     58 
     59 V2 scans add conditional question modules activated by methodology_tags. These target systematic issues identified by meta-research papers in the corpus.
     60 
     61 **12. Experimental rigor** (8q, activated by `benchmark-eval`):
     62 - `seed_sensitivity_reported` — Henderson et al. (2018) showed RL results vary 2x across seeds
     63 - `number_of_runs_stated` — exact run count, not implicit
     64 - `hyperparameter_search_budget` — Dodge et al. (2019) showed search budget dramatically affects results
     65 - `best_config_selection_justified` — selection on validation set, not cherry-picked
     66 - `multiple_comparison_correction` — Bonferroni/Holm/BH for multiple tests
     67 - `self_comparison_bias_addressed` — Lucic et al. (2018) showed authors' baseline re-implementations systematically underperform
     68 - `compute_budget_vs_performance` — performance as function of compute, not just peak
     69 - `benchmark_construct_validity` — Kapoor & Narayanan (2024) documented widespread validity gaps
     70 - `scaffold_confound_addressed` — SWE-bench scores vary 2.7–28.3% for the same model depending on scaffold; scaffold effect often exceeds model effect
     71 
     72 **13. Data leakage** (4q, activated by `benchmark-eval`):
     73 - `temporal_leakage_addressed` — training data from after prediction target
     74 - `feature_leakage_addressed` — input features leak answer information
     75 - `non_independence_addressed` — train/test share structural similarities
     76 - `leakage_detection_method` — concrete detection (canary strings, n-gram overlap, etc.)
     77 
     78 Source: Kapoor & Narayanan (2024) leakage taxonomy.
     79 
     80 **14. Survey methodology** (3q, activated by `meta-analysis`):
     81 - `prisma_or_structured_protocol` — PRISMA or equivalent systematic protocol
     82 - `quality_assessment_of_sources` — quality scoring of included studies (Leech et al.)
     83 - `publication_bias_discussed` — funnel plots, negative-result underrepresentation
     84 
     85 Total: 51 base + 16 conditional = 67 max per paper. V1 scans remain valid (new fields optional).
     86 
     87 Full schema with evaluation criteria for each question: `schema/scan.schema.json`
     88 
     89 ### Answer rules
     90 
     91 - **`applies: true, answer: true`** = the paper clearly satisfies the criterion; you can point to where
     92 - **`applies: true, answer: false`** = the paper does not satisfy the criterion, or evidence is absent. Absence of evidence is `answer: false`, not `applies: false`.
     93 - **`applies: false, answer: false`** = the criterion is structurally inapplicable to this paper type (e.g., human_studies questions for a benchmark paper, contamination questions for a mining study)
     94 
     95 Each answer includes a 1-3 sentence justification citing specific paper sections.
     96 
     97 ### Model assignment
     98 
     99 - **Primary rater (v1)**: Sonnet — switched to Opus after Round 3 calibration showed persistent Sonnet generosity bias
    100 - **Primary rater (v2)**: Opus (Claude Opus 4.6) — all scanning from Round 3 onward
    101 - **Calibration rater**: Opus (independent re-evaluation of subset to measure inter-rater agreement)
    102 
    103 ### Calibration results
    104 
    105 **Round 1 (2026-02-28, yes/no/na format):** 8 papers, **93.2%** agreement (373/400). Two systematic issues:
    106 1. NA boundary errors (56% of disagreements): Sonnet confused "didn't do it" with "doesn't apply." Fixed by adding explicit NA guidance per question.
    107 2. Generosity bias (44%): Sonnet credited partial information. Fixed by adding "does NOT count" examples.
    108 
    109 **Round 2 (2026-02-28, yes/no/na format, post-fixes):** 10 papers, **96.2%** agreement (481/500). Improvement confirmed, but NA boundary errors still 47% of remaining disagreements. Led to two-field redesign (applies + answer) to structurally eliminate the conflation.
    110 
    111 **Round 3 (2026-02-28, applies/answer format):** 60 papers, **97.0%** agreement (2,911/3,000). Two-field design validated at scale. Remaining disagreements: applies_boundary 52%, sonnet_generous 36%, opus_generous 12%. Sonnet generosity bias persisted, leading to switch to Opus as primary rater.
    112 
    113 ## LLM-Assisted Systematic Review
    114 
    115 This survey is itself a contribution to the methodology of large-scale systematic reviews. The entire pipeline — from paper harvesting and PDF acquisition through structured quality assessment — is conducted using Claude Opus 4.6 (Anthropic) as the primary evaluation instrument, with human oversight for calibration and editorial judgment.
    116 
    117 ### Why this matters
    118 
    119 Traditional systematic reviews face a scale ceiling: human reviewers can assess tens to low hundreds of papers before fatigue and inconsistency degrade quality. This survey targets ~1,000 papers with a 65-question structured instrument — infeasible for a small research team using manual review alone.
    120 
    121 ### What the LLM does
    122 
    123 1. **Structured extraction**: Each paper is read in full and evaluated against the boolean checklist. The LLM cites specific sections, tables, and figures in its justifications — these are auditable by human reviewers.
    124 2. **Consistency at scale**: Unlike human reviewers who drift over hundreds of papers, the LLM applies the same criteria to paper #1 and paper #900. Calibration data (97% inter-rater agreement at 60 papers) quantifies this consistency.
    125 3. **Conditional evaluation**: V2 scans activate domain-specific question modules (experimental rigor, data leakage, survey methodology) based on paper type, applying targeted criteria from meta-research literature.
    126 
    127 ### What the LLM does not do
    128 
    129 - **Editorial judgment**: Decisions about what findings mean, which narrative threads to pursue, and how to position the work are human decisions.
    130 - **Instrument design**: The 65-question checklist was designed by humans, informed by meta-research literature (Henderson, Dodge, Kapoor, Leech, REFORMS). The LLM's role is to apply the instrument, not design it.
    131 - **Calibration**: Inter-rater reliability was measured by comparing independent Opus evaluations against prior scans. Disagreement analysis and instrument refinement were human-driven.
    132 
    133 ### Transparency
    134 
    135 - The scanning model (Claude Opus 4.6), prompt text (`agents/scan-agent.md`), and schema (`schema/scan.schema.json`) are all published as part of the replication package.
    136 - Every checklist answer includes a justification citing specific paper content, enabling human spot-checking at any granularity.
    137 - The calibration journey (93.2% → 96.2% → 97.0%) and the specific failure modes discovered (NA boundary confusion, generosity bias) are documented as methodological findings in their own right.
    138 - V1 vs V2 scan versions are tracked per paper, so analysis can control for instrument version.
    139 
    140 ### Limitations of LLM-assisted review
    141 
    142 - The LLM cannot verify claims against external sources (e.g., checking if a claimed GitHub repo actually exists and contains working code).
    143 - Fabricated data would not be detected — the checklist assesses transparency and methodology, not truthfulness of reported numbers (the Wakefield benchmark scored 45.7%, catching transparency gaps but not fabrication).
    144 - The LLM's training data includes many of the papers being reviewed, creating a potential bias toward charitable interpretation of familiar work. The strict "absence of evidence is false" rule and calibration process mitigate but do not eliminate this.
    145 
    146 ## Paper Selection
    147 
    148 ### Inclusion Criteria
    149 - Published 2023 or later (post-GPT-4, when agentic AI research accelerated)
    150 - Makes empirical claims about AI/LLM capability, productivity, or safety
    151 - Relevant to software development, code generation, or agentic workflows
    152 - Available in English
    153 
    154 ### Exclusion Criteria
    155 - Pure opinion pieces or blog posts (no empirical content)
    156 - Product announcements without methodology
    157 - Papers focused exclusively on non-code domains (unless methodology is transferable)
    158 
    159 ### Sampling Strategy
    160 - **Seed set**: Papers cited in existing survey documents and well-known references
    161 - **Forward/backward citation chasing**: Follow citations from seed papers
    162 - **Venue monitoring**: arXiv cs.SE, cs.AI, cs.CL; major ML conferences (NeurIPS, ICML, ACL)
    163 - **Community sources**: HuggingFace trending, Semantic Scholar alerts
    164 
    165 This is a purposive sample, not a random one. The goal is coverage of the most influential and most cited papers, not statistical representativeness of all papers published.
    166 
    167 ## PDF Acquisition Pipeline
    168 
    169 PDFs are obtained through a multi-stage automated pipeline before falling back to manual retrieval. All stages are fully documented for transparency in the PRISMA flow diagram.
    170 
    171 ### Automated stages (scripts/download-arxiv.py, scripts/download-doi.py)
    172 
    173 1. **arXiv direct download** — papers with `arxiv_id` downloaded from `arxiv.org/pdf/<id>.pdf`. Also catches arXiv DOIs (`10.48550/arXiv.*`) where the arXiv ID is embedded in the DOI.
    174 2. **Semantic Scholar open access** — queries S2 API for open-access PDF URL; also recovers arXiv IDs missed during harvesting.
    175 3. **Unpaywall** — queries Unpaywall API for green/gold OA versions.
    176 4. **CORE API** — queries CORE (core.ac.uk) for author manuscripts and institutional repository copies.
    177 5. **OpenAlex** — queries OpenAlex for additional OA links not indexed by Unpaywall.
    178 6. **Sci-Hub** — opt-in (`--scihub` flag); parses mirror HTML to find embedded PDF URL.
    179 
    180 ### Claude web search stage (scripts/run-pdf-finder.py / Agent tool)
    181 
    182 For papers that survive all automated stages without a PDF, Claude agents (Sonnet, WebSearch + WebFetch + Bash) perform targeted web searches:
    183 - DOI landing page crawl for embedded PDF links
    184 - Author institutional page and publication list
    185 - Preprint servers (arXiv, SSRN, bioRxiv, OSF)
    186 - ResearchGate and Semantic Scholar pages
    187 - Publisher "free access" or author-accepted-manuscript versions
    188 
    189 Each agent writes `papers/<slug>/pdf-finder-result.txt` with `FOUND <url>` or `NOT_FOUND`. The orchestrator updates registry status on success.
    190 
    191 **Observed hit rate**: ~50% of papers attempted via web search are found (primarily through author pages and preprint servers). The remaining failures are documented as genuinely paywalled with no open-access version available.
    192 
    193 ### Reporting
    194 
    195 Papers that could not be obtained are counted in the PRISMA flow diagram under "full text not available." The acquisition method (arXiv, OA repository, author page, etc.) is not tracked per paper but the overall breakdown is available from registry metadata.

Impressum · Datenschutz