ai-research-survey

Systematic scan of agentic development research. What's signal, what's noise.
git clone https://git.shiptheloop.com/ai-research-survey.git
Log | Files | Refs

v2-findings.md (8203B)


      1 # V2 Scan Findings: Methodological Quality in Agentic AI/LLM Research
      2 
      3 *Updated 2026-03-15 from 467 v2 scans (Opus-rated, two-field boolean instrument)*
      4 
      5 ## 1. Overall Quality Distribution
      6 
      7 - **N**: 467 papers (all scanned by Opus with v2 instrument)
      8 - **Mean**: 51.0%
      9 - **Median**: 50.0%
     10 - **Std dev**: 16.1%
     11 - **Range**: 10.0% – 93.3%
     12 
     13 | Tier | Count | % |
     14 |------|-------|---|
     15 | >= 70% | 55 | 12% |
     16 | 60-69% | 93 | 20% |
     17 | 50-59% | 94 | 20% |
     18 | 40-49% | 107 | 23% |
     19 | 30-39% | 68 | 15% |
     20 | < 30% | 50 | 11% |
     21 
     22 The typical paper satisfies exactly half of applicable criteria. Median declined from 53.3% (n=135, priority papers) → 55.6% (n=271) → 50.0% (n=467) as less prominent papers entered the pool, confirming that visibility correlates with quality. Numbers are stabilizing — findings consistent across three checkpoints.
     23 
     24 ## 2. The Methodological Theater Gap
     25 
     26 The headline finding from v2: **papers look like they have evaluations, but the evaluations lack rigor.**
     27 
     28 | Layer | Pass Rate | What It Measures |
     29 |-------|-----------|-----------------|
     30 | Base: evaluation_design | **86%** | Has baselines, metrics, breakdowns |
     31 | Conditional: experimental_rigor | **28%** | Seeds, search budgets, self-comparison bias |
     32 | Conditional: data_leakage | **15%** | Temporal/feature leakage, non-independence |
     33 | Conditional: survey_methodology | **15%** | PRISMA, quality assessment, publication bias |
     34 
     35 Papers have the *form* of rigorous evaluation without the *substance*. 86% include baselines and multiple metrics. But only 26% report seed sensitivity, search budgets, or address self-comparison bias. This is methodological theater.
     36 
     37 ## 3. Category Pass Rates
     38 
     39 | Category | Pass Rate |
     40 |----------|-----------|
     41 | evaluation_design | 86% |
     42 | setup_transparency | 70% |
     43 | data_integrity | 65% |
     44 | claims_and_evidence | 64% |
     45 | limitations_and_scope | 61% |
     46 | human_studies | 42% |
     47 | conflicts_of_interest | 41% |
     48 | statistical_methodology | 36% |
     49 | cost_and_practicality | 34% |
     50 | artifacts | 33% |
     51 | experimental_rigor | 28% |
     52 | contamination | 19% |
     53 | data_leakage | 15% |
     54 | survey_methodology | 15% |
     55 
     56 ## 4. Near-Zero Questions (The Damning Statistics)
     57 
     58 These questions are almost universally failed. Each is a single publishable finding.
     59 
     60 | Question | Pass Rate | N |
     61 |----------|-----------|---|
     62 | self_comparison_bias_addressed | **2.3%** | 3/132 |
     63 | financial_interests_declared | **3.3%** | 9/271 |
     64 | multiple_comparison_correction | **3.8%** | 3/80 |
     65 | hyperparameter_search_budget | **5.2%** | 7/135 |
     66 | sample_size_justified | **5.2%** | 11/212 |
     67 | quality_assessment_of_sources (surveys) | **8.3%** | 3/36 |
     68 | reproduction_instructions | **8.5%** | 21/247 |
     69 | environment_specified | **11.0%** | 25/227 |
     70 | pre_registered | **11.4%** | 5/44 |
     71 | non_independence_addressed | **11.5%** | 14/122 |
     72 
     73 **97%** of papers don't declare financial interests. **95%** of benchmark papers don't report hyperparameter search budgets. **91%** don't provide reproduction instructions. **89%** don't specify environments.
     74 
     75 ## 5. The Reproducibility Gap
     76 
     77 - Code released: **58.6%** (78/133 applicable)
     78 - Data released: **56.1%** (74/132)
     79 - Environment specified: **13.4%** (16/119)
     80 - Reproduction instructions: **10.9%** (14/129)
     81 - **Full reproducibility (all 4)**: **6.7%** (9/135)
     82 
     83 Most papers release code. Almost none tell you how to run it.
     84 
     85 ## 6. Statistical Negligence
     86 
     87 | Question | Pass Rate | N |
     88 |----------|-----------|---|
     89 | Confidence intervals / error bars | **24.0%** | 25/104 |
     90 | Variance reported | **26.0%** | 27/104 |
     91 | Significance tests | **31.4%** | 32/102 |
     92 | Sample size justified | **6.4%** | 7/109 |
     93 | Effect sizes reported | **84.5%** | 87/103 |
     94 
     95 Papers report that "X outperforms Y by 12%" (effect sizes: 84.5%) but don't report whether that 12% is within noise (CIs: 24%, significance tests: 31.4%, variance: 26%). Point estimates without uncertainty are not science.
     96 
     97 ## 7. The Contamination Blindspot
     98 
     99 | Question | Pass Rate | N |
    100 |----------|-----------|---|
    101 | Training cutoff stated | **16.7%** | 10/60 |
    102 | Benchmark contamination addressed | **33.3%** | 20/60 |
    103 | Train/test overlap discussed | **40.0%** | 24/60 |
    104 | Temporal leakage addressed | **39.0%** | 23/59 |
    105 | Feature leakage addressed | **20.0%** | 12/60 |
    106 | Non-independence addressed | **9.8%** | 6/61 |
    107 | Leakage detection method | **31.7%** | 19/60 |
    108 
    109 Two-thirds of benchmark papers don't address contamination at all. 90% don't check for non-independence between train and test sets. Results on public benchmarks may be meaningless.
    110 
    111 ## 8. Conflicts of Interest
    112 
    113 | Question | Pass Rate | N |
    114 |----------|-----------|---|
    115 | Affiliations disclosed | **98.5%** | 133/135 |
    116 | Funding disclosed | **32.6%** | 44/135 |
    117 | Funder independent of outcome | **24.6%** | 33/134 |
    118 | Financial interests declared | **1.5%** | 2/135 |
    119 
    120 Everyone lists their employer. Almost nobody declares financial interests. Only a quarter of funded papers establish funder independence. In a field where major companies fund research that evaluates their own products, this is a transparency crisis.
    121 
    122 ## 9. Scores by Methodology Tag
    123 
    124 | Tag | N | Mean | Median |
    125 |-----|---|------|--------|
    126 | rct | 9 | 63.4% | 60.3% |
    127 | observational | 19 | 60.9% | 62.5% |
    128 | qualitative | 22 | 56.0% | 61.1% |
    129 | benchmark-eval | 79 | 50.4% | 50.0% |
    130 | theoretical | 10 | 48.1% | 49.0% |
    131 | meta-analysis | 25 | 45.2% | 37.5% |
    132 | case-study | 14 | 43.1% | 41.6% |
    133 
    134 **RCTs score highest** (63.4%). The methodology that requires the most effort produces the best quality. **Meta-analyses score lowest** (45.2%) — survey papers that preach methodological rigor don't practice it themselves.
    135 
    136 ## 10. Quality Declining Over Time
    137 
    138 | Year | N | Mean | Median |
    139 |------|---|------|--------|
    140 | 2022 | 1 | 73.2% | 73.2% |
    141 | 2023 | 14 | 49.3% | 47.1% |
    142 | 2024 | 28 | 54.7% | 55.3% |
    143 | 2025 | 65 | 46.6% | 47.3% |
    144 | 2026 | 26 | 60.6% | 61.4% |
    145 
    146 2025 papers (n=65) average 46.6% — **8 points below 2024**. The field is growing faster than its standards. (2026 bump likely reflects selection bias toward higher-profile early-year papers.)
    147 
    148 ## 11. What Papers Do Well
    149 
    150 | Question | Pass Rate |
    151 |----------|-----------|
    152 | affiliations_disclosed | 98.5% |
    153 | abstract_claims_supported | 97.8% |
    154 | per_category_breakdown | 97.0% |
    155 | negative_results_reported | 92.4% |
    156 | failure_cases_discussed | 90.2% |
    157 | scaffolding_described | 88.4% |
    158 | data_collection_described | 88.6% |
    159 | multiple_metrics | 87.6% |
    160 
    161 Papers are good at the visible parts: disclosing affiliations, supporting abstract claims, reporting breakdowns. The failures are in the invisible infrastructure: statistical foundations, contamination controls, reproducibility details, financial transparency.
    162 
    163 ## 12. Headline Statistics for the Paper
    164 
    165 Across 467 papers in agentic AI/LLM programming research (2022-2026), scanned by Claude Opus 4.6 with a 67-question two-field boolean instrument:
    166 
    167 - The median paper satisfies **50%** of applicable methodological criteria
    168 - **86%** have formal evaluations; **28%** address experimental rigor — methodological theater
    169 - **97%** don't declare financial interests
    170 - **95%** don't report hyperparameter search budgets
    171 - **91%** don't provide reproduction instructions
    172 - **54%** report effect sizes with zero uncertainty (no CIs, no sig tests, no variance)
    173 - **84%** of benchmark papers don't address contamination
    174 - **84%** of code releases are "open source theater" — no env or instructions
    175 - Only **4%** achieve full reproducibility (code + data + environment + instructions)
    176 - Contamination awareness **collapsing**: 32.5% (2024) → 12.5% (2026)
    177 - Data leakage awareness **collapsing faster**: 24.2% (2024) → 9.3% (2026)
    178 - Even evaluation_design is **eroding**: 88.7% (2024) → 82.3% (2026)
    179 - One bright spot: statistical_methodology **improving** 30% → 42% (2023→2026)
    180 - Surveys score **lowest** (43.6%) — papers about rigor lack rigor
    181 - RCTs score **highest** (64.3%) — effort correlates with quality
    182 - Funding disclosure predicts quality: **14pp gap** (60% vs 46%)
    183 - Visibility correlates with quality: prominent papers scored higher than long tail
    184 - Findings **stable across three checkpoints** (n=135, 271, 467) — these are real

Impressum · Datenschutz