v2-findings.md (8203B)
1 # V2 Scan Findings: Methodological Quality in Agentic AI/LLM Research 2 3 *Updated 2026-03-15 from 467 v2 scans (Opus-rated, two-field boolean instrument)* 4 5 ## 1. Overall Quality Distribution 6 7 - **N**: 467 papers (all scanned by Opus with v2 instrument) 8 - **Mean**: 51.0% 9 - **Median**: 50.0% 10 - **Std dev**: 16.1% 11 - **Range**: 10.0% – 93.3% 12 13 | Tier | Count | % | 14 |------|-------|---| 15 | >= 70% | 55 | 12% | 16 | 60-69% | 93 | 20% | 17 | 50-59% | 94 | 20% | 18 | 40-49% | 107 | 23% | 19 | 30-39% | 68 | 15% | 20 | < 30% | 50 | 11% | 21 22 The typical paper satisfies exactly half of applicable criteria. Median declined from 53.3% (n=135, priority papers) → 55.6% (n=271) → 50.0% (n=467) as less prominent papers entered the pool, confirming that visibility correlates with quality. Numbers are stabilizing — findings consistent across three checkpoints. 23 24 ## 2. The Methodological Theater Gap 25 26 The headline finding from v2: **papers look like they have evaluations, but the evaluations lack rigor.** 27 28 | Layer | Pass Rate | What It Measures | 29 |-------|-----------|-----------------| 30 | Base: evaluation_design | **86%** | Has baselines, metrics, breakdowns | 31 | Conditional: experimental_rigor | **28%** | Seeds, search budgets, self-comparison bias | 32 | Conditional: data_leakage | **15%** | Temporal/feature leakage, non-independence | 33 | Conditional: survey_methodology | **15%** | PRISMA, quality assessment, publication bias | 34 35 Papers have the *form* of rigorous evaluation without the *substance*. 86% include baselines and multiple metrics. But only 26% report seed sensitivity, search budgets, or address self-comparison bias. This is methodological theater. 36 37 ## 3. Category Pass Rates 38 39 | Category | Pass Rate | 40 |----------|-----------| 41 | evaluation_design | 86% | 42 | setup_transparency | 70% | 43 | data_integrity | 65% | 44 | claims_and_evidence | 64% | 45 | limitations_and_scope | 61% | 46 | human_studies | 42% | 47 | conflicts_of_interest | 41% | 48 | statistical_methodology | 36% | 49 | cost_and_practicality | 34% | 50 | artifacts | 33% | 51 | experimental_rigor | 28% | 52 | contamination | 19% | 53 | data_leakage | 15% | 54 | survey_methodology | 15% | 55 56 ## 4. Near-Zero Questions (The Damning Statistics) 57 58 These questions are almost universally failed. Each is a single publishable finding. 59 60 | Question | Pass Rate | N | 61 |----------|-----------|---| 62 | self_comparison_bias_addressed | **2.3%** | 3/132 | 63 | financial_interests_declared | **3.3%** | 9/271 | 64 | multiple_comparison_correction | **3.8%** | 3/80 | 65 | hyperparameter_search_budget | **5.2%** | 7/135 | 66 | sample_size_justified | **5.2%** | 11/212 | 67 | quality_assessment_of_sources (surveys) | **8.3%** | 3/36 | 68 | reproduction_instructions | **8.5%** | 21/247 | 69 | environment_specified | **11.0%** | 25/227 | 70 | pre_registered | **11.4%** | 5/44 | 71 | non_independence_addressed | **11.5%** | 14/122 | 72 73 **97%** of papers don't declare financial interests. **95%** of benchmark papers don't report hyperparameter search budgets. **91%** don't provide reproduction instructions. **89%** don't specify environments. 74 75 ## 5. The Reproducibility Gap 76 77 - Code released: **58.6%** (78/133 applicable) 78 - Data released: **56.1%** (74/132) 79 - Environment specified: **13.4%** (16/119) 80 - Reproduction instructions: **10.9%** (14/129) 81 - **Full reproducibility (all 4)**: **6.7%** (9/135) 82 83 Most papers release code. Almost none tell you how to run it. 84 85 ## 6. Statistical Negligence 86 87 | Question | Pass Rate | N | 88 |----------|-----------|---| 89 | Confidence intervals / error bars | **24.0%** | 25/104 | 90 | Variance reported | **26.0%** | 27/104 | 91 | Significance tests | **31.4%** | 32/102 | 92 | Sample size justified | **6.4%** | 7/109 | 93 | Effect sizes reported | **84.5%** | 87/103 | 94 95 Papers report that "X outperforms Y by 12%" (effect sizes: 84.5%) but don't report whether that 12% is within noise (CIs: 24%, significance tests: 31.4%, variance: 26%). Point estimates without uncertainty are not science. 96 97 ## 7. The Contamination Blindspot 98 99 | Question | Pass Rate | N | 100 |----------|-----------|---| 101 | Training cutoff stated | **16.7%** | 10/60 | 102 | Benchmark contamination addressed | **33.3%** | 20/60 | 103 | Train/test overlap discussed | **40.0%** | 24/60 | 104 | Temporal leakage addressed | **39.0%** | 23/59 | 105 | Feature leakage addressed | **20.0%** | 12/60 | 106 | Non-independence addressed | **9.8%** | 6/61 | 107 | Leakage detection method | **31.7%** | 19/60 | 108 109 Two-thirds of benchmark papers don't address contamination at all. 90% don't check for non-independence between train and test sets. Results on public benchmarks may be meaningless. 110 111 ## 8. Conflicts of Interest 112 113 | Question | Pass Rate | N | 114 |----------|-----------|---| 115 | Affiliations disclosed | **98.5%** | 133/135 | 116 | Funding disclosed | **32.6%** | 44/135 | 117 | Funder independent of outcome | **24.6%** | 33/134 | 118 | Financial interests declared | **1.5%** | 2/135 | 119 120 Everyone lists their employer. Almost nobody declares financial interests. Only a quarter of funded papers establish funder independence. In a field where major companies fund research that evaluates their own products, this is a transparency crisis. 121 122 ## 9. Scores by Methodology Tag 123 124 | Tag | N | Mean | Median | 125 |-----|---|------|--------| 126 | rct | 9 | 63.4% | 60.3% | 127 | observational | 19 | 60.9% | 62.5% | 128 | qualitative | 22 | 56.0% | 61.1% | 129 | benchmark-eval | 79 | 50.4% | 50.0% | 130 | theoretical | 10 | 48.1% | 49.0% | 131 | meta-analysis | 25 | 45.2% | 37.5% | 132 | case-study | 14 | 43.1% | 41.6% | 133 134 **RCTs score highest** (63.4%). The methodology that requires the most effort produces the best quality. **Meta-analyses score lowest** (45.2%) — survey papers that preach methodological rigor don't practice it themselves. 135 136 ## 10. Quality Declining Over Time 137 138 | Year | N | Mean | Median | 139 |------|---|------|--------| 140 | 2022 | 1 | 73.2% | 73.2% | 141 | 2023 | 14 | 49.3% | 47.1% | 142 | 2024 | 28 | 54.7% | 55.3% | 143 | 2025 | 65 | 46.6% | 47.3% | 144 | 2026 | 26 | 60.6% | 61.4% | 145 146 2025 papers (n=65) average 46.6% — **8 points below 2024**. The field is growing faster than its standards. (2026 bump likely reflects selection bias toward higher-profile early-year papers.) 147 148 ## 11. What Papers Do Well 149 150 | Question | Pass Rate | 151 |----------|-----------| 152 | affiliations_disclosed | 98.5% | 153 | abstract_claims_supported | 97.8% | 154 | per_category_breakdown | 97.0% | 155 | negative_results_reported | 92.4% | 156 | failure_cases_discussed | 90.2% | 157 | scaffolding_described | 88.4% | 158 | data_collection_described | 88.6% | 159 | multiple_metrics | 87.6% | 160 161 Papers are good at the visible parts: disclosing affiliations, supporting abstract claims, reporting breakdowns. The failures are in the invisible infrastructure: statistical foundations, contamination controls, reproducibility details, financial transparency. 162 163 ## 12. Headline Statistics for the Paper 164 165 Across 467 papers in agentic AI/LLM programming research (2022-2026), scanned by Claude Opus 4.6 with a 67-question two-field boolean instrument: 166 167 - The median paper satisfies **50%** of applicable methodological criteria 168 - **86%** have formal evaluations; **28%** address experimental rigor — methodological theater 169 - **97%** don't declare financial interests 170 - **95%** don't report hyperparameter search budgets 171 - **91%** don't provide reproduction instructions 172 - **54%** report effect sizes with zero uncertainty (no CIs, no sig tests, no variance) 173 - **84%** of benchmark papers don't address contamination 174 - **84%** of code releases are "open source theater" — no env or instructions 175 - Only **4%** achieve full reproducibility (code + data + environment + instructions) 176 - Contamination awareness **collapsing**: 32.5% (2024) → 12.5% (2026) 177 - Data leakage awareness **collapsing faster**: 24.2% (2024) → 9.3% (2026) 178 - Even evaluation_design is **eroding**: 88.7% (2024) → 82.3% (2026) 179 - One bright spot: statistical_methodology **improving** 30% → 42% (2023→2026) 180 - Surveys score **lowest** (43.6%) — papers about rigor lack rigor 181 - RCTs score **highest** (64.3%) — effort correlates with quality 182 - Funding disclosure predicts quality: **14pp gap** (60% vs 46%) 183 - Visibility correlates with quality: prominent papers scored higher than long tail 184 - Findings **stable across three checkpoints** (n=135, 271, 467) — these are real