early-findings.md (13826B)
1 # Early Findings Report: Methodological Quality in Agentic AI/LLM Programming Research 2 3 *Generated 2026-03-08 from 561 scanned papers (~21% of registry)* 4 5 ## 1. Overall Quality Distribution 6 7 - **Mean composite score**: 50.9% 8 - **Median**: 52.8% 9 - **Std dev**: 17.4% 10 - **Range**: 2.4% – 87.8% 11 - **Q25–Q75**: 41.5% – 62.5% 12 13 ### Score Distribution (histogram) 14 | Bucket | Count | Bar | 15 |--------|-------|-----| 16 | 0-10% | 13 | █████████████ | 17 | 10-20% | 18 | ██████████████████ | 18 | 20-30% | 43 | ███████████████████████████████████████████ | 19 | 30-40% | 51 | ███████████████████████████████████████████████████ | 20 | 40-50% | 105 | █████████████████████████████████████████████████████████████████████████████████████████████████████████ | 21 | 50-60% | 156 | ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 22 | 60-70% | 103 | ███████████████████████████████████████████████████████████████████████████████████████████████████████ | 23 | 70-80% | 53 | █████████████████████████████████████████████████████ | 24 | 80-90% | 19 | ███████████████████ | 25 26 ## 2. Per-Category Scores 27 28 | Category | Mean | Median | Std | N | 29 |----------|------|--------|-----|---| 30 | contamination | 20.5% | 0.0% | 33.5% | 328 | 31 | human_studies | 25.6% | 20.0% | 23.0% | 80 | 32 | cost_and_practicality | 27.0% | 0.0% | 37.0% | 471 | 33 | statistical_methodology | 27.3% | 20.0% | 24.2% | 490 | 34 | artifacts | 33.0% | 25.0% | 26.5% | 548 | 35 | conflicts_of_interest | 41.9% | 25.0% | 23.8% | 561 | 36 | limitations_and_scope | 52.2% | 66.7% | 41.6% | 561 | 37 | claims_and_evidence | 57.4% | 50.0% | 29.9% | 561 | 38 | setup_transparency | 61.9% | 60.0% | 31.1% | 539 | 39 | data_integrity | 65.5% | 66.7% | 31.9% | 542 | 40 | evaluation_design | 79.2% | 87.5% | 24.1% | 556 | 41 42 ## 3. Most Commonly Failed Questions (Lowest Compliance) 43 44 | Question | Applies | Compliance | Yes/Applicable | 45 |----------|---------|------------|----------------| 46 | human_studies.pre_registered | 79 (14%) | 1.3% | 1/79 | 47 | conflicts_of_interest.financial_interests_declared | 561 (100%) | 4.1% | 23/561 | 48 | statistical_methodology.sample_size_justified | 488 (87%) | 4.5% | 22/488 | 49 | artifacts.reproduction_instructions | 534 (95%) | 5.6% | 30/534 | 50 | contamination.training_cutoff_stated | 327 (58%) | 7.0% | 23/327 | 51 | artifacts.environment_specified | 511 (91%) | 9.4% | 48/511 | 52 | human_studies.irb_or_ethics_approval | 79 (14%) | 12.7% | 10/79 | 53 | statistical_methodology.significance_tests | 474 (84%) | 19.4% | 92/474 | 54 | human_studies.attrition_reported | 77 (14%) | 19.5% | 15/77 | 55 | cost_and_practicality.compute_budget_stated | 470 (84%) | 20.4% | 96/470 | 56 | statistical_methodology.confidence_intervals_or_error_bars | 485 (86%) | 21.0% | 102/485 | 57 | statistical_methodology.variance_reported | 482 (86%) | 21.6% | 104/482 | 58 | contamination.benchmark_contamination_addressed | 323 (58%) | 25.4% | 82/323 | 59 | human_studies.inclusion_exclusion_criteria | 80 (14%) | 28.7% | 23/80 | 60 | contamination.train_test_overlap_discussed | 328 (58%) | 29.0% | 95/328 | 61 62 ## 4. Most Commonly Passed Questions (Highest Compliance) 63 64 | Question | Applies | Compliance | Yes/Applicable | 65 |----------|---------|------------|----------------| 66 | evaluation_design.baselines_contemporary | 507 (90%) | 78.7% | 399/507 | 67 | evaluation_design.failure_cases_discussed | 555 (99%) | 82.9% | 460/555 | 68 | evaluation_design.multiple_metrics | 491 (88%) | 84.5% | 415/491 | 69 | data_integrity.data_collection_described | 541 (96%) | 86.7% | 469/541 | 70 | evaluation_design.baselines_included | 521 (93%) | 87.9% | 458/521 | 71 | evaluation_design.negative_results_reported | 545 (97%) | 88.4% | 482/545 | 72 | evaluation_design.per_category_breakdown | 539 (96%) | 89.2% | 481/539 | 73 | claims_and_evidence.abstract_claims_supported | 561 (100%) | 91.4% | 513/561 | 74 | setup_transparency.scaffolding_described | 238 (42%) | 94.1% | 224/238 | 75 | conflicts_of_interest.affiliations_disclosed | 561 (100%) | 98.9% | 555/561 | 76 77 ## 5. Lowest-Scoring Papers 78 79 | Rank | Score | Slug | Title | 80 |------|-------|------|-------| 81 | 1 | 2.4% | automating-rest-api-2024 | Automating REST API Postman Test Cases Using LLM | 82 | 2 | 2.6% | chatofthought-collaborative-multiagent-2025 | Chat-of-Thought: Collaborative Multi-Agent System for Generating Domain Specific | 83 | 3 | 2.7% | aidriven-software-engineering-2023 | AI-driven software engineering | 84 | 4 | 4.3% | aiassisted-code-editors-2025 | AI-Assisted Code Editors with Real-Time Collaboration: A Comprehensive Review | 85 | 5 | 5.3% | ai-agents-software-2025 | AI Agents in Software Engineering Optimizing Software Development Processes and | 86 | 6 | 5.3% | attacking-llms-ai-2025 | Attacking LLMs and AI Agents: Advertisement Embedding Attacks Against Large Lang | 87 | 7 | 5.6% | breaking-prompt-wall-2025 | Breaking the Prompt Wall (I): A Real-World Case Study of Attacking ChatGPT via L | 88 | 8 | 6.7% | agentic-ai-software-2025 | Agentic AI for Software: thoughts from Software Engineering community | 89 | 9 | 7.7% | aipowered-code-review-2024 | AI-powered Code Review with LLMs: Early Results | 90 | 10 | 7.7% | aipowered-software-development-2025-2 | AI-Powered Software Development Life Cycle: From Requirements to Maintenance | 91 92 ## 6. Highest-Scoring Papers 93 94 | Rank | Score | Slug | Title | 95 |------|-------|------|-------| 96 | 1 | 83.3% | concrete-roadmap-safety-2025 | A Concrete Roadmap towards Safety Cases based on Chain-of-Thought Monitoring | 97 | 2 | 84.6% | ai-alignment-strategies-2025 | AI Alignment Strategies from a Risk Perspective: Independent Safety Mechanisms o | 98 | 3 | 84.6% | ai-testing-should-2025 | AI Testing Should Account for Sophisticated Strategic Behaviour | 99 | 4 | 85.7% | alphacode-competition-level-2022 | Competition-Level Code Generation with AlphaCode | 100 | 5 | 85.7% | appworld-controllable-world-2024 | AppWorld: A Controllable World of Apps and People for Benchmarking Interactive C | 101 | 6 | 86.1% | agentdojo-dynamic-environment-2024 | AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defens | 102 | 7 | 86.8% | attention-pruning-automated-2025 | Attention Pruning: Automated Fairness Repair of Language Models via Surrogate Si | 103 | 8 | 87.0% | bridging-mde-ai-2023 | Bridging MDE and AI: A Systematic Review of Domain-Specific Languages and Model- | 104 | 9 | 87.2% | annotation-alignment-comparing-2024 | Annotation alignment: Comparing LLM and human annotations of conversational safe | 105 | 10 | 87.8% | assessing-latent-automated-2024 | Assessing the Latent Automated Program Repair Capabilities of Large Language Mod | 106 107 ## 7. Methodology Tag Distribution 108 109 | Tag | Count | % | 110 |-----|-------|---| 111 | benchmark-eval | 424 | 75.6% | 112 | case-study | 115 | 20.5% | 113 | qualitative | 68 | 12.1% | 114 | theoretical | 62 | 11.1% | 115 | observational | 50 | 8.9% | 116 | meta-analysis | 31 | 5.5% | 117 | rct | 7 | 1.2% | 118 119 ## 8. Claim Support Distribution 120 121 | Support Level | Count | % | 122 |---------------|-------|---| 123 | moderate | 1243 | 43.1% | 124 | strong | 1163 | 40.3% | 125 | weak | 404 | 14.0% | 126 | unsupported | 75 | 2.6% | 127 128 ## 9. Scores by Year 129 130 | Year | N | Mean | Median | 131 |------|---|------|--------| 132 | 2017 | 1 | 52.8% | 52.8% | 133 | 2018 | 1 | 55.0% | 55.0% | 134 | 2019 | 1 | 64.7% | 64.7% | 135 | 2020 | 1 | 52.5% | 52.5% | 136 | 2021 | 1 | 70.7% | 70.7% | 137 | 2022 | 18 | 62.4% | 65.8% | 138 | 2023 | 61 | 50.7% | 52.5% | 139 | 2024 | 129 | 50.8% | 52.8% | 140 | 2025 | 282 | 49.5% | 51.8% | 141 | 2026 | 66 | 53.7% | 55.8% | 142 143 ## 10. Scores by Methodology Tag 144 145 | Tag | N | Mean | Median | 146 |-----|---|------|--------| 147 | meta-analysis | 31 | 40.8% | 35.1% | 148 | case-study | 115 | 41.3% | 41.7% | 149 | qualitative | 68 | 44.7% | 50.0% | 150 | theoretical | 62 | 51.0% | 55.6% | 151 | rct | 7 | 52.7% | 48.7% | 152 | benchmark-eval | 424 | 53.5% | 53.8% | 153 | observational | 50 | 57.7% | 59.7% | 154 155 ## 11. Code Released vs. Quality 156 157 - Papers with code released: **295** (mean score: 53.7%) 158 - Papers without code released: **252** (mean score: 48.5%) 159 - **Difference: +5.2 percentage points** 160 161 ### Code Released vs. Per-Category Scores 162 | Category | With Code | Without Code | Diff | 163 |----------|-----------|--------------|------| 164 | data_integrity | 79.3% | 50.1% | +29.2pp | 165 | setup_transparency | 69.7% | 52.9% | +16.9pp | 166 | contamination | 24.3% | 12.8% | +11.5pp | 167 | claims_and_evidence | 62.5% | 51.1% | +11.4pp | 168 | cost_and_practicality | 31.4% | 20.4% | +11.0pp | 169 | limitations_and_scope | 57.6% | 46.8% | +10.8pp | 170 | conflicts_of_interest | 46.3% | 35.9% | +10.4pp | 171 | evaluation_design | 83.1% | 74.9% | +8.2pp | 172 | statistical_methodology | 30.2% | 23.2% | +7.0pp | 173 | human_studies | 27.1% | 23.8% | +3.3pp | 174 175 ## 12. Red Flags Distribution 176 177 | Flag | Count | 178 |------|-------| 179 | No limitations section | 121 | 180 | No uncertainty quantification | 59 | 181 | No statistical significance tests | 50 | 182 | No statistical significance testing | 41 | 183 | Benchmark contamination risk unaddressed | 36 | 184 | No contamination analysis | 27 | 185 | Benchmark contamination not addressed | 24 | 186 | Contamination risk unaddressed | 19 | 187 | No variance or uncertainty quantification | 18 | 188 | Missing hyperparameters | 18 | 189 | No statistical uncertainty quantification | 17 | 190 | No statistical rigor | 15 | 191 | No funding disclosure | 14 | 192 | No ablation study | 14 | 193 | Company evaluating its own product | 13 | 194 | Company evaluating own product | 13 | 195 | No systematic review methodology | 12 | 196 | Benchmark contamination risk | 11 | 197 | No code or data release | 11 | 198 | No quantitative evaluation | 10 | 199 200 ## 13. Research Paper Ideas 201 202 ### Surprising / Counterintuitive Findings 203 204 1. **Code release rate is 54%** — lower than expected given open-science norms. 205 2. **Only 21% of applicable papers report confidence intervals or error bars** — most results are bare point estimates with no uncertainty quantification. 206 3. **Statistical significance tests used in only 19% of papers making comparative claims** — papers claim 'X outperforms Y' based on eyeballing two numbers. 207 4. **Benchmark contamination addressed by only 25% of applicable papers** — the elephant in the room that most papers ignore. 208 5. **Pre-registration rate: 1%** among papers with human participants — standard in medicine, nearly absent in AI/SE research. 209 6. **Prompts provided in only 65% of applicable papers** — prompt-based research without releasing the actual prompts is unreproducible. 210 7. **Generalization bounded in only 36% of papers** — testing on Python + GPT-4 but claiming results for 'code generation' broadly. 211 8. **Funding disclosed in only 33% of applicable papers**; financial interests declared in only 4%. 212 213 ### Strong Patterns 214 215 - **Code release correlates with higher quality**: +5.2pp overall score gap between papers that release code vs. those that don't. 216 - **Year trend**: Mean score falling from 58.4% (<=2023) to 51.6% (>=2025) — quality declining as the field grows faster than rigor. 217 - **Median score is 52.8%** — the typical paper in this space satisfies about half of applicable methodological criteria. 218 - **RCTs score 52.7% vs benchmark-evals at 53.5%** — benchmark papers match or exceed RCT rigor, possibly because the checklist favors artifact-heavy work. 219 220 ### Publication-Worthy Statistics 221 222 - Across 561 papers in agentic AI/LLM programming (2022-2026): 223 - 54% release source code 224 - 64% release data 225 - 21% report confidence intervals 226 - 19% use significance tests 227 - 25% address benchmark contamination 228 - 36% bound their generalization claims 229 - 33% disclose funding 230 - 1% pre-register (among human studies) 231 232 ### Potential Narrative Threads 233 234 1. **The reproducibility gap**: Most papers don't release enough to reproduce (code + data + environment + instructions). What fraction hits all four? 235 - **Answer: 18/510 (4%)** of papers where all four artifact criteria apply satisfy all four. 236 2. **The Wakefield parallel**: How many papers have undisclosed conflicts of interest? Papers evaluating their employer's product without acknowledging the conflict. 237 3. **Statistical theater**: Papers report numbers that look precise but have no statistical foundation — no CIs, no significance tests, no variance across runs. 238 4. **The contamination blindspot**: Papers evaluating LLMs on public benchmarks without any contamination analysis. Results may be meaningless. 239 5. **Overclaiming**: Papers testing narrow configurations but making broad claims in titles and abstracts. 240 241 ### Subgroup Comparisons to Investigate Further 242 243 - Do papers from top venues score higher? (requires venue data) 244 - Do papers with more authors score higher? 245 - Do industry papers vs. academic papers differ? 246 - Do survey/meta-analysis papers score higher on methodology? 247 - Is there a correlation between number of red flags and composite score?