ai-research-survey

Systematic scan of agentic development research. What's signal, what's noise.
git clone https://git.shiptheloop.com/ai-research-survey.git
Log | Files | Refs

early-findings.md (13826B)


      1 # Early Findings Report: Methodological Quality in Agentic AI/LLM Programming Research
      2 
      3 *Generated 2026-03-08 from 561 scanned papers (~21% of registry)*
      4 
      5 ## 1. Overall Quality Distribution
      6 
      7 - **Mean composite score**: 50.9%
      8 - **Median**: 52.8%
      9 - **Std dev**: 17.4%
     10 - **Range**: 2.4% – 87.8%
     11 - **Q25–Q75**: 41.5% – 62.5%
     12 
     13 ### Score Distribution (histogram)
     14 | Bucket | Count | Bar |
     15 |--------|-------|-----|
     16 | 0-10% | 13 | █████████████ |
     17 | 10-20% | 18 | ██████████████████ |
     18 | 20-30% | 43 | ███████████████████████████████████████████ |
     19 | 30-40% | 51 | ███████████████████████████████████████████████████ |
     20 | 40-50% | 105 | █████████████████████████████████████████████████████████████████████████████████████████████████████████ |
     21 | 50-60% | 156 | ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ |
     22 | 60-70% | 103 | ███████████████████████████████████████████████████████████████████████████████████████████████████████ |
     23 | 70-80% | 53 | █████████████████████████████████████████████████████ |
     24 | 80-90% | 19 | ███████████████████ |
     25 
     26 ## 2. Per-Category Scores
     27 
     28 | Category | Mean | Median | Std | N |
     29 |----------|------|--------|-----|---|
     30 | contamination | 20.5% | 0.0% | 33.5% | 328 |
     31 | human_studies | 25.6% | 20.0% | 23.0% | 80 |
     32 | cost_and_practicality | 27.0% | 0.0% | 37.0% | 471 |
     33 | statistical_methodology | 27.3% | 20.0% | 24.2% | 490 |
     34 | artifacts | 33.0% | 25.0% | 26.5% | 548 |
     35 | conflicts_of_interest | 41.9% | 25.0% | 23.8% | 561 |
     36 | limitations_and_scope | 52.2% | 66.7% | 41.6% | 561 |
     37 | claims_and_evidence | 57.4% | 50.0% | 29.9% | 561 |
     38 | setup_transparency | 61.9% | 60.0% | 31.1% | 539 |
     39 | data_integrity | 65.5% | 66.7% | 31.9% | 542 |
     40 | evaluation_design | 79.2% | 87.5% | 24.1% | 556 |
     41 
     42 ## 3. Most Commonly Failed Questions (Lowest Compliance)
     43 
     44 | Question | Applies | Compliance | Yes/Applicable |
     45 |----------|---------|------------|----------------|
     46 | human_studies.pre_registered | 79 (14%) | 1.3% | 1/79 |
     47 | conflicts_of_interest.financial_interests_declared | 561 (100%) | 4.1% | 23/561 |
     48 | statistical_methodology.sample_size_justified | 488 (87%) | 4.5% | 22/488 |
     49 | artifacts.reproduction_instructions | 534 (95%) | 5.6% | 30/534 |
     50 | contamination.training_cutoff_stated | 327 (58%) | 7.0% | 23/327 |
     51 | artifacts.environment_specified | 511 (91%) | 9.4% | 48/511 |
     52 | human_studies.irb_or_ethics_approval | 79 (14%) | 12.7% | 10/79 |
     53 | statistical_methodology.significance_tests | 474 (84%) | 19.4% | 92/474 |
     54 | human_studies.attrition_reported | 77 (14%) | 19.5% | 15/77 |
     55 | cost_and_practicality.compute_budget_stated | 470 (84%) | 20.4% | 96/470 |
     56 | statistical_methodology.confidence_intervals_or_error_bars | 485 (86%) | 21.0% | 102/485 |
     57 | statistical_methodology.variance_reported | 482 (86%) | 21.6% | 104/482 |
     58 | contamination.benchmark_contamination_addressed | 323 (58%) | 25.4% | 82/323 |
     59 | human_studies.inclusion_exclusion_criteria | 80 (14%) | 28.7% | 23/80 |
     60 | contamination.train_test_overlap_discussed | 328 (58%) | 29.0% | 95/328 |
     61 
     62 ## 4. Most Commonly Passed Questions (Highest Compliance)
     63 
     64 | Question | Applies | Compliance | Yes/Applicable |
     65 |----------|---------|------------|----------------|
     66 | evaluation_design.baselines_contemporary | 507 (90%) | 78.7% | 399/507 |
     67 | evaluation_design.failure_cases_discussed | 555 (99%) | 82.9% | 460/555 |
     68 | evaluation_design.multiple_metrics | 491 (88%) | 84.5% | 415/491 |
     69 | data_integrity.data_collection_described | 541 (96%) | 86.7% | 469/541 |
     70 | evaluation_design.baselines_included | 521 (93%) | 87.9% | 458/521 |
     71 | evaluation_design.negative_results_reported | 545 (97%) | 88.4% | 482/545 |
     72 | evaluation_design.per_category_breakdown | 539 (96%) | 89.2% | 481/539 |
     73 | claims_and_evidence.abstract_claims_supported | 561 (100%) | 91.4% | 513/561 |
     74 | setup_transparency.scaffolding_described | 238 (42%) | 94.1% | 224/238 |
     75 | conflicts_of_interest.affiliations_disclosed | 561 (100%) | 98.9% | 555/561 |
     76 
     77 ## 5. Lowest-Scoring Papers
     78 
     79 | Rank | Score | Slug | Title |
     80 |------|-------|------|-------|
     81 | 1 | 2.4% | automating-rest-api-2024 | Automating REST API Postman Test Cases Using LLM |
     82 | 2 | 2.6% | chatofthought-collaborative-multiagent-2025 | Chat-of-Thought: Collaborative Multi-Agent System for Generating Domain Specific |
     83 | 3 | 2.7% | aidriven-software-engineering-2023 | AI-driven software engineering |
     84 | 4 | 4.3% | aiassisted-code-editors-2025 | AI-Assisted Code Editors with Real-Time Collaboration: A Comprehensive Review |
     85 | 5 | 5.3% | ai-agents-software-2025 | AI Agents in Software Engineering Optimizing Software Development Processes and  |
     86 | 6 | 5.3% | attacking-llms-ai-2025 | Attacking LLMs and AI Agents: Advertisement Embedding Attacks Against Large Lang |
     87 | 7 | 5.6% | breaking-prompt-wall-2025 | Breaking the Prompt Wall (I): A Real-World Case Study of Attacking ChatGPT via L |
     88 | 8 | 6.7% | agentic-ai-software-2025 | Agentic AI for Software: thoughts from Software Engineering community |
     89 | 9 | 7.7% | aipowered-code-review-2024 | AI-powered Code Review with LLMs: Early Results |
     90 | 10 | 7.7% | aipowered-software-development-2025-2 | AI-Powered Software Development Life Cycle: From Requirements to Maintenance |
     91 
     92 ## 6. Highest-Scoring Papers
     93 
     94 | Rank | Score | Slug | Title |
     95 |------|-------|------|-------|
     96 | 1 | 83.3% | concrete-roadmap-safety-2025 | A Concrete Roadmap towards Safety Cases based on Chain-of-Thought Monitoring |
     97 | 2 | 84.6% | ai-alignment-strategies-2025 | AI Alignment Strategies from a Risk Perspective: Independent Safety Mechanisms o |
     98 | 3 | 84.6% | ai-testing-should-2025 | AI Testing Should Account for Sophisticated Strategic Behaviour |
     99 | 4 | 85.7% | alphacode-competition-level-2022 | Competition-Level Code Generation with AlphaCode |
    100 | 5 | 85.7% | appworld-controllable-world-2024 | AppWorld: A Controllable World of Apps and People for Benchmarking Interactive C |
    101 | 6 | 86.1% | agentdojo-dynamic-environment-2024 | AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defens |
    102 | 7 | 86.8% | attention-pruning-automated-2025 | Attention Pruning: Automated Fairness Repair of Language Models via Surrogate Si |
    103 | 8 | 87.0% | bridging-mde-ai-2023 | Bridging MDE and AI: A Systematic Review of Domain-Specific Languages and Model- |
    104 | 9 | 87.2% | annotation-alignment-comparing-2024 | Annotation alignment: Comparing LLM and human annotations of conversational safe |
    105 | 10 | 87.8% | assessing-latent-automated-2024 | Assessing the Latent Automated Program Repair Capabilities of Large Language Mod |
    106 
    107 ## 7. Methodology Tag Distribution
    108 
    109 | Tag | Count | % |
    110 |-----|-------|---|
    111 | benchmark-eval | 424 | 75.6% |
    112 | case-study | 115 | 20.5% |
    113 | qualitative | 68 | 12.1% |
    114 | theoretical | 62 | 11.1% |
    115 | observational | 50 | 8.9% |
    116 | meta-analysis | 31 | 5.5% |
    117 | rct | 7 | 1.2% |
    118 
    119 ## 8. Claim Support Distribution
    120 
    121 | Support Level | Count | % |
    122 |---------------|-------|---|
    123 | moderate | 1243 | 43.1% |
    124 | strong | 1163 | 40.3% |
    125 | weak | 404 | 14.0% |
    126 | unsupported | 75 | 2.6% |
    127 
    128 ## 9. Scores by Year
    129 
    130 | Year | N | Mean | Median |
    131 |------|---|------|--------|
    132 | 2017 | 1 | 52.8% | 52.8% |
    133 | 2018 | 1 | 55.0% | 55.0% |
    134 | 2019 | 1 | 64.7% | 64.7% |
    135 | 2020 | 1 | 52.5% | 52.5% |
    136 | 2021 | 1 | 70.7% | 70.7% |
    137 | 2022 | 18 | 62.4% | 65.8% |
    138 | 2023 | 61 | 50.7% | 52.5% |
    139 | 2024 | 129 | 50.8% | 52.8% |
    140 | 2025 | 282 | 49.5% | 51.8% |
    141 | 2026 | 66 | 53.7% | 55.8% |
    142 
    143 ## 10. Scores by Methodology Tag
    144 
    145 | Tag | N | Mean | Median |
    146 |-----|---|------|--------|
    147 | meta-analysis | 31 | 40.8% | 35.1% |
    148 | case-study | 115 | 41.3% | 41.7% |
    149 | qualitative | 68 | 44.7% | 50.0% |
    150 | theoretical | 62 | 51.0% | 55.6% |
    151 | rct | 7 | 52.7% | 48.7% |
    152 | benchmark-eval | 424 | 53.5% | 53.8% |
    153 | observational | 50 | 57.7% | 59.7% |
    154 
    155 ## 11. Code Released vs. Quality
    156 
    157 - Papers with code released: **295** (mean score: 53.7%)
    158 - Papers without code released: **252** (mean score: 48.5%)
    159 - **Difference: +5.2 percentage points**
    160 
    161 ### Code Released vs. Per-Category Scores
    162 | Category | With Code | Without Code | Diff |
    163 |----------|-----------|--------------|------|
    164 | data_integrity | 79.3% | 50.1% | +29.2pp |
    165 | setup_transparency | 69.7% | 52.9% | +16.9pp |
    166 | contamination | 24.3% | 12.8% | +11.5pp |
    167 | claims_and_evidence | 62.5% | 51.1% | +11.4pp |
    168 | cost_and_practicality | 31.4% | 20.4% | +11.0pp |
    169 | limitations_and_scope | 57.6% | 46.8% | +10.8pp |
    170 | conflicts_of_interest | 46.3% | 35.9% | +10.4pp |
    171 | evaluation_design | 83.1% | 74.9% | +8.2pp |
    172 | statistical_methodology | 30.2% | 23.2% | +7.0pp |
    173 | human_studies | 27.1% | 23.8% | +3.3pp |
    174 
    175 ## 12. Red Flags Distribution
    176 
    177 | Flag | Count |
    178 |------|-------|
    179 | No limitations section | 121 |
    180 | No uncertainty quantification | 59 |
    181 | No statistical significance tests | 50 |
    182 | No statistical significance testing | 41 |
    183 | Benchmark contamination risk unaddressed | 36 |
    184 | No contamination analysis | 27 |
    185 | Benchmark contamination not addressed | 24 |
    186 | Contamination risk unaddressed | 19 |
    187 | No variance or uncertainty quantification | 18 |
    188 | Missing hyperparameters | 18 |
    189 | No statistical uncertainty quantification | 17 |
    190 | No statistical rigor | 15 |
    191 | No funding disclosure | 14 |
    192 | No ablation study | 14 |
    193 | Company evaluating its own product | 13 |
    194 | Company evaluating own product | 13 |
    195 | No systematic review methodology | 12 |
    196 | Benchmark contamination risk | 11 |
    197 | No code or data release | 11 |
    198 | No quantitative evaluation | 10 |
    199 
    200 ## 13. Research Paper Ideas
    201 
    202 ### Surprising / Counterintuitive Findings
    203 
    204 1. **Code release rate is 54%** — lower than expected given open-science norms.
    205 2. **Only 21% of applicable papers report confidence intervals or error bars** — most results are bare point estimates with no uncertainty quantification.
    206 3. **Statistical significance tests used in only 19% of papers making comparative claims** — papers claim 'X outperforms Y' based on eyeballing two numbers.
    207 4. **Benchmark contamination addressed by only 25% of applicable papers** — the elephant in the room that most papers ignore.
    208 5. **Pre-registration rate: 1%** among papers with human participants — standard in medicine, nearly absent in AI/SE research.
    209 6. **Prompts provided in only 65% of applicable papers** — prompt-based research without releasing the actual prompts is unreproducible.
    210 7. **Generalization bounded in only 36% of papers** — testing on Python + GPT-4 but claiming results for 'code generation' broadly.
    211 8. **Funding disclosed in only 33% of applicable papers**; financial interests declared in only 4%.
    212 
    213 ### Strong Patterns
    214 
    215 - **Code release correlates with higher quality**: +5.2pp overall score gap between papers that release code vs. those that don't.
    216 - **Year trend**: Mean score falling from 58.4% (<=2023) to 51.6% (>=2025) — quality declining as the field grows faster than rigor.
    217 - **Median score is 52.8%** — the typical paper in this space satisfies about half of applicable methodological criteria.
    218 - **RCTs score 52.7% vs benchmark-evals at 53.5%** — benchmark papers match or exceed RCT rigor, possibly because the checklist favors artifact-heavy work.
    219 
    220 ### Publication-Worthy Statistics
    221 
    222 - Across 561 papers in agentic AI/LLM programming (2022-2026):
    223   - 54% release source code
    224   - 64% release data
    225   - 21% report confidence intervals
    226   - 19% use significance tests
    227   - 25% address benchmark contamination
    228   - 36% bound their generalization claims
    229   - 33% disclose funding
    230   - 1% pre-register (among human studies)
    231 
    232 ### Potential Narrative Threads
    233 
    234 1. **The reproducibility gap**: Most papers don't release enough to reproduce (code + data + environment + instructions). What fraction hits all four?
    235    - **Answer: 18/510 (4%)** of papers where all four artifact criteria apply satisfy all four.
    236 2. **The Wakefield parallel**: How many papers have undisclosed conflicts of interest? Papers evaluating their employer's product without acknowledging the conflict.
    237 3. **Statistical theater**: Papers report numbers that look precise but have no statistical foundation — no CIs, no significance tests, no variance across runs.
    238 4. **The contamination blindspot**: Papers evaluating LLMs on public benchmarks without any contamination analysis. Results may be meaningless.
    239 5. **Overclaiming**: Papers testing narrow configurations but making broad claims in titles and abstracts.
    240 
    241 ### Subgroup Comparisons to Investigate Further
    242 
    243 - Do papers from top venues score higher? (requires venue data)
    244 - Do papers with more authors score higher?
    245 - Do industry papers vs. academic papers differ?
    246 - Do survey/meta-analysis papers score higher on methodology?
    247 - Is there a correlation between number of red flags and composite score?

Impressum · Datenschutz