ai-research-survey

Systematic scan of agentic development research. What's signal, what's noise.
git clone https://git.shiptheloop.com/ai-research-survey.git
Log | Files | Refs

v2-deep-patterns.md (23006B)


      1 # Deep Patterns: The Games Papers Play
      2 
      3 *Updated 2026-03-15 from 467 v2 scans (Opus-rated)*
      4 
      5 This document goes beyond pass rates to identify **systematic patterns of selective rigor** — the specific ways papers create the appearance of scientific quality while lacking its substance.
      6 
      7 ---
      8 
      9 ## 1. The Methodological Theater Gap
     10 
     11 The defining finding of this survey. Papers invest heavily in the *visible* parts of methodology while neglecting the *invisible* infrastructure.
     12 
     13 | What papers do well (visible) | Rate | What papers neglect (invisible) | Rate |
     14 |-----|------|------|------|
     15 | Include baselines | 84.5% | Report seed sensitivity | 12.8% |
     16 | Multiple metrics | 87.6% | Hyperparameter search budget | 3.0% |
     17 | Per-category breakdowns | 97.0% | Confidence intervals | 24.0% |
     18 | Report effect sizes | 84.5% | Significance tests | 31.4% |
     19 | Failure cases discussed | 90.2% | Sample size justified | 6.4% |
     20 | Limitations section present | 68.1% | Specific threats to validity | 63.7% |
     21 | Affiliations disclosed | 98.5% | Financial interests declared | 1.5% |
     22 | Code released | 58.6% | Reproduction instructions | 10.9% |
     23 
     24 The pattern is consistent: **papers perform the rituals of science without its substance.** They have baselines but don't report variance. They report effect sizes but not confidence intervals. They release code but not instructions to run it. They disclose affiliations but not financial interests.
     25 
     26 ## 2. Six Named Games
     27 
     28 ### Game 1: "Big Numbers, No Error Bars"
     29 Papers that report effect sizes but provide NO confidence intervals, NO significance tests, AND NO variance.
     30 
     31 **54% of papers reporting effect sizes play this game** (172/318).
     32 
     33 These papers say "our method improves by 12%" but can't tell you whether that 12% is signal or noise. The effect size without uncertainty quantification is scientifically meaningless — it's a marketing number.
     34 
     35 ### Game 2: "Overclaiming"
     36 Papers whose abstract claims are supported by their evidence, but who don't bound their generalization.
     37 
     38 **53% of papers overclaim** (249/467). More than half of papers with supported abstracts fail to bound their generalization claims.
     39 
     40 The typical pattern: test on Python + GPT-4 on SWE-bench, then claim results for "AI-assisted software engineering" broadly. The abstract is technically accurate about what they measured, but the framing implies far more generality than the evidence supports.
     41 
     42 ### Game 3: "Open Source Theater"
     43 Papers that release code but provide NO environment specification AND NO reproduction instructions.
     44 
     45 **84% of papers releasing code play this game** (193/231).
     46 
     47 Releasing a GitHub repo with no dockerfile, no requirements.txt versions, no instructions, and no seeds is not reproducibility — it's the appearance of reproducibility. The code exists but cannot be meaningfully used by others.
     48 
     49 ### Game 4: "The Contamination Dodge"
     50 Benchmark-eval papers that don't address whether their LLM's training data included the benchmark.
     51 
     52 **84% of benchmark papers dodge contamination** (264/314). This worsened steadily from 51% (n=135) → 78% (n=271) → 84% (n=467) as less prominent papers entered the pool. The long tail almost never addresses contamination.
     53 
     54 If GPT-4 was trained on HumanEval solutions, then evaluating it on HumanEval measures memorization, not capability. Half of benchmark papers don't even acknowledge this possibility.
     55 
     56 ### Game 5: "Cherry-picked Comparisons"
     57 Papers that include baselines but don't perform ablation studies.
     58 
     59 **52 papers** include baselines but skip ablation. Without ablation, you can't tell which component of a system drives the improvement — maybe it's the prompt template, not the novel technique.
     60 
     61 ### Game 6: "All Show, No Substance"
     62 Papers that pass 80%+ of easy checklist questions (baselines, metrics, breakdowns, abstract support, affiliations) but pass 0% of hard questions (CIs, significance, seeds, contamination, environment, sample size, funding, generalization).
     63 
     64 **~50 papers fit this pattern** at n=271. These papers have all the surface markers of a scientific paper — evaluation section, results table, comparison chart — but none of the methodological foundations.
     65 
     66 Examples:
     67 - a2hcoder-llmdriven-coding-2025 (33%)
     68 - deepseek-coder-v2-2024 (42%)
     69 - enterprise-ai-coding-requirements-2026 (26%)
     70 - hidden-risks-llm-web-code-2025 (22%)
     71 
     72 ## 3. The Confidence-Rigor Inversion
     73 
     74 Papers making **stronger claims** have **lower rigor** on the hard questions:
     75 
     76 | Claim support level | N papers | Mean overall score |
     77 |---|---|---|
     78 | Strong claims | 304 claims across papers | 59.8% |
     79 | Moderate claims | 279 | 49.6% |
     80 | Weak claims | 89 | 41.6% |
     81 | Unsupported claims | 9 | 37.4% |
     82 
     83 This looks like strong claims correlate with quality — but it's an artifact. Papers making strong claims tend to have elaborate evaluation sections (high eval_design), which inflates their overall score. When you look at only the hard questions, the relationship inverts: papers with strong claims are the *least* likely to justify sample sizes, report variance, or bound generalization.
     84 
     85 ## 4. Rigor Tiers
     86 
     87 Grouping papers by what fraction of "hard" questions they pass (CIs, significance, variance, sample size, seeds, hyperparameter search, contamination, environment, reproduction instructions, funding, generalization):
     88 
     89 | Hard question pass rate | N | Mean overall score |
     90 |---|---|---|
     91 | 0% (zero hard questions) | 40 | 34.7% |
     92 | 1-24% | 31 | 52.6% |
     93 | 25-49% | 42 | 57.9% |
     94 | 50-74% | 19 | 69.8% |
     95 | 75-100% | 3 | 60.5% |
     96 
     97 **40 papers (30%) pass zero hard questions.** These papers have evaluations, baselines, and results tables — they just lack every form of methodological infrastructure that would make those results trustworthy.
     98 
     99 The jump from 0% to 1-24% is 18pp in overall score — suggesting that papers that do *any* hard methodology tend to do better across the board. Rigor is not a spectrum; it's binary. Papers either care about methodology or they don't.
    100 
    101 ## 5. The "Trust Me Bro" Divide
    102 
    103 Papers that report effect sizes split cleanly into two populations:
    104 
    105 | Group | N | Mean score |
    106 |---|---|---|
    107 | Effect sizes WITHOUT CIs | 64 | 51.0% |
    108 | Effect sizes WITH CIs | 23 | 64.3% |
    109 
    110 **13pp gap.** Papers that bother with confidence intervals are systematically better across every dimension — not just statistics, but artifacts, transparency, and claims. CIs are a marker of methodological seriousness, not just a statistical checkbox.
    111 
    112 ## 6. Funding Disclosure Predicts Quality
    113 
    114 | Group | N | Mean score |
    115 |---|---|---|
    116 | Funding disclosed | 44 | 61.5% |
    117 | Funding NOT disclosed | 91 | 46.7% |
    118 
    119 **15pp gap.** Papers that disclose funding are meaningfully better. This could be causal (funded research has more resources for rigor) or selective (journals/venues requiring funding disclosure also require higher standards). Either way, 67% of papers don't disclose funding at all.
    120 
    121 ## 7. Year Trends by Category
    122 
    123 | Category | 2023 | 2024 | 2025 | 2026 |
    124 |---|---|---|---|---|
    125 | evaluation_design | 89.3% | 84.9% | 83.8% | 87.8% |
    126 | statistical_methodology | 27.7% | 34.5% | 29.5% | 50.4% |
    127 | artifacts | 32.1% | 41.7% | 30.3% | 41.0% |
    128 | contamination | 33.3% | 39.6% | 26.7% | 19.0% |
    129 | conflicts_of_interest | 42.9% | 40.2% | 36.9% | 41.7% |
    130 
    131 Two alarming trends:
    132 - **Contamination awareness is declining**: 39.6% (2024) → 26.7% (2025) → 19.0% (2026). As LLM benchmarks proliferate, contamination is becoming *more* of a problem and *less* discussed.
    133 - **Artifact quality dropped in 2025**: 41.7% (2024) → 30.3% (2025). The volume explosion brought papers that don't release code or data.
    134 
    135 Evaluation design remains stable (~85%) — the easy parts stay easy. The hard parts are getting harder.
    136 
    137 ## 8. The Overclaiming Gap
    138 
    139 - **97.8%** of paper abstracts are supported by their evidence
    140 - **35.6%** bound their generalization claims
    141 - **64%** of papers with supported abstracts overclaim
    142 
    143 The typical paper accurately describes narrow results in the body, then frames them as broad contributions in the abstract and title. This isn't fabrication — it's systematic inflation. "We improve Python code completion with GPT-4 on HumanEval" becomes "Advancing AI-Assisted Software Engineering."
    144 
    145 ## 9. Red Flag Combinations
    146 
    147 The most concerning co-occurrences:
    148 
    149 | Combination | Count | Why it matters |
    150 |---|---|---|
    151 | No limitations + no uncertainty | 3 | Complete methodological absence |
    152 | Contamination risk + no limitations | 3 | Known risk, zero acknowledgment |
    153 | No publication bias + no quality assessment (surveys) | 3 | Surveys laundering weak results |
    154 | Company evaluating own product + no blinding | 2 | Undisclosed conflict + biased design |
    155 | Company evaluating own product + no limitations | 2 | Conflict without self-awareness |
    156 
    157 ## 10. Emerging Taxonomy of Paper Types
    158 
    159 From these patterns, five archetypes emerge:
    160 
    161 1. **The Rigorous Paper** (score 65%+, passes hard questions): ~15% of corpus. Discloses funding, reports uncertainty, bounds claims. Usually from established research groups.
    162 
    163 2. **The Competent Paper** (score 50-65%, passes some hard questions): ~35%. Has a real evaluation but misses statistical infrastructure. Usually fixable with reviewer feedback.
    164 
    165 3. **The Theater Paper** (score 35-50%, high eval_design, low everything else): ~30%. Elaborate results section, zero statistical foundation. Looks scientific at a glance, falls apart under scrutiny.
    166 
    167 4. **The Position Paper Masquerading as Empirical Work** (score 20-35%): ~15%. Has a "evaluation" section that's really a demo. One model, one benchmark, no baselines, no statistics.
    168 
    169 5. **The Non-Paper** (score <20%): ~5%. Blog post submitted to arXiv. No methodology, no evaluation framework, often just screenshots of ChatGPT output.
    170 
    171 ---
    172 
    173 ## Summary: What Papers Actually Do vs. What They Should Do
    174 
    175 The field has converged on a template: include an evaluation section with baselines and metrics, write a limitations paragraph, release a GitHub repo. This template creates a strong *appearance* of rigor (evaluation_design: 86%) while systematically omitting the foundations that would make results trustworthy (experimental_rigor: 26%, statistical_methodology: 34%, contamination: 30%).
    176 
    177 The result is a literature that looks scientific but isn't — where "X outperforms Y" means "we ran it once without reporting variance on a benchmark that might be in the training data, using hyperparameters we selected without documenting our search, and we didn't check whether our reimplementation of the baseline was fair."
    178 
    179 This isn't a few bad papers. It's the **median paper** in the field.
    180 
    181 ---
    182 
    183 ## Part II: What's Actually Working
    184 
    185 The corpus isn't all bad. 12% of papers score above 70%, and the top papers reveal a clear template for what good looks like.
    186 
    187 ## 11. The Quality Distribution Is a Spectrum, Not a Disaster
    188 
    189 | Tier | N | % |
    190 |---|---|---|
    191 | >= 70% | 16 | 12% |
    192 | 60-69% | 31 | 23% |
    193 | 50-59% | 29 | 21% |
    194 | 40-49% | 24 | 18% |
    195 | 30-39% | 21 | 16% |
    196 | < 30% | 14 | 10% |
    197 
    198 **35% of papers score 60% or above.** These aren't perfect but they're doing real science — reporting variance, bounding claims, disclosing conflicts. The problem isn't that nobody knows how; it's that the majority don't bother.
    199 
    200 ## 12. The Top Papers and What They Do Right
    201 
    202 The top 10 papers share a clear pattern — they do the hard things, not just the easy things:
    203 
    204 | Score | Paper | Hard Passes |
    205 |---|---|---|
    206 | 86% | measuring-mid2025-llmassistance-2026 (RCT) | CIs, sig tests, variance, sample size, env, repro, funding, gen bound |
    207 | 86% | 2025-ai-agent-2026 (meta-analysis) | sample size, funding, gen bound |
    208 | 85% | cursor-speed-quality-tradeoff-2025 (observational) | CIs, sig tests, variance, funding, gen bound |
    209 | 80% | fuzz4all-universal-fuzzing-2023 (benchmark) | CIs, sig tests, variance, seeds, env, funding, gen bound |
    210 | 79% | swe-agent-2024 (benchmark) | CIs, variance, seeds, hyper search, env, repro, funding |
    211 | 76% | how-ai-impacts-2026 (RCT) | CIs, sig tests, sample size, funding |
    212 | 75% | swe-bench-2023 (benchmark) | contamination, env, repro, funding |
    213 | 74% | design-evaluation-assisted-2026 (RCT) | CIs, sig tests, variance, sample size, funding |
    214 | 74% | library-hallucinations-llm-2025 (benchmark) | contamination, env, repro |
    215 
    216 **The pattern**: top papers don't do one thing well — they do *everything*. The top RCT passes 8/9 hard questions. The top benchmark paper (fuzz4all) passes 7/9. Good methodology is a package deal: papers either invest in rigor across the board or they don't.
    217 
    218 Note that **benchmark papers can score just as high as RCTs** — fuzz4all (80%) and swe-agent (79%) prove that benchmark methodology can be rigorous. The tool isn't biased against benchmarks; most benchmark papers just don't try.
    219 
    220 ## 13. Things That Are Getting Better
    221 
    222 Three questions show genuine improvement (2025-26 vs 2023-24):
    223 
    224 | Question | 2023-24 | 2025-26 | Change |
    225 |---|---|---|---|
    226 | significance_tests | 23% | 37% | **+14pp** |
    227 | scope_boundaries_stated | 45% | 59% | **+14pp** |
    228 | alternative_explanations_discussed | 42% | 52% | **+10pp** |
    229 
    230 **Statistical testing is increasing.** From nearly absent to roughly a third of papers. This is slow, but it's the right direction.
    231 
    232 **Claims are getting more bounded.** Scope boundaries and alternative explanations are improving — papers are becoming more careful about what they claim.
    233 
    234 ## 14. Things That Are Getting Worse
    235 
    236 | Question | 2023-24 | 2025-26 | Change |
    237 |---|---|---|---|
    238 | training_cutoff_stated | 30% | 8% | **-22pp** |
    239 | leakage_detection_method | 42% | 25% | **-17pp** |
    240 | causal_claims_justified | 67% | 51% | **-16pp** |
    241 | limitations_section_present | 79% | 64% | **-15pp** |
    242 | train_test_overlap_discussed | 48% | 35% | **-13pp** |
    243 | baselines_included | 93% | 80% | **-13pp** |
    244 | model_versions_specified | 54% | 42% | **-13pp** |
    245 | funding_disclosed | 40% | 29% | **-12pp** |
    246 | data_released | 63% | 53% | **-11pp** |
    247 
    248 **Contamination awareness is collapsing.** Training cutoff stated dropped from 30% to 8%. Leakage detection from 42% to 25%. As LLMs train on everything and benchmarks proliferate, the problem grows while the field's attention to it shrinks.
    249 
    250 **Basic professionalism is eroding.** Limitations sections, baselines, funding disclosure, and data release are all declining. The volume explosion of 2025 brought a wave of papers that skip fundamentals.
    251 
    252 ## 15. What Even Good Papers Miss
    253 
    254 Questions that trip up even the top quartile (passed by <50% of papers scoring 65%+):
    255 
    256 | Question | Top quartile pass rate |
    257 |---|---|
    258 | financial_interests_declared | **3%** |
    259 | environment_specified | **16%** |
    260 | inference_cost_reported | **18%** |
    261 | reproduction_instructions | **18%** |
    262 | hyperparameter_search_budget | **18%** |
    263 | sample_size_justified | **21%** |
    264 | seed_sensitivity_reported | **42%** |
    265 | funder_independent_of_outcome | **45%** |
    266 
    267 **Even the best papers rarely declare financial interests (3%) or specify environments (16%).** These aren't judgment calls — they're simple disclosures that the field's norms don't require. This is a venue/reviewer problem, not an author problem. If top conferences required these, compliance would be near-universal.
    268 
    269 ## 16. What Would Move the Needle
    270 
    271 Simulation: if every paper just added CIs, bounded generalization, and addressed contamination — three straightforward additions:
    272 
    273 - Current median: **53.3%**
    274 - Simulated median: **55.8%**
    275 - Lift: **+2.4pp**
    276 
    277 Only 2.4pp — because the problems are broader than any three questions. The field needs systematic reform, not quick fixes. But these three are high-impact starting points: they're well-understood, easy to implement, and address the biggest credibility gaps.
    278 
    279 ## 17. The Real Story
    280 
    281 The corpus tells a nuanced story, not a simple one:
    282 
    283 1. **There is a template for good work** — the top 12% prove it's possible to do rigorous AI research. These papers report uncertainty, bound claims, disclose conflicts, and release reproducible artifacts. They exist across all methodology types — RCTs, benchmarks, observational, even meta-analyses.
    284 
    285 2. **The middle 50% could be fixed by reviewer pressure.** These papers have evaluations but skip statistical foundations. If venues required CIs, contamination analysis, and environment specs, most of these authors could comply — they just don't because nobody asks.
    286 
    287 3. **The bottom 25% are a different problem.** These aren't sloppy science; they're not science at all. Position papers masquerading as empirical work, demos submitted as research, blog posts on arXiv.
    288 
    289 4. **The trend is concerning.** Quality is declining as volume grows. Contamination awareness is collapsing. Basic professionalism (limitations, funding, baselines) is eroding. The field is outgrowing its standards.
    290 
    291 5. **The fix is institutional, not individual.** Top papers prove authors *can* do this. The problem is that venues don't require it, reviewers don't enforce it, and the incentive is to publish fast, not publish well. A checklist like this one — required at submission — would shift the distribution overnight.
    292 
    293 6. **Visibility correlates with quality.** The first 135 papers scanned (priority/prominent papers) had a median of 53.3%. As less prominent papers entered the pool, the Contamination Dodge went from 51% to 84% and Open Source Theater from 74% to 84%. The long tail of the field is worse than its visible front. This has implications for meta-analyses that sample only from top venues.
    294 
    295 ---
    296 
    297 ## Part III: What the Papers Say to Each Other
    298 
    299 Analysis of 2,349 extracted claims across 467 papers reveals three major tensions where the literature contradicts itself — and a meta-finding about which side has better evidence.
    300 
    301 ## 18. Tension 1: The Productivity Paradox
    302 
    303 28 papers claim AI tools improve developer productivity. 11 papers complicate or contradict this.
    304 
    305 **The positive case:**
    306 - "55.8% faster task completion" (Copilot RCT, 69% score)
    307 - "15% increase in resolutions per hour" (generative-ai-at-2023, 69%)
    308 - "82.3% of survey respondents perceive productivity gains" (evolving-ai-longitudinal-2026, 72%)
    309 
    310 **The complication:**
    311 - "Inexperience and limited development practices are associated with *greater* perceived productivity gains" (more-code-less-2025, 78%) — the people who think they benefit most are least able to judge
    312 - "A Productivity Pressure Paradox exists where organizational expectations for rapid productivity gains create new pressures" (maybe-we-need-2025, 69%)
    313 - "Less-skilled workers see ~36% gains while the most skilled see no significant improvement" (generative-ai-at-2023, 69%) — from the *same paper* claiming 15% overall
    314 
    315 **The evidence quality gap:** Papers finding nuance score higher (mean 57% vs 55%). The field hasn't reconciled "faster" with "but worse code" with "and less-skilled benefit more."
    316 
    317 ## 19. Tension 2: The Benchmark Validity Crisis
    318 
    319 Papers simultaneously build benchmarks and question whether benchmarks work. 14 papers trust their evaluation frameworks. 7 papers distrust them.
    320 
    321 **The trust case:**
    322 - Papers build increasingly elaborate benchmarks (SWE-bench, AgentDojo, AppWorld) with high evaluation_design scores
    323 
    324 **The distrust case:**
    325 - "Only 53.4% of articles presented evidence for the construct validity of their benchmark" (measuring-what-matters-2025, 71%)
    326 - "Gap between in silico benchmark performance and real-world utility" (measuring-mid2025-llmassistance-2026, 86%)
    327 - "Gold-patch-only evaluation dramatically overestimates performance" (omnicode-benchmark-2026, 50%)
    328 
    329 **The irony:** 84% of benchmark papers don't address contamination, and only 15% address data leakage. The field builds elaborate measurement instruments while leaving the most fundamental validity threats unexamined. Distrust papers score higher (53% vs 51%).
    330 
    331 ## 20. Tension 3: The Agent Capability Gap
    332 
    333 94 claims that agents succeed vs 12 that they're limited — an **8:1 ratio**. The imbalance alone is suspicious.
    334 
    335 **The success case (from benchmarks):**
    336 - "SWE-agent with GPT-4 Turbo solves 12.47% of SWE-bench" (swe-agent-2024, 79%)
    337 - "87.7% pass@1 on HumanEvalFix Python" (swe-agent-2024, 79%)
    338 - Dozens of papers claiming SOTA on various coding benchmarks
    339 
    340 **The limitation case (from deployment):**
    341 - "Reasoning errors and retrieval/tooling failures are the dominant failure modes" (proof-time-benchmark-2026, 68%)
    342 - "Multi-agent systems introduce collective action problems, adversarial dynamics, and coordination failures" (multiagent-risks-from-2025, 63%)
    343 - "Browser-based agents operate at L4-L5 autonomy with limited mid-execution intervention" (2025-ai-agent-2026, 86%)
    344 
    345 **The pattern:** Success is measured in sandboxes. Failure is found in deployment. The field overwhelmingly produces evidence that agents work, but that evidence comes from controlled benchmarks, while the minority finding limitations comes from real-world observation — and scores higher (52% vs 48%).
    346 
    347 ## 21. The Optimism-Rigor Inversion
    348 
    349 Across all three tensions, **the skeptical/nuanced position comes from higher-scoring papers:**
    350 
    351 | Tension | Optimistic papers | Skeptical papers | Gap |
    352 |---|---|---|---|
    353 | Productivity | 55% mean score | 57% mean score | +2pp |
    354 | Benchmarks | 51% | 53% | +2pp |
    355 | Agent capability | 48% | 52% | +4pp |
    356 
    357 This is consistent and concerning. The literature is biased toward positive results, and those positive results come from less methodologically rigorous papers. The most confident claims have the weakest foundations.
    358 
    359 This doesn't mean the optimistic claims are wrong. It means the field lacks the evidence quality to know either way — and the incentive structure rewards publishing positive results with minimal validation.
    360 
    361 ## 22. Consensus Points
    362 
    363 Some things the corpus agrees on:
    364 - **LLMs hallucinate** (55 claims, no contradictions). The one thing everyone acknowledges.
    365 - **Code quality is a concern** (241 claims). Both boosters and skeptics agree AI-generated code has quality issues.
    366 - **Scale helps** — larger models, more data, more compute generally improve results. No papers argue against scaling, though some find diminishing returns.
    367 - **The field is moving toward agents** — shift from fine-tuning to zero-shot and agent-based approaches is documented across multiple papers.
    368 
    369 ## 23. The Benchmark Monoculture
    370 
    371 51% of all papers are pure benchmark-eval (sole methodology tag). This share is growing: 49% (2023) → 55% (2025), dipping to 42% in 2026.
    372 
    373 The field is converging on a single way of doing science: run a model on a benchmark, report a number, claim improvement. This monoculture interacts with every other finding:
    374 - Benchmark papers score the median (50.5%) — not terrible, not good
    375 - They drive the 84% Contamination Dodge rate
    376 - They produce the 8:1 agent success ratio
    377 - They're where the Open Source Theater is worst
    378 
    379 The 2026 dip in benchmark share (42%) may indicate the easy benchmark papers have been written, or it may reflect growing awareness that benchmarks alone aren't sufficient.

Impressum · Datenschutz