v2-deep-patterns.md (23006B)
1 # Deep Patterns: The Games Papers Play 2 3 *Updated 2026-03-15 from 467 v2 scans (Opus-rated)* 4 5 This document goes beyond pass rates to identify **systematic patterns of selective rigor** — the specific ways papers create the appearance of scientific quality while lacking its substance. 6 7 --- 8 9 ## 1. The Methodological Theater Gap 10 11 The defining finding of this survey. Papers invest heavily in the *visible* parts of methodology while neglecting the *invisible* infrastructure. 12 13 | What papers do well (visible) | Rate | What papers neglect (invisible) | Rate | 14 |-----|------|------|------| 15 | Include baselines | 84.5% | Report seed sensitivity | 12.8% | 16 | Multiple metrics | 87.6% | Hyperparameter search budget | 3.0% | 17 | Per-category breakdowns | 97.0% | Confidence intervals | 24.0% | 18 | Report effect sizes | 84.5% | Significance tests | 31.4% | 19 | Failure cases discussed | 90.2% | Sample size justified | 6.4% | 20 | Limitations section present | 68.1% | Specific threats to validity | 63.7% | 21 | Affiliations disclosed | 98.5% | Financial interests declared | 1.5% | 22 | Code released | 58.6% | Reproduction instructions | 10.9% | 23 24 The pattern is consistent: **papers perform the rituals of science without its substance.** They have baselines but don't report variance. They report effect sizes but not confidence intervals. They release code but not instructions to run it. They disclose affiliations but not financial interests. 25 26 ## 2. Six Named Games 27 28 ### Game 1: "Big Numbers, No Error Bars" 29 Papers that report effect sizes but provide NO confidence intervals, NO significance tests, AND NO variance. 30 31 **54% of papers reporting effect sizes play this game** (172/318). 32 33 These papers say "our method improves by 12%" but can't tell you whether that 12% is signal or noise. The effect size without uncertainty quantification is scientifically meaningless — it's a marketing number. 34 35 ### Game 2: "Overclaiming" 36 Papers whose abstract claims are supported by their evidence, but who don't bound their generalization. 37 38 **53% of papers overclaim** (249/467). More than half of papers with supported abstracts fail to bound their generalization claims. 39 40 The typical pattern: test on Python + GPT-4 on SWE-bench, then claim results for "AI-assisted software engineering" broadly. The abstract is technically accurate about what they measured, but the framing implies far more generality than the evidence supports. 41 42 ### Game 3: "Open Source Theater" 43 Papers that release code but provide NO environment specification AND NO reproduction instructions. 44 45 **84% of papers releasing code play this game** (193/231). 46 47 Releasing a GitHub repo with no dockerfile, no requirements.txt versions, no instructions, and no seeds is not reproducibility — it's the appearance of reproducibility. The code exists but cannot be meaningfully used by others. 48 49 ### Game 4: "The Contamination Dodge" 50 Benchmark-eval papers that don't address whether their LLM's training data included the benchmark. 51 52 **84% of benchmark papers dodge contamination** (264/314). This worsened steadily from 51% (n=135) → 78% (n=271) → 84% (n=467) as less prominent papers entered the pool. The long tail almost never addresses contamination. 53 54 If GPT-4 was trained on HumanEval solutions, then evaluating it on HumanEval measures memorization, not capability. Half of benchmark papers don't even acknowledge this possibility. 55 56 ### Game 5: "Cherry-picked Comparisons" 57 Papers that include baselines but don't perform ablation studies. 58 59 **52 papers** include baselines but skip ablation. Without ablation, you can't tell which component of a system drives the improvement — maybe it's the prompt template, not the novel technique. 60 61 ### Game 6: "All Show, No Substance" 62 Papers that pass 80%+ of easy checklist questions (baselines, metrics, breakdowns, abstract support, affiliations) but pass 0% of hard questions (CIs, significance, seeds, contamination, environment, sample size, funding, generalization). 63 64 **~50 papers fit this pattern** at n=271. These papers have all the surface markers of a scientific paper — evaluation section, results table, comparison chart — but none of the methodological foundations. 65 66 Examples: 67 - a2hcoder-llmdriven-coding-2025 (33%) 68 - deepseek-coder-v2-2024 (42%) 69 - enterprise-ai-coding-requirements-2026 (26%) 70 - hidden-risks-llm-web-code-2025 (22%) 71 72 ## 3. The Confidence-Rigor Inversion 73 74 Papers making **stronger claims** have **lower rigor** on the hard questions: 75 76 | Claim support level | N papers | Mean overall score | 77 |---|---|---| 78 | Strong claims | 304 claims across papers | 59.8% | 79 | Moderate claims | 279 | 49.6% | 80 | Weak claims | 89 | 41.6% | 81 | Unsupported claims | 9 | 37.4% | 82 83 This looks like strong claims correlate with quality — but it's an artifact. Papers making strong claims tend to have elaborate evaluation sections (high eval_design), which inflates their overall score. When you look at only the hard questions, the relationship inverts: papers with strong claims are the *least* likely to justify sample sizes, report variance, or bound generalization. 84 85 ## 4. Rigor Tiers 86 87 Grouping papers by what fraction of "hard" questions they pass (CIs, significance, variance, sample size, seeds, hyperparameter search, contamination, environment, reproduction instructions, funding, generalization): 88 89 | Hard question pass rate | N | Mean overall score | 90 |---|---|---| 91 | 0% (zero hard questions) | 40 | 34.7% | 92 | 1-24% | 31 | 52.6% | 93 | 25-49% | 42 | 57.9% | 94 | 50-74% | 19 | 69.8% | 95 | 75-100% | 3 | 60.5% | 96 97 **40 papers (30%) pass zero hard questions.** These papers have evaluations, baselines, and results tables — they just lack every form of methodological infrastructure that would make those results trustworthy. 98 99 The jump from 0% to 1-24% is 18pp in overall score — suggesting that papers that do *any* hard methodology tend to do better across the board. Rigor is not a spectrum; it's binary. Papers either care about methodology or they don't. 100 101 ## 5. The "Trust Me Bro" Divide 102 103 Papers that report effect sizes split cleanly into two populations: 104 105 | Group | N | Mean score | 106 |---|---|---| 107 | Effect sizes WITHOUT CIs | 64 | 51.0% | 108 | Effect sizes WITH CIs | 23 | 64.3% | 109 110 **13pp gap.** Papers that bother with confidence intervals are systematically better across every dimension — not just statistics, but artifacts, transparency, and claims. CIs are a marker of methodological seriousness, not just a statistical checkbox. 111 112 ## 6. Funding Disclosure Predicts Quality 113 114 | Group | N | Mean score | 115 |---|---|---| 116 | Funding disclosed | 44 | 61.5% | 117 | Funding NOT disclosed | 91 | 46.7% | 118 119 **15pp gap.** Papers that disclose funding are meaningfully better. This could be causal (funded research has more resources for rigor) or selective (journals/venues requiring funding disclosure also require higher standards). Either way, 67% of papers don't disclose funding at all. 120 121 ## 7. Year Trends by Category 122 123 | Category | 2023 | 2024 | 2025 | 2026 | 124 |---|---|---|---|---| 125 | evaluation_design | 89.3% | 84.9% | 83.8% | 87.8% | 126 | statistical_methodology | 27.7% | 34.5% | 29.5% | 50.4% | 127 | artifacts | 32.1% | 41.7% | 30.3% | 41.0% | 128 | contamination | 33.3% | 39.6% | 26.7% | 19.0% | 129 | conflicts_of_interest | 42.9% | 40.2% | 36.9% | 41.7% | 130 131 Two alarming trends: 132 - **Contamination awareness is declining**: 39.6% (2024) → 26.7% (2025) → 19.0% (2026). As LLM benchmarks proliferate, contamination is becoming *more* of a problem and *less* discussed. 133 - **Artifact quality dropped in 2025**: 41.7% (2024) → 30.3% (2025). The volume explosion brought papers that don't release code or data. 134 135 Evaluation design remains stable (~85%) — the easy parts stay easy. The hard parts are getting harder. 136 137 ## 8. The Overclaiming Gap 138 139 - **97.8%** of paper abstracts are supported by their evidence 140 - **35.6%** bound their generalization claims 141 - **64%** of papers with supported abstracts overclaim 142 143 The typical paper accurately describes narrow results in the body, then frames them as broad contributions in the abstract and title. This isn't fabrication — it's systematic inflation. "We improve Python code completion with GPT-4 on HumanEval" becomes "Advancing AI-Assisted Software Engineering." 144 145 ## 9. Red Flag Combinations 146 147 The most concerning co-occurrences: 148 149 | Combination | Count | Why it matters | 150 |---|---|---| 151 | No limitations + no uncertainty | 3 | Complete methodological absence | 152 | Contamination risk + no limitations | 3 | Known risk, zero acknowledgment | 153 | No publication bias + no quality assessment (surveys) | 3 | Surveys laundering weak results | 154 | Company evaluating own product + no blinding | 2 | Undisclosed conflict + biased design | 155 | Company evaluating own product + no limitations | 2 | Conflict without self-awareness | 156 157 ## 10. Emerging Taxonomy of Paper Types 158 159 From these patterns, five archetypes emerge: 160 161 1. **The Rigorous Paper** (score 65%+, passes hard questions): ~15% of corpus. Discloses funding, reports uncertainty, bounds claims. Usually from established research groups. 162 163 2. **The Competent Paper** (score 50-65%, passes some hard questions): ~35%. Has a real evaluation but misses statistical infrastructure. Usually fixable with reviewer feedback. 164 165 3. **The Theater Paper** (score 35-50%, high eval_design, low everything else): ~30%. Elaborate results section, zero statistical foundation. Looks scientific at a glance, falls apart under scrutiny. 166 167 4. **The Position Paper Masquerading as Empirical Work** (score 20-35%): ~15%. Has a "evaluation" section that's really a demo. One model, one benchmark, no baselines, no statistics. 168 169 5. **The Non-Paper** (score <20%): ~5%. Blog post submitted to arXiv. No methodology, no evaluation framework, often just screenshots of ChatGPT output. 170 171 --- 172 173 ## Summary: What Papers Actually Do vs. What They Should Do 174 175 The field has converged on a template: include an evaluation section with baselines and metrics, write a limitations paragraph, release a GitHub repo. This template creates a strong *appearance* of rigor (evaluation_design: 86%) while systematically omitting the foundations that would make results trustworthy (experimental_rigor: 26%, statistical_methodology: 34%, contamination: 30%). 176 177 The result is a literature that looks scientific but isn't — where "X outperforms Y" means "we ran it once without reporting variance on a benchmark that might be in the training data, using hyperparameters we selected without documenting our search, and we didn't check whether our reimplementation of the baseline was fair." 178 179 This isn't a few bad papers. It's the **median paper** in the field. 180 181 --- 182 183 ## Part II: What's Actually Working 184 185 The corpus isn't all bad. 12% of papers score above 70%, and the top papers reveal a clear template for what good looks like. 186 187 ## 11. The Quality Distribution Is a Spectrum, Not a Disaster 188 189 | Tier | N | % | 190 |---|---|---| 191 | >= 70% | 16 | 12% | 192 | 60-69% | 31 | 23% | 193 | 50-59% | 29 | 21% | 194 | 40-49% | 24 | 18% | 195 | 30-39% | 21 | 16% | 196 | < 30% | 14 | 10% | 197 198 **35% of papers score 60% or above.** These aren't perfect but they're doing real science — reporting variance, bounding claims, disclosing conflicts. The problem isn't that nobody knows how; it's that the majority don't bother. 199 200 ## 12. The Top Papers and What They Do Right 201 202 The top 10 papers share a clear pattern — they do the hard things, not just the easy things: 203 204 | Score | Paper | Hard Passes | 205 |---|---|---| 206 | 86% | measuring-mid2025-llmassistance-2026 (RCT) | CIs, sig tests, variance, sample size, env, repro, funding, gen bound | 207 | 86% | 2025-ai-agent-2026 (meta-analysis) | sample size, funding, gen bound | 208 | 85% | cursor-speed-quality-tradeoff-2025 (observational) | CIs, sig tests, variance, funding, gen bound | 209 | 80% | fuzz4all-universal-fuzzing-2023 (benchmark) | CIs, sig tests, variance, seeds, env, funding, gen bound | 210 | 79% | swe-agent-2024 (benchmark) | CIs, variance, seeds, hyper search, env, repro, funding | 211 | 76% | how-ai-impacts-2026 (RCT) | CIs, sig tests, sample size, funding | 212 | 75% | swe-bench-2023 (benchmark) | contamination, env, repro, funding | 213 | 74% | design-evaluation-assisted-2026 (RCT) | CIs, sig tests, variance, sample size, funding | 214 | 74% | library-hallucinations-llm-2025 (benchmark) | contamination, env, repro | 215 216 **The pattern**: top papers don't do one thing well — they do *everything*. The top RCT passes 8/9 hard questions. The top benchmark paper (fuzz4all) passes 7/9. Good methodology is a package deal: papers either invest in rigor across the board or they don't. 217 218 Note that **benchmark papers can score just as high as RCTs** — fuzz4all (80%) and swe-agent (79%) prove that benchmark methodology can be rigorous. The tool isn't biased against benchmarks; most benchmark papers just don't try. 219 220 ## 13. Things That Are Getting Better 221 222 Three questions show genuine improvement (2025-26 vs 2023-24): 223 224 | Question | 2023-24 | 2025-26 | Change | 225 |---|---|---|---| 226 | significance_tests | 23% | 37% | **+14pp** | 227 | scope_boundaries_stated | 45% | 59% | **+14pp** | 228 | alternative_explanations_discussed | 42% | 52% | **+10pp** | 229 230 **Statistical testing is increasing.** From nearly absent to roughly a third of papers. This is slow, but it's the right direction. 231 232 **Claims are getting more bounded.** Scope boundaries and alternative explanations are improving — papers are becoming more careful about what they claim. 233 234 ## 14. Things That Are Getting Worse 235 236 | Question | 2023-24 | 2025-26 | Change | 237 |---|---|---|---| 238 | training_cutoff_stated | 30% | 8% | **-22pp** | 239 | leakage_detection_method | 42% | 25% | **-17pp** | 240 | causal_claims_justified | 67% | 51% | **-16pp** | 241 | limitations_section_present | 79% | 64% | **-15pp** | 242 | train_test_overlap_discussed | 48% | 35% | **-13pp** | 243 | baselines_included | 93% | 80% | **-13pp** | 244 | model_versions_specified | 54% | 42% | **-13pp** | 245 | funding_disclosed | 40% | 29% | **-12pp** | 246 | data_released | 63% | 53% | **-11pp** | 247 248 **Contamination awareness is collapsing.** Training cutoff stated dropped from 30% to 8%. Leakage detection from 42% to 25%. As LLMs train on everything and benchmarks proliferate, the problem grows while the field's attention to it shrinks. 249 250 **Basic professionalism is eroding.** Limitations sections, baselines, funding disclosure, and data release are all declining. The volume explosion of 2025 brought a wave of papers that skip fundamentals. 251 252 ## 15. What Even Good Papers Miss 253 254 Questions that trip up even the top quartile (passed by <50% of papers scoring 65%+): 255 256 | Question | Top quartile pass rate | 257 |---|---| 258 | financial_interests_declared | **3%** | 259 | environment_specified | **16%** | 260 | inference_cost_reported | **18%** | 261 | reproduction_instructions | **18%** | 262 | hyperparameter_search_budget | **18%** | 263 | sample_size_justified | **21%** | 264 | seed_sensitivity_reported | **42%** | 265 | funder_independent_of_outcome | **45%** | 266 267 **Even the best papers rarely declare financial interests (3%) or specify environments (16%).** These aren't judgment calls — they're simple disclosures that the field's norms don't require. This is a venue/reviewer problem, not an author problem. If top conferences required these, compliance would be near-universal. 268 269 ## 16. What Would Move the Needle 270 271 Simulation: if every paper just added CIs, bounded generalization, and addressed contamination — three straightforward additions: 272 273 - Current median: **53.3%** 274 - Simulated median: **55.8%** 275 - Lift: **+2.4pp** 276 277 Only 2.4pp — because the problems are broader than any three questions. The field needs systematic reform, not quick fixes. But these three are high-impact starting points: they're well-understood, easy to implement, and address the biggest credibility gaps. 278 279 ## 17. The Real Story 280 281 The corpus tells a nuanced story, not a simple one: 282 283 1. **There is a template for good work** — the top 12% prove it's possible to do rigorous AI research. These papers report uncertainty, bound claims, disclose conflicts, and release reproducible artifacts. They exist across all methodology types — RCTs, benchmarks, observational, even meta-analyses. 284 285 2. **The middle 50% could be fixed by reviewer pressure.** These papers have evaluations but skip statistical foundations. If venues required CIs, contamination analysis, and environment specs, most of these authors could comply — they just don't because nobody asks. 286 287 3. **The bottom 25% are a different problem.** These aren't sloppy science; they're not science at all. Position papers masquerading as empirical work, demos submitted as research, blog posts on arXiv. 288 289 4. **The trend is concerning.** Quality is declining as volume grows. Contamination awareness is collapsing. Basic professionalism (limitations, funding, baselines) is eroding. The field is outgrowing its standards. 290 291 5. **The fix is institutional, not individual.** Top papers prove authors *can* do this. The problem is that venues don't require it, reviewers don't enforce it, and the incentive is to publish fast, not publish well. A checklist like this one — required at submission — would shift the distribution overnight. 292 293 6. **Visibility correlates with quality.** The first 135 papers scanned (priority/prominent papers) had a median of 53.3%. As less prominent papers entered the pool, the Contamination Dodge went from 51% to 84% and Open Source Theater from 74% to 84%. The long tail of the field is worse than its visible front. This has implications for meta-analyses that sample only from top venues. 294 295 --- 296 297 ## Part III: What the Papers Say to Each Other 298 299 Analysis of 2,349 extracted claims across 467 papers reveals three major tensions where the literature contradicts itself — and a meta-finding about which side has better evidence. 300 301 ## 18. Tension 1: The Productivity Paradox 302 303 28 papers claim AI tools improve developer productivity. 11 papers complicate or contradict this. 304 305 **The positive case:** 306 - "55.8% faster task completion" (Copilot RCT, 69% score) 307 - "15% increase in resolutions per hour" (generative-ai-at-2023, 69%) 308 - "82.3% of survey respondents perceive productivity gains" (evolving-ai-longitudinal-2026, 72%) 309 310 **The complication:** 311 - "Inexperience and limited development practices are associated with *greater* perceived productivity gains" (more-code-less-2025, 78%) — the people who think they benefit most are least able to judge 312 - "A Productivity Pressure Paradox exists where organizational expectations for rapid productivity gains create new pressures" (maybe-we-need-2025, 69%) 313 - "Less-skilled workers see ~36% gains while the most skilled see no significant improvement" (generative-ai-at-2023, 69%) — from the *same paper* claiming 15% overall 314 315 **The evidence quality gap:** Papers finding nuance score higher (mean 57% vs 55%). The field hasn't reconciled "faster" with "but worse code" with "and less-skilled benefit more." 316 317 ## 19. Tension 2: The Benchmark Validity Crisis 318 319 Papers simultaneously build benchmarks and question whether benchmarks work. 14 papers trust their evaluation frameworks. 7 papers distrust them. 320 321 **The trust case:** 322 - Papers build increasingly elaborate benchmarks (SWE-bench, AgentDojo, AppWorld) with high evaluation_design scores 323 324 **The distrust case:** 325 - "Only 53.4% of articles presented evidence for the construct validity of their benchmark" (measuring-what-matters-2025, 71%) 326 - "Gap between in silico benchmark performance and real-world utility" (measuring-mid2025-llmassistance-2026, 86%) 327 - "Gold-patch-only evaluation dramatically overestimates performance" (omnicode-benchmark-2026, 50%) 328 329 **The irony:** 84% of benchmark papers don't address contamination, and only 15% address data leakage. The field builds elaborate measurement instruments while leaving the most fundamental validity threats unexamined. Distrust papers score higher (53% vs 51%). 330 331 ## 20. Tension 3: The Agent Capability Gap 332 333 94 claims that agents succeed vs 12 that they're limited — an **8:1 ratio**. The imbalance alone is suspicious. 334 335 **The success case (from benchmarks):** 336 - "SWE-agent with GPT-4 Turbo solves 12.47% of SWE-bench" (swe-agent-2024, 79%) 337 - "87.7% pass@1 on HumanEvalFix Python" (swe-agent-2024, 79%) 338 - Dozens of papers claiming SOTA on various coding benchmarks 339 340 **The limitation case (from deployment):** 341 - "Reasoning errors and retrieval/tooling failures are the dominant failure modes" (proof-time-benchmark-2026, 68%) 342 - "Multi-agent systems introduce collective action problems, adversarial dynamics, and coordination failures" (multiagent-risks-from-2025, 63%) 343 - "Browser-based agents operate at L4-L5 autonomy with limited mid-execution intervention" (2025-ai-agent-2026, 86%) 344 345 **The pattern:** Success is measured in sandboxes. Failure is found in deployment. The field overwhelmingly produces evidence that agents work, but that evidence comes from controlled benchmarks, while the minority finding limitations comes from real-world observation — and scores higher (52% vs 48%). 346 347 ## 21. The Optimism-Rigor Inversion 348 349 Across all three tensions, **the skeptical/nuanced position comes from higher-scoring papers:** 350 351 | Tension | Optimistic papers | Skeptical papers | Gap | 352 |---|---|---|---| 353 | Productivity | 55% mean score | 57% mean score | +2pp | 354 | Benchmarks | 51% | 53% | +2pp | 355 | Agent capability | 48% | 52% | +4pp | 356 357 This is consistent and concerning. The literature is biased toward positive results, and those positive results come from less methodologically rigorous papers. The most confident claims have the weakest foundations. 358 359 This doesn't mean the optimistic claims are wrong. It means the field lacks the evidence quality to know either way — and the incentive structure rewards publishing positive results with minimal validation. 360 361 ## 22. Consensus Points 362 363 Some things the corpus agrees on: 364 - **LLMs hallucinate** (55 claims, no contradictions). The one thing everyone acknowledges. 365 - **Code quality is a concern** (241 claims). Both boosters and skeptics agree AI-generated code has quality issues. 366 - **Scale helps** — larger models, more data, more compute generally improve results. No papers argue against scaling, though some find diminishing returns. 367 - **The field is moving toward agents** — shift from fine-tuning to zero-shot and agent-based approaches is documented across multiple papers. 368 369 ## 23. The Benchmark Monoculture 370 371 51% of all papers are pure benchmark-eval (sole methodology tag). This share is growing: 49% (2023) → 55% (2025), dipping to 42% in 2026. 372 373 The field is converging on a single way of doing science: run a model on a benchmark, report a number, claim improvement. This monoculture interacts with every other finding: 374 - Benchmark papers score the median (50.5%) — not terrible, not good 375 - They drive the 84% Contamination Dodge rate 376 - They produce the 8:1 agent success ratio 377 - They're where the Open Source Theater is worst 378 379 The 2026 dip in benchmark share (42%) may indicate the easy benchmark papers have been written, or it may reflect growing awareness that benchmarks alone aren't sufficient.