related-work.md - ai-research-survey - Systematic scan of agentic development research. What's signal, what's noise.

related-work.md (5444B)

1 # Related Work
2
3 ## Methodology Precedents
4
5 ### Cochrane Reviews
6
7 The Cochrane Collaboration has produced systematic reviews of medical research since 1993. Their methodology is the gold standard for structured, reproducible literature review:
8
9 - **Structured extraction**: Every study is assessed against a predefined rubric (Risk of Bias tool)
10 - **Pre-registered protocols**: Review methodology is published before data extraction begins
11 - **Multiple reviewers**: At least two independent reviewers extract data, with conflict resolution procedures
12 - **GRADE framework**: Explicit scoring of evidence certainty across dimensions
13
14 We adapt this approach for CS/AI research, recognizing that the field has different norms (preprints vs. peer review, code release vs. clinical trial registration) but the same underlying need for structured quality assessment.
15
16 ### PRISMA Reporting Guidelines
17
18 PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) provides a checklist for transparent reporting. Key elements we adopt:
19
20 - Explicit inclusion/exclusion criteria
21 - Search strategy documentation
22 - Flow diagram of paper selection
23 - Structured data extraction
24
25 ### Limitations of the Analogy
26
27 Medical systematic reviews typically synthesize effect sizes across comparable studies (e.g., "does drug X reduce mortality?"). Our review assesses *methodological quality* rather than synthesizing a single outcome. The studies we review measure different things with different methods, so meta-analytic pooling is inappropriate. We are closer to a "scoping review" or "critical appraisal" than a traditional Cochrane review.
28
29 ## Relevant Meta-Research
30
31 ### "Are Emergent Abilities of Large Language Models a Mirage?" (arXiv:2304.15004)
32
33 NeurIPS 2023 Outstanding Paper. Schaeffer, Miranda, & Koyejo (Stanford) showed that 92% of claimed "emergent abilities" in LLMs were artifacts of metric choice, not genuine phase transitions. When researchers used discontinuous metrics (exact-match accuracy), abilities appeared to emerge suddenly at certain scales. When they switched to continuous metrics (partial credit), the same data showed smooth, predictable improvement.
34
35 **Relevance to this project**: This paper is a paradigmatic example of "you measured it wrong" meta-research. It demonstrates that the *method of measurement* can create or destroy dramatic findings. Our survey asks the same question across a broader set of papers: are the claimed results genuine, or artifacts of how they were measured?
36
37 ### Why This Matters: The Wakefield Precedent
38
39 In 1998, Andrew Wakefield published a study in *The Lancet* linking the MMR vaccine to autism. The study had 12 participants, undisclosed financial conflicts of interest, and ethical violations in how the children were recruited and tested. It took 12 years to retract (2010) and Wakefield was struck off the UK medical register. By then, the damage was done: vaccination rates dropped, measles outbreaks returned, and the anti-vaccination movement it fueled persists decades later.
40
41 Wakefield is the canonical example of what happens when a methodologically weak study escapes into public discourse without adequate scrutiny. The paper scored poorly on every dimension we measure: no reproducibility (data later found to be fabricated), no statistical rigor (N=12, no controls), inappropriate methodology (case series presented as causal evidence), claims wildly exceeding the evidence, and zero honest limitations discussion.
42
43 The lesson is not that peer review failed (it did, but that is a systemic problem). The lesson is that *methodological quality assessment after publication* is a necessary check. The Cochrane Collaboration exists precisely because individual studies, even peer-reviewed ones, cannot be taken at face value. Someone has to do the structured, dispassionate evaluation.
44
45 The AI/LLM research space has analogous risk factors:
46 - **Strong commercial incentives** distort what gets published and how findings are framed
47 - **Rapid publication pace** (preprints, blog posts) bypasses traditional review
48 - **Hype-driven media coverage** amplifies dramatic claims without methodological scrutiny
49 - **Limited replication culture** means few papers are independently verified
50 - **Benchmark optimization** creates results that look impressive but do not generalize
51
52 No AI methodology paper is likely to cause a public health crisis. But inflated productivity claims influence hiring decisions, investment allocation, and engineering practice. Inflated safety claims create false confidence. Deflated capability claims cause premature dismissal of useful techniques. Getting the methodology right matters because people make real decisions based on these numbers.
53
54 ### Broader Meta-Research Context
55
56 The "replication crisis" in psychology and social science (beginning ~2011) demonstrated that many published findings did not hold up under scrutiny. Key lessons:
57
58 - Publication bias toward positive results inflates effect sizes
59 - Small sample sizes produce unstable estimates
60 - Researcher degrees of freedom (flexible analysis choices) enable p-hacking
61 - Pre-registration and replication requirements improve reliability
62
63 The AI/LLM research space shares several of these risk factors: rapid publication pace, strong commercial incentives, limited replication culture, and benchmarks that can be optimized against. Our survey explicitly checks for these patterns.

	ai-research-survey Systematic scan of agentic development research. What's signal, what's noise.
	git clone https://git.shiptheloop.com/ai-research-survey.git
	Log \| Files \| Refs