related-work.md (5444B)
1 # Related Work 2 3 ## Methodology Precedents 4 5 ### Cochrane Reviews 6 7 The Cochrane Collaboration has produced systematic reviews of medical research since 1993. Their methodology is the gold standard for structured, reproducible literature review: 8 9 - **Structured extraction**: Every study is assessed against a predefined rubric (Risk of Bias tool) 10 - **Pre-registered protocols**: Review methodology is published before data extraction begins 11 - **Multiple reviewers**: At least two independent reviewers extract data, with conflict resolution procedures 12 - **GRADE framework**: Explicit scoring of evidence certainty across dimensions 13 14 We adapt this approach for CS/AI research, recognizing that the field has different norms (preprints vs. peer review, code release vs. clinical trial registration) but the same underlying need for structured quality assessment. 15 16 ### PRISMA Reporting Guidelines 17 18 PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) provides a checklist for transparent reporting. Key elements we adopt: 19 20 - Explicit inclusion/exclusion criteria 21 - Search strategy documentation 22 - Flow diagram of paper selection 23 - Structured data extraction 24 25 ### Limitations of the Analogy 26 27 Medical systematic reviews typically synthesize effect sizes across comparable studies (e.g., "does drug X reduce mortality?"). Our review assesses *methodological quality* rather than synthesizing a single outcome. The studies we review measure different things with different methods, so meta-analytic pooling is inappropriate. We are closer to a "scoping review" or "critical appraisal" than a traditional Cochrane review. 28 29 ## Relevant Meta-Research 30 31 ### "Are Emergent Abilities of Large Language Models a Mirage?" (arXiv:2304.15004) 32 33 NeurIPS 2023 Outstanding Paper. Schaeffer, Miranda, & Koyejo (Stanford) showed that 92% of claimed "emergent abilities" in LLMs were artifacts of metric choice, not genuine phase transitions. When researchers used discontinuous metrics (exact-match accuracy), abilities appeared to emerge suddenly at certain scales. When they switched to continuous metrics (partial credit), the same data showed smooth, predictable improvement. 34 35 **Relevance to this project**: This paper is a paradigmatic example of "you measured it wrong" meta-research. It demonstrates that the *method of measurement* can create or destroy dramatic findings. Our survey asks the same question across a broader set of papers: are the claimed results genuine, or artifacts of how they were measured? 36 37 ### Why This Matters: The Wakefield Precedent 38 39 In 1998, Andrew Wakefield published a study in *The Lancet* linking the MMR vaccine to autism. The study had 12 participants, undisclosed financial conflicts of interest, and ethical violations in how the children were recruited and tested. It took 12 years to retract (2010) and Wakefield was struck off the UK medical register. By then, the damage was done: vaccination rates dropped, measles outbreaks returned, and the anti-vaccination movement it fueled persists decades later. 40 41 Wakefield is the canonical example of what happens when a methodologically weak study escapes into public discourse without adequate scrutiny. The paper scored poorly on every dimension we measure: no reproducibility (data later found to be fabricated), no statistical rigor (N=12, no controls), inappropriate methodology (case series presented as causal evidence), claims wildly exceeding the evidence, and zero honest limitations discussion. 42 43 The lesson is not that peer review failed (it did, but that is a systemic problem). The lesson is that *methodological quality assessment after publication* is a necessary check. The Cochrane Collaboration exists precisely because individual studies, even peer-reviewed ones, cannot be taken at face value. Someone has to do the structured, dispassionate evaluation. 44 45 The AI/LLM research space has analogous risk factors: 46 - **Strong commercial incentives** distort what gets published and how findings are framed 47 - **Rapid publication pace** (preprints, blog posts) bypasses traditional review 48 - **Hype-driven media coverage** amplifies dramatic claims without methodological scrutiny 49 - **Limited replication culture** means few papers are independently verified 50 - **Benchmark optimization** creates results that look impressive but do not generalize 51 52 No AI methodology paper is likely to cause a public health crisis. But inflated productivity claims influence hiring decisions, investment allocation, and engineering practice. Inflated safety claims create false confidence. Deflated capability claims cause premature dismissal of useful techniques. Getting the methodology right matters because people make real decisions based on these numbers. 53 54 ### Broader Meta-Research Context 55 56 The "replication crisis" in psychology and social science (beginning ~2011) demonstrated that many published findings did not hold up under scrutiny. Key lessons: 57 58 - Publication bias toward positive results inflates effect sizes 59 - Small sample sizes produce unstable estimates 60 - Researcher degrees of freedom (flexible analysis choices) enable p-hacking 61 - Pre-registration and replication requirements improve reliability 62 63 The AI/LLM research space shares several of these risk factors: rapid publication pace, strong commercial incentives, limited replication culture, and benchmarks that can be optimized against. Our survey explicitly checks for these patterns.