requirements.md (7744B)
1 # Project Requirements 2 3 ## Motivation 4 5 Research papers in the agentic AI / LLM programming space are proliferating rapidly, but their methodological quality varies enormously. Productivity claims range from "19% slower" to "10x faster" depending on what was measured, how it was measured, and what was left unmeasured. Most widely cited studies do not describe *how* the AI was used. They report percentage gains without distinguishing autocomplete from chat from agentic workflows with feedback loops. 6 7 This project is a **systematic review** evaluating the methodological quality of research papers in this space. The goal is not to determine which AI tool is best, but to assess how rigorously the claims are supported and help readers calibrate their confidence in reported findings. 8 9 **Target scope**: ~1,000 papers. The scan agent is cheap enough to run on every paper. Deep eval is reserved for a smaller subset selected after scan results are in. 10 11 ## Pipeline Architecture 12 13 ``` 14 discover -> download -> scan -> deep eval (optional) -> aggregate 15 ``` 16 17 1. **Discover**: Harvester agent searches arXiv, HuggingFace trending, Semantic Scholar, and other sources. Adds entries to `registry.jsonl` with status `queued`. 18 2. **Download**: Papers are either manually dropped into `inbox/` or downloaded by a future automation step. The inbox-sorter agent identifies, files, and registers them. 19 3. **Scan**: Scan agent reads each paper and produces a structured `scan.json` per the schema. Extracts claims, scores six rubric dimensions, assigns methodology tags, and flags red flags. 20 4. **Deep Eval** (optional): Deep-eval agent attempts to run released code, reproduce key results, and check for benchmark contamination. Triggered selectively for high-impact or suspicious papers. 21 5. **Aggregate**: Final analysis across all scanned papers to identify patterns in methodological quality. 22 23 ## Key Decisions 24 25 - **JSONL for registry**: One JSON object per line in `registry.jsonl`. Easy to append, easy to grep, easy to diff. No need for a database. 26 - **Paper-per-directory**: Each paper gets its own directory under `papers/` containing the PDF (local only), `scan.json`, and optionally `deep_eval.json`. 27 - **No PDF redistribution**: PDFs are stored locally for analysis but never committed or published. Only structured outputs (scan results, evaluations) are publishable. 28 - **Agent tier design**: Two tiers of evaluation. The scan agent is cheap and fast, producing a structured assessment from reading the paper. The deep-eval agent is expensive and slow, attempting to reproduce results. Most papers only get scanned; deep eval is reserved for high-impact or contested papers. 29 30 ## Agent Tier Design 31 32 ### Harvester Agent (Discovery) 33 - Bulk discovery from arXiv, Semantic Scholar, HuggingFace, conferences 34 - Fills registry entries with metadata only 35 - **Model: Sonnet** (fast, cheap, structured metadata extraction) 36 37 ### Scan Agent (Tier 1) 38 - Reads paper text 39 - Fills out `scan.json` per schema 40 - Scores six rubric dimensions 41 - Extracts claims with supporting evidence 42 - Extracts cited papers for citation-chasing pipeline 43 - Fast, cheap, runs on every paper 44 - **Model: Opus** (requires judgment on methodology quality) 45 46 ### Deep-Eval Agent (Tier 2) 47 - Attempts to run released code 48 - Tries to reproduce key results 49 - Checks for benchmark contamination 50 - Expensive, slow, runs selectively 51 - **Model: Opus** 52 53 ## Tag Taxonomy 54 55 ### Topic Tags 56 - `productivity` - Developer productivity studies 57 - `reliability` - Compounding reliability, error rates 58 - `scaling` - Scaling laws, model size effects 59 - `security` - Prompt injection, alignment, supply chain 60 - `benchmarks` - Benchmark design and evaluation 61 - `agents` - Agentic workflows and scaffolding 62 - `code-generation` - Code generation quality and techniques 63 - `alignment` - Model alignment and safety 64 - `survey` - Meta-research, literature reviews 65 - `economics` - Cost, inference economics, democratization 66 67 ### Methodology Tags (assigned by scan agent) 68 - `rct` - Randomized controlled trial 69 - `observational` - Observational study 70 - `benchmark-eval` - Benchmark evaluation 71 - `case-study` - Case study or anecdotal 72 - `meta-analysis` - Meta-analysis or systematic review 73 - `theoretical` - Theoretical / analytical 74 - `qualitative` - Qualitative research 75 76 ## Build Pipeline 77 78 ``` 79 registry.jsonl (queued) 80 → scripts/download-arxiv.py → papers/<slug>/paper.pdf (downloaded) 81 → scripts/extract-text.py → papers/<slug>/paper.txt 82 → scan agent (Opus) → papers/<slug>/scan.json (scanned) 83 → scripts/harvest-citations.py → new registry entries 84 → scripts/build-summary.py → analysis/summary.json + summary.md 85 → LaTeX build → paper.pdf 86 ``` 87 88 Text extraction uses pymupdf (free, fast). Falls back to Sonnet via `claude` CLI if pymupdf output fails quality checks (too short, garbled, low words-per-page). 89 90 The summary artifact (`analysis/summary.json` and `analysis/summary.md`) is built before the LaTeX paper. It contains score distributions, ranked lists, red flag counts, and breakdowns by year/tag/methodology. This is the working document for developing the narrative sections. 91 92 ## Output Format 93 94 LaTeX paper. Submittable to academic venues. 95 96 ### Venue Brainstorm 97 98 **Top targets (SE + AI intersection):** 99 - **ICSE** (International Conference on Software Engineering) — premier SE venue, has had AI4SE tracks 100 - **FSE/ESEC** (Foundations of Software Engineering) — strong SE venue, accepts empirical studies 101 - **ASE** (Automated Software Engineering) — good fit for tooling/methodology papers 102 - **MSR** (Mining Software Repositories) — empirical SE, data-heavy studies welcome 103 104 **AI/ML venues:** 105 - **NeurIPS** (Datasets and Benchmarks track) — good fit for "the benchmarks are broken" angle 106 - **ICML** — possible but harder sell for a survey 107 - **AAAI** — broad AI, accepts surveys and position papers 108 - **COLM** (Conference on Language Modeling) — new venue, directly relevant 109 110 **NLP venues:** 111 - **ACL** — accepts survey/position papers, has done "methodology critique" before 112 - **EMNLP** — similar to ACL, slightly more empirical focus 113 - **NAACL** — regional but well-regarded 114 115 **Journals:** 116 - **TOSEM** (ACM Transactions on Software Engineering and Methodology) — perfect fit for a systematic review 117 - **TSE** (IEEE Transactions on Software Engineering) — prestigious, accepts surveys 118 - **EMSE** (Empirical Software Engineering) — Springer journal, literally made for this 119 - **Nature Machine Intelligence** — high impact, accepts perspective/review articles 120 - **Communications of the ACM** — broad reach, good for "the field has a methodology problem" message 121 122 **Meta-research / open science:** 123 - **MetaArXiv** — preprint server for meta-research 124 - **Royal Society Open Science** — open access, accepts methodological critiques across fields 125 126 **Workshop / special tracks:** 127 - **NeurIPS Datasets and Benchmarks** — if framed as benchmark quality assessment 128 - **ICSE NIER** (New Ideas and Emerging Results) — shorter format, good for early results 129 - **LLM4Code** workshop (co-located with ICSE) — directly on topic 130 131 ### Venue Strategy 132 133 TOSEM or EMSE are the most natural fit: they publish systematic reviews routinely, the reviewers understand the format, and "methodological quality of AI/SE research" is exactly their beat. For maximum impact outside SE, NeurIPS Datasets & Benchmarks or Nature Machine Intelligence would reach the ML audience that needs to hear it most. 134 135 ## Virality Tracking 136 137 Deferred until after initial data collection. Future work may track citation counts, social media mentions, and media coverage to correlate with methodological quality.