ai-research-survey

Systematic scan of agentic development research. What's signal, what's noise.
git clone https://git.shiptheloop.com/ai-research-survey.git
Log | Files | Refs

requirements.md (7744B)


      1 # Project Requirements
      2 
      3 ## Motivation
      4 
      5 Research papers in the agentic AI / LLM programming space are proliferating rapidly, but their methodological quality varies enormously. Productivity claims range from "19% slower" to "10x faster" depending on what was measured, how it was measured, and what was left unmeasured. Most widely cited studies do not describe *how* the AI was used. They report percentage gains without distinguishing autocomplete from chat from agentic workflows with feedback loops.
      6 
      7 This project is a **systematic review** evaluating the methodological quality of research papers in this space. The goal is not to determine which AI tool is best, but to assess how rigorously the claims are supported and help readers calibrate their confidence in reported findings.
      8 
      9 **Target scope**: ~1,000 papers. The scan agent is cheap enough to run on every paper. Deep eval is reserved for a smaller subset selected after scan results are in.
     10 
     11 ## Pipeline Architecture
     12 
     13 ```
     14 discover -> download -> scan -> deep eval (optional) -> aggregate
     15 ```
     16 
     17 1. **Discover**: Harvester agent searches arXiv, HuggingFace trending, Semantic Scholar, and other sources. Adds entries to `registry.jsonl` with status `queued`.
     18 2. **Download**: Papers are either manually dropped into `inbox/` or downloaded by a future automation step. The inbox-sorter agent identifies, files, and registers them.
     19 3. **Scan**: Scan agent reads each paper and produces a structured `scan.json` per the schema. Extracts claims, scores six rubric dimensions, assigns methodology tags, and flags red flags.
     20 4. **Deep Eval** (optional): Deep-eval agent attempts to run released code, reproduce key results, and check for benchmark contamination. Triggered selectively for high-impact or suspicious papers.
     21 5. **Aggregate**: Final analysis across all scanned papers to identify patterns in methodological quality.
     22 
     23 ## Key Decisions
     24 
     25 - **JSONL for registry**: One JSON object per line in `registry.jsonl`. Easy to append, easy to grep, easy to diff. No need for a database.
     26 - **Paper-per-directory**: Each paper gets its own directory under `papers/` containing the PDF (local only), `scan.json`, and optionally `deep_eval.json`.
     27 - **No PDF redistribution**: PDFs are stored locally for analysis but never committed or published. Only structured outputs (scan results, evaluations) are publishable.
     28 - **Agent tier design**: Two tiers of evaluation. The scan agent is cheap and fast, producing a structured assessment from reading the paper. The deep-eval agent is expensive and slow, attempting to reproduce results. Most papers only get scanned; deep eval is reserved for high-impact or contested papers.
     29 
     30 ## Agent Tier Design
     31 
     32 ### Harvester Agent (Discovery)
     33 - Bulk discovery from arXiv, Semantic Scholar, HuggingFace, conferences
     34 - Fills registry entries with metadata only
     35 - **Model: Sonnet** (fast, cheap, structured metadata extraction)
     36 
     37 ### Scan Agent (Tier 1)
     38 - Reads paper text
     39 - Fills out `scan.json` per schema
     40 - Scores six rubric dimensions
     41 - Extracts claims with supporting evidence
     42 - Extracts cited papers for citation-chasing pipeline
     43 - Fast, cheap, runs on every paper
     44 - **Model: Opus** (requires judgment on methodology quality)
     45 
     46 ### Deep-Eval Agent (Tier 2)
     47 - Attempts to run released code
     48 - Tries to reproduce key results
     49 - Checks for benchmark contamination
     50 - Expensive, slow, runs selectively
     51 - **Model: Opus**
     52 
     53 ## Tag Taxonomy
     54 
     55 ### Topic Tags
     56 - `productivity` - Developer productivity studies
     57 - `reliability` - Compounding reliability, error rates
     58 - `scaling` - Scaling laws, model size effects
     59 - `security` - Prompt injection, alignment, supply chain
     60 - `benchmarks` - Benchmark design and evaluation
     61 - `agents` - Agentic workflows and scaffolding
     62 - `code-generation` - Code generation quality and techniques
     63 - `alignment` - Model alignment and safety
     64 - `survey` - Meta-research, literature reviews
     65 - `economics` - Cost, inference economics, democratization
     66 
     67 ### Methodology Tags (assigned by scan agent)
     68 - `rct` - Randomized controlled trial
     69 - `observational` - Observational study
     70 - `benchmark-eval` - Benchmark evaluation
     71 - `case-study` - Case study or anecdotal
     72 - `meta-analysis` - Meta-analysis or systematic review
     73 - `theoretical` - Theoretical / analytical
     74 - `qualitative` - Qualitative research
     75 
     76 ## Build Pipeline
     77 
     78 ```
     79 registry.jsonl (queued)
     80   → scripts/download-arxiv.py         → papers/<slug>/paper.pdf  (downloaded)
     81   → scripts/extract-text.py           → papers/<slug>/paper.txt
     82   → scan agent (Opus)                 → papers/<slug>/scan.json  (scanned)
     83   → scripts/harvest-citations.py      → new registry entries
     84   → scripts/build-summary.py          → analysis/summary.json + summary.md
     85   → LaTeX build                       → paper.pdf
     86 ```
     87 
     88 Text extraction uses pymupdf (free, fast). Falls back to Sonnet via `claude` CLI if pymupdf output fails quality checks (too short, garbled, low words-per-page).
     89 
     90 The summary artifact (`analysis/summary.json` and `analysis/summary.md`) is built before the LaTeX paper. It contains score distributions, ranked lists, red flag counts, and breakdowns by year/tag/methodology. This is the working document for developing the narrative sections.
     91 
     92 ## Output Format
     93 
     94 LaTeX paper. Submittable to academic venues.
     95 
     96 ### Venue Brainstorm
     97 
     98 **Top targets (SE + AI intersection):**
     99 - **ICSE** (International Conference on Software Engineering) — premier SE venue, has had AI4SE tracks
    100 - **FSE/ESEC** (Foundations of Software Engineering) — strong SE venue, accepts empirical studies
    101 - **ASE** (Automated Software Engineering) — good fit for tooling/methodology papers
    102 - **MSR** (Mining Software Repositories) — empirical SE, data-heavy studies welcome
    103 
    104 **AI/ML venues:**
    105 - **NeurIPS** (Datasets and Benchmarks track) — good fit for "the benchmarks are broken" angle
    106 - **ICML** — possible but harder sell for a survey
    107 - **AAAI** — broad AI, accepts surveys and position papers
    108 - **COLM** (Conference on Language Modeling) — new venue, directly relevant
    109 
    110 **NLP venues:**
    111 - **ACL** — accepts survey/position papers, has done "methodology critique" before
    112 - **EMNLP** — similar to ACL, slightly more empirical focus
    113 - **NAACL** — regional but well-regarded
    114 
    115 **Journals:**
    116 - **TOSEM** (ACM Transactions on Software Engineering and Methodology) — perfect fit for a systematic review
    117 - **TSE** (IEEE Transactions on Software Engineering) — prestigious, accepts surveys
    118 - **EMSE** (Empirical Software Engineering) — Springer journal, literally made for this
    119 - **Nature Machine Intelligence** — high impact, accepts perspective/review articles
    120 - **Communications of the ACM** — broad reach, good for "the field has a methodology problem" message
    121 
    122 **Meta-research / open science:**
    123 - **MetaArXiv** — preprint server for meta-research
    124 - **Royal Society Open Science** — open access, accepts methodological critiques across fields
    125 
    126 **Workshop / special tracks:**
    127 - **NeurIPS Datasets and Benchmarks** — if framed as benchmark quality assessment
    128 - **ICSE NIER** (New Ideas and Emerging Results) — shorter format, good for early results
    129 - **LLM4Code** workshop (co-located with ICSE) — directly on topic
    130 
    131 ### Venue Strategy
    132 
    133 TOSEM or EMSE are the most natural fit: they publish systematic reviews routinely, the reviewers understand the format, and "methodological quality of AI/SE research" is exactly their beat. For maximum impact outside SE, NeurIPS Datasets & Benchmarks or Nature Machine Intelligence would reach the ML audience that needs to hear it most.
    134 
    135 ## Virality Tracking
    136 
    137 Deferred until after initial data collection. Future work may track citation counts, social media mentions, and media coverage to correlate with methodological quality.

Impressum · Datenschutz