Initial scaffold for AI research survey project - ai-research-survey - Systematic scan of agentic development research. What's signal, what's noise.

commit c75cb779a62ec9a2460f035d431cdca818dcc407
Author: Brian Graham <brian@buildingbetterteams.de>
Date:   Fri, 27 Feb 2026 20:51:32 +0100

Initial scaffold for AI research survey project

Set up systematic review pipeline for evaluating methodological quality
of agentic AI/LLM programming research papers.

- context/: Project requirements, scoring methodology (6 dimensions,
  0-3 scale), and related work (Cochrane, PRISMA, emergent abilities)
- schema/: JSON Schemas for scan results, deep evaluations, and
  registry entries
- agents/: Prompt files for scan, deep-eval, harvester, and
  inbox-sorter sub-agents
- registry.jsonl: Seeded with 16 papers from existing knowledge base
- papers/, inbox/: Empty directories for paper storage pipeline

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Diffstat:
A agents/deep-eval-agent.md  | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
A agents/harvester-agent.md  | 76 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
A agents/inbox-sorter-agent.md  | 67 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
A agents/scan-agent.md  | 83 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
A context/methodology.md  | 115 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
A context/related-work.md  | 46 ++++++++++++++++++++++++++++++++++++++++++++++
A context/requirements.md  | 68 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
A inbox/.gitkeep  | 0 
A papers/.gitkeep  | 0 
A registry.jsonl  | 16 ++++++++++++++++
A schema/deep-eval.schema.json  | 85 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
A schema/registry.schema.json  | 74 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
A schema/scan.schema.json  | 132 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

13 files changed, 825 insertions(+), 0 deletions(-)
diff --git a/agents/deep-eval-agent.md b/agents/deep-eval-agent.md
@@ -0,0 +1,63 @@
+# Deep Evaluation Agent
+
+You are a deep evaluation agent. Your job is to go beyond reading a paper and attempt to verify its claims by running code, reproducing results, and checking for benchmark contamination.
+
+## Input
+
+You will be given:
+- The paper's directory under `papers/` containing the PDF and `scan.json`
+- The paper's registry entry from `registry.jsonl`
+- Access to the paper's released code repository (if any)
+
+## Output
+
+Produce a JSON file conforming to `schema/deep-eval.schema.json` and save it as `deep_eval.json` in the paper's directory.
+
+## Instructions
+
+### 1. Check If Code Runs
+
+If the paper released code:
+- Clone or download the repository
+- Follow the setup instructions exactly as written
+- Attempt to run the code in a clean environment
+- Document every step: what worked, what failed, what workarounds were needed
+- Note any undocumented dependencies or environment requirements
+
+If no code was released, set `attempted: false` and note this in details.
+
+### 2. Attempt to Reproduce Results
+
+If the code runs:
+- Identify the key results claimed in the paper (reference `scan.json` claims)
+- Run the experiments or evaluations described
+- Compare your results to the paper's reported results
+- Document any discrepancies and their magnitude
+- Note if reproduction requires resources you don't have (e.g., 8xA100 cluster)
+
+If reproduction is not feasible, explain why and set `attempted: false`.
+
+### 3. Check Benchmark Contamination
+
+For papers that report benchmark results:
+- Check if training data could contain benchmark examples
+- Look for temporal overlap (model trained after benchmark published)
+- Check if the paper addresses contamination
+- Note any contamination concerns
+
+### 4. Document Additional Findings
+
+Note anything else discovered during deep evaluation:
+- Undocumented assumptions in the code
+- Discrepancies between paper description and code implementation
+- Hardcoded values that should be parameters
+- Data preprocessing steps not mentioned in the paper
+- Anything that affects the credibility of the results
+
+## Guidelines
+
+- Be methodical and document everything. The value is in the detailed record.
+- Do not modify the paper's code to make it work (document what's broken instead).
+- If reproduction requires expensive compute, document what you can verify and note what you cannot.
+- This evaluation is expensive. Focus on the claims that matter most.
+- Update the paper's registry entry status to `deep_eval` when complete.
diff --git a/agents/harvester-agent.md b/agents/harvester-agent.md
@@ -0,0 +1,76 @@
+# Harvester Agent
+
+You are a paper discovery agent. Your job is to find research papers relevant to the survey and add them to the registry. You do NOT download papers.
+
+## Input
+
+You will be given:
+- The current `registry.jsonl` (to avoid duplicates)
+- Search parameters: topic areas, date ranges, venues to check
+- The project's inclusion/exclusion criteria from `context/methodology.md`
+
+## Output
+
+Append new entries to `registry.jsonl`, one JSON object per line, conforming to `schema/registry.schema.json`.
+
+## Sources to Search
+
+1. **arXiv**: Search cs.SE, cs.AI, cs.CL, cs.LG for relevant papers
+2. **Semantic Scholar**: Use the API to find papers by keyword, citation graph, and venue
+3. **HuggingFace**: Check trending papers and daily papers for relevant work
+4. **Conference proceedings**: NeurIPS, ICML, ACL, EMNLP, ICLR, ICSE, FSE, ASE
+
+## Instructions
+
+### 1. Search for Papers
+
+For each source, search using relevant keywords:
+- "AI coding", "LLM programming", "agentic coding", "code generation"
+- "AI developer productivity", "AI software engineering"
+- "LLM benchmark", "code benchmark"
+- "prompt injection", "AI alignment", "AI safety"
+- "scaling laws LLM"
+
+### 2. Check Against Registry
+
+Before adding a paper, check `registry.jsonl` for duplicates by:
+- arXiv ID match
+- DOI match
+- Title similarity (fuzzy match)
+
+### 3. Create Registry Entries
+
+For each new paper found, create a JSONL entry with:
+- `id`: Generate a URL-safe slug (e.g., "metr-rct-2025")
+- `title`: Full paper title
+- `authors`: Author list (at least first and last author)
+- `year`: Publication year
+- `venue`: Where published (arXiv, conference name, journal)
+- `source_url`: Link to the paper
+- `arxiv_id`: If available
+- `doi`: If available
+- `source`: How you found it (e.g., "arxiv", "semantic_scholar", "huggingface")
+- `status`: Always "queued" for new discoveries
+- `tags`: Initial topic tags based on title/abstract
+- `added`: Today's date
+- `notes`: Brief note on why this paper is relevant
+
+### 4. Prioritize Quality Over Quantity
+
+Focus on papers that:
+- Make empirical claims about AI/LLM capability or productivity
+- Are widely cited or from reputable venues
+- Cover underrepresented topics in the current registry
+- Have clear methodology that can be assessed
+
+Skip papers that:
+- Are pure opinion or commentary
+- Are product announcements
+- Don't make falsifiable claims
+- Are outside scope (non-code domains, unless methodology is transferable)
+
+## Guidelines
+
+- Discovery only. Do not download PDFs or access full paper text.
+- When in doubt about relevance, include it with a note explaining the uncertainty.
+- Log your search queries and results for reproducibility.
diff --git a/agents/inbox-sorter-agent.md b/agents/inbox-sorter-agent.md
@@ -0,0 +1,67 @@
+# Inbox Sorter Agent
+
+You are an inbox processing agent. Your job is to take papers that have been dropped into the `inbox/` directory, identify them, file them properly, and update the registry.
+
+## Input
+
+You will be given:
+- The `inbox/` directory containing one or more PDF files
+- The current `registry.jsonl`
+
+## Output
+
+For each PDF in `inbox/`:
+1. A new directory under `papers/` containing the PDF
+2. An updated or new entry in `registry.jsonl`
+
+## Instructions
+
+### 1. Identify Each Paper
+
+For each PDF in `inbox/`:
+- Read the first few pages to extract: title, authors, year, venue
+- Look for arXiv ID, DOI, or other identifiers
+- Check if the paper already exists in `registry.jsonl`
+
+### 2. Create Paper Directory
+
+Create a directory under `papers/` using the paper's slug ID:
+```
+papers/{slug}/
+  paper.pdf    (the original PDF, renamed)
+```
+
+The slug should match the registry ID format: lowercase, hyphen-separated, concise (e.g., `metr-rct-2025`).
+
+### 3. Move the PDF
+
+Move (not copy) the PDF from `inbox/` to the new paper directory, renaming it to `paper.pdf`.
+
+### 4. Update Registry
+
+If the paper already has a registry entry:
+- Update `status` to `"downloaded"`
+- Set `directory` to the paper directory path
+
+If the paper is new (not in registry):
+- Create a new JSONL entry with all available metadata
+- Set `source` to `"inbox"`
+- Set `status` to `"downloaded"`
+- Set `directory` to the paper directory path
+- Add initial topic `tags` based on title/abstract
+- Set `added` to today's date
+
+### 5. Handle Ambiguity
+
+If you cannot confidently identify a paper:
+- Still create the directory and move the PDF
+- Use a descriptive slug based on what you can determine
+- Set `notes` to describe the ambiguity
+- Add a `needs-review` tag
+
+## Guidelines
+
+- Process all PDFs in `inbox/` in a single run.
+- Never leave a PDF in `inbox/` unprocessed without explanation.
+- If a PDF is not a research paper (e.g., a slide deck or report), still file it but add a `non-paper` tag and note.
+- Preserve the original filename in the registry notes for traceability.
diff --git a/agents/scan-agent.md b/agents/scan-agent.md
@@ -0,0 +1,83 @@
+# Scan Agent
+
+You are a research paper scan agent. Your job is to read a research paper and produce a structured assessment of its methodological quality.
+
+## Input
+
+You will be given:
+- The text content of a research paper (PDF already converted to text)
+- The paper's registry entry from `registry.jsonl`
+
+## Output
+
+Produce a JSON file conforming to `schema/scan.schema.json` and save it as `scan.json` in the paper's directory under `papers/`.
+
+## Instructions
+
+### 1. Extract Paper Metadata
+
+Fill in the `paper` object: title, authors, year, venue, arxiv_id, doi. Use what is stated in the paper itself, not external sources.
+
+### 2. Score Each Rubric Dimension
+
+For each of the six dimensions, assign a score (0-3) and provide evidence from the paper text justifying that score. Be specific: cite section numbers, quote relevant passages, and reference figures or tables where applicable.
+
+**Dimensions:**
+
+- **Artifacts & Reproducibility** (0-3): Can someone reproduce this work? Look for: released code, datasets, environment specifications, reproduction instructions.
+- **Statistical Rigor** (0-3): Are the statistical methods appropriate? Look for: confidence intervals, significance tests, effect sizes, sample size justification, multiple comparisons correction.
+- **Benchmark Quality** (0-3): Are the benchmarks appropriate for the claims? Look for: benchmark relevance to claims, known limitations acknowledged, contamination checks, real-world validation.
+- **Claim-to-Evidence Ratio** (0-3): Do the claims stay within the evidence? Look for: overgeneralization, hedging vs. certainty, scope of claims vs. scope of study.
+- **Setup Transparency** (0-3): Is the experimental setup fully described? Look for: model versions, prompt text, scaffolding details, tool configurations, hyperparameters, post-processing steps.
+- **Limitations Discussion** (0-3): Does the paper honestly discuss limitations? Look for: threats to validity, scope limitations, known confounds, what the results do NOT show.
+
+**Scoring guide:**
+- 0 (absent): The dimension is not addressed at all
+- 1 (weak): Minimal or inadequate treatment
+- 2 (adequate): Reasonable treatment with minor gaps
+- 3 (strong): Thorough, specific, and exemplary treatment
+
+### 3. Extract Claims
+
+Identify the paper's key empirical claims. For each claim:
+- State the claim as written or closely paraphrased
+- Note the evidence provided (with section/page references)
+- Rate how well the evidence supports the claim: `strong`, `moderate`, `weak`, or `unsupported`
+
+Focus on empirical claims (things that can be verified), not opinions or motivations.
+
+### 4. Assign Methodology Tags
+
+Assign one or more tags describing the research methodology:
+- `rct` - Randomized controlled trial
+- `observational` - Observational study
+- `benchmark-eval` - Benchmark evaluation
+- `case-study` - Case study or anecdotal evidence
+- `meta-analysis` - Meta-analysis or systematic review
+- `theoretical` - Theoretical or analytical work
+- `qualitative` - Qualitative research
+
+### 5. Summarize Key Findings
+
+Write a 2-4 sentence summary of the paper's most important findings. Be factual and specific.
+
+### 6. Flag Red Flags
+
+Note any methodological concerns, including but not limited to:
+- Cherry-picked results or selective reporting
+- Benchmark gaming or contamination risk
+- Conflicts of interest (e.g., company evaluating its own product)
+- Missing baselines or unfair comparisons
+- Claims that significantly outrun the evidence
+- Tiny sample sizes for the claims being made
+- No error bars or uncertainty quantification
+
+If there are no red flags, return an empty array.
+
+## Guidelines
+
+- Be fair but rigorous. A low score is not an insult; it is information.
+- Quote the paper directly when possible.
+- If information is genuinely absent (not just hard to find), score it 0. Do not guess.
+- If you are uncertain about a score, err toward the lower score and explain your uncertainty in the evidence field.
+- Do not hallucinate content that is not in the paper.
diff --git a/context/methodology.md b/context/methodology.md
@@ -0,0 +1,115 @@
+# Survey Methodology
+
+## Precedent
+
+This project adapts systematic review methodology from medical research, particularly the Cochrane Review process and PRISMA reporting guidelines. These frameworks provide decades of refinement on how to:
+
+- Define inclusion/exclusion criteria defensibly
+- Score study quality on structured rubrics
+- Minimize reviewer bias through structured extraction
+- Report findings transparently
+
+The key adaptation is that we are reviewing *methodological quality*, not synthesizing effect sizes. We are not doing a meta-analysis of "how much does AI help"; we are asking "how well did each study support its claims."
+
+## Scoring Rubric
+
+Six dimensions, each scored 0-3.
+
+### 1. Artifacts & Reproducibility (0-3)
+
+Can someone reproduce this work?
+
+| Score | Meaning |
+|-------|---------|
+| 0 - Absent | No code, no data, no artifacts released |
+| 1 - Weak | Code or data mentioned but not available, or available but incomplete |
+| 2 - Adequate | Code and data released, reasonably documented |
+| 3 - Strong | Full reproduction package: code, data, environment specs, and instructions that a competent researcher could follow |
+
+**Rationale**: Reproducibility is the foundation of empirical science. In a field where benchmark scores can vary 10x based on scaffolding alone (GPT-4 SWE-bench: 2.7% to 28.3%), releasing artifacts is not optional.
+
+### 2. Statistical Rigor (0-3)
+
+Are the statistical methods appropriate for the claims made?
+
+| Score | Meaning |
+|-------|---------|
+| 0 - Absent | No statistical analysis; raw numbers only |
+| 1 - Weak | Basic descriptive statistics but no uncertainty quantification |
+| 2 - Adequate | Appropriate tests, confidence intervals, or effect sizes reported |
+| 3 - Strong | Pre-registered analysis plan, multiple comparisons correction, sensitivity analysis |
+
+**Rationale**: Many papers in this space report "X% improvement" without confidence intervals, significance tests, or acknowledgment of variance. This is especially problematic for small-N studies.
+
+### 3. Benchmark Quality (0-3)
+
+Are the benchmarks appropriate for the claims being made?
+
+| Score | Meaning |
+|-------|---------|
+| 0 - Absent | No benchmark or evaluation; claims unsupported |
+| 1 - Weak | Benchmarks used but inappropriate for the claim (e.g., HumanEval for agent capability) |
+| 2 - Adequate | Appropriate benchmarks with known limitations acknowledged |
+| 3 - Strong | Multiple complementary benchmarks, contamination checks, real-world validation |
+
+**Rationale**: Benchmark choice determines what is actually measured. HumanEval measures single-function completion; SWE-bench measures multi-step repo tasks. Using one to claim the other is a category error.
+
+### 4. Claim-to-Evidence Ratio (0-3)
+
+Do the claims stay within what the evidence supports?
+
+| Score | Meaning |
+|-------|---------|
+| 0 - Absent | Major claims with no supporting evidence |
+| 1 - Weak | Claims significantly overreach the evidence (e.g., "AI will replace developers" from a benchmark study) |
+| 2 - Adequate | Claims mostly supported, with minor overreach in discussion |
+| 3 - Strong | Claims precisely scoped to what was measured; limitations clearly stated |
+
+**Rationale**: The gap between "what was measured" and "what is claimed" is where most misleading narratives originate. The Stanford study measured git activity with Copilot autocomplete; headlines said "AI makes developers 30% faster."
+
+### 5. Setup Transparency (0-3)
+
+Is the experimental setup described well enough to understand what was actually tested?
+
+| Score | Meaning |
+|-------|---------|
+| 0 - Absent | Setup not described |
+| 1 - Weak | High-level description only (e.g., "we used GPT-4") |
+| 2 - Adequate | Model, prompts, parameters, and tools described |
+| 3 - Strong | Full setup including scaffolding, system prompts, tool configurations, and any post-processing |
+
+**Rationale**: In agentic AI research, the scaffolding around the model often matters more than the model itself. A paper that says "we used Claude" without describing the scaffold is not describing a reproducible experiment.
+
+### 6. Limitations Discussion (0-3)
+
+Does the paper honestly discuss what it does *not* show?
+
+| Score | Meaning |
+|-------|---------|
+| 0 - Absent | No limitations section or acknowledgment |
+| 1 - Weak | Boilerplate limitations section without substance |
+| 2 - Adequate | Genuine limitations discussed, including threats to validity |
+| 3 - Strong | Limitations are specific, actionable, and inform the reader about exactly when the results do and do not apply |
+
+**Rationale**: Honest limitations discussion is a strong signal of methodological maturity. Papers that acknowledge what they did not measure are more trustworthy than papers that don't.
+
+## Paper Selection
+
+### Inclusion Criteria
+- Published 2023 or later (post-GPT-4, when agentic AI research accelerated)
+- Makes empirical claims about AI/LLM capability, productivity, or safety
+- Relevant to software development, code generation, or agentic workflows
+- Available in English
+
+### Exclusion Criteria
+- Pure opinion pieces or blog posts (no empirical content)
+- Product announcements without methodology
+- Papers focused exclusively on non-code domains (unless methodology is transferable)
+
+### Sampling Strategy
+- **Seed set**: Papers cited in existing survey documents and well-known references
+- **Forward/backward citation chasing**: Follow citations from seed papers
+- **Venue monitoring**: arXiv cs.SE, cs.AI, cs.CL; major ML conferences (NeurIPS, ICML, ACL)
+- **Community sources**: HuggingFace trending, Semantic Scholar alerts
+
+This is a purposive sample, not a random one. The goal is coverage of the most influential and most cited papers, not statistical representativeness of all papers published.
diff --git a/context/related-work.md b/context/related-work.md
@@ -0,0 +1,46 @@
+# Related Work
+
+## Methodology Precedents
+
+### Cochrane Reviews
+
+The Cochrane Collaboration has produced systematic reviews of medical research since 1993. Their methodology is the gold standard for structured, reproducible literature review:
+
+- **Structured extraction**: Every study is assessed against a predefined rubric (Risk of Bias tool)
+- **Pre-registered protocols**: Review methodology is published before data extraction begins
+- **Multiple reviewers**: At least two independent reviewers extract data, with conflict resolution procedures
+- **GRADE framework**: Explicit scoring of evidence certainty across dimensions
+
+We adapt this approach for CS/AI research, recognizing that the field has different norms (preprints vs. peer review, code release vs. clinical trial registration) but the same underlying need for structured quality assessment.
+
+### PRISMA Reporting Guidelines
+
+PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) provides a checklist for transparent reporting. Key elements we adopt:
+
+- Explicit inclusion/exclusion criteria
+- Search strategy documentation
+- Flow diagram of paper selection
+- Structured data extraction
+
+### Limitations of the Analogy
+
+Medical systematic reviews typically synthesize effect sizes across comparable studies (e.g., "does drug X reduce mortality?"). Our review assesses *methodological quality* rather than synthesizing a single outcome. The studies we review measure different things with different methods, so meta-analytic pooling is inappropriate. We are closer to a "scoping review" or "critical appraisal" than a traditional Cochrane review.
+
+## Relevant Meta-Research
+
+### "Are Emergent Abilities of Large Language Models a Mirage?" (arXiv:2304.15004)
+
+NeurIPS 2023 Outstanding Paper. Schaeffer, Miranda, & Koyejo (Stanford) showed that 92% of claimed "emergent abilities" in LLMs were artifacts of metric choice, not genuine phase transitions. When researchers used discontinuous metrics (exact-match accuracy), abilities appeared to emerge suddenly at certain scales. When they switched to continuous metrics (partial credit), the same data showed smooth, predictable improvement.
+
+**Relevance to this project**: This paper is a paradigmatic example of "you measured it wrong" meta-research. It demonstrates that the *method of measurement* can create or destroy dramatic findings. Our survey asks the same question across a broader set of papers: are the claimed results genuine, or artifacts of how they were measured?
+
+### Broader Meta-Research Context
+
+The "replication crisis" in psychology and social science (beginning ~2011) demonstrated that many published findings did not hold up under scrutiny. Key lessons:
+
+- Publication bias toward positive results inflates effect sizes
+- Small sample sizes produce unstable estimates
+- Researcher degrees of freedom (flexible analysis choices) enable p-hacking
+- Pre-registration and replication requirements improve reliability
+
+The AI/LLM research space shares several of these risk factors: rapid publication pace, strong commercial incentives, limited replication culture, and benchmarks that can be optimized against. Our survey explicitly checks for these patterns.
diff --git a/context/requirements.md b/context/requirements.md
@@ -0,0 +1,68 @@
+# Project Requirements
+
+## Motivation
+
+Research papers in the agentic AI / LLM programming space are proliferating rapidly, but their methodological quality varies enormously. Productivity claims range from "19% slower" to "10x faster" depending on what was measured, how it was measured, and what was left unmeasured. Most widely cited studies do not describe *how* the AI was used. They report percentage gains without distinguishing autocomplete from chat from agentic workflows with feedback loops.
+
+This project is a **systematic review** evaluating the methodological quality of research papers in this space. The goal is not to determine which AI tool is best, but to assess how rigorously the claims are supported and help readers calibrate their confidence in reported findings.
+
+## Pipeline Architecture
+
+```
+discover -> download -> scan -> deep eval (optional) -> aggregate
+```
+
+1. **Discover**: Harvester agent searches arXiv, HuggingFace trending, Semantic Scholar, and other sources. Adds entries to `registry.jsonl` with status `queued`.
+2. **Download**: Papers are either manually dropped into `inbox/` or downloaded by a future automation step. The inbox-sorter agent identifies, files, and registers them.
+3. **Scan**: Scan agent reads each paper and produces a structured `scan.json` per the schema. Extracts claims, scores six rubric dimensions, assigns methodology tags, and flags red flags.
+4. **Deep Eval** (optional): Deep-eval agent attempts to run released code, reproduce key results, and check for benchmark contamination. Triggered selectively for high-impact or suspicious papers.
+5. **Aggregate**: Final analysis across all scanned papers to identify patterns in methodological quality.
+
+## Key Decisions
+
+- **JSONL for registry**: One JSON object per line in `registry.jsonl`. Easy to append, easy to grep, easy to diff. No need for a database.
+- **Paper-per-directory**: Each paper gets its own directory under `papers/` containing the PDF (local only), `scan.json`, and optionally `deep_eval.json`.
+- **No PDF redistribution**: PDFs are stored locally for analysis but never committed or published. Only structured outputs (scan results, evaluations) are publishable.
+- **Agent tier design**: Two tiers of evaluation. The scan agent is cheap and fast, producing a structured assessment from reading the paper. The deep-eval agent is expensive and slow, attempting to reproduce results. Most papers only get scanned; deep eval is reserved for high-impact or contested papers.
+
+## Agent Tier Design
+
+### Scan Agent (Tier 1)
+- Reads paper text
+- Fills out `scan.json` per schema
+- Scores six rubric dimensions
+- Extracts claims with supporting evidence
+- Fast, cheap, runs on every paper
+
+### Deep-Eval Agent (Tier 2)
+- Attempts to run released code
+- Tries to reproduce key results
+- Checks for benchmark contamination
+- Expensive, slow, runs selectively
+
+## Tag Taxonomy
+
+### Topic Tags
+- `productivity` - Developer productivity studies
+- `reliability` - Compounding reliability, error rates
+- `scaling` - Scaling laws, model size effects
+- `security` - Prompt injection, alignment, supply chain
+- `benchmarks` - Benchmark design and evaluation
+- `agents` - Agentic workflows and scaffolding
+- `code-generation` - Code generation quality and techniques
+- `alignment` - Model alignment and safety
+- `survey` - Meta-research, literature reviews
+- `economics` - Cost, inference economics, democratization
+
+### Methodology Tags (assigned by scan agent)
+- `rct` - Randomized controlled trial
+- `observational` - Observational study
+- `benchmark-eval` - Benchmark evaluation
+- `case-study` - Case study or anecdotal
+- `meta-analysis` - Meta-analysis or systematic review
+- `theoretical` - Theoretical / analytical
+- `qualitative` - Qualitative research
+
+## Virality Tracking
+
+Deferred until after initial data collection. Future work may track citation counts, social media mentions, and media coverage to correlate with methodological quality.
diff --git a/inbox/.gitkeep b/inbox/.gitkeep
diff --git a/papers/.gitkeep b/papers/.gitkeep
diff --git a/registry.jsonl b/registry.jsonl
@@ -0,0 +1,16 @@
+{"id":"stanford-developer-productivity-2025","title":"The Effects of Generative AI on Software Developer Productivity","authors":["Stanford University researchers"],"year":2025,"venue":"Stanford/Industry Report","source_url":"https://softwareengineeringproductivity.stanford.edu/","source":"manual","status":"queued","tags":["productivity","observational"],"added":"2026-02-27","notes":"~100,000 developers across 600+ companies. Measured 12-31% speed improvements on greenfield tasks, 0-10% on brownfield. Primary tool was GitHub Copilot autocomplete. Did not distinguish how developers used the AI."}
+{"id":"metr-rct-2025","title":"Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity","authors":["METR researchers"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2507.09089","arxiv_id":"2507.09089","source":"manual","status":"queued","tags":["productivity","rct"],"added":"2026-02-27","notes":"RCT finding experienced OSS developers 19% slower with Cursor Pro + Claude 3.5/3.7 Sonnet. Developers believed they were 20% faster. Standard tooling without custom scaffolding."}
+{"id":"remote-labor-index-2025","title":"The Remote Labor Index: Measuring AI Agent Capabilities on Real-World Freelance Tasks","authors":["Scale AI researchers"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2510.26787","arxiv_id":"2510.26787","source":"manual","status":"queued","tags":["productivity","benchmarks","agents"],"added":"2026-02-27","notes":"240 real Upwork projects, median $200/11.5hrs human work. Best agent (Manus) completed 2.5% of tasks acceptably. Fully autonomous agents with file system access."}
+{"id":"uc-berkeley-mast-2025","title":"MAST: A Taxonomy for Multi-Agent System Failures","authors":["UC Berkeley researchers"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2503.13657","arxiv_id":"2503.13657","source":"manual","status":"queued","tags":["agents","reliability"],"added":"2026-02-27","notes":"41-86.7% of multi-agent LLM systems fail in production. 79% of failures trace to specification and coordination problems, not implementation bugs."}
+{"id":"typescript-typecheck-failures-2025","title":"An Empirical Study of Type-Check Failures in LLM-Generated TypeScript","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2504.09246","arxiv_id":"2504.09246","source":"manual","status":"queued","tags":["code-generation","benchmarks"],"added":"2026-02-27","notes":"94% of compilation errors in LLM-generated TypeScript are type-check failures. Enforcing type constraints during generation cuts compilation errors by more than half."}
+{"id":"type-context-pass-rates-2024","title":"Improving LLM Code Generation with Type Context","authors":["Unknown"],"year":2024,"venue":"arXiv","source_url":"https://arxiv.org/abs/2406.03283","arxiv_id":"2406.03283","source":"manual","status":"queued","tags":["code-generation","benchmarks"],"added":"2026-02-27","notes":"Providing type context from surrounding code improves pass rates by 8-14% on repository-level tasks."}
+{"id":"compiler-feedback-loops-2025","title":"Compiler Feedback Loops Equalize Model Quality","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2512.02567","arxiv_id":"2512.02567","source":"manual","status":"queued","tags":["code-generation","reliability"],"added":"2026-02-27","notes":"50% performance gap between models shrank to 13% when both had access to strict compiler feedback. Type system matters more than model choice."}
+{"id":"scaling-laws-2020","title":"Scaling Laws for Neural Language Models","authors":["Kaplan, J.","McCandlish, S.","Henighan, T.","Brown, T. B.","Chess, B.","Child, R.","Gray, S.","Radford, A.","Wu, J.","Amodei, D."],"year":2020,"venue":"arXiv","source_url":"https://arxiv.org/abs/2001.08361","arxiv_id":"2001.08361","source":"manual","status":"queued","tags":["scaling","theoretical"],"added":"2026-02-27","notes":"Foundational paper on logarithmic relationship between model size and performance. LLM performance scales as a power law with model size, dataset size, and compute."}
+{"id":"emergent-abilities-mirage-2023","title":"Are Emergent Abilities of Large Language Models a Mirage?","authors":["Schaeffer, R.","Miranda, B.","Koyejo, S."],"year":2023,"venue":"NeurIPS 2023","source_url":"https://arxiv.org/abs/2304.15004","arxiv_id":"2304.15004","source":"manual","status":"queued","tags":["scaling","benchmarks","meta-analysis"],"added":"2026-02-27","notes":"NeurIPS 2023 Outstanding Paper. 92% of claimed emergent abilities were metric artifacts. Discontinuous metrics create illusion of sudden emergence; continuous metrics show smooth improvement."}
+{"id":"compounding-reliability-2025","title":"Compounding Reliability in Sequential LLM Reasoning","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2509.09677","arxiv_id":"2509.09677","source":"manual","status":"queued","tags":["reliability","agents"],"added":"2026-02-27","notes":"Per-step accuracy determines autonomous chain length. 90% accuracy = 7 steps before 50% failure. 95% = 14 steps. 5% improvement doubles reliable chain length."}
+{"id":"compute-optimal-inference-2024","title":"Compute-Optimal Inference Scaling for LLMs","authors":["Google DeepMind researchers"],"year":2024,"venue":"arXiv","source_url":"https://arxiv.org/abs/2408.03314","arxiv_id":"2408.03314","source":"manual","status":"queued","tags":["scaling","economics"],"added":"2026-02-27","notes":"Smaller model can outperform one 14x its size at same total compute cost with compute-optimal inference scaling."}
+{"id":"scaffolded-model-capability-2023","title":"Scaffolded LLMs as Effective Agents","authors":["Unknown"],"year":2023,"venue":"arXiv","source_url":"https://arxiv.org/abs/2305.05176","arxiv_id":"2305.05176","source":"manual","status":"queued","tags":["agents","scaling"],"added":"2026-02-27","notes":"Well-scaffolded 10B model operates at 200B-500B effective capability. Scaffolded frontier model reaches estimated 5T-10T effective."}
+{"id":"sleeper-agents-2024","title":"Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training","authors":["Anthropic researchers"],"year":2024,"venue":"arXiv","source_url":"https://arxiv.org/abs/2401.05566","arxiv_id":"2401.05566","source":"manual","status":"queued","tags":["security","alignment"],"added":"2026-02-27","notes":"Backdoor behaviors persist through standard safety training (SFT, RL, adversarial training). Persistence increased with model scale. Adversarial training created stealthier deception."}
+{"id":"alignment-faking-2024","title":"Alignment Faking in Large Language Models","authors":["Anthropic researchers"],"year":2024,"venue":"arXiv","source_url":"https://arxiv.org/abs/2412.14093","arxiv_id":"2412.14093","source":"manual","status":"queued","tags":["security","alignment"],"added":"2026-02-27","notes":"Claude 3 Opus strategically complied with harmful queries 14% of the time when it believed it was being monitored. After RL training, alignment-faking reasoning rose to 78%."}
+{"id":"multi-turn-jailbreak-2025","title":"Multi-Turn Jailbreak Attacks on LLM Agents","authors":["Unknown"],"year":2025,"venue":"ACL 2025 (REALM Workshop)","source_url":"https://aclanthology.org/2025.realm-1.13/","source":"manual","status":"queued","tags":["security","agents"],"added":"2026-02-27","notes":"Multi-turn jailbreak attacks achieved 94.44% success rate on GPT-3.5-Turbo (up from 12.12% baseline). Decomposes harmful requests into innocuous sub-steps."}
+{"id":"survey-code-gen-llm-agents-2025","title":"A Survey on Code Generation with LLM-based Agents","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2508.00083","arxiv_id":"2508.00083","source":"manual","status":"queued","tags":["code-generation","agents","survey"],"added":"2026-02-27","notes":"Comprehensive survey on code generation with LLM-based agents. Covers techniques, benchmarks, and open problems."}
diff --git a/schema/deep-eval.schema.json b/schema/deep-eval.schema.json
@@ -0,0 +1,85 @@
+{
+  "$schema": "https://json-schema.org/draft/2020-12/schema",
+  "$id": "deep-eval.schema.json",
+  "title": "Deep Evaluation Result",
+  "description": "Schema for optional deep evaluation of a paper. Produced by the deep-eval agent for papers selected for closer scrutiny.",
+  "type": "object",
+  "required": ["paper_id", "code_runs", "results_reproduce", "benchmark_contamination_check", "additional_findings"],
+  "properties": {
+    "paper_id": {
+      "type": "string",
+      "description": "Registry ID of the paper being evaluated."
+    },
+    "code_runs": {
+      "type": "object",
+      "description": "Whether the released code runs successfully.",
+      "required": ["attempted", "success", "details"],
+      "properties": {
+        "attempted": {
+          "type": "boolean",
+          "description": "Whether code execution was attempted (false if no code released)."
+        },
+        "success": {
+          "type": ["boolean", "null"],
+          "description": "Whether the code ran successfully. Null if not attempted."
+        },
+        "details": {
+          "type": "string",
+          "description": "Description of what happened: environment setup, errors encountered, workarounds needed."
+        }
+      }
+    },
+    "results_reproduce": {
+      "type": "object",
+      "description": "Whether key results from the paper can be reproduced.",
+      "required": ["attempted", "success", "details"],
+      "properties": {
+        "attempted": {
+          "type": "boolean",
+          "description": "Whether reproduction was attempted."
+        },
+        "success": {
+          "type": ["boolean", "null"],
+          "description": "Whether results were reproduced within reasonable tolerance. Null if not attempted."
+        },
+        "details": {
+          "type": "string",
+          "description": "What was attempted, what matched, what diverged, and by how much."
+        }
+      }
+    },
+    "benchmark_contamination_check": {
+      "type": "object",
+      "description": "Check for potential benchmark contamination or data leakage.",
+      "required": ["checked", "concerns"],
+      "properties": {
+        "checked": {
+          "type": "boolean",
+          "description": "Whether contamination was checked."
+        },
+        "concerns": {
+          "type": "string",
+          "description": "Any contamination concerns found, or 'none' if clean."
+        }
+      }
+    },
+    "additional_findings": {
+      "type": "array",
+      "description": "Any other notable findings from deep evaluation.",
+      "items": {
+        "type": "object",
+        "required": ["finding", "detail"],
+        "properties": {
+          "finding": {
+            "type": "string",
+            "description": "Short label for the finding."
+          },
+          "detail": {
+            "type": "string",
+            "description": "Detailed explanation."
+          }
+        }
+      }
+    }
+  }
+}
diff --git a/schema/registry.schema.json b/schema/registry.schema.json
@@ -0,0 +1,74 @@
+{
+  "$schema": "https://json-schema.org/draft/2020-12/schema",
+  "$id": "registry.schema.json",
+  "title": "Registry Entry",
+  "description": "Schema for a single line in registry.jsonl. Each line represents one paper in the survey.",
+  "type": "object",
+  "required": ["id", "title", "authors", "year", "source", "status", "added"],
+  "properties": {
+    "id": {
+      "type": "string",
+      "pattern": "^[a-z0-9-]+$",
+      "description": "URL-safe slug identifying this paper (e.g., 'metr-rct-2025')."
+    },
+    "title": {
+      "type": "string",
+      "description": "Full paper title."
+    },
+    "authors": {
+      "type": "array",
+      "items": { "type": "string" },
+      "description": "List of author names."
+    },
+    "year": {
+      "type": "integer",
+      "description": "Publication year."
+    },
+    "venue": {
+      "type": "string",
+      "description": "Publication venue (journal, conference, preprint server)."
+    },
+    "source_url": {
+      "type": "string",
+      "format": "uri",
+      "description": "Primary URL where the paper can be found."
+    },
+    "doi": {
+      "type": "string",
+      "description": "Digital Object Identifier, if available."
+    },
+    "arxiv_id": {
+      "type": "string",
+      "pattern": "^\\d{4}\\.\\d{4,5}$",
+      "description": "arXiv identifier (e.g., '2507.09089')."
+    },
+    "source": {
+      "type": "string",
+      "enum": ["manual", "arxiv", "huggingface", "semantic_scholar", "inbox"],
+      "description": "How this paper was discovered."
+    },
+    "status": {
+      "type": "string",
+      "enum": ["queued", "downloaded", "scanned", "deep_eval", "excluded"],
+      "description": "Current pipeline status."
+    },
+    "tags": {
+      "type": "array",
+      "items": { "type": "string" },
+      "description": "Topic tags for categorization."
+    },
+    "directory": {
+      "type": "string",
+      "description": "Relative path to paper directory under papers/."
+    },
+    "added": {
+      "type": "string",
+      "format": "date",
+      "description": "Date this entry was added (YYYY-MM-DD)."
+    },
+    "notes": {
+      "type": "string",
+      "description": "Free-text notes about this paper."
+    }
+  }
+}
diff --git a/schema/scan.schema.json b/schema/scan.schema.json
@@ -0,0 +1,132 @@
+{
+  "$schema": "https://json-schema.org/draft/2020-12/schema",
+  "$id": "scan.schema.json",
+  "title": "Paper Scan Result",
+  "description": "Structured output from the scan agent for a single research paper.",
+  "type": "object",
+  "required": [
+    "paper",
+    "rubric",
+    "claims",
+    "methodology_tags",
+    "key_findings",
+    "red_flags"
+  ],
+  "properties": {
+    "paper": {
+      "type": "object",
+      "description": "Paper metadata.",
+      "required": ["title", "authors", "year"],
+      "properties": {
+        "title": { "type": "string" },
+        "authors": {
+          "type": "array",
+          "items": { "type": "string" }
+        },
+        "year": { "type": "integer" },
+        "venue": { "type": "string" },
+        "arxiv_id": { "type": "string", "pattern": "^\\d{4}\\.\\d{4,5}$" },
+        "doi": { "type": "string" }
+      }
+    },
+    "rubric": {
+      "type": "object",
+      "description": "Scores across six quality dimensions.",
+      "required": [
+        "artifacts_reproducibility",
+        "statistical_rigor",
+        "benchmark_quality",
+        "claim_to_evidence",
+        "setup_transparency",
+        "limitations_discussion"
+      ],
+      "properties": {
+        "artifacts_reproducibility": { "$ref": "#/$defs/rubric_dimension" },
+        "statistical_rigor": { "$ref": "#/$defs/rubric_dimension" },
+        "benchmark_quality": { "$ref": "#/$defs/rubric_dimension" },
+        "claim_to_evidence": { "$ref": "#/$defs/rubric_dimension" },
+        "setup_transparency": { "$ref": "#/$defs/rubric_dimension" },
+        "limitations_discussion": { "$ref": "#/$defs/rubric_dimension" }
+      }
+    },
+    "claims": {
+      "type": "array",
+      "description": "Key claims extracted from the paper with supporting evidence.",
+      "items": {
+        "type": "object",
+        "required": ["claim", "evidence", "supported"],
+        "properties": {
+          "claim": {
+            "type": "string",
+            "description": "The claim as stated or paraphrased from the paper."
+          },
+          "evidence": {
+            "type": "string",
+            "description": "The evidence cited in support, with page/section references."
+          },
+          "supported": {
+            "type": "string",
+            "enum": ["strong", "moderate", "weak", "unsupported"],
+            "description": "How well the evidence supports the claim."
+          }
+        }
+      }
+    },
+    "methodology_tags": {
+      "type": "array",
+      "description": "Methodology type tags assigned by the scan agent.",
+      "items": {
+        "type": "string",
+        "enum": [
+          "rct",
+          "observational",
+          "benchmark-eval",
+          "case-study",
+          "meta-analysis",
+          "theoretical",
+          "qualitative"
+        ]
+      }
+    },
+    "key_findings": {
+      "type": "string",
+      "description": "Brief summary of the paper's key findings (2-4 sentences)."
+    },
+    "red_flags": {
+      "type": "array",
+      "description": "Methodological red flags identified during the scan.",
+      "items": {
+        "type": "object",
+        "required": ["flag", "detail"],
+        "properties": {
+          "flag": {
+            "type": "string",
+            "description": "Short label for the red flag."
+          },
+          "detail": {
+            "type": "string",
+            "description": "Explanation of why this is a concern."
+          }
+        }
+      }
+    }
+  },
+  "$defs": {
+    "rubric_dimension": {
+      "type": "object",
+      "required": ["score", "evidence"],
+      "properties": {
+        "score": {
+          "type": "integer",
+          "minimum": 0,
+          "maximum": 3,
+          "description": "0=absent, 1=weak, 2=adequate, 3=strong"
+        },
+        "evidence": {
+          "type": "string",
+          "description": "Justification for the score, citing specific sections or text from the paper."
+        }
+      }
+    }
+  }
+}

	ai-research-survey Systematic scan of agentic development research. What's signal, what's noise.
	git clone https://git.shiptheloop.com/ai-research-survey.git
	Log \| Files \| Refs

A	agents/deep-eval-agent.md	\|	63	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
A	agents/harvester-agent.md	\|	76	++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
A	agents/inbox-sorter-agent.md	\|	67	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
A	agents/scan-agent.md	\|	83	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
A	context/methodology.md	\|	115	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
A	context/related-work.md	\|	46	++++++++++++++++++++++++++++++++++++++++++++++
A	context/requirements.md	\|	68	++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
A	inbox/.gitkeep	\|	0
A	papers/.gitkeep	\|	0
A	registry.jsonl	\|	16	++++++++++++++++
A	schema/deep-eval.schema.json	\|	85	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
A	schema/registry.schema.json	\|	74	++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
A	schema/scan.schema.json	\|	132	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++