ai-research-survey

Systematic scan of agentic development research. What's signal, what's noise.
git clone https://git.shiptheloop.com/ai-research-survey.git
Log | Files | Refs

commit b4be3d6dbb04eddb183b2a99dad0327adbe38b39
parent 08f6c3db4222ef96a668db4bf3fd6b61e3326b67
Author: Brian Graham <brian@buildingbetterteams.de>
Date:   Fri, 27 Feb 2026 22:05:52 +0100

Update scan agent prompt and add scan orchestrator

scan-agent.md:
- Added file paths (paper.txt input, scan.json output)
- Added write-immediately rule and registry status update
- Added guidance for scoring survey papers (are they rigorous or
  just laundering weak results?)
- Added handling for theoretical/position papers
- Added schema validation requirements
- Added note that important papers can still score poorly

extract-text.py:
- Added extraction-failures.txt log for papers that fail both
  pymupdf and Sonnet fallback

scripts/run-scan.py (new):
- Orchestrates full pipeline: extract text → scan agent → validate
- Calls claude CLI with opus model for each paper
- Validates scan.json output (required fields, rubric dimensions)
- Updates registry status to 'scanned'
- Supports --parallel N for concurrent scanning
- Writes scan-failures.txt for debugging

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Diffstat:
Magents/scan-agent.md | 43+++++++++++++++++++++++++++++++++++++++----
Mscripts/extract-text.py | 17+++++++++++++++++
Ascripts/run-scan.py | 279+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 335 insertions(+), 4 deletions(-)

diff --git a/agents/scan-agent.md b/agents/scan-agent.md @@ -1,16 +1,22 @@ # Scan Agent +**Model: Opus** (requires judgment on methodology quality) + You are a research paper scan agent. Your job is to read a research paper and produce a structured assessment of its methodological quality. ## Input You will be given: -- The text content of a research paper (PDF already converted to text) +- The paper text at `papers/<slug>/paper.txt` (already extracted from PDF) - The paper's registry entry from `registry.jsonl` ## Output -Produce a JSON file conforming to `schema/scan.schema.json` and save it as `scan.json` in the paper's directory under `papers/`. +Write a JSON file conforming to `schema/scan.schema.json` and save it as `papers/<slug>/scan.json`. + +**Write the scan.json file immediately when complete.** Do not hold results in memory across multiple papers. + +After writing scan.json, update the paper's status in `registry.jsonl` to `"scanned"`. ## Instructions @@ -72,9 +78,9 @@ Scan the paper's references for other papers that fall within the survey scope ( - **doi**: If available - **relevance**: One sentence on why this paper belongs in the survey -Do NOT include every reference. Only include papers that meet the survey's inclusion criteria from `context/methodology.md`. A typical paper might cite 30-60 references; you should extract 3-15 relevant ones. +Do NOT include every reference. Only include papers that make empirical claims about AI/LLM capability, productivity, safety, or code generation. A typical paper might cite 30-60 references; you should extract 3-15 relevant ones. -These cited papers feed a citation-chasing pipeline: the `scripts/harvest-citations.py` script reads them from all scan.json files and proposes new registry entries for papers we haven't seen yet. +These cited papers feed a citation-chasing pipeline: `scripts/harvest-citations.py` reads them from all scan.json files and proposes new registry entries. ### 7. Flag Red Flags @@ -89,6 +95,34 @@ Note any methodological concerns, including but not limited to: If there are no red flags, return an empty array. +## Handling Different Paper Types + +### Empirical papers (most common) +Score all six dimensions normally. These are the core of the survey. + +### Survey / review papers +These are in scope. Score them, but calibrate appropriately: +- **Artifacts & Reproducibility**: Did they document search strategy, inclusion criteria, and extraction process? A rigorous survey is reproducible. +- **Statistical Rigor**: Did they do quantitative synthesis, or just narrative summary? A survey that just lists papers without structured quality assessment scores low. +- **Benchmark Quality**: N/A for most surveys — score based on how well they evaluated the benchmarks they discuss. +- **Claim-to-Evidence Ratio**: Do the survey's conclusions follow from the papers reviewed, or do they overgeneralize? +- **Setup Transparency**: Is the review methodology clear? Search terms, databases, date ranges, screening process? +- **Limitations Discussion**: Does it acknowledge selection bias, publication bias, scope limitations? + +A survey that just collects and summarizes without quality assessment is laundering the signal-to-noise ratio of its sources. Score accordingly. + +### Theoretical / position papers +Score what applies. Statistical rigor may be N/A. Claim-to-evidence ratio still applies — are the theoretical claims well-argued? + +## Validation + +Your output must be valid JSON conforming to `schema/scan.schema.json`. Specifically: +- All six rubric dimensions must have `score` (integer 0-3) and `evidence` (string) +- Each claim must have `claim`, `evidence`, and `supported` (one of: strong, moderate, weak, unsupported) +- `methodology_tags` must use only the allowed values +- `cited_papers` must each have at least `title` and `relevance` +- `red_flags` must each have `flag` and `detail` + ## Guidelines - Be fair but rigorous. A low score is not an insult; it is information. @@ -96,3 +130,4 @@ If there are no red flags, return an empty array. - If information is genuinely absent (not just hard to find), score it 0. Do not guess. - If you are uncertain about a score, err toward the lower score and explain your uncertainty in the evidence field. - Do not hallucinate content that is not in the paper. +- A paper can be important and influential while still scoring poorly on methodology. Score what's there, not what you think should be there. diff --git a/scripts/extract-text.py b/scripts/extract-text.py @@ -174,6 +174,23 @@ def main(): print(f"\nDone. Extracted: {extracted} (pymupdf: {extracted - fallback}, " f"sonnet fallback: {fallback}), Failed: {failed}") + # Write failure log + if failed > 0: + failed_entries = [ + entry for entry in candidates + if not (PAPERS_DIR / entry["id"] / "paper.txt").exists() + ] + failure_path = ROOT / "extraction-failures.txt" + with open(failure_path, "w") as f: + f.write(f"# Text extraction failures ({len(failed_entries)} total)\n") + f.write(f"# pymupdf failed quality check and Sonnet fallback also failed.\n") + f.write(f"# These papers need manual text extraction or alternative tools.\n\n") + for e in failed_entries: + f.write(f"{e['id']}\n") + f.write(f" {e['title']}\n") + f.write(f" papers/{e['id']}/paper.pdf\n\n") + print(f"Failure log written to {failure_path}") + if __name__ == "__main__": main() diff --git a/scripts/run-scan.py b/scripts/run-scan.py @@ -0,0 +1,279 @@ +#!/usr/bin/env python3 +""" +Orchestrate the scan pipeline: extract text → run scan agent → validate output. + +For each paper with status 'downloaded': +1. Extract text if paper.txt doesn't exist (calls extract-text.py logic) +2. Run the scan agent via claude CLI +3. Validate scan.json against the schema +4. Update registry status to 'scanned' + +Usage: + python scripts/run-scan.py # All downloaded papers + python scripts/run-scan.py --id metr-rct-2025 # Specific paper + python scripts/run-scan.py --limit 10 # First N papers + python scripts/run-scan.py --dry-run # Show what would be scanned + python scripts/run-scan.py --parallel 4 # Run N scans concurrently +""" + +import json +import subprocess +import sys +import os +from concurrent.futures import ThreadPoolExecutor, as_completed +from pathlib import Path + +ROOT = Path(__file__).resolve().parent.parent +REGISTRY_PATH = ROOT / "registry.jsonl" +PAPERS_DIR = ROOT / "papers" +SCAN_AGENT_PROMPT = ROOT / "agents" / "scan-agent.md" +SCAN_SCHEMA = ROOT / "schema" / "scan.schema.json" + +# Heuristics for text extraction quality (duplicated from extract-text.py) +MIN_CHARS = 500 +MIN_WORDS_PER_PAGE = 30 +MAX_GARBLE_RATIO = 0.15 + + +def load_registry(): + entries = [] + with open(REGISTRY_PATH) as f: + for line in f: + line = line.strip() + if line: + entries.append(json.loads(line)) + return entries + + +def save_registry(entries): + with open(REGISTRY_PATH, "w") as f: + for entry in entries: + f.write(json.dumps(entry, ensure_ascii=False) + "\n") + + +def ensure_text(entry): + """Extract text if paper.txt doesn't exist. Returns (ok, reason).""" + pdf_path = PAPERS_DIR / entry["id"] / "paper.pdf" + txt_path = PAPERS_DIR / entry["id"] / "paper.txt" + + if txt_path.exists() and txt_path.stat().st_size > MIN_CHARS: + return True, "already extracted" + + if not pdf_path.exists(): + return False, "no PDF" + + try: + import fitz + doc = fitz.open(str(pdf_path)) + pages = [page.get_text() for page in doc] + doc.close() + text = "\n\n".join(pages) + + # Quality check + words = text.split() + words_per_page = len(words) / max(len(pages), 1) + non_ascii = sum(1 for c in text if ord(c) > 127 and not c.isalpha()) + garble_ratio = non_ascii / max(len(text), 1) + + if len(text) < MIN_CHARS: + return False, f"pymupdf: too short ({len(text)} chars)" + if words_per_page < MIN_WORDS_PER_PAGE: + return False, f"pymupdf: too few words/page ({words_per_page:.0f})" + if garble_ratio > MAX_GARBLE_RATIO: + return False, f"pymupdf: garbled ({garble_ratio:.1%} non-ASCII)" + + txt_path.write_text(text, encoding="utf-8") + return True, f"extracted {len(text)} chars" + + except Exception as e: + return False, f"pymupdf error: {e}" + + +def run_scan_agent(entry): + """Run the scan agent on a single paper. Returns (ok, reason).""" + txt_path = PAPERS_DIR / entry["id"] / "paper.txt" + scan_path = PAPERS_DIR / entry["id"] / "scan.json" + + if scan_path.exists(): + return True, "already scanned" + + paper_text = txt_path.read_text(encoding="utf-8") + + # Build the prompt + registry_json = json.dumps(entry, indent=2, ensure_ascii=False) + prompt = f"""You are the scan agent. Read your full instructions at agents/scan-agent.md and the schema at schema/scan.schema.json. + +Scan this paper and write the result to papers/{entry['id']}/scan.json. + +## Registry Entry +```json +{registry_json} +``` + +## Paper Text +{paper_text} +""" + + try: + result = subprocess.run( + [ + "claude", "-p", prompt, + "--model", "opus", + "--allowedTools", "Read,Write,Edit", + "--max-turns", "3", + ], + capture_output=True, text=True, timeout=600, + cwd=str(ROOT), + ) + + if result.returncode != 0: + return False, f"claude exit {result.returncode}: {result.stderr[:200]}" + + # Check if scan.json was created + if not scan_path.exists(): + return False, "scan.json not created" + + # Validate JSON + try: + with open(scan_path) as f: + scan = json.load(f) + except json.JSONDecodeError as e: + scan_path.unlink() + return False, f"invalid JSON: {e}" + + # Basic schema validation (check required fields) + required = ["paper", "rubric", "claims", "methodology_tags", "key_findings", "red_flags", "cited_papers"] + missing = [r for r in required if r not in scan] + if missing: + scan_path.unlink() + return False, f"missing fields: {missing}" + + # Check rubric dimensions + rubric = scan.get("rubric", {}) + dims = ["artifacts_reproducibility", "statistical_rigor", "benchmark_quality", + "claim_to_evidence", "setup_transparency", "limitations_discussion"] + missing_dims = [d for d in dims if d not in rubric] + if missing_dims: + scan_path.unlink() + return False, f"missing rubric dimensions: {missing_dims}" + + return True, "scanned" + + except subprocess.TimeoutExpired: + return False, "timeout (600s)" + except FileNotFoundError: + return False, "'claude' CLI not found" + except Exception as e: + return False, f"error: {e}" + + +def scan_one(entry): + """Full pipeline for one paper: extract text → scan → return result.""" + paper_id = entry["id"] + + # Step 1: ensure text + ok, reason = ensure_text(entry) + if not ok: + return paper_id, False, f"text extraction failed: {reason}" + + # Step 2: run scan + ok, reason = run_scan_agent(entry) + return paper_id, ok, reason + + +def main(): + args = sys.argv[1:] + dry_run = "--dry-run" in args + limit = None + specific_id = None + parallel = 1 + + for i, arg in enumerate(args): + if arg == "--limit" and i + 1 < len(args): + limit = int(args[i + 1]) + if arg == "--id" and i + 1 < len(args): + specific_id = args[i + 1] + if arg == "--parallel" and i + 1 < len(args): + parallel = int(args[i + 1]) + + entries = load_registry() + + candidates = [] + for entry in entries: + if specific_id and entry["id"] != specific_id: + continue + if entry["status"] != "downloaded" and not specific_id: + continue + scan_path = PAPERS_DIR / entry["id"] / "scan.json" + if scan_path.exists() and not specific_id: + continue + candidates.append(entry) + + if limit: + candidates = candidates[:limit] + + if not candidates: + print("No papers to scan.") + return + + print(f"{'Would scan' if dry_run else 'Scanning'} {len(candidates)} paper(s)" + f"{f' (parallel={parallel})' if parallel > 1 else ''}:\n") + + if dry_run: + for entry in candidates: + txt_exists = (PAPERS_DIR / entry["id"] / "paper.txt").exists() + print(f" {entry['id']} {'(text ready)' if txt_exists else '(needs extraction)'}") + return + + results = {"scanned": 0, "failed": 0, "skipped": 0} + failures = [] + + if parallel > 1: + with ThreadPoolExecutor(max_workers=parallel) as executor: + futures = {executor.submit(scan_one, e): e for e in candidates} + for future in as_completed(futures): + paper_id, ok, reason = future.result() + if ok: + results["scanned"] += 1 + print(f" OK: {paper_id} — {reason}") + else: + results["failed"] += 1 + failures.append((paper_id, reason)) + print(f" FAIL: {paper_id} — {reason}") + else: + for i, entry in enumerate(candidates): + print(f"[{i+1}/{len(candidates)}] {entry['id']}") + paper_id, ok, reason = scan_one(entry) + if ok: + results["scanned"] += 1 + print(f" OK: {reason}") + else: + results["failed"] += 1 + failures.append((paper_id, reason)) + print(f" FAIL: {reason}") + + # Update registry for successful scans + entries = load_registry() # Reload in case of parallel modifications + scanned_ids = set() + for entry in entries: + scan_path = PAPERS_DIR / entry["id"] / "scan.json" + if scan_path.exists() and entry["status"] == "downloaded": + entry["status"] = "scanned" + scanned_ids.add(entry["id"]) + save_registry(entries) + + print(f"\nDone. Scanned: {results['scanned']}, Failed: {results['failed']}") + if scanned_ids: + print(f"Registry updated: {len(scanned_ids)} entries → 'scanned'") + + if failures: + failure_path = ROOT / "scan-failures.txt" + with open(failure_path, "w") as f: + f.write(f"# Scan failures ({len(failures)} total)\n\n") + for paper_id, reason in failures: + f.write(f"{paper_id}\n {reason}\n\n") + print(f"Failure log: {failure_path}") + + +if __name__ == "__main__": + main()

Impressum · Datenschutz