Update scan agent prompt and add scan orchestrator - ai-research-survey - Systematic scan of agentic development research. What's signal, what's noise.

commit b4be3d6dbb04eddb183b2a99dad0327adbe38b39
parent 08f6c3db4222ef96a668db4bf3fd6b61e3326b67
Author: Brian Graham <brian@buildingbetterteams.de>
Date:   Fri, 27 Feb 2026 22:05:52 +0100

Update scan agent prompt and add scan orchestrator

scan-agent.md:
- Added file paths (paper.txt input, scan.json output)
- Added write-immediately rule and registry status update
- Added guidance for scoring survey papers (are they rigorous or
  just laundering weak results?)
- Added handling for theoretical/position papers
- Added schema validation requirements
- Added note that important papers can still score poorly

extract-text.py:
- Added extraction-failures.txt log for papers that fail both
  pymupdf and Sonnet fallback

scripts/run-scan.py (new):
- Orchestrates full pipeline: extract text → scan agent → validate
- Calls claude CLI with opus model for each paper
- Validates scan.json output (required fields, rubric dimensions)
- Updates registry status to 'scanned'
- Supports --parallel N for concurrent scanning
- Writes scan-failures.txt for debugging

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Diffstat:
M agents/scan-agent.md  | 43 +++++++++++++++++++++++++++++++++++++++----
M scripts/extract-text.py  | 17 +++++++++++++++++
A scripts/run-scan.py  | 279 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

3 files changed, 335 insertions(+), 4 deletions(-)
diff --git a/agents/scan-agent.md b/agents/scan-agent.md
@@ -1,16 +1,22 @@
 # Scan Agent
 
+**Model: Opus** (requires judgment on methodology quality)
+
 You are a research paper scan agent. Your job is to read a research paper and produce a structured assessment of its methodological quality.
 
 ## Input
 
 You will be given:
-- The text content of a research paper (PDF already converted to text)
+- The paper text at `papers/<slug>/paper.txt` (already extracted from PDF)
 - The paper's registry entry from `registry.jsonl`
 
 ## Output
 
-Produce a JSON file conforming to `schema/scan.schema.json` and save it as `scan.json` in the paper's directory under `papers/`.
+Write a JSON file conforming to `schema/scan.schema.json` and save it as `papers/<slug>/scan.json`.
+
+**Write the scan.json file immediately when complete.** Do not hold results in memory across multiple papers.
+
+After writing scan.json, update the paper's status in `registry.jsonl` to `"scanned"`.
 
 ## Instructions
 
@@ -72,9 +78,9 @@ Scan the paper's references for other papers that fall within the survey scope (
 - **doi**: If available
 - **relevance**: One sentence on why this paper belongs in the survey
 
-Do NOT include every reference. Only include papers that meet the survey's inclusion criteria from `context/methodology.md`. A typical paper might cite 30-60 references; you should extract 3-15 relevant ones.
+Do NOT include every reference. Only include papers that make empirical claims about AI/LLM capability, productivity, safety, or code generation. A typical paper might cite 30-60 references; you should extract 3-15 relevant ones.
 
-These cited papers feed a citation-chasing pipeline: the `scripts/harvest-citations.py` script reads them from all scan.json files and proposes new registry entries for papers we haven't seen yet.
+These cited papers feed a citation-chasing pipeline: `scripts/harvest-citations.py` reads them from all scan.json files and proposes new registry entries.
 
 ### 7. Flag Red Flags
 
@@ -89,6 +95,34 @@ Note any methodological concerns, including but not limited to:
 
 If there are no red flags, return an empty array.
 
+## Handling Different Paper Types
+
+### Empirical papers (most common)
+Score all six dimensions normally. These are the core of the survey.
+
+### Survey / review papers
+These are in scope. Score them, but calibrate appropriately:
+- **Artifacts & Reproducibility**: Did they document search strategy, inclusion criteria, and extraction process? A rigorous survey is reproducible.
+- **Statistical Rigor**: Did they do quantitative synthesis, or just narrative summary? A survey that just lists papers without structured quality assessment scores low.
+- **Benchmark Quality**: N/A for most surveys — score based on how well they evaluated the benchmarks they discuss.
+- **Claim-to-Evidence Ratio**: Do the survey's conclusions follow from the papers reviewed, or do they overgeneralize?
+- **Setup Transparency**: Is the review methodology clear? Search terms, databases, date ranges, screening process?
+- **Limitations Discussion**: Does it acknowledge selection bias, publication bias, scope limitations?
+
+A survey that just collects and summarizes without quality assessment is laundering the signal-to-noise ratio of its sources. Score accordingly.
+
+### Theoretical / position papers
+Score what applies. Statistical rigor may be N/A. Claim-to-evidence ratio still applies — are the theoretical claims well-argued?
+
+## Validation
+
+Your output must be valid JSON conforming to `schema/scan.schema.json`. Specifically:
+- All six rubric dimensions must have `score` (integer 0-3) and `evidence` (string)
+- Each claim must have `claim`, `evidence`, and `supported` (one of: strong, moderate, weak, unsupported)
+- `methodology_tags` must use only the allowed values
+- `cited_papers` must each have at least `title` and `relevance`
+- `red_flags` must each have `flag` and `detail`
+
 ## Guidelines
 
 - Be fair but rigorous. A low score is not an insult; it is information.
@@ -96,3 +130,4 @@ If there are no red flags, return an empty array.
 - If information is genuinely absent (not just hard to find), score it 0. Do not guess.
 - If you are uncertain about a score, err toward the lower score and explain your uncertainty in the evidence field.
 - Do not hallucinate content that is not in the paper.
+- A paper can be important and influential while still scoring poorly on methodology. Score what's there, not what you think should be there.
diff --git a/scripts/extract-text.py b/scripts/extract-text.py
@@ -174,6 +174,23 @@ def main():
         print(f"\nDone. Extracted: {extracted} (pymupdf: {extracted - fallback}, "
               f"sonnet fallback: {fallback}), Failed: {failed}")
 
+        # Write failure log
+        if failed > 0:
+            failed_entries = [
+                entry for entry in candidates
+                if not (PAPERS_DIR / entry["id"] / "paper.txt").exists()
+            ]
+            failure_path = ROOT / "extraction-failures.txt"
+            with open(failure_path, "w") as f:
+                f.write(f"# Text extraction failures ({len(failed_entries)} total)\n")
+                f.write(f"# pymupdf failed quality check and Sonnet fallback also failed.\n")
+                f.write(f"# These papers need manual text extraction or alternative tools.\n\n")
+                for e in failed_entries:
+                    f.write(f"{e['id']}\n")
+                    f.write(f"  {e['title']}\n")
+                    f.write(f"  papers/{e['id']}/paper.pdf\n\n")
+            print(f"Failure log written to {failure_path}")
+
 
 if __name__ == "__main__":
     main()
diff --git a/scripts/run-scan.py b/scripts/run-scan.py
@@ -0,0 +1,279 @@
+#!/usr/bin/env python3
+"""
+Orchestrate the scan pipeline: extract text → run scan agent → validate output.
+
+For each paper with status 'downloaded':
+1. Extract text if paper.txt doesn't exist (calls extract-text.py logic)
+2. Run the scan agent via claude CLI
+3. Validate scan.json against the schema
+4. Update registry status to 'scanned'
+
+Usage:
+    python scripts/run-scan.py                    # All downloaded papers
+    python scripts/run-scan.py --id metr-rct-2025 # Specific paper
+    python scripts/run-scan.py --limit 10         # First N papers
+    python scripts/run-scan.py --dry-run          # Show what would be scanned
+    python scripts/run-scan.py --parallel 4       # Run N scans concurrently
+"""
+
+import json
+import subprocess
+import sys
+import os
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from pathlib import Path
+
+ROOT = Path(__file__).resolve().parent.parent
+REGISTRY_PATH = ROOT / "registry.jsonl"
+PAPERS_DIR = ROOT / "papers"
+SCAN_AGENT_PROMPT = ROOT / "agents" / "scan-agent.md"
+SCAN_SCHEMA = ROOT / "schema" / "scan.schema.json"
+
+# Heuristics for text extraction quality (duplicated from extract-text.py)
+MIN_CHARS = 500
+MIN_WORDS_PER_PAGE = 30
+MAX_GARBLE_RATIO = 0.15
+
+
+def load_registry():
+    entries = []
+    with open(REGISTRY_PATH) as f:
+        for line in f:
+            line = line.strip()
+            if line:
+                entries.append(json.loads(line))
+    return entries
+
+
+def save_registry(entries):
+    with open(REGISTRY_PATH, "w") as f:
+        for entry in entries:
+            f.write(json.dumps(entry, ensure_ascii=False) + "\n")
+
+
+def ensure_text(entry):
+    """Extract text if paper.txt doesn't exist. Returns (ok, reason)."""
+    pdf_path = PAPERS_DIR / entry["id"] / "paper.pdf"
+    txt_path = PAPERS_DIR / entry["id"] / "paper.txt"
+
+    if txt_path.exists() and txt_path.stat().st_size > MIN_CHARS:
+        return True, "already extracted"
+
+    if not pdf_path.exists():
+        return False, "no PDF"
+
+    try:
+        import fitz
+        doc = fitz.open(str(pdf_path))
+        pages = [page.get_text() for page in doc]
+        doc.close()
+        text = "\n\n".join(pages)
+
+        # Quality check
+        words = text.split()
+        words_per_page = len(words) / max(len(pages), 1)
+        non_ascii = sum(1 for c in text if ord(c) > 127 and not c.isalpha())
+        garble_ratio = non_ascii / max(len(text), 1)
+
+        if len(text) < MIN_CHARS:
+            return False, f"pymupdf: too short ({len(text)} chars)"
+        if words_per_page < MIN_WORDS_PER_PAGE:
+            return False, f"pymupdf: too few words/page ({words_per_page:.0f})"
+        if garble_ratio > MAX_GARBLE_RATIO:
+            return False, f"pymupdf: garbled ({garble_ratio:.1%} non-ASCII)"
+
+        txt_path.write_text(text, encoding="utf-8")
+        return True, f"extracted {len(text)} chars"
+
+    except Exception as e:
+        return False, f"pymupdf error: {e}"
+
+
+def run_scan_agent(entry):
+    """Run the scan agent on a single paper. Returns (ok, reason)."""
+    txt_path = PAPERS_DIR / entry["id"] / "paper.txt"
+    scan_path = PAPERS_DIR / entry["id"] / "scan.json"
+
+    if scan_path.exists():
+        return True, "already scanned"
+
+    paper_text = txt_path.read_text(encoding="utf-8")
+
+    # Build the prompt
+    registry_json = json.dumps(entry, indent=2, ensure_ascii=False)
+    prompt = f"""You are the scan agent. Read your full instructions at agents/scan-agent.md and the schema at schema/scan.schema.json.
+
+Scan this paper and write the result to papers/{entry['id']}/scan.json.
+
+## Registry Entry
+```json
+{registry_json}
+```
+
+## Paper Text
+{paper_text}
+"""
+
+    try:
+        result = subprocess.run(
+            [
+                "claude", "-p", prompt,
+                "--model", "opus",
+                "--allowedTools", "Read,Write,Edit",
+                "--max-turns", "3",
+            ],
+            capture_output=True, text=True, timeout=600,
+            cwd=str(ROOT),
+        )
+
+        if result.returncode != 0:
+            return False, f"claude exit {result.returncode}: {result.stderr[:200]}"
+
+        # Check if scan.json was created
+        if not scan_path.exists():
+            return False, "scan.json not created"
+
+        # Validate JSON
+        try:
+            with open(scan_path) as f:
+                scan = json.load(f)
+        except json.JSONDecodeError as e:
+            scan_path.unlink()
+            return False, f"invalid JSON: {e}"
+
+        # Basic schema validation (check required fields)
+        required = ["paper", "rubric", "claims", "methodology_tags", "key_findings", "red_flags", "cited_papers"]
+        missing = [r for r in required if r not in scan]
+        if missing:
+            scan_path.unlink()
+            return False, f"missing fields: {missing}"
+
+        # Check rubric dimensions
+        rubric = scan.get("rubric", {})
+        dims = ["artifacts_reproducibility", "statistical_rigor", "benchmark_quality",
+                "claim_to_evidence", "setup_transparency", "limitations_discussion"]
+        missing_dims = [d for d in dims if d not in rubric]
+        if missing_dims:
+            scan_path.unlink()
+            return False, f"missing rubric dimensions: {missing_dims}"
+
+        return True, "scanned"
+
+    except subprocess.TimeoutExpired:
+        return False, "timeout (600s)"
+    except FileNotFoundError:
+        return False, "'claude' CLI not found"
+    except Exception as e:
+        return False, f"error: {e}"
+
+
+def scan_one(entry):
+    """Full pipeline for one paper: extract text → scan → return result."""
+    paper_id = entry["id"]
+
+    # Step 1: ensure text
+    ok, reason = ensure_text(entry)
+    if not ok:
+        return paper_id, False, f"text extraction failed: {reason}"
+
+    # Step 2: run scan
+    ok, reason = run_scan_agent(entry)
+    return paper_id, ok, reason
+
+
+def main():
+    args = sys.argv[1:]
+    dry_run = "--dry-run" in args
+    limit = None
+    specific_id = None
+    parallel = 1
+
+    for i, arg in enumerate(args):
+        if arg == "--limit" and i + 1 < len(args):
+            limit = int(args[i + 1])
+        if arg == "--id" and i + 1 < len(args):
+            specific_id = args[i + 1]
+        if arg == "--parallel" and i + 1 < len(args):
+            parallel = int(args[i + 1])
+
+    entries = load_registry()
+
+    candidates = []
+    for entry in entries:
+        if specific_id and entry["id"] != specific_id:
+            continue
+        if entry["status"] != "downloaded" and not specific_id:
+            continue
+        scan_path = PAPERS_DIR / entry["id"] / "scan.json"
+        if scan_path.exists() and not specific_id:
+            continue
+        candidates.append(entry)
+
+    if limit:
+        candidates = candidates[:limit]
+
+    if not candidates:
+        print("No papers to scan.")
+        return
+
+    print(f"{'Would scan' if dry_run else 'Scanning'} {len(candidates)} paper(s)"
+          f"{f' (parallel={parallel})' if parallel > 1 else ''}:\n")
+
+    if dry_run:
+        for entry in candidates:
+            txt_exists = (PAPERS_DIR / entry["id"] / "paper.txt").exists()
+            print(f"  {entry['id']} {'(text ready)' if txt_exists else '(needs extraction)'}")
+        return
+
+    results = {"scanned": 0, "failed": 0, "skipped": 0}
+    failures = []
+
+    if parallel > 1:
+        with ThreadPoolExecutor(max_workers=parallel) as executor:
+            futures = {executor.submit(scan_one, e): e for e in candidates}
+            for future in as_completed(futures):
+                paper_id, ok, reason = future.result()
+                if ok:
+                    results["scanned"] += 1
+                    print(f"  OK: {paper_id} — {reason}")
+                else:
+                    results["failed"] += 1
+                    failures.append((paper_id, reason))
+                    print(f"  FAIL: {paper_id} — {reason}")
+    else:
+        for i, entry in enumerate(candidates):
+            print(f"[{i+1}/{len(candidates)}] {entry['id']}")
+            paper_id, ok, reason = scan_one(entry)
+            if ok:
+                results["scanned"] += 1
+                print(f"  OK: {reason}")
+            else:
+                results["failed"] += 1
+                failures.append((paper_id, reason))
+                print(f"  FAIL: {reason}")
+
+    # Update registry for successful scans
+    entries = load_registry()  # Reload in case of parallel modifications
+    scanned_ids = set()
+    for entry in entries:
+        scan_path = PAPERS_DIR / entry["id"] / "scan.json"
+        if scan_path.exists() and entry["status"] == "downloaded":
+            entry["status"] = "scanned"
+            scanned_ids.add(entry["id"])
+    save_registry(entries)
+
+    print(f"\nDone. Scanned: {results['scanned']}, Failed: {results['failed']}")
+    if scanned_ids:
+        print(f"Registry updated: {len(scanned_ids)} entries → 'scanned'")
+
+    if failures:
+        failure_path = ROOT / "scan-failures.txt"
+        with open(failure_path, "w") as f:
+            f.write(f"# Scan failures ({len(failures)} total)\n\n")
+            for paper_id, reason in failures:
+                f.write(f"{paper_id}\n  {reason}\n\n")
+        print(f"Failure log: {failure_path}")
+
+
+if __name__ == "__main__":
+    main()

	ai-research-survey Systematic scan of agentic development research. What's signal, what's noise.
	git clone https://git.shiptheloop.com/ai-research-survey.git
	Log \| Files \| Refs

M	agents/scan-agent.md	\|	43	+++++++++++++++++++++++++++++++++++++++----
M	scripts/extract-text.py	\|	17	+++++++++++++++++
A	scripts/run-scan.py	\|	279	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++