commit b4be3d6dbb04eddb183b2a99dad0327adbe38b39
parent 08f6c3db4222ef96a668db4bf3fd6b61e3326b67
Author: Brian Graham <brian@buildingbetterteams.de>
Date: Fri, 27 Feb 2026 22:05:52 +0100
Update scan agent prompt and add scan orchestrator
scan-agent.md:
- Added file paths (paper.txt input, scan.json output)
- Added write-immediately rule and registry status update
- Added guidance for scoring survey papers (are they rigorous or
just laundering weak results?)
- Added handling for theoretical/position papers
- Added schema validation requirements
- Added note that important papers can still score poorly
extract-text.py:
- Added extraction-failures.txt log for papers that fail both
pymupdf and Sonnet fallback
scripts/run-scan.py (new):
- Orchestrates full pipeline: extract text → scan agent → validate
- Calls claude CLI with opus model for each paper
- Validates scan.json output (required fields, rubric dimensions)
- Updates registry status to 'scanned'
- Supports --parallel N for concurrent scanning
- Writes scan-failures.txt for debugging
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Diffstat:
3 files changed, 335 insertions(+), 4 deletions(-)
diff --git a/agents/scan-agent.md b/agents/scan-agent.md
@@ -1,16 +1,22 @@
# Scan Agent
+**Model: Opus** (requires judgment on methodology quality)
+
You are a research paper scan agent. Your job is to read a research paper and produce a structured assessment of its methodological quality.
## Input
You will be given:
-- The text content of a research paper (PDF already converted to text)
+- The paper text at `papers/<slug>/paper.txt` (already extracted from PDF)
- The paper's registry entry from `registry.jsonl`
## Output
-Produce a JSON file conforming to `schema/scan.schema.json` and save it as `scan.json` in the paper's directory under `papers/`.
+Write a JSON file conforming to `schema/scan.schema.json` and save it as `papers/<slug>/scan.json`.
+
+**Write the scan.json file immediately when complete.** Do not hold results in memory across multiple papers.
+
+After writing scan.json, update the paper's status in `registry.jsonl` to `"scanned"`.
## Instructions
@@ -72,9 +78,9 @@ Scan the paper's references for other papers that fall within the survey scope (
- **doi**: If available
- **relevance**: One sentence on why this paper belongs in the survey
-Do NOT include every reference. Only include papers that meet the survey's inclusion criteria from `context/methodology.md`. A typical paper might cite 30-60 references; you should extract 3-15 relevant ones.
+Do NOT include every reference. Only include papers that make empirical claims about AI/LLM capability, productivity, safety, or code generation. A typical paper might cite 30-60 references; you should extract 3-15 relevant ones.
-These cited papers feed a citation-chasing pipeline: the `scripts/harvest-citations.py` script reads them from all scan.json files and proposes new registry entries for papers we haven't seen yet.
+These cited papers feed a citation-chasing pipeline: `scripts/harvest-citations.py` reads them from all scan.json files and proposes new registry entries.
### 7. Flag Red Flags
@@ -89,6 +95,34 @@ Note any methodological concerns, including but not limited to:
If there are no red flags, return an empty array.
+## Handling Different Paper Types
+
+### Empirical papers (most common)
+Score all six dimensions normally. These are the core of the survey.
+
+### Survey / review papers
+These are in scope. Score them, but calibrate appropriately:
+- **Artifacts & Reproducibility**: Did they document search strategy, inclusion criteria, and extraction process? A rigorous survey is reproducible.
+- **Statistical Rigor**: Did they do quantitative synthesis, or just narrative summary? A survey that just lists papers without structured quality assessment scores low.
+- **Benchmark Quality**: N/A for most surveys — score based on how well they evaluated the benchmarks they discuss.
+- **Claim-to-Evidence Ratio**: Do the survey's conclusions follow from the papers reviewed, or do they overgeneralize?
+- **Setup Transparency**: Is the review methodology clear? Search terms, databases, date ranges, screening process?
+- **Limitations Discussion**: Does it acknowledge selection bias, publication bias, scope limitations?
+
+A survey that just collects and summarizes without quality assessment is laundering the signal-to-noise ratio of its sources. Score accordingly.
+
+### Theoretical / position papers
+Score what applies. Statistical rigor may be N/A. Claim-to-evidence ratio still applies — are the theoretical claims well-argued?
+
+## Validation
+
+Your output must be valid JSON conforming to `schema/scan.schema.json`. Specifically:
+- All six rubric dimensions must have `score` (integer 0-3) and `evidence` (string)
+- Each claim must have `claim`, `evidence`, and `supported` (one of: strong, moderate, weak, unsupported)
+- `methodology_tags` must use only the allowed values
+- `cited_papers` must each have at least `title` and `relevance`
+- `red_flags` must each have `flag` and `detail`
+
## Guidelines
- Be fair but rigorous. A low score is not an insult; it is information.
@@ -96,3 +130,4 @@ If there are no red flags, return an empty array.
- If information is genuinely absent (not just hard to find), score it 0. Do not guess.
- If you are uncertain about a score, err toward the lower score and explain your uncertainty in the evidence field.
- Do not hallucinate content that is not in the paper.
+- A paper can be important and influential while still scoring poorly on methodology. Score what's there, not what you think should be there.
diff --git a/scripts/extract-text.py b/scripts/extract-text.py
@@ -174,6 +174,23 @@ def main():
print(f"\nDone. Extracted: {extracted} (pymupdf: {extracted - fallback}, "
f"sonnet fallback: {fallback}), Failed: {failed}")
+ # Write failure log
+ if failed > 0:
+ failed_entries = [
+ entry for entry in candidates
+ if not (PAPERS_DIR / entry["id"] / "paper.txt").exists()
+ ]
+ failure_path = ROOT / "extraction-failures.txt"
+ with open(failure_path, "w") as f:
+ f.write(f"# Text extraction failures ({len(failed_entries)} total)\n")
+ f.write(f"# pymupdf failed quality check and Sonnet fallback also failed.\n")
+ f.write(f"# These papers need manual text extraction or alternative tools.\n\n")
+ for e in failed_entries:
+ f.write(f"{e['id']}\n")
+ f.write(f" {e['title']}\n")
+ f.write(f" papers/{e['id']}/paper.pdf\n\n")
+ print(f"Failure log written to {failure_path}")
+
if __name__ == "__main__":
main()
diff --git a/scripts/run-scan.py b/scripts/run-scan.py
@@ -0,0 +1,279 @@
+#!/usr/bin/env python3
+"""
+Orchestrate the scan pipeline: extract text → run scan agent → validate output.
+
+For each paper with status 'downloaded':
+1. Extract text if paper.txt doesn't exist (calls extract-text.py logic)
+2. Run the scan agent via claude CLI
+3. Validate scan.json against the schema
+4. Update registry status to 'scanned'
+
+Usage:
+ python scripts/run-scan.py # All downloaded papers
+ python scripts/run-scan.py --id metr-rct-2025 # Specific paper
+ python scripts/run-scan.py --limit 10 # First N papers
+ python scripts/run-scan.py --dry-run # Show what would be scanned
+ python scripts/run-scan.py --parallel 4 # Run N scans concurrently
+"""
+
+import json
+import subprocess
+import sys
+import os
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from pathlib import Path
+
+ROOT = Path(__file__).resolve().parent.parent
+REGISTRY_PATH = ROOT / "registry.jsonl"
+PAPERS_DIR = ROOT / "papers"
+SCAN_AGENT_PROMPT = ROOT / "agents" / "scan-agent.md"
+SCAN_SCHEMA = ROOT / "schema" / "scan.schema.json"
+
+# Heuristics for text extraction quality (duplicated from extract-text.py)
+MIN_CHARS = 500
+MIN_WORDS_PER_PAGE = 30
+MAX_GARBLE_RATIO = 0.15
+
+
+def load_registry():
+ entries = []
+ with open(REGISTRY_PATH) as f:
+ for line in f:
+ line = line.strip()
+ if line:
+ entries.append(json.loads(line))
+ return entries
+
+
+def save_registry(entries):
+ with open(REGISTRY_PATH, "w") as f:
+ for entry in entries:
+ f.write(json.dumps(entry, ensure_ascii=False) + "\n")
+
+
+def ensure_text(entry):
+ """Extract text if paper.txt doesn't exist. Returns (ok, reason)."""
+ pdf_path = PAPERS_DIR / entry["id"] / "paper.pdf"
+ txt_path = PAPERS_DIR / entry["id"] / "paper.txt"
+
+ if txt_path.exists() and txt_path.stat().st_size > MIN_CHARS:
+ return True, "already extracted"
+
+ if not pdf_path.exists():
+ return False, "no PDF"
+
+ try:
+ import fitz
+ doc = fitz.open(str(pdf_path))
+ pages = [page.get_text() for page in doc]
+ doc.close()
+ text = "\n\n".join(pages)
+
+ # Quality check
+ words = text.split()
+ words_per_page = len(words) / max(len(pages), 1)
+ non_ascii = sum(1 for c in text if ord(c) > 127 and not c.isalpha())
+ garble_ratio = non_ascii / max(len(text), 1)
+
+ if len(text) < MIN_CHARS:
+ return False, f"pymupdf: too short ({len(text)} chars)"
+ if words_per_page < MIN_WORDS_PER_PAGE:
+ return False, f"pymupdf: too few words/page ({words_per_page:.0f})"
+ if garble_ratio > MAX_GARBLE_RATIO:
+ return False, f"pymupdf: garbled ({garble_ratio:.1%} non-ASCII)"
+
+ txt_path.write_text(text, encoding="utf-8")
+ return True, f"extracted {len(text)} chars"
+
+ except Exception as e:
+ return False, f"pymupdf error: {e}"
+
+
+def run_scan_agent(entry):
+ """Run the scan agent on a single paper. Returns (ok, reason)."""
+ txt_path = PAPERS_DIR / entry["id"] / "paper.txt"
+ scan_path = PAPERS_DIR / entry["id"] / "scan.json"
+
+ if scan_path.exists():
+ return True, "already scanned"
+
+ paper_text = txt_path.read_text(encoding="utf-8")
+
+ # Build the prompt
+ registry_json = json.dumps(entry, indent=2, ensure_ascii=False)
+ prompt = f"""You are the scan agent. Read your full instructions at agents/scan-agent.md and the schema at schema/scan.schema.json.
+
+Scan this paper and write the result to papers/{entry['id']}/scan.json.
+
+## Registry Entry
+```json
+{registry_json}
+```
+
+## Paper Text
+{paper_text}
+"""
+
+ try:
+ result = subprocess.run(
+ [
+ "claude", "-p", prompt,
+ "--model", "opus",
+ "--allowedTools", "Read,Write,Edit",
+ "--max-turns", "3",
+ ],
+ capture_output=True, text=True, timeout=600,
+ cwd=str(ROOT),
+ )
+
+ if result.returncode != 0:
+ return False, f"claude exit {result.returncode}: {result.stderr[:200]}"
+
+ # Check if scan.json was created
+ if not scan_path.exists():
+ return False, "scan.json not created"
+
+ # Validate JSON
+ try:
+ with open(scan_path) as f:
+ scan = json.load(f)
+ except json.JSONDecodeError as e:
+ scan_path.unlink()
+ return False, f"invalid JSON: {e}"
+
+ # Basic schema validation (check required fields)
+ required = ["paper", "rubric", "claims", "methodology_tags", "key_findings", "red_flags", "cited_papers"]
+ missing = [r for r in required if r not in scan]
+ if missing:
+ scan_path.unlink()
+ return False, f"missing fields: {missing}"
+
+ # Check rubric dimensions
+ rubric = scan.get("rubric", {})
+ dims = ["artifacts_reproducibility", "statistical_rigor", "benchmark_quality",
+ "claim_to_evidence", "setup_transparency", "limitations_discussion"]
+ missing_dims = [d for d in dims if d not in rubric]
+ if missing_dims:
+ scan_path.unlink()
+ return False, f"missing rubric dimensions: {missing_dims}"
+
+ return True, "scanned"
+
+ except subprocess.TimeoutExpired:
+ return False, "timeout (600s)"
+ except FileNotFoundError:
+ return False, "'claude' CLI not found"
+ except Exception as e:
+ return False, f"error: {e}"
+
+
+def scan_one(entry):
+ """Full pipeline for one paper: extract text → scan → return result."""
+ paper_id = entry["id"]
+
+ # Step 1: ensure text
+ ok, reason = ensure_text(entry)
+ if not ok:
+ return paper_id, False, f"text extraction failed: {reason}"
+
+ # Step 2: run scan
+ ok, reason = run_scan_agent(entry)
+ return paper_id, ok, reason
+
+
+def main():
+ args = sys.argv[1:]
+ dry_run = "--dry-run" in args
+ limit = None
+ specific_id = None
+ parallel = 1
+
+ for i, arg in enumerate(args):
+ if arg == "--limit" and i + 1 < len(args):
+ limit = int(args[i + 1])
+ if arg == "--id" and i + 1 < len(args):
+ specific_id = args[i + 1]
+ if arg == "--parallel" and i + 1 < len(args):
+ parallel = int(args[i + 1])
+
+ entries = load_registry()
+
+ candidates = []
+ for entry in entries:
+ if specific_id and entry["id"] != specific_id:
+ continue
+ if entry["status"] != "downloaded" and not specific_id:
+ continue
+ scan_path = PAPERS_DIR / entry["id"] / "scan.json"
+ if scan_path.exists() and not specific_id:
+ continue
+ candidates.append(entry)
+
+ if limit:
+ candidates = candidates[:limit]
+
+ if not candidates:
+ print("No papers to scan.")
+ return
+
+ print(f"{'Would scan' if dry_run else 'Scanning'} {len(candidates)} paper(s)"
+ f"{f' (parallel={parallel})' if parallel > 1 else ''}:\n")
+
+ if dry_run:
+ for entry in candidates:
+ txt_exists = (PAPERS_DIR / entry["id"] / "paper.txt").exists()
+ print(f" {entry['id']} {'(text ready)' if txt_exists else '(needs extraction)'}")
+ return
+
+ results = {"scanned": 0, "failed": 0, "skipped": 0}
+ failures = []
+
+ if parallel > 1:
+ with ThreadPoolExecutor(max_workers=parallel) as executor:
+ futures = {executor.submit(scan_one, e): e for e in candidates}
+ for future in as_completed(futures):
+ paper_id, ok, reason = future.result()
+ if ok:
+ results["scanned"] += 1
+ print(f" OK: {paper_id} — {reason}")
+ else:
+ results["failed"] += 1
+ failures.append((paper_id, reason))
+ print(f" FAIL: {paper_id} — {reason}")
+ else:
+ for i, entry in enumerate(candidates):
+ print(f"[{i+1}/{len(candidates)}] {entry['id']}")
+ paper_id, ok, reason = scan_one(entry)
+ if ok:
+ results["scanned"] += 1
+ print(f" OK: {reason}")
+ else:
+ results["failed"] += 1
+ failures.append((paper_id, reason))
+ print(f" FAIL: {reason}")
+
+ # Update registry for successful scans
+ entries = load_registry() # Reload in case of parallel modifications
+ scanned_ids = set()
+ for entry in entries:
+ scan_path = PAPERS_DIR / entry["id"] / "scan.json"
+ if scan_path.exists() and entry["status"] == "downloaded":
+ entry["status"] = "scanned"
+ scanned_ids.add(entry["id"])
+ save_registry(entries)
+
+ print(f"\nDone. Scanned: {results['scanned']}, Failed: {results['failed']}")
+ if scanned_ids:
+ print(f"Registry updated: {len(scanned_ids)} entries → 'scanned'")
+
+ if failures:
+ failure_path = ROOT / "scan-failures.txt"
+ with open(failure_path, "w") as f:
+ f.write(f"# Scan failures ({len(failures)} total)\n\n")
+ for paper_id, reason in failures:
+ f.write(f"{paper_id}\n {reason}\n\n")
+ print(f"Failure log: {failure_path}")
+
+
+if __name__ == "__main__":
+ main()