CLAUDE.md (3488B)
1 # AI Research Survey - Project Rules 2 3 ## What This Is 4 Systematic review of ~1,000 research papers evaluating methodological quality in the agentic AI / LLM programming space. 5 6 ## Project Structure 7 - `registry.jsonl` — One JSON object per line, one per paper. Source of truth for paper inventory. 8 - `papers/<slug>/` — One directory per paper. Contains `paper.pdf` (local only), `scan.json`, optionally `calibration.json`, `deep_eval.json`. 9 - `schema/` — JSON Schemas that agent outputs must conform to. 10 - `agents/` — Prompt files for each agent type. 11 - `context/` — Project requirements, methodology, related work. 12 - `scripts/` — Pipeline tooling (harvest-citations, etc.). 13 - `inbox/` — Drop PDFs here for the inbox-sorter agent to process. 14 15 ## Registry Conventions 16 - **ID slugs**: lowercase, hyphen-separated, concise. Include year. E.g., `metr-rct-2025`. 17 - **ID slugs must be unique** across the entire registry. 18 - **Status values**: `queued` → `downloaded` → `scanned` → `deep_eval`. Also `excluded`. 19 - **Source values**: `manual`, `arxiv`, `huggingface`, `semantic_scholar`, `inbox`. 20 - **Dedup before inserting**: Check `arxiv_id`, `doi`, and title (case-insensitive) against existing entries. 21 - **Never delete registry entries**. Set status to `excluded` with a note explaining why. 22 23 ## Model Assignments 24 - **Harvester agent**: Sonnet (structured metadata extraction, no deep reasoning needed) 25 - **Scan agent**: Opus (calibration showed persistent Sonnet generosity bias — 36% of disagreements) 26 - **Audit/calibration agent**: Opus only (independent re-evaluation to measure inter-rater agreement — `/audit` command) 27 - **Deep-eval agent**: Opus (requires careful verification work) 28 - **Inbox sorter**: Sonnet 29 30 ## PDFs 31 - PDFs are stored locally for analysis but **never committed to git**. 32 - `.gitignore` excludes `papers/*/*.pdf` and `inbox/*.pdf`. 33 - Do not redistribute PDFs. Only structured outputs (scan.json, deep_eval.json) are publishable. 34 35 ## Scan Output 36 - Must conform to `schema/scan.schema.json`. 37 - V1: 50 base questions. V2: 50 base + up to 15 conditional (experimental_rigor, data_leakage, survey_methodology). 38 - Each checklist item has two boolean fields: `applies` (is the criterion relevant?) and `answer` (does the paper satisfy it?), plus a `justification` string. 39 - V2 scans include `scan_version: 2` and `active_modules` array. 40 - `cited_papers` array is required — extract 3-15 survey-relevant references per paper for citation chasing. 41 - Validate with `python3 scripts/validate-scan.py` (supports both v1 and v2). 42 - Run `scripts/harvest-citations.py` after scanning to discover new candidates. 43 44 ## Calibration / Audit 45 - `/audit` runs Opus calibration on existing scans. Always uses Opus — never Sonnet. 46 - Opus independently answers the same 50-question checklist, then compares with scan.json. 47 - Output: `papers/<slug>/calibration.json` with agreement rate, per-question disagreements, and Opus's full checklist. 48 - Purpose: measure inter-rater reliability for the published paper. Target: >95% agreement. 49 - Round 3 results (60 papers): 97.0% agreement. Existing calibration.json files compare Sonnet scans vs Opus. 50 51 ## Code Style 52 - Scripts in Python 3. No external dependencies unless unavoidable. 53 - JSON output: `ensure_ascii=False`, one object per line for JSONL. 54 - Dates in ISO 8601 (`YYYY-MM-DD`). 55 56 ## Git 57 - Commit messages: imperative mood, concise first line, body for detail. 58 - Never commit PDFs. 59 - Never amend published commits.