ai-research-survey

Systematic scan of agentic development research. What's signal, what's noise.
git clone https://git.shiptheloop.com/ai-research-survey.git
Log | Files | Refs

commit 279c91802101fbb60c70d19a451bfc29babcd85d
parent 6e63d899afcec26a7a1e6668f9197bfad25b53f0
Author: Brian Graham <brian@buildingbetterteams.de>
Date:   Sun,  8 Mar 2026 10:22:14 +0100

Implement v2 scan pipeline with conditional modules and enrichment

- claim.py: extend expiry to 1 hour, add take-next atomic command
- validate-scan.py: standalone schema validator (572/572 existing scans pass)
- Schema: add scan_version, active_modules, 3 conditional categories
  (experimental_rigor 8q, data_leakage 4q, survey_methodology 3q)
  sourced from Henderson, Dodge, Lucic, Kapoor meta-research findings
- scan-worker.md: v2 worker loop (triage → 6 parallel category agents)
- scan-triage.md + scan-category-{a..f}.md: split evaluation prompts
- scan.md command: v2 default with v1 fallback flag
- enrich-metadata.py: Semantic Scholar API enrichment
- build-citation-graph.py: cross-reference cited_papers against registry
- methodology.md: document new questions with sources
- V1 scans remain valid (all new fields optional)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Diffstat:
M.claude/commands/scan.md | 43+++++++++++++++++++++++++++++++++++++++----
Aagents/scan-category-a.md | 50++++++++++++++++++++++++++++++++++++++++++++++++++
Aagents/scan-category-b.md | 41+++++++++++++++++++++++++++++++++++++++++
Aagents/scan-category-c.md | 34++++++++++++++++++++++++++++++++++
Aagents/scan-category-d.md | 34++++++++++++++++++++++++++++++++++
Aagents/scan-category-e.md | 42++++++++++++++++++++++++++++++++++++++++++
Aagents/scan-category-f.md | 59+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Aagents/scan-triage.md | 92+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Aagents/scan-worker.md | 98+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Mcontext/methodology.md | 97++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----------
Mschema/scan.schema.json | 106++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----
Aschema/scan.schema.v1.json | 508+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Ascripts/build-citation-graph.py | 172+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Mscripts/claim.py | 36+++++++++++++++++++++++++++++++++---
Ascripts/enrich-metadata.py | 174+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Ascripts/validate-scan.py | 217+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
16 files changed, 1777 insertions(+), 26 deletions(-)

diff --git a/.claude/commands/scan.md b/.claude/commands/scan.md @@ -4,6 +4,7 @@ Arguments: $ARGUMENTS - A number (e.g., `5`, `20`) sets the batch limit - `all` or no argument runs unlimited until all papers are scanned - `status` just prints current progress without scanning +- `v1` suffix forces v1 single-pass mode (e.g., `10 v1`) ## Instructions @@ -20,11 +21,13 @@ Print the status summary to the user. If the argument is `status`, stop here. -### 2. Determine batch size +### 2. Determine batch size and mode Parse `$ARGUMENTS`: - If it's a number, scan that many papers maximum - If it's `all` or empty, scan everything available +- If it contains `v1`, use v1 single-pass mode (see below) +- Default: v2 two-pass mode with conditional modules ### 3. Get the list of papers to scan @@ -36,7 +39,39 @@ This returns slugs of papers that have paper.txt, no scan.json, and no active cl ### 4. Launch scan sub-agents in parallel batches -Launch **5 sub-agents at a time** using the Task tool with `model: "sonnet"` and `run_in_background: true`. +Launch **50 sub-agents at a time** using the Agent tool with `run_in_background: true`. + +#### V2 mode (default) + +For each sub-agent, use this prompt (fill in the slug): + +--- + +You are a v2 scan agent. Your job is to evaluate a single research paper using a two-pass process. + +**Read these files first:** +1. `/root/projects/ai-research-survey/schema/scan.schema.json` — the full checklist schema +2. `/root/projects/ai-research-survey/agents/scan-agent.md` — answer rules and strictness guidelines +3. `/root/projects/ai-research-survey/papers/<SLUG>/paper.txt` — the paper to evaluate + +**Then:** +1. Claim the paper: run `python3 /root/projects/ai-research-survey/scripts/claim.py take <SLUG>`. If it prints "taken", stop immediately. + +2. **Pass 1 (Triage)**: Read `agents/scan-triage.md`. Extract metadata, assign methodology_tags, determine active conditional modules (experimental_rigor if benchmark-eval/rct, data_leakage if benchmark-eval, survey_methodology if meta-analysis), set all applicability flags, extract cited papers, key findings, red flags. + +3. **Pass 2 (Evaluation)**: Answer all 50 base checklist questions + any active conditional module questions. Follow the schema descriptions and scan-agent.md answer rules strictly. + +4. Assemble the full scan.json with `scan_version: 2`, `active_modules`, and the complete checklist. Write to `/root/projects/ai-research-survey/papers/<SLUG>/scan.json`. + +5. Validate: run `python3 /root/projects/ai-research-survey/scripts/validate-scan.py papers/<SLUG>/scan.json`. Fix any errors. + +6. Release: run `python3 /root/projects/ai-research-survey/scripts/claim.py done <SLUG>` + +Do NOT write to registry.jsonl. Only write scan.json. + +--- + +#### V1 mode (legacy, when `v1` flag is present) For each sub-agent, use this prompt (fill in the slug): @@ -61,12 +96,12 @@ Do NOT write to registry.jsonl. Only write scan.json. ### 5. Wait for each batch to complete before launching the next -After each batch of 5 completes, report: +After each batch of 50 completes, report: - How many succeeded (wrote scan.json) - How many failed - How many remain -Then launch the next batch of 5. Continue until the limit is reached or all papers are scanned. +Then launch the next batch of 50. Continue until the limit is reached or all papers are scanned. ### 6. Final summary diff --git a/agents/scan-category-a.md b/agents/scan-category-a.md @@ -0,0 +1,50 @@ +# Scan Category A: Artifacts + Setup Transparency + +**Model: Opus** + +You are a category evaluator. Answer ONLY the questions in your assigned categories. + +## Your categories (9 questions) + +### Artifacts (4q) +- `code_released` — source code released (GitHub, Zenodo)? +- `data_released` — dataset released or publicly available? +- `environment_specified` — environment/dependency specs provided? +- `reproduction_instructions` — step-by-step reproduction instructions? + +### Setup Transparency (5q) +- `model_versions_specified` — exact model versions (not just "GPT-4")? +- `prompts_provided` — actual prompt text provided (not just descriptions)? +- `hyperparameters_reported` — temperature, learning rate, etc.? +- `scaffolding_described` — agentic scaffolding described in detail? +- `data_preprocessing_documented` — preprocessing/filtering steps documented? + +## Input + +1. Paper text: `papers/<SLUG>/paper.txt` +2. Triage applicability flags: `papers/<SLUG>/triage.json` → `applicability.artifacts` and `applicability.setup_transparency` + +## Output + +Write to stdout a JSON object with this structure: + +```json +{ + "artifacts": { + "code_released": { "applies": true, "answer": true, "justification": "..." }, + ... + }, + "setup_transparency": { + "model_versions_specified": { "applies": true, "answer": false, "justification": "..." }, + ... + } +} +``` + +## Rules + +- Read the schema descriptions in `schema/scan.schema.json` for detailed evaluation criteria per question. +- Use the `applies` flag from triage.json. If triage says `applies: false`, set `applies: false, answer: false` with justification. +- If triage says `applies: true`, search the paper and determine the answer. +- Follow all answer rules from `agents/scan-agent.md`: be strict, don't be generous, absence of evidence is `answer: false`. +- Cite specific sections/pages in justifications. diff --git a/agents/scan-category-b.md b/agents/scan-category-b.md @@ -0,0 +1,41 @@ +# Scan Category B: Statistical Methodology + Evaluation Design + +**Model: Opus** + +You are a category evaluator. Answer ONLY the questions in your assigned categories. + +## Your categories (14 questions) + +### Statistical Methodology (5q) +- `confidence_intervals_or_error_bars` — CIs or error bars on main results? +- `significance_tests` — statistical tests for comparative claims? +- `effect_sizes_reported` — effect sizes, not just p-values or raw differences? +- `sample_size_justified` — sample size justified or power analysis? +- `variance_reported` — variance/std dev across experimental runs? + +### Evaluation Design (9q) +- `baselines_included` — baseline comparisons included? +- `baselines_contemporary` — baselines recent and competitive? +- `ablation_study` — ablation showing which components matter? +- `multiple_metrics` — multiple evaluation metrics used? +- `human_evaluation` — human evaluation of system outputs? +- `held_out_test_set` — results on held-out test set? +- `per_category_breakdown` — per-category/per-task breakdowns? +- `failure_cases_discussed` — failure cases shown or discussed? +- `negative_results_reported` — things that didn't work reported? + +## Input + +1. Paper text: `papers/<SLUG>/paper.txt` +2. Triage applicability flags: `papers/<SLUG>/triage.json` + +## Output + +Write to stdout a JSON object with `statistical_methodology` and `evaluation_design` keys, each containing checklist items with `applies`, `answer`, `justification`. + +## Rules + +- Read schema descriptions in `schema/scan.schema.json` for detailed criteria. +- Use `applies` flags from triage.json. +- Be strict. Follow answer rules from `agents/scan-agent.md`. +- Cite specific sections/pages in justifications. diff --git a/agents/scan-category-c.md b/agents/scan-category-c.md @@ -0,0 +1,34 @@ +# Scan Category C: Claims & Evidence + Limitations & Scope + +**Model: Opus** + +You are a category evaluator. Answer ONLY the questions in your assigned categories. + +## Your categories (7 questions) + +### Claims and Evidence (4q) +- `abstract_claims_supported` — abstract claims supported by results? +- `causal_claims_justified` — causal claims backed by adequate study design? +- `generalization_bounded` — generalizations bounded to tested setting? +- `alternative_explanations_discussed` — alternative explanations considered? + +### Limitations and Scope (3q) +- `limitations_section_present` — dedicated limitations section? +- `threats_to_validity_specific` — specific (not boilerplate) threats discussed? +- `scope_boundaries_stated` — explicit statements of what results do NOT show? + +## Input + +1. Paper text: `papers/<SLUG>/paper.txt` +2. Triage applicability flags: `papers/<SLUG>/triage.json` + +## Output + +Write to stdout a JSON object with `claims_and_evidence` and `limitations_and_scope` keys, each containing checklist items with `applies`, `answer`, `justification`. + +## Rules + +- Read schema descriptions in `schema/scan.schema.json` for detailed criteria. +- Use `applies` flags from triage.json. +- Be strict. Follow answer rules from `agents/scan-agent.md`. +- Cite specific sections/pages in justifications. diff --git a/agents/scan-category-d.md b/agents/scan-category-d.md @@ -0,0 +1,34 @@ +# Scan Category D: Data Integrity + Contamination + +**Model: Opus** + +You are a category evaluator. Answer ONLY the questions in your assigned categories. + +## Your categories (7 questions) + +### Data Integrity (4q) +- `raw_data_available` — raw data available for independent verification? +- `data_collection_described` — data collection procedure described? +- `recruitment_methods_described` — participant/sample recruitment described? +- `data_pipeline_documented` — full data pipeline documented? + +### Contamination (3q) +- `training_cutoff_stated` — model training data cutoff stated? +- `train_test_overlap_discussed` — potential train/test overlap discussed? +- `benchmark_contamination_addressed` — benchmark contamination risk addressed? + +## Input + +1. Paper text: `papers/<SLUG>/paper.txt` +2. Triage applicability flags: `papers/<SLUG>/triage.json` + +## Output + +Write to stdout a JSON object with `data_integrity` and `contamination` keys, each containing checklist items with `applies`, `answer`, `justification`. + +## Rules + +- Read schema descriptions in `schema/scan.schema.json` for detailed criteria. +- Use `applies` flags from triage.json. +- Be strict. Follow answer rules from `agents/scan-agent.md`. +- Cite specific sections/pages in justifications. diff --git a/agents/scan-category-e.md b/agents/scan-category-e.md @@ -0,0 +1,42 @@ +# Scan Category E: Conflicts of Interest + Human Studies + Cost & Practicality + +**Model: Opus** + +You are a category evaluator. Answer ONLY the questions in your assigned categories. + +## Your categories (13 questions) + +### Conflicts of Interest (4q) +- `funding_disclosed` — funding source disclosed? +- `affiliations_disclosed` — author affiliations with evaluated product disclosed? +- `funder_independent_of_outcome` — funder independent of results? +- `financial_interests_declared` — patents, equity, financial interests declared? + +### Human Studies (7q) +- `pre_registered` — study pre-registered? +- `irb_or_ethics_approval` — IRB/ethics approval mentioned? +- `demographics_reported` — participant demographics reported? +- `inclusion_exclusion_criteria` — inclusion/exclusion criteria stated? +- `randomization_described` — randomization procedure described? +- `blinding_described` — blinding described? +- `attrition_reported` — attrition/dropout reported? + +### Cost and Practicality (2q) +- `inference_cost_reported` — inference cost or latency reported? +- `compute_budget_stated` — total computational budget stated? + +## Input + +1. Paper text: `papers/<SLUG>/paper.txt` +2. Triage applicability flags: `papers/<SLUG>/triage.json` + +## Output + +Write to stdout a JSON object with `conflicts_of_interest`, `human_studies`, and `cost_and_practicality` keys, each containing checklist items with `applies`, `answer`, `justification`. + +## Rules + +- Read schema descriptions in `schema/scan.schema.json` for detailed criteria. +- Use `applies` flags from triage.json. +- Be strict. Follow answer rules from `agents/scan-agent.md`. +- Cite specific sections/pages in justifications. diff --git a/agents/scan-category-f.md b/agents/scan-category-f.md @@ -0,0 +1,59 @@ +# Scan Category F: Conditional Modules + +**Model: Opus** + +You are a category evaluator for conditional checklist modules. Answer ONLY the questions in modules that are active for this paper. + +## Conditional modules + +### Experimental Rigor (8q, active when tags include benchmark-eval or rct) +- `seed_sensitivity_reported` — results across multiple random seeds? +- `number_of_runs_stated` — exact run count stated? +- `hyperparameter_search_budget` — search budget reported? +- `best_config_selection_justified` — config selection not cherry-picked? +- `multiple_comparison_correction` — correction for multiple statistical tests? +- `self_comparison_bias_addressed` — authors acknowledge evaluating own system? +- `compute_budget_vs_performance` — performance as function of compute? +- `benchmark_construct_validity` — benchmark measures what's claimed? + +### Data Leakage (4q, active when tags include benchmark-eval) +- `temporal_leakage_addressed` — temporal leakage discussed? +- `feature_leakage_addressed` — feature leakage discussed? +- `non_independence_addressed` — train/test independence verified? +- `leakage_detection_method` — concrete detection method used? + +### Survey Methodology (3q, active when tags include meta-analysis) +- `prisma_or_structured_protocol` — PRISMA or structured protocol followed? +- `quality_assessment_of_sources` — quality scoring of included studies? +- `publication_bias_discussed` — publication bias considered? + +## Input + +1. Paper text: `papers/<SLUG>/paper.txt` +2. Triage: `papers/<SLUG>/triage.json` → check `active_modules` to see which modules to evaluate + +## Output + +Write to stdout a JSON object containing ONLY the active module keys. Example for a benchmark-eval paper: + +```json +{ + "experimental_rigor": { + "seed_sensitivity_reported": { "applies": true, "answer": false, "justification": "..." }, + ... + }, + "data_leakage": { + "temporal_leakage_addressed": { "applies": true, "answer": false, "justification": "..." }, + ... + } +} +``` + +If no modules are active (empty `active_modules`), output `{}`. + +## Rules + +- Read schema descriptions in `schema/scan.schema.json` for detailed criteria per question. +- Use `applies` flags from triage.json for questions in active modules. +- Be strict. Follow answer rules from `agents/scan-agent.md`. +- Cite specific sections/pages in justifications. diff --git a/agents/scan-triage.md b/agents/scan-triage.md @@ -0,0 +1,92 @@ +# Scan Triage Agent (Pass 1) + +**Model: Sonnet** (lightweight classification and metadata extraction) + +You are a triage agent. Your job is to quickly read a research paper and produce: +1. Paper metadata +2. Methodology tags +3. Applicability flags for all checklist questions +4. Cited papers relevant to the survey +5. Active conditional modules + +## Input + +- Paper text: `papers/<SLUG>/paper.txt` + +## Output + +Write `papers/<SLUG>/triage.json` with this structure: + +```json +{ + "paper": { + "title": "...", + "authors": ["..."], + "year": 2025, + "venue": "...", + "arxiv_id": "...", + "doi": "..." + }, + "methodology_tags": ["benchmark-eval"], + "active_modules": ["experimental_rigor", "data_leakage"], + "applicability": { + "artifacts": { + "code_released": true, + "data_released": true, + ... + }, + ... + }, + "cited_papers": [...], + "key_findings": "...", + "red_flags": [...] +} +``` + +## Instructions + +### 1. Extract metadata + +Fill in `paper` object from what's stated in the paper itself. + +### 2. Assign methodology tags + +One or more of: `rct`, `observational`, `benchmark-eval`, `case-study`, `meta-analysis`, `theoretical`, `qualitative`. + +### 3. Determine active conditional modules + +Based on methodology_tags: +- `benchmark-eval` or `rct` → activate `experimental_rigor` +- `benchmark-eval` → activate `data_leakage` +- `meta-analysis` → activate `survey_methodology` + +### 4. Set applicability flags + +For every question in the base 50 + active conditional modules, decide `applies: true/false`. + +Follow the same rules as `scan-agent.md`: +- `applies: false` = structurally inapplicable to this paper type +- `applies: true` = the paper could reasonably be expected to address this +- When in doubt, set `applies: true` + +The `applicability` object mirrors the checklist structure but contains only boolean values (the applies flag for each question). + +### 5. Extract cited papers + +Same as scan-agent.md: 3-15 survey-relevant references with title, authors, year, arxiv_id, doi, relevance. + +### 6. Summarize key findings + +2-4 sentence factual summary. + +### 7. Flag red flags + +Note methodological concerns. Empty array if none. + +## Paper type guidance + +Use the same paper-type rules from `scan-agent.md` for applicability decisions: +- Survey papers: artifacts applies, most statistical_methodology doesn't, human_studies doesn't +- Mining studies: human_studies doesn't apply, contamination usually doesn't +- Benchmark-eval: most things apply, contamination is especially important +- Theoretical: most empirical items don't apply diff --git a/agents/scan-worker.md b/agents/scan-worker.md @@ -0,0 +1,98 @@ +# Scan Worker (v2) + +**Model: Opus** + +You are a scan worker running in a Claude Code session. You process papers in a loop: claim → triage → evaluate → assemble → validate → release → repeat. + +## Worker Loop + +### Step 1: Claim next paper + +```bash +python3 /root/projects/ai-research-survey/scripts/claim.py take-next +``` + +If output is "none", all papers are done. Stop. + +The output is the slug of the paper you claimed. + +### Step 2: Read the paper + +Read `papers/<SLUG>/paper.txt`. + +### Step 3: Pass 1 — Triage + +Using the instructions in `agents/scan-triage.md`: +1. Extract paper metadata +2. Assign methodology_tags +3. Determine active conditional modules +4. Set all applicability flags +5. Extract cited papers, key findings, red flags + +Write the result to `papers/<SLUG>/triage.json`. + +### Step 4: Pass 2 — Evaluate by category + +Launch 6 sub-agents in parallel (use the Agent tool). Each evaluates its assigned questions: + +- **Agent A** (`agents/scan-category-a.md`): artifacts + setup_transparency (9 Qs) +- **Agent B** (`agents/scan-category-b.md`): statistical_methodology + evaluation_design (14 Qs) +- **Agent C** (`agents/scan-category-c.md`): claims_and_evidence + limitations_and_scope (7 Qs) +- **Agent D** (`agents/scan-category-d.md`): data_integrity + contamination (7 Qs) +- **Agent E** (`agents/scan-category-e.md`): conflicts_of_interest + human_studies + cost_and_practicality (13 Qs) +- **Agent F** (`agents/scan-category-f.md`): conditional modules (0-15 Qs, skip if no active modules) + +Each sub-agent receives: +- Path to paper.txt +- Path to triage.json (for applicability flags) +- Path to scan.schema.json (for question descriptions) +- Instructions from its category prompt file + +Each sub-agent returns a JSON object with its category answers. + +### Step 5: Assemble scan.json + +Merge all outputs into a single scan.json: + +```json +{ + "scan_version": 2, + "paper": { ... from triage }, + "methodology_tags": [ ... from triage ], + "active_modules": [ ... from triage ], + "checklist": { + ... merge all category agent outputs ... + }, + "claims": [ ... from triage or separate extraction ], + "key_findings": "... from triage", + "red_flags": [ ... from triage ], + "cited_papers": [ ... from triage ] +} +``` + +Write to `papers/<SLUG>/scan.json`. + +### Step 6: Validate + +```bash +python3 /root/projects/ai-research-survey/scripts/validate-scan.py papers/<SLUG>/scan.json +``` + +If validation fails, fix the issues and re-write. + +### Step 7: Release claim + +```bash +python3 /root/projects/ai-research-survey/scripts/claim.py done <SLUG> +``` + +### Step 8: Loop + +Go back to Step 1. + +## Important rules + +- Do NOT write to `registry.jsonl` — only write scan.json and triage.json. +- Write files immediately — do not hold results in memory. +- If a paper fails (unreadable, too short, clearly wrong PDF), release with `fail` and move on. +- Be strict on all checklist answers. Follow `agents/scan-agent.md` answer rules. diff --git a/context/methodology.md b/context/methodology.md @@ -13,18 +13,30 @@ The key adaptation is that we are reviewing *methodological quality*, not synthe ## Quality Assessment Instrument -### Design: 50-question boolean checklist +### Design: 50-question two-field boolean checklist -We use a 50-question yes/no/na checklist rather than subjective Likert-style scores. This was a deliberate design decision: +We use a 50-question checklist with two boolean fields per question rather than subjective Likert-style scores or a three-way yes/no/na enum. This was a deliberate design decision refined across two calibration rounds. +**Each question has:** +- `applies` (boolean): Is this criterion relevant to this paper type? +- `answer` (boolean): Does the paper satisfy this criterion? (Only meaningful when applies=true.) +- `justification` (string): 1-3 sentence explanation citing specific paper sections. + +**Why two fields instead of yes/no/na:** +The original design used a single `answer: yes/no/na` field. Calibration round 1 (93.2% agreement) revealed that 47-56% of Sonnet-Opus disagreements were "NA boundary errors" — the model conflating "the paper didn't do this" (should be no) with "this doesn't apply" (should be na). The two-field design forces explicit, separate decisions on applicability and compliance, eliminating this conflation. + +**Design principles:** - **Verifiable**: each question has a factually correct answer checkable against the paper - **Auditable**: justification text cites specific sections/quotes -- **High inter-rater reliability**: yes/no has much less variance than 0-3 scores +- **High inter-rater reliability**: booleans have much less variance than 0-3 scores - **Fast human calibration**: checking 50 booleans takes ~15 min, not hours - **Derived scores**: composite scores computed deterministically from boolean counts +- **Separate denominators**: compliance rates computed only over papers where applies=true - **The questions are findings**: "only 34% of papers release code" is concrete and publishable -Previous design (discarded): 6-dimension 0-3 rubric (Artifacts, Statistical Rigor, Benchmark Quality, Claim-to-Evidence Ratio, Setup Transparency, Limitations Discussion). Discarded because LLM-assigned subjective scores are hard to defend in a paper about methodological rigor. Inter-rater reliability would be poor, construct validity unproven, and the instrument itself would be a methodology weakness. +**Previous designs (discarded):** +1. 6-dimension 0-3 rubric — discarded because LLM-assigned subjective scores are hard to defend in a paper about methodological rigor. +2. Single yes/no/na field — discarded after calibration showed NA boundary confusion was the dominant error mode. ### Categories (11 groups, 50 questions) @@ -42,13 +54,42 @@ Previous design (discarded): 6-dimension 0-3 rubric (Artifacts, Statistical Rigo Data integrity and conflicts of interest categories inspired by the Wakefield MMR case — "Is raw data available for independent verification?" would have caught the fabrication years earlier. +### Conditional modules (v2, 15 questions) + +V2 scans add conditional question modules activated by methodology_tags. These target systematic issues identified by meta-research papers in the corpus. + +**12. Experimental rigor** (8q, activated by `benchmark-eval` or `rct`): +- `seed_sensitivity_reported` — Henderson et al. (2018) showed RL results vary 2x across seeds +- `number_of_runs_stated` — exact run count, not implicit +- `hyperparameter_search_budget` — Dodge et al. (2019) showed search budget dramatically affects results +- `best_config_selection_justified` — selection on validation set, not cherry-picked +- `multiple_comparison_correction` — Bonferroni/Holm/BH for multiple tests +- `self_comparison_bias_addressed` — Lucic et al. (2018) showed authors' baseline re-implementations systematically underperform +- `compute_budget_vs_performance` — performance as function of compute, not just peak +- `benchmark_construct_validity` — Kapoor & Narayanan (2024) documented widespread validity gaps + +**13. Data leakage** (4q, activated by `benchmark-eval`): +- `temporal_leakage_addressed` — training data from after prediction target +- `feature_leakage_addressed` — input features leak answer information +- `non_independence_addressed` — train/test share structural similarities +- `leakage_detection_method` — concrete detection (canary strings, n-gram overlap, etc.) + +Source: Kapoor & Narayanan (2024) leakage taxonomy. + +**14. Survey methodology** (3q, activated by `meta-analysis`): +- `prisma_or_structured_protocol` — PRISMA or equivalent systematic protocol +- `quality_assessment_of_sources` — quality scoring of included studies (Leech et al.) +- `publication_bias_discussed` — funnel plots, negative-result underrepresentation + +Total: 50 base + 15 conditional = 65 max per paper. V1 scans remain valid (new fields optional). + Full schema with evaluation criteria for each question: `schema/scan.schema.json` ### Answer rules -- **yes** = the paper clearly satisfies the criterion; you can point to where -- **no** = the paper does not satisfy the criterion, or evidence is absent (absence of evidence is NO, not NA) -- **na** = the criterion is structurally inapplicable to this paper type (e.g., human_studies questions for a benchmark paper, contamination questions for a mining study) +- **`applies: true, answer: true`** = the paper clearly satisfies the criterion; you can point to where +- **`applies: true, answer: false`** = the paper does not satisfy the criterion, or evidence is absent. Absence of evidence is `answer: false`, not `applies: false`. +- **`applies: false, answer: false`** = the criterion is structurally inapplicable to this paper type (e.g., human_studies questions for a benchmark paper, contamination questions for a mining study) Each answer includes a 1-3 sentence justification citing specific paper sections. @@ -57,15 +98,15 @@ Each answer includes a 1-3 sentence justification citing specific paper sections - **Primary rater**: Sonnet (boolean checklist is factual lookup, not subjective judgment) - **Calibration rater**: Opus (independent re-evaluation of subset to measure agreement) -### Calibration results (round 1, 2026-02-28) +### Calibration results -8 papers calibrated (Sonnet vs Opus). Overall agreement: **93.2%** (373/400 questions). +**Round 1 (2026-02-28, yes/no/na format):** 8 papers, **93.2%** agreement (373/400). Two systematic issues: +1. NA boundary errors (56% of disagreements): Sonnet confused "didn't do it" with "doesn't apply." Fixed by adding explicit NA guidance per question. +2. Generosity bias (44%): Sonnet credited partial information. Fixed by adding "does NOT count" examples. -Two systematic issues identified and corrected in the schema: -1. **NA boundary errors** (56% of disagreements): Sonnet marked NO when questions didn't apply, or NA when they did. Fixed by adding explicit "NA when:" guidance to schema descriptions for contamination, human_studies, artifacts, cost_and_practicality categories. -2. **Generosity bias** (44% of disagreements): Sonnet said YES where Opus said NO, especially on setup_transparency (credited partial information) and claims_and_evidence (credited vague mentions). Fixed by adding "does NOT count" examples to schema descriptions. +**Round 2 (2026-02-28, yes/no/na format, post-fixes):** 10 papers, **96.2%** agreement (481/500). Improvement confirmed, but NA boundary errors still 47% of remaining disagreements. Led to two-field redesign (applies + answer) to structurally eliminate the conflation. -Schema and agent prompt updated based on calibration findings. All scans removed for re-run with improved instrument. Calibration round 2 pending. +**Round 3 (pending, applies/answer format):** Will validate whether the two-field design resolves remaining NA boundary issues. ## Paper Selection @@ -87,3 +128,33 @@ Schema and agent prompt updated based on calibration findings. All scans removed - **Community sources**: HuggingFace trending, Semantic Scholar alerts This is a purposive sample, not a random one. The goal is coverage of the most influential and most cited papers, not statistical representativeness of all papers published. + +## PDF Acquisition Pipeline + +PDFs are obtained through a multi-stage automated pipeline before falling back to manual retrieval. All stages are fully documented for transparency in the PRISMA flow diagram. + +### Automated stages (scripts/download-arxiv.py, scripts/download-doi.py) + +1. **arXiv direct download** — papers with `arxiv_id` downloaded from `arxiv.org/pdf/<id>.pdf`. Also catches arXiv DOIs (`10.48550/arXiv.*`) where the arXiv ID is embedded in the DOI. +2. **Semantic Scholar open access** — queries S2 API for open-access PDF URL; also recovers arXiv IDs missed during harvesting. +3. **Unpaywall** — queries Unpaywall API for green/gold OA versions. +4. **CORE API** — queries CORE (core.ac.uk) for author manuscripts and institutional repository copies. +5. **OpenAlex** — queries OpenAlex for additional OA links not indexed by Unpaywall. +6. **Sci-Hub** — opt-in (`--scihub` flag); parses mirror HTML to find embedded PDF URL. + +### Claude web search stage (scripts/run-pdf-finder.py / Agent tool) + +For papers that survive all automated stages without a PDF, Claude agents (Sonnet, WebSearch + WebFetch + Bash) perform targeted web searches: +- DOI landing page crawl for embedded PDF links +- Author institutional page and publication list +- Preprint servers (arXiv, SSRN, bioRxiv, OSF) +- ResearchGate and Semantic Scholar pages +- Publisher "free access" or author-accepted-manuscript versions + +Each agent writes `papers/<slug>/pdf-finder-result.txt` with `FOUND <url>` or `NOT_FOUND`. The orchestrator updates registry status on success. + +**Observed hit rate**: ~50% of papers attempted via web search are found (primarily through author pages and preprint servers). The remaining failures are documented as genuinely paywalled with no open-access version available. + +### Reporting + +Papers that could not be obtained are counted in the PRISMA flow diagram under "full text not available." The acquisition method (arXiv, OA repository, author page, etc.) is not tracked per paper but the overall breakdown is available from registry metadata. diff --git a/schema/scan.schema.json b/schema/scan.schema.json @@ -32,7 +32,7 @@ }, "checklist": { "type": "object", - "description": "Boolean quality checklist. Each question has a verifiable yes/no/na answer plus justification text. These replace subjective 0-3 scores with factual, auditable checks.", + "description": "Boolean quality checklist. Each question has two boolean fields — 'applies' (is this relevant to this paper type?) and 'answer' (does the paper satisfy the criterion?) — plus justification text. This two-field design separates applicability from compliance, eliminating ambiguity in NA boundary decisions.", "required": [ "artifacts", "statistical_methodology", @@ -384,9 +384,100 @@ "description": "Is the total computational budget stated? Look for: GPU hours, total API spend, hardware used, training time. If the approach required significant compute and this is not quantified, NO." } } + }, + "experimental_rigor": { + "type": "object", + "description": "Conditional module: activated when methodology_tags includes 'benchmark-eval' or 'rct'. Addresses systematic issues identified by Henderson et al. (2018), Dodge et al. (2019), and Lucic et al. (2018) in ML experimental methodology.", + "properties": { + "seed_sensitivity_reported": { + "$ref": "#/$defs/checklist_item", + "description": "Are results reported across multiple random seeds? Look for: results tables showing mean/std across seeds, or explicit statement of seed sensitivity analysis. Henderson et al. (2018) showed RL results can vary by 2x across seeds. If the paper reports single-seed results, NO." + }, + "number_of_runs_stated": { + "$ref": "#/$defs/checklist_item", + "description": "Is the exact number of experimental runs explicitly stated? Look for: 'averaged over K runs', 'N trials', or equivalent. If results are presented without stating how many runs produced them, NO." + }, + "hyperparameter_search_budget": { + "$ref": "#/$defs/checklist_item", + "description": "Is the hyperparameter search budget reported? Look for: number of configurations tried, search method (grid, random, Bayesian), total compute spent on search. Dodge et al. (2019) showed that search budget dramatically affects reported results. If hyperparameters appear tuned but no search budget is stated, NO." + }, + "best_config_selection_justified": { + "$ref": "#/$defs/checklist_item", + "description": "Is the selection of the best configuration justified and not cherry-picked? Look for: selection on validation set (not test), clear description of selection criterion, or reporting all configurations tried. If only the best result is shown with no explanation of how it was selected, NO." + }, + "multiple_comparison_correction": { + "$ref": "#/$defs/checklist_item", + "description": "When multiple statistical tests are performed, is correction for multiple comparisons applied? Look for: Bonferroni, Holm, Benjamini-Hochberg, or other family-wise error rate corrections. If the paper runs many comparisons and reports p-values without correction, NO. NA if only one or two comparisons are made." + }, + "self_comparison_bias_addressed": { + "$ref": "#/$defs/checklist_item", + "description": "Do the authors acknowledge the bias of evaluating their own system? Look for: explicit discussion of author-evaluation bias, independent evaluation, or mitigation strategies. Lucic et al. (2018) showed that authors' implementations of baselines systematically underperform. If authors compare their system against their own re-implementation of baselines without acknowledging this bias, NO." + }, + "compute_budget_vs_performance": { + "$ref": "#/$defs/checklist_item", + "description": "Is performance reported as a function of compute budget? Look for: performance curves across compute levels, or explicit comparison at matched compute budgets. If the proposed method uses 10x more compute than baselines and this is not discussed, NO. NA if compute differences are negligible." + }, + "benchmark_construct_validity": { + "$ref": "#/$defs/checklist_item", + "description": "Does the paper discuss whether the benchmark actually measures what is claimed? Look for: analysis of what the benchmark tests vs. what the paper claims to evaluate, discussion of construct validity, or comparison with alternative benchmarks. Kapoor & Narayanan (2024) documented widespread validity gaps. If the paper uses a benchmark without questioning whether it measures the claimed capability, NO." + } + } + }, + "data_leakage": { + "type": "object", + "description": "Conditional module: activated when methodology_tags includes 'benchmark-eval'. Addresses the taxonomy of leakage types from Kapoor & Narayanan (2024).", + "properties": { + "temporal_leakage_addressed": { + "$ref": "#/$defs/checklist_item", + "description": "Is temporal leakage addressed? Look for: discussion of whether training data includes information from after the prediction target's time period, or whether benchmark problems existed before model training. If a model trained on 2024 data is tested on tasks created in 2022, the model may have seen solutions. NO if this is not discussed." + }, + "feature_leakage_addressed": { + "$ref": "#/$defs/checklist_item", + "description": "Is feature leakage addressed? Look for: discussion of whether input features contain information that would not be available at prediction time, or whether the evaluation setup leaks answer information through context. If test harness provides hints not available in real usage, NO." + }, + "non_independence_addressed": { + "$ref": "#/$defs/checklist_item", + "description": "Is non-independence of train and test data addressed? Look for: discussion of whether train and test examples are drawn from the same distribution or share structural similarities (e.g., same repositories, same authors, duplicate or near-duplicate problems). If the paper does not verify independence, NO." + }, + "leakage_detection_method": { + "$ref": "#/$defs/checklist_item", + "description": "Is a concrete leakage detection or prevention method used? Look for: canary strings, membership inference tests, n-gram overlap analysis, temporal splits, decontamination pipelines. If the paper only discusses leakage conceptually without applying a detection method, NO." + } + } + }, + "survey_methodology": { + "type": "object", + "description": "Conditional module: activated when methodology_tags includes 'meta-analysis'. Assesses whether surveys and systematic reviews follow structured review protocols.", + "properties": { + "prisma_or_structured_protocol": { + "$ref": "#/$defs/checklist_item", + "description": "Does the survey follow PRISMA or another structured review protocol? Look for: PRISMA flow diagram, explicit protocol registration, structured search strategy with reproducible queries, or reference to an established review methodology. Ad-hoc paper collection without a systematic protocol is NO." + }, + "quality_assessment_of_sources": { + "$ref": "#/$defs/checklist_item", + "description": "Does the survey assess the quality of its source papers? Look for: quality scoring rubric, risk-of-bias assessment, or structured evaluation of included studies. If the survey treats all papers equally regardless of methodological quality, NO. Leech et al. and the Trust AI Benchmarks paper both note that surveys without quality assessment launder weak results." + }, + "publication_bias_discussed": { + "$ref": "#/$defs/checklist_item", + "description": "Does the survey discuss publication bias? Look for: funnel plots, discussion of negative-result underrepresentation, acknowledgment that published papers skew positive, or tests for publication bias (Egger's test, trim-and-fill). If the survey does not consider whether its sources are biased toward positive results, NO." + } + } } } }, + "scan_version": { + "type": "integer", + "description": "Schema version. 1 = base 50 questions only. 2 = base + conditional modules. Omitted = 1.", + "default": 1 + }, + "active_modules": { + "type": "array", + "description": "Which conditional checklist modules were activated for this paper, based on methodology_tags. Empty or omitted for v1 scans.", + "items": { + "type": "string", + "enum": ["experimental_rigor", "data_leakage", "survey_methodology"] + } + }, "claims": { "type": "array", "description": "Key empirical claims extracted from the paper with supporting evidence.", @@ -488,16 +579,19 @@ "$defs": { "checklist_item": { "type": "object", - "required": ["answer", "justification"], + "required": ["applies", "answer", "justification"], "properties": { + "applies": { + "type": "boolean", + "description": "Does this criterion apply to this paper type? false = structurally inapplicable (e.g., human_studies questions for a benchmark paper). true = the criterion is applicable, even if the paper does not satisfy it." + }, "answer": { - "type": "string", - "enum": ["yes", "no", "na"], - "description": "yes = the paper satisfies this criterion. no = it does not. na = the criterion does not apply to this type of paper." + "type": "boolean", + "description": "Does the paper satisfy this criterion? Only meaningful when applies=true. Set to false when applies=false." }, "justification": { "type": "string", - "description": "1-3 sentences explaining the answer. Cite specific sections, pages, or quote the paper where possible. For NO answers, state what is missing. For NA answers, state why the criterion does not apply." + "description": "1-3 sentences explaining the answer. When applies=true: cite specific sections for answer=true, or state what is missing for answer=false. When applies=false: state why the criterion does not apply to this paper type." } } } diff --git a/schema/scan.schema.v1.json b/schema/scan.schema.v1.json @@ -0,0 +1,508 @@ +{ + "$schema": "https://json-schema.org/draft/2020-12/schema", + "$id": "scan.schema.json", + "title": "Paper Scan Result", + "description": "Structured output from the scan agent for a single research paper. Quality assessment uses boolean checklist questions (verifiable, auditable) rather than subjective scores.", + "type": "object", + "required": [ + "paper", + "checklist", + "claims", + "methodology_tags", + "key_findings", + "red_flags", + "cited_papers" + ], + "properties": { + "paper": { + "type": "object", + "description": "Paper metadata.", + "required": ["title", "authors", "year"], + "properties": { + "title": { "type": "string" }, + "authors": { + "type": "array", + "items": { "type": "string" } + }, + "year": { "type": "integer" }, + "venue": { "type": "string" }, + "arxiv_id": { "type": "string", "pattern": "^\\d{4}\\.\\d{4,5}$" }, + "doi": { "type": "string" } + } + }, + "checklist": { + "type": "object", + "description": "Boolean quality checklist. Each question has two boolean fields — 'applies' (is this relevant to this paper type?) and 'answer' (does the paper satisfy the criterion?) — plus justification text. This two-field design separates applicability from compliance, eliminating ambiguity in NA boundary decisions.", + "required": [ + "artifacts", + "statistical_methodology", + "evaluation_design", + "claims_and_evidence", + "setup_transparency", + "limitations_and_scope", + "data_integrity", + "conflicts_of_interest", + "contamination", + "human_studies", + "cost_and_practicality" + ], + "properties": { + "artifacts": { + "type": "object", + "description": "Can someone reproduce this work from what was released?", + "required": [ + "code_released", + "data_released", + "environment_specified", + "reproduction_instructions" + ], + "properties": { + "code_released": { + "$ref": "#/$defs/checklist_item", + "description": "Is source code released (e.g., GitHub link, Zenodo archive)? Look for: repository URLs in the paper, footnotes, or abstract. A promise of future release counts as NO. Code 'available upon request' counts as NO. Only YES if a working URL or archive is provided." + }, + "data_released": { + "$ref": "#/$defs/checklist_item", + "description": "Is the dataset released or publicly available? Look for: dataset download links, references to public datasets used (e.g., 'we use the publicly available SWE-bench dataset' = YES). If they collected proprietary data and did not release it, NO. If the data is a standard public benchmark they didn't modify, YES." + }, + "environment_specified": { + "$ref": "#/$defs/checklist_item", + "description": "Are environment or dependency specifications provided? Look for: requirements.txt, Dockerfile, conda environment file, or a detailed 'Environment Setup' section listing library versions. Mentioning 'Python 3.x' alone is NOT enough — there must be enough detail to recreate the environment." + }, + "reproduction_instructions": { + "$ref": "#/$defs/checklist_item", + "description": "Are step-by-step reproduction instructions included? Look for: a README with commands to run, a 'Reproducing Results' section, or scripts that replicate the main experiments. The instructions must be specific enough that a competent researcher could follow them without guessing." + } + } + }, + "statistical_methodology": { + "type": "object", + "description": "Are the numbers treated with appropriate rigor?", + "required": [ + "confidence_intervals_or_error_bars", + "significance_tests", + "effect_sizes_reported", + "sample_size_justified", + "variance_reported" + ], + "properties": { + "confidence_intervals_or_error_bars": { + "$ref": "#/$defs/checklist_item", + "description": "Are confidence intervals or error bars reported for main results? Look for: CI notation (e.g., '95% CI [x, y]'), error bars on figures, ± notation in tables. If the paper reports only point estimates (e.g., '43.2% accuracy') with no uncertainty, NO." + }, + "significance_tests": { + "$ref": "#/$defs/checklist_item", + "description": "Are statistical significance tests used where claims of difference are made? Look for: p-values, t-tests, Mann-Whitney U, chi-squared, ANOVA, bootstrap tests, permutation tests. If the paper claims 'X outperforms Y' based solely on comparing two numbers without any test, NO. NA if the paper makes no comparative claims." + }, + "effect_sizes_reported": { + "$ref": "#/$defs/checklist_item", + "description": "Are effect sizes reported (not just p-values or raw differences)? Look for: Cohen's d, odds ratios, relative risk, percentage improvement with baseline context. A paper that says 'p < 0.05' without indicating the magnitude of the effect is NO. A paper that says '12% improvement over baseline (from 45% to 57%)' provides enough context for YES." + }, + "sample_size_justified": { + "$ref": "#/$defs/checklist_item", + "description": "Is the sample size justified or is a power analysis discussed? Look for: explicit justification for why N participants/examples were chosen, power analysis, or acknowledgment that the sample may be too small for certain claims. If N is small and no justification is given, NO. NA for theoretical papers." + }, + "variance_reported": { + "$ref": "#/$defs/checklist_item", + "description": "Is variance or standard deviation reported across experimental runs? Look for: std dev in tables, variance across seeds, interquartile range, multiple-run results with spread measures. If the paper reports single-run numbers only, NO. If it explicitly states 'averaged over K runs with std dev' YES. Reporting medians across runs WITHOUT any spread measure (std dev, IQR, min/max range) is NO — the reader cannot assess result stability." + } + } + }, + "evaluation_design": { + "type": "object", + "description": "Is the evaluation designed to actually test the claims?", + "required": [ + "baselines_included", + "baselines_contemporary", + "ablation_study", + "multiple_metrics", + "human_evaluation", + "held_out_test_set", + "per_category_breakdown", + "failure_cases_discussed", + "negative_results_reported" + ], + "properties": { + "baselines_included": { + "$ref": "#/$defs/checklist_item", + "description": "Are baseline comparisons included? Look for: comparison against prior work, naive baselines, or ablated versions. A paper that only reports its own system's numbers with no comparison is NO. NA for papers that define a new task with no prior work." + }, + "baselines_contemporary": { + "$ref": "#/$defs/checklist_item", + "description": "Are the baselines contemporary and competitive? Look for: whether the baselines are recent and represent the state of the art, or whether they are suspiciously old/weak. If the newest baseline is 3+ years old when newer alternatives exist, NO. If the paper justifies why older baselines are appropriate, YES." + }, + "ablation_study": { + "$ref": "#/$defs/checklist_item", + "description": "Is there an ablation study showing which components matter? Look for: experiments that remove or modify individual components to measure their contribution. NA if the system has only one component." + }, + "multiple_metrics": { + "$ref": "#/$defs/checklist_item", + "description": "Are multiple evaluation metrics used? Look for: at least two different metrics (e.g., accuracy AND F1, or Pass@1 AND Pass@10). If the paper reports only a single metric, NO." + }, + "human_evaluation": { + "$ref": "#/$defs/checklist_item", + "description": "Is human evaluation included (not just automated metrics)? Look for: human ratings, manual inspection, user studies, expert review of the system's OUTPUTS. The humans must be evaluating what the system produced — manual classification of the benchmark or dataset itself does not count. If evaluation of the system is entirely automated (e.g., pass/fail on test suites), NO. NA if human evaluation is clearly irrelevant to the claims." + }, + "held_out_test_set": { + "$ref": "#/$defs/checklist_item", + "description": "Are results reported on a held-out test set (not the dev/validation set used for tuning)? Look for: explicit separation of dev and test splits. If unclear whether the reported numbers are on data used for any selection decisions, NO." + }, + "per_category_breakdown": { + "$ref": "#/$defs/checklist_item", + "description": "Are per-category or per-task breakdowns provided (not just overall averages)? Look for: tables showing performance on individual tasks, categories, or splits. A single aggregate number hides important variation — if a system scores 80% overall but 20% on hard cases, the average is misleading." + }, + "failure_cases_discussed": { + "$ref": "#/$defs/checklist_item", + "description": "Are failure cases shown or discussed? Look for: error analysis, qualitative examples of failures, discussion of where the approach breaks down. If the paper only shows successes, NO." + }, + "negative_results_reported": { + "$ref": "#/$defs/checklist_item", + "description": "Are negative results reported (things that didn't work)? Look for: ablations that hurt performance, approaches that were tried and abandoned, configurations that failed. If every experiment shows improvement, be skeptical — NO unless the paper explicitly addresses this." + } + } + }, + "claims_and_evidence": { + "type": "object", + "description": "Do the claims stay within what the evidence supports?", + "required": [ + "abstract_claims_supported", + "causal_claims_justified", + "generalization_bounded", + "alternative_explanations_discussed" + ], + "properties": { + "abstract_claims_supported": { + "$ref": "#/$defs/checklist_item", + "description": "Are all claims in the abstract supported by results in the paper? Read the abstract and check each empirical claim against the results section. If the abstract says 'our method achieves state-of-the-art' but the results show it's second-best, NO. If the abstract hedges appropriately, YES." + }, + "causal_claims_justified": { + "$ref": "#/$defs/checklist_item", + "description": "If the paper makes causal claims, is the study design adequate for causal inference? Look for: RCT, natural experiment, instrumental variables, difference-in-differences, or other causal identification strategies. If the paper says 'X improves Y' from observational data without addressing confounds, NO. NA if no causal claims are made. Note: ablation studies ('removing component X reduces performance by Y%') ARE causal claims — check whether the ablation design is adequate (controlled single-variable manipulation counts as YES). Language like 'improves', 'causes', 'leads to', 'enables' signals causal claims." + }, + "generalization_bounded": { + "$ref": "#/$defs/checklist_item", + "description": "Are generalizations bounded to the tested setting? Look for: claims that extend beyond the tested models, languages, tasks, or populations. If the paper tests on Python and claims results for 'code generation' generally, NO. If it says 'on Python tasks with GPT-4' YES. Check the title and abstract — broad titles like 'LLM-based Software Engineering' when results are on a single benchmark in a single language is NO." + }, + "alternative_explanations_discussed": { + "$ref": "#/$defs/checklist_item", + "description": "Are alternative explanations for the results discussed? Look for: consideration of confounds, other factors that could explain the results, robustness checks. If the paper presents one interpretation without considering alternatives, NO. A threats-to-validity section counts only if it discusses specific alternative explanations for the observed results, not just generic methodological limitations. NA only for papers that present no empirical results (e.g., pure surveys or taxonomies)." + } + } + }, + "setup_transparency": { + "type": "object", + "description": "Is the experimental setup described well enough to understand what was actually tested?", + "required": [ + "model_versions_specified", + "prompts_provided", + "hyperparameters_reported", + "scaffolding_described", + "data_preprocessing_documented" + ], + "properties": { + "model_versions_specified": { + "$ref": "#/$defs/checklist_item", + "description": "Are exact model versions or sizes specified? Look for: specific model names with version (e.g., 'gpt-4-0613', 'Claude 3.5 Sonnet', 'Llama-2-70b-chat'). If the paper says just 'GPT-4' or 'Claude' without a version or snapshot date, NO — model behavior changes across versions. Marketing names like 'Gemini-2.5' or 'GPT-4o' without a snapshot date or API version do NOT count as specified versions." + }, + "prompts_provided": { + "$ref": "#/$defs/checklist_item", + "description": "Are the prompts or system instructions used in experiments provided? Look for: full prompt text in the paper or appendix, or a link to a repository containing prompts. If prompts are described only in natural language ('we asked the model to...') without the actual text, NO. A prompt TEMPLATE with placeholders (e.g., '[Task Description]') does NOT count unless the actual fill values are also provided — the reader must be able to reconstruct every prompt sent to the model. NA if the paper does not use prompting." + }, + "hyperparameters_reported": { + "$ref": "#/$defs/checklist_item", + "description": "Are hyperparameters reported (temperature, top-p, max tokens, learning rate, etc.)? Look for: a hyperparameters table or section. If the paper uses an LLM API without stating temperature/sampling settings, NO — these significantly affect output." + }, + "scaffolding_described": { + "$ref": "#/$defs/checklist_item", + "description": "If the approach uses agentic scaffolding, is it described in detail? Look for: tool descriptions, workflow diagrams, retry logic, feedback mechanisms, memory/context management. If the paper says 'we used an agent' without describing the scaffold, NO. NA if no scaffolding is used. Also NA if the paper evaluates third-party tools (e.g., Claude Code, Copilot) as black boxes — the authors cannot be expected to describe internal scaffolding they have no access to." + }, + "data_preprocessing_documented": { + "$ref": "#/$defs/checklist_item", + "description": "Are data preprocessing and filtering steps documented? Look for: how raw data was cleaned, filtered, or transformed before use. If the paper goes from 'we collected data' to 'here are the results' without describing intermediate processing, NO. For survey papers: describing the filtering pipeline stages with counts (e.g., '500 initial results → 200 after title screening → 80 after full-text review') is YES only if the actual filtering CRITERIA at each stage are also stated. Listing stages without criteria is NO." + } + } + }, + "limitations_and_scope": { + "type": "object", + "description": "Does the paper honestly discuss what it does not show?", + "required": [ + "limitations_section_present", + "threats_to_validity_specific", + "scope_boundaries_stated" + ], + "properties": { + "limitations_section_present": { + "$ref": "#/$defs/checklist_item", + "description": "Is there a limitations or threats-to-validity section? Look for: a dedicated section or subsection titled 'Limitations', 'Threats to Validity', or similar. A single sentence buried in the conclusion does not count — there must be substantive discussion." + }, + "threats_to_validity_specific": { + "$ref": "#/$defs/checklist_item", + "description": "Are specific threats to validity discussed (not just boilerplate)? Look for: threats that are specific to THIS study, not generic disclaimers like 'our results may not generalize.' Good: 'Our sample of 16 developers is too small for subgroup analysis.' Bad: 'More research is needed.' If the limitations are all generic, NO." + }, + "scope_boundaries_stated": { + "$ref": "#/$defs/checklist_item", + "description": "Are scope boundaries explicitly stated (what the results do NOT show)? Look for: explicit statements about what was not tested, what populations/settings are excluded, what claims the authors are NOT making. The METR paper's Table 2 ('What the evidence does not show') is the gold standard. Generic limitations like 'our results may not generalize' do NOT count — the paper must state specific things it did NOT test or claim." + } + } + }, + "data_integrity": { + "type": "object", + "description": "Can the underlying data be verified? Inspired by cases like the Wakefield MMR paper where fabricated data went undetected for 12 years because no one could check it.", + "required": [ + "raw_data_available", + "data_collection_described", + "recruitment_methods_described", + "data_pipeline_documented" + ], + "properties": { + "raw_data_available": { + "$ref": "#/$defs/checklist_item", + "description": "Is raw data available for independent verification? Look for: data downloads, supplementary data files, database access. If only processed/aggregated results are shown with no way to verify the underlying data, NO. This is the check that would have caught Wakefield — if the raw medical records had been available, fabrication would have been detected immediately." + }, + "data_collection_described": { + "$ref": "#/$defs/checklist_item", + "description": "Is the data collection procedure described in detail? Look for: how data was gathered, what instruments were used, what time period, what inclusion/exclusion criteria. If the paper says 'we collected N examples' without explaining how, NO." + }, + "recruitment_methods_described": { + "$ref": "#/$defs/checklist_item", + "description": "Are participant or sample recruitment methods described? Look for: how participants were found, what channels were used, whether recruitment could introduce bias. Wakefield recruited through anti-vaccine activists, biasing the sample. If participants/samples were selected without description of the selection process, NO. For crowd-sourced events (competitions, red-teaming), simply stating 'we ran a competition' is not enough — describe how participants were recruited and whether this introduces selection bias. NA if no human participants and data source is a standard benchmark." + }, + "data_pipeline_documented": { + "$ref": "#/$defs/checklist_item", + "description": "Is the full data pipeline from collection to final analysis documented? Look for: each transformation step, filtering criteria and how many examples were removed at each stage, any manual annotation steps. If there are unexplained jumps (e.g., 'we started with 1000 examples' then results show 500 with no explanation), NO." + } + } + }, + "conflicts_of_interest": { + "type": "object", + "description": "Are potential biases from funding, affiliation, or financial interest disclosed?", + "required": [ + "funding_disclosed", + "affiliations_disclosed", + "funder_independent_of_outcome", + "financial_interests_declared" + ], + "properties": { + "funding_disclosed": { + "$ref": "#/$defs/checklist_item", + "description": "Is the funding source disclosed? Look for: an acknowledgments section listing grants, corporate sponsors, or funding agencies. If there is no mention of funding at all, NO. NA only if it's clearly unfunded work (e.g., a solo independent researcher)." + }, + "affiliations_disclosed": { + "$ref": "#/$defs/checklist_item", + "description": "Are author affiliations with the evaluated product or company disclosed? Look for: authors who work at the company whose product is being tested. If Google employees evaluate Gemini, or OpenAI employees evaluate GPT, this must be prominent. If affiliations are listed but the conflict is not explicitly acknowledged, still YES for this question (the conflict-of-interest flag is separate)." + }, + "funder_independent_of_outcome": { + "$ref": "#/$defs/checklist_item", + "description": "Is the funder independent of the outcome? Look for: whether the entity paying for the research has a financial interest in a particular result. Wakefield was secretly paid by lawyers suing vaccine makers. A paper funded by OpenAI evaluating GPT-4 has a non-independent funder. YES if the funder has no stake in the results, NO if they do, NA if unfunded." + }, + "financial_interests_declared": { + "$ref": "#/$defs/checklist_item", + "description": "Do any authors hold patents, equity, or other financial interests related to the findings? Look for: competing interests statements, patent disclosures, author-affiliated startups. If there is no competing interests statement at all, NO — absence of disclosure is not the same as absence of conflict." + } + } + }, + "contamination": { + "type": "object", + "description": "Could the model have seen the test data during training? This is the 'you measured it wrong' problem — if the benchmark is in the training data, the results are meaningless.", + "required": [ + "training_cutoff_stated", + "train_test_overlap_discussed", + "benchmark_contamination_addressed" + ], + "properties": { + "training_cutoff_stated": { + "$ref": "#/$defs/checklist_item", + "description": "Is the model's training data cutoff date stated? Look for: explicit mention of when the training data ends. This is necessary to assess whether test examples could have been in the training set. If the paper uses a model without stating when its training data was collected, NO. NA if the paper does not evaluate a pre-trained model's capability on any benchmark (e.g., mining studies, interview studies, surveys, or studies that test defenses/tools rather than model knowledge)." + }, + "train_test_overlap_discussed": { + "$ref": "#/$defs/checklist_item", + "description": "Is potential train/test overlap discussed? Look for: any analysis of whether test examples appeared in the training data. Canary strings, membership inference, or temporal splits all count. If the paper uses a public benchmark with a model that could have trained on it and doesn't address this, NO. NA if the paper does not evaluate a pre-trained model on any benchmark (same NA rule as training_cutoff_stated)." + }, + "benchmark_contamination_addressed": { + "$ref": "#/$defs/checklist_item", + "description": "Were benchmark examples available online before the model's training cutoff? Look for: whether the benchmark was published before the model's training data was collected. HumanEval was published in 2021; any model trained after 2021 may have seen it. If the paper uses such a benchmark without discussing contamination risk, NO. NA if using a benchmark created after the model's training cutoff, OR if the paper does not evaluate a pre-trained model on any benchmark (same NA rule as training_cutoff_stated)." + } + } + }, + "human_studies": { + "type": "object", + "description": "For papers involving human participants. All items NA if the paper has no human subjects.", + "required": [ + "pre_registered", + "irb_or_ethics_approval", + "demographics_reported", + "inclusion_exclusion_criteria", + "randomization_described", + "blinding_described", + "attrition_reported" + ], + "properties": { + "pre_registered": { + "$ref": "#/$defs/checklist_item", + "description": "Is the study pre-registered? Look for: a link to a pre-registration (OSF, AsPredicted, ClinicalTrials.gov, AEA registry). Pre-registration commits the researchers to their analysis plan before seeing the data, preventing p-hacking and outcome switching. Very rare in CS but standard in medicine. NA if no human participants. Mining public repositories or analyzing public data does NOT make participants — use NA." + }, + "irb_or_ethics_approval": { + "$ref": "#/$defs/checklist_item", + "description": "Is IRB or ethics board approval mentioned? Look for: 'This study was approved by [institution] IRB' or equivalent. If the study collects data from human participants without mentioning ethics review, NO. NA if no human participants." + }, + "demographics_reported": { + "$ref": "#/$defs/checklist_item", + "description": "Are participant demographics reported? Look for: experience level, years of experience, gender, geographic distribution, programming languages known, company size. If the paper says 'N developers' without characterizing them, NO. NA if no human participants." + }, + "inclusion_exclusion_criteria": { + "$ref": "#/$defs/checklist_item", + "description": "Are inclusion and exclusion criteria for participants stated? Look for: who was eligible, who was excluded and why, any screening process. If participants just 'were recruited' with no selection criteria, NO. NA if no human participants." + }, + "randomization_described": { + "$ref": "#/$defs/checklist_item", + "description": "Is the randomization procedure described (if applicable)? Look for: how participants were assigned to conditions, whether randomization was stratified, what tool was used. If the paper compares treatment vs. control without explaining how assignment worked, NO. NA if not an experimental study (e.g., cross-sectional surveys, observational studies, repository mining) or no human participants." + }, + "blinding_described": { + "$ref": "#/$defs/checklist_item", + "description": "Is blinding described (if applicable)? Look for: whether participants knew which condition they were in, whether evaluators knew which outputs came from which system. If applicable and not mentioned, NO. NA if blinding is not feasible, no human participants, or not an experimental study (e.g., cross-sectional surveys, observational studies)." + }, + "attrition_reported": { + "$ref": "#/$defs/checklist_item", + "description": "Is participant attrition or dropout reported? Look for: how many participants started vs. finished, reasons for dropout, whether intention-to-treat analysis was used. If participants are mentioned at the start but the final N is smaller with no explanation, NO. NA if no human participants." + } + } + }, + "cost_and_practicality": { + "type": "object", + "description": "Is the practical cost of the approach reported? Important for assessing real-world applicability.", + "required": [ + "inference_cost_reported", + "compute_budget_stated" + ], + "properties": { + "inference_cost_reported": { + "$ref": "#/$defs/checklist_item", + "description": "Is inference cost or latency reported? Look for: API costs, tokens consumed, wall-clock time, cost per example. If the paper proposes a method that calls GPT-4 100 times per example without mentioning cost, NO. NA if cost is clearly irrelevant (e.g., theoretical paper, survey paper). If a survey reports costs of systems it reviews, that does NOT count — this question asks about the cost of the paper's own method." + }, + "compute_budget_stated": { + "$ref": "#/$defs/checklist_item", + "description": "Is the total computational budget stated? Look for: GPU hours, total API spend, hardware used, training time. If the approach required significant compute and this is not quantified, NO." + } + } + } + } + }, + "claims": { + "type": "array", + "description": "Key empirical claims extracted from the paper with supporting evidence.", + "items": { + "type": "object", + "required": ["claim", "evidence", "supported"], + "properties": { + "claim": { + "type": "string", + "description": "The claim as stated or paraphrased from the paper." + }, + "evidence": { + "type": "string", + "description": "The evidence cited in support, with page/section references." + }, + "supported": { + "type": "string", + "enum": ["strong", "moderate", "weak", "unsupported"], + "description": "How well the evidence supports the claim." + } + } + } + }, + "methodology_tags": { + "type": "array", + "description": "Methodology type tags assigned by the scan agent.", + "items": { + "type": "string", + "enum": [ + "rct", + "observational", + "benchmark-eval", + "case-study", + "meta-analysis", + "theoretical", + "qualitative" + ] + } + }, + "key_findings": { + "type": "string", + "description": "Brief summary of the paper's key findings (2-4 sentences)." + }, + "red_flags": { + "type": "array", + "description": "Methodological red flags identified during the scan.", + "items": { + "type": "object", + "required": ["flag", "detail"], + "properties": { + "flag": { + "type": "string", + "description": "Short label for the red flag." + }, + "detail": { + "type": "string", + "description": "Explanation of why this is a concern." + } + } + } + }, + "cited_papers": { + "type": "array", + "description": "Papers cited in this paper that are relevant to the survey scope. Used for citation-chasing: these become candidates for the registry.", + "items": { + "type": "object", + "required": ["title", "relevance"], + "properties": { + "title": { + "type": "string", + "description": "Title of the cited paper as it appears in the references." + }, + "authors": { + "type": "array", + "items": { "type": "string" }, + "description": "Author names if available from the reference." + }, + "year": { + "type": "integer", + "description": "Publication year if available." + }, + "arxiv_id": { + "type": "string", + "pattern": "^\\d{4}\\.\\d{4,5}$", + "description": "arXiv ID if available." + }, + "doi": { + "type": "string", + "description": "DOI if available." + }, + "relevance": { + "type": "string", + "description": "Why this cited paper is relevant to the survey (1 sentence)." + } + } + } + } + }, + "$defs": { + "checklist_item": { + "type": "object", + "required": ["applies", "answer", "justification"], + "properties": { + "applies": { + "type": "boolean", + "description": "Does this criterion apply to this paper type? false = structurally inapplicable (e.g., human_studies questions for a benchmark paper). true = the criterion is applicable, even if the paper does not satisfy it." + }, + "answer": { + "type": "boolean", + "description": "Does the paper satisfy this criterion? Only meaningful when applies=true. Set to false when applies=false." + }, + "justification": { + "type": "string", + "description": "1-3 sentences explaining the answer. When applies=true: cite specific sections for answer=true, or state what is missing for answer=false. When applies=false: state why the criterion does not apply to this paper type." + } + } + } + } +} diff --git a/scripts/build-citation-graph.py b/scripts/build-citation-graph.py @@ -0,0 +1,172 @@ +#!/usr/bin/env python3 +""" +Build a citation graph from cited_papers in all scan.json files. + +Matches cited papers against registry entries by title (case-insensitive), +arxiv_id, or doi. Outputs analysis/citation-graph.json with: +- nodes (all papers in registry that have been scanned) +- edges (citing → cited relationships) +- most_cited (top papers by incoming citation count) +- connected_components (groups of papers linked by citations) + +Usage: + python scripts/build-citation-graph.py +""" + +import json +import re +from collections import defaultdict +from pathlib import Path + +ROOT = Path(__file__).resolve().parent.parent +PAPERS_DIR = ROOT / "papers" +REGISTRY = ROOT / "registry.jsonl" +OUTPUT = ROOT / "analysis" / "citation-graph.json" + + +def normalize_title(title): + """Normalize title for fuzzy matching.""" + return re.sub(r'[^a-z0-9\s]', '', title.lower()).strip() + + +def load_registry(): + """Load registry and build lookup indices.""" + entries = [] + by_title = {} + by_arxiv = {} + by_doi = {} + + with open(REGISTRY) as f: + for line in f: + line = line.strip() + if not line: + continue + entry = json.loads(line) + entries.append(entry) + slug = entry["id"] + + title = entry.get("title", "") + if title: + by_title[normalize_title(title)] = slug + + arxiv_id = entry.get("arxiv_id", "") + if arxiv_id: + by_arxiv[arxiv_id] = slug + + doi = entry.get("doi", "") + if doi: + by_doi[doi.lower()] = slug + + return entries, by_title, by_arxiv, by_doi + + +def find_connected_components(adjacency, all_nodes): + """Find connected components in an undirected graph.""" + visited = set() + components = [] + + def dfs(node, component): + visited.add(node) + component.append(node) + for neighbor in adjacency.get(node, []): + if neighbor not in visited: + dfs(neighbor, component) + + for node in all_nodes: + if node not in visited: + component = [] + dfs(node, component) + components.append(sorted(component)) + + return components + + +def main(): + entries, by_title, by_arxiv, by_doi = load_registry() + + nodes = [] + edges = [] + incoming_count = defaultdict(int) + + # Build undirected adjacency for connected components + adjacency = defaultdict(set) + scanned_slugs = set() + + for scan_path in sorted(PAPERS_DIR.glob("*/scan.json")): + slug = scan_path.parent.name + try: + data = json.loads(scan_path.read_text()) + except (json.JSONDecodeError, FileNotFoundError): + continue + + scanned_slugs.add(slug) + title = data.get("paper", {}).get("title", slug) + nodes.append({"id": slug, "title": title}) + + cited = data.get("cited_papers", []) + for ref in cited: + target = None + + # Match by arxiv_id + arxiv_id = ref.get("arxiv_id", "") + if arxiv_id and arxiv_id in by_arxiv: + target = by_arxiv[arxiv_id] + + # Match by doi + if not target: + doi = ref.get("doi", "") + if doi and doi.lower() in by_doi: + target = by_doi[doi.lower()] + + # Match by title + if not target: + ref_title = ref.get("title", "") + if ref_title: + norm = normalize_title(ref_title) + if norm in by_title: + target = by_title[norm] + + if target and target != slug: + edges.append({"source": slug, "target": target}) + incoming_count[target] += 1 + adjacency[slug].add(target) + adjacency[target].add(slug) + + # Most cited + most_cited = sorted(incoming_count.items(), key=lambda x: -x[1])[:30] + most_cited = [{"slug": slug, "incoming_citations": count} for slug, count in most_cited] + + # Connected components (only among scanned papers that appear in edges) + edge_nodes = set() + for e in edges: + edge_nodes.add(e["source"]) + edge_nodes.add(e["target"]) + components = find_connected_components(adjacency, edge_nodes) + # Sort by size descending + components.sort(key=len, reverse=True) + + result = { + "node_count": len(nodes), + "edge_count": len(edges), + "nodes": nodes, + "edges": edges, + "most_cited": most_cited, + "connected_components": { + "count": len(components), + "largest_size": len(components[0]) if components else 0, + "components": components[:20], # Top 20 by size + }, + } + + OUTPUT.parent.mkdir(parents=True, exist_ok=True) + OUTPUT.write_text(json.dumps(result, indent=2, ensure_ascii=False) + "\n") + print(f"Citation graph written to {OUTPUT}") + print(f" Nodes: {len(nodes)}") + print(f" Edges: {len(edges)}") + print(f" Components: {len(components)}") + if most_cited: + print(f" Most cited: {most_cited[0]['slug']} ({most_cited[0]['incoming_citations']} citations)") + + +if __name__ == "__main__": + main() diff --git a/scripts/claim.py b/scripts/claim.py @@ -1,7 +1,7 @@ #!/usr/bin/env python3 """ Paper claim system for parallel scanning. Prevents two agents from -working on the same paper. Claims expire after 10 minutes. +working on the same paper. Claims expire after 1 hour. Claims are stored as empty files: papers/<slug>/.claimed_<timestamp> @@ -9,6 +9,8 @@ Usage: python scripts/claim.py list # List unclaimed papers ready to scan python scripts/claim.py list --limit 10 # First 10 unclaimed python scripts/claim.py take <slug> # Claim a paper (prints "ok" or "taken") + python scripts/claim.py take-next # Atomically list + claim next available (prints slug or "none") + python scripts/claim.py take-next --limit 5 # Claim next from first 5 available python scripts/claim.py done <slug> # Mark scan complete, remove claim python scripts/claim.py fail <slug> # Release claim without completing python scripts/claim.py status # Show claim summary @@ -22,7 +24,7 @@ ROOT = Path(__file__).resolve().parent.parent PAPERS_DIR = ROOT / "papers" CLAIM_PREFIX = ".claimed_" -CLAIM_EXPIRY_SECONDS = 600 # 10 minutes +CLAIM_EXPIRY_SECONDS = 3600 # 1 hour def get_claim_file(slug): @@ -106,10 +108,26 @@ def status(): print(f" Available: {total_unclaimed}") +def take_next(limit=None): + """Atomically find the next unclaimed paper and claim it. Returns slug or None.""" + for txt in sorted(PAPERS_DIR.glob("*/paper.txt")): + slug = txt.parent.name + scan = txt.parent / "scan.json" + if scan.exists(): + continue + if is_claimed(slug): + continue + if claim(slug): + return slug + # Race condition: another agent claimed it between check and claim + continue + return None + + def main(): args = sys.argv[1:] if not args: - print("Usage: python scripts/claim.py [list|take|done|fail|status]") + print("Usage: python scripts/claim.py [list|take|take-next|done|fail|status]") sys.exit(1) cmd = args[0] @@ -134,6 +152,18 @@ def main(): print("taken") sys.exit(1) + elif cmd == "take-next": + limit = None + for i, arg in enumerate(args): + if arg == "--limit" and i + 1 < len(args): + limit = int(args[i + 1]) + slug = take_next(limit) + if slug: + print(slug) + else: + print("none") + sys.exit(1) + elif cmd == "done": if len(args) < 2: print("Usage: python scripts/claim.py done <slug>") diff --git a/scripts/enrich-metadata.py b/scripts/enrich-metadata.py @@ -0,0 +1,174 @@ +#!/usr/bin/env python3 +""" +Enrich paper metadata from Semantic Scholar API. + +Queries by DOI or title, writes papers/<slug>/metadata.json with: +author_count, affiliations, venue, citation_count, is_open_access, fields_of_study. + +Usage: + python scripts/enrich-metadata.py # All papers with scan.json but no metadata.json + python scripts/enrich-metadata.py --limit 5 # First 5 + python scripts/enrich-metadata.py --slug <slug> # Single paper + +Rate-limited to 1 request per second (S2 API limit for unauthenticated). +""" + +import json +import sys +import time +import urllib.request +import urllib.parse +import urllib.error +from pathlib import Path + +ROOT = Path(__file__).resolve().parent.parent +PAPERS_DIR = ROOT / "papers" +REGISTRY = ROOT / "registry.jsonl" + +S2_API = "https://api.semanticscholar.org/graph/v1" +S2_FIELDS = "title,authors,venue,year,citationCount,isOpenAccess,fieldsOfStudy,externalIds" + +# Rate limit: 1 request per second +RATE_LIMIT = 1.0 + + +def load_registry(): + """Load registry as dict keyed by slug.""" + registry = {} + with open(REGISTRY) as f: + for line in f: + line = line.strip() + if not line: + continue + entry = json.loads(line) + registry[entry["id"]] = entry + return registry + + +def query_s2_by_doi(doi): + """Query Semantic Scholar by DOI.""" + url = f"{S2_API}/paper/DOI:{urllib.parse.quote(doi, safe='')}?fields={S2_FIELDS}" + return _fetch_json(url) + + +def query_s2_by_title(title): + """Query Semantic Scholar by title search.""" + params = urllib.parse.urlencode({"query": title, "limit": 1, "fields": S2_FIELDS}) + url = f"{S2_API}/paper/search?{params}" + result = _fetch_json(url) + if result and result.get("data") and len(result["data"]) > 0: + return result["data"][0] + return None + + +def _fetch_json(url): + """Fetch JSON from URL, return None on error.""" + try: + req = urllib.request.Request(url, headers={"User-Agent": "ai-research-survey/1.0"}) + with urllib.request.urlopen(req, timeout=15) as resp: + return json.loads(resp.read()) + except (urllib.error.HTTPError, urllib.error.URLError, json.JSONDecodeError, TimeoutError) as e: + print(f" API error: {e}", file=sys.stderr) + return None + + +def extract_metadata(s2_data): + """Extract structured metadata from S2 API response.""" + if not s2_data: + return None + + authors = s2_data.get("authors", []) + affiliations = [] + for a in authors: + if a.get("affiliations"): + affiliations.extend(a["affiliations"]) + + return { + "author_count": len(authors), + "affiliations": list(set(affiliations)) if affiliations else [], + "venue": s2_data.get("venue", ""), + "citation_count": s2_data.get("citationCount", 0), + "is_open_access": s2_data.get("isOpenAccess", False), + "fields_of_study": s2_data.get("fieldsOfStudy") or [], + "s2_paper_id": s2_data.get("paperId", ""), + } + + +def main(): + args = sys.argv[1:] + limit = None + target_slug = None + + i = 0 + while i < len(args): + if args[i] == "--limit" and i + 1 < len(args): + limit = int(args[i + 1]) + i += 2 + elif args[i] == "--slug" and i + 1 < len(args): + target_slug = args[i + 1] + i += 2 + else: + i += 1 + + registry = load_registry() + + # Find papers to enrich + if target_slug: + slugs = [target_slug] + else: + slugs = [] + for scan_path in sorted(PAPERS_DIR.glob("*/scan.json")): + slug = scan_path.parent.name + meta_path = scan_path.parent / "metadata.json" + if meta_path.exists(): + continue + slugs.append(slug) + if limit and len(slugs) >= limit: + break + + print(f"Enriching {len(slugs)} papers...") + success = 0 + failed = 0 + + for slug in slugs: + entry = registry.get(slug, {}) + doi = entry.get("doi", "") + title = entry.get("title", "") + + # Also try reading title from scan.json + scan_path = PAPERS_DIR / slug / "scan.json" + if scan_path.exists() and not title: + try: + scan_data = json.loads(scan_path.read_text()) + title = scan_data.get("paper", {}).get("title", "") + if not doi: + doi = scan_data.get("paper", {}).get("doi", "") + except (json.JSONDecodeError, KeyError): + pass + + print(f" {slug}...", end=" ", flush=True) + + s2_data = None + if doi: + s2_data = query_s2_by_doi(doi) + time.sleep(RATE_LIMIT) + + if not s2_data and title: + s2_data = query_s2_by_title(title) + time.sleep(RATE_LIMIT) + + metadata = extract_metadata(s2_data) + if metadata: + meta_path = PAPERS_DIR / slug / "metadata.json" + meta_path.write_text(json.dumps(metadata, indent=2, ensure_ascii=False) + "\n") + print(f"OK (citations: {metadata['citation_count']})") + success += 1 + else: + print("not found") + failed += 1 + + print(f"\nDone: {success} enriched, {failed} not found") + + +if __name__ == "__main__": + main() diff --git a/scripts/validate-scan.py b/scripts/validate-scan.py @@ -0,0 +1,217 @@ +#!/usr/bin/env python3 +""" +Validate scan.json files against the scan schema. + +Usage: + python scripts/validate-scan.py papers/<slug>/scan.json # Validate one file + python scripts/validate-scan.py --all # Validate all scan.json files + python scripts/validate-scan.py --all --quiet # Only print failures + +Exit 0 if all valid, exit 1 if any failures. +""" + +import json +import sys +from pathlib import Path + +ROOT = Path(__file__).resolve().parent.parent +SCHEMA_PATH = ROOT / "schema" / "scan.schema.json" +PAPERS_DIR = ROOT / "papers" + +# Base 50 questions: category -> list of required question keys +BASE_QUESTIONS = { + "artifacts": ["code_released", "data_released", "environment_specified", "reproduction_instructions"], + "statistical_methodology": ["confidence_intervals_or_error_bars", "significance_tests", "effect_sizes_reported", "sample_size_justified", "variance_reported"], + "evaluation_design": ["baselines_included", "baselines_contemporary", "ablation_study", "multiple_metrics", "human_evaluation", "held_out_test_set", "per_category_breakdown", "failure_cases_discussed", "negative_results_reported"], + "claims_and_evidence": ["abstract_claims_supported", "causal_claims_justified", "generalization_bounded", "alternative_explanations_discussed"], + "setup_transparency": ["model_versions_specified", "prompts_provided", "hyperparameters_reported", "scaffolding_described", "data_preprocessing_documented"], + "limitations_and_scope": ["limitations_section_present", "threats_to_validity_specific", "scope_boundaries_stated"], + "data_integrity": ["raw_data_available", "data_collection_described", "recruitment_methods_described", "data_pipeline_documented"], + "conflicts_of_interest": ["funding_disclosed", "affiliations_disclosed", "funder_independent_of_outcome", "financial_interests_declared"], + "contamination": ["training_cutoff_stated", "train_test_overlap_discussed", "benchmark_contamination_addressed"], + "human_studies": ["pre_registered", "irb_or_ethics_approval", "demographics_reported", "inclusion_exclusion_criteria", "randomization_described", "blinding_described", "attrition_reported"], + "cost_and_practicality": ["inference_cost_reported", "compute_budget_stated"], +} + +# V2 conditional categories (optional) +CONDITIONAL_QUESTIONS = { + "experimental_rigor": ["seed_sensitivity_reported", "number_of_runs_stated", "hyperparameter_search_budget", "best_config_selection_justified", "multiple_comparison_correction", "self_comparison_bias_addressed", "compute_budget_vs_performance", "benchmark_construct_validity"], + "data_leakage": ["temporal_leakage_addressed", "feature_leakage_addressed", "non_independence_addressed", "leakage_detection_method"], + "survey_methodology": ["prisma_or_structured_protocol", "quality_assessment_of_sources", "publication_bias_discussed"], +} + +VALID_METHODOLOGY_TAGS = {"rct", "observational", "benchmark-eval", "case-study", "meta-analysis", "theoretical", "qualitative"} +VALID_SUPPORT_LEVELS = {"strong", "moderate", "weak", "unsupported"} + + +def validate_checklist_item(item, path): + """Validate a single checklist item. Returns list of error strings.""" + errors = [] + if not isinstance(item, dict): + return [f"{path}: not an object"] + + for field in ("applies", "answer", "justification"): + if field not in item: + errors.append(f"{path}: missing '{field}'") + + if "applies" in item and not isinstance(item["applies"], bool): + errors.append(f"{path}.applies: not a boolean") + if "answer" in item and not isinstance(item["answer"], bool): + errors.append(f"{path}.answer: not a boolean") + if "justification" in item: + if not isinstance(item["justification"], str): + errors.append(f"{path}.justification: not a string") + elif len(item["justification"].strip()) == 0: + errors.append(f"{path}.justification: empty") + + # Constraint: applies=false implies answer=false + if item.get("applies") is False and item.get("answer") is True: + errors.append(f"{path}: applies=false but answer=true (invalid)") + + return errors + + +def validate_scan(data, filepath=""): + """Validate a scan.json object. Returns list of error strings.""" + errors = [] + prefix = filepath + ": " if filepath else "" + + # Top-level required fields + for field in ("paper", "checklist", "claims", "methodology_tags", "key_findings", "red_flags", "cited_papers"): + if field not in data: + errors.append(f"{prefix}missing required field '{field}'") + + # Paper metadata + paper = data.get("paper", {}) + for field in ("title", "authors", "year"): + if field not in paper: + errors.append(f"{prefix}paper: missing '{field}'") + + # Checklist — base categories (required) + checklist = data.get("checklist", {}) + for cat, questions in BASE_QUESTIONS.items(): + if cat not in checklist: + errors.append(f"{prefix}checklist: missing category '{cat}'") + continue + cat_obj = checklist[cat] + if not isinstance(cat_obj, dict): + errors.append(f"{prefix}checklist.{cat}: not an object") + continue + for q in questions: + if q not in cat_obj: + errors.append(f"{prefix}checklist.{cat}: missing question '{q}'") + else: + errors.extend(validate_checklist_item(cat_obj[q], f"{prefix}checklist.{cat}.{q}")) + + # Checklist — conditional categories (validate if present) + for cat, questions in CONDITIONAL_QUESTIONS.items(): + if cat not in checklist: + continue + cat_obj = checklist[cat] + if not isinstance(cat_obj, dict): + errors.append(f"{prefix}checklist.{cat}: not an object") + continue + for q in questions: + if q not in cat_obj: + errors.append(f"{prefix}checklist.{cat}: missing question '{q}' (category present but incomplete)") + else: + errors.extend(validate_checklist_item(cat_obj[q], f"{prefix}checklist.{cat}.{q}")) + + # Methodology tags + tags = data.get("methodology_tags", []) + if not isinstance(tags, list): + errors.append(f"{prefix}methodology_tags: not an array") + else: + for tag in tags: + if tag not in VALID_METHODOLOGY_TAGS: + errors.append(f"{prefix}methodology_tags: invalid tag '{tag}'") + + # Claims + claims = data.get("claims", []) + if not isinstance(claims, list): + errors.append(f"{prefix}claims: not an array") + else: + for i, c in enumerate(claims): + for field in ("claim", "evidence", "supported"): + if field not in c: + errors.append(f"{prefix}claims[{i}]: missing '{field}'") + if c.get("supported") and c["supported"] not in VALID_SUPPORT_LEVELS: + errors.append(f"{prefix}claims[{i}].supported: invalid value '{c['supported']}'") + + # key_findings + if "key_findings" in data and not isinstance(data["key_findings"], str): + errors.append(f"{prefix}key_findings: not a string") + + # red_flags + flags = data.get("red_flags", []) + if not isinstance(flags, list): + errors.append(f"{prefix}red_flags: not an array") + else: + for i, f in enumerate(flags): + for field in ("flag", "detail"): + if field not in f: + errors.append(f"{prefix}red_flags[{i}]: missing '{field}'") + + # cited_papers + cited = data.get("cited_papers", []) + if not isinstance(cited, list): + errors.append(f"{prefix}cited_papers: not an array") + else: + for i, c in enumerate(cited): + for field in ("title", "relevance"): + if field not in c: + errors.append(f"{prefix}cited_papers[{i}]: missing '{field}'") + + # scan_version (optional, integer if present) + if "scan_version" in data: + if not isinstance(data["scan_version"], int): + errors.append(f"{prefix}scan_version: not an integer") + + # active_modules (optional, array of strings if present) + if "active_modules" in data: + if not isinstance(data["active_modules"], list): + errors.append(f"{prefix}active_modules: not an array") + + return errors + + +def main(): + args = sys.argv[1:] + quiet = "--quiet" in args + args = [a for a in args if a != "--quiet"] + + if not args: + print("Usage: python scripts/validate-scan.py <path> | --all [--quiet]") + sys.exit(1) + + if args[0] == "--all": + files = sorted(PAPERS_DIR.glob("*/scan.json")) + else: + files = [Path(a) for a in args if not a.startswith("--")] + + total = 0 + failed = 0 + for f in files: + total += 1 + try: + data = json.loads(f.read_text()) + except (json.JSONDecodeError, FileNotFoundError) as e: + print(f"FAIL {f}: {e}") + failed += 1 + continue + + errors = validate_scan(data, str(f)) + if errors: + failed += 1 + print(f"FAIL {f}:") + for err in errors: + print(f" {err}") + elif not quiet: + print(f"OK {f}") + + print(f"\n{total} files checked, {total - failed} passed, {failed} failed") + sys.exit(1 if failed else 0) + + +if __name__ == "__main__": + main()

Impressum · Datenschutz