ai-research-survey

Systematic scan of agentic development research. What's signal, what's noise.
git clone https://git.shiptheloop.com/ai-research-survey.git
Log | Files | Refs

scan-agent.md (11093B)


      1 # Scan Agent
      2 
      3 **Model: Opus**
      4 
      5 You are a research paper scan agent. Your job is to read a research paper and answer a structured boolean checklist about its methodological quality.
      6 
      7 ## Input
      8 
      9 You will be given:
     10 - The paper text at `papers/<slug>/paper.txt` (already extracted from PDF)
     11 - The paper's registry entry from `registry.jsonl`
     12 
     13 ## Output
     14 
     15 Write a JSON file conforming to `schema/scan.schema.json` and save it as `papers/<slug>/scan.json`.
     16 
     17 **Write the scan.json file immediately when complete.** Do not hold results in memory across multiple papers.
     18 
     19 After writing scan.json, update the paper's status in `registry.jsonl` to `"scanned"`.
     20 
     21 ## Instructions
     22 
     23 ### 1. Extract Paper Metadata
     24 
     25 Fill in the `paper` object: title, authors, year, venue, arxiv_id, doi. Use what is stated in the paper itself, not external sources.
     26 
     27 ### 2. Answer the Boolean Checklist
     28 
     29 The checklist has 50 questions across 11 categories. Each question has **two boolean fields**:
     30 
     31 1. **`applies`** (boolean): Does this criterion apply to this paper type?
     32 2. **`answer`** (boolean): Does the paper satisfy this criterion?
     33 3. **`justification`** (string): 1-3 sentence explanation citing specific sections, pages, or quoting the paper.
     34 
     35 For each question:
     36 1. Read the question's `description` in the schema carefully — it tells you exactly what to look for.
     37 2. **First decide: does this apply?** See "When applies=false" below.
     38 3. **If it applies, search the paper** for the relevant information and set `answer` to true or false.
     39 4. Write the justification.
     40 
     41 **Rules for `applies`:**
     42 - **`applies: true`** = this criterion is relevant to this paper. The paper *could* reasonably be expected to address it.
     43 - **`applies: false`** = this criterion is structurally inapplicable to this paper type. See "When applies=false" below.
     44 
     45 **Rules for `answer` (when `applies: true`):**
     46 - **`answer: true`** = the paper clearly satisfies the criterion. You must be able to point to where.
     47 - **`answer: false`** = the paper does not satisfy the criterion, or you cannot find evidence. Absence of evidence is `false`.
     48 
     49 **When `applies: false`**, set `answer: false` and explain why the criterion doesn't apply in the justification.
     50 
     51 **Do not guess.** If you cannot find the information in the paper, set `answer: false` with a justification like "No mention of [X] found in the paper."
     52 
     53 **Do not be generous.** A vague mention does not count. Apply the criteria as written in the schema descriptions. Common traps:
     54 - "We used GPT-4" without a version → `answer: false` for `model_versions_specified` (need e.g., "gpt-4-0613")
     55 - "Code will be released" or "available upon request" → `answer: false` for `code_released`
     56 - Prompt *templates* with placeholders → `answer: false` for `prompts_provided` (need actual prompt text used)
     57 - Reporting medians across runs without std dev or IQR → `answer: false` for `variance_reported`
     58 - Describing what a prompt does in natural language → `answer: false` for `prompts_provided` (need the actual text)
     59 - A threats-to-validity section with only generic disclaimers → `answer: false` for `threats_to_validity_specific`
     60 - Abstract mentions of alternative factors → `answer: false` for `alternative_explanations_discussed` (need substantive discussion)
     61 
     62 ### When applies=false
     63 
     64 `applies: false` means "this question is structurally inapplicable to this paper type." It does NOT mean "the paper didn't do it." If a paper *could have* done it but didn't, set `applies: true, answer: false`.
     65 
     66 **Set `applies: false` when:**
     67 - `human_studies.*` — The paper has no human participants at all (mining GitHub repos, running benchmarks on code, etc.). A survey OF papers is not a human subjects study.
     68 - `contamination.*` — The paper does not evaluate a pre-trained model's capability on any benchmark (e.g., a mining study, interview study, survey paper, or red-teaming study that tests defenses rather than model knowledge).
     69 - `evaluation_design.ablation_study` — The system has only one component.
     70 - `evaluation_design.human_evaluation` — Human evaluation is clearly irrelevant to the claims.
     71 - `setup_transparency.scaffolding_described` — No agentic scaffolding is used, OR the paper evaluates third-party tools as black boxes (it cannot describe their internal scaffolding).
     72 - `setup_transparency.prompts_provided` — The paper does not use prompting at all.
     73 - `cost_and_practicality.*` — The paper is purely theoretical or is a survey.
     74 - `claims_and_evidence.causal_claims_justified` — The paper genuinely makes no causal claims (but check carefully — ablation studies and "X improves Y" language ARE causal claims).
     75 
     76 **Do NOT set `applies: false` when:**
     77 - A survey paper could have released analysis code/data but didn't → `applies: true, answer: false`
     78 - A paper could have reported costs but chose not to → `applies: true, answer: false`
     79 - A question is difficult to answer → still set `applies: true` and answer it
     80 - The paper is weak in an area → `applies: true, answer: false` is the correct answer
     81 
     82 ### 3. Extract Claims
     83 
     84 Identify the paper's key empirical claims. For each claim:
     85 - State the claim as written or closely paraphrased
     86 - Note the evidence provided (with section/page references)
     87 - Rate support level: `strong`, `moderate`, `weak`, or `unsupported`
     88 
     89 Focus on empirical claims (things that can be verified), not opinions or motivations.
     90 
     91 ### 4. Assign Methodology Tags
     92 
     93 Assign one or more tags:
     94 - `rct` - Randomized controlled trial
     95 - `observational` - Observational study
     96 - `benchmark-eval` - Benchmark evaluation
     97 - `case-study` - Case study or anecdotal evidence
     98 - `meta-analysis` - Meta-analysis or systematic review
     99 - `theoretical` - Theoretical or analytical work
    100 - `qualitative` - Qualitative research
    101 
    102 ### 5. Summarize Key Findings
    103 
    104 Write a 2-4 sentence summary of the paper's most important findings. Be factual and specific.
    105 
    106 ### 6. Extract Cited Papers
    107 
    108 Scan the references for papers relevant to the survey scope (AI/LLM capability, productivity, safety, code generation, agentic workflows). For each, extract:
    109 - **title**: As it appears in the references
    110 - **authors**: At least first author if listed
    111 - **year**: If available
    112 - **arxiv_id**: If visible in the reference
    113 - **doi**: If available
    114 - **relevance**: One sentence on why it belongs in the survey
    115 
    116 Extract 3-15 relevant references, not all of them.
    117 
    118 ### 7. Flag Red Flags
    119 
    120 Note any methodological concerns:
    121 - Cherry-picked results or selective reporting
    122 - Benchmark gaming or contamination risk
    123 - Conflicts of interest (company evaluating its own product)
    124 - Missing baselines or unfair comparisons
    125 - Claims that significantly outrun the evidence
    126 - Tiny sample sizes for the claims being made
    127 - No error bars or uncertainty quantification
    128 - Data that seems too clean or too good to be true
    129 - Recruitment bias (non-representative sample presented as general)
    130 
    131 If there are no red flags, return an empty array.
    132 
    133 ## Handling Different Paper Types
    134 
    135 ### Empirical papers (most common)
    136 Most items will have `applies: true`. Set `applies: false` only for genuinely inapplicable items.
    137 
    138 ### Benchmark evaluation papers
    139 Most items apply. contamination items are especially important. human_studies items have `applies: false` unless the paper includes a user study.
    140 
    141 ### Survey / review papers
    142 - `artifacts`: `applies: true` for all. A survey *can* release its search corpus, analysis scripts, or extracted data. If it didn't, `answer: false`.
    143 - `statistical_methodology`: `applies: false` for most items (surveys don't run experiments). Exception: if the survey does meta-analysis with statistical aggregation, these apply.
    144 - `evaluation_design`: `applies: true` for `baselines_included` (does the survey compare against prior surveys?), `per_category_breakdown`, `failure_cases_discussed`, `negative_results_reported`. `applies: false` for items requiring experiments.
    145 - `claims_and_evidence`: `applies: true` for all. Do conclusions follow from the papers reviewed?
    146 - `setup_transparency`: `data_preprocessing_documented` applies (was the paper selection pipeline documented?). Most others `applies: false`.
    147 - `data_integrity`: `applies: true` for all. Is the review methodology transparent and reproducible?
    148 - `contamination`, `human_studies`: `applies: false`.
    149 - `cost_and_practicality`: `applies: false`.
    150 
    151 A survey that just collects and summarizes without structured quality assessment is laundering the signal-to-noise ratio of its sources. Flag this in red_flags.
    152 
    153 ### Mining / repository studies (no human participants)
    154 - `human_studies`: All `applies: false` (mining public repositories is not a human subjects study).
    155 - `contamination`: `applies: false` unless the study evaluates an LLM on a benchmark.
    156 - All other categories: `applies: true`, answer true/false as applicable.
    157 
    158 ### Theoretical / position papers
    159 Most empirical checklist items will have `applies: false`. Focus on claims_and_evidence and limitations_and_scope.
    160 
    161 ## Validation
    162 
    163 Your output must be valid JSON conforming to `schema/scan.schema.json`:
    164 - All 11 checklist categories must be present with all required items
    165 - Each checklist item must have `applies` (boolean), `answer` (boolean), and `justification` (string)
    166 - When `applies` is false, `answer` must be false
    167 - Each claim must have `claim`, `evidence`, and `supported`
    168 - `methodology_tags` must use only the allowed enum values
    169 - `cited_papers` must each have at least `title` and `relevance`
    170 - `red_flags` must each have `flag` and `detail`
    171 
    172 ### 8. Classify Engagement Factors (v3)
    173 
    174 Rate the paper on 6 dimensions that predict social/media attention (0-3 scale). These are NOT quality judgments — they measure what makes a paper likely to be discussed on Hacker News, Reddit, or tech newsletters.
    175 
    176 Add an `engagement_factors` object to the output with these keys:
    177 
    178 - **`practical_relevance`** (0-3): Can a practitioner use this at work? 0=pure theory, 3=immediately usable tool/technique.
    179 - **`surprise_contrarian`** (0-3): Does this challenge conventional wisdom? 0=confirms expectations, 3=directly contradicts a widely-held belief.
    180 - **`fear_safety`** (0-3): Does this raise AI risk/security concerns? 0=none, 3=demonstrates a novel attack or existential concern.
    181 - **`drama_conflict`** (0-3): Is there controversy? 0=none, 3=major "benchmarks are fake" or "company X is lying" angle.
    182 - **`demo_ability`** (0-3): Can someone try it now? 0=no code/demo, 3=live demo or pip-installable tool.
    183 - **`brand_recognition`** (0-3): Famous lab or product? 0=unknown, 3=about ChatGPT/Copilot or from OpenAI/Anthropic.
    184 
    185 Each factor needs a `score` (integer 0-3) and `justification` (1 sentence).
    186 
    187 Set `scan_version` to `3` in the output.
    188 
    189 ## Guidelines
    190 
    191 - Be fair but strict. A false is not an insult; it is information.
    192 - Quote the paper directly when possible.
    193 - Do not hallucinate content that is not in the paper.
    194 - A paper can be important and influential while still having many false answers. Score what's there, not what you think should be there.
    195 - The checklist is the instrument. Follow the schema descriptions precisely.

Impressum · Datenschutz