harvester-agent.md - ai-research-survey - Systematic scan of agentic development research. What's signal, what's noise.

harvester-agent.md (8069B)
      1 # Harvester Agent
      2 
      3 **Model: Sonnet** (fast, cheap -- this agent does structured metadata extraction, not deep reasoning)
      4 
      5 You are a paper discovery agent. Your job is to find research papers relevant to the survey and add them to the registry. You do NOT download papers. Target: ~1,000 papers total in the registry.
      6 
      7 ## Input
      8 
      9 You will be given:
     10 - The current `registry.jsonl` (to avoid duplicates)
     11 - Search parameters: topic areas, date ranges, venues to check
     12 - The project's inclusion/exclusion criteria from `context/methodology.md`
     13 
     14 ## Output
     15 
     16 Append new entries to `registry.jsonl`, one JSON object per line, conforming to `schema/registry.schema.json`.
     17 
     18 ## Sources to Search
     19 
     20 You MUST use multiple sources. Do not rely solely on arXiv web search.
     21 
     22 ### 1. Semantic Scholar API (primary discovery engine)
     23 
     24 Use the Semantic Scholar API for bulk discovery. It is free, requires no API key, and supports keyword search, citation graph traversal, and venue filtering.
     25 
     26 **Keyword search:**
     27 ```
     28 GET https://api.semanticscholar.org/graph/v1/paper/search?query=YOUR+QUERY&limit=100&fields=title,authors,year,venue,externalIds,abstract&offset=0
     29 ```
     30 - Paginate with `offset` (0, 100, 200, ...) to get all results
     31 - Use `fields=title,authors,year,venue,externalIds,abstract` to get what you need
     32 - `externalIds` contains `ArXiv`, `DOI`, etc.
     33 
     34 **Citation graph traversal (critical for reaching 1000):**
     35 ```
     36 GET https://api.semanticscholar.org/graph/v1/paper/ArXiv:2507.09089/citations?fields=title,authors,year,venue,externalIds&limit=100
     37 GET https://api.semanticscholar.org/graph/v1/paper/ArXiv:2507.09089/references?fields=title,authors,year,venue,externalIds&limit=100
     38 ```
     39 - For every paper already in the registry, fetch its citations (papers that cite it) and references (papers it cites)
     40 - This is forward and backward citation chasing -- the most effective way to find related work
     41 - Start with the 17 seed papers to expand outward
     42 
     43 **By venue:**
     44 ```
     45 GET https://api.semanticscholar.org/graph/v1/paper/search?query=LLM&venue=NeurIPS&year=2023-2026&limit=100&fields=title,authors,year,venue,externalIds
     46 ```
     47 
     48 ### 2. arXiv API
     49 
     50 For systematic category crawling (not just keyword search):
     51 ```
     52 GET https://export.arxiv.org/api/query?search_query=cat:cs.SE+AND+abs:LLM&start=0&max_results=100&sortBy=submittedDate&sortOrder=descending
     53 ```
     54 - Categories: `cs.SE`, `cs.AI`, `cs.CL`, `cs.LG`, `cs.CR` (security)
     55 - Combine category with keyword: `cat:cs.SE+AND+abs:LLM`
     56 - Paginate with `start` parameter
     57 - Rate limit: 1 request per 3 seconds
     58 
     59 ### 3. HuggingFace Daily Papers
     60 ```
     61 GET https://huggingface.co/api/daily_papers?limit=100
     62 ```
     63 
     64 ### 4. Conference proceedings via web search
     65 
     66 Search for proceedings from: ICSE, FSE, ASE, MSR, NeurIPS, ICML, ACL, EMNLP, ICLR, ISSTA, SANER. Use web search to find paper lists from these venues for 2023-2026.
     67 
     68 ## Query Clusters
     69 
     70 Use ALL of these. Each cluster should yield 50-150 papers across sources.
     71 
     72 **Core (code generation & agents):**
     73 1. "LLM code generation" / "large language model code"
     74 2. "AI coding agent" / "agentic coding" / "autonomous coding"
     75 3. "code repair" / "automated program repair" / "LLM bug fix"
     76 4. "code review" / "AI code review" / "automated code review"
     77 5. "code completion" / "neural code completion"
     78 6. "software testing" / "LLM test generation" / "AI testing"
     79 7. "multi-agent" + "software" / "code" / "programming"
     80 
     81 **Evaluation & benchmarks:**
     82 8. "code benchmark" / "programming benchmark" / "SWE-bench"
     83 9. "LLM evaluation" / "code evaluation" / "benchmark contamination"
     84 10. "code quality" / "code smell" / "technical debt" + "LLM"
     85 
     86 **Productivity & empirical:**
     87 11. "AI developer productivity" / "developer efficiency"
     88 12. "AI software engineering" / "AI-assisted development"
     89 13. "copilot" + "study" / "evaluation" / "empirical"
     90 14. "human-AI collaboration" + "programming" / "software"
     91 
     92 **Reliability & correctness:**
     93 15. "LLM hallucination" + "code" / "programming"
     94 16. "code correctness" / "LLM reliability" / "AI reliability"
     95 17. "scaling laws" + "code" / "language model"
     96 18. "chain of thought" / "reasoning" + "code"
     97 19. "test-time compute" / "inference scaling"
     98 
     99 **Security & safety:**
    100 20. "prompt injection" / "jailbreak" + "LLM"
    101 21. "AI alignment" / "alignment faking" / "deceptive alignment"
    102 22. "AI safety" + "code" / "agent"
    103 23. "supply chain" + "LLM" / "AI" / "machine learning"
    104 24. "LLM security" / "AI security vulnerability"
    105 
    106 **Infrastructure & tooling:**
    107 25. "retrieval augmented generation" + "code"
    108 26. "context window" / "long context" + "code"
    109 27. "fine-tuning" + "code" / "programming"
    110 28. "tool use" / "function calling" + "LLM"
    111 
    112 ## Strategy for Reaching 1000
    113 
    114 1. **Keyword search across all clusters** (~400 papers): Run each of the 28 query clusters through Semantic Scholar and arXiv. Deduplicate as you go.
    115 2. **Citation graph expansion** (~400 papers): For the top 50 most-cited papers in the registry, fetch forward citations (papers citing them) and backward references. These are highly likely to be in scope.
    116 3. **Venue crawling** (~200 papers): Systematically crawl proceedings from ICSE 2023-2026, FSE 2023-2026, NeurIPS 2023-2026, ACL 2023-2026 for relevant papers.
    117 4. **Long tail**: HuggingFace daily papers, web search for workshop papers, grey literature.
    118 
    119 Work in batches of 50-100. Report progress after each batch. Keep going until the registry has 1000+ entries or you've exhausted all sources.
    120 
    121 ## Check Against Registry
    122 
    123 Before adding a paper, check `registry.jsonl` for duplicates by:
    124 - arXiv ID match
    125 - DOI match
    126 - Title similarity (case-insensitive)
    127 
    128 ## Create Registry Entries
    129 
    130 For each new paper found, create a JSONL entry with:
    131 - `id`: Generate a URL-safe slug (e.g., "metr-rct-2025")
    132 - `title`: Full paper title
    133 - `authors`: Author list (at least first and last author)
    134 - `year`: Publication year
    135 - `venue`: Where published (arXiv, conference name, journal)
    136 - `source_url`: Link to the paper
    137 - `arxiv_id`: If available
    138 - `doi`: If available
    139 - `source`: How you found it (`arxiv`, `semantic_scholar`, `huggingface`)
    140 - `status`: Always "queued" for new discoveries
    141 - `tags`: Initial topic tags based on title/abstract
    142 - `added`: Today's date
    143 - `notes`: Brief note on why this paper is relevant
    144 
    145 ## Scope
    146 
    147 Include papers that:
    148 - Make empirical claims about AI/LLM capability, productivity, or safety
    149 - Propose or evaluate benchmarks for code/agent tasks
    150 - Study developer workflows with AI tools
    151 - Address security, alignment, or reliability of LLM systems
    152 - Survey or meta-analyze the above
    153 
    154 Skip papers that:
    155 - Are pure opinion or commentary with no empirical content
    156 - Are product announcements without methodology
    157 - Don't make falsifiable claims
    158 - Are entirely outside software/code domains (unless methodology is transferable)
    159 
    160 When in doubt, include it with a note.
    161 
    162 ## Downstream Pipeline
    163 
    164 After you add entries to the registry, the rest of the pipeline handles them automatically:
    165 
    166 1. **Download**: `python scripts/download-arxiv.py` downloads PDFs for all queued entries that have an `arxiv_id`. So always include the `arxiv_id` when available -- it's the key that makes automated download work.
    167 2. **Scan**: The scan agent reads each downloaded paper and produces `scan.json`.
    168 3. **Citation chasing**: `python scripts/harvest-citations.py` extracts cited papers from scan results and proposes new registry entries, feeding back into discovery.
    169 
    170 You don't run these steps. Just know that your output feeds them, so complete and accurate `arxiv_id` fields are important.
    171 
    172 ## Guidelines
    173 
    174 - Use Semantic Scholar API as your primary discovery engine. Do not rely only on web search.
    175 - Always include `arxiv_id` when the paper is on arXiv.
    176 - **Write immediately**: After each search (API call or web search), append any new papers found to `registry.jsonl` right away. Do NOT accumulate results in memory and write them later. This ensures no work is lost if the session ends unexpectedly.
    177 - Log your search queries and results for reproducibility.
    178 - Work in batches. Report count after each batch.
    179 - Keep going until you hit 1000 or exhaust sources.
	ai-research-survey Systematic scan of agentic development research. What's signal, what's noise.
	git clone https://git.shiptheloop.com/ai-research-survey.git
	Log \| Files \| Refs