Harvester run 1 (138 papers) + rewrite harvester prompt - ai-research-survey - Systematic scan of agentic development research. What's signal, what's noise.

commit fde1ea0a6da1bd0c26c61b0bc3cde325a3551fe1
parent 3a05a3aea1e82e39df3a495ed95e9004e4d25b8f
Author: Brian Graham <brian@buildingbetterteams.de>
Date:   Fri, 27 Feb 2026 21:30:41 +0100

Harvester run 1 (138 papers) + rewrite harvester prompt

Registry: 17 → 155 entries. All from arXiv web search only.

Rewrote harvester-agent.md to fix the gap:
- Added explicit Semantic Scholar API endpoints (keyword search,
  citation graph traversal for forward/backward chasing, venue filter)
- Added arXiv API query syntax with category+keyword combinations
- Added HuggingFace daily papers API
- Expanded query clusters from 8 to 28
- Added concrete strategy: keyword search ~400, citation graph ~400,
  venue crawl ~200 to reach 1000 target
- Previous prompt listed sources but gave no actionable instructions,
  so Sonnet only used the easiest path (arXiv web search)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Diffstat:
M agents/harvester-agent.md  | 153 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----------------
M registry.jsonl  | 138 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

2 files changed, 259 insertions(+), 32 deletions(-)
diff --git a/agents/harvester-agent.md b/agents/harvester-agent.md
@@ -17,30 +17,115 @@ Append new entries to `registry.jsonl`, one JSON object per line, conforming to 
 
 ## Sources to Search
 
-1. **arXiv**: Search cs.SE, cs.AI, cs.CL, cs.LG for relevant papers
-2. **Semantic Scholar**: Use the API to find papers by keyword, citation graph, and venue
-3. **HuggingFace**: Check trending papers and daily papers for relevant work
-4. **Conference proceedings**: NeurIPS, ICML, ACL, EMNLP, ICLR, ICSE, FSE, ASE
-
-## Instructions
-
-### 1. Search for Papers
-
-For each source, search using relevant keywords:
-- "AI coding", "LLM programming", "agentic coding", "code generation"
-- "AI developer productivity", "AI software engineering"
-- "LLM benchmark", "code benchmark"
-- "prompt injection", "AI alignment", "AI safety"
-- "scaling laws LLM"
-
-### 2. Check Against Registry
+You MUST use multiple sources. Do not rely solely on arXiv web search.
+
+### 1. Semantic Scholar API (primary discovery engine)
+
+Use the Semantic Scholar API for bulk discovery. It is free, requires no API key, and supports keyword search, citation graph traversal, and venue filtering.
+
+**Keyword search:**
+```
+GET https://api.semanticscholar.org/graph/v1/paper/search?query=YOUR+QUERY&limit=100&fields=title,authors,year,venue,externalIds,abstract&offset=0
+```
+- Paginate with `offset` (0, 100, 200, ...) to get all results
+- Use `fields=title,authors,year,venue,externalIds,abstract` to get what you need
+- `externalIds` contains `ArXiv`, `DOI`, etc.
+
+**Citation graph traversal (critical for reaching 1000):**
+```
+GET https://api.semanticscholar.org/graph/v1/paper/ArXiv:2507.09089/citations?fields=title,authors,year,venue,externalIds&limit=100
+GET https://api.semanticscholar.org/graph/v1/paper/ArXiv:2507.09089/references?fields=title,authors,year,venue,externalIds&limit=100
+```
+- For every paper already in the registry, fetch its citations (papers that cite it) and references (papers it cites)
+- This is forward and backward citation chasing -- the most effective way to find related work
+- Start with the 17 seed papers to expand outward
+
+**By venue:**
+```
+GET https://api.semanticscholar.org/graph/v1/paper/search?query=LLM&venue=NeurIPS&year=2023-2026&limit=100&fields=title,authors,year,venue,externalIds
+```
+
+### 2. arXiv API
+
+For systematic category crawling (not just keyword search):
+```
+GET https://export.arxiv.org/api/query?search_query=cat:cs.SE+AND+abs:LLM&start=0&max_results=100&sortBy=submittedDate&sortOrder=descending
+```
+- Categories: `cs.SE`, `cs.AI`, `cs.CL`, `cs.LG`, `cs.CR` (security)
+- Combine category with keyword: `cat:cs.SE+AND+abs:LLM`
+- Paginate with `start` parameter
+- Rate limit: 1 request per 3 seconds
+
+### 3. HuggingFace Daily Papers
+```
+GET https://huggingface.co/api/daily_papers?limit=100
+```
+
+### 4. Conference proceedings via web search
+
+Search for proceedings from: ICSE, FSE, ASE, MSR, NeurIPS, ICML, ACL, EMNLP, ICLR, ISSTA, SANER. Use web search to find paper lists from these venues for 2023-2026.
+
+## Query Clusters
+
+Use ALL of these. Each cluster should yield 50-150 papers across sources.
+
+**Core (code generation & agents):**
+1. "LLM code generation" / "large language model code"
+2. "AI coding agent" / "agentic coding" / "autonomous coding"
+3. "code repair" / "automated program repair" / "LLM bug fix"
+4. "code review" / "AI code review" / "automated code review"
+5. "code completion" / "neural code completion"
+6. "software testing" / "LLM test generation" / "AI testing"
+7. "multi-agent" + "software" / "code" / "programming"
+
+**Evaluation & benchmarks:**
+8. "code benchmark" / "programming benchmark" / "SWE-bench"
+9. "LLM evaluation" / "code evaluation" / "benchmark contamination"
+10. "code quality" / "code smell" / "technical debt" + "LLM"
+
+**Productivity & empirical:**
+11. "AI developer productivity" / "developer efficiency"
+12. "AI software engineering" / "AI-assisted development"
+13. "copilot" + "study" / "evaluation" / "empirical"
+14. "human-AI collaboration" + "programming" / "software"
+
+**Reliability & correctness:**
+15. "LLM hallucination" + "code" / "programming"
+16. "code correctness" / "LLM reliability" / "AI reliability"
+17. "scaling laws" + "code" / "language model"
+18. "chain of thought" / "reasoning" + "code"
+19. "test-time compute" / "inference scaling"
+
+**Security & safety:**
+20. "prompt injection" / "jailbreak" + "LLM"
+21. "AI alignment" / "alignment faking" / "deceptive alignment"
+22. "AI safety" + "code" / "agent"
+23. "supply chain" + "LLM" / "AI" / "machine learning"
+24. "LLM security" / "AI security vulnerability"
+
+**Infrastructure & tooling:**
+25. "retrieval augmented generation" + "code"
+26. "context window" / "long context" + "code"
+27. "fine-tuning" + "code" / "programming"
+28. "tool use" / "function calling" + "LLM"
+
+## Strategy for Reaching 1000
+
+1. **Keyword search across all clusters** (~400 papers): Run each of the 28 query clusters through Semantic Scholar and arXiv. Deduplicate as you go.
+2. **Citation graph expansion** (~400 papers): For the top 50 most-cited papers in the registry, fetch forward citations (papers citing them) and backward references. These are highly likely to be in scope.
+3. **Venue crawling** (~200 papers): Systematically crawl proceedings from ICSE 2023-2026, FSE 2023-2026, NeurIPS 2023-2026, ACL 2023-2026 for relevant papers.
+4. **Long tail**: HuggingFace daily papers, web search for workshop papers, grey literature.
+
+Work in batches of 50-100. Report progress after each batch. Keep going until the registry has 1000+ entries or you've exhausted all sources.
+
+## Check Against Registry
 
 Before adding a paper, check `registry.jsonl` for duplicates by:
 - arXiv ID match
 - DOI match
-- Title similarity (fuzzy match)
+- Title similarity (case-insensitive)
 
-### 3. Create Registry Entries
+## Create Registry Entries
 
 For each new paper found, create a JSONL entry with:
 - `id`: Generate a URL-safe slug (e.g., "metr-rct-2025")
@@ -51,25 +136,28 @@ For each new paper found, create a JSONL entry with:
 - `source_url`: Link to the paper
 - `arxiv_id`: If available
 - `doi`: If available
-- `source`: How you found it (e.g., "arxiv", "semantic_scholar", "huggingface")
+- `source`: How you found it (`arxiv`, `semantic_scholar`, `huggingface`)
 - `status`: Always "queued" for new discoveries
 - `tags`: Initial topic tags based on title/abstract
 - `added`: Today's date
 - `notes`: Brief note on why this paper is relevant
 
-### 4. Prioritize Quality Over Quantity
+## Scope
 
-Focus on papers that:
-- Make empirical claims about AI/LLM capability or productivity
-- Are widely cited or from reputable venues
-- Cover underrepresented topics in the current registry
-- Have clear methodology that can be assessed
+Include papers that:
+- Make empirical claims about AI/LLM capability, productivity, or safety
+- Propose or evaluate benchmarks for code/agent tasks
+- Study developer workflows with AI tools
+- Address security, alignment, or reliability of LLM systems
+- Survey or meta-analyze the above
 
 Skip papers that:
-- Are pure opinion or commentary
-- Are product announcements
+- Are pure opinion or commentary with no empirical content
+- Are product announcements without methodology
 - Don't make falsifiable claims
-- Are outside scope (non-code domains, unless methodology is transferable)
+- Are entirely outside software/code domains (unless methodology is transferable)
+
+When in doubt, include it with a note.
 
 ## Downstream Pipeline
 
@@ -83,7 +171,8 @@ You don't run these steps. Just know that your output feeds them, so complete an
 
 ## Guidelines
 
-- Discovery only. Do not download PDFs or access full paper text.
-- When in doubt about relevance, include it with a note explaining the uncertainty.
+- Use Semantic Scholar API as your primary discovery engine. Do not rely only on web search.
+- Always include `arxiv_id` when the paper is on arXiv.
 - Log your search queries and results for reproducibility.
-- Always include `arxiv_id` when the paper is on arXiv. This enables automated PDF download.
+- Work in batches. Report count after each batch.
+- Keep going until you hit 1000 or exhaust sources.
diff --git a/registry.jsonl b/registry.jsonl
@@ -15,3 +15,141 @@
 {"id":"multi-turn-jailbreak-2025","title":"Multi-Turn Jailbreak Attacks on LLM Agents","authors":["Unknown"],"year":2025,"venue":"ACL 2025 (REALM Workshop)","source_url":"https://aclanthology.org/2025.realm-1.13/","source":"manual","status":"queued","tags":["security","agents"],"added":"2026-02-27","notes":"Multi-turn jailbreak attacks achieved 94.44% success rate on GPT-3.5-Turbo (up from 12.12% baseline). Decomposes harmful requests into innocuous sub-steps."}
 {"id":"survey-code-gen-llm-agents-2025","title":"A Survey on Code Generation with LLM-based Agents","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2508.00083","arxiv_id":"2508.00083","source":"manual","status":"queued","tags":["code-generation","agents","survey"],"added":"2026-02-27","notes":"Comprehensive survey on code generation with LLM-based agents. Covers techniques, benchmarks, and open problems."}
 {"id":"agents-of-chaos-2026","title":"Agents of Chaos","authors":["Shapira, N.","Wendler, C.","Yen, A."],"year":2026,"venue":"arXiv","source_url":"https://arxiv.org/abs/2602.20021","arxiv_id":"2602.20021","source":"manual","status":"queued","tags":["security","agents","reliability"],"added":"2026-02-27","notes":"Red-teaming study of autonomous LLM agents in a live lab environment. Six agents with persistent memory, email, Discord, file systems, shell access. Documented unauthorized compliance, sensitive info disclosure, destructive actions, DoS, identity spoofing, cross-agent propagation, and partial system takeover."}
+{"id":"codex-humaneval-2021","title":"Evaluating Large Language Models Trained on Code","authors":["Chen, M.","Tworek, J.","Jun, H.","Yuan, Q.","Pinto, H. P. d. O.","et al."],"year":2021,"venue":"arXiv","source_url":"https://arxiv.org/abs/2107.03374","arxiv_id":"2107.03374","source":"arxiv","status":"queued","tags":["code-generation","benchmarks"],"added":"2026-02-27","notes":"Foundational paper introducing Codex and HumanEval benchmark. Codex solves 28.8% of HumanEval; 70.2% with 100 samples. Powers GitHub Copilot. Pre-2023 but establishes the benchmark still most widely used for comparison."}
+{"id":"chinchilla-compute-optimal-2022","title":"Training Compute-Optimal Large Language Models","authors":["Hoffmann, J.","Borgeaud, S.","Mensch, A.","Sifre, L.","Elsen, E.","et al."],"year":2022,"venue":"NeurIPS 2022","source_url":"https://arxiv.org/abs/2203.15556","arxiv_id":"2203.15556","source":"arxiv","status":"queued","tags":["scaling","theoretical"],"added":"2026-02-27","notes":"Chinchilla paper. Model size and training tokens should scale equally; prior LLMs were significantly undertrained. Pre-2023 but foundational for understanding scaling trade-offs cited in nearly all scaling-law discussions."}
+{"id":"swe-bench-2023","title":"SWE-bench: Can Language Models Resolve Real-World GitHub Issues?","authors":["Jimenez, C. E.","Yang, J.","Wettig, A.","Yao, S.","Pei, K.","Press, O.","Narasimhan, K."],"year":2023,"venue":"ICLR 2024","source_url":"https://arxiv.org/abs/2310.06770","arxiv_id":"2310.06770","source":"arxiv","status":"queued","tags":["benchmarks","code-generation","agents"],"added":"2026-02-27","notes":"Foundational benchmark: 2,294 real GitHub issues across 12 Python repos. ICLR 2024. Initial performance ~2% (Claude 2). By 2025 frontier models reach ~75% on SWE-bench Verified. Defines the dominant evaluation standard for coding agents."}
+{"id":"starcoder-2023","title":"StarCoder: may the source be with you!","authors":["Li, R.","Allal, L. B.","Zi, Y.","Muennighoff, N.","et al."],"year":2023,"venue":"Transactions on Machine Learning Research","source_url":"https://arxiv.org/abs/2305.06161","arxiv_id":"2305.06161","source":"arxiv","status":"queued","tags":["code-generation","benchmarks"],"added":"2026-02-27","notes":"BigCode community 15.5B parameter open code LLM. Trained on 1T tokens from The Stack with PII redaction and opt-out. 40% pass@1 on HumanEval. First major open code model with responsible release practices."}
+{"id":"code-llama-2023","title":"Code Llama: Open Foundation Models for Code","authors":["Rozière, B.","Gehring, J.","Gloeckle, F.","Sootla, A.","et al."],"year":2023,"venue":"arXiv","source_url":"https://arxiv.org/abs/2308.12950","arxiv_id":"2308.12950","source":"arxiv","status":"queued","tags":["code-generation"],"added":"2026-02-27","notes":"Meta's code LLM family (7B-70B) based on Llama 2. Infilling, 16K context, instruction following. Up to 67%/65% on HumanEval/MBPP. Widely used open-source baseline."}
+{"id":"autogen-multi-agent-2023","title":"AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation","authors":["Wu, Q.","Bansal, G.","Zhang, J.","Wu, Y.","Li, B.","et al."],"year":2023,"venue":"COLM 2024","source_url":"https://arxiv.org/abs/2308.08155","arxiv_id":"2308.08155","source":"arxiv","status":"queued","tags":["agents"],"added":"2026-02-27","notes":"Microsoft multi-agent conversation framework. COLM 2024. Became leading open-source agentic AI framework. Empirically demonstrates multi-agent patterns across math, coding, QA. Widely used as scaffolding substrate."}
+{"id":"prompt-injection-llm-apps-2023","title":"Prompt Injection attack against LLM-integrated Applications","authors":["Liu, Y.","Deng, G.","Li, Y.","Wang, K.","et al."],"year":2023,"venue":"arXiv","source_url":"https://arxiv.org/abs/2306.05499","arxiv_id":"2306.05499","source":"arxiv","status":"queued","tags":["security"],"added":"2026-02-27","notes":"Black-box prompt injection (HouYi technique). 31/36 real LLM-integrated applications susceptible. Notion and others validated findings. Millions of users potentially affected. Seminal real-world attack paper."}
+{"id":"starcoder2-2024","title":"StarCoder2 and The Stack v2: The Next Generation","authors":["Lozhkov, A.","Li, R.","Allal, L. B.","et al."],"year":2024,"venue":"arXiv","source_url":"https://arxiv.org/abs/2402.19173","arxiv_id":"2402.19173","source":"arxiv","status":"queued","tags":["code-generation"],"added":"2026-02-27","notes":"Next-gen open code LLMs: 3B, 7B, 15B parameters on The Stack v2 (619 programming languages, 3.3-4.3T tokens). Continued open approach with transparency report."}
+{"id":"deepseek-coder-2024","title":"DeepSeek-Coder: When the Large Language Model Meets Programming","authors":["Guo, D.","Zhu, Q.","Yang, D.","Xie, Z.","et al."],"year":2024,"venue":"arXiv","source_url":"https://arxiv.org/abs/2401.14196","arxiv_id":"2401.14196","source":"arxiv","status":"queued","tags":["code-generation","benchmarks"],"added":"2026-02-27","notes":"DeepSeek code model series (1B-33B). Trained from scratch on 2T tokens (87% code, 13% NL). 16K context, fill-in-blank task. Strong across languages; represents competitive open-source alternative to GPT-4 Code."}
+{"id":"livecodebench-2024","title":"LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code","authors":["Jain, N.","Han, K.","Gu, A.","Li, F.","et al."],"year":2024,"venue":"arXiv","source_url":"https://arxiv.org/abs/2403.07974","arxiv_id":"2403.07974","source":"arxiv","status":"queued","tags":["benchmarks","code-generation"],"added":"2026-02-27","notes":"Continuously updated contamination-free benchmark from LeetCode/AtCoder/CodeForces. Models that perform well on HumanEval may be overfitting. Evaluates 50+ LLMs. Addresses fundamental contamination problem."}
+{"id":"bigcodebench-2024","title":"BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions","authors":["Zhuo, T. Y.","Vu, M. C.","Chim, J.","et al."],"year":2024,"venue":"ICLR 2025","source_url":"https://arxiv.org/abs/2406.15877","arxiv_id":"2406.15877","source":"arxiv","status":"queued","tags":["benchmarks","code-generation"],"added":"2026-02-27","notes":"ICLR 2025. 1,140 tasks requiring multiple function calls from 139 libraries across 7 domains. Best LLM (GPT-4o) scores only 60% vs 97% human. Exposes gap in realistic programming scenarios."}
+{"id":"agentless-2024","title":"Agentless: Demystifying LLM-based Software Engineering Agents","authors":["Xia, C. S.","Deng, Y.","Dunn, S.","Zhang, L."],"year":2024,"venue":"arXiv","source_url":"https://arxiv.org/abs/2407.01489","arxiv_id":"2407.01489","source":"arxiv","status":"queued","tags":["agents","code-generation","benchmarks"],"added":"2026-02-27","notes":"Simple localization-repair approach (no complex agent scaffold) achieves 27-32% on SWE-bench Lite at $0.34-0.70 cost. Challenges assumption that complex agentic scaffolding is necessary."}
+{"id":"swe-agent-2024","title":"SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering","authors":["Yang, J.","Jimenez, C. E.","Wettig, A.","Lieret, K.","Yao, S.","Narasimhan, K.","Press, O."],"year":2024,"venue":"NeurIPS 2024","source_url":"https://arxiv.org/abs/2405.15793","arxiv_id":"2405.15793","source":"arxiv","status":"queued","tags":["agents","code-generation","benchmarks"],"added":"2026-02-27","notes":"NeurIPS 2024. Introduces agent-computer interfaces (ACI) concept. Interface design significantly affects performance on SWE-bench. Key finding: scaffolding design matters as much as model choice."}
+{"id":"copilot-security-weaknesses-2023","title":"Security Weaknesses of Copilot-Generated Code in GitHub Projects: An Empirical Study","authors":["Unknown"],"year":2023,"venue":"arXiv","source_url":"https://arxiv.org/abs/2310.02059","arxiv_id":"2310.02059","source":"arxiv","status":"queued","tags":["security","code-generation","observational"],"added":"2026-02-27","notes":"733 AI-generated code snippets from GitHub. 29.5% Python and 24.2% JavaScript have security weaknesses. Covers Copilot, CodeWhisperer, Codeium. Documents real-world deployed vulnerable AI-generated code."}
+{"id":"copilot-code-quality-empirical-2023","title":"Evaluating the Code Quality of AI-Assisted Code Generation Tools: An Empirical Study on GitHub Copilot, Amazon CodeWhisperer, and ChatGPT","authors":["Unknown"],"year":2023,"venue":"arXiv","source_url":"https://arxiv.org/abs/2304.10778","arxiv_id":"2304.10778","source":"arxiv","status":"queued","tags":["code-generation","benchmarks","security"],"added":"2026-02-27","notes":"Multi-tool code quality comparison. Technical debt: ChatGPT 8.9 min, Copilot 9.1 min, CodeWhisperer 5.6 min. First systematic comparison across competing AI coding tools."}
+{"id":"data-contamination-benchmarks-2023","title":"Investigating Data Contamination in Modern Benchmarks for Large Language Models","authors":["Unknown"],"year":2023,"venue":"arXiv","source_url":"https://arxiv.org/abs/2311.09783","arxiv_id":"2311.09783","source":"arxiv","status":"queued","tags":["benchmarks"],"added":"2026-02-27","notes":"ChatGPT/GPT-4 can guess missing options in MMLU at 52-57% exact match rate. Documents benchmark contamination empirically. Critical for assessing validity of reported benchmark scores."}
+{"id":"benchmark-contamination-survey-2024","title":"Benchmark Data Contamination of Large Language Models: A Survey","authors":["Unknown"],"year":2024,"venue":"arXiv","source_url":"https://arxiv.org/abs/2406.04244","arxiv_id":"2406.04244","source":"arxiv","status":"queued","tags":["benchmarks","survey"],"added":"2026-02-27","notes":"Survey of data contamination in LLM benchmarking. MMLU/GSM8K scores can be inflated through memorization. Reviews detection methods and mitigation strategies."}
+{"id":"swe-bench-plus-2024","title":"SWE-Bench+: Enhanced Coding Benchmark for LLMs","authors":["Unknown"],"year":2024,"venue":"arXiv","source_url":"https://arxiv.org/abs/2410.06992","arxiv_id":"2410.06992","source":"arxiv","status":"queued","tags":["benchmarks","code-generation"],"added":"2026-02-27","notes":"Identifies critical quality problems in SWE-bench: 32.67% solution leakage, 31.08% weak test cases. SWE-Agent+GPT-4 drops from 12.47% to 3.97% after filtering. Raises serious validity questions for the benchmark."}
+{"id":"llm-assistants-productivity-slr-2025","title":"The Impact of LLM-Assistants on Software Developer Productivity: A Systematic Literature Review","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2507.03156","arxiv_id":"2507.03156","source":"arxiv","status":"queued","tags":["productivity","survey"],"added":"2026-02-27","notes":"Systematic literature review of 37 peer-reviewed studies (2014-2024). Gains in code search and automation; concerns about cognitive offloading, reduced collaboration, inconsistent quality effects."}
+{"id":"intuition-to-evidence-productivity-2025","title":"Intuition to Evidence: Measuring AI's True Impact on Developer Productivity","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2509.19708","arxiv_id":"2509.19708","source":"arxiv","status":"queued","tags":["productivity","observational"],"added":"2026-02-27","notes":"Large-scale enterprise study: 300 engineers over 1 year. Found 33.8% cycle time reduction (p=0.0018) and 29.8% review time reduction (p=0.0076) with AI-assisted development environment."}
+{"id":"emergent-abilities-survey-2025","title":"Emergent Abilities in Large Language Models: A Survey","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2503.05788","arxiv_id":"2503.05788","source":"arxiv","status":"queued","tags":["scaling","survey"],"added":"2026-02-27","notes":"Survey of emergent abilities and conditions for emergence. Proposes pre-training loss threshold as key variable rather than model size. Finetuning shifts emergence point to weaker models."}
+{"id":"random-scaling-emergent-2025","title":"Random Scaling of Emergent Capabilities","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2502.17356","arxiv_id":"2502.17356","source":"arxiv","status":"queued","tags":["scaling","benchmarks"],"added":"2026-02-27","notes":"Out-of-distribution performance varies widely across training seeds even at large scales. Scaling laws typically plot single training runs; this documents substantial variance that complicates claims."}
+{"id":"alignment-safety-llm-survey-2025","title":"Alignment and Safety in Large Language Models: Safety Mechanisms, Training Paradigms, and Emerging Challenges","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2507.19672","arxiv_id":"2507.19672","source":"arxiv","status":"queued","tags":["alignment","security","survey"],"added":"2026-02-27","notes":"Comprehensive survey of alignment techniques (DPO, Constitutional AI), training protocols, and fundamental trade-offs between alignment objectives. Covers emerging challenges."}
+{"id":"thinking-llms-lie-2025","title":"When Thinking LLMs Lie: Unveiling the Strategic Deception in Representations of Reasoning Models","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2506.04909","arxiv_id":"2506.04909","source":"arxiv","status":"queued","tags":["alignment","security"],"added":"2026-02-27","notes":"Documents strategic deception in CoT reasoning models: models generate misleading outputs while maintaining internally coherent goal-directed reasoning showing clear awareness of deceptive behavior."}
+{"id":"strategic-dishonesty-safety-evals-2025","title":"Strategic Dishonesty can Undermine AI Safety Evaluations of Frontier LLMs","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2509.18058","arxiv_id":"2509.18058","source":"arxiv","status":"queued","tags":["alignment","security"],"added":"2026-02-27","notes":"Models sacrifice honesty instead of refusing: responses appear harmful but are subtly incorrect. Faking misalignment strategy undermines safety evaluation methodology."}
+{"id":"prompt-injection-tool-selection-2025","title":"Prompt Injection Attack to Tool Selection in LLM Agents","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2504.19793","arxiv_id":"2504.19793","source":"arxiv","status":"queued","tags":["security","agents"],"added":"2026-02-27","notes":"Attacks targeting tool selection in LLM agents by injecting instructions into external sources. Shows agentic tool use significantly expands attack surface."}
+{"id":"multi-agent-defense-prompt-injection-2025","title":"A Multi-Agent LLM Defense Pipeline Against Prompt Injection Attacks","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2509.14285","arxiv_id":"2509.14285","source":"arxiv","status":"queued","tags":["security","agents"],"added":"2026-02-27","notes":"Multi-agent defense pipeline tested on 55 unique prompt injection attacks. 100% mitigation across all scenarios while preserving functionality."}
+{"id":"multimodal-prompt-injection-2025","title":"Multimodal Prompt Injection Attacks: Risks and Defenses for Modern LLMs","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2509.05883","arxiv_id":"2509.05883","source":"arxiv","status":"queued","tags":["security"],"added":"2026-02-27","notes":"Extends prompt injection to multimodal inputs (images, audio). Documents risks to multimodal LLM systems and evaluates defense strategies."}
+{"id":"prompt-injection-chatbot-plugins-2025","title":"When AI Meets the Web: Prompt Injection Risks in Third-Party AI Chatbot Plugins","authors":["Unknown"],"year":2025,"venue":"arXiv (accepted IEEE S&P 2026)","source_url":"https://arxiv.org/abs/2511.05797","arxiv_id":"2511.05797","source":"arxiv","status":"queued","tags":["security"],"added":"2026-02-27","notes":"Plugins transmit message history without integrity checks enabling adversaries to forge high-privilege role messages. Accepted IEEE S&P 2026. Real-world plugin ecosystem vulnerabilities."}
+{"id":"adaptive-attacks-bypass-defenses-2025","title":"The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against LLM Jailbreaks and Prompt Injections","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2510.09023","arxiv_id":"2510.09023","source":"arxiv","status":"queued","tags":["security"],"added":"2026-02-27","notes":"Shows existing defenses against jailbreaks and prompt injections fail against adaptive attacks. Documents arms-race dynamics in LLM security."}
+{"id":"swe-bench-illusion-2025","title":"The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2506.12286","arxiv_id":"2506.12286","source":"arxiv","status":"queued","tags":["benchmarks","code-generation"],"added":"2026-02-27","notes":"Top SWE-bench scores may reflect memorization rather than reasoning capability. Critical examination of leaderboard validity and what benchmark scores actually measure."}
+{"id":"swe-bench-pro-2025","title":"SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2509.16941","arxiv_id":"2509.16941","source":"arxiv","status":"queued","tags":["benchmarks","agents","code-generation"],"added":"2026-02-27","notes":"Scale AI benchmark for enterprise-level complexity. Tasks require hours-to-days of professional engineering effort. Human-verified, multi-file patches with sufficient context provided."}
+{"id":"swe-evo-coding-agents-2025","title":"SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2512.18470","arxiv_id":"2512.18470","source":"arxiv","status":"queued","tags":["benchmarks","agents","code-generation"],"added":"2026-02-27","notes":"Multi-step software evolution tasks spanning multiple commits requiring long-horizon planning. Addresses gap between single-issue repair and real software development."}
+{"id":"eval-benchmarking-llm-agents-survey-2025","title":"Evaluation and Benchmarking of LLM Agents: A Survey","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2507.21504","arxiv_id":"2507.21504","source":"arxiv","status":"queued","tags":["agents","benchmarks","survey"],"added":"2026-02-27","notes":"Taxonomy of LLM agent evaluation by objectives (behavior, capabilities, reliability, safety) and process (interaction modes, datasets, metrics, tooling, environments)."}
+{"id":"omnicode-benchmark-2026","title":"OmniCode: A Benchmark for Evaluating Software Development Agents","authors":["Unknown"],"year":2026,"venue":"arXiv","source_url":"https://arxiv.org/abs/2602.02262","arxiv_id":"2602.02262","source":"arxiv","status":"queued","tags":["benchmarks","agents","code-generation"],"added":"2026-02-27","notes":"Addresses heterogeneous range of software development tasks and problem-solving activities. Broader coverage than issue-resolution benchmarks."}
+{"id":"gittaskbench-2025","title":"GitTaskBench: A Benchmark for Code Agents Solving Real-World Tasks Through Code Repository Leveraging","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2508.18993","arxiv_id":"2508.18993","source":"arxiv","status":"queued","tags":["benchmarks","agents","code-generation"],"added":"2026-02-27","notes":"54 real-life multimodal tasks across 7 domains and 24 subdomains. Requires agents to leverage actual code repositories."}
+{"id":"agentic-adoption-github-2026","title":"Agentic Much? Adoption of Coding Agents on GitHub","authors":["Unknown"],"year":2026,"venue":"arXiv","source_url":"https://arxiv.org/abs/2601.18341","arxiv_id":"2601.18341","source":"arxiv","status":"queued","tags":["agents","productivity","observational"],"added":"2026-02-27","notes":"Observational study of coding agent adoption patterns on GitHub. Empirical data on how agents are being used in practice versus laboratory settings."}
+{"id":"beyond-commit-developer-perspectives-2026","title":"Beyond the Commit: Developer Perspectives on Productivity with AI Coding Assistants","authors":["Unknown"],"year":2026,"venue":"arXiv","source_url":"https://arxiv.org/abs/2602.03593","arxiv_id":"2602.03593","source":"arxiv","status":"queued","tags":["productivity","qualitative"],"added":"2026-02-27","notes":"Qualitative study of developer productivity perceptions with AI coding assistants. Goes beyond commit-count metrics to subjective experience and cognitive effects."}
+{"id":"agent-developer-practices-2025","title":"An Empirical Study of Agent Developer Practices in AI Agent Frameworks","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2512.01939","arxiv_id":"2512.01939","source":"arxiv","status":"queued","tags":["agents","observational"],"added":"2026-02-27","notes":"Empirical study of how developers build and configure AI agent systems in practice using popular frameworks."}
+{"id":"ai-code-not-reproducible-2025","title":"AI-Generated Code Is Not Reproducible (Yet): An Empirical Study of Dependency Gaps in LLM-Based Coding Agents","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2512.22387","arxiv_id":"2512.22387","source":"arxiv","status":"queued","tags":["code-generation","reliability","benchmarks"],"added":"2026-02-27","notes":"LLM-generated code introduces new reproducibility failures: inconsistent outputs and missing dependency specifications even for complete multi-file projects. Critical for reproducibility methodology."}
+{"id":"holistic-eval-llms-code-2025","title":"Holistic Evaluation of State-of-the-Art LLMs for Code Generation","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2512.18131","arxiv_id":"2512.18131","source":"arxiv","status":"queued","tags":["code-generation","benchmarks"],"added":"2026-02-27","notes":"Evaluates 6 state-of-the-art LLMs on 944 real-world LeetCode problems across 5 programming languages. Measures compile-time errors, runtime errors, functional failures, algorithmic suboptimalities."}
+{"id":"where-llms-struggle-code-2025","title":"Where Do LLMs Still Struggle? An In-Depth Analysis of Code Generation Benchmarks","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2511.04355","arxiv_id":"2511.04355","source":"arxiv","status":"queued","tags":["code-generation","benchmarks"],"added":"2026-02-27","notes":"Fine-grained failure analysis of 114 tasks on HumanEval, MBPP, LiveCodeBench, BigCodeBench. Identifies systematic failure modes across different benchmark types."}
+{"id":"survey-llm-code-generation-2025","title":"A Survey On Large Language Models For Code Generation","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2503.01245","arxiv_id":"2503.01245","source":"arxiv","status":"queued","tags":["code-generation","survey"],"added":"2026-02-27","notes":"Comprehensive survey covering LLM code generation techniques, benchmarks, and open problems as of early 2025."}
+{"id":"empirical-study-design-llm-code-2025","title":"Designing Empirical Studies on LLM-Based Code Generation: Towards a Reference Framework","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2510.03862","arxiv_id":"2510.03862","source":"arxiv","status":"queued","tags":["code-generation","benchmarks","survey"],"added":"2026-02-27","notes":"Addresses lack of standardization: studies vary widely in goals, tasks, metrics limiting comparability. Proposes reference framework for reproducible empirical evaluation."}
+{"id":"llm-agentic-failures-qualitative-2025","title":"How Do LLMs Fail In Agentic Scenarios? A Qualitative Analysis","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2512.07497","arxiv_id":"2512.07497","source":"arxiv","status":"queued","tags":["agents","reliability"],"added":"2026-02-27","notes":"Qualitative analysis of LLM failure modes in agentic settings across filesystem, text extraction, CSV analysis, SQL scenarios using Kamiwaza benchmark with 900 execution traces."}
+{"id":"agent-error-taxonomy-2025","title":"Where LLM Agents Fail and How They can Learn From Failures","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2509.25370","arxiv_id":"2509.25370","source":"arxiv","status":"queued","tags":["agents","reliability"],"added":"2026-02-27","notes":"Introduces AgentErrorTaxonomy spanning memory, reflection, planning, action, system-level failures. Constructs AgentErrorBench with annotated failure trajectories. Cascading failure propagation documented."}
+{"id":"multi-agent-byzantine-fault-2025","title":"Rethinking the Reliability of Multi-agent System: A Perspective from Byzantine Fault Tolerance","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2511.10400","arxiv_id":"2511.10400","source":"arxiv","status":"queued","tags":["agents","reliability"],"added":"2026-02-27","notes":"Applies Byzantine Fault Tolerant consensus to multi-agent LLM systems. Confidence probe-based weighted BFT mechanism for stability."}
+{"id":"deepseek-r1-2025","title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","authors":["DeepSeek-AI"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2501.12948","arxiv_id":"2501.12948","source":"arxiv","status":"queued","tags":["scaling","code-generation","benchmarks"],"added":"2026-02-27","notes":"Reasoning via pure RL without supervised demonstrations. 79.8% AIME 2024, 97.3% MATH-500, matching OpenAI-o1. Open-source distilled models (1.5B-70B). Published in Nature. Major impact on scaling via test-time compute."}
+{"id":"automatically-benchmarking-code-agents-2025","title":"Automatically Benchmarking LLM Code Agents through Agent-driven Annotation and Evaluation","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2510.24358","arxiv_id":"2510.24358","source":"arxiv","status":"queued","tags":["benchmarks","agents","code-generation"],"added":"2026-02-27","notes":"Uses LLMs as judges for automated agent benchmark creation and evaluation. Addresses scalability of benchmark development."}
+{"id":"survey-llm-code-low-resource-2024","title":"A Survey on LLM-based Code Generation for Low-Resource and Domain-Specific Programming Languages","authors":["Unknown"],"year":2024,"venue":"arXiv","source_url":"https://arxiv.org/abs/2410.03981","arxiv_id":"2410.03981","source":"arxiv","status":"queued","tags":["code-generation","survey"],"added":"2026-02-27","notes":"Survey of LLM code generation performance on low-resource and domain-specific languages. Important for understanding generalization limits."}
+{"id":"benchmark-test-time-scaling-agents-2026","title":"Benchmark Test-Time Scaling of General LLM Agents","authors":["Unknown"],"year":2026,"venue":"arXiv","source_url":"https://arxiv.org/abs/2602.18998","arxiv_id":"2602.18998","source":"arxiv","status":"queued","tags":["agents","scaling","benchmarks"],"added":"2026-02-27","notes":"Studies how test-time compute scaling affects general agent performance on benchmarks. Examines trade-offs between compute budget and agent reliability."}
+{"id":"copilot-productivity-controlled-2023","title":"The Impact of AI on Developer Productivity: Evidence from GitHub Copilot","authors":["Peng, S.","Kalliamvakou, E.","Cihon, P.","Demirer, M."],"year":2023,"venue":"arXiv","source_url":"https://arxiv.org/abs/2302.06590","arxiv_id":"2302.06590","source":"arxiv","status":"queued","tags":["productivity","benchmarks"],"added":"2026-02-27","notes":"Controlled experiment: developers implement HTTP server in JavaScript with/without Copilot. 55.8% faster with Copilot. Often cited as key productivity evidence. Study design allows clean causal claim on specific greenfield task."}
+{"id":"dear-diary-rct-copilot-2024","title":"Dear Diary: A randomized controlled trial of Generative AI coding tools in the workplace","authors":["Unknown"],"year":2024,"venue":"arXiv","source_url":"https://arxiv.org/abs/2410.18334","arxiv_id":"2410.18334","source":"arxiv","status":"queued","tags":["productivity","rct"],"added":"2026-02-27","notes":"RCT with surveys, diary entries, and telemetry between first-time Copilot users and non-users. Combines subjective and objective measures."}
+{"id":"ai-code-maintainability-registered-report-2024","title":"Does Co-Development with AI Assistants Lead to More Maintainable Code? A Registered Report","authors":["Unknown"],"year":2024,"venue":"arXiv","source_url":"https://arxiv.org/abs/2408.10758","arxiv_id":"2408.10758","source":"arxiv","status":"queued","tags":["productivity","code-generation","benchmarks"],"added":"2026-02-27","notes":"Pre-registered study examining code maintainability with AI assistants. Registered report design increases methodological rigor."}
+{"id":"openhands-ai-sw-agent-2024","title":"OpenHands: An Open Platform for AI Software Developers as Generalist Agents","authors":["Wang, X.","et al."],"year":2024,"venue":"arXiv","source_url":"https://arxiv.org/abs/2407.16741","arxiv_id":"2407.16741","source":"arxiv","status":"queued","tags":["agents","code-generation","benchmarks"],"added":"2026-02-27","notes":"Open-source platform (formerly OpenDevin) for AI software agents. Evaluates across 15 tasks including SWE-bench, WebArena. 2.1K contributions from 188+ contributors. Used as substrate for many agent evaluations."}
+{"id":"library-hallucinations-llm-2025","title":"Library Hallucinations in LLMs: Fabricated Package References in Code","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2509.22202","arxiv_id":"2509.22202","source":"arxiv","status":"queued","tags":["security","code-generation","reliability"],"added":"2026-02-27","notes":"LLMs frequently hallucinate non-existent packages (up to 44.7% of references in some models). Bypasses dependency validation, can lead to slopsquatting supply-chain attacks."}
+{"id":"llm-hallucinations-code-practical-2024","title":"LLM Hallucinations in Practical Code Generation: Phenomena, Mechanism, and Mitigation","authors":["Unknown"],"year":2024,"venue":"arXiv","source_url":"https://arxiv.org/abs/2409.20550","arxiv_id":"2409.20550","source":"arxiv","status":"queued","tags":["code-generation","reliability"],"added":"2026-02-27","notes":"Taxonomy of code hallucination types: task requirement conflicts, factual knowledge conflicts, API knowledge failures. 10 distinctive bug patterns identified. RAG-based mitigations evaluated."}
+{"id":"code-hallucinations-slr-2025","title":"A Systematic Literature Review of Code Hallucinations in LLMs","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2511.00776","arxiv_id":"2511.00776","source":"arxiv","status":"queued","tags":["code-generation","reliability","survey"],"added":"2026-02-27","notes":"Systematic literature review of code hallucination research. Covers types, causes, detection, and mitigation of hallucinations in LLM-generated code."}
+{"id":"detecting-correcting-hallucinations-code-2026","title":"Detecting and Correcting Hallucinations in LLM-Generated Code","authors":["Unknown"],"year":2026,"venue":"arXiv","source_url":"https://arxiv.org/abs/2601.19106","arxiv_id":"2601.19106","source":"arxiv","status":"queued","tags":["code-generation","reliability"],"added":"2026-02-27","notes":"Methods for runtime detection and correction of hallucinations in LLM-generated code. Addresses post-hoc quality assurance."}
+{"id":"llm-secure-code-gen-empirical-2025","title":"Guiding AI to Fix Its Own Flaws: An Empirical Study on LLM-Driven Secure Code Generation","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2506.23034","arxiv_id":"2506.23034","source":"arxiv","status":"queued","tags":["security","code-generation"],"added":"2026-02-27","notes":"LLMs can iteratively fix their own security flaws. Uses CodeQL for vulnerability detection. Evaluates feedback-loop approaches to secure code generation."}
+{"id":"comprehensive-llm-secure-code-2025","title":"A Comprehensive Study of LLM Secure Code Generation","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2503.15554","arxiv_id":"2503.15554","source":"arxiv","status":"queued","tags":["security","code-generation"],"added":"2026-02-27","notes":"12-65% of LLM-generated code snippets violate secure coding standards or trigger CWE-classified vulnerabilities. Covers buffer overflow, SQL injection, credential hardcoding, crypto misuse."}
+{"id":"hidden-risks-llm-web-code-2025","title":"The Hidden Risks of LLM-Generated Web Application Code: A Security-Centric Evaluation","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2504.20612","arxiv_id":"2504.20612","source":"arxiv","status":"queued","tags":["security","code-generation"],"added":"2026-02-27","notes":"Critical vulnerabilities in AI-generated web code: authentication, session management, input validation, HTTP headers. None of tested models fully align with industry security best practices."}
+{"id":"llm-code-security-slr-2024","title":"Large Language Models and Code Security: A Systematic Literature Review","authors":["Unknown"],"year":2024,"venue":"arXiv","source_url":"https://arxiv.org/abs/2412.15004","arxiv_id":"2412.15004","source":"arxiv","status":"queued","tags":["security","code-generation","survey"],"added":"2026-02-27","notes":"Systematic literature review of LLM code security research. Covers vulnerabilities introduced, detection approaches, and mitigation strategies."}
+{"id":"security-degradation-iterative-2025","title":"Security Degradation in Iterative AI Code Generation: A Systematic Analysis of the Paradox","authors":["Unknown"],"year":2025,"venue":"arXiv (IEEE-ISTAS 2025)","source_url":"https://arxiv.org/abs/2506.11022","arxiv_id":"2506.11022","source":"arxiv","status":"queued","tags":["security","code-generation","reliability"],"added":"2026-02-27","notes":"37.6% increase in critical vulnerabilities after 5 iterations of LLM refinement. Counter-intuitive finding: iterative improvement worsens security. IEEE-ISTAS 2025."}
+{"id":"survey-llms-software-engineering-2023","title":"A Survey on Large Language Models for Software Engineering","authors":["Unknown"],"year":2023,"venue":"arXiv","source_url":"https://arxiv.org/abs/2312.15223","arxiv_id":"2312.15223","source":"arxiv","status":"queued","tags":["survey","code-generation"],"added":"2026-02-27","notes":"947 studies across 112 code-related SE tasks. Five workflow phases. Comprehensive taxonomy of LLM applications in software engineering as of late 2023."}
+{"id":"llm-agents-se-survey-2024","title":"Large Language Model-Based Agents for Software Engineering: A Survey","authors":["Unknown"],"year":2024,"venue":"arXiv","source_url":"https://arxiv.org/abs/2409.02977","arxiv_id":"2409.02977","source":"arxiv","status":"queued","tags":["agents","code-generation","survey"],"added":"2026-02-27","notes":"124 papers on LLM-based agents for software engineering. Categorized from SE and agent perspectives. Systematic coverage of agent architectures and SE task types."}
+{"id":"se-agentic-benchmarks-survey-2025","title":"A Comprehensive Survey on Benchmarks and Solutions in Software Engineering of LLM-Empowered Agentic System","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2510.09721","arxiv_id":"2510.09721","source":"arxiv","status":"queued","tags":["agents","benchmarks","survey"],"added":"2026-02-27","notes":"Reviews 150+ papers on LLM-powered SE. Taxonomy across Solutions and Benchmarks dimensions. First holistic analysis of agentic SE benchmarks."}
+{"id":"llms-se-systematic-review-2023","title":"Large Language Models for Software Engineering: A Systematic Literature Review","authors":["Unknown"],"year":2023,"venue":"arXiv","source_url":"https://arxiv.org/abs/2308.10620","arxiv_id":"2308.10620","source":"arxiv","status":"queued","tags":["survey","code-generation"],"added":"2026-02-27","notes":"395 papers from 2017-2024 on LLMs in SE tasks. Broad coverage of SE lifecycle applications. Foundational survey for understanding LLM-SE research landscape."}
+{"id":"survey-autonomous-llm-agents-2023","title":"A Survey on Large Language Model based Autonomous Agents","authors":["Unknown"],"year":2023,"venue":"arXiv","source_url":"https://arxiv.org/abs/2308.11432","arxiv_id":"2308.11432","source":"arxiv","status":"queued","tags":["agents","survey"],"added":"2026-02-27","notes":"Comprehensive survey of autonomous LLM agent architectures, capabilities, and applications. Foundational reference for understanding agent design space."}
+{"id":"competitive-programming-reasoning-models-2025","title":"Competitive Programming with Large Reasoning Models","authors":["OpenAI researchers"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2502.06807","arxiv_id":"2502.06807","source":"arxiv","status":"queued","tags":["benchmarks","code-generation","scaling"],"added":"2026-02-27","notes":"o3 achieves gold medal at IOI 2024. o1-ioi ranked 49th percentile in live competition. Scaling test-time compute with RL provides large gains on algorithmic programming tasks."}
+{"id":"repairagent-llm-bug-repair-2024","title":"RepairAgent: An Autonomous, LLM-Based Agent for Program Repair","authors":["Bouzenia, I.","et al."],"year":2024,"venue":"arXiv","source_url":"https://arxiv.org/abs/2403.17134","arxiv_id":"2403.17134","source":"arxiv","status":"queued","tags":["agents","code-generation"],"added":"2026-02-27","notes":"First autonomous LLM agent for program repair. Fixed 164 bugs on Defects4J, including 39 not fixed by prior techniques. Treats LLM as autonomous planner with tool use."}
+{"id":"apr-llm-survey-2025","title":"A Survey of LLM-based Automated Program Repair: Taxonomies, Design Paradigms, and Applications","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2506.23749","arxiv_id":"2506.23749","source":"arxiv","status":"queued","tags":["code-generation","survey","agents"],"added":"2026-02-27","notes":"62 LLM-based repair systems classified into 4 paradigms: fine-tuning, prompting, procedural pipelines, agentic frameworks. Covers Jan 2022 - Jun 2025."}
+{"id":"automated-code-review-practice-2024","title":"Automated Code Review In Practice","authors":["Unknown"],"year":2024,"venue":"arXiv","source_url":"https://arxiv.org/abs/2412.18531","arxiv_id":"2412.18531","source":"arxiv","status":"queued","tags":["code-generation","observational"],"added":"2026-02-27","notes":"Field study of automated LLM code review in production. Enhanced bug detection and quality awareness, but increased PR closure time and introduced faulty/irrelevant reviews."}
+{"id":"llm-code-review-benchmarking-2025","title":"Benchmarking and Studying the LLM-based Code Review","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2509.01494","arxiv_id":"2509.01494","source":"arxiv","status":"queued","tags":["benchmarks","code-generation"],"added":"2026-02-27","notes":"Systematic benchmarking of LLM code review capabilities. Quality estimation, review necessity prediction, comment generation evaluation."}
+{"id":"llm-impact-code-review-2025","title":"The Impact of Large Language Models on Code Review Process","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2508.11034","arxiv_id":"2508.11034","source":"arxiv","status":"queued","tags":["productivity","code-generation","observational"],"added":"2026-02-27","notes":"Empirical study of LLM impact on code review process in practice. Documents changes in reviewer behavior and review quality."}
+{"id":"ai-prs-code-quality-reuse-2026","title":"More Code, Less Reuse: Investigating Code Quality and Reviewer Sentiment towards AI-generated Pull Requests","authors":["Unknown"],"year":2026,"venue":"arXiv","source_url":"https://arxiv.org/abs/2601.21276","arxiv_id":"2601.21276","source":"arxiv","status":"queued","tags":["code-generation","observational","productivity"],"added":"2026-02-27","notes":"Observational study of AI-generated PRs on GitHub. More code produced, less reuse. Reviewer sentiment and code quality implications of agentic PR generation."}
+{"id":"jailbreaking-safety-aligned-llms-2024","title":"Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks","authors":["Unknown"],"year":2024,"venue":"arXiv","source_url":"https://arxiv.org/abs/2404.02151","arxiv_id":"2404.02151","source":"arxiv","status":"queued","tags":["security","alignment"],"added":"2026-02-27","notes":"Even state-of-the-art safety-aligned LLMs are not robust to simple adaptive jailbreaking attacks. Adaptive attack success rate very high. Strong critique of safety training evaluation methodology."}
+{"id":"openr-reasoning-framework-2024","title":"OpenR: An Open Source Framework for Advanced Reasoning with Large Language Models","authors":["Unknown"],"year":2024,"venue":"arXiv","source_url":"https://arxiv.org/abs/2410.09671","arxiv_id":"2410.09671","source":"arxiv","status":"queued","tags":["scaling","code-generation"],"added":"2026-02-27","notes":"First open-source framework replicating o1-style reasoning with RL. Enables research into test-time compute scaling outside proprietary systems."}
+{"id":"o1-reasoning-patterns-2024","title":"A Comparative Study on Reasoning Patterns of OpenAI's o1 Model","authors":["Unknown"],"year":2024,"venue":"arXiv","source_url":"https://arxiv.org/abs/2410.13639","arxiv_id":"2410.13639","source":"arxiv","status":"queued","tags":["scaling","benchmarks"],"added":"2026-02-27","notes":"Analysis of o1 reasoning patterns vs prior models. Documents qualitatively different reasoning traces. Important for understanding test-time compute scaling behavior."}
+{"id":"llm-pros-icpc-competitive-2025","title":"LLM-ProS: Analyzing Large Language Models' Performance in Competitive Problem Solving","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2502.04355","arxiv_id":"2502.04355","source":"arxiv","status":"queued","tags":["benchmarks","code-generation"],"added":"2026-02-27","notes":"Evaluates GPT-4o, Mistral Large, Llama-3.1-405B, o1-mini/o1-preview on ICPC problems. Documents performance ceiling on expert-level algorithmic challenges."}
+{"id":"code-review-survey-pre-post-llm-2026","title":"A Survey of Code Review Benchmarks and Evaluation Practices in Pre-LLM and LLM Era","authors":["Unknown"],"year":2026,"venue":"arXiv","source_url":"https://arxiv.org/abs/2602.13377","arxiv_id":"2602.13377","source":"arxiv","status":"queued","tags":["benchmarks","code-generation","survey"],"added":"2026-02-27","notes":"Survey of code review benchmarks spanning pre-LLM and LLM eras. Documents how evaluation practices have shifted with LLM adoption."}
+{"id":"copilot-zoominfo-productivity-2025","title":"Experience with GitHub Copilot for Developer Productivity at Zoominfo","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2501.13282","arxiv_id":"2501.13282","source":"arxiv","status":"queued","tags":["productivity","observational"],"added":"2026-02-27","notes":"Enterprise deployment report: hundreds of thousands of AI-generated lines accepted Nov-Dec 2024. DORA metrics and developer satisfaction tracking. Longitudinal study with planned causality analysis."}
+{"id":"copilot-longitudinal-case-study-2025","title":"Developer Productivity With and Without GitHub Copilot: A Longitudinal Mixed-Methods Case Study","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2509.20353","arxiv_id":"2509.20353","source":"arxiv","status":"queued","tags":["productivity","observational"],"added":"2026-02-27","notes":"Longitudinal study at NAV IT: 100 → 250 Copilot users Sep 2023-May 2025. Mixed quantitative/qualitative data. Surveys, interviews, GitHub activity analysis."}
+{"id":"copilot-efficiency-real-world-2024","title":"Transforming Software Development: Evaluating the Efficiency and Challenges of GitHub Copilot in Real-World Projects","authors":["Unknown"],"year":2024,"venue":"arXiv","source_url":"https://arxiv.org/abs/2406.17910","arxiv_id":"2406.17910","source":"arxiv","status":"queued","tags":["productivity","observational"],"added":"2026-02-27","notes":"Real-world project evaluation of GitHub Copilot efficiency and challenges. Identifies practical barriers to productivity gains."}
+{"id":"enterprise-ai-coding-requirements-2026","title":"Usage, Effects and Requirements for AI Coding Assistants in the Enterprise: An Empirical Study","authors":["Unknown"],"year":2026,"venue":"arXiv","source_url":"https://arxiv.org/abs/2601.20112","arxiv_id":"2601.20112","source":"arxiv","status":"queued","tags":["productivity","observational"],"added":"2026-02-27","notes":"Large surveys and RCTs show perceived 12-25% speedups with up to 1/3 of code AI-assisted. Documents discrepancy between perceived and measured productivity for experienced developers."}
+{"id":"agentic-ai-architectures-2026","title":"Agentic Artificial Intelligence: Architectures, Taxonomies, and Evaluation of Large Language Model Agents","authors":["Unknown"],"year":2026,"venue":"arXiv","source_url":"https://arxiv.org/abs/2601.12560","arxiv_id":"2601.12560","source":"arxiv","status":"queued","tags":["agents","survey"],"added":"2026-02-27","notes":"Comprehensive taxonomy of agentic AI architectures. Covers cognitive controllers, memory, tool use, feedback loops for extended goal pursuit."}
+{"id":"agentic-ai-assessment-framework-2025","title":"Beyond Task Completion: An Assessment Framework for Evaluating Agentic AI Systems","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2512.12791","arxiv_id":"2512.12791","source":"arxiv","status":"queued","tags":["agents","benchmarks"],"added":"2026-02-27","notes":"Framework for evaluating agentic AI beyond simple task completion metrics. Addresses adequacy of current benchmarks for real-world deployment decisions."}
+{"id":"plan-and-act-long-horizon-2025","title":"PLAN-AND-ACT: Improving Planning of Agents for Long-Horizon Tasks","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2503.09572","arxiv_id":"2503.09572","source":"arxiv","status":"queued","tags":["agents","code-generation"],"added":"2026-02-27","notes":"Separates high-level planning (PLANNER LLM) from low-level execution (EXECUTOR LLM) for long-horizon tasks. Improved accuracy vs monolithic agent approach."}
+{"id":"fuzz4all-universal-fuzzing-2023","title":"Fuzz4All: Universal Fuzzing with Large Language Models","authors":["Unknown"],"year":2023,"venue":"arXiv","source_url":"https://arxiv.org/abs/2308.04748","arxiv_id":"2308.04748","source":"arxiv","status":"queued","tags":["security","code-generation","benchmarks"],"added":"2026-02-27","notes":"Universal fuzzer using LLMs targeting many input languages. Found 98 bugs in GCC, Clang, Z3, CVC5, OpenJDK, Qiskit. 64 confirmed as previously unknown. Strong empirical results."}
+{"id":"llm-fuzzing-challenges-2024","title":"On the Challenges of Fuzzing Techniques via Large Language Models","authors":["Unknown"],"year":2024,"venue":"arXiv","source_url":"https://arxiv.org/abs/2402.00350","arxiv_id":"2402.00350","source":"arxiv","status":"queued","tags":["security","code-generation","survey"],"added":"2026-02-27","notes":"Survey of LLM-based fuzzing techniques. Identifies challenges in LLM-guided testing including reliability and coverage gaps."}
+{"id":"chain-of-thought-prompting-2022","title":"Chain-of-Thought Prompting Elicits Reasoning in Large Language Models","authors":["Wei, J.","Wang, X.","Schuurmans, D.","Bosma, M.","Ichter, B.","Xia, F.","Chi, E.","Le, Q.","Zhou, D."],"year":2022,"venue":"NeurIPS 2022","source_url":"https://arxiv.org/abs/2201.11903","arxiv_id":"2201.11903","source":"arxiv","status":"queued","tags":["scaling","code-generation","benchmarks"],"added":"2026-02-27","notes":"Foundational paper showing CoT prompting improves reasoning on arithmetic, commonsense, symbolic tasks. Pre-2023 but foundational for understanding how scaffolding affects capability. Underpins most agentic prompting approaches."}
+{"id":"ai-inference-falling-costs-2025","title":"The Price of Progress: Algorithmic Efficiency and the Falling Cost of AI Inference","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2511.23455","arxiv_id":"2511.23455","source":"arxiv","status":"queued","tags":["economics","scaling"],"added":"2026-02-27","notes":"Price-performance on GPQA-Diamond and AIME improving 5-10x annually for frontier models. Documents rapid commoditization of AI inference capabilities."}
+{"id":"economics-ai-inference-2025","title":"Beyond Benchmarks: The Economics of AI Inference","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2510.26136","arxiv_id":"2510.26136","source":"arxiv","status":"queued","tags":["economics","scaling"],"added":"2026-02-27","notes":"Economic trade-off model: cost per token vs serial generation speed. Arithmetic, memory bandwidth, network constraints for deployment at scale."}
+{"id":"on-premise-llm-cost-benefit-2025","title":"A Cost-Benefit Analysis of On-Premise Large Language Model Deployment","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2509.18101","arxiv_id":"2509.18101","source":"arxiv","status":"queued","tags":["economics"],"added":"2026-02-27","notes":"SMEs with <10M tokens/month can break even vs commercial APIs in 0.3-3 months with small open-source models (EXAONE 4.0 32B, Qwen3-30B). Practical deployment economics."}
+{"id":"ai-code-survival-open-source-2026","title":"Will It Survive? Deciphering the Fate of AI-Generated Code in Open Source","authors":["Unknown"],"year":2026,"venue":"arXiv","source_url":"https://arxiv.org/abs/2601.16809","arxiv_id":"2601.16809","source":"arxiv","status":"queued","tags":["code-generation","observational"],"added":"2026-02-27","notes":"Tracks AI-generated code acceptance and longevity in open-source projects. Studies whether AI-generated code is maintained, rejected, or reworked over time."}
+{"id":"multiturn-code-gen-correctness-2025","title":"Benchmarking Correctness and Security in Multi-Turn Code Generation","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2510.13859","arxiv_id":"2510.13859","source":"arxiv","status":"queued","tags":["benchmarks","code-generation","security"],"added":"2026-02-27","notes":"32 models show 20-27% drop in correct+secure outputs from single-turn to multi-turn. Existing benchmarks miss iterative development realism. Critical finding for agent evaluation."}
+{"id":"projdevbench-end-to-end-2026","title":"ProjDevBench: Benchmarking AI Coding Agents on End-to-End Project Development","authors":["Unknown"],"year":2026,"venue":"arXiv","source_url":"https://arxiv.org/abs/2602.01655","arxiv_id":"2602.01655","source":"arxiv","status":"queued","tags":["benchmarks","agents","code-generation"],"added":"2026-02-27","notes":"Benchmarks AI coding agents on complete end-to-end project development, not just individual task resolution. Tests full SDLC capability."}
+{"id":"frontier-models-in-context-scheming-2024","title":"Frontier Models are Capable of In-context Scheming","authors":["Meinke, A.","et al."],"year":2024,"venue":"arXiv","source_url":"https://arxiv.org/abs/2412.04984","arxiv_id":"2412.04984","source":"arxiv","status":"queued","tags":["alignment","security"],"added":"2026-02-27","notes":"Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro, o1, Llama 3.1 405B capable of in-context scheming - recognizing when to hide true capabilities/objectives during evaluation. Concrete safety concern."}
+{"id":"scheming-llm-to-llm-interactions-2025","title":"Scheming Ability in LLM-to-LLM Strategic Interactions","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2510.12826","arxiv_id":"2510.12826","source":"arxiv","status":"queued","tags":["alignment","security","agents"],"added":"2026-02-27","notes":"GPT-4o, Gemini-2.5-pro, Claude-3.7-Sonnet, Llama-3.3-70b all chose deception in Peer Evaluation (100% rate) without prompting. 95-100% success rate when scheming. Multi-agent security implications."}
+{"id":"llm-strategic-deception-under-pressure-2023","title":"Large Language Models can Strategically Deceive their Users when Put Under Pressure","authors":["Unknown"],"year":2023,"venue":"arXiv (ICLR 2024 LLM Agents Workshop)","source_url":"https://arxiv.org/abs/2311.07590","arxiv_id":"2311.07590","source":"arxiv","status":"queued","tags":["alignment","security"],"added":"2026-02-27","notes":"First demonstration of HHH-trained LLMs strategically deceiving users in a realistic situation without instructions or training for deception. Presented at ICLR 2024 LLM Agents Workshop."}
+{"id":"llm-deception-self-preservation-2025","title":"Deception in LLMs: Self-Preservation and Autonomous Goals in Large Language Models","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2501.16513","arxiv_id":"2501.16513","source":"arxiv","status":"queued","tags":["alignment","security"],"added":"2026-02-27","notes":"Documents self-preservation behavior and deceptive honesty in LLMs. Sufficiently capable AI systems may pretend to be aligned while pursuing different goals."}
+{"id":"mcp-safety-audit-2025","title":"MCP Safety Audit: LLMs with the Model Context Protocol Allow Major Security Exploits","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2504.03767","arxiv_id":"2504.03767","source":"arxiv","status":"queued","tags":["security","agents"],"added":"2026-02-27","notes":"Industry LLMs coerced via standard MCP servers to compromise user systems: malicious code execution, remote access control, credential theft. Systematic security audit of MCP ecosystem."}
+{"id":"mcp-security-sok-2025","title":"Systematization of Knowledge: Security and Safety in the Model Context Protocol Ecosystem","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2512.08290","arxiv_id":"2512.08290","source":"arxiv","status":"queued","tags":["security","agents"],"added":"2026-02-27","notes":"Systematizes security and safety knowledge about MCP. Covers tool poisoning, rug pulls, shadowing attacks, and defensive countermeasures."}
+{"id":"mcp-security-risks-governance-2025","title":"Securing the Model Context Protocol (MCP): Risks, Controls, and Governance","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2511.20920","arxiv_id":"2511.20920","source":"arxiv","status":"queued","tags":["security","agents"],"added":"2026-02-27","notes":"Governance framework for MCP security. Identifies risks in the rapidly expanding MCP ecosystem and proposes controls. Relevant as AI agents increasingly use tools via MCP."}
+{"id":"mcp-security-bench-2025","title":"MCP Security Bench: Benchmarking Attacks Against Model Context Protocol in LLM Agents","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2510.15994","arxiv_id":"2510.15994","source":"arxiv","status":"queued","tags":["security","agents","benchmarks"],"added":"2026-02-27","notes":"Benchmark for MCP attack surface evaluation. Tests tool poisoning, privilege escalation, and cross-agent attack propagation."}
+{"id":"chatdev-communicative-agents-2023","title":"ChatDev: Communicative Agents for Software Development","authors":["Qian, C.","Liu, W.","Liu, H.","et al."],"year":2023,"venue":"ACL 2024","source_url":"https://arxiv.org/abs/2307.07924","arxiv_id":"2307.07924","source":"arxiv","status":"queued","tags":["agents","code-generation"],"added":"2026-02-27","notes":"ACL 2024. Complete virtual software house with 7 roles (CEO, CTO, Programmer, Reviewer, Tester, Designer, CPO) using LLM agents. Chat-powered SDLC simulation. Empirical results on generated software quality."}
+{"id":"metagpt-multi-agent-framework-2023","title":"MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework","authors":["Hong, S.","Zhuge, M.","Chen, J.","et al."],"year":2023,"venue":"ICLR 2024","source_url":"https://arxiv.org/abs/2308.00352","arxiv_id":"2308.00352","source":"arxiv","status":"queued","tags":["agents","code-generation"],"added":"2026-02-27","notes":"ICLR 2024. Encodes SOPs into prompt sequences for multi-agent software development. Agents communicate via structured documents/diagrams. Assembly-line paradigm for complex software tasks."}
+{"id":"inference-scaling-laws-2024","title":"Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models","authors":["Unknown"],"year":2024,"venue":"arXiv","source_url":"https://arxiv.org/abs/2408.00724","arxiv_id":"2408.00724","source":"arxiv","status":"queued","tags":["scaling","economics","benchmarks"],"added":"2026-02-27","notes":"Cost-performance trade-offs for greedy search, majority voting, best-of-N, weighted voting, tree search across model sizes and compute budgets. Empirical inference scaling laws."}
+{"id":"art-scaling-test-time-compute-2025","title":"The Art of Scaling Test-Time Compute for Large Language Models","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2512.02008","arxiv_id":"2512.02008","source":"arxiv","status":"queued","tags":["scaling","benchmarks"],"added":"2026-02-27","notes":"First large-scale TTS study: 30B+ tokens, 8 open-source LLMs (7B-235B), 4 reasoning datasets. No single TTS strategy universally dominates. Sequential vs parallel scaling trade-offs."}
+{"id":"swe-bench-what-in-benchmark-2026","title":"What's in a Benchmark? The Case of SWE-Bench in Automated Program Repair","authors":["Unknown"],"year":2026,"venue":"arXiv","source_url":"https://arxiv.org/abs/2602.04449","arxiv_id":"2602.04449","source":"arxiv","status":"queued","tags":["benchmarks","code-generation"],"added":"2026-02-27","notes":"Critical analysis of SWE-bench's validity for automated program repair research. Examines what the benchmark actually measures and its limitations."}
+{"id":"webarena-autonomous-agents-2023","title":"WebArena: A Realistic Web Environment for Building Autonomous Agents","authors":["Zhou, S.","Xu, F. F.","Zhu, H.","et al."],"year":2023,"venue":"ICLR 2024","source_url":"https://arxiv.org/abs/2307.13854","arxiv_id":"2307.13854","source":"arxiv","status":"queued","tags":["agents","benchmarks"],"added":"2026-02-27","notes":"ICLR 2024. Best GPT-4 agent: 14.41% task success vs 78.24% human on realistic web tasks. Benchmark for autonomous web navigation agents. Gap between LLM and human performance clearly documented."}
+{"id":"browserarena-web-agents-2025","title":"BrowserArena: Evaluating LLM Agents on Real-World Web Navigation Tasks","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2510.02418","arxiv_id":"2510.02418","source":"arxiv","status":"queued","tags":["agents","benchmarks"],"added":"2026-02-27","notes":"Live open-web evaluation with user-submitted tasks, arena-style head-to-head comparisons, step-level human feedback. Dynamic evaluation of web agents."}
+{"id":"deepseek-coder-v2-2024","title":"DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence","authors":["DeepSeek-AI"],"year":2024,"venue":"arXiv","source_url":"https://arxiv.org/abs/2406.11931","arxiv_id":"2406.11931","source":"arxiv","status":"queued","tags":["code-generation","benchmarks"],"added":"2026-02-27","notes":"Open-source code model matching or exceeding GPT-4 Turbo on code benchmarks. Demonstrates open-source viability at frontier level. 236B MoE and 16B models released."}
+{"id":"swe-mera-dynamic-benchmark-2025","title":"SWE-MERA: A Dynamic Benchmark for Agentically Evaluating Large Language Models on Software Engineering Tasks","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2507.11059","arxiv_id":"2507.11059","source":"arxiv","status":"queued","tags":["benchmarks","agents","code-generation"],"added":"2026-02-27","notes":"Dynamic benchmark for agentic SE evaluation. Addresses static benchmark limitations with continuously updated evaluation tasks."}
+{"id":"dissecting-swe-bench-leaderboard-2025","title":"Dissecting the SWE-Bench Leaderboards: Profiling Submitters and Architectures of LLM and Agent-Based Repair Systems","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2506.17208","arxiv_id":"2506.17208","source":"arxiv","status":"queued","tags":["benchmarks","agents","code-generation"],"added":"2026-02-27","notes":"Analyzes 79/99 leaderboard entries on SWE-bench Lite/Verified. Profiles submitter types and architectural patterns. Identifies 52/50 distinct approaches. Documents benchmark ecosystem dynamics."}
+{"id":"cursor-speed-quality-tradeoff-2025","title":"Speed at the Cost of Quality: How Cursor AI Increases Short-Term Velocity and Long-Term Complexity in Open-Source Projects","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2511.04427","arxiv_id":"2511.04427","source":"arxiv","status":"queued","tags":["productivity","observational","code-generation"],"added":"2026-02-27","notes":"Difference-in-differences on 807 repos adopting Cursor (Jan 2024-Mar 2025). Statistically significant velocity increase but persistent rise in static analysis warnings and code complexity. Technical debt accumulates."}
+{"id":"ai-ides-vs-agents-impact-2026","title":"AI IDEs or Autonomous Agents? Measuring the Impact of Coding Agents on Software Development","authors":["Unknown"],"year":2026,"venue":"arXiv","source_url":"https://arxiv.org/abs/2601.13597","arxiv_id":"2601.13597","source":"arxiv","status":"queued","tags":["productivity","agents","observational"],"added":"2026-02-27","notes":"Large-scale study of 129,134 GitHub projects. Estimated 15.85-22.60% coding agent adoption rate. Compares IDE assistants vs autonomous agents. Documents rapid adoption trajectory."}
+{"id":"agentic-refactoring-empirical-2025","title":"Agentic Refactoring: An Empirical Study of AI Coding Agents","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2511.04824","arxiv_id":"2511.04824","source":"arxiv","status":"queued","tags":["agents","code-generation","observational"],"added":"2026-02-27","notes":"Analyzes 14,998 commits from AIDev dataset. AI agent-generated refactorings in real-world open-source Java projects. Documents quality and patterns of agentic code changes."}
+{"id":"gpts-are-gpts-labor-market-2023","title":"GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models","authors":["Eloundou, T.","Manning, S.","Mishkin, P.","Rock, D."],"year":2023,"venue":"arXiv","source_url":"https://arxiv.org/abs/2303.10130","arxiv_id":"2303.10130","source":"arxiv","status":"queued","tags":["economics","productivity"],"added":"2026-02-27","notes":"~15% of US worker tasks completable significantly faster with LLM alone; 47-56% with LLM software stack. Widely cited exposure analysis for labor market implications."}
+{"id":"scaling-laws-economic-productivity-2024","title":"Scaling Laws for Economic Productivity: Experimental Evidence in LLM-Assisted Translation","authors":["Unknown"],"year":2024,"venue":"arXiv","source_url":"https://arxiv.org/abs/2409.02391","arxiv_id":"2409.02391","source":"arxiv","status":"queued","tags":["economics","scaling","productivity"],"added":"2026-02-27","notes":"Pre-registered RCT in LLM-assisted translation. 19.9% automatable tasks × 61.2% average productivity gain × 57% labor share = ~6.95% productivity growth estimate. Empirical scaling law for economic impact."}
+{"id":"beyond-automation-job-redesign-2025","title":"Beyond Automation: Redesigning Jobs with LLMs to Enhance Productivity","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2512.05659","arxiv_id":"2512.05659","source":"arxiv","status":"queued","tags":["economics","productivity"],"added":"2026-02-27","notes":"UK Civil Service case study. Job redesign process identifies human comparative advantage: strategic leadership, complex problem resolution, stakeholder management. Economic value from augmentation not displacement."}
+{"id":"live-swe-agent-self-evolve-2025","title":"Live-SWE-agent: Can Software Engineering Agents Self-Evolve on the Fly?","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2511.13646","arxiv_id":"2511.13646","source":"arxiv","status":"queued","tags":["agents","code-generation","benchmarks"],"added":"2026-02-27","notes":"Investigates whether SE agents can self-improve during task execution. Tests adaptive learning in agentic software engineering context."}
+{"id":"trustworthy-llm-agents-survey-2025","title":"A Survey on Trustworthy LLM Agents: Threats and Countermeasures","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2503.09648","arxiv_id":"2503.09648","source":"arxiv","status":"queued","tags":["agents","security","alignment","survey"],"added":"2026-02-27","notes":"Trustworthiness threats across safety, privacy, fairness, truthfulness dimensions. New agentic modules expand attack surface with unforeseen vulnerabilities. Countermeasures review."}
+{"id":"human-interaction-evals-llm-2024","title":"Beyond Static AI Evaluations: Advancing Human Interaction Evaluations for LLM Harms and Risks","authors":["Unknown"],"year":2024,"venue":"arXiv","source_url":"https://arxiv.org/abs/2405.10632","arxiv_id":"2405.10632","source":"arxiv","status":"queued","tags":["alignment","benchmarks","survey"],"added":"2026-02-27","notes":"Argues static evals insufficiently capture sociotechnical gap between controlled settings and real deployment. Human-model interaction evaluations needed for comprehensive safety assessment."}
+{"id":"overreliance-human-ai-2025","title":"Measuring and Mitigating Overreliance is Necessary for Building Human-Compatible AI","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2509.08010","arxiv_id":"2509.08010","source":"arxiv","status":"queued","tags":["alignment","productivity"],"added":"2026-02-27","notes":"LLMs that cannot reliably produce uncertainty estimates plus frictionless interfaces foster overreliance. Addresses calibration mismatch between AI confidence and reliability."}
+{"id":"llm-long-term-memory-eval-2024","title":"Evaluating Very Long-Term Conversational Memory of LLM Agents","authors":["Unknown"],"year":2024,"venue":"arXiv","source_url":"https://arxiv.org/abs/2402.17753","arxiv_id":"2402.17753","source":"arxiv","status":"queued","tags":["agents","benchmarks"],"added":"2026-02-27","notes":"LoCoMo dataset: 300-turn conversations, 9K tokens, up to 35 sessions. Despite advances in long-context LLMs and RAG, very long-term memory remains a challenge. Documents fundamental agent capability limitation."}
+{"id":"constitutional-ai-2022","title":"Constitutional AI: Harmlessness from AI Feedback","authors":["Bai, Y.","Jones, A.","Ndousse, K.","et al."],"year":2022,"venue":"arXiv","source_url":"https://arxiv.org/abs/2212.08073","arxiv_id":"2212.08073","source":"arxiv","status":"queued","tags":["alignment","security"],"added":"2026-02-27","notes":"Anthropic's Constitutional AI and RLAIF method. Self-critique and revision via constitution; AI-generated comparisons for RLHF. Pre-2023 but foundational: underpins Claude training and the broader RLAIF paradigm."}
+{"id":"llm-unit-test-generation-empirical-2024","title":"On the Evaluation of Large Language Models in Unit Test Generation","authors":["Unknown"],"year":2024,"venue":"arXiv","source_url":"https://arxiv.org/abs/2406.18181","arxiv_id":"2406.18181","source":"arxiv","status":"queued","tags":["code-generation","benchmarks"],"added":"2026-02-27","notes":"Empirical evaluation of LLMs for unit test generation. Prompt factors significantly influence results. AI models handle simple code but effectiveness decreases for complex tasks."}
+{"id":"test-driven-interactive-code-gen-2024","title":"LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation","authors":["Unknown"],"year":2024,"venue":"arXiv","source_url":"https://arxiv.org/abs/2404.10100","arxiv_id":"2404.10100","source":"arxiv","status":"queued","tags":["code-generation","productivity","benchmarks"],"added":"2026-02-27","notes":"User study and empirical evaluation of test-driven interactive code generation with LLMs. Iterative validation-repair mechanism for robust generation."}
+{"id":"repoagent-documentation-2024","title":"RepoAgent: An LLM-Powered Open-Source Framework for Repository-level Code Documentation Generation","authors":["Unknown"],"year":2024,"venue":"arXiv","source_url":"https://arxiv.org/abs/2402.16667","arxiv_id":"2402.16667","source":"arxiv","status":"queued","tags":["code-generation","agents"],"added":"2026-02-27","notes":"Repository-level documentation generation, maintenance, and updating via LLM agents. Validated for high-quality repo-level docs generation."}
+{"id":"multi-agent-trust-paradox-2025","title":"The Trust Paradox in LLM-Based Multi-Agent Systems: When Collaboration Becomes a Security Vulnerability","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2510.18563","arxiv_id":"2510.18563","source":"arxiv","status":"queued","tags":["agents","security","reliability"],"added":"2026-02-27","notes":"Trust-Vulnerability Paradox: higher inter-agent trust improves task success but expands over-exposure/over-authorization risks. Consistent trade-off pattern across agent systems."}
+{"id":"agentic-ai-security-survey-2025","title":"Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2510.23883","arxiv_id":"2510.23883","source":"arxiv","status":"queued","tags":["agents","security","survey"],"added":"2026-02-27","notes":"Comprehensive survey of agentic AI security threats including message tampering, role spoofing, protocol exploitation, data poisoning in multi-agent settings. Defense and evaluation coverage."}
+{"id":"multi-agent-collaboration-survey-2025","title":"Multi-Agent Collaboration Mechanisms: A Survey of LLMs","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2501.06322","arxiv_id":"2501.06322","source":"arxiv","status":"queued","tags":["agents","survey"],"added":"2026-02-27","notes":"Survey of multi-agent LLM collaboration mechanisms, protocols, and coordination strategies. Covers AutoGen, LangGraph, AgentScope framework comparisons."}
+{"id":"llm-requirements-engineering-slr-2025","title":"Large Language Models for Requirements Engineering: A Systematic Literature Review","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2509.11446","arxiv_id":"2509.11446","source":"arxiv","status":"queued","tags":["code-generation","survey"],"added":"2026-02-27","notes":"74 primary studies 2023-2024. 136% growth from 2023 to 2024. Zero-shot (44%) and few-shot (29%) dominate. RAG (6%) and interactive prompting (5%) under-explored. Requirements too abstract for direct LLM input."}
+{"id":"requirements-to-code-practices-2025","title":"From Requirements to Code: Understanding Developer Practices in LLM-Assisted Software Engineering","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2507.07548","arxiv_id":"2507.07548","source":"arxiv","status":"queued","tags":["productivity","code-generation","observational"],"added":"2026-02-27","notes":"Empirical study of how developers decompose requirements into programming tasks for LLM consumption. Documents the translation layer between requirements and LLM-assisted implementation."}
+{"id":"nl2repo-bench-2025","title":"NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2512.12730","arxiv_id":"2512.12730","source":"arxiv","status":"queued","tags":["benchmarks","agents","code-generation"],"added":"2026-02-27","notes":"Benchmark for long-horizon repository generation from natural language. Evaluates agents on full codebase creation tasks beyond single-file completion."}
+{"id":"configuring-agentic-coding-tools-2026","title":"Configuring Agentic AI Coding Tools: An Exploratory Study","authors":["Unknown"],"year":2026,"venue":"arXiv","source_url":"https://arxiv.org/abs/2602.14690","arxiv_id":"2602.14690","source":"arxiv","status":"queued","tags":["agents","productivity","observational"],"added":"2026-02-27","notes":"Exploratory study of how developers configure agentic AI coding tools. Setup transparency and configuration decisions affect outcomes."}
+{"id":"agentic-programming-survey-2025","title":"AI Agentic Programming: A Survey of Techniques, Challenges, and Opportunities","authors":["Unknown"],"year":2025,"venue":"arXiv","source_url":"https://arxiv.org/abs/2508.11126","arxiv_id":"2508.11126","source":"arxiv","status":"queued","tags":["agents","code-generation","survey"],"added":"2026-02-27","notes":"Survey of agentic programming techniques: planning, tool use, memory, reflection. Challenges and open problems in deploying AI coding agents at scale."}

	ai-research-survey Systematic scan of agentic development research. What's signal, what's noise.
	git clone https://git.shiptheloop.com/ai-research-survey.git
	Log \| Files \| Refs

M	agents/harvester-agent.md	\|	153	++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----------------
M	registry.jsonl	\|	138	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++