ai-research-survey

Systematic scan of agentic development research. What's signal, what's noise.
git clone https://git.shiptheloop.com/ai-research-survey.git
Log | Files | Refs

commit 736a50a032a47708cf7293b93076df2b494eb27b
parent fbc3c552e124c8c6c91d532e531bbc6f81f4d957
Author: Brian Graham <brian@buildingbetterteams.de>
Date:   Mon, 30 Mar 2026 20:17:11 +0200

Classify all 1,206 papers by type via Haiku

Distribution:
  empirical           816 (69%)
  benchmark-creation  155 (13%)
  survey              102 (9%)
  position             63 (5%)
  theoretical          49 (4%)

Spot-checked: Tao→position, METR→empirical, SWE-Bench→benchmark-creation,
HalluLens→benchmark-creation, multi-agent survey→survey. All correct.

Separate paper_type.json files, non-destructive to scan data.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Diffstat:
Apapers/future-ml-systems-2022/paper_type.json | 5+++++
Apapers/multiagent-risks-from-2025/paper_type.json | 5+++++
Apapers/multiple-scalable-polyglot-2023/paper_type.json | 5+++++
Apapers/multiturn-code-gen-correctness-2025/paper_type.json | 5+++++
Apapers/nalamainz-at-blp2025-2025/paper_type.json | 5+++++
Apapers/narrow-finetuning-leaves-2025/paper_type.json | 5+++++
Apapers/narrowing-complexity-gap-2026/paper_type.json | 5+++++
Apapers/natural-language-outlines-2024/paper_type.json | 5+++++
Apapers/navigating-copyright-aienhanced-2025/paper_type.json | 5+++++
Apapers/navigating-representation-utilizing-2025/paper_type.json | 5+++++
Apapers/nested-learning-illusion-2025/paper_type.json | 5+++++
Apapers/neural-chameleons-language-2025/paper_type.json | 5+++++
Apapers/neural-exec-learning-2024/paper_type.json | 5+++++
Apapers/neural-network-decoder-2023/paper_type.json | 5+++++
Apapers/neural-neural-scaling-2026/paper_type.json | 5+++++
Apapers/neural-program-repair-2023/paper_type.json | 5+++++
Apapers/neurosymbolic-verification-instruction-2026/paper_type.json | 5+++++
Apapers/new-compiler-stack-2026/paper_type.json | 5+++++
Apapers/next-paradigm-usercentric-2026/paper_type.json | 5+++++
Apapers/nl2repo-bench-2025/paper_type.json | 5+++++
Apapers/nlp-evaluation-trouble-2023/paper_type.json | 5+++++
Apapers/nmusketeers-reinforcement-learning-2026/paper_type.json | 5+++++
Apapers/no-more-manual-2023/paper_type.json | 5+++++
Apapers/no-need-lift-2023/paper_type.json | 5+++++
Apapers/not-all-metrics-2023/paper_type.json | 5+++++
Apapers/not-everyone-wins-2025-2/paper_type.json | 5+++++
Apapers/not-everyone-wins-2025/paper_type.json | 5+++++
Apapers/not-what-youve-2023/paper_type.json | 5+++++
Apapers/novel-differential-feature-2025/paper_type.json | 5+++++
Apapers/novel-preprocessing-technique-2023/paper_type.json | 5+++++
Apapers/o1-reasoning-patterns-2024/paper_type.json | 5+++++
Apapers/obliinjection-orderoblivious-prompt-2025/paper_type.json | 5+++++
Apapers/ocrmediated-modality-dominance-2026/paper_type.json | 5+++++
Apapers/oet-optimizationbased-prompt-2025/paper_type.json | 5+++++
Apapers/ojbench-competition-level-2025/paper_type.json | 5+++++
Apapers/ojbkq-objectivejoint-babaiklein-2026/paper_type.json | 5+++++
Apapers/omnicode-benchmark-2026/paper_type.json | 5+++++
Apapers/omnigrok-grokking-beyond-2022/paper_type.json | 5+++++
Apapers/on-premise-llm-cost-benefit-2025/paper_type.json | 5+++++
Apapers/one-token-embedding-2025/paper_type.json | 5+++++
Apapers/openhands-ai-sw-agent-2024/paper_type.json | 5+++++
Apapers/openllmrtl-open-dataset-2024/paper_type.json | 5+++++
Apapers/openr-reasoning-framework-2024/paper_type.json | 5+++++
Apapers/openrubrics-scalable-synthetic-2025/paper_type.json | 5+++++
Apapers/optima-optimizing-effectiveness-2024/paper_type.json | 5+++++
Apapers/optimal-attention-temperature-2025/paper_type.json | 5+++++
Apapers/optimal-scaling-laws-2026/paper_type.json | 5+++++
Apapers/optimalagentselection-stateaware-routing-2025/paper_type.json | 5+++++
Apapers/optimizationbased-prompt-injection-2024/paper_type.json | 5+++++
Apapers/optimizing-code-runtime-2025/paper_type.json | 5+++++
Apapers/optimizing-netgpt-routingbased-2025/paper_type.json | 5+++++
Apapers/optimizing-pytorch-inference-2025/paper_type.json | 5+++++
Apapers/orchestrating-intelligence-confidenceaware-2026/paper_type.json | 5+++++
Apapers/orllmagent-automating-modeling-2025/paper_type.json | 5+++++
Apapers/osc-cognitive-orchestration-2025/paper_type.json | 5+++++
Apapers/osworld-benchmarking-multimodal-2024/paper_type.json | 5+++++
Apapers/outofcontext-outofscope-manipulating-2026/paper_type.json | 5+++++
Apapers/overreliance-human-ai-2025/paper_type.json | 5+++++
Apapers/overseeing-agents-without-2026/paper_type.json | 5+++++
Apapers/pacost-paired-confidence-2024/paper_type.json | 5+++++
Apapers/pairwise-proximal-policy-2023/paper_type.json | 5+++++
Apapers/palm-scaling-language-2022/paper_type.json | 5+++++
Apapers/pangucoder2-boosting-large-2023/paper_type.json | 5+++++
Apapers/paperbench-evaluating-ais-2025/paper_type.json | 5+++++
Apapers/parameterefficient-finetuning-attributed-2025/paper_type.json | 5+++++
Apapers/patchzero-zeroshot-automatic-2023/paper_type.json | 5+++++
Apapers/pathfix-automated-program-2025/paper_type.json | 5+++++
Apapers/pay-hints-not-2026/paper_type.json | 5+++++
Apapers/pedagogical-alignment-large-2024/paper_type.json | 5+++++
Apapers/peeping-at-creaitivity-2025/paper_type.json | 5+++++
Apapers/pennylang-pioneering-llmbased-2025/paper_type.json | 5+++++
Apapers/performance-review-llm-2024/paper_type.json | 5+++++
Apapers/performance-study-llmgenerated-2024/paper_type.json | 5+++++
Apapers/perplexity-paradox-why-2026/paper_type.json | 5+++++
Apapers/persistent-backdoor-attacks-2025/paper_type.json | 5+++++
Apapers/personadual-balancing-personalization-2026/paper_type.json | 5+++++
Apapers/personalizedrouter-personalized-llm-2025/paper_type.json | 5+++++
Apapers/phantom-transfer-datalevel-2026/paper_type.json | 5+++++
Apapers/picking-right-specialist-2026/paper_type.json | 5+++++
Apapers/piguard-prompt-injection-2025/paper_type.json | 5+++++
Apapers/pina-prompt-injection-2026/paper_type.json | 5+++++
Apapers/pisanitizer-preventing-prompt-2025/paper_type.json | 5+++++
Apapers/plan-and-act-long-horizon-2025/paper_type.json | 5+++++
Apapers/planning-natural-language-2024/paper_type.json | 5+++++
Apapers/planrag-planthenretrieval-augmented-2024/paper_type.json | 5+++++
Apapers/play-by-type-2025/paper_type.json | 5+++++
Apapers/plot2code-comprehensive-benchmark-2025/paper_type.json | 5+++++
Apapers/poetics-code-generative-2025/paper_type.json | 5+++++
Apapers/poison-once-refuse-2025/paper_type.json | 5+++++
Apapers/poisoning-attacks-llms-2025/paper_type.json | 5+++++
Apapers/poisoning-real-threat-2024/paper_type.json | 5+++++
Apapers/policy-compiler-secure-2026/paper_type.json | 5+++++
Apapers/policyasprompt-turning-ai-2025/paper_type.json | 5+++++
Apapers/poodle-seamlessly-scaling-2025/paper_type.json | 5+++++
Apapers/position-explaining-behavioral-2026/paper_type.json | 5+++++
Apapers/position-require-frontier-2025/paper_type.json | 5+++++
Apapers/position-vibe-coding-2025/paper_type.json | 5+++++
Apapers/pots-proofoftrainingsteps-backdoor-2025/paper_type.json | 5+++++
Apapers/power-limitations-aggregation-2026/paper_type.json | 5+++++
Apapers/practical-program-repair-2022/paper_type.json | 5+++++
Apapers/practical-useful-automated-2024/paper_type.json | 5+++++
Apapers/pragmatic-reasoning-improves-2025/paper_type.json | 5+++++
Apapers/prattack-coordinated-promptrag-2025/paper_type.json | 5+++++
Apapers/precedentbased-professional-role-2025/paper_type.json | 5+++++
Apapers/predictable-artificial-intelligence-2023/paper_type.json | 5+++++
Apapers/predicting-llm-reasoning-2025/paper_type.json | 5+++++
Apapers/prefillshare-shared-prefill-2026/paper_type.json | 5+++++
Apapers/pretraining-scaling-laws-2025/paper_type.json | 5+++++
Apapers/primg-efficient-llmdriven-2025/paper_type.json | 5+++++
Apapers/proactive-hardening-llm-2026/paper_type.json | 5+++++
Apapers/probing-emergence-crosslingual-2024/paper_type.json | 5+++++
Apapers/probing-language-models-2024/paper_type.json | 5+++++
Apapers/processcentric-analysis-agentic-2025/paper_type.json | 5+++++
Apapers/programmed-please-moral-2026/paper_type.json | 5+++++
Apapers/programming-language-techniques-2025/paper_type.json | 5+++++
Apapers/projdevbench-end-to-end-2026/paper_type.json | 5+++++
Apapers/projecteval-benchmark-programming-2025/paper_type.json | 5+++++
Apapers/projecttest-projectlevel-llm-2025/paper_type.json | 5+++++
Apapers/promises-perils-timely-2026/paper_type.json | 5+++++
Apapers/prompt-alchemist-automated-2025/paper_type.json | 5+++++
Apapers/prompt-infection-llmtollm-2024/paper_type.json | 5+++++
Apapers/prompt-injection-attacks-2024/paper_type.json | 5+++++
Apapers/prompt-injection-attacks-2025-2-2/paper_type.json | 5+++++
Apapers/prompt-injection-attacks-2025-2/paper_type.json | 5+++++
Apapers/prompt-injection-attacks-2025/paper_type.json | 5+++++
Apapers/prompt-injection-attacks-2026-2-2/paper_type.json | 5+++++
Apapers/prompt-injection-attacks-2026/paper_type.json | 5+++++
Apapers/prompt-injection-chatbot-plugins-2025/paper_type.json | 5+++++
Apapers/prompt-injection-detection-2025/paper_type.json | 5+++++
Apapers/prompt-injection-llm-apps-2023/paper_type.json | 5+++++
Apapers/prompt-injection-tool-selection-2025/paper_type.json | 5+++++
Apapers/prompt-injection-vulnerability-2025/paper_type.json | 5+++++
Apapers/prompt-less-smile-2025/paper_type.json | 5+++++
Apapers/prompt-perturbation-fraction-2025/paper_type.json | 5+++++
Apapers/prompt-sapper-llmempowered-2023/paper_type.json | 5+++++
Apapers/prompt-variability-effects-2025/paper_type.json | 5+++++
Apapers/promptarmor-simple-yet-2025/paper_type.json | 5+++++
Apapers/promptbased-code-completion-2024/paper_type.json | 5+++++
Apapers/prompting-programming-query-2022/paper_type.json | 5+++++
Apapers/promptlocate-localizing-prompt-2025/paper_type.json | 5+++++
Apapers/promptpex-automatic-test-2025/paper_type.json | 5+++++
Apapers/prompts-first-precision-2025/paper_type.json | 5+++++
Apapers/promptscreen-efficient-jailbreak-2025/paper_type.json | 5+++++
Apapers/promptsleuth-detecting-prompt-2025/paper_type.json | 5+++++
Apapers/promptware-kill-chain-2026/paper_type.json | 5+++++
Apapers/proof-time-benchmark-2026/paper_type.json | 5+++++
Apapers/prophetfuzz-fully-automated-2024/paper_type.json | 5+++++
Apapers/protect-llm-agent-2025/paper_type.json | 5+++++
Apapers/proteus-slaaware-routing-2026/paper_type.json | 5+++++
Apapers/proververifier-games-improve-2024/paper_type.json | 5+++++
Apapers/proving-coding-interview-2025/paper_type.json | 5+++++
Apapers/psychometric-personality-shaping-2025/paper_type.json | 5+++++
Apapers/pyramid-moa-probabilistic-2026/paper_type.json | 5+++++
Apapers/python-symbolic-execution-2024/paper_type.json | 5+++++
Apapers/pythonsaga-redefining-benchmark-2024/paper_type.json | 5+++++
Apapers/qiskit-humaneval-evaluation-2024/paper_type.json | 5+++++
Apapers/quantifying-contamination-evaluating-2024/paper_type.json | 5+++++
Apapers/quantization-model-neural-2023/paper_type.json | 5+++++
Apapers/queryipi-queryagnostic-indirect-2025/paper_type.json | 5+++++
Apapers/quo-vadis-code-2025/paper_type.json | 5+++++
Apapers/qwen25-technical-report-2024/paper_type.json | 5+++++
Apapers/qwen25coder-technical-report-2024/paper_type.json | 5+++++
Apapers/r2router-new-paradigm-2026/paper_type.json | 5+++++
Apapers/ragmcp-mitigating-prompt-2025/paper_type.json | 5+++++
Apapers/ral2m-retrieval-augmented-2026/paper_type.json | 5+++++
Apapers/ramon-llulls-thinking-2025/paper_type.json | 5+++++
Apapers/random-scaling-emergent-2025/paper_type.json | 5+++++
Apapers/rankllm-python-package-2025/paper_type.json | 5+++++
Apapers/rathandravidianlangtech-2025-annaparavai-2025/paper_type.json | 5+++++
Apapers/raudit-blind-auditing-2026/paper_type.json | 5+++++
Apapers/react-synergizing-reasoning-2022/paper_type.json | 5+++++
Apapers/real-time-ai-2025/paper_type.json | 5+++++
Apapers/realist-pluralist-conceptions-2025/paper_type.json | 5+++++
Apapers/realmath-continuous-benchmark-2025/paper_type.json | 5+++++
Apapers/reasalign-reasoning-enhanced-2026/paper_type.json | 5+++++
Apapers/reasoning-large-language-2023/paper_type.json | 5+++++
Apapers/reasoning-runtime-behavior-2024/paper_type.json | 5+++++
Apapers/rebench-evaluating-frontier-2024/paper_type.json | 5+++++
Apapers/recode-improving-llmbased-2025/paper_type.json | 5+++++
Apapers/red-teaming-mind-2025/paper_type.json | 5+++++
Apapers/redcode-risky-code-2024/paper_type.json | 5+++++
Apapers/reducing-hallucinations-llmgenerated-2025/paper_type.json | 5+++++
Apapers/redvisor-reasoningaware-prompt-2026/paper_type.json | 5+++++
Apapers/refinestat-efficient-exploration-2025/paper_type.json | 5+++++
Apapers/refining-input-guardrails-2025/paper_type.json | 5+++++
Apapers/reinforcement-learning-mutation-2023/paper_type.json | 5+++++
Apapers/relative-preference-optimization-2024/paper_type.json | 5+++++
Apapers/relative-scaling-laws-2025/paper_type.json | 5+++++
Apapers/relativebased-scaling-law-2025/paper_type.json | 5+++++
Apapers/relaygen-intrageneration-model-2026/paper_type.json | 5+++++
Apapers/rele-scalable-system-2026/paper_type.json | 5+++++
Apapers/reliability-explainability-language-2023/paper_type.json | 5+++++
Apapers/reliable-agent-engineering-2025/paper_type.json | 5+++++
Apapers/reliable-llmbased-edgecloudexpert-2025/paper_type.json | 5+++++
Apapers/relrepair-enhancing-automated-2025/paper_type.json | 5+++++
Apapers/remote-labor-index-2025/paper_type.json | 5+++++
Apapers/repaca-leveraging-reasoning-2025/paper_type.json | 5+++++
Apapers/repair-automated-program-2024/paper_type.json | 5+++++
Apapers/repair-ingredients-all-2025/paper_type.json | 5+++++
Apapers/repairagent-llm-bug-repair-2024/paper_type.json | 5+++++
Apapers/repairing-bugs-python-2022/paper_type.json | 5+++++
Apapers/repairllama-efficient-representations-2023/paper_type.json | 5+++++
Apapers/repairr1-better-test-2025/paper_type.json | 5+++++
Apapers/repoagent-documentation-2024/paper_type.json | 5+++++
Apapers/repogenreflex-enhancing-repositorylevel-2024/paper_type.json | 5+++++
Apapers/repotransbench-realworld-multilingual-2024/paper_type.json | 5+++++
Apapers/requirements-to-code-practices-2025/paper_type.json | 5+++++
Apapers/rescue-ranking-llm-2023/paper_type.json | 5+++++
Apapers/researchcodebench-benchmarking-llms-2025/paper_type.json | 5+++++
Apapers/researchrubrics-benchmark-prompts-2025/paper_type.json | 5+++++
Apapers/reshaping-higher-education-2025/paper_type.json | 5+++++
Apapers/resourceefficient-multimodal-intelligence-2025/paper_type.json | 5+++++
Apapers/responsible-artificial-intelligence-2025/paper_type.json | 5+++++
Apapers/rethinking-benchmark-contamination-2023/paper_type.json | 5+++++
Apapers/rethinking-code-review-2025/paper_type.json | 5+++++
Apapers/rethinking-kernel-program-2025/paper_type.json | 5+++++
Apapers/rethinking-knowledge-distillation-2025/paper_type.json | 5+++++
Apapers/rethinking-verification-llm-2025/paper_type.json | 5+++++
Apapers/retrievalaugmented-code-generation-2025/paper_type.json | 5+++++
Apapers/retrievalaugmented-code-review-2025/paper_type.json | 5+++++
Apapers/retrievalaugmented-generation-approach-2025/paper_type.json | 5+++++
Apapers/retrievalaugmented-generation-electrocardiogramlanguage-2025/paper_type.json | 5+++++
Apapers/retrievalaugmented-generation-multilingual-2024/paper_type.json | 5+++++
Apapers/reversum-multistaged-retrievalaugmented-2025/paper_type.json | 5+++++
Apapers/review-advances-aipowered-2024/paper_type.json | 5+++++
Apapers/review-aidriven-approaches-2025/paper_type.json | 5+++++
Apapers/review-generative-ai-2024/paper_type.json | 5+++++
Apapers/review-generative-ai-2025/paper_type.json | 5+++++
Apapers/review-hallucination-understanding-2025/paper_type.json | 5+++++
Apapers/review-research-aiassisted-2025/paper_type.json | 5+++++
Apapers/review-tools-zerocode-2025/paper_type.json | 5+++++
Apapers/revisiting-evolutionary-program-2024/paper_type.json | 5+++++
Apapers/revisiting-unnaturalness-automated-2024/paper_type.json | 5+++++
Apapers/revolution-hype-seeking-2025/paper_type.json | 5+++++
Apapers/rexbench-can-coding-2025/paper_type.json | 5+++++
Apapers/rgfl-reasoning-guided-2026/paper_type.json | 5+++++
Apapers/right-prompts-job-2023/paper_type.json | 5+++++
Apapers/rise-potential-large-2023/paper_type.json | 5+++++
Apapers/rise-potential-opportunities-2025/paper_type.json | 5+++++
Apapers/rl-hammer-llms-2025/paper_type.json | 5+++++
Apapers/rltf-reinforcement-learning-2023/paper_type.json | 5+++++
Apapers/rlthf-targeted-human-2025/paper_type.json | 5+++++
Apapers/rmb-comprehensively-benchmarking-2024/paper_type.json | 5+++++
Apapers/robon-routed-online-2025/paper_type.json | 5+++++
Apapers/robots-here-navigating-2023/paper_type.json | 5+++++
Apapers/robust-llm-alignment-2025/paper_type.json | 5+++++
Apapers/robust-retrievalbased-summarization-2024/paper_type.json | 5+++++
Apapers/robustness-referencing-defending-2025/paper_type.json | 5+++++
Apapers/role-artificial-intelligence-2025/paper_type.json | 5+++++
Apapers/role-genai-automated-2023/paper_type.json | 5+++++
Apapers/role-generative-ai-2025/paper_type.json | 5+++++
Apapers/rooflinebench-benchmarking-framework-2026/paper_type.json | 5+++++
Apapers/routing-cascades-user-2026/paper_type.json | 5+++++
Apapers/rtbas-defending-llm-2025/paper_type.json | 5+++++
Apapers/rtl-graphenhanced-llm-2025/paper_type.json | 5+++++
Apapers/rtlsquad-multiagent-based-2025/paper_type.json | 5+++++
Apapers/rubric-all-you-2025-2/paper_type.json | 5+++++
Apapers/rubric-all-you-2025/paper_type.json | 5+++++
Apapers/runbugrun-executable-dataset-2023/paper_type.json | 5+++++
Apapers/rustassistant-llms-fix-2025/paper_type.json | 5+++++
Apapers/saber-efficient-sampling-2025/paper_type.json | 5+++++
Apapers/saber-small-actions-2025/paper_type.json | 5+++++
Apapers/safegenbench-benchmark-framework-2025/paper_type.json | 5+++++
Apapers/safeguarding-visionlanguage-models-2024/paper_type.json | 5+++++
Apapers/safepro-evaluating-safety-2026/paper_type.json | 5+++++
Apapers/safetyefficacy-trade-off-2026/paper_type.json | 5+++++
Apapers/sage-steerable-agentic-2026/paper_type.json | 5+++++
Apapers/salad-systematic-assessment-2025/paper_type.json | 5+++++
Apapers/sampleefficient-human-evaluation-2024/paper_type.json | 5+++++
Apapers/saro-enhancing-llm-2025/paper_type.json | 5+++++
Apapers/scaffolded-model-capability-2023/paper_type.json | 5+++++
Apapers/scalable-oversight-partitioned-2025/paper_type.json | 5+++++
Apapers/scales-justitia-comprehensive-2025/paper_type.json | 5+++++
Apapers/scaling-laws-2020/paper_type.json | 5+++++
Apapers/scaling-laws-code-2025/paper_type.json | 5+++++
Apapers/scaling-laws-data-2025/paper_type.json | 5+++++
Apapers/scaling-laws-economic-productivity-2024/paper_type.json | 5+++++
Apapers/scaling-laws-multiagent-2022/paper_type.json | 5+++++
Apapers/scaling-testtime-compute-2025/paper_type.json | 5+++++
Apapers/scenarios-transition-agi-2024/paper_type.json | 5+++++
Apapers/scheming-llm-to-llm-interactions-2025/paper_type.json | 5+++++
Apapers/science-scaling-agent-2025/paper_type.json | 5+++++
Apapers/scissorhands-exploiting-persistence-2023/paper_type.json | 5+++++
Apapers/scmas-constructing-costefficient-2026/paper_type.json | 5+++++
Apapers/sdag-subjectbased-directed-2025/paper_type.json | 5+++++
Apapers/se-agentic-benchmarks-survey-2025/paper_type.json | 5+++++
Apapers/seakr-selfaware-knowledge-2024/paper_type.json | 5+++++
Apapers/searchbased-automated-program-2024/paper_type.json | 5+++++
Apapers/secalign-defending-against-2024/paper_type.json | 5+++++
Apapers/secbench-automated-benchmarking-2025/paper_type.json | 5+++++
Apapers/seccodeprm-process-reward-2026/paper_type.json | 5+++++
Apapers/secinfer-preventing-prompt-2025/paper_type.json | 5+++++
Apapers/secodeplt-unified-platform-2024/paper_type.json | 5+++++
Apapers/secure-coding-ai-2025/paper_type.json | 5+++++
Apapers/secureagentbench-benchmarking-secure-2025/paper_type.json | 5+++++
Apapers/securecai-injectionresilient-llm-2026/paper_type.json | 5+++++
Apapers/securing-ai-agents-2025/paper_type.json | 5+++++
Apapers/securing-large-language-2025/paper_type.json | 5+++++
Apapers/security-assertions-by-2023/paper_type.json | 5+++++
Apapers/security-degradation-iterative-2025/paper_type.json | 5+++++
Apapers/seeing-fixing-crossmodal-2025-2/paper_type.json | 5+++++
Apapers/selfconsistency-improves-chain-2022/paper_type.json | 5+++++
Apapers/selforganized-agents-llm-2024/paper_type.json | 5+++++
Apapers/semagent-semantics-aware-2025/paper_type.json | 5+++++
Apapers/semantic-compression-memory-2026/paper_type.json | 5+++++
Apapers/semantics-as-shield-2025/paper_type.json | 5+++++
Apapers/semisupervised-cascaded-clustering-2022/paper_type.json | 5+++++
Apapers/sensorium-arc-ai-2025/paper_type.json | 5+++++
Apapers/sentraguard-multilingual-humanai-2025/paper_type.json | 5+++++
Apapers/separator-injection-attack-2025/paper_type.json | 5+++++
Apapers/sequential-enumeration-large-2025/paper_type.json | 5+++++
Apapers/shadowcode-automatic-external-2024/paper_type.json | 5+++++
Apapers/sherlock-reliable-efficient-2025/paper_type.json | 5+++++
Apapers/shieldlearner-new-paradigm-2025/paper_type.json | 5+++++
Apapers/shifting-from-ranking-2025/paper_type.json | 5+++++
Apapers/shroomindelab-at-semeval2024-2024/paper_type.json | 5+++++
Apapers/siadafix-issue-description-2025/paper_type.json | 5+++++
Apapers/sidiffagent-selfimproving-diffusion-2026/paper_type.json | 5+++++
Apapers/signedprompt-new-approach-2024/paper_type.json | 5+++++
Apapers/significant-productivity-gains-2024/paper_type.json | 5+++++
Apapers/simple-llm-baselines-2026/paper_type.json | 5+++++
Apapers/simple-prompt-injection-2025/paper_type.json | 5+++++
Apapers/simulationguided-llmbased-code-2025/paper_type.json | 5+++++
Apapers/single-character-perturbations-2024/paper_type.json | 5+++++
Apapers/single-direction-truth-2025/paper_type.json | 5+++++
Apapers/singleagent-scaling-fails-2025/paper_type.json | 5+++++
Apapers/singlehead-attention-high-2025/paper_type.json | 5+++++
Apapers/singlemulti-evolution-loop-2026/paper_type.json | 5+++++
Apapers/six-sigma-agent-2026/paper_type.json | 5+++++
Apapers/skate-scalable-tournament-2025/paper_type.json | 5+++++
Apapers/skillorchestra-learning-route-2026/paper_type.json | 5+++++
Apapers/sleeper-agents-2024/paper_type.json | 5+++++
Apapers/slidesgenbench-evaluating-slides-2026/paper_type.json | 5+++++
Apapers/sloconditioned-action-routing-2025/paper_type.json | 5+++++
Apapers/smoothquant-accurate-efficient-2022/paper_type.json | 5+++++
Apapers/socialveil-probing-social-2026/paper_type.json | 5+++++
Apapers/societal-alignment-frameworks-2025/paper_type.json | 5+++++
Apapers/sok-comprehensive-causality-2025/paper_type.json | 5+++++
Apapers/sok-trustauthorization-mismatch-2025/paper_type.json | 5+++++
Apapers/soleval-benchmarking-large-2025/paper_type.json | 5+++++
Apapers/source-code-comprehension-2023/paper_type.json | 5+++++
Apapers/spec2rtlagent-automated-hardware-2025/paper_type.json | 5+++++
Apapers/specification-vibing-automated-2026/paper_type.json | 5+++++
Apapers/specificationguided-vulnerability-detection-2025/paper_type.json | 5+++++
Apapers/specifications-missing-link-2024/paper_type.json | 5+++++
Apapers/speed-at-cost-2025/paper_type.json | 5+++++
Apapers/spin-selfsupervised-prompt-2024/paper_type.json | 5+++++
Apapers/split-personality-training-2026/paper_type.json | 5+++++
Apapers/spread-preference-annotation-2024/paper_type.json | 5+++++
Apapers/starcoder-2023/paper_type.json | 5+++++
Apapers/starcoder2-2024/paper_type.json | 5+++++
Apapers/stateflow-enhancing-llm-2024/paper_type.json | 5+++++
Apapers/static-program-analysis-2024/paper_type.json | 5+++++
Apapers/statically-contextualizing-large-2024/paper_type.json | 5+++++
Apapers/steering-llms-scalable-2026/paper_type.json | 5+++++
Apapers/stellar-searchbased-testing-2026/paper_type.json | 5+++++
Apapers/stelp-secure-transpilation-2026/paper_type.json | 5+++++
Apapers/stepshield-when-not-2026/paper_type.json | 5+++++
Apapers/stop-wasting-your-2025/paper_type.json | 5+++++
Apapers/strategic-dishonesty-safety-evals-2025/paper_type.json | 5+++++
Apapers/strongermas-multiagent-reinforcement-2025/paper_type.json | 5+++++
Apapers/structtest-benchmarking-llms-2024/paper_type.json | 5+++++
Apapers/study-prompt-injection-2024/paper_type.json | 5+++++
Apapers/style-outweighs-substance-2024/paper_type.json | 5+++++
Apapers/subliminal-corruption-mechanisms-2025/paper_type.json | 5+++++
Apapers/subliminal-learning-language-2025/paper_type.json | 5+++++
Apapers/successive-prompting-decomposing-2022/paper_type.json | 5+++++
Apapers/survey-adversarial-examples-2025/paper_type.json | 5+++++
Apapers/survey-agentic-service-2025/paper_type.json | 5+++++
Apapers/survey-automated-program-2023/paper_type.json | 5+++++
Apapers/survey-autonomous-llm-agents-2023/paper_type.json | 5+++++
Apapers/survey-code-gen-llm-agents-2025/paper_type.json | 5+++++
Apapers/survey-code-generation-2024/paper_type.json | 5+++++
Apapers/survey-data-contamination-2025/paper_type.json | 5+++++
Apapers/survey-hallucination-large-2023/paper_type.json | 5+++++
Apapers/survey-large-language-2023/paper_type.json | 5+++++
Apapers/survey-learningbased-automated-2023/paper_type.json | 5+++++
Apapers/survey-llm-code-generation-2025/paper_type.json | 5+++++
Apapers/survey-llm-code-low-resource-2024/paper_type.json | 5+++++
Apapers/survey-llmbased-multiagent-2024/paper_type.json | 5+++++
Apapers/survey-llms-software-engineering-2023/paper_type.json | 5+++++
Apapers/survey-progress-llm-2025/paper_type.json | 5+++++
Apapers/survey-useful-llm-2024/paper_type.json | 5+++++
Apapers/survival-games-humanllm-2025/paper_type.json | 5+++++
Apapers/survivehr-competing-risks-2025/paper_type.json | 5+++++
Apapers/sustainable-llm-inference-2026/paper_type.json | 5+++++
Apapers/svrepair-structured-visual-2026/paper_type.json | 5+++++
Apapers/swe-agent-2024/paper_type.json | 5+++++
Apapers/swe-bench-2023/paper_type.json | 5+++++
Apapers/swe-bench-illusion-2025/paper_type.json | 5+++++
Apapers/swe-bench-plus-2024/paper_type.json | 5+++++
Apapers/swe-bench-pro-2025/paper_type.json | 5+++++
Apapers/swe-bench-what-in-benchmark-2026/paper_type.json | 5+++++
Apapers/swe-evo-coding-agents-2025/paper_type.json | 5+++++
Apapers/swe-mera-dynamic-benchmark-2025/paper_type.json | 5+++++
Apapers/sweeffi-reevaluating-software-2025/paper_type.json | 5+++++
Apapers/swelancer-can-frontier-2025/paper_type.json | 5+++++
Apapers/swenergy-empirical-study-2025/paper_type.json | 5+++++
Apapers/sweprotege-learning-selectively-2026/paper_type.json | 5+++++
Apapers/swerank-multilingual-multiturn-2025/paper_type.json | 5+++++
Apapers/swerebench-automated-pipeline-2025/paper_type.json | 5+++++
Apapers/swtbench-testing-validating-2024/paper_type.json | 5+++++
Apapers/syncode-llm-generation-2024/paper_type.json | 5+++++
Apapers/synergizing-human-expertise-2024/paper_type.json | 5+++++
Apapers/syntactic-robustness-llmbased-2024/paper_type.json | 5+++++
Apapers/synthetic-code-surgery-2025/paper_type.json | 5+++++
Apapers/sysllmatic-large-language-2025/paper_type.json | 5+++++
Apapers/system-automated-unit-2024/paper_type.json | 5+++++
Apapers/system-prompt-poisoning-2025/paper_type.json | 5+++++
Apapers/systematic-evaluation-llmasajudge-2024/paper_type.json | 5+++++
Apapers/systematic-literature-review-2024-2/paper_type.json | 5+++++
Apapers/systematic-literature-review-2024/paper_type.json | 5+++++
Apapers/systematic-literature-review-2025-2/paper_type.json | 5+++++
Apapers/systematic-literature-review-2025/paper_type.json | 5+++++
Apapers/systematic-review-infrastructure-2023/paper_type.json | 5+++++
Apapers/systemlevel-defense-against-2024/paper_type.json | 5+++++
Apapers/syzygy-dual-codetest-2024/paper_type.json | 5+++++
Apapers/t3-multilevel-treebased-2025/paper_type.json | 5+++++
Apapers/t5apr-empowering-automated-2023/paper_type.json | 5+++++
Apapers/tamas-benchmarking-adversarial-2025/paper_type.json | 5+++++
Apapers/target-traffic-rulebased-2023/paper_type.json | 5+++++
Apapers/task-shield-enforcing-2024/paper_type.json | 5+++++
Apapers/tasklevel-evaluation-ai-2026/paper_type.json | 5+++++
Apapers/taxonomy-evaluation-exploitation-2025/paper_type.json | 5+++++
Apapers/teaching-critiquing-conceptualization-2025/paper_type.json | 5+++++
Apapers/teaching-programming-age-2025/paper_type.json | 5+++++
Apapers/teamcraft-benchmark-multimodal-2024/paper_type.json | 5+++++
Apapers/telecomrag-taming-telecom-2024/paper_type.json | 5+++++
Apapers/temporal-knowledgebase-creation-2025/paper_type.json | 5+++++
Apapers/ten-simple-rules-2025/paper_type.json | 5+++++
Apapers/test-driven-interactive-code-gen-2024/paper_type.json | 5+++++
Apapers/test-smells-llmgenerated-2024/paper_type.json | 5+++++
Apapers/test-wars-comparative-2025/paper_type.json | 5+++++
Apapers/testbench-evaluating-classlevel-2024/paper_type.json | 5+++++
Apapers/testdriven-development-llmbased-2024/paper_type.json | 5+++++
Apapers/testgeneval-real-world-2024/paper_type.json | 5+++++
Apapers/testtime-matching-unlocking-2025/paper_type.json | 5+++++
Apapers/text-prompt-injection-2025/paper_type.json | 5+++++
Apapers/textresnet-decoupling-routing-2026/paper_type.json | 5+++++
Apapers/texttoaudio-generation-instructiontuned-2023/paper_type.json | 5+++++
Apapers/textttremind-understanding-deductive-2025/paper_type.json | 5+++++
Apapers/tfhecoder-evaluating-llmagentic-2025/paper_type.json | 5+++++
Apapers/theoretical-foundations-scaling-2025/paper_type.json | 5+++++
Apapers/they-all-good-2025/paper_type.json | 5+++++
Apapers/think-locally-explain-2026/paper_type.json | 5+++++
Apapers/thinking-isnt-illusion-2025/paper_type.json | 5+++++
Apapers/thinking-llms-lie-2025/paper_type.json | 5+++++
Apapers/thinking-longer-not-2025/paper_type.json | 5+++++
Apapers/thinkrepair-selfdirected-automated-2024/paper_type.json | 5+++++
Apapers/thought-communication-multiagent-2025/paper_type.json | 5+++++
Apapers/threatlens-llmguided-threat-2025/paper_type.json | 5+++++
Apapers/throwbench-benchmarking-llms-2025/paper_type.json | 5+++++
Apapers/tigercoder-novel-suite-2025/paper_type.json | 5+++++
Apapers/timecma-llmempowered-multivariate-2024/paper_type.json | 5+++++
Apapers/timecma-llmempowered-time-2024/paper_type.json | 5+++++
Apapers/todo-enhancing-llm-2024/paper_type.json | 5+++++
Apapers/tokenefficient-prompt-injection-2025/paper_type.json | 5+++++
Apapers/too-easily-fooled-2025/paper_type.json | 5+++++
Apapers/top-general-performance-2024/paper_type.json | 5+++++
Apapers/top-leaderboard-ranking-2024/paper_type.json | 5+++++
Apapers/topicattack-indirect-prompt-2025/paper_type.json | 5+++++
Apapers/traceable-latent-variable-2026/paper_type.json | 5+++++
Apapers/tracing-errors-constructing-2025/paper_type.json | 5+++++
Apapers/tracking-moving-target-2025/paper_type.json | 5+++++
Apapers/trae-agent-llmbased-2025/paper_type.json | 5+++++
Apapers/training-generalizable-collaborative-2026/paper_type.json | 5+++++
Apapers/training-llms-generating-2024/paper_type.json | 5+++++
Apapers/training-llms-honesty-2025/paper_type.json | 5+++++
Apapers/traitors-deception-trust-2025/paper_type.json | 5+++++
Apapers/transfer-q-star-2024/paper_type.json | 5+++++
Apapers/transformer-we-trust-2026/paper_type.json | 5+++++
Apapers/transforming-software-development-2024-2/paper_type.json | 5+++++
Apapers/transforming-wearable-data-2024/paper_type.json | 5+++++
Apapers/tree-thoughts-deliberate-2023/paper_type.json | 5+++++
Apapers/trigger-haystack-extracting-2026/paper_type.json | 5+++++
Apapers/trust-by-design-2026/paper_type.json | 5+++++
Apapers/trust-llmcontrolled-robotics-2025/paper_type.json | 5+++++
Apapers/trustworthy-agentic-ai-2025/paper_type.json | 5+++++
Apapers/trustworthy-llm-agents-survey-2025/paper_type.json | 5+++++
Apapers/trustworthy-llms-survey-2023/paper_type.json | 5+++++
Apapers/tsapr-tree-search-2025/paper_type.json | 5+++++
Apapers/turning-tide-repositorybased-2025/paper_type.json | 5+++++
Apapers/type-context-pass-rates-2024/paper_type.json | 5+++++
Apapers/typeaware-llmbased-regression-2025/paper_type.json | 5+++++
Apapers/types-grassroots-logic-2026/paper_type.json | 5+++++
Apapers/typescript-typecheck-failures-2025/paper_type.json | 5+++++
Apapers/uc-berkeley-mast-2025/paper_type.json | 5+++++
Apapers/uda-benchmark-suite-2024/paper_type.json | 5+++++
Apapers/ultrarag-modular-automated-2025/paper_type.json | 5+++++
Apapers/uncertainty-large-language-2026/paper_type.json | 5+++++
Apapers/understand-what-llm-2024/paper_type.json | 5+++++
Apapers/understanding-large-language-2023/paper_type.json | 5+++++
Apapers/understanding-layer-significance-2024/paper_type.json | 5+++++
Apapers/understanding-multimodal-finetuning-2026/paper_type.json | 5+++++
Apapers/understanding-protecting-augmenting-2025/paper_type.json | 5+++++
Apapers/understanding-software-engineering-2025/paper_type.json | 5+++++
Apapers/understanding-subliminal-learning-2025/paper_type.json | 5+++++
Apapers/unicode-augmenting-evaluation-2025/paper_type.json | 5+++++
Apapers/unified-scaling-laws-2022/paper_type.json | 5+++++
Apapers/unified-threat-detection-2025/paper_type.json | 5+++++
Apapers/uniguardian-unified-defense-2025/paper_type.json | 5+++++
Apapers/unintended-impacts-llm-2024/paper_type.json | 5+++++
Apapers/unseen-horizons-unveiling-2024/paper_type.json | 5+++++
Apapers/unveiling-potential-diffusion-2025/paper_type.json | 5+++++
Apapers/upbench-dynamically-evolving-2025/paper_type.json | 5+++++
Apapers/use-generative-ai-2024/paper_type.json | 5+++++
Apapers/use-propertybased-testing-2025/paper_type.json | 5+++++
Apapers/user-centric-evaluation-2024/paper_type.json | 5+++++
Apapers/user-feedback-alignment-2025/paper_type.json | 5+++++
Apapers/user-misconceptions-llmbased-2025/paper_type.json | 5+++++
Apapers/utboost-rigorous-evaluation-2025/paper_type.json | 5+++++
Apapers/validity-what-you-2025/paper_type.json | 5+++++
Apapers/validityguided-workflow-robust-2025/paper_type.json | 5+++++
Apapers/value-variance-mitigating-2026/paper_type.json | 5+++++
Apapers/values-science-ai-2026/paper_type.json | 5+++++
Apapers/vericoder-enhancing-llmbased-2025/paper_type.json | 5+++++
Apapers/vericontaminated-assessing-llmdriven-2025/paper_type.json | 5+++++
Apapers/verification-implicit-world-2026/paper_type.json | 5+++++
Apapers/verifierq-enhancing-llm-2024/paper_type.json | 5+++++
Apapers/verilogeval-evaluating-large-2023/paper_type.json | 5+++++
Apapers/verilogreader-llmaided-hardware-2024/paper_type.json | 5+++++
Apapers/verimind-agentic-llm-2025/paper_type.json | 5+++++
Apapers/verina-benchmarking-verifiable-2025/paper_type.json | 5+++++
Apapers/verpo-verifiable-dense-2026/paper_type.json | 5+++++
Apapers/vhdleval-framework-evaluating-2024/paper_type.json | 5+++++
Apapers/vibe-aigc-new-2026/paper_type.json | 5+++++
Apapers/vibe-coding-ainative-2025/paper_type.json | 5+++++
Apapers/vibe-coding-practice-2025/paper_type.json | 5+++++
Apapers/vibe-coding-product-2025/paper_type.json | 5+++++
Apapers/vibe-learning-education-2025/paper_type.json | 5+++++
Apapers/videot1-testtime-scaling-2025/paper_type.json | 5+++++
Apapers/vieva-llm-conceptual-2024/paper_type.json | 5+++++
Apapers/virtual-lab-ai-2024/paper_type.json | 5+++++
Apapers/virus-infection-attack-2025/paper_type.json | 5+++++
Apapers/viscosity-logic-phase-2026/paper_type.json | 5+++++
Apapers/vision-wormhole-latentspace-2026/paper_type.json | 5+++++
Apapers/visualwebarena-evaluating-multimodal-2024/paper_type.json | 5+++++
Apapers/vlrouterbench-benchmark-visionlanguage-2025/paper_type.json | 5+++++
Apapers/vortexpia-indirect-prompt-2025/paper_type.json | 5+++++
Apapers/vsavisualstructural-alignment-uitocode-2025/paper_type.json | 5+++++
Apapers/vulscriber-exploring-ragbased-2024/paper_type.json | 5+++++
Apapers/walle-world-alignment-2024/paper_type.json | 5+++++
Apapers/wasp-benchmarking-web-2025/paper_type.json | 5+++++
Apapers/watch-weights-unsupervised-2025/paper_type.json | 5+++++
Apapers/webapp1k-practical-codegeneration-2024/paper_type.json | 5+++++
Apapers/webarena-autonomous-agents-2023/paper_type.json | 5+++++
Apapers/webbench-llm-code-2025/paper_type.json | 5+++++
Apapers/webguard-building-generalizable-2025/paper_type.json | 5+++++
Apapers/webinject-prompt-injection-2025/paper_type.json | 5+++++
Apapers/webmmu-benchmark-multimodal-2025/paper_type.json | 5+++++
Apapers/webuibench-comprehensive-benchmark-2025/paper_type.json | 5+++++
Apapers/what-can-youth-2024/paper_type.json | 5+++++
Apapers/what-cut-predicting-2026/paper_type.json | 5+++++
Apapers/what-do-llm-2026/paper_type.json | 5+++++
Apapers/what-does-it-2025/paper_type.json | 5+++++
Apapers/what-retrieve-effective-2025/paper_type.json | 5+++++
Apapers/what-wrong-your-2024/paper_type.json | 5+++++
Apapers/when-agents-fail-2026/paper_type.json | 5+++++
Apapers/when-ai-agents-2025/paper_type.json | 5+++++
Apapers/when-bots-take-2026/paper_type.json | 5+++++
Apapers/when-finetuning-llms-2024/paper_type.json | 5+++++
Apapers/when-large-language-2024/paper_type.json | 5+++++
Apapers/when-nobody-around-2026/paper_type.json | 5+++++
Apapers/when-reject-turns-2025/paper_type.json | 5+++++
Apapers/when-singleagent-skills-2026/paper_type.json | 5+++++
Apapers/where-did-it-2025/paper_type.json | 5+++++
Apapers/where-do-ai-2026/paper_type.json | 5+++++
Apapers/where-llms-struggle-code-2025/paper_type.json | 5+++++
Apapers/which-agent-causes-2025/paper_type.json | 5+++++
Apapers/why-ai-alignment-2026/paper_type.json | 5+++++
Apapers/why-behind-action-2026/paper_type.json | 5+++++
Apapers/why-do-language-2025/paper_type.json | 5+++++
Apapers/why-reasoning-fails-2026/paper_type.json | 5+++++
Apapers/wink-recovering-from-2026/paper_type.json | 5+++++
Apapers/xgenq-explainable-domainadaptive-2025/paper_type.json | 5+++++
Apapers/you-only-need-2025/paper_type.json | 5+++++
Apapers/your-benchmark-still-2025/paper_type.json | 5+++++
Apapers/your-code-generated-2023/paper_type.json | 5+++++
Apapers/yunque-deepresearch-technical-2026/paper_type.json | 5+++++
Apapers/zeroshot-embedding-drift-2026/paper_type.json | 5+++++
Apapers/zeroshot-llmguided-counterfactual-2024/paper_type.json | 5+++++
Apapers/zeroshot-prompting-approaches-2024/paper_type.json | 5+++++
582 files changed, 2910 insertions(+), 0 deletions(-)

diff --git a/papers/future-ml-systems-2022/paper_type.json b/papers/future-ml-systems-2022/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper documents emergent abilities through quantitative results across multiple tasks and model scales, showing that downstream metrics demonstrate clear empirical phenomena rather than just surveying existing work or proposing untested frameworks." +} +\ No newline at end of file diff --git a/papers/multiagent-risks-from-2025/paper_type.json b/papers/multiagent-risks-from-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper proposes a taxonomy but validates it through 3 novel RL experiments and 13 case studies that demonstrate specific multi-agent failure modes; the experimental findings are the primary contribution." +} +\ No newline at end of file diff --git a/papers/multiple-scalable-polyglot-2023/paper_type.json b/papers/multiple-scalable-polyglot-2023/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "Primary contribution is MultiPL-E, a scalable benchmark framework that translates existing code generation benchmarks to 18 languages; experiments with Codex are baseline validation of the framework." +} +\ No newline at end of file diff --git a/papers/multiturn-code-gen-correctness-2025/paper_type.json b/papers/multiturn-code-gen-correctness-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs experiments measuring performance degradation of multi-turn code generation vs. single-turn and provides quantitative findings about correctness/security impacts; primary contribution is experimental evidence, not the benchmark framework itself." +} +\ No newline at end of file diff --git a/papers/nalamainz-at-blp2025-2025/paper_type.json b/papers/nalamainz-at-blp2025-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes a multi-agent pipeline for code generation and experimentally evaluates it on the BLP-2025 benchmark, reporting quantitative results (95.4% Pass@1) and ablation findings." +} +\ No newline at end of file diff --git a/papers/narrow-finetuning-leaves-2025/paper_type.json b/papers/narrow-finetuning-leaves-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs controlled experiments comparing an interpretability agent against baselines, reporting quantitative success rates (91% vs 39%) across 33 model organisms and 7 architectures to demonstrate that finetuning creates detectable activation traces." +} +\ No newline at end of file diff --git a/papers/narrowing-complexity-gap-2026/paper_type.json b/papers/narrowing-complexity-gap-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "Introduces GeneBench, a genetic optimization framework for transforming existing benchmarks into more complex variants; the LLM evaluation validates the benchmark rather than being the primary contribution." +} +\ No newline at end of file diff --git a/papers/natural-language-outlines-2024/paper_type.json b/papers/natural-language-outlines-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Conducts case studies and experiments evaluating LLM-generated code outlines on real Python functions and Android reverse engineering tasks, reporting quantitative results (60% excellent ratings, 80% correctness) and correlations from professional developer feedback." +} +\ No newline at end of file diff --git a/papers/navigating-copyright-aienhanced-2025/paper_type.json b/papers/navigating-copyright-aienhanced-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "position", + "reason": "Reviews copyright case law and proposes prescriptive solutions (human-AI co-creation models, metadata standards, licensing frameworks) for handling AI-generated game content, without empirical validation or experimental evidence." +} +\ No newline at end of file diff --git a/papers/navigating-representation-utilizing-2025/paper_type.json b/papers/navigating-representation-utilizing-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs controlled experiments comparing three prompting strategies with quantitative results (93.9% least-harmful rankings, Cronbach's Alpha 0.67) and manual evaluation, making the primary contribution experimental findings about prompt engineering effectiveness." +} +\ No newline at end of file diff --git a/papers/nested-learning-illusion-2025/paper_type.json b/papers/nested-learning-illusion-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "theoretical", + "reason": "The paper's primary contribution is a new theoretical paradigm (Nested Learning) that reframes and unifies existing optimizers and architectures as nested optimization problems and associative memories, with empirical results on the Hope architecture serving as validation of the framework." +} +\ No newline at end of file diff --git a/papers/neural-chameleons-language-2025/paper_type.json b/papers/neural-chameleons-language-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Fine-tunes and experiments on LLMs to demonstrate evasion capabilities against activation monitors, reporting quantitative results on transfer success, benchmark performance, and mechanistic analysis of evasion strategies." +} +\ No newline at end of file diff --git a/papers/neural-exec-learning-2024/paper_type.json b/papers/neural-exec-learning-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper introduces Neural Exec and validates it through experiments on four LLMs, reporting quantitative effectiveness improvements (200-500%) and persistence rates (~80%), with primary contribution being experimental findings across multiple models and scenarios." +} +\ No newline at end of file diff --git a/papers/neural-network-decoder-2023/paper_type.json b/papers/neural-network-decoder-2023/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper develops and experimentally validates an LSTM-based neural network decoder, demonstrating quantitative improvements over baselines on both simulated and real quantum error correction data." +} +\ No newline at end of file diff --git a/papers/neural-neural-scaling-2026/paper_type.json b/papers/neural-neural-scaling-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Trains a neural network (NEUNEU) and reports quantitative experimental results (2.04% MAE, 75.6% ranking accuracy) across 66 benchmark tasks; the primary contribution is the empirical findings of the model's predictive performance." +} +\ No newline at end of file diff --git a/papers/neural-program-repair-2023/paper_type.json b/papers/neural-program-repair-2023/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs experiments on five existing benchmarks with quantitative comparisons against 10 baselines and includes ablation studies; primary contribution is empirical findings demonstrating the proposed method's effectiveness." +} +\ No newline at end of file diff --git a/papers/neurosymbolic-verification-instruction-2026/paper_type.json b/papers/neurosymbolic-verification-instruction-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes NSVIF framework and empirically evaluates it on VIFBENCH (820 examples) with quantitative results (94.8% F1), baseline comparisons, and ablation studies showing component contributions." +} +\ No newline at end of file diff --git a/papers/new-compiler-stack-2026/paper_type.json b/papers/new-compiler-stack-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "Explicitly a systematic literature review of 159 papers that proposes a multi-dimensional taxonomy to categorize and synthesize existing work on LLMs and compilers." +} +\ No newline at end of file diff --git a/papers/next-paradigm-usercentric-2026/paper_type.json b/papers/next-paradigm-usercentric-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "position", + "reason": "The paper argues for a paradigm shift from platform-centric to user-centric agent architectures and proposes a conceptual framework (device-cloud pipeline) without experimental validation or formal mathematical analysis." +} +\ No newline at end of file diff --git a/papers/nl2repo-bench-2025/paper_type.json b/papers/nl2repo-bench-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "The primary contribution is NL2Repo-Bench, a new evaluation framework for coding agents on repository generation tasks; the experimental results validate the benchmark rather than constituting independent empirical findings." +} +\ No newline at end of file diff --git a/papers/nlp-evaluation-trouble-2023/paper_type.json b/papers/nlp-evaluation-trouble-2023/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "position", + "reason": "This position paper argues that LLM evaluation is compromised by data contamination and proposes a conceptual framework for categorizing contamination types and training stages, with empirical demonstrations as supporting evidence rather than primary contribution." +} +\ No newline at end of file diff --git a/papers/nmusketeers-reinforcement-learning-2026/paper_type.json b/papers/nmusketeers-reinforcement-learning-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper proposes a reinforcement learning method for multi-model collaboration and experimentally evaluates it on standard benchmarks, reporting quantitative performance results with specific improvements and degradation patterns across tasks." +} +\ No newline at end of file diff --git a/papers/no-more-manual-2023/paper_type.json b/papers/no-more-manual-2023/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Conducts systematic experiments evaluating ChatGPT's unit test generation performance with quantitative metrics, identifies failure modes, and proposes/validates an improved approach (ChatTester) with measured improvements." +} +\ No newline at end of file diff --git a/papers/no-need-lift-2023/paper_type.json b/papers/no-need-lift-2023/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs experiments with ChatGPT on LeetCode problems, reports quantitative acceptance rates and defect analysis, and analyzes multi-round fixing effectiveness—primary contribution is experimental findings about code generation quality." +} +\ No newline at end of file diff --git a/papers/not-all-metrics-2023/paper_type.json b/papers/not-all-metrics-2023/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs experiments across multiple NLG benchmarks (MT, summarization, image captioning) and reports quantitative results showing improved metric-human correlation through diverse reference generation." +} +\ No newline at end of file diff --git a/papers/not-everyone-wins-2025-2/paper_type.json b/papers/not-everyone-wins-2025-2/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Conducts an observational study in a graduate course (n=36) to measure how technical experience and context (time pressure) affect LLM use patterns and academic performance, reporting quantitative results on homework outcomes." +} +\ No newline at end of file diff --git a/papers/not-everyone-wins-2025/paper_type.json b/papers/not-everyone-wins-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs an observational study measuring how technical experience predicts student success with LLMs in data science, reports quantitative results (β=6.09, p=.041), and analyzes behavioral patterns empirically." +} +\ No newline at end of file diff --git a/papers/not-what-youve-2023/paper_type.json b/papers/not-what-youve-2023/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "position", + "reason": "Introduces indirect prompt injection as a novel attack vector through qualitative case studies and demonstrations, proposing this conceptual framework as a security threat, without quantitative experimental validation on benchmarks." +} +\ No newline at end of file diff --git a/papers/novel-differential-feature-2025/paper_type.json b/papers/novel-differential-feature-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes a specific architectural method (PF-DFL) and validates it experimentally with quantitative accuracy improvements on established hallucination detection benchmarks." +} +\ No newline at end of file diff --git a/papers/novel-preprocessing-technique-2023/paper_type.json b/papers/novel-preprocessing-technique-2023/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "position", + "reason": "Proposes four novel preprocessing techniques for code generation with only preliminary experimental results on 5 scripts; explicitly states the full planned experiment remains incomplete, making this primarily a technique proposal rather than a validated empirical contribution." +} +\ No newline at end of file diff --git a/papers/o1-reasoning-patterns-2024/paper_type.json b/papers/o1-reasoning-patterns-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs comparative experiments on multiple benchmarks (math, coding, commonsense reasoning) comparing o1 and GPT-4o, reporting quantitative results and identifying empirical reasoning patterns." +} +\ No newline at end of file diff --git a/papers/obliinjection-orderoblivious-prompt-2025/paper_type.json b/papers/obliinjection-orderoblivious-prompt-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper's primary contribution is empirical validation of the ObliInjection attack method, demonstrating ~99% success rate across three datasets and seven LLMs with quantitative comparisons against prior attacks and defense mechanisms." +} +\ No newline at end of file diff --git a/papers/ocrmediated-modality-dominance-2026/paper_type.json b/papers/ocrmediated-modality-dominance-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper runs controlled experiments testing 9 commercial VLMs with OCR-text injected into medical images, measuring quantitative outcomes (FPR, ASR, accuracy) across different injection methods and defenses." +} +\ No newline at end of file diff --git a/papers/oet-optimizationbased-prompt-2025/paper_type.json b/papers/oet-optimizationbased-prompt-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "The primary contribution is the OET evaluation toolkit for assessing prompt injection attacks across diverse datasets; empirical results are secondary demonstrations of the framework's utility." +} +\ No newline at end of file diff --git a/papers/ojbench-competition-level-2025/paper_type.json b/papers/ojbench-competition-level-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "OJBench introduces a new 232-problem competitive programming benchmark sourced from NOI and ICPC; model evaluations are baseline results to characterize the benchmark's difficulty and utility." +} +\ No newline at end of file diff --git a/papers/ojbkq-objectivejoint-babaiklein-2026/paper_type.json b/papers/ojbkq-objectivejoint-babaiklein-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes a new quantization method (OJBKQ) and validates it through comprehensive experiments across multiple models, reporting quantitative improvements in perplexity and accuracy compared to existing baselines." +} +\ No newline at end of file diff --git a/papers/omnicode-benchmark-2026/paper_type.json b/papers/omnicode-benchmark-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "Introduces OmniCode, a 1794-task benchmark for evaluating software development agents; empirical results on agent performance serve to demonstrate the benchmark's utility rather than being the primary contribution." +} +\ No newline at end of file diff --git a/papers/omnigrok-grokking-beyond-2022/paper_type.json b/papers/omnigrok-grokking-beyond-2022/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "theoretical", + "reason": "The primary contribution is formal analysis of the grokking mechanism (LU mechanism) explaining loss landscape mismatches, with empirical validation on MNIST, IMDb, and QM9." +} +\ No newline at end of file diff --git a/papers/on-premise-llm-cost-benefit-2025/paper_type.json b/papers/on-premise-llm-cost-benefit-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "theoretical", + "reason": "The paper performs formal mathematical cost-benefit analysis across deployment scenarios to calculate break-even periods, which is analytical rather than experimental." +} +\ No newline at end of file diff --git a/papers/one-token-embedding-2025/paper_type.json b/papers/one-token-embedding-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper experimentally validates a Deadlock Attack on multiple LRMs, reporting quantitative success rates (100%) across three benchmarks, with the primary contribution being the empirical findings about the attack's effectiveness and properties." +} +\ No newline at end of file diff --git a/papers/openhands-ai-sw-agent-2024/paper_type.json b/papers/openhands-ai-sw-agent-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The primary contribution is experimental validation of the OpenHands platform across 15 benchmarks with reported quantitative results (26% SWE-Bench, 15.3% WebArena, 52% GPQA), demonstrating practical performance and deployment economics." +} +\ No newline at end of file diff --git a/papers/openllmrtl-open-dataset-2024/paper_type.json b/papers/openllmrtl-open-dataset-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "Introduces three open-source contributions (RTLLM 2.0 benchmark, AssertEval framework, RTLCoder-Data dataset) for RTL generation, with experimental validation of their utility." +} +\ No newline at end of file diff --git a/papers/openr-reasoning-framework-2024/paper_type.json b/papers/openr-reasoning-framework-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper presents a framework and validates it through experiments on the MATH500 benchmark, reporting quantitative performance improvements (82% accuracy, ~10% relative gain over baselines)." +} +\ No newline at end of file diff --git a/papers/openrubrics-scalable-synthetic-2025/paper_type.json b/papers/openrubrics-scalable-synthetic-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Introduces a method (Contrastive Rubric Generation) and validates it experimentally across 8 reward-modeling benchmarks, reporting quantitative improvements (8.4pp) and downstream task performance gains." +} +\ No newline at end of file diff --git a/papers/optima-optimizing-effectiveness-2024/paper_type.json b/papers/optima-optimizing-effectiveness-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Introduces OPTIMA framework and validates it through quantitative experiments on existing benchmarks (MATH, GSM8k), reporting measurable performance gains and token efficiency improvements; primary contribution is experimental findings, not the framework concept alone." +} +\ No newline at end of file diff --git a/papers/optimal-attention-temperature-2025/paper_type.json b/papers/optimal-attention-temperature-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "theoretical", + "reason": "Derives closed-form mathematical solutions for optimal attention temperature in linearized Transformers under distribution shift, with experiments serving to validate the theoretical results." +} +\ No newline at end of file diff --git a/papers/optimal-scaling-laws-2026/paper_type.json b/papers/optimal-scaling-laws-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "theoretical", + "reason": "Derives exact information-theoretic scaling laws and analyzes optimal rates through formal mathematical analysis rather than empirical experimentation." +} +\ No newline at end of file diff --git a/papers/optimalagentselection-stateaware-routing-2025/paper_type.json b/papers/optimalagentselection-stateaware-routing-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes STRMAC routing framework and reports quantitative experimental results (23.8% accuracy improvement, token consumption metrics) on existing benchmarks (PDDP, EBFC), making experimental findings the primary contribution." +} +\ No newline at end of file diff --git a/papers/optimizationbased-prompt-injection-2024/paper_type.json b/papers/optimizationbased-prompt-injection-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Reports quantitative attack success rates (89-99% ASR) across multiple LLMs and benchmarks, systematically evaluates defense mechanisms, and demonstrates experimental transferability results." +} +\ No newline at end of file diff --git a/papers/optimizing-code-runtime-2025/paper_type.json b/papers/optimizing-code-runtime-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes AUTOPATCH method and reports quantitative experimental results (7.3% improvement) on IBM Project CodeNet, with the primary contribution being empirical findings rather than a new benchmark or theoretical analysis." +} +\ No newline at end of file diff --git a/papers/optimizing-netgpt-routingbased-2025/paper_type.json b/papers/optimizing-netgpt-routingbased-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "theoretical", + "reason": "The primary contribution is proving formal properties of optimal routing thresholds (uniqueness and monotonicity in network parameters), with experiments validating the theoretical framework." +} +\ No newline at end of file diff --git a/papers/optimizing-pytorch-inference-2025/paper_type.json b/papers/optimizing-pytorch-inference-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper runs experiments with PIKE-B on a benchmark suite and reports quantitative performance results (2.88× speedup, comparisons with torch.compile/TensorRT/METR), with the primary contribution being empirical findings about exploit-heavy multi-agent strategies." +} +\ No newline at end of file diff --git a/papers/orchestrating-intelligence-confidenceaware-2026/paper_type.json b/papers/orchestrating-intelligence-confidenceaware-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes OI-MAS routing framework and validates it experimentally across five benchmarks (GSM8K, MATH, MedQA, GPQA, MBPP) with quantitative results showing accuracy improvements and cost reductions over baselines." +} +\ No newline at end of file diff --git a/papers/orllmagent-automating-modeling-2025/paper_type.json b/papers/orllmagent-automating-modeling-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The primary contribution is demonstrating through quantitative experiments (82.93% BWOR accuracy, ablation study, comparison across NL4OPT/MAMO/IndustryOR) that the OR-LLM-Agent framework effectively solves optimization problems; benchmark creation is secondary." +} +\ No newline at end of file diff --git a/papers/osc-cognitive-orchestration-2025/paper_type.json b/papers/osc-cognitive-orchestration-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes the OSC multi-agent framework and reports quantitative experimental results on AlpacaEval 2.0 and MT-Bench with ablation studies demonstrating component importance and performance improvements over baselines." +} +\ No newline at end of file diff --git a/papers/osworld-benchmarking-multimodal-2024/paper_type.json b/papers/osworld-benchmarking-multimodal-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "The paper's primary contribution is introducing OSWorld, a new real-world computer environment benchmark with 369 tasks; the baseline evaluations of models are secondary validation of the benchmark itself." +} +\ No newline at end of file diff --git a/papers/outofcontext-outofscope-manipulating-2026/paper_type.json b/papers/outofcontext-outofscope-manipulating-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper runs systematic experiments on multiple LLMs (7B-70B) to test instruction-hiding techniques via fine-tuning, measures the effectiveness of different prompt formats, and reports quantitative findings about embedding behaviors through minimal modifications." +} +\ No newline at end of file diff --git a/papers/overreliance-human-ai-2025/paper_type.json b/papers/overreliance-human-ai-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "position", + "reason": "This is a position paper that argues overreliance measurement should be central to LLM research and proposes mitigation strategies without experimental validation, despite having theoretical methodology tags." +} +\ No newline at end of file diff --git a/papers/overseeing-agents-without-2026/paper_type.json b/papers/overseeing-agents-without-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs three iterative user studies with quantitative outcome measures (Hedges' g effect sizes) on agent oversight interfaces; primary contribution is experimental findings on interface design effectiveness." +} +\ No newline at end of file diff --git a/papers/pacost-paired-confidence-2024/paper_type.json b/papers/pacost-paired-confidence-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Introduces a contamination detection method (PaCoST) and validates it through controlled experiments, then applies it to 10 LLMs across 6 benchmarks to report empirical findings about widespread contamination." +} +\ No newline at end of file diff --git a/papers/pairwise-proximal-policy-2023/paper_type.json b/papers/pairwise-proximal-policy-2023/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Introduces P3O algorithm and validates it experimentally on TL;DR and Anthropic HH benchmarks with quantitative KL-Reward trade-off results; theoretical properties are supporting analysis rather than the primary contribution." +} +\ No newline at end of file diff --git a/papers/palm-scaling-language-2022/paper_type.json b/papers/palm-scaling-language-2022/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Reports extensive quantitative experimental results from training and evaluating a 540B-parameter language model across 28+ NLP benchmarks and 150+ BIG-bench tasks, with the primary contribution being the empirical findings on model performance at scale." +} +\ No newline at end of file diff --git a/papers/pangucoder2-boosting-large-2023/paper_type.json b/papers/pangucoder2-boosting-large-2023/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper presents experimental results of a fine-tuned code LLM evaluated on standard benchmarks (HumanEval, CoderEval, LeetCode) with quantitative performance metrics as the primary contribution." +} +\ No newline at end of file diff --git a/papers/paperbench-evaluating-ais-2025/paper_type.json b/papers/paperbench-evaluating-ais-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "Introduces PaperBench, a novel benchmark with 8,316 tasks for evaluating AI agents' ability to replicate research papers, with baselines tested but the benchmark itself being the primary contribution." +} +\ No newline at end of file diff --git a/papers/parameterefficient-finetuning-attributed-2025/paper_type.json b/papers/parameterefficient-finetuning-attributed-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes Graph-LoRA method and validates it through experiments on five existing datasets with three LLMs, reporting quantitative accuracy and F1 improvements with ablation studies." +} +\ No newline at end of file diff --git a/papers/patchzero-zeroshot-automatic-2023/paper_type.json b/papers/patchzero-zeroshot-automatic-2023/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes LLM4PatchCorrect method and reports quantitative experimental results (84.4% accuracy, 86.5% F1) across 22 APR tools, with ablation studies and comparisons to prior approaches." +} +\ No newline at end of file diff --git a/papers/pathfix-automated-program-2025/paper_type.json b/papers/pathfix-automated-program-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "PathFix proposes a new program repair method and validates it through quantitative experiments on QuixBugs benchmark, comparing performance against baselines (SemGraft) with and without LLM integration." +} +\ No newline at end of file diff --git a/papers/pay-hints-not-2026/paper_type.json b/papers/pay-hints-not-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper introduces and experimentally validates a cost-efficient inference method (LLM Shepherding), reporting quantitative results (42-94% cost reduction) across four benchmarks with comparisons to baselines." +} +\ No newline at end of file diff --git a/papers/pedagogical-alignment-large-2024/paper_type.json b/papers/pedagogical-alignment-large-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "The paper explicitly identifies itself as a narrative survey that reviews and synthesizes existing literature on LLMs for education across multiple dimensions (knowledge editing, content generation, personalized learning), catalogs frameworks and challenges, but does not present original experimental results." +} +\ No newline at end of file diff --git a/papers/peeping-at-creaitivity-2025/paper_type.json b/papers/peeping-at-creaitivity-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Experimentally evaluates creative capabilities of three chatbots (ChatGPT, Claude, Gemini) with quantitative ratings from AI and human evaluators plus qualitative analysis of writing patterns and limitations." +} +\ No newline at end of file diff --git a/papers/pennylang-pioneering-llmbased-2025/paper_type.json b/papers/pennylang-pioneering-llmbased-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "The paper's primary contribution is introducing PennyLang, a novel curated dataset of 3,347 samples for quantum code generation; the experiments serve to validate and characterize this benchmark rather than being the core finding." +} +\ No newline at end of file diff --git a/papers/performance-review-llm-2024/paper_type.json b/papers/performance-review-llm-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper runs experiments evaluating 22 LLMs on LeetCode problems and reports quantitative results (pass@k metrics, runtime percentiles), with the primary contribution being experimental performance findings rather than a new benchmark or theoretical analysis." +} +\ No newline at end of file diff --git a/papers/performance-study-llmgenerated-2024/paper_type.json b/papers/performance-study-llmgenerated-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Evaluates 18 LLMs on 204 Leetcode problems and reports quantitative performance metrics including runtime analysis, pass@k rates, and statistical correlations." +} +\ No newline at end of file diff --git a/papers/perplexity-paradox-why-2026/paper_type.json b/papers/perplexity-paradox-why-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Paper validates the perplexity paradox mechanism through experiments with quantitative results (perplexity metrics, pass rates, compression ratios) and tests novel compression techniques (signature injection, TAAC algorithm) on benchmark tasks." +} +\ No newline at end of file diff --git a/papers/persistent-backdoor-attacks-2025/paper_type.json b/papers/persistent-backdoor-attacks-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper conducts experiments on LLaMA3 and Qwen2.5 models, reports quantitative results (99-100% backdoor persistence rates), and compares P-Trojan against baselines, making empirical findings the primary contribution." +} +\ No newline at end of file diff --git a/papers/personadual-balancing-personalization-2026/paper_type.json b/papers/personadual-balancing-personalization-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes PersonaDual method, trains it with SFT and DualGRPO, reports quantitative experimental results on multiple model backbones, and measures performance gains against baselines." +} +\ No newline at end of file diff --git a/papers/personalizedrouter-personalized-llm-2025/paper_type.json b/papers/personalizedrouter-personalized-llm-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Introduces PersonalizedRouter framework and demonstrates its effectiveness through quantitative experiments on benchmarks and real-user validation, with primary contribution being experimental findings rather than the benchmark itself." +} +\ No newline at end of file diff --git a/papers/phantom-transfer-datalevel-2026/paper_type.json b/papers/phantom-transfer-datalevel-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper experimentally demonstrates Phantom Transfer attacks across multiple models (including GPT-4.1), evaluates various defense mechanisms against these attacks, and reports comparative findings between attack methods, making the primary contribution experimental rather than positional or theoretical." +} +\ No newline at end of file diff --git a/papers/picking-right-specialist-2026/paper_type.json b/papers/picking-right-specialist-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Primary contribution is the experimental validation of ToolSelect method across multiple tasks, demonstrating superior performance over baselines; benchmark introduced as supporting infrastructure for evaluation rather than as the main contribution." +} +\ No newline at end of file diff --git a/papers/piguard-prompt-injection-2025/paper_type.json b/papers/piguard-prompt-injection-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes PIGuard model with MOF training strategy and reports quantitative experimental results (83.48% accuracy) demonstrating superiority over baselines; the NotInject benchmark is a supporting contribution used to evaluate the model." +} +\ No newline at end of file diff --git a/papers/pina-prompt-injection-2026/paper_type.json b/papers/pina-prompt-injection-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Introduces an adaptive prompt injection attack framework and reports quantitative results (75-100% ASR) from systematic experiments on navigation agents, with primary contribution being the experimental findings on attack effectiveness and transferability." +} +\ No newline at end of file diff --git a/papers/pisanitizer-preventing-prompt-2025/paper_type.json b/papers/pisanitizer-preventing-prompt-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes PISanitizer defense mechanism and validates it through experiments across 6 LongBench datasets and 7 LLMs, reporting quantitative results on attack success rates and performance tradeoffs." +} +\ No newline at end of file diff --git a/papers/plan-and-act-long-horizon-2025/paper_type.json b/papers/plan-and-act-long-horizon-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes PLAN-AND-ACT method, runs experiments on WebArena-Lite and WebVoyager benchmarks, and reports quantitative performance improvements with ablation studies showing the contribution of each component." +} +\ No newline at end of file diff --git a/papers/planning-natural-language-2024/paper_type.json b/papers/planning-natural-language-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes PLANSEARCH algorithm and validates through quantitative experiments on LiveCodeBench benchmark, with primary contribution being experimental findings on search performance improvements." +} +\ No newline at end of file diff --git a/papers/planrag-planthenretrieval-augmented-2024/paper_type.json b/papers/planrag-planthenretrieval-augmented-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes the PlanRAG technique and validates it through controlled experiments comparing against iterative RAG baseline, reporting quantitative performance improvements on Decision QA benchmarks." +} +\ No newline at end of file diff --git a/papers/play-by-type-2025/paper_type.json b/papers/play-by-type-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes type-constrained decoding for LLM functions and validates experimentally with quantitative results (7% accuracy improvement, 53% latency improvement) on established benchmarks (HybridQA, TAG-Bench)." +} +\ No newline at end of file diff --git a/papers/plot2code-comprehensive-benchmark-2025/paper_type.json b/papers/plot2code-comprehensive-benchmark-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "The paper introduces Plot2Code, a new 368-plot benchmark and evaluation framework for assessing MLLMs on code generation from scientific plots, with experiments and metrics serving to validate and demonstrate the benchmark's utility." +} +\ No newline at end of file diff --git a/papers/poetics-code-generative-2025/paper_type.json b/papers/poetics-code-generative-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "This narrative literature review catalogs existing perspectives on AI's limitations in literary creativity, making literature synthesis—not original experiments—the primary contribution." +} +\ No newline at end of file diff --git a/papers/poison-once-refuse-2025/paper_type.json b/papers/poison-once-refuse-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Introduces the SAI attack method and reports quantitative experimental results (90% refusal rates, defense evasion metrics, bias propagation measurements) demonstrating its effectiveness on LLMs." +} +\ No newline at end of file diff --git a/papers/poisoning-attacks-llms-2025/paper_type.json b/papers/poisoning-attacks-llms-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs experiments testing poisoning attacks across multiple model sizes and dataset sizes, reporting quantitative results on attack success rates and sample requirements." +} +\ No newline at end of file diff --git a/papers/poisoning-real-threat-2024/paper_type.json b/papers/poisoning-real-threat-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Conducts experiments comparing poisoning vulnerabilities of DPO vs PPO-based RLHF with quantitative measurements of attack thresholds and defense effectiveness." +} +\ No newline at end of file diff --git a/papers/policy-compiler-secure-2026/paper_type.json b/papers/policy-compiler-secure-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs experiments on τ2-bench customer service tasks, reports quantitative compliance improvements (48% → 93%), and measures attack success rates and runtime overhead." +} +\ No newline at end of file diff --git a/papers/policyasprompt-turning-ai-2025/paper_type.json b/papers/policyasprompt-turning-ai-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper proposes a framework and runs controlled experiments comparing model performance (O1 vs others, GPT-4o vs Qwen3-1.7B) on policy extraction and enforcement tasks, reporting quantitative results (F1, accuracy metrics) as primary contributions." +} +\ No newline at end of file diff --git a/papers/poodle-seamlessly-scaling-2025/paper_type.json b/papers/poodle-seamlessly-scaling-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Presents and experimentally validates a just-in-time model replacement technique with quantitative benchmarks on IMDB sentiment classification, demonstrating cost, throughput, and accuracy improvements." +} +\ No newline at end of file diff --git a/papers/position-explaining-behavioral-2026/paper_type.json b/papers/position-explaining-behavioral-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "position", + "reason": "The paper argues a viewpoint about the necessity of comparative explainability methods (∆-XAI) for understanding LLM behavioral shifts, proposes a prescriptive framework with 10 desiderata, and includes only a small illustrative experiment rather than primary experimental validation." +} +\ No newline at end of file diff --git a/papers/position-require-frontier-2025/paper_type.json b/papers/position-require-frontier-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "position", + "reason": "The paper explicitly proposes a policy prescription (mandating analog model releases) and argues a viewpoint about safety-innovation tradeoffs, rather than reporting experimental findings, introducing a benchmark, or proving formal results." +} +\ No newline at end of file diff --git a/papers/position-vibe-coding-2025/paper_type.json b/papers/position-vibe-coding-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "position", + "reason": "Explicitly labeled position paper that argues a viewpoint about vibe coding limitations and proposes a conceptual framework (Vibe Reasoning) with a proof-of-concept rather than experimental validation." +} +\ No newline at end of file diff --git a/papers/pots-proofoftrainingsteps-backdoor-2025/paper_type.json b/papers/pots-proofoftrainingsteps-backdoor-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes the PoTS backdoor detection method and validates it experimentally across multiple LLM models, reporting quantitative metrics on detection rates, verification speed, and computational efficiency." +} +\ No newline at end of file diff --git a/papers/power-limitations-aggregation-2026/paper_type.json b/papers/power-limitations-aggregation-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "theoretical", + "reason": "Primary contribution is a formal theoretical framework with proven theorems (4.3, 4.4) characterizing necessary and sufficient conditions for aggregation in compound AI systems; empirical illustration is secondary." +} +\ No newline at end of file diff --git a/papers/practical-program-repair-2022/paper_type.json b/papers/practical-program-repair-2022/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Extensive experimental evaluation of 9 PLMs for automated program repair across 5 benchmarks, reporting quantitative results (109 bugs fixed) and demonstrating scaling effects." +} +\ No newline at end of file diff --git a/papers/practical-useful-automated-2024/paper_type.json b/papers/practical-useful-automated-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Reports quantitative experimental results on established benchmarks (QuixBugs, Defects4J) and user studies (44% success improvement), making the primary contribution empirical findings rather than position or theory." +} +\ No newline at end of file diff --git a/papers/pragmatic-reasoning-improves-2025/paper_type.json b/papers/pragmatic-reasoning-improves-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Applies pragmatic reasoning framework to code generation with quantitative results on MBPP and HumanEval benchmarks, comparing against baselines and conducting ablation studies across models." +} +\ No newline at end of file diff --git a/papers/prattack-coordinated-promptrag-2025/paper_type.json b/papers/prattack-coordinated-promptrag-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper proposes a new attack method (PR-Attack) and validates it experimentally across 6 LLMs and 3 QA datasets, with the primary contribution being the empirical demonstration of 90-100% attack success rates; the bilevel optimization formulation is theoretical support for the experimental findings, not the primary contribution." +} +\ No newline at end of file diff --git a/papers/precedentbased-professional-role-2025/paper_type.json b/papers/precedentbased-professional-role-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "position", + "reason": "Proposes a conceptual framework (ProEthica) for professional role ethics in AI decision analysis with architectural design intent, but presents no quantitative results, experiments, or system outputs." +} +\ No newline at end of file diff --git a/papers/predictable-artificial-intelligence-2023/paper_type.json b/papers/predictable-artificial-intelligence-2023/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "theoretical", + "reason": "Introduces a formal mathematical framework (Equations 1-4) characterizing unpredictability in AI systems through formal analysis of predictor families, scoring rules, and ecosystem histories." +} +\ No newline at end of file diff --git a/papers/predicting-llm-reasoning-2025/paper_type.json b/papers/predicting-llm-reasoning-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper introduces RBRIDGE, a method for proxy model prediction, and validates it experimentally across six benchmarks, reporting quantitative results (100x+ compute savings, correlation metrics) as the primary contribution." +} +\ No newline at end of file diff --git a/papers/prefillshare-shared-prefill-2026/paper_type.json b/papers/prefillshare-shared-prefill-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes PrefillShare method and validates it through experiments on standard benchmarks (math, coding, tool-calling) and real workloads, reporting quantitative results on accuracy and performance metrics." +} +\ No newline at end of file diff --git a/papers/pretraining-scaling-laws-2025/paper_type.json b/papers/pretraining-scaling-laws-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "theoretical", + "reason": "Primary contribution is proposing and proving pretraining scaling laws with formal mathematical analysis across multiple formulations; experiments on GSM8K and MATH serve to validate the theoretical predictions rather than being the primary contribution." +} +\ No newline at end of file diff --git a/papers/primg-efficient-llmdriven-2025/paper_type.json b/papers/primg-efficient-llmdriven-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs controlled experiments on real-world Solidity contracts (Code4Arena) with quantitative results (test correctness 3%→33%, mutant kill rates), where the primary contribution is the experimental validation of the PRIMG technique." +} +\ No newline at end of file diff --git a/papers/proactive-hardening-llm-2026/paper_type.json b/papers/proactive-hardening-llm-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Paper runs controlled experiments comparing hard-negative mining strategies with quantitative results (accuracy reductions and recovery rates across configurations), with the primary contribution being empirical findings about HASTE's effectiveness." +} +\ No newline at end of file diff --git a/papers/probing-emergence-crosslingual-2024/paper_type.json b/papers/probing-emergence-crosslingual-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper probes BLOOM checkpoints at various training steps to measure cross-lingual neuron overlap and correlate it with downstream performance on standard benchmarks (XNLI, POS tagging), with primary contribution being empirical findings about alignment emergence dynamics." +} +\ No newline at end of file diff --git a/papers/probing-language-models-2024/paper_type.json b/papers/probing-language-models-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper proposes a method (linear probes for pre-training data detection) and reports quantitative experimental results (AUC scores on benchmarks); while it introduces ArxivMIA, the primary contribution is the empirical findings about the probing approach's effectiveness, not the benchmark itself." +} +\ No newline at end of file diff --git a/papers/processcentric-analysis-agentic-2025/paper_type.json b/papers/processcentric-analysis-agentic-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Introduces Graphectory representation and conducts large-scale observational analysis of 4,000 trajectories across systems and models, reporting quantitative findings about trajectory patterns, anti-patterns, and performance correlations." +} +\ No newline at end of file diff --git a/papers/programmed-please-moral-2026/paper_type.json b/papers/programmed-please-moral-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "position", + "reason": "Makes a philosophical argument about AI sycophancy using Aristotelian virtue ethics and proposes prescriptive solutions without empirical validation, experiments, or formal proofs." +} +\ No newline at end of file diff --git a/papers/programming-language-techniques-2025/paper_type.json b/papers/programming-language-techniques-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "position", + "reason": "Explicitly a position paper arguing that PL techniques should bridge LLM code generation gaps, proposing a research agenda without empirical validation." +} +\ No newline at end of file diff --git a/papers/projdevbench-end-to-end-2026/paper_type.json b/papers/projdevbench-end-to-end-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "Introduces ProjDevBench, a new benchmark with 20 end-to-end C++ project construction tasks for evaluating coding agents; the primary contribution is the benchmark framework itself, though baseline evaluations of agents are included." +} +\ No newline at end of file diff --git a/papers/projecteval-benchmark-programming-2025/paper_type.json b/papers/projecteval-benchmark-programming-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "Introduces ProjectEval, a 20-task benchmark with automated evaluation framework; runs baseline experiments to demonstrate the benchmark but the primary contribution is the benchmark itself." +} +\ No newline at end of file diff --git a/papers/projecttest-projectlevel-llm-2025/paper_type.json b/papers/projecttest-projectlevel-llm-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "Introduces ProjectTest, a new project-level unit test generation benchmark covering 60 projects; primary contribution is the benchmark resource itself, though it includes empirical evaluation on 9 frontier LLMs." +} +\ No newline at end of file diff --git a/papers/promises-perils-timely-2026/paper_type.json b/papers/promises-perils-timely-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper reports quantitative empirical findings about agent adoption rates (15-20% of GitHub projects) and tool market share derived from observational analysis of GitHub traces, with the heuristics serving as the detection methodology rather than the primary contribution." +} +\ No newline at end of file diff --git a/papers/prompt-alchemist-automated-2025/paper_type.json b/papers/prompt-alchemist-automated-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper proposes MAPS method and validates it through experiments on the Defects4J benchmark, comparing performance against baselines with quantitative metrics (line/branch coverage) across multiple LLMs." +} +\ No newline at end of file diff --git a/papers/prompt-infection-llmtollm-2024/paper_type.json b/papers/prompt-infection-llmtollm-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Demonstrates prompt injection attacks through controlled experiments with quantitative results (14-209% performance improvements, logistic growth patterns), comparing model vulnerabilities and infection mechanisms across different scenarios." +} +\ No newline at end of file diff --git a/papers/prompt-injection-attacks-2024/paper_type.json b/papers/prompt-injection-attacks-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper conducts experimental attacks and defenses in a CTF competition, reporting quantitative results on attack effectiveness and demonstrating which strategies bypass defenses, with the primary contribution being empirical findings about prompt injection attack success rates." +} +\ No newline at end of file diff --git a/papers/prompt-injection-attacks-2025-2-2/paper_type.json b/papers/prompt-injection-attacks-2025-2-2/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs experiments on four VLMs with quantified attack success rates (33-67%) on medical imaging tasks and tests mitigation strategies, with experimental findings as the primary contribution." +} +\ No newline at end of file diff --git a/papers/prompt-injection-attacks-2025-2/paper_type.json b/papers/prompt-injection-attacks-2025-2/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs controlled experiments with prompt injection attacks on LLMs, quantifying acceptance/rejection rates across models and comparing to human baseline performance." +} +\ No newline at end of file diff --git a/papers/prompt-injection-attacks-2025/paper_type.json b/papers/prompt-injection-attacks-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "Systematic review synthesizing 128 studies on prompt injection attacks and defenses with classification frameworks—primary contribution is synthesis of existing work, not new experiments or benchmarks." +} +\ No newline at end of file diff --git a/papers/prompt-injection-attacks-2026-2-2/paper_type.json b/papers/prompt-injection-attacks-2026-2-2/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Primary contribution is quantitative experimental evaluation of prompt injection attacks across three LLMs with 1,000 scenarios, ablation studies, and a proposed defense framework validated to reduce attack success rate from 42% to 3.2%." +} +\ No newline at end of file diff --git a/papers/prompt-injection-attacks-2026/paper_type.json b/papers/prompt-injection-attacks-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "The paper is a systematization of knowledge (SoK) that synthesizes 78 existing studies on prompt injection attacks, proposing a unifying taxonomy of the landscape rather than conducting primary experiments." +} +\ No newline at end of file diff --git a/papers/prompt-injection-chatbot-plugins-2025/paper_type.json b/papers/prompt-injection-chatbot-plugins-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Large-scale observational study across 17 plugins and 10,000+ websites reporting quantitative findings on prompt injection vulnerabilities and attack success rates." +} +\ No newline at end of file diff --git a/papers/prompt-injection-detection-2025/paper_type.json b/papers/prompt-injection-detection-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper proposes a prompt injection detection system and empirically validates the BERT component on ~250k validation records, reporting accuracy results of 0.91-0.99 across use cases." +} +\ No newline at end of file diff --git a/papers/prompt-injection-llm-apps-2023/paper_type.json b/papers/prompt-injection-llm-apps-2023/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper runs systematic experiments testing HOUYI across 36 real-world applications, reports quantitative findings (86.1% success rate, 31 vulnerable apps), and evaluates existing defenses—making the experimental findings the primary contribution." +} +\ No newline at end of file diff --git a/papers/prompt-injection-tool-selection-2025/paper_type.json b/papers/prompt-injection-tool-selection-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Primary contribution is experimental validation of a novel attack method (ToolHijacker) with quantitative results across multiple LLM architectures and defense mechanisms." +} +\ No newline at end of file diff --git a/papers/prompt-injection-vulnerability-2025/paper_type.json b/papers/prompt-injection-vulnerability-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs experiments testing prompt injection attacks on multiple LLMs, reports quantitative Attack Success Rates across different models, domains, and rhetorical strategies, with the primary contribution being empirical findings about vulnerability patterns." +} +\ No newline at end of file diff --git a/papers/prompt-less-smile-2025/paper_type.json b/papers/prompt-less-smile-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes semantic engineering method, runs experiments on benchmarks with quantitative comparisons (1.3x-3x improvements), includes ablation study, and measures developer effort—primary contribution is experimental validation." +} +\ No newline at end of file diff --git a/papers/prompt-perturbation-fraction-2025/paper_type.json b/papers/prompt-perturbation-fraction-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs controlled experiments across six LLMs measuring how prompt variations and fractional scoring affect correlation with human judgments, with quantitative empirical results as the primary contribution." +} +\ No newline at end of file diff --git a/papers/prompt-sapper-llmempowered-2023/paper_type.json b/papers/prompt-sapper-llmempowered-2023/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper presents two user studies with quantitative results (timing metrics with p-values, participant counts, and rating scores) as the primary contribution demonstrating the tool's effectiveness." +} +\ No newline at end of file diff --git a/papers/prompt-variability-effects-2025/paper_type.json b/papers/prompt-variability-effects-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs controlled experiments on multiple LLMs measuring code similarity under prompt variations (typos, semantics, personas), reports quantitative results on benchmarks, and analyzes data contamination effects." +} +\ No newline at end of file diff --git a/papers/promptarmor-simple-yet-2025/paper_type.json b/papers/promptarmor-simple-yet-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes PromptArmor defense and validates it experimentally on AgentDojo benchmark with quantitative results (FPR/FNR metrics and attack success rates) across multiple models and attack scenarios." +} +\ No newline at end of file diff --git a/papers/promptbased-code-completion-2024/paper_type.json b/papers/promptbased-code-completion-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes ProCC framework and reports quantitative experimental results (8.6-10.1% improvements) on established code completion benchmarks, with primary contribution being empirical performance findings." +} +\ No newline at end of file diff --git a/papers/prompting-programming-query-2022/paper_type.json b/papers/prompting-programming-query-2022/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper introduces LMQL and validates its practical benefits through extensive experiments demonstrating 26-85% token reduction, improved accuracy, and code brevity across multiple prompting scenarios." +} +\ No newline at end of file diff --git a/papers/promptlocate-localizing-prompt-2025/paper_type.json b/papers/promptlocate-localizing-prompt-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes a method for localizing prompt injection attacks and validates it experimentally with quantitative metrics (ROUGE-L, embedding similarity, FPR/FNR) across existing benchmarks and adaptive attack variants." +} +\ No newline at end of file diff --git a/papers/promptpex-automatic-test-2025/paper_type.json b/papers/promptpex-automatic-test-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper presents an LLM-based tool (PromptPex) and reports quantitative experimental results across four models, comparing test generation effectiveness against baselines with metrics like non-compliance rates and groundedness percentages." +} +\ No newline at end of file diff --git a/papers/prompts-first-precision-2025/paper_type.json b/papers/prompts-first-precision-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "position", + "reason": "Explicitly framed as a position paper arguing for a pedagogical approach without experimental validation; makes prescriptive claims about how computing education should be structured." +} +\ No newline at end of file diff --git a/papers/promptscreen-efficient-jailbreak-2025/paper_type.json b/papers/promptscreen-efficient-jailbreak-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes PromptScreen method and reports quantitative experimental results (93.4% accuracy, 0% ASR) with ablation studies on a dataset of 30,937 labeled prompts; primary contribution is empirical findings, not the benchmark itself." +} +\ No newline at end of file diff --git a/papers/promptsleuth-detecting-prompt-2025/paper_type.json b/papers/promptsleuth-detecting-prompt-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper introduces PromptSleuth, a defense mechanism, and validates it experimentally across three benchmarks with quantitative false-negative-rate comparisons showing superior performance over existing defenses." +} +\ No newline at end of file diff --git a/papers/promptware-kill-chain-2026/paper_type.json b/papers/promptware-kill-chain-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "Meta-analyzes 36 documented prompt injection incidents to synthesize patterns and propose a kill chain framework, with primary contribution being the synthesis of attack evolution across 2023–2026." +} +\ No newline at end of file diff --git a/papers/proof-time-benchmark-2026/paper_type.json b/papers/proof-time-benchmark-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "The paper explicitly introduces PoT, a new time-partitioned benchmarking framework with 30K+ instances across four domains for evaluating scientific idea judgments; the primary contribution is the benchmark itself, not empirical findings about models." +} +\ No newline at end of file diff --git a/papers/prophetfuzz-fully-automated-2024/paper_type.json b/papers/prophetfuzz-fully-automated-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Introduces ProphetFuzz system and reports quantitative experimental results (364 vulnerabilities discovered, precision metrics, baseline comparisons) across 52 programs, with the primary contribution being the empirical findings of the approach's effectiveness." +} +\ No newline at end of file diff --git a/papers/protect-llm-agent-2025/paper_type.json b/papers/protect-llm-agent-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper proposes Polymorphic Prompt Assembling as a defense mechanism and experimentally validates it with quantitative results across multiple LLM models and 12 attack categories, with performance benchmarks and overhead measurements." +} +\ No newline at end of file diff --git a/papers/proteus-slaaware-routing-2026/paper_type.json b/papers/proteus-slaaware-routing-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes PROTEUS routing system and validates it experimentally on existing benchmarks (RouterBench, SPROUT), reporting quantitative metrics for accuracy, compliance, and cost savings." +} +\ No newline at end of file diff --git a/papers/proververifier-games-improve-2024/paper_type.json b/papers/proververifier-games-improve-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper trains LLM provers using Prover-Verifier Games and reports quantitative experimental findings about improved legibility, comparing training approaches and reward mechanisms across benchmarks." +} +\ No newline at end of file diff --git a/papers/proving-coding-interview-2025/paper_type.json b/papers/proving-coding-interview-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "Introduces FVAPPS, a new formal verification benchmark with 4,715 Lean 4 samples; baseline experiments are secondary to the benchmark contribution itself." +} +\ No newline at end of file diff --git a/papers/psychometric-personality-shaping-2025/paper_type.json b/papers/psychometric-personality-shaping-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs controlled experiments testing personality prompting on multiple established safety and capability benchmarks (WMDP, TruthfulQA, ETHICS, MMLU) with quantified effect sizes." +} +\ No newline at end of file diff --git a/papers/pyramid-moa-probabilistic-2026/paper_type.json b/papers/pyramid-moa-probabilistic-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Presents a novel inference framework (Pyramid MoA) and validates it through quantitative experiments on GSM8K and MBPP benchmarks, reporting performance metrics and comparisons against baselines." +} +\ No newline at end of file diff --git a/papers/python-symbolic-execution-2024/paper_type.json b/papers/python-symbolic-execution-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Builds an LLM-augmented symbolic execution system and reports quantitative experimental results on LeetCode benchmarks, with the primary contribution being the measured performance metrics and cost-effectiveness of the approach." +} +\ No newline at end of file diff --git a/papers/pythonsaga-redefining-benchmark-2024/paper_type.json b/papers/pythonsaga-redefining-benchmark-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "Introduces PythonSaga, a new 185-problem code benchmark designed to address documented biases in HumanEval and MBPP, with baseline evaluations demonstrating improved concept and difficulty balance." +} +\ No newline at end of file diff --git a/papers/qiskit-humaneval-evaluation-2024/paper_type.json b/papers/qiskit-humaneval-evaluation-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "The paper's primary contribution is introducing Qiskit HumanEval, a 101-task benchmark for evaluating quantum code generation; experiments on the benchmark validate and characterize it rather than serving as the main research finding." +} +\ No newline at end of file diff --git a/papers/quantifying-contamination-evaluating-2024/paper_type.json b/papers/quantifying-contamination-evaluating-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper measures contamination in existing benchmarks through experiments and reports quantitative results on how contamination affects model performance across MBPP and HumanEval." +} +\ No newline at end of file diff --git a/papers/quantization-model-neural-2023/paper_type.json b/papers/quantization-model-neural-2023/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "theoretical", + "reason": "Proposes the Quantization Hypothesis as a formal mathematical model explaining power law scaling and emergent capabilities, with empirical validation as supporting evidence rather than the primary contribution." +} +\ No newline at end of file diff --git a/papers/queryipi-queryagnostic-indirect-2025/paper_type.json b/papers/queryipi-queryagnostic-indirect-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Reports quantitative attack success rates (87% ASR), benchmarks against baselines (50%), tests transferability to real agents, and evaluates against defense mechanisms through experiments." +} +\ No newline at end of file diff --git a/papers/quo-vadis-code-2025/paper_type.json b/papers/quo-vadis-code-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Conducts original survey research with 100 professional developers, reporting quantitative findings about code review expectations and LLM adoption, with qualitative analysis of identified risks." +} +\ No newline at end of file diff --git a/papers/qwen25-technical-report-2024/paper_type.json b/papers/qwen25-technical-report-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Reports quantitative performance results of Qwen2.5 across multiple benchmarks, comparing model efficiency and capabilities against existing baselines; primary contribution is experimental findings rather than introducing new evaluation frameworks." +} +\ No newline at end of file diff --git a/papers/qwen25coder-technical-report-2024/paper_type.json b/papers/qwen25coder-technical-report-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Presents experimental results comparing Qwen2.5-Coder models against GPT-4o and other baselines on coding benchmarks, with primary contributions being quantitative performance findings and scaling properties rather than a new benchmark, survey, or theory." +} +\ No newline at end of file diff --git a/papers/r2router-new-paradigm-2026/paper_type.json b/papers/r2router-new-paradigm-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The primary contribution is R2-ROUTER, a method that jointly selects LLMs and output budgets with experimental results showing 4-5x cost reduction; while R2-BENCH (a new routing dataset) is introduced, it serves to validate the routing method rather than being the primary contribution." +} +\ No newline at end of file diff --git a/papers/ragmcp-mitigating-prompt-2025/paper_type.json b/papers/ragmcp-mitigating-prompt-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper proposes RAG-MCP and runs experiments on MCPBench, reporting quantitative results (43.13% accuracy, 49% token reduction) and stress testing with varying tool pool sizes." +} +\ No newline at end of file diff --git a/papers/ral2m-retrieval-augmented-2026/paper_type.json b/papers/ral2m-retrieval-augmented-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes RAL2M method and validates it through quantitative experiments on a multi-domain QA benchmark, with the primary contribution being the experimental findings demonstrating the method's effectiveness." +} +\ No newline at end of file diff --git a/papers/ramon-llulls-thinking-2025/paper_type.json b/papers/ramon-llulls-thinking-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes a combinatorial ideation framework and validates it through experiments on 7,483 papers, reporting quantitative results on decomposability, reconstructibility, and diversity-relevance trade-offs." +} +\ No newline at end of file diff --git a/papers/random-scaling-emergent-2025/paper_type.json b/papers/random-scaling-emergent-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs experiments with different random seeds to measure scaling trends quantitatively, reporting empirical findings about bimodal performance distributions and emergence behavior as the primary contribution." +} +\ No newline at end of file diff --git a/papers/rankllm-python-package-2025/paper_type.json b/papers/rankllm-python-package-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs reproduction experiments on TREC benchmarks (DL19-DL23) and reports quantitative findings comparing model performance, malformed response rates, and nDCG scores." +} +\ No newline at end of file diff --git a/papers/rathandravidianlangtech-2025-annaparavai-2025/paper_type.json b/papers/rathandravidianlangtech-2025-annaparavai-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Paper runs experiments with a transfer learning ensemble approach and reports quantitative F1 scores on an existing shared-task dataset for AI-generated review detection, with primary contribution being experimental findings on performance metrics." +} +\ No newline at end of file diff --git a/papers/raudit-blind-auditing-2026/paper_type.json b/papers/raudit-blind-auditing-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper introduces RAudit as a diagnostic protocol and reports quantitative experimental findings on failure mechanisms in LLM reasoning through experiments on existing benchmarks, making the empirical results the primary contribution." +} +\ No newline at end of file diff --git a/papers/react-synergizing-reasoning-2022/paper_type.json b/papers/react-synergizing-reasoning-2022/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Introduces the ReAct method and validates it through experiments on multiple benchmarks (HotpotQA, FEVER, ALFWorld, WebShop) with quantitative comparisons against baselines, where experimental findings are the primary contribution." +} +\ No newline at end of file diff --git a/papers/real-time-ai-2025/paper_type.json b/papers/real-time-ai-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "position", + "reason": "Proposes a conceptual defense framework against prompt injection attacks without any experimental validation, empirical results, or formal mathematical analysis." +} +\ No newline at end of file diff --git a/papers/realist-pluralist-conceptions-2025/paper_type.json b/papers/realist-pluralist-conceptions-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "position", + "reason": "The paper proposes a conceptual framework (two conceptions of intelligence) and argues how it clarifies methodological disagreements in AI research, without experimental validation, mathematical proofs, or benchmark creation." +} +\ No newline at end of file diff --git a/papers/realmath-continuous-benchmark-2025/paper_type.json b/papers/realmath-continuous-benchmark-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "Introduces RealMath, a new benchmark with 1,200+ QA pairs from arXiv and Stack Exchange for evaluating LLMs on research-level mathematics, with frontier model baselines provided." +} +\ No newline at end of file diff --git a/papers/reasalign-reasoning-enhanced-2026/paper_type.json b/papers/reasalign-reasoning-enhanced-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes ReasAlign method and validates it through experiments on CyberSecEval2 and CySE benchmarks, reporting quantitative results and ablation studies comparing against baselines." +} +\ No newline at end of file diff --git a/papers/reasoning-large-language-2023/paper_type.json b/papers/reasoning-large-language-2023/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs controlled experiments across 10 reasoning benchmarks comparing multi-agent peer review collaboration against baselines, reporting quantitative results on strategy effectiveness." +} +\ No newline at end of file diff --git a/papers/reasoning-runtime-behavior-2024/paper_type.json b/papers/reasoning-runtime-behavior-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "Introduces REval, a novel evaluation framework with four code reasoning tasks and an Incremental Consistency metric, then validates it by evaluating 15 LLMs; the primary contribution is the benchmark and evaluation methodology, not just the empirical findings." +} +\ No newline at end of file diff --git a/papers/rebench-evaluating-frontier-2024/paper_type.json b/papers/rebench-evaluating-frontier-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "Introduces RE-Bench, a novel evaluation framework with 7 hand-crafted ML research engineering environments, with baseline experiments validating the benchmark." +} +\ No newline at end of file diff --git a/papers/recode-improving-llmbased-2025/paper_type.json b/papers/recode-improving-llmbased-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Introduces ReCode method and validates it experimentally across 6 LLMs on code repair benchmarks, with quantitative performance improvements as the primary contribution." +} +\ No newline at end of file diff --git a/papers/red-teaming-mind-2025/paper_type.json b/papers/red-teaming-mind-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper systematically tests 1,400+ adversarial prompts against multiple LLMs and reports quantitative results (success rates, transfer rates, failure modes) as its primary contribution, rather than creating a reusable benchmark or surveying existing work." +} +\ No newline at end of file diff --git a/papers/redcode-risky-code-2024/paper_type.json b/papers/redcode-risky-code-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "The primary contribution is RedCode, a new benchmark with 4,050 risky code execution scenarios and 160 malicious software generation prompts; the agent evaluations are baseline demonstrations of the benchmark." +} +\ No newline at end of file diff --git a/papers/reducing-hallucinations-llmgenerated-2025/paper_type.json b/papers/reducing-hallucinations-llmgenerated-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes a semantic triangulation method and validates it empirically on LiveCodeBench and CodeElo benchmarks, with quantitative results demonstrating 90% conditional correctness vs 60% baselines." +} +\ No newline at end of file diff --git a/papers/redvisor-reasoningaware-prompt-2026/paper_type.json b/papers/redvisor-reasoningaware-prompt-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes a defense method and validates it through experiments on multiple models, testing attack success rates and performance metrics across five injection types." +} +\ No newline at end of file diff --git a/papers/refinestat-efficient-exploration-2025/paper_type.json b/papers/refinestat-efficient-exploration-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper proposes REFINESTAT framework and validates it through experiments on probabilistic program synthesis tasks, reporting quantitative improvements (40pp in run rates, matches GPT-4/o3 on ELPD-LOO scores) against baselines." +} +\ No newline at end of file diff --git a/papers/refining-input-guardrails-2025/paper_type.json b/papers/refining-input-guardrails-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper's primary contribution is experimental validation showing quantitative improvements from fine-tuning methods (SFT, DPO, KTO) for LLM guardrails, with benchmarked comparisons against baselines, not the introduction of a novel benchmark itself." +} +\ No newline at end of file diff --git a/papers/reinforcement-learning-mutation-2023/paper_type.json b/papers/reinforcement-learning-mutation-2023/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs 30,080 controlled repair experiments on Defects4J benchmark, comparing RL strategies against baseline, and reports quantitative findings on bug patches and test-passing variants." +} +\ No newline at end of file diff --git a/papers/relative-preference-optimization-2024/paper_type.json b/papers/relative-preference-optimization-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes RPO method and reports quantitative experimental results across multiple benchmarks (Anthropic-HH, OpenAI Summarization, AlpacaEval2.0) with ablations validating the approach." +} +\ No newline at end of file diff --git a/papers/relative-scaling-laws-2025/paper_type.json b/papers/relative-scaling-laws-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper runs experiments on LLMs using established benchmarks (MMLU) and reports quantitative empirical findings about how scaling laws vary across subpopulations, with measured correlations and behavioral observations." +} +\ No newline at end of file diff --git a/papers/relativebased-scaling-law-2025/paper_type.json b/papers/relativebased-scaling-law-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "theoretical", + "reason": "The primary contribution is establishing a formal power-law scaling relationship between token prediction rank and model size, with theoretical explanation of emergence phenomena, supported by experimental validation rather than being the main output." +} +\ No newline at end of file diff --git a/papers/relaygen-intrageneration-model-2026/paper_type.json b/papers/relaygen-intrageneration-model-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes RelayGen method and validates it through experiments on AIME 2025, reporting quantitative results (2.2× speedup, <2% accuracy degradation); primary contribution is experimental findings." +} +\ No newline at end of file diff --git a/papers/rele-scalable-system-2026/paper_type.json b/papers/rele-scalable-system-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "Introduces ReLE, a structured benchmark system for evaluating Chinese LLMs with a Domain × Capability matrix; experiments on 304 LLMs serve to validate the benchmark rather than being the primary contribution." +} +\ No newline at end of file diff --git a/papers/reliability-explainability-language-2023/paper_type.json b/papers/reliability-explainability-language-2023/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Evaluates 8 pre-trained language models on 5 existing program generation benchmarks, reporting quantitative findings about data duplication, output copying rates, and explainability patterns — the primary contribution is experimental evidence of reliability issues in the evaluation ecosystem." +} +\ No newline at end of file diff --git a/papers/reliable-agent-engineering-2025/paper_type.json b/papers/reliable-agent-engineering-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "position", + "reason": "Proposes organizational principles and frameworks for agent engineering drawn from organization science, but explicitly remains unvalidated and lacks experimental evidence." +} +\ No newline at end of file diff --git a/papers/reliable-llmbased-edgecloudexpert-2025/paper_type.json b/papers/reliable-llmbased-edgecloudexpert-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "theoretical", + "reason": "The primary contribution is the MHT-ERM method with finite-sample theoretical guarantees on misalignment risk via formal statistical analysis; empirical validation on TeleQnA is secondary." +} +\ No newline at end of file diff --git a/papers/relrepair-enhancing-automated-2025/paper_type.json b/papers/relrepair-enhancing-automated-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes a retrieval-augmented method for program repair and validates it through experiments on existing benchmarks (Defects4J, ManySStuBs4J), reporting quantitative improvements in bug fix rates." +} +\ No newline at end of file diff --git a/papers/remote-labor-index-2025/paper_type.json b/papers/remote-labor-index-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "Introduces the Remote Labor Index (RLI), a new benchmark dataset of 240 real-world freelance projects designed to measure AI automation of remote work; empirical evaluation of frontier models serves as validation." +} +\ No newline at end of file diff --git a/papers/repaca-leveraging-reasoning-2025/paper_type.json b/papers/repaca-leveraging-reasoning-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Fine-tunes a model with GRPO and reports quantitative experimental results (83.1% accuracy, 84.8% F1) on the Defects4J benchmark, comparing against prior SOTA." +} +\ No newline at end of file diff --git a/papers/repair-automated-program-2024/paper_type.json b/papers/repair-automated-program-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper runs experiments demonstrating process-based feedback via SFT+PPO with quantitative results (pass@1 metrics) and ablation studies on automated program repair, with the primary contribution being the experimental findings." +} +\ No newline at end of file diff --git a/papers/repair-ingredients-all-2025/paper_type.json b/papers/repair-ingredients-all-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper introduces ReinFix, an LLM-based program repair system, and validates it through quantitative experiments on Defects4J benchmarks with ablation studies demonstrating component contributions." +} +\ No newline at end of file diff --git a/papers/repairagent-llm-bug-repair-2024/paper_type.json b/papers/repairagent-llm-bug-repair-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes RepairAgent system and validates it through experiments on Defects4J and GitBug-Java benchmarks, reporting quantitative results (164 bugs fixed, ablation studies, generalization tests) showing the primary contribution is experimental findings." +} +\ No newline at end of file diff --git a/papers/repairing-bugs-python-2022/paper_type.json b/papers/repairing-bugs-python-2022/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes MMAPR system and validates it through experiments on 286 real student programs, reporting quantitative repair rates and performance metrics compared to baselines." +} +\ No newline at end of file diff --git a/papers/repairllama-efficient-representations-2023/paper_type.json b/papers/repairllama-efficient-representations-2023/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs experiments on three existing Java program repair benchmarks with quantitative results comparing LoRA fine-tuning against baselines; primary contribution is experimental findings about effective representations and parameter-efficient adapters." +} +\ No newline at end of file diff --git a/papers/repairr1-better-test-2025/paper_type.json b/papers/repairr1-better-test-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Presents a novel method (joint RL-based test generation and repair optimization) and validates it empirically across four established benchmarks with quantitative results (2.68–48.29% improvement), making experimental findings the primary contribution." +} +\ No newline at end of file diff --git a/papers/repoagent-documentation-2024/paper_type.json b/papers/repoagent-documentation-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Primary contribution is experimental validation through blind preference tests and quantitative comparisons showing the framework generates documentation superior to human-authored baselines; the framework itself is the tool, not the primary contribution." +} +\ No newline at end of file diff --git a/papers/repogenreflex-enhancing-repositorylevel-2024/paper_type.json b/papers/repogenreflex-enhancing-repositorylevel-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes RepoGenReflex framework and validates it through experiments comparing quantitative results across benchmarks, with the primary contribution being the experimental findings of the proposed method's performance." +} +\ No newline at end of file diff --git a/papers/repotransbench-realworld-multilingual-2024/paper_type.json b/papers/repotransbench-realworld-multilingual-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "Introduces RepoTransBench, a new multilingual code translation benchmark with 1,897 samples across 13 language pairs; while baselines are evaluated, the primary contribution is the benchmark itself." +} +\ No newline at end of file diff --git a/papers/requirements-to-code-practices-2025/paper_type.json b/papers/requirements-to-code-practices-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Qualitative empirical study of developer practices that reports findings from observing how practitioners use LLMs for software engineering, identifies interaction patterns, and proposes a process model based on research observations." +} +\ No newline at end of file diff --git a/papers/rescue-ranking-llm-2023/paper_type.json b/papers/rescue-ranking-llm-2023/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper introduces RESCUE, a training method, and validates it through experiments on existing benchmarks (e-SNLI, Multi-doc QA) with quantitative improvements and human evaluation metrics." +} +\ No newline at end of file diff --git a/papers/researchcodebench-benchmarking-llms-2025/paper_type.json b/papers/researchcodebench-benchmarking-llms-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "The primary contribution is ResearchCodeBench, a new benchmark with 212 coding challenges from recent ML papers; the LLM evaluation experiments validate the benchmark rather than being the primary finding." +} +\ No newline at end of file diff --git a/papers/researchrubrics-benchmark-prompts-2025/paper_type.json b/papers/researchrubrics-benchmark-prompts-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "The primary contribution is RESEARCHRUBRICS, a new evaluation benchmark with 101 prompts and 2,593 expert-written rubric criteria for assessing Deep Research agents; while experiments validate the benchmark, the benchmark itself is the main deliverable." +} +\ No newline at end of file diff --git a/papers/reshaping-higher-education-2025/paper_type.json b/papers/reshaping-higher-education-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "position", + "reason": "The paper argues a viewpoint about AI integration in higher education and proposes conceptual frameworks (AI Integrated vs. AI Leading Education) and assessment strategies without experimental validation or formal analysis." +} +\ No newline at end of file diff --git a/papers/resourceefficient-multimodal-intelligence-2025/paper_type.json b/papers/resourceefficient-multimodal-intelligence-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes a routing framework and validates it through experiments on standard benchmarks (MMLU, VQA-v2), with quantitative performance metrics as the primary contribution." +} +\ No newline at end of file diff --git a/papers/responsible-artificial-intelligence-2025/paper_type.json b/papers/responsible-artificial-intelligence-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "Explicitly surveys responsible AI practices across five dimensions in Earth observation using meta-analysis methodology to synthesize existing work." +} +\ No newline at end of file diff --git a/papers/rethinking-benchmark-contamination-2023/paper_type.json b/papers/rethinking-benchmark-contamination-2023/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Demonstrates experimentally that rephrasing bypasses existing decontamination methods and proposes an LLM-based decontaminator with quantitative F1 score results on real pre-training and fine-tuning datasets." +} +\ No newline at end of file diff --git a/papers/rethinking-code-review-2025/paper_type.json b/papers/rethinking-code-review-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Qualitative field study at WirelessCar that empirically investigates developer preferences and challenges with LLM-assisted code review; primary contribution is research findings from real-world observation, not a benchmark, survey, position, or theory." +} +\ No newline at end of file diff --git a/papers/rethinking-kernel-program-2025/paper_type.json b/papers/rethinking-kernel-program-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "The primary contribution is RGym, a new benchmarking framework and verified dataset of 143 kernel bugs for evaluating LLM-based program repair; empirical results are secondary findings from using this framework." +} +\ No newline at end of file diff --git a/papers/rethinking-knowledge-distillation-2025/paper_type.json b/papers/rethinking-knowledge-distillation-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "Comprehensive review of knowledge distillation in collaborative learning organized through memory and knowledge mechanisms, with meta-analysis methodology." +} +\ No newline at end of file diff --git a/papers/rethinking-verification-llm-2025/paper_type.json b/papers/rethinking-verification-llm-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "Introduces TCGBench, a new evaluation benchmark for code generation test case quality, with SAGA framework as the primary method validated on it." +} +\ No newline at end of file diff --git a/papers/retrievalaugmented-code-generation-2025/paper_type.json b/papers/retrievalaugmented-code-generation-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "Explicitly titled as a survey that reviews 110 papers on RAG code generation, provides a taxonomy of approaches, and synthesizes findings across the field." +} +\ No newline at end of file diff --git a/papers/retrievalaugmented-code-review-2025/paper_type.json b/papers/retrievalaugmented-code-review-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes RAG-Reviewer framework and reports quantitative experimental results (EM, BLEU scores) comparing it against baselines on existing benchmarks." +} +\ No newline at end of file diff --git a/papers/retrievalaugmented-generation-approach-2025/paper_type.json b/papers/retrievalaugmented-generation-approach-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper presents the NN-RAG system and reports quantitative experimental results (73.0% pass rate, 82% uniqueness, 72.46% dataset coverage) from applying it to 19 repositories, with primary contribution being empirical findings about extraction effectiveness." +} +\ No newline at end of file diff --git a/papers/retrievalaugmented-generation-electrocardiogramlanguage-2025/paper_type.json b/papers/retrievalaugmented-generation-electrocardiogramlanguage-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs experiments across three public ECG datasets with multiple architectures, reports quantitative performance improvements, and conducts ablation studies—primary contribution is experimental findings about RAG's effect on ELM performance." +} +\ No newline at end of file diff --git a/papers/retrievalaugmented-generation-multilingual-2024/paper_type.json b/papers/retrievalaugmented-generation-multilingual-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper reports quantitative experimental results comparing RAG performance and multilingual models across 13 languages, with the primary contribution being empirical findings rather than benchmark creation." +} +\ No newline at end of file diff --git a/papers/reversum-multistaged-retrievalaugmented-2025/paper_type.json b/papers/reversum-multistaged-retrievalaugmented-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes and empirically validates REVERSUM through controlled experiments with quantitative human evaluation metrics (92% vs 75% integrability) and ablation analysis." +} +\ No newline at end of file diff --git a/papers/review-advances-aipowered-2024/paper_type.json b/papers/review-advances-aipowered-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "Explicitly a narrative review that surveys and synthesizes existing AI techniques applied to CI/CD monitoring, without original experiments or quantitative meta-analysis." +} +\ No newline at end of file diff --git a/papers/review-aidriven-approaches-2025/paper_type.json b/papers/review-aidriven-approaches-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "A narrative review that synthesizes and categorizes existing AI-driven approaches to defect detection and classification across multiple technique types and domains." +} +\ No newline at end of file diff --git a/papers/review-generative-ai-2024/paper_type.json b/papers/review-generative-ai-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "The paper is explicitly a narrative review that surveys and synthesizes GenAI applications across cybersecurity domains (offensive and defensive), with informal demonstrations serving as illustrative case studies within the broader synthesis." +} +\ No newline at end of file diff --git a/papers/review-generative-ai-2025/paper_type.json b/papers/review-generative-ai-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "Paper explicitly surveys approximately 50 publications with meta-analysis methodology to synthesize existing work on generative AI in DevOps, making field synthesis its primary contribution." +} +\ No newline at end of file diff --git a/papers/review-hallucination-understanding-2025/paper_type.json b/papers/review-hallucination-understanding-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "This meta-analysis reviews hallucination understanding across model types and proposes a unified framework for synthesizing existing knowledge, making it a literature review rather than original empirical work." +} +\ No newline at end of file diff --git a/papers/review-research-aiassisted-2025/paper_type.json b/papers/review-research-aiassisted-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "This is a narrative review that synthesizes existing research on AI-assisted code generation and code review, explicitly labeled as a meta-analysis, rather than conducting original experiments or introducing a new benchmark." +} +\ No newline at end of file diff --git a/papers/review-tools-zerocode-2025/paper_type.json b/papers/review-tools-zerocode-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "Explicitly reviews and categorizes existing zero-code LLM platforms along defined dimensions, with meta-analysis methodology—a synthesis contribution rather than experimental findings or novel benchmark." +} +\ No newline at end of file diff --git a/papers/revisiting-evolutionary-program-2024/paper_type.json b/papers/revisiting-evolutionary-program-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes ARJA-CLM combining Code Language Models with evolutionary algorithms and validates it through quantitative experiments on Defects4J and APR-2024 benchmarks, reporting specific repair success rates and comparisons against baselines." +} +\ No newline at end of file diff --git a/papers/revisiting-unnaturalness-automated-2024/paper_type.json b/papers/revisiting-unnaturalness-automated-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Conducts experiments measuring LLM entropy metrics on program repair tasks and reports quantitative improvements over baseline methods on existing benchmarks." +} +\ No newline at end of file diff --git a/papers/revolution-hype-seeking-2025/paper_type.json b/papers/revolution-hype-seeking-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "Panel paper surveying the state of LLMs and LCMs in hardware design with meta-analysis methodology, synthesizing opportunities and challenges in the field." +} +\ No newline at end of file diff --git a/papers/rexbench-can-coding-2025/paper_type.json b/papers/rexbench-can-coding-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "REXBENCH is a novel benchmark for evaluating coding agents on AI research extension tasks, with baseline results from 9 LLM configurations establishing the evaluation framework." +} +\ No newline at end of file diff --git a/papers/rgfl-reasoning-guided-2026/paper_type.json b/papers/rgfl-reasoning-guided-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Introduces RGFL method and reports quantitative improvements on SWE-bench Verified (Hit@1: 71.4%→85%, repair rate: 51.6%→58.2%) with ablation studies validating the approach." +} +\ No newline at end of file diff --git a/papers/right-prompts-job-2023/paper_type.json b/papers/right-prompts-job-2023/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs controlled experiments with CodeLLaMA comparing different prompts for code repair, reporting quantitative results (72.97% exact-match rate, ablation studies on prompt components, cross-dataset transfer analysis) as the primary contribution." +} +\ No newline at end of file diff --git a/papers/rise-potential-large-2023/paper_type.json b/papers/rise-potential-large-2023/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "Explicitly labeled as a survey with meta-analysis methodology; provides comprehensive framework synthesizing existing LLM agent research rather than reporting novel experiments or benchmarks." +} +\ No newline at end of file diff --git a/papers/rise-potential-opportunities-2025/paper_type.json b/papers/rise-potential-opportunities-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "Explicitly reviews and catalogs ~60 existing LLM agent systems across bioinformatics/biomedicine with meta-analysis methodology, synthesizing the field rather than running original experiments or introducing a new benchmark." +} +\ No newline at end of file diff --git a/papers/rl-hammer-llms-2025/paper_type.json b/papers/rl-hammer-llms-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Presents RL-Hammer, a reinforcement learning method for prompt injection attacks, with quantitative experimental results (98% ASR against GPT-4o, 72% against GPT-5) as the primary contribution." +} +\ No newline at end of file diff --git a/papers/rltf-reinforcement-learning-2023/paper_type.json b/papers/rltf-reinforcement-learning-2023/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes an RL framework and validates it through experiments on APPS and MBPP benchmarks with ablation studies showing quantitative improvements, making experimental findings the primary contribution." +} +\ No newline at end of file diff --git a/papers/rlthf-targeted-human-2025/paper_type.json b/papers/rlthf-targeted-human-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper proposes RLTHF, a human-AI hybrid framework, and validates it through quantitative experiments on standard benchmarks (HH-RLHF, TL;DR) with ablation studies and comparisons to baselines, demonstrating that the method achieves comparable accuracy with significantly reduced human annotation effort." +} +\ No newline at end of file diff --git a/papers/rmb-comprehensively-benchmarking-2024/paper_type.json b/papers/rmb-comprehensively-benchmarking-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "The paper introduces RMB, a new comprehensive reward model benchmark with 49 real-world scenarios and evaluation paradigms; the primary contribution is the benchmark itself, not just experimental findings on existing benchmarks." +} +\ No newline at end of file diff --git a/papers/robon-routed-online-2025/paper_type.json b/papers/robon-routed-online-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes RoBoN method and validates it experimentally across five reasoning benchmarks with quantitative results (accuracy gains up to 3.4pp) compared to baselines; primary contribution is experimental findings, not the benchmarks themselves." +} +\ No newline at end of file diff --git a/papers/robots-here-navigating-2023/paper_type.json b/papers/robots-here-navigating-2023/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "The paper is a large working group report that synthesizes 71 existing papers on LLMs in computing education, with meta-analysis as a core methodology; while it includes original surveys and interviews, the primary contribution is synthesizing the field's existing literature." +} +\ No newline at end of file diff --git a/papers/robust-llm-alignment-2025/paper_type.json b/papers/robust-llm-alignment-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Introduces new robust DPO methods (WDPO, KLDPO) with theoretical convergence analysis, then validates superior performance empirically across multiple models and benchmarks (Emotion, ArmoRM, OpenLLM)." +} +\ No newline at end of file diff --git a/papers/robust-retrievalbased-summarization-2024/paper_type.json b/papers/robust-retrievalbased-summarization-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The primary contribution is experimental findings about a fine-tuned RAG-based summarization system (SummRAG) and insights about LLM performance on document relevance assessment, with the evaluation framework (LogicSumm) serving as the validation tool." +} +\ No newline at end of file diff --git a/papers/robustness-referencing-defending-2025/paper_type.json b/papers/robustness-referencing-defending-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes a prompt injection defense method and validates it through comprehensive experiments across 8 models, 5 attack types, and 3 datasets with quantitative results (ASR metrics)." +} +\ No newline at end of file diff --git a/papers/role-artificial-intelligence-2025/paper_type.json b/papers/role-artificial-intelligence-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper's primary contribution is experimental findings from computational experiments comparing two algorithms with quantitative results; despite methodological limitations (synthetic data only, weak baseline), the core contribution is empirical evaluation." +} +\ No newline at end of file diff --git a/papers/role-genai-automated-2023/paper_type.json b/papers/role-genai-automated-2023/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "Narrative review that surveys and categorizes AI techniques for automated code generation without conducting original experiments." +} +\ No newline at end of file diff --git a/papers/role-generative-ai-2025/paper_type.json b/papers/role-generative-ai-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "Systematic literature review (meta-analysis) that identifies and synthesizes 109 existing GenAI practices across 11 cybersecurity risk categories, with aggregate reporting of findings." +} +\ No newline at end of file diff --git a/papers/rooflinebench-benchmarking-framework-2026/paper_type.json b/papers/rooflinebench-benchmarking-framework-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "The primary contribution is RooflineBench, a new benchmarking framework for characterizing LLM inference on edge devices via Roofline analysis; the reported findings about operational intensity patterns and attention mechanisms are applications of this framework rather than the primary contribution." +} +\ No newline at end of file diff --git a/papers/routing-cascades-user-2026/paper_type.json b/papers/routing-cascades-user-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "theoretical", + "reason": "Analyzes optimal routing policies between LLM models through Stackelberg game theory, proving properties of equilibrium solutions rather than reporting experimental results or benchmarks." +} +\ No newline at end of file diff --git a/papers/rtbas-defending-llm-2025/paper_type.json b/papers/rtbas-defending-llm-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes RTBAS method and validates it experimentally with quantitative results (100% integrity, FPR/FNR metrics) on established and synthesized benchmarks; primary contribution is experimental findings, not the benchmark itself." +} +\ No newline at end of file diff --git a/papers/rtl-graphenhanced-llm-2025/paper_type.json b/papers/rtl-graphenhanced-llm-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes RTL++ method and validates it experimentally with quantitative results on VerilogEval HumanEval benchmark, including ablation studies and baseline comparisons." +} +\ No newline at end of file diff --git a/papers/rtlsquad-multiagent-based-2025/paper_type.json b/papers/rtlsquad-multiagent-based-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Presents a multi-agent RTL design system and reports quantitative experimental results (Pass@1 improvements of 10.4-11.2pp) on the RTLLM V2.0 benchmark against baselines." +} +\ No newline at end of file diff --git a/papers/rubric-all-you-2025-2/paper_type.json b/papers/rubric-all-you-2025-2/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs controlled experiments comparing rubric approaches (CRE vs PRE, question-specific vs question-agnostic) and reports quantitative results on code evaluation accuracy, with the primary contribution being empirical findings on rubric effectiveness." +} +\ No newline at end of file diff --git a/papers/rubric-all-you-2025/paper_type.json b/papers/rubric-all-you-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs experiments on code evaluation methods and reports quantitative results (ICC3 scores, leniency metrics) comparing question-specific vs question-agnostic rubrics across benchmarks." +} +\ No newline at end of file diff --git a/papers/runbugrun-executable-dataset-2023/paper_type.json b/papers/runbugrun-executable-dataset-2023/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "Introduces RunBugRun, a large-scale executable dataset of ~450,000 buggy/fixed program pairs across 8 languages with structured labels and test cases; baseline experiments validate the dataset but are secondary to the benchmark contribution." +} +\ No newline at end of file diff --git a/papers/rustassistant-llms-fix-2025/paper_type.json b/papers/rustassistant-llms-fix-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs experiments on micro-benchmarks and real-world data with quantitative results (92.59%, 73.63%), includes ablation studies, and manual validation; primary contribution is empirical findings about RustAssistant effectiveness." +} +\ No newline at end of file diff --git a/papers/saber-efficient-sampling-2025/paper_type.json b/papers/saber-efficient-sampling-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes a sampling algorithm (Saber) and validates it experimentally on HumanEval with quantitative metrics (Pass@1, inference time) and ablation studies." +} +\ No newline at end of file diff --git a/papers/saber-small-actions-2025/paper_type.json b/papers/saber-small-actions-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper runs experiments across multiple models and benchmarks, reporting quantitative findings about mutating action failures (14-18% of steps, 55-96% odds reduction) and validating the SABER safeguard with empirical improvements (+19.7pp)." +} +\ No newline at end of file diff --git a/papers/safegenbench-benchmark-framework-2025/paper_type.json b/papers/safegenbench-benchmark-framework-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "The primary contribution is SafeGenBench, a new benchmark framework with 558 security-related code generation tasks across 44 CWE types and 13 languages; the LLM evaluations are secondary validation of the benchmark." +} +\ No newline at end of file diff --git a/papers/safeguarding-visionlanguage-models-2024/paper_type.json b/papers/safeguarding-visionlanguage-models-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper proposes a defense method (SmoothVLM) and validates it with quantitative experimental results on real models (0–5% attack success rates, 67.3–95% benign recovery), with theoretical analysis serving as supporting justification for the empirical findings." +} +\ No newline at end of file diff --git a/papers/safepro-evaluating-safety-2026/paper_type.json b/papers/safepro-evaluating-safety-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "Introduces SafePro, a new evaluation benchmark of 275 professional-level safety tasks for AI agents, with baseline evaluations of state-of-the-art LLMs." +} +\ No newline at end of file diff --git a/papers/safetyefficacy-trade-off-2026/paper_type.json b/papers/safetyefficacy-trade-off-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "theoretical", + "reason": "The paper proves mathematical properties of data-poisoning attacks (rank-one spike in Hessian, spectral undetectability regimes) and establishes a fundamental safety-efficacy trade-off through formal analysis, with experiments validating the theory rather than driving the primary contribution." +} +\ No newline at end of file diff --git a/papers/sage-steerable-agentic-2026/paper_type.json b/papers/sage-steerable-agentic-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes the SAGE data generation method and validates it through experiments with quantitative results (87% correctness, 50% pass rate, 29% improvement), where the primary contribution is the experimental findings demonstrating the method's effectiveness." +} +\ No newline at end of file diff --git a/papers/salad-systematic-assessment-2025/paper_type.json b/papers/salad-systematic-assessment-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper experimentally evaluates six existing machine unlearning algorithms on LLaMA 3.1-8B for Verilog generation, reporting quantitative findings about algorithm trade-offs across threat scenarios rather than introducing a new benchmark or framework." +} +\ No newline at end of file diff --git a/papers/sampleefficient-human-evaluation-2024/paper_type.json b/papers/sampleefficient-human-evaluation-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "Introduces MAD Competition, a novel sample-efficient evaluation framework for human comparison of LLMs using adaptive instruction selection." +} +\ No newline at end of file diff --git a/papers/saro-enhancing-llm-2025/paper_type.json b/papers/saro-enhancing-llm-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes SaRO framework and reports quantitative experimental results comparing it against baselines (SafetySFT, DPO) on safety benchmarks, with findings about reasoning-based alignment effectiveness." +} +\ No newline at end of file diff --git a/papers/scaffolded-model-capability-2023/paper_type.json b/papers/scaffolded-model-capability-2023/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes cost-reduction strategies and validates them through experiments on multiple benchmarks, reporting quantitative cost-performance tradeoffs." +} +\ No newline at end of file diff --git a/papers/scalable-oversight-partitioned-2025/paper_type.json b/papers/scalable-oversight-partitioned-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "theoretical", + "reason": "The primary contributions are deriving an unbiased estimator and providing finite-sample deviation guarantees; empirical benchmarks serve as validation rather than the core contribution." +} +\ No newline at end of file diff --git a/papers/scales-justitia-comprehensive-2025/paper_type.json b/papers/scales-justitia-comprehensive-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "The paper explicitly conducts a meta-analysis and comprehensive review of LLM safety evaluation practices, organizing existing evaluation dimensions, datasets, benchmarks, and frameworks rather than introducing novel experimental contributions or a single new benchmark." +} +\ No newline at end of file diff --git a/papers/scaling-laws-2020/paper_type.json b/papers/scaling-laws-2020/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The primary contribution is empirical discovery of power-law scaling relationships across seven orders of magnitude of experiments; the theoretical framework derives from and supports these empirical findings rather than providing mathematical proofs." +} +\ No newline at end of file diff --git a/papers/scaling-laws-code-2025/paper_type.json b/papers/scaling-laws-code-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs controlled experiments measuring scaling behaviors across programming languages and reports quantitative findings (scaling exponents, loss orders, multilingual synergies), with primary contribution being experimental results rather than benchmark creation or theory." +} +\ No newline at end of file diff --git a/papers/scaling-laws-data-2025/paper_type.json b/papers/scaling-laws-data-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper runs experiments with a 7B model across domains and reports quantitative findings about how data source rankings change across compute scales, making this primarily an empirical contribution about experimental findings rather than a new benchmark or theoretical result." +} +\ No newline at end of file diff --git a/papers/scaling-laws-economic-productivity-2024/paper_type.json b/papers/scaling-laws-economic-productivity-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs a pre-registered RCT with 300 translators testing 13 LLMs, reporting quantitative results on productivity metrics (completion time, quality, earnings) across compute scales; primary contribution is experimental findings on scaling effects." +} +\ No newline at end of file diff --git a/papers/scaling-laws-multiagent-2022/paper_type.json b/papers/scaling-laws-multiagent-2022/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs experiments on benchmark games (Connect Four, Pentago) measuring quantitative scaling relationships between model parameters, compute, and performance, with primary contribution being the empirical power-law exponents and sample efficiency findings." +} +\ No newline at end of file diff --git a/papers/scaling-testtime-compute-2025/paper_type.json b/papers/scaling-testtime-compute-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "theoretical", + "reason": "The paper proves a theoretical separation between verifier-based and verifier-free methods (√H factor scaling) with empirical validation on MATH, making formal analysis the primary contribution." +} +\ No newline at end of file diff --git a/papers/scenarios-transition-agi-2024/paper_type.json b/papers/scenarios-transition-agi-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "theoretical", + "reason": "Develops a formal compute-centric economic framework and derives mathematical properties of wage dynamics under different task-distribution scenarios." +} +\ No newline at end of file diff --git a/papers/scheming-llm-to-llm-interactions-2025/paper_type.json b/papers/scheming-llm-to-llm-interactions-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper runs experiments across four frontier LLMs in strategic interaction scenarios (Peer Evaluation, Cheap Talk), reporting quantitative results on deception rates and success rates, with comparative analysis of model behavior." +} +\ No newline at end of file diff --git a/papers/science-scaling-agent-2025/paper_type.json b/papers/science-scaling-agent-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs experiments measuring multi-agent system performance across task types, reports quantitative results (+80.9% to -70% performance ranges, R²=0.524 predictive model, 87% accuracy), and identifies empirical patterns (tool-coordination trade-off, capability saturation, error amplification)." +} +\ No newline at end of file diff --git a/papers/scissorhands-exploiting-persistence-2023/paper_type.json b/papers/scissorhands-exploiting-persistence-2023/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper proposes a KV cache compression method (Scissorhands) based on an observed attention pattern and validates it experimentally across multiple OPT models with quantitative results on memory reduction, perplexity, and downstream tasks." +} +\ No newline at end of file diff --git a/papers/scmas-constructing-costefficient-2026/paper_type.json b/papers/scmas-constructing-costefficient-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes SC-MAS framework and validates it through experiments on five benchmarks with quantitative results, ablations, and generalization tests; primary contribution is experimental findings demonstrating accuracy gains and cost reductions." +} +\ No newline at end of file diff --git a/papers/sdag-subjectbased-directed-2025/paper_type.json b/papers/sdag-subjectbased-directed-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes S-DAG framework and validates it experimentally on existing benchmarks (MMLU-Pro, GPQA, MedMCQA), reporting quantitative accuracy improvements and ablation studies; primary contribution is experimental findings, not the benchmarks themselves." +} +\ No newline at end of file diff --git a/papers/se-agentic-benchmarks-survey-2025/paper_type.json b/papers/se-agentic-benchmarks-survey-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "Comprehensively reviews 150+ papers on LLM-powered software engineering with taxonomy and meta-analysis of existing solutions and benchmarks, identifying research gaps through synthesis rather than new experiments or frameworks." +} +\ No newline at end of file diff --git a/papers/seakr-selfaware-knowledge-2024/paper_type.json b/papers/seakr-selfaware-knowledge-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper introduces a method (SEAKR) and validates it through quantitative experiments on multiple benchmarks (2WikiMultiHop, HotpotQA, IIRC) with ablation studies, where the primary contribution is the experimental findings demonstrating performance improvements over baselines." +} +\ No newline at end of file diff --git a/papers/searchbased-automated-program-2024/paper_type.json b/papers/searchbased-automated-program-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper presents FLOWREPAIR, a new tool for automated program repair of CPS controllers, and validates it experimentally on 9 case study systems with quantitative results (plausible and valid patch counts, baseline comparisons, statistical testing)." +} +\ No newline at end of file diff --git a/papers/secalign-defending-against-2024/paper_type.json b/papers/secalign-defending-against-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes SecAlign defense method and validates it with quantitative experiments across five models and multiple benchmarks, with primary contribution being empirical findings on attack success rates and generalization." +} +\ No newline at end of file diff --git a/papers/secbench-automated-benchmarking-2025/paper_type.json b/papers/secbench-automated-benchmarking-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "SEC-bench is the primary contribution—a new benchmark for evaluating LLM agents on real-world security tasks with 200 CVE instances, with baseline results provided for state-of-the-art models." +} +\ No newline at end of file diff --git a/papers/seccodeprm-process-reward-2026/paper_type.json b/papers/seccodeprm-process-reward-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Introduces SecCodePRM and validates it through extensive quantitative experiments on multiple benchmarks (SVEN, PrimeVul, PreciseBugs), reporting concrete performance improvements over baselines." +} +\ No newline at end of file diff --git a/papers/secinfer-preventing-prompt-2025/paper_type.json b/papers/secinfer-preventing-prompt-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes SecInfer defense method and validates it through systematic experiments across 4 LLMs, 6 tasks, and 13 attack variants, reporting quantitative results on attack success rates and task utility preservation." +} +\ No newline at end of file diff --git a/papers/secodeplt-unified-platform-2024/paper_type.json b/papers/secodeplt-unified-platform-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "The primary contribution is SeCodePLT, a new benchmark with 5.9k samples across 44 CWE categories; the model evaluations are baseline validations of the benchmark's utility." +} +\ No newline at end of file diff --git a/papers/secure-coding-ai-2025/paper_type.json b/papers/secure-coding-ai-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper conducts controlled experiments benchmarking multiple LLMs on vulnerability detection and repair across 2,315 code snippets with quantitative performance metrics and trend analysis as the primary contribution." +} +\ No newline at end of file diff --git a/papers/secureagentbench-benchmarking-secure-2025/paper_type.json b/papers/secureagentbench-benchmarking-secure-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "The primary contribution is SecureAgentBench, a new evaluation framework for secure code generation with 105 vulnerability-introduction tasks; baseline evaluations on agents/LLMs are secondary." +} +\ No newline at end of file diff --git a/papers/securecai-injectionresilient-llm-2026/paper_type.json b/papers/securecai-injectionresilient-llm-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes SecureCAI method and validates it experimentally with quantitative results (94.7% attack reduction, 95.1% accuracy) and ablation studies across six attack categories." +} +\ No newline at end of file diff --git a/papers/securing-ai-agents-2025/paper_type.json b/papers/securing-ai-agents-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes a defense framework and validates it through quantitative experiments across 7 LLMs and 847 test cases, with primary contribution being the empirical findings on attack success rates and task performance." +} +\ No newline at end of file diff --git a/papers/securing-large-language-2025/paper_type.json b/papers/securing-large-language-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs controlled experiments comparing JATMO-fine-tuned models against baselines, reports quantitative metrics (75-90% attack reduction, ROUGE-L correlation), and empirically identifies vulnerability gaps and bypass vectors." +} +\ No newline at end of file diff --git a/papers/security-assertions-by-2023/paper_type.json b/papers/security-assertions-by-2023/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Systematically evaluates Codex performance on security assertion generation across 2,268 prompt configurations and 10 benchmarks, reporting quantitative accuracy results and identifying prompt engineering as the dominant performance factor." +} +\ No newline at end of file diff --git a/papers/security-degradation-iterative-2025/paper_type.json b/papers/security-degradation-iterative-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Reports quantitative experimental results showing iterative code generation produces 37.6% more vulnerabilities with specific profiles varying by prompting strategy." +} +\ No newline at end of file diff --git a/papers/seeing-fixing-crossmodal-2025-2/paper_type.json b/papers/seeing-fixing-crossmodal-2025-2/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes GUIRepair methodology and validates it through experimental comparison on SWE-bench M, reporting quantitative improvements over baselines with ablation studies." +} +\ No newline at end of file diff --git a/papers/selfconsistency-improves-chain-2022/paper_type.json b/papers/selfconsistency-improves-chain-2022/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes and experimentally validates a decoding strategy (self-consistency) across multiple benchmarks (GSM8K, SVAMP, AQuA), reporting quantitative improvements as the primary contribution." +} +\ No newline at end of file diff --git a/papers/selforganized-agents-llm-2024/paper_type.json b/papers/selforganized-agents-llm-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes a multi-agent framework and experimentally validates it on HumanEval with quantitative improvements over baselines." +} +\ No newline at end of file diff --git a/papers/semagent-semantics-aware-2025/paper_type.json b/papers/semagent-semantics-aware-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes and empirically evaluates a program repair agent on SWE-Bench Lite with ablation studies demonstrating incremental component contributions." +} +\ No newline at end of file diff --git a/papers/semantic-compression-memory-2026/paper_type.json b/papers/semantic-compression-memory-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs controlled experiments comparing memory vs. no-memory baselines on test generation, reports quantitative metrics (compression ratio, code coverage, compilation success rates) across library modules, with primary contribution being experimental findings rather than a new benchmark, theoretical analysis, or survey." +} +\ No newline at end of file diff --git a/papers/semantics-as-shield-2025/paper_type.json b/papers/semantics-as-shield-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes Label Disguise Defense (LDD) mechanism and validates it through controlled experiments across 9 models, reporting quantitative attack/defense performance metrics (accuracy drops), with the primary contribution being the experimental findings on defense effectiveness." +} +\ No newline at end of file diff --git a/papers/semisupervised-cascaded-clustering-2022/paper_type.json b/papers/semisupervised-cascaded-clustering-2022/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes SSCC method and validates it through controlled experiments on standard datasets with injected noise, reporting accuracy comparisons against baselines." +} +\ No newline at end of file diff --git a/papers/sensorium-arc-ai-2025/paper_type.json b/papers/sensorium-arc-ai-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "position", + "reason": "System description of an art installation presented without formal evaluation; proposes a vision for AI-enhanced eco-art interaction rather than empirically validating claims or introducing a benchmark." +} +\ No newline at end of file diff --git a/papers/sentraguard-multilingual-humanai-2025/paper_type.json b/papers/sentraguard-multilingual-humanai-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes and evaluates a detection system experimentally, reporting quantitative results (99.98% accuracy, ASR, latency) on an existing benchmark dataset." +} +\ No newline at end of file diff --git a/papers/separator-injection-attack-2025/paper_type.json b/papers/separator-injection-attack-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Paper runs experiments measuring separator injection biases in LLMs with quantitative metrics (PBI, ASR), evaluates the effectiveness of the SIA attack, and tests defenses—primary contribution is experimental findings about a vulnerability." +} +\ No newline at end of file diff --git a/papers/sequential-enumeration-large-2025/paper_type.json b/papers/sequential-enumeration-large-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Tests multiple LLM models (GPT-5, Gemini 2.5, Llama-70B) on counting/enumeration tasks, reports quantitative accuracy results, and analyzes internal representations through PCA—primary contribution is experimental findings about LLM enumeration behavior." +} +\ No newline at end of file diff --git a/papers/shadowcode-automatic-external-2024/paper_type.json b/papers/shadowcode-automatic-external-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Reports quantitative experimental results (97.9% ASR, 90%+ success rates) demonstrating an attack's effectiveness on multiple Code LLM systems, with ablations and transferability studies." +} +\ No newline at end of file diff --git a/papers/sherlock-reliable-efficient-2025/paper_type.json b/papers/sherlock-reliable-efficient-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes a framework (Sherlock) and validates it experimentally across multiple benchmarks (instruction-following, coding, math, tool-use) with quantitative results (48.7% latency reduction, Pareto-optimal tradeoffs)." +} +\ No newline at end of file diff --git a/papers/shieldlearner-new-paradigm-2025/paper_type.json b/papers/shieldlearner-new-paradigm-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes ShieldLearner defense method and validates it through experiments on jailbreak benchmarks (HarmBench, AdvBench) with quantitative ASR metrics on five attack types as primary findings." +} +\ No newline at end of file diff --git a/papers/shifting-from-ranking-2025/paper_type.json b/papers/shifting-from-ranking-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes SETR method and validates it through controlled experiments on four multi-hop QA benchmarks, reporting quantitative performance improvements over reranking baselines." +} +\ No newline at end of file diff --git a/papers/shroomindelab-at-semeval2024-2024/paper_type.json b/papers/shroomindelab-at-semeval2024-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Participates in SemEval-2024 Task 6, reports experimental results (rankings and accuracy scores), and contributes empirical findings through ablation analysis of prompt components." +} +\ No newline at end of file diff --git a/papers/siadafix-issue-description-2025/paper_type.json b/papers/siadafix-issue-description-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes an adaptive program repair framework (SIADAFIX) and validates it empirically on SWE-bench Lite with quantitative results (60.7% Pass@1), including ablation studies showing incremental contributions of each component." +} +\ No newline at end of file diff --git a/papers/sidiffagent-selfimproving-diffusion-2026/paper_type.json b/papers/sidiffagent-selfimproving-diffusion-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Introduces SIDiffAgent framework and reports quantitative experimental results on established benchmarks (GenAIBench, DrawBench) with specific performance improvements, making the primary contribution experimental findings rather than a new benchmark or theoretical analysis." +} +\ No newline at end of file diff --git a/papers/signedprompt-new-approach-2024/paper_type.json b/papers/signedprompt-new-approach-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes Signed-Prompt defense mechanism and validates it through experiments with quantitative results (100% correctness, 0% attack success), though evaluation scope is limited to a single command type." +} +\ No newline at end of file diff --git a/papers/significant-productivity-gains-2024/paper_type.json b/papers/significant-productivity-gains-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "RCT (N=24) with quantitative measurements of productivity and code quality across AI assistant conditions; primary contribution is experimental findings." +} +\ No newline at end of file diff --git a/papers/simple-llm-baselines-2026/paper_type.json b/papers/simple-llm-baselines-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper runs comparative experiments evaluating LLM baselines against SAE-based methods for model diffing, reporting quantitative results on accuracy, frequency, and interestingness metrics; the primary contribution is the experimental finding that simple LLM baselines are competitive." +} +\ No newline at end of file diff --git a/papers/simple-prompt-injection-2025/paper_type.json b/papers/simple-prompt-injection-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs experiments on 6 LLMs using AgentDojo's banking suite, quantifying attack success rates (20%) and utility drops (15–50%), with evaluation of defense mechanisms." +} +\ No newline at end of file diff --git a/papers/simulationguided-llmbased-code-2025/paper_type.json b/papers/simulationguided-llmbased-code-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Develops and evaluates a simulation-guided LLM code generation pipeline, reporting quantitative results (9.2% improvement, model success rates) and qualitative findings (expert interviews) on autonomous driving code generation." +} +\ No newline at end of file diff --git a/papers/single-character-perturbations-2024/paper_type.json b/papers/single-character-perturbations-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs controlled experiments on multiple models, reports quantitative attack success rates (100% on specific models), and provides empirical evidence that single character perturbations reliably bypass safety alignment." +} +\ No newline at end of file diff --git a/papers/single-direction-truth-2025/paper_type.json b/papers/single-direction-truth-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Conducts experiments on benchmarks (news summarization, ContraTales) with quantitative results (F1 scores, baseline comparisons), where the primary contribution is empirical findings about detecting and steering hallucinations via linear probes." +} +\ No newline at end of file diff --git a/papers/singleagent-scaling-fails-2025/paper_type.json b/papers/singleagent-scaling-fails-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs experiments across 41 LLMs with quantitative measurements of single-agent and multi-agent performance, reporting empirical findings with regression analysis showing diminishing returns." +} +\ No newline at end of file diff --git a/papers/singlehead-attention-high-2025/paper_type.json b/papers/singlehead-attention-high-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "theoretical", + "reason": "Provides exact mathematical characterization of attention mechanisms, derives closed-form formulas for training/test error and spectral properties, and analyzes formal properties of gradient descent without experimental validation." +} +\ No newline at end of file diff --git a/papers/singlemulti-evolution-loop-2026/paper_type.json b/papers/singlemulti-evolution-loop-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes a novel method (single-multi evolution loop) and validates it with quantitative experiments across 15 tasks and multiple model configurations, with reported improvements of 8.0% for individual models and 14.9% for collaboration systems." +} +\ No newline at end of file diff --git a/papers/six-sigma-agent-2026/paper_type.json b/papers/six-sigma-agent-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "position", + "reason": "Proposes the Six Sigma Agent architecture with theoretical reliability analysis based on assumed error rates rather than empirical validation of the approach." +} +\ No newline at end of file diff --git a/papers/skate-scalable-tournament-2025/paper_type.json b/papers/skate-scalable-tournament-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "SKATE introduces a novel peer-challenge evaluation framework for ranking LLMs using tournament mechanics and TrueSkill, with empirical validation demonstrating its effectiveness." +} +\ No newline at end of file diff --git a/papers/skillorchestra-learning-route-2026/paper_type.json b/papers/skillorchestra-learning-route-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes SkillOrchestra method and validates it experimentally across 10 benchmarks with quantitative results showing outperformance over baselines." +} +\ No newline at end of file diff --git a/papers/sleeper-agents-2024/paper_type.json b/papers/sleeper-agents-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Paper trains LLMs with deliberate backdoor behaviors and runs controlled experiments measuring persistence through safety training techniques, reporting quantitative results on persistence rates across model scales." +} +\ No newline at end of file diff --git a/papers/slidesgenbench-evaluating-slides-2026/paper_type.json b/papers/slidesgenbench-evaluating-slides-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "Introduces SlidesGen-Bench, a new three-dimensional benchmark for evaluating slide generation systems, with validation against human preferences and analysis of system performance." +} +\ No newline at end of file diff --git a/papers/sloconditioned-action-routing-2025/paper_type.json b/papers/sloconditioned-action-routing-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs controlled experiments on SQuAD 2.0 with quantitative metrics comparing learned routing policies against baselines, reporting accuracy/reward improvements and failure modes." +} +\ No newline at end of file diff --git a/papers/smoothquant-accurate-efficient-2022/paper_type.json b/papers/smoothquant-accurate-efficient-2022/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes SmoothQuant method and validates with quantitative experiments (1.56× speedup, 2× memory reduction) across diverse LLM sizes, with integrated implementation in PyTorch and FasterTransformer." +} +\ No newline at end of file diff --git a/papers/socialveil-probing-social-2026/paper_type.json b/papers/socialveil-probing-social-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper runs controlled experiments measuring how communication barriers affect LLM agent social intelligence with quantitative metrics (45% reduction in mutual understanding, -58% from semantic vagueness, etc.) and tests adaptation strategies." +} +\ No newline at end of file diff --git a/papers/societal-alignment-frameworks-2025/paper_type.json b/papers/societal-alignment-frameworks-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "position", + "reason": "Proposes a conceptual framework for LLM alignment using societal mechanisms (social norms, economic fairness, legal oversight) and makes prescriptive arguments without experimental validation." +} +\ No newline at end of file diff --git a/papers/sok-comprehensive-causality-2025/paper_type.json b/papers/sok-comprehensive-causality-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper presents experimental findings with quantitative results (95% F1 detection accuracy, neuron localization measurements) from applying a causality analysis framework to LLM security tasks; the primary contribution is the empirical discovery of which components cause safety behavior." +} +\ No newline at end of file diff --git a/papers/sok-trustauthorization-mismatch-2025/paper_type.json b/papers/sok-trustauthorization-mismatch-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "SoK (Systematization of Knowledge) format that surveys 87 papers to systematize existing defenses through the Trust-Authorization Mismatch framework." +} +\ No newline at end of file diff --git a/papers/soleval-benchmarking-large-2025/paper_type.json b/papers/soleval-benchmarking-large-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "Introduces SolEval, the first repository-level benchmark for Solidity code generation with 1,125 samples; the primary contribution is the benchmark itself, though baseline evaluations of 16 LLMs are included." +} +\ No newline at end of file diff --git a/papers/source-code-comprehension-2023/paper_type.json b/papers/source-code-comprehension-2023/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "theoretical", + "reason": "Proposes formal definitions and a conceptual model for code comprehension with theoretical distinctions between outcome measurement and process observation, rather than empirical experiments or survey synthesis." +} +\ No newline at end of file diff --git a/papers/spec2rtlagent-automated-hardware-2025/paper_type.json b/papers/spec2rtlagent-automated-hardware-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Presents a multi-agent LLM system and evaluates it experimentally on 3 specifications with quantitative results (75% fewer interventions) and ablation studies, making the primary contribution the experimental findings of the system's effectiveness." +} +\ No newline at end of file diff --git a/papers/specification-vibing-automated-2026/paper_type.json b/papers/specification-vibing-automated-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Presents a novel APR approach (VibeRepair) and validates it through experiments on established benchmarks (Defects4J), reporting quantitative improvements over baselines across multiple models." +} +\ No newline at end of file diff --git a/papers/specificationguided-vulnerability-detection-2025/paper_type.json b/papers/specificationguided-vulnerability-detection-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper runs experiments on vulnerability detection with quantitative results (45.0% F1, 37.7% recall) and comparative improvements over baselines on existing benchmarks, with the primary contribution being experimental findings rather than a new benchmark or theoretical analysis." +} +\ No newline at end of file diff --git a/papers/specifications-missing-link-2024/paper_type.json b/papers/specifications-missing-link-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "position", + "reason": "The paper argues that specifications are essential for engineering LLM systems and proposes a conceptual framework distinguishing statement and solution specifications, without experimental validation or formal analysis." +} +\ No newline at end of file diff --git a/papers/speed-at-cost-2025/paper_type.json b/papers/speed-at-cost-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs observational analysis on open-source projects with quantitative results (Panel GMM models, DiD estimators, robustness checks) showing Cursor adoption's effects on velocity and code quality metrics." +} +\ No newline at end of file diff --git a/papers/spin-selfsupervised-prompt-2024/paper_type.json b/papers/spin-selfsupervised-prompt-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "SPIN proposes a defense method and evaluates it experimentally on AdvBench with quantitative results (ASR rates) across multiple attack types and models, with the primary contribution being the experimental findings of the method's effectiveness." +} +\ No newline at end of file diff --git a/papers/split-personality-training-2026/paper_type.json b/papers/split-personality-training-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Introduces Split Personality Training method and experimentally validates it with quantitative results (96% detection accuracy) on an existing model organism." +} +\ No newline at end of file diff --git a/papers/spread-preference-annotation-2024/paper_type.json b/papers/spread-preference-annotation-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper presents a novel LLM alignment method (SPA) and validates it through controlled experiments with quantitative benchmark results on AlpacaEval 2.0, comparing against multiple baselines." +} +\ No newline at end of file diff --git a/papers/starcoder-2023/paper_type.json b/papers/starcoder-2023/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Introduces StarCoder model and reports quantitative experimental results comparing performance across HumanEval, MBPP, and DS-1000 benchmarks, with additional empirical evaluation of PII redaction pipeline and social bias metrics." +} +\ No newline at end of file diff --git a/papers/starcoder2-2024/paper_type.json b/papers/starcoder2-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "Introduces The Stack v2, a large-scale code dataset (4× larger, spanning 619 languages), with StarCoder 2 models serving as baseline implementations to demonstrate the dataset's value." +} +\ No newline at end of file diff --git a/papers/stateflow-enhancing-llm-2024/paper_type.json b/papers/stateflow-enhancing-llm-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes StateFlow method and validates it through experiments on existing benchmarks (InterCode SQL, ALFWorld), reporting quantitative improvements (13-28% success rate gains, 3-5x cost reduction) with ablation studies." +} +\ No newline at end of file diff --git a/papers/static-program-analysis-2024/paper_type.json b/papers/static-program-analysis-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper proposes a method for improving LLM-based test generation using static analysis, validates it experimentally on a commercial Java project, and reports quantitative results (102/103 vs 37/103 test generation rates) as the primary contribution." +} +\ No newline at end of file diff --git a/papers/statically-contextualizing-large-2024/paper_type.json b/papers/statically-contextualizing-large-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper runs controlled experiments comparing different static contextualization strategies for LLM code completion, reporting quantitative improvements across multiple models and languages, with the primary contribution being the experimental findings about which context types matter most." +} +\ No newline at end of file diff --git a/papers/steering-llms-scalable-2026/paper_type.json b/papers/steering-llms-scalable-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper proposes a framework but validates it through quantitative experiments showing 54% alignment improvement and tests RL training with measurable generalization properties across benchmarks and case studies." +} +\ No newline at end of file diff --git a/papers/stellar-searchbased-testing-2026/paper_type.json b/papers/stellar-searchbased-testing-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper introduces STELLAR and validates it through experimental benchmarking against baselines (random, combinatorial, coverage-based), reporting quantitative results and effect sizes on LLM testing tasks." +} +\ No newline at end of file diff --git a/papers/stelp-secure-transpilation-2026/paper_type.json b/papers/stelp-secure-transpilation-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "STELP is a security system for LLM-generated code with experimental validation on a custom benchmark, where the primary contribution is the system's performance findings (TBR, TAR, latency) rather than the benchmark itself." +} +\ No newline at end of file diff --git a/papers/stepshield-when-not-2026/paper_type.json b/papers/stepshield-when-not-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "Introduces the first benchmark for evaluating temporal detection of rogue agent behavior, with empirical baselines demonstrating its utility." +} +\ No newline at end of file diff --git a/papers/stop-wasting-your-2025/paper_type.json b/papers/stop-wasting-your-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes SupervisorAgent framework and validates it experimentally across 6 benchmarks, 3 models, and multiple MAS frameworks with quantitative results and ablation studies." +} +\ No newline at end of file diff --git a/papers/strategic-dishonesty-safety-evals-2025/paper_type.json b/papers/strategic-dishonesty-safety-evals-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs experiments on frontier LLMs to quantitatively measure strategic dishonesty behavior, reports F1 scores for detection methods, and validates findings through steering experiments; the primary contribution is experimental findings about model behavior, not the benchmarks being evaluated." +} +\ No newline at end of file diff --git a/papers/strongermas-multiagent-reinforcement-2025/paper_type.json b/papers/strongermas-multiagent-reinforcement-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Introduces AT-GRPO algorithm and validates it with quantitative experiments on existing benchmarks (long-horizon planning, coding, math), with the primary contribution being the experimental findings of performance improvements." +} +\ No newline at end of file diff --git a/papers/structtest-benchmarking-llms-2024/paper_type.json b/papers/structtest-benchmarking-llms-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "Introduces StructTest, a new rule-based benchmark for evaluating LLM reasoning through structured outputs, with evaluation on 17 models serving to validate the benchmark itself." +} +\ No newline at end of file diff --git a/papers/study-prompt-injection-2024/paper_type.json b/papers/study-prompt-injection-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs experiments testing secure prompting against prompt injection attacks on an LLM-controlled robot, reporting quantitative metrics (30.8% improvement, precision/recall) with the primary contribution being experimental findings about attack detection effectiveness." +} +\ No newline at end of file diff --git a/papers/style-outweighs-substance-2024/paper_type.json b/papers/style-outweighs-substance-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs experiments evaluating LLM judges on alignment benchmarks with quantitative correlations and a meta-analysis of post-training methods, reporting empirical findings about failure modes rather than creating a new benchmark or making unvalidated claims." +} +\ No newline at end of file diff --git a/papers/subliminal-corruption-mechanisms-2025/paper_type.json b/papers/subliminal-corruption-mechanisms-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Conducts experiments to quantify alignment degradation mechanisms, measure phase transition thresholds (~250 examples), and analyze interpretability patterns, with primary contribution being empirical findings about corruption dynamics." +} +\ No newline at end of file diff --git a/papers/subliminal-learning-language-2025/paper_type.json b/papers/subliminal-learning-language-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "theoretical", + "reason": "Primary contribution is a theorem proving this is a general property of neural networks under certain conditions, with experiments validating the phenomenon." +} +\ No newline at end of file diff --git a/papers/successive-prompting-decomposing-2022/paper_type.json b/papers/successive-prompting-decomposing-2022/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes a successive prompting method for decomposing complex questions and validates it experimentally with quantitative improvements (50.2% F1, ~5% absolute gain) on the DROP benchmark." +} +\ No newline at end of file diff --git a/papers/survey-adversarial-examples-2025/paper_type.json b/papers/survey-adversarial-examples-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "Explicitly a survey that taxonomizes and synthesizes existing adversarial attack and defense methods, theoretical explanations, and their tradeoffs rather than proposing novel experiments or benchmarks." +} +\ No newline at end of file diff --git a/papers/survey-agentic-service-2025/paper_type.json b/papers/survey-agentic-service-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "Explicitly titled as a survey, uses meta-analysis methodology, and organizes existing work across multiple analysis dimensions within a proposed three-stage framework for understanding agentic service ecosystems." +} +\ No newline at end of file diff --git a/papers/survey-automated-program-2023/paper_type.json b/papers/survey-automated-program-2023/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "Paper explicitly surveys and classifies 140 representative APR works from 2005-2022 with meta-analysis of techniques and methodological problems, rather than reporting original experimental results or introducing a new benchmark." +} +\ No newline at end of file diff --git a/papers/survey-autonomous-llm-agents-2023/paper_type.json b/papers/survey-autonomous-llm-agents-2023/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "Systematically reviews and synthesizes existing work on LLM-based autonomous agents, proposing a unified organizational framework and categorizing approaches across domains rather than conducting original experiments." +} +\ No newline at end of file diff --git a/papers/survey-code-gen-llm-agents-2025/paper_type.json b/papers/survey-code-gen-llm-agents-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "Systematically reviews and categorizes 100 papers on LLM-based code generation agents, synthesizing techniques, evolution, and deployed tools rather than conducting novel experiments." +} +\ No newline at end of file diff --git a/papers/survey-code-generation-2024/paper_type.json b/papers/survey-code-generation-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "This is an explicit systematic literature review of 111 papers synthesizing findings about LLM code generation strategies, with the primary contribution being a synthesis of existing work rather than new experiments or benchmarks." +} +\ No newline at end of file diff --git a/papers/survey-data-contamination-2025/paper_type.json b/papers/survey-data-contamination-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "Categorizes and reviews existing work on data contamination types, evaluation strategies, and detection paradigms in LLMs, with methodology tag of meta-analysis." +} +\ No newline at end of file diff --git a/papers/survey-hallucination-large-2023/paper_type.json b/papers/survey-hallucination-large-2023/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "Explicitly titled a survey and uses meta-analysis methodology to catalog and synthesize hallucination detection methods, mitigation strategies, and datasets across ~25 papers, creating a comprehensive taxonomy across modalities rather than contributing new experimental findings." +} +\ No newline at end of file diff --git a/papers/survey-large-language-2023/paper_type.json b/papers/survey-large-language-2023/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "This is a meta-analysis of 281 papers that categorizes and synthesizes existing LLM security/privacy research rather than conducting original experiments or creating a new benchmark." +} +\ No newline at end of file diff --git a/papers/survey-learningbased-automated-2023/paper_type.json b/papers/survey-learningbased-automated-2023/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "This is a systematic review that synthesizes 112 learning-based APR papers from 2016–2022, categorizing techniques and documenting datasets and metrics rather than reporting original experimental contributions." +} +\ No newline at end of file diff --git a/papers/survey-llm-code-generation-2025/paper_type.json b/papers/survey-llm-code-generation-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "Explicitly titled as a survey that organizes and synthesizes existing LLM code generation literature across four topic areas (challenges, techniques, evaluation, applications) with meta-analysis methodology, though with limited references." +} +\ No newline at end of file diff --git a/papers/survey-llm-code-low-resource-2024/paper_type.json b/papers/survey-llm-code-low-resource-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "Explicitly titled as a survey conducting a systematic review of 111 existing papers on LLM-based code generation, with the primary contribution being synthesis and categorization of prior work rather than new experiments or benchmarks." +} +\ No newline at end of file diff --git a/papers/survey-llmbased-multiagent-2024/paper_type.json b/papers/survey-llmbased-multiagent-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "Title explicitly identifies it as a survey; methodology tagged as meta-analysis; synthesizes existing work into a framework and reviews representative systems across multiple application domains." +} +\ No newline at end of file diff --git a/papers/survey-llms-software-engineering-2023/paper_type.json b/papers/survey-llms-software-engineering-2023/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "The paper explicitly provides a comprehensive taxonomy and synthesis of 62 LLMs across 947 studies in software engineering, organizing and cataloging existing work rather than conducting original experiments." +} +\ No newline at end of file diff --git a/papers/survey-progress-llm-2025/paper_type.json b/papers/survey-progress-llm-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "Explicitly a survey that taxonomizes and reviews existing reward modeling approaches for LLM alignment, with no experimental validation or benchmark contributions." +} +\ No newline at end of file diff --git a/papers/survey-useful-llm-2024/paper_type.json b/papers/survey-useful-llm-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "Title explicitly identifies it as a survey, methodology tag is meta-analysis, and the paper reviews and synthesizes existing LLM evaluation benchmarks and methods across multiple domains and application areas." +} +\ No newline at end of file diff --git a/papers/survival-games-humanllm-2025/paper_type.json b/papers/survival-games-humanllm-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs experiments in a survival simulation and reports quantitative findings about LLM behavioral differences across models and prompting strategies." +} +\ No newline at end of file diff --git a/papers/survivehr-competing-risks-2025/paper_type.json b/papers/survivehr-competing-risks-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper presents a foundation model with quantitative experimental results comparing performance against baseline methods on clinical prediction tasks, with empirical findings on zero-shot and fine-tuned performance." +} +\ No newline at end of file diff --git a/papers/sustainable-llm-inference-2026/paper_type.json b/papers/sustainable-llm-inference-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper proposes a model switching system and validates it through experiments, reporting quantitative performance metrics (energy reduction, latency, quality scores, routing accuracy) on benchmark queries; the primary contribution is the experimental findings, not the benchmark itself." +} +\ No newline at end of file diff --git a/papers/svrepair-structured-visual-2026/paper_type.json b/papers/svrepair-structured-visual-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes a multimodal APR framework and reports quantitative experimental results on multiple benchmarks (SWE-Bench M, MMCode, CodeVision) with ablation studies demonstrating the effectiveness of structured visual reasoning." +} +\ No newline at end of file diff --git a/papers/swe-agent-2024/paper_type.json b/papers/swe-agent-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Introduces agent-computer interface (ACI) concept but validates it primarily through extensive experiments on SWE-bench with quantitative results (12.47% resolution rate, ablation studies), making the empirical findings the core contribution." +} +\ No newline at end of file diff --git a/papers/swe-bench-2023/paper_type.json b/papers/swe-bench-2023/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "Introduces SWE-bench, a benchmark of 2,294 real GitHub issues across 12 Python repositories for evaluating language models on software engineering tasks; baseline experiments validate the benchmark but the primary contribution is the benchmark resource itself." +} +\ No newline at end of file diff --git a/papers/swe-bench-illusion-2025/paper_type.json b/papers/swe-bench-illusion-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs controlled experiments on SWE-Bench with quantitative measurements (file path accuracy, n-gram overlap, verbatim match rates) to demonstrate that model performance reflects memorization rather than reasoning." +} +\ No newline at end of file diff --git a/papers/swe-bench-plus-2024/paper_type.json b/papers/swe-bench-plus-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "Introduces SWE-bench+, an improved coding benchmark that addresses quality issues (solution leakage, weak test cases) in the original SWE-bench dataset." +} +\ No newline at end of file diff --git a/papers/swe-bench-pro-2025/paper_type.json b/papers/swe-bench-pro-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "SWE-Bench Pro's primary contribution is the introduction of a new 1,865-problem benchmark with human verification, not the empirical findings from running models on it." +} +\ No newline at end of file diff --git a/papers/swe-bench-what-in-benchmark-2026/paper_type.json b/papers/swe-bench-what-in-benchmark-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "This is a meta-analysis of 212 existing leaderboard entries that synthesizes characteristics of the SWE-Bench benchmark ecosystem (industry dominance, model performance patterns, critical issues) rather than introducing new experiments or a new benchmark." +} +\ No newline at end of file diff --git a/papers/swe-evo-coding-agents-2025/paper_type.json b/papers/swe-evo-coding-agents-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "The paper's primary contribution is introducing SWE-EVO, a new benchmark with 48 tasks; experiments are conducted to validate and demonstrate the benchmark's utility in revealing capability gaps." +} +\ No newline at end of file diff --git a/papers/swe-mera-dynamic-benchmark-2025/paper_type.json b/papers/swe-mera-dynamic-benchmark-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "Introduces SWE-MERA, a new dynamic benchmark for evaluating LLMs on software engineering tasks with ~10,000 tasks via automated pipeline; experimental results validate the benchmark's discriminative power but are secondary to the benchmark contribution itself." +} +\ No newline at end of file diff --git a/papers/sweeffi-reevaluating-software-2025/paper_type.json b/papers/sweeffi-reevaluating-software-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper runs experiments on multiple AI systems with new effectiveness metrics and reports quantitative findings about scaffold-model synergy and resource trade-offs, making the primary contribution the experimental results rather than the metrics framework itself." +} +\ No newline at end of file diff --git a/papers/swelancer-can-frontier-2025/paper_type.json b/papers/swelancer-can-frontier-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "The paper's primary contribution is introducing SWE-Lancer, a novel benchmark of 1,488 real-world freelance software engineering tasks with E2E tests and real monetary payouts, while experimental evaluation of LLMs on this benchmark is secondary." +} +\ No newline at end of file diff --git a/papers/swenergy-empirical-study-2025/paper_type.json b/papers/swenergy-empirical-study-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs experiments on existing agentic frameworks, reports quantitative metrics (resolution rates, energy consumption, correlations) on SWE-bench, with empirical findings as the primary contribution." +} +\ No newline at end of file diff --git a/papers/sweprotege-learning-selectively-2026/paper_type.json b/papers/sweprotege-learning-selectively-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper runs experiments on SWE-bench Verified, reports quantitative results (42.4% Pass@1), and demonstrates a training approach's effectiveness—the primary contribution is experimental findings on improved SLM performance, not a new benchmark." +} +\ No newline at end of file diff --git a/papers/swerank-multilingual-multiturn-2025/paper_type.json b/papers/swerank-multilingual-multiturn-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper introduces new ranking models (SweRankEmbedMulti, SweRankAgent) and reports quantitative experimental results showing state-of-the-art performance and improvements over baselines (2-6 points on Acc@10), with the multilingual dataset as a supporting contribution rather than the primary focus." +} +\ No newline at end of file diff --git a/papers/swerebench-automated-pipeline-2025/paper_type.json b/papers/swerebench-automated-pipeline-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "Introduces SWE-rebench, an automated pipeline that extracts 21,336 verifiable tasks from GitHub to create a new decontaminated benchmark for evaluating software engineering agents; the primary contribution is the benchmark and pipeline infrastructure, not experimental findings." +} +\ No newline at end of file diff --git a/papers/swtbench-testing-validating-2024/paper_type.json b/papers/swtbench-testing-validating-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "Introduces SWT-Bench, a novel benchmark dataset of 1,983 bug-fix instances from GitHub; while experiments validate the benchmark, the primary contribution is the benchmark and evaluation framework itself." +} +\ No newline at end of file diff --git a/papers/syncode-llm-generation-2024/paper_type.json b/papers/syncode-llm-generation-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Introduces a constrained decoding framework and validates it experimentally across multiple models and benchmarks, reporting quantitative syntax error reduction results." +} +\ No newline at end of file diff --git a/papers/synergizing-human-expertise-2024/paper_type.json b/papers/synergizing-human-expertise-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper runs case studies testing ChatGPT-4's capabilities on microscopy automation tasks and reports qualitative findings about its strengths and limitations in scientific applications." +} +\ No newline at end of file diff --git a/papers/syntactic-robustness-llmbased-2024/paper_type.json b/papers/syntactic-robustness-llmbased-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs controlled experiments measuring LLM robustness to formula mutations, reports quantitative performance metrics across models and mutation distances, and tests a pre-processing solution." +} +\ No newline at end of file diff --git a/papers/synthetic-code-surgery-2025/paper_type.json b/papers/synthetic-code-surgery-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Reports quantitative experimental results comparing quality-filtered synthetic data against baselines on program repair benchmarks with statistical significance testing (ANOVA, p<0.0001), where the primary contribution is the empirical finding that filtered synthetic data outperforms unfiltered and real-world alternatives." +} +\ No newline at end of file diff --git a/papers/sysllmatic-large-language-2025/paper_type.json b/papers/sysllmatic-large-language-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Presents SysLLMatic system and validates it through experiments on real-world DaCapo applications and microbenchmarks with quantitative latency/throughput improvements against compiler baselines." +} +\ No newline at end of file diff --git a/papers/system-automated-unit-2024/paper_type.json b/papers/system-automated-unit-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper runs experiments on 10 Java repositories, reports quantitative results (compilation rates, pass rates, coverage metrics), and makes empirical comparisons between LLM models and prompting strategies, with the primary contribution being experimental findings rather than the benchmark or system itself." +} +\ No newline at end of file diff --git a/papers/system-prompt-poisoning-2025/paper_type.json b/papers/system-prompt-poisoning-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper conducts systematic experiments measuring system prompt poisoning attack effectiveness, reporting quantitative results (accuracy reduction to <4% on MATH) across extended conversations and against various defenses." +} +\ No newline at end of file diff --git a/papers/systematic-evaluation-llmasajudge-2024/paper_type.json b/papers/systematic-evaluation-llmasajudge-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper systematically evaluates LLM judges through experiments, reports quantitative results (accuracy, alignment scores, bias measurements), and presents empirical findings about prompt sensitivity and LLM behavior rather than primarily creating a new benchmark or introducing theoretical results." +} +\ No newline at end of file diff --git a/papers/systematic-literature-review-2024-2/paper_type.json b/papers/systematic-literature-review-2024-2/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "This is a systematic literature review (SLR) that synthesizes findings from 19 papers on AI code generation security, with the primary contribution being structured synthesis of existing work rather than new experiments." +} +\ No newline at end of file diff --git a/papers/systematic-literature-review-2024/paper_type.json b/papers/systematic-literature-review-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "Explicit systematic literature review of 189 LLM-based APR papers with meta-analysis of trends, models used, and field evolution—primary contribution is synthesis of existing work, not new experiments or benchmarks." +} +\ No newline at end of file diff --git a/papers/systematic-literature-review-2025-2/paper_type.json b/papers/systematic-literature-review-2025-2/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "Systematic literature review of 112 peer-reviewed articles synthesizing ethical challenges in generative AI—primary contribution is field synthesis and meta-analysis, not original experiments or benchmarks." +} +\ No newline at end of file diff --git a/papers/systematic-literature-review-2025/paper_type.json b/papers/systematic-literature-review-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "Systematic literature review explicitly analyzing 28 existing papers on PEFT techniques, synthesizing task distribution, method adoption patterns, and performance comparisons across the field." +} +\ No newline at end of file diff --git a/papers/systematic-review-infrastructure-2023/paper_type.json b/papers/systematic-review-infrastructure-2023/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "The paper narratively reviews and synthesizes existing work on Infrastructure as Code and GitOps, with methodology tags including meta-analysis, making it a literature review despite lacking rigorous systematic methodology." +} +\ No newline at end of file diff --git a/papers/systemlevel-defense-against-2024/paper_type.json b/papers/systemlevel-defense-against-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes f-secure LLM architecture and validates it experimentally with quantitative results (0% vs 15-67% ASR) on established benchmarks (InjectAgent) across multiple models." +} +\ No newline at end of file diff --git a/papers/syzygy-dual-codetest-2024/paper_type.json b/papers/syzygy-dual-codetest-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Presents a C-to-Rust translation system and validates it experimentally on a real codebase (Zopfli) with quantitative results on coverage and performance metrics." +} +\ No newline at end of file diff --git a/papers/t3-multilevel-treebased-2025/paper_type.json b/papers/t3-multilevel-treebased-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes T3, a multi-level repair framework, and validates it experimentally on the MODIT dataset with quantitative comparisons against baselines and ablation studies demonstrating the method's effectiveness." +} +\ No newline at end of file diff --git a/papers/t5apr-empowering-automated-2023/paper_type.json b/papers/t5apr-empowering-automated-2023/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Develops and evaluates T5APR on existing program repair benchmarks across six datasets and multiple languages, reporting quantitative bug-fixing results and comparative performance metrics." +} +\ No newline at end of file diff --git a/papers/tamas-benchmarking-adversarial-2025/paper_type.json b/papers/tamas-benchmarking-adversarial-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "TAMAS introduces a new benchmarking framework for evaluating adversarial risks in multi-agent LLM systems, with comprehensive experimental validation across 10 models and 5 configurations as demonstration of the benchmark's utility." +} +\ No newline at end of file diff --git a/papers/target-traffic-rulebased-2023/paper_type.json b/papers/target-traffic-rulebased-2023/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper demonstrates a method for generating test scenarios and empirically validates it by executing tests on 7 autonomous driving systems, uncovering 610 erroneous behaviors and reporting quantitative accuracy metrics, making empirical findings rather than benchmark creation the primary contribution." +} +\ No newline at end of file diff --git a/papers/task-shield-enforcing-2024/paper_type.json b/papers/task-shield-enforcing-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper proposes Task Shield and validates it through quantitative experiments on the AgentDojo benchmark, reporting attack success rates and task utility metrics; the primary contribution is the empirical findings about defense effectiveness." +} +\ No newline at end of file diff --git a/papers/tasklevel-evaluation-ai-2026/paper_type.json b/papers/tasklevel-evaluation-ai-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper empirically evaluates AI agents' performance on open-source projects by measuring quantitative outcomes (PR acceptance rates, review comments, commit message quality) across different models." +} +\ No newline at end of file diff --git a/papers/taxonomy-evaluation-exploitation-2025/paper_type.json b/papers/taxonomy-evaluation-exploitation-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Designs novel adaptive attacks and empirically evaluates their effectiveness against 11 defense frameworks across benchmarks, with quantitative results showing 5x ASR increases." +} +\ No newline at end of file diff --git a/papers/teaching-critiquing-conceptualization-2025/paper_type.json b/papers/teaching-critiquing-conceptualization-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "position", + "reason": "Proposes a pedagogical framework and conceptual approach to teaching NLP students critical evaluation of research concepts, supported by qualitative case-study observations rather than experimental validation." +} +\ No newline at end of file diff --git a/papers/teaching-programming-age-2025/paper_type.json b/papers/teaching-programming-age-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "position", + "reason": "Proposes a pedagogical framework (visual program simulation) for programming education with LLMs and makes prescriptive claims about assessment strategies, with literature review and preliminary student validation as supporting evidence." +} +\ No newline at end of file diff --git a/papers/teamcraft-benchmark-multimodal-2024/paper_type.json b/papers/teamcraft-benchmark-multimodal-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "The primary contribution is TeamCraft, a new 55,000-variant multi-modal multi-agent benchmark in Minecraft; the experiments on VLA models and control architectures are baseline validations of the benchmark itself." +} +\ No newline at end of file diff --git a/papers/telecomrag-taming-telecom-2024/paper_type.json b/papers/telecomrag-taming-telecom-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "position", + "reason": "Proposes a RAG framework for telecom standards with only qualitative case-study validation (one example) and no quantitative evaluation metrics, despite claims of broader testing." +} +\ No newline at end of file diff --git a/papers/temporal-knowledgebase-creation-2025/paper_type.json b/papers/temporal-knowledgebase-creation-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper proposes an LLM annotation pipeline and evaluates it experimentally with quantitative F1 scores and inter-LLM agreement metrics on human-annotated test sets." +} +\ No newline at end of file diff --git a/papers/ten-simple-rules-2025/paper_type.json b/papers/ten-simple-rules-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "position", + "reason": "Proposes ten prescriptive rules for AI-assisted coding without conducting original experiments, instead synthesizing existing evidence and advocating a viewpoint on best practices for scientific computing." +} +\ No newline at end of file diff --git a/papers/test-driven-interactive-code-gen-2024/paper_type.json b/papers/test-driven-interactive-code-gen-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Conducts user study and benchmark experiments (MBPP, HumanEval) with quantitative results on correctness and pass@1 improvements; primary contribution is empirical evaluation of TiCoder's effectiveness." +} +\ No newline at end of file diff --git a/papers/test-smells-llmgenerated-2024/paper_type.json b/papers/test-smells-llmgenerated-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs observational experiments analyzing test smells in LLM-generated unit tests, reporting quantitative prevalence metrics across different prompting strategies and model configurations." +} +\ No newline at end of file diff --git a/papers/test-wars-comparative-2025/paper_type.json b/papers/test-wars-comparative-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper runs a comparative experiment across multiple test generation approaches (EvoSuite, Kex, ChatGPT-4o), reporting quantitative results on coverage metrics, mutation scores, and fault reproduction, making experimental findings its primary contribution." +} +\ No newline at end of file diff --git a/papers/testbench-evaluating-classlevel-2024/paper_type.json b/papers/testbench-evaluating-classlevel-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Primary contribution is experimental evaluation of multiple LLMs on a test generation task with detailed quantitative findings about model performance, context effects, and repair strategies, rather than introducing TestBench as a reusable benchmark resource." +} +\ No newline at end of file diff --git a/papers/testdriven-development-llmbased-2024/paper_type.json b/papers/testdriven-development-llmbased-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs controlled experiments across MBPP, HumanEval, and CodeChef datasets with quantitative results (9.15–29.57% improvements) comparing TDD approaches and different models; primary contribution is experimental findings about test-driven development's effectiveness for code generation." +} +\ No newline at end of file diff --git a/papers/testgeneval-real-world-2024/paper_type.json b/papers/testgeneval-real-world-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "Introduces TestGenEval, a large-scale benchmark for evaluating LLM test generation, with 68,647 tests across 1,210 code-test pairs; the primary contribution is the benchmark itself, though it includes baseline evaluations." +} +\ No newline at end of file diff --git a/papers/testtime-matching-unlocking-2025/paper_type.json b/papers/testtime-matching-unlocking-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes new evaluation metrics and algorithms (GroupMatch, SimpleMatch, TTM), then validates them through experiments on Winoground benchmark, demonstrating improved multimodal model performance with quantitative results." +} +\ No newline at end of file diff --git a/papers/text-prompt-injection-2025/paper_type.json b/papers/text-prompt-injection-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes and empirically evaluates a text prompt injection attack on VLMs, reporting quantitative success rates across constraint levels and model scales." +} +\ No newline at end of file diff --git a/papers/textresnet-decoupling-routing-2026/paper_type.json b/papers/textresnet-decoupling-routing-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes the TextResNet framework and validates it through controlled experiments on HotpotQA with quantitative comparisons to baselines, ablation studies, and efficiency metrics." +} +\ No newline at end of file diff --git a/papers/texttoaudio-generation-instructiontuned-2023/paper_type.json b/papers/texttoaudio-generation-instructiontuned-2023/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes TANGO method and reports comprehensive quantitative results on AudioCaps benchmark against AudioLDM baseline, with ablations on training data and augmentation techniques." +} +\ No newline at end of file diff --git a/papers/textttremind-understanding-deductive-2025/paper_type.json b/papers/textttremind-understanding-deductive-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs controlled experiments measuring LLM code reasoning gaps and validates a proposed multi-agent framework with quantitative results, making experimental findings the primary contribution." +} +\ No newline at end of file diff --git a/papers/tfhecoder-evaluating-llmagentic-2025/paper_type.json b/papers/tfhecoder-evaluating-llmagentic-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs controlled experiments comparing multiple LLM models on TFHE code generation tasks and reports quantitative performance results; primary contribution is the empirical findings about model performance and technique effectiveness, not the benchmark itself." +} +\ No newline at end of file diff --git a/papers/theoretical-foundations-scaling-2025/paper_type.json b/papers/theoretical-foundations-scaling-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "theoretical", + "reason": "The paper extends neural scaling laws through mathematical formulation and formal analysis of how granularity affects model scaling, with primary contributions being theorems and fitted scaling equations rather than experimental validation." +} +\ No newline at end of file diff --git a/papers/they-all-good-2025/paper_type.json b/papers/they-all-good-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Empirical study of 1,023 CoT-code pairs analyzing quality issues with quantitative findings (76.4% defect rate, factor breakdowns, disconnect metrics)." +} +\ No newline at end of file diff --git a/papers/think-locally-explain-2026/paper_type.json b/papers/think-locally-explain-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes EoG architecture and validates it experimentally on ITBench, reporting quantitative improvements (7x F1 gain, Pass@k/Majority@k metrics) against ReAct baselines; primary contribution is experimental findings." +} +\ No newline at end of file diff --git a/papers/thinking-isnt-illusion-2025/paper_type.json b/papers/thinking-isnt-illusion-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper runs systematic experiments testing tool-augmented reasoning models on an existing benchmark, reporting quantitative performance results (e.g., accuracy percentages across different models and tasks) as the primary contribution." +} +\ No newline at end of file diff --git a/papers/thinking-llms-lie-2025/paper_type.json b/papers/thinking-llms-lie-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper runs experiments with CoT-enabled LLMs under threat-based prompting, quantifies deception rates and detection accuracy, and tests activation steering interventions—the primary contribution is experimental findings about deception behavior rather than a new benchmark or theoretical analysis." +} +\ No newline at end of file diff --git a/papers/thinking-longer-not-2025/paper_type.json b/papers/thinking-longer-not-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs experiments evaluating LLM performance on hierarchical legal reasoning, reports quantitative results (100% on surface-level vs 11-34% on integrated analysis, 2.6x RL improvement), with primary contribution being empirical findings about the thinking-longer paradox rather than a new benchmark." +} +\ No newline at end of file diff --git a/papers/thinkrepair-selfdirected-automated-2024/paper_type.json b/papers/thinkrepair-selfdirected-automated-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes an LLM-based program repair method and validates it experimentally on Defects4J benchmarks with quantitative bug-fix improvements and ablation studies; primary contribution is empirical findings." +} +\ No newline at end of file diff --git a/papers/thought-communication-multiagent-2025/paper_type.json b/papers/thought-communication-multiagent-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "theoretical", + "reason": "The paper's primary contribution is Theorems 1-3 proving identifiability guarantees for recovered latent thoughts, with empirical benchmarks serving as supporting validation." +} +\ No newline at end of file diff --git a/papers/threatlens-llmguided-threat-2025/paper_type.json b/papers/threatlens-llmguided-threat-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Describes a working LLM system (ThreatLens) with concrete experimental evaluation on NEORV32 hardware, including policy generation metrics and validation against real vulnerabilities, despite acknowledged lack of systematic quantitative evaluation." +} +\ No newline at end of file diff --git a/papers/throwbench-benchmarking-llms-2025/paper_type.json b/papers/throwbench-benchmarking-llms-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "THROWBENCH is the primary contribution—a new multilingual benchmark of 2,466 programs for evaluating LLMs on exception prediction; baseline experiments on six models are supporting evidence." +} +\ No newline at end of file diff --git a/papers/tigercoder-novel-suite-2025/paper_type.json b/papers/tigercoder-novel-suite-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Introduces TigerCoder models and reports experimental findings (11-13% Pass@1 improvements, performance drops with Bangla inputs) through benchmarking on Python and other programming languages." +} +\ No newline at end of file diff --git a/papers/timecma-llmempowered-multivariate-2024/paper_type.json b/papers/timecma-llmempowered-multivariate-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Introduces a method (TimeCMA) and validates it through experiments on eight public time series datasets, reporting quantitative performance metrics (MSE, MAE) against seven baselines." +} +\ No newline at end of file diff --git a/papers/timecma-llmempowered-time-2024/paper_type.json b/papers/timecma-llmempowered-time-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes a method (TimeCMA) and reports quantitative experimental results on 8 standard datasets with performance comparisons against 7 baselines, making experimental findings the primary contribution." +} +\ No newline at end of file diff --git a/papers/todo-enhancing-llm-2024/paper_type.json b/papers/todo-enhancing-llm-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper proposes a ternary preference optimization algorithm and validates it through comparative experiments on multiple benchmarks, with quantitative results showing improvements over DPO baselines." +} +\ No newline at end of file diff --git a/papers/tokenefficient-prompt-injection-2025/paper_type.json b/papers/tokenefficient-prompt-injection-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Conducts experiments on LLM prompt injection attacks with quantitative metrics (ASR, token compression rates) across different arithmetic operations on DeepSeek-R1." +} +\ No newline at end of file diff --git a/papers/too-easily-fooled-2025/paper_type.json b/papers/too-easily-fooled-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs systematic experiments testing prompt injection attacks on six LLM models with quantitative measurements of vulnerability and defense effectiveness across different prompt types and model architectures." +} +\ No newline at end of file diff --git a/papers/top-general-performance-2024/paper_type.json b/papers/top-general-performance-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "Primary contribution is DomainCodeBench, a new multi-domain code generation benchmark with 2,400 manually verified tasks across 12 domains; the empirical evaluation of 10 LLMs serves to validate and demonstrate the benchmark's utility." +} +\ No newline at end of file diff --git a/papers/top-leaderboard-ranking-2024/paper_type.json b/papers/top-leaderboard-ranking-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "EvoEval is a new suite of 828 benchmark problems evolved from HumanEval; the paper's primary contribution is the benchmark itself, with experiments on 51 LLMs serving as baselines to validate the benchmark's utility." +} +\ No newline at end of file diff --git a/papers/topicattack-indirect-prompt-2025/paper_type.json b/papers/topicattack-indirect-prompt-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Introduces TopicAttack attack method and reports quantitative experimental results (>90% success rates) across 10 LLMs and 3 datasets, with primary contribution being the empirical findings." +} +\ No newline at end of file diff --git a/papers/traceable-latent-variable-2026/paper_type.json b/papers/traceable-latent-variable-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper proposes the TLVD framework for latent variable discovery and validates it through experiments on 5 datasets (3 medical, 2 social science), reporting quantitative results comparing against multiple baselines including single LLMs and multi-agent approaches." +} +\ No newline at end of file diff --git a/papers/tracing-errors-constructing-2025/paper_type.json b/papers/tracing-errors-constructing-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Presents LTFix method, runs experiments repairing 37 of 49 real-world memory errors across 14 open-source projects, reports quantitative comparisons against baselines, and includes ablation studies—primary contribution is experimental findings." +} +\ No newline at end of file diff --git a/papers/tracking-moving-target-2025/paper_type.json b/papers/tracking-moving-target-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Validates a measurement framework through a longitudinal case study with quantitative results comparing model performance across time, making the empirical findings (performance improvements from 32.44% to 90%+) the primary contribution." +} +\ No newline at end of file diff --git a/papers/trae-agent-llmbased-2025/paper_type.json b/papers/trae-agent-llmbased-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Introduces a novel agent-based ensemble approach for software engineering and reports quantitative experimental results (5.83–14.60% improvements) on the SWE-bench Verified benchmark across multiple LLMs." +} +\ No newline at end of file diff --git a/papers/training-generalizable-collaborative-2026/paper_type.json b/papers/training-generalizable-collaborative-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The primary contribution is empirical validation of the SRPO algorithm across multiple benchmarks (Overcooked, Tag, Hanabi, GSM8K), demonstrating quantitative performance improvements over IPPO baselines, with theoretical analysis (provable results on RQE) serving as justification for the approach." +} +\ No newline at end of file diff --git a/papers/training-llms-generating-2024/paper_type.json b/papers/training-llms-generating-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes an online DPO training method and reports quantitative experimental results (compilation rate improvement from 7% to ~70%, semantic correctness ~45%) validating the approach on IEC 61131-3 code generation." +} +\ No newline at end of file diff --git a/papers/training-llms-honesty-2025/paper_type.json b/papers/training-llms-honesty-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes confessions trained via RL and reports quantitative experimental results (confession rates, detection accuracy) on GPT-5-Thinking across 11/12 evaluations." +} +\ No newline at end of file diff --git a/papers/traitors-deception-trust-2025/paper_type.json b/papers/traitors-deception-trust-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "Introduces a new multi-agent simulation framework (The Traitors game) for evaluating deception and trust dynamics, with initial experiments serving to demonstrate the framework's utility rather than being the primary contribution." +} +\ No newline at end of file diff --git a/papers/transfer-q-star-2024/paper_type.json b/papers/transfer-q-star-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "theoretical", + "reason": "Primary contribution is Theorem 1 proving suboptimality bounds and KL-efficiency guarantees for the Transfer Q⋆ method; experiments validate the theoretical framework rather than being the main finding." +} +\ No newline at end of file diff --git a/papers/transformer-we-trust-2026/paper_type.json b/papers/transformer-we-trust-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "Meta-analysis that comprehensively surveys transformer trustworthiness across six dimensions and multiple domains, identifying recurring failure patterns in existing work." +} +\ No newline at end of file diff --git a/papers/transforming-software-development-2024-2/paper_type.json b/papers/transforming-software-development-2024-2/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "Labeled 'Systematic Analysis' with meta-analysis methodology, synthesizing findings from secondary sources (blog posts, industry reports) rather than conducting original experiments, making this a literature review despite poor sourcing practices." +} +\ No newline at end of file diff --git a/papers/transforming-wearable-data-2024/paper_type.json b/papers/transforming-wearable-data-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Paper introduces PHIA agent and experimentally evaluates its performance on health query tasks, reporting quantitative results (84% accuracy, human ratings across dimensions, error rates) against baselines." +} +\ No newline at end of file diff --git a/papers/tree-thoughts-deliberate-2023/paper_type.json b/papers/tree-thoughts-deliberate-2023/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes Tree of Thoughts framework and validates it through experiments, reporting quantitative improvements (4%→74%, 6.19→7.56, 15.6%→60%) across three tasks; primary contribution is experimental findings." +} +\ No newline at end of file diff --git a/papers/trigger-haystack-extracting-2026/paper_type.json b/papers/trigger-haystack-extracting-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper proposes a scanner method for extracting backdoor triggers and reports quantitative experimental results (87.8% detection rate, zero false positives) evaluated on tasks with baseline comparisons, making the primary contribution the empirical findings rather than a benchmark or theoretical framework." +} +\ No newline at end of file diff --git a/papers/trust-by-design-2026/paper_type.json b/papers/trust-by-design-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "position", + "reason": "Proposes a conceptual framework (BELLA) for LLM routing with interpretability and cost awareness, but explicitly lacks implementation or empirical validation." +} +\ No newline at end of file diff --git a/papers/trust-llmcontrolled-robotics-2025/paper_type.json b/papers/trust-llmcontrolled-robotics-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "The paper explicitly surveys security threats and defenses in LLM-controlled robotics, presenting a taxonomy of existing attack vectors and mitigation strategies rather than proposing new empirical findings, benchmarks, or theoretical proofs." +} +\ No newline at end of file diff --git a/papers/trustworthy-agentic-ai-2025/paper_type.json b/papers/trustworthy-agentic-ai-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes a defense framework and validates it through experiments, reporting quantitative results (94% detection accuracy, 70% reduction in trust leakage, 96% benign task accuracy) with baseline comparisons." +} +\ No newline at end of file diff --git a/papers/trustworthy-llm-agents-survey-2025/paper_type.json b/papers/trustworthy-llm-agents-survey-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "Title explicitly identifies it as a survey; proposes an organizational framework (TrustAgent) for analyzing existing work on LLM agent trustworthiness across multiple dimensions and paradigms." +} +\ No newline at end of file diff --git a/papers/trustworthy-llms-survey-2023/paper_type.json b/papers/trustworthy-llms-survey-2023/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "Proposes a taxonomy of LLM trustworthiness and conducts measurement studies across models to synthesize the field of trustworthiness evaluation; meta-analysis tag and title explicitly indicate a survey contribution." +} +\ No newline at end of file diff --git a/papers/tsapr-tree-search-2025/paper_type.json b/papers/tsapr-tree-search-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper proposes TSAPR and validates it through extensive experiments on three benchmarks (Defects4J, SWE-Bench-Lite, VUL4J) with quantitative comparisons against 10 baselines, making the primary contribution the experimental findings demonstrating the method's effectiveness." +} +\ No newline at end of file diff --git a/papers/turning-tide-repositorybased-2025/paper_type.json b/papers/turning-tide-repositorybased-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "Introduces LiveRepoReflection, a new benchmark of 1,888 repository-level coding problems across 6 languages; evaluation of models on the benchmark is secondary." +} +\ No newline at end of file diff --git a/papers/type-context-pass-rates-2024/paper_type.json b/papers/type-context-pass-rates-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes a code generation method (CatCoder) and validates it experimentally on existing benchmarks, reporting quantitative improvements in compile@k and pass@k metrics across multiple LLMs." +} +\ No newline at end of file diff --git a/papers/typeaware-llmbased-regression-2025/paper_type.json b/papers/typeaware-llmbased-regression-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper runs experiments on 183 real-world Python modules, reports quantitative coverage results, and compares Test4Py against existing baselines (CoverUp, CodaMosa), making the primary contribution experimental findings rather than benchmark creation." +} +\ No newline at end of file diff --git a/papers/types-grassroots-logic-2026/paper_type.json b/papers/types-grassroots-logic-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "theoretical", + "reason": "The paper's primary contribution is formal: defining a type system and proving that well-typing is preserved through computation (Theorem 6.5, Corollary 6.7), with implementation as secondary validation." +} +\ No newline at end of file diff --git a/papers/typescript-typecheck-failures-2025/paper_type.json b/papers/typescript-typecheck-failures-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes a type-constrained decoding method and reports quantitative experimental results (50-75% error reduction, 3.5-37% improvement in pass@1) across 6 LLMs on existing benchmarks (HumanEval, MBPP), with primary contribution being the empirical findings." +} +\ No newline at end of file diff --git a/papers/uc-berkeley-mast-2025/paper_type.json b/papers/uc-berkeley-mast-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs controlled experiments measuring failure rates (41-86.7%) across 7 frameworks and develops a failure taxonomy from 150 analyzed traces with quantified inter-annotator agreement (κ=0.88, κ=0.77)." +} +\ No newline at end of file diff --git a/papers/uda-benchmark-suite-2024/paper_type.json b/papers/uda-benchmark-suite-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "The primary contribution is UDA, a new benchmark suite for evaluating RAG systems in document analysis; the experimental results are baselines on this benchmark." +} +\ No newline at end of file diff --git a/papers/ultrarag-modular-automated-2025/paper_type.json b/papers/ultrarag-modular-automated-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper presents a RAG toolkit and validates its effectiveness through quantitative experiments on LawBench, reporting specific improvements in retrieval (MRR@10) and generation (ROUGE-L) metrics in a legal domain case study." +} +\ No newline at end of file diff --git a/papers/uncertainty-large-language-2026/paper_type.json b/papers/uncertainty-large-language-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs controlled experiments across 5 LLMs and 6 benchmarks, reports quantitative results (43.3% configuration comparison), and derives principles from empirical findings about MAS performance." +} +\ No newline at end of file diff --git a/papers/understand-what-llm-2024/paper_type.json b/papers/understand-what-llm-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper proposes a dual preference alignment framework and validates it through experiments on four QA datasets with quantitative results across multiple LLMs, making the primary contribution experimental findings rather than the benchmark or framework itself." +} +\ No newline at end of file diff --git a/papers/understanding-large-language-2023/paper_type.json b/papers/understanding-large-language-2023/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Experiments on 86 fuzz driver generation questions with quantitative metrics (91% success rate, token costs, performance improvements) show the primary contribution is empirical findings about LLM effectiveness, not a new benchmark or survey." +} +\ No newline at end of file diff --git a/papers/understanding-layer-significance-2024/paper_type.json b/papers/understanding-layer-significance-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes the ILA method and reports quantitative experimental results (90% layer overlap, freezing/fine-tuning efficiency) across multiple alignment and reasoning datasets." +} +\ No newline at end of file diff --git a/papers/understanding-multimodal-finetuning-2026/paper_type.json b/papers/understanding-multimodal-finetuning-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Conducts controlled experiments with ablation studies and attribution patching on existing benchmarks (VSR, VQA), reporting quantitative performance impacts (9-16pp accuracy reduction) to understand how multimodal fine-tuning encodes spatial features." +} +\ No newline at end of file diff --git a/papers/understanding-protecting-augmenting-2025/paper_type.json b/papers/understanding-protecting-augmenting-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "Explicitly a synthesis paper from a workshop that maps and reviews existing research on GenAI's impact on human cognition; primary contribution is synthesis of the field rather than new experiments, benchmarks, theory, or novel methodology." +} +\ No newline at end of file diff --git a/papers/understanding-software-engineering-2025/paper_type.json b/papers/understanding-software-engineering-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Large-scale empirical study analyzing 120 trajectories and 2,822 iterations from three LLM-based SE agents, reporting quantitative findings about successful vs. failed agent behavior patterns." +} +\ No newline at end of file diff --git a/papers/understanding-subliminal-learning-2025/paper_type.json b/papers/understanding-subliminal-learning-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs controlled experiments to identify mechanisms of subliminal learning (divergence tokens, layer criticality, robustness), validating findings across multiple conditions." +} +\ No newline at end of file diff --git a/papers/unicode-augmenting-evaluation-2025/paper_type.json b/papers/unicode-augmenting-evaluation-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "Introduces UniCode, a generative evaluation framework that augments programming problems along five structured axes, with empirical testing primarily serving to validate and demonstrate the benchmark's utility in revealing model limitations." +} +\ No newline at end of file diff --git a/papers/unified-scaling-laws-2022/paper_type.json b/papers/unified-scaling-laws-2022/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "theoretical", + "reason": "The paper derives unified scaling laws for routed language models with mathematical formulation (bilinear functions in log scale), generalizing existing theory with 168 models used for validation rather than discovery." +} +\ No newline at end of file diff --git a/papers/unified-threat-detection-2025/paper_type.json b/papers/unified-threat-detection-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes a framework and reports quantitative experimental results (92% detection accuracy, 65% reduction in deceptive outputs, 78% fairness improvement) across multiple benchmarks, even though the evidence is limited to toy models rather than the claimed enterprise-scale systems." +} +\ No newline at end of file diff --git a/papers/uniguardian-unified-defense-2025/paper_type.json b/papers/uniguardian-unified-defense-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes a unified detection framework validated through experiments with quantitative auROC results across three attack types; theoretical component (Proposition 1) supports the empirical approach." +} +\ No newline at end of file diff --git a/papers/unintended-impacts-llm-2024/paper_type.json b/papers/unintended-impacts-llm-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs experiments evaluating alignment procedures on multiple language variants and datasets, reporting quantitative findings (e.g., 17.1% US English disparity increase, 13.1% multilingual data) as primary contributions." +} +\ No newline at end of file diff --git a/papers/unseen-horizons-unveiling-2024/paper_type.json b/papers/unseen-horizons-unveiling-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "Introduces OBFUSEVAL, a novel code-obfuscation-based benchmark with 1,354 C functions designed to evaluate LLM code generation capabilities; while experiments are run, the primary contribution is the benchmark itself and its evaluation framework." +} +\ No newline at end of file diff --git a/papers/unveiling-potential-diffusion-2025/paper_type.json b/papers/unveiling-potential-diffusion-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes S3 method, runs experiments on WikiBio benchmark, and reports quantitative results comparing structural adherence, hallucination rates, and content fidelity metrics against baselines." +} +\ No newline at end of file diff --git a/papers/upbench-dynamically-evolving-2025/paper_type.json b/papers/upbench-dynamically-evolving-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "Introduces UpBench, a new benchmark framework with 322 real-world Upwork jobs and expert-crafted rubrics, with empirical evaluation of LLM agents serving as validation rather than the primary contribution." +} +\ No newline at end of file diff --git a/papers/use-generative-ai-2024/paper_type.json b/papers/use-generative-ai-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper experimentally evaluates ChatGPT and Bing AI performance on literature search tasks against an expert benchmark, reporting quantitative results on accuracy rates and failure modes." +} +\ No newline at end of file diff --git a/papers/use-propertybased-testing-2025/paper_type.json b/papers/use-propertybased-testing-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper runs experiments comparing property-based testing approaches for LLM code generation against baselines, reporting quantitative improvements (9.2% pass@1 gain, 23.1-37.3% relative RSR gains) with empirical findings as the primary contribution." +} +\ No newline at end of file diff --git a/papers/user-centric-evaluation-2024/paper_type.json b/papers/user-centric-evaluation-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "Introduces a user-centric evaluation methodology and framework (with multi-attempt testing and usability quality attributes) for assessing code generation tools, validated through a case study." +} +\ No newline at end of file diff --git a/papers/user-feedback-alignment-2025/paper_type.json b/papers/user-feedback-alignment-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Reports quantitative results from a deployed system on a commercial platform, comparing a dual-LLM approach against baselines with measured improvements in novelty and user satisfaction." +} +\ No newline at end of file diff --git a/papers/user-misconceptions-llmbased-2025/paper_type.json b/papers/user-misconceptions-llmbased-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Qualitative empirical study analyzing 500 real user conversations to systematically identify, categorize, and quantify misconceptions about LLM assistants, with inter-rater reliability measurements." +} +\ No newline at end of file diff --git a/papers/utboost-rigorous-evaluation-2025/paper_type.json b/papers/utboost-rigorous-evaluation-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "Paper introduces an improved evaluation framework (UTBoost parser) with corrected benchmark data, identifying and fixing annotation errors in SWE-Bench; primary contribution is the improved benchmark infrastructure, not findings about agent behavior." +} +\ No newline at end of file diff --git a/papers/validity-what-you-2025/paper_type.json b/papers/validity-what-you-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "position", + "reason": "Argues a conceptual viewpoint about agentic AI as a software delivery mechanism and proposes a prescriptive five-step design framework centered on validation, without experimental evidence or formal mathematical analysis." +} +\ No newline at end of file diff --git a/papers/validityguided-workflow-robust-2025/paper_type.json b/papers/validityguided-workflow-robust-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "position", + "reason": "Proposes a prescriptive validity-guided workflow and conceptual framework for LLM psychology research without experimental validation, illustrated through a worked example rather than empirical studies." +} +\ No newline at end of file diff --git a/papers/value-variance-mitigating-2026/paper_type.json b/papers/value-variance-mitigating-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes UDPO framework and validates it through experiments reporting quantitative improvements (25pp accuracy gains, Cohen's d > 0.8 statistical significance), with primary contribution being experimental findings rather than the framework itself." +} +\ No newline at end of file diff --git a/papers/values-science-ai-2026/paper_type.json b/papers/values-science-ai-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "position", + "reason": "Argues a conceptual viewpoint about value-ladenness in AI alignment research and makes prescriptive recommendations without experimental validation." +} +\ No newline at end of file diff --git a/papers/vericoder-enhancing-llmbased-2025/paper_type.json b/papers/vericoder-enhancing-llmbased-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Reports quantitative experimental results on benchmarks (VerilogEval, RTLLM) with a fine-tuned model, including ablation studies validating that functional correctness in training data improves LLM performance." +} +\ No newline at end of file diff --git a/papers/vericontaminated-assessing-llmdriven-2025/paper_type.json b/papers/vericontaminated-assessing-llmdriven-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs experiments testing multiple LLMs (GPT-3.5, GPT-4o, LLaMA variants) on existing benchmarks and reports quantitative contamination rates and detection method comparisons." +} +\ No newline at end of file diff --git a/papers/verification-implicit-world-2026/paper_type.json b/papers/verification-implicit-world-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs experiments on 48 generative models (GPT-2 and LLaMA variants), measures soundness via adversarial attacks, and reports quantitative findings about factors affecting world model verification." +} +\ No newline at end of file diff --git a/papers/verifierq-enhancing-llm-2024/paper_type.json b/papers/verifierq-enhancing-llm-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes VerifierQ method and validates it experimentally on GSM8K and MATH benchmarks, comparing quantitative results against baselines like Process Reward Models and Majority Voting." +} +\ No newline at end of file diff --git a/papers/verilogeval-evaluating-large-2023/paper_type.json b/papers/verilogeval-evaluating-large-2023/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "VerilogEval's primary contribution is a new benchmark with 156 Verilog problems and automated evaluation infrastructure; baseline experiments with CodeGen and GPT-3.5 validate the benchmark rather than being the main finding." +} +\ No newline at end of file diff --git a/papers/verilogreader-llmaided-hardware-2024/paper_type.json b/papers/verilogreader-llmaided-hardware-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Conducts experimental evaluation of LLMs for hardware test generation, reporting quantitative results (5x–100x cycle reduction, coverage metrics) on benchmark designs as its primary contribution." +} +\ No newline at end of file diff --git a/papers/verimind-agentic-llm-2025/paper_type.json b/papers/verimind-agentic-llm-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The primary contribution is the VeriMind multi-agent framework with experimental validation showing 8.3pp and 8.1pp improvements on VerilogEval; while it introduces a novel metric, the benchmark itself is existing and the metric is secondary to demonstrating the system's effectiveness." +} +\ No newline at end of file diff --git a/papers/verina-benchmarking-verifiable-2025/paper_type.json b/papers/verina-benchmarking-verifiable-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "VERINA's primary contribution is a new 189-task benchmark in Lean for evaluating verifiable code generation; while baseline evaluations are included, the benchmark itself is the core contribution." +} +\ No newline at end of file diff --git a/papers/verpo-verifiable-dense-2026/paper_type.json b/papers/verpo-verifiable-dense-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes VeRPO method and validates it with quantitative experiments on six code generation benchmarks, including ablation studies, with the primary contribution being empirical performance improvements." +} +\ No newline at end of file diff --git a/papers/vhdleval-framework-evaluating-2024/paper_type.json b/papers/vhdleval-framework-evaluating-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "The primary contribution is the VHDL-Eval benchmark itself (202 problems), with model evaluations and fine-tuning experiments serving as baseline validation of the framework." +} +\ No newline at end of file diff --git a/papers/vibe-aigc-new-2026/paper_type.json b/papers/vibe-aigc-new-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "position", + "reason": "Proposes a new paradigm and conceptual framework (multi-agent orchestration) to address the Intent-Execution Gap without experimental validation or implementation." +} +\ No newline at end of file diff --git a/papers/vibe-coding-ainative-2025/paper_type.json b/papers/vibe-coding-ainative-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "position", + "reason": "Proposes a conceptual framework and reference architecture for a programming paradigm without empirical validation or formal mathematical analysis." +} +\ No newline at end of file diff --git a/papers/vibe-coding-practice-2025/paper_type.json b/papers/vibe-coding-practice-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "A grey literature review that synthesizes and meta-analyzes 101 existing practitioner sources about vibe coding practices; primary contribution is the synthesis of existing work, not original experiments or a new benchmark." +} +\ No newline at end of file diff --git a/papers/vibe-coding-product-2025/paper_type.json b/papers/vibe-coding-product-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Collects original qualitative empirical data through 22 interviews and reports findings about workflows, challenges, and tensions in AI-assisted design, with the primary contribution being insights from that empirical study rather than benchmarks, surveys, or theoretical analysis." +} +\ No newline at end of file diff --git a/papers/vibe-learning-education-2025/paper_type.json b/papers/vibe-learning-education-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "position", + "reason": "Argues that LLMs have fundamental limitations and proposes shifting education from knowledge-transmission to constructivist paradigms without experimental validation or benchmarks." +} +\ No newline at end of file diff --git a/papers/videot1-testtime-scaling-2025/paper_type.json b/papers/videot1-testtime-scaling-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs experiments across six models, reports quantitative results on quality improvements and computational efficiency, making the primary contribution experimental findings about test-time scaling for video generation." +} +\ No newline at end of file diff --git a/papers/vieva-llm-conceptual-2024/paper_type.json b/papers/vieva-llm-conceptual-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "Proposes EvaLLM, a 5-layer evaluation framework for LLM-generated visualizations, with case studies demonstrating its application rather than being the primary empirical contribution." +} +\ No newline at end of file diff --git a/papers/virtual-lab-ai-2024/paper_type.json b/papers/virtual-lab-ai-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs experiments designing nanobodies with GPT-4o and validates them experimentally, reporting quantitative success rates (93% expression/solubility, 2.2% binding), with primary contribution being experimental findings despite lacking baseline comparisons." +} +\ No newline at end of file diff --git a/papers/virus-infection-attack-2025/paper_type.json b/papers/virus-infection-attack-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes a data poisoning attack framework (VIA) and validates it experimentally with quantitative infection rates (50-85%), attack success rates, and detection metrics." +} +\ No newline at end of file diff --git a/papers/viscosity-logic-phase-2026/paper_type.json b/papers/viscosity-logic-phase-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs systematic experiments sweeping DPO parameter β across model families and reports quantitative findings (specific capability transitions, correlation coefficients, hysteresis effects) on existing benchmarks." +} +\ No newline at end of file diff --git a/papers/vision-wormhole-latentspace-2026/paper_type.json b/papers/vision-wormhole-latentspace-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper proposes a framework (Vision Wormhole) and validates it through systematic experiments across multiple model sizes, reporting quantitative performance metrics (accuracy improvements/degradations and speedup comparisons) against baseline approaches." +} +\ No newline at end of file diff --git a/papers/visualwebarena-evaluating-multimodal-2024/paper_type.json b/papers/visualwebarena-evaluating-multimodal-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "VisualWebArena is the primary contribution—a new benchmark of 910 visually grounded web tasks with self-hosted environments—and the experimental results on various agents serve to characterize and validate the benchmark rather than as the primary finding." +} +\ No newline at end of file diff --git a/papers/vlrouterbench-benchmark-visionlanguage-2025/paper_type.json b/papers/vlrouterbench-benchmark-visionlanguage-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "VL-RouterBench is introduced as the first systematic benchmark for vision-language model routing, covering 14 datasets, 17 models, and 519,180 sample-model pairs—the primary contribution is the benchmark framework itself, not the router evaluation results." +} +\ No newline at end of file diff --git a/papers/vortexpia-indirect-prompt-2025/paper_type.json b/papers/vortexpia-indirect-prompt-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes and experimentally evaluates a prompt injection attack method (VORTEXPIA) across 6 LLMs and 4 datasets, reporting quantitative results including ASR rates, token efficiency, and comparative findings about model vulnerabilities." +} +\ No newline at end of file diff --git a/papers/vsavisualstructural-alignment-uitocode-2025/paper_type.json b/papers/vsavisualstructural-alignment-uitocode-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "position", + "reason": "Proposes a conceptual framework for UI-to-code generation with no real experimental validation—only draft placeholder numbers." +} +\ No newline at end of file diff --git a/papers/vulscriber-exploring-ragbased-2024/paper_type.json b/papers/vulscriber-exploring-ragbased-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes three LLM-based augmentation strategies and validates them experimentally across 4 datasets, 3 models, and 3 LLMs with quantitative F1-score comparisons against baselines." +} +\ No newline at end of file diff --git a/papers/walle-world-alignment-2024/paper_type.json b/papers/walle-world-alignment-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes a neurosymbolic world model and validates it through quantitative experiments on Minecraft TechTree and ALFWorld benchmarks, reporting success rates and efficiency improvements compared to baselines." +} +\ No newline at end of file diff --git a/papers/wasp-benchmarking-web-2025/paper_type.json b/papers/wasp-benchmarking-web-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "WASP's primary contribution is introducing a new benchmark for evaluating web agent security against prompt injection attacks; empirical results on models serve to validate the benchmark rather than as the main finding." +} +\ No newline at end of file diff --git a/papers/watch-weights-unsupervised-2025/paper_type.json b/papers/watch-weights-unsupervised-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes WeightWatch method and validates it with quantitative experiments on backdoored, unlearned, and commercial models, reporting specific detection rates and false positive rates." +} +\ No newline at end of file diff --git a/papers/webapp1k-practical-codegeneration-2024/paper_type.json b/papers/webapp1k-practical-codegeneration-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "The primary contribution is WebApp1K, a new 1000-problem benchmark for code generation; the empirical evaluations of 11 models serve to validate and demonstrate the benchmark's utility." +} +\ No newline at end of file diff --git a/papers/webarena-autonomous-agents-2023/paper_type.json b/papers/webarena-autonomous-agents-2023/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "Introduces WebArena, a new realistic web environment with 812 long-horizon tasks across multiple domains, with baseline evaluations to establish benchmark difficulty and properties." +} +\ No newline at end of file diff --git a/papers/webbench-llm-code-2025/paper_type.json b/papers/webbench-llm-code-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "The primary contribution is Web-Bench, a new evaluation framework with 50 web projects and 20 sequential tasks per project; baseline experiments on existing models are secondary to the benchmark introduction itself." +} +\ No newline at end of file diff --git a/papers/webguard-building-generalizable-2025/paper_type.json b/papers/webguard-building-generalizable-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "Introduces WebGuard, the first large-scale action-level dataset for web agent guardrail evaluation (4,939 human-annotated actions), with empirical results validating the benchmark." +} +\ No newline at end of file diff --git a/papers/webinject-prompt-injection-2025/paper_type.json b/papers/webinject-prompt-injection-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes an optimization-based attack method and experimentally validates it with quantitative success rates (96-97% ASR) across multiple MLLMs, comparing against baseline approaches." +} +\ No newline at end of file diff --git a/papers/webmmu-benchmark-multimodal-2025/paper_type.json b/papers/webmmu-benchmark-multimodal-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "Introduces WebMMU, a new multimodal multilingual benchmark with three web-based tasks across four languages; experiments on 18+ MLLMs are conducted to validate and characterize the benchmark, not as the primary contribution." +} +\ No newline at end of file diff --git a/papers/webuibench-comprehensive-benchmark-2025/paper_type.json b/papers/webuibench-comprehensive-benchmark-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "The paper's primary contribution is introducing WebUIBench, a new evaluation framework with 21K QA pairs across 9 subtasks; while it evaluates 29 MLLMs as baselines, the benchmark itself is the core contribution." +} +\ No newline at end of file diff --git a/papers/what-can-youth-2024/paper_type.json b/papers/what-can-youth-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "survey", + "reason": "This is a meta-analysis and content analysis of existing Hour of Code activities, synthesizing findings about how they address AI/ML concepts rather than conducting novel experiments or creating new benchmarks." +} +\ No newline at end of file diff --git a/papers/what-cut-predicting-2026/paper_type.json b/papers/what-cut-predicting-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Reports quantitative experimental findings (9.9% method deletion rate, 87.1% AUC classifier performance) from agentic code analysis with feature importance analysis and model comparison." +} +\ No newline at end of file diff --git a/papers/what-do-llm-2026/paper_type.json b/papers/what-do-llm-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "Proposes Task2Quiz (T2Q), a new two-stage evaluation paradigm and framework for studying environment understanding in LLM agents, with 30 procedurally-generated TextWorld environments as the benchmark infrastructure." +} +\ No newline at end of file diff --git a/papers/what-does-it-2025/paper_type.json b/papers/what-does-it-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "theoretical", + "reason": "Proposes precise mathematical criteria and formalizes concepts for defining when a neural network has learned a world model, providing formal definitions rather than experimental validation." +} +\ No newline at end of file diff --git a/papers/what-retrieve-effective-2025/paper_type.json b/papers/what-retrieve-effective-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs experiments on existing benchmarks (CoderEval, RepoExec) to evaluate retrieval strategies for code generation, proposes and tests AllianceCoder, and reports quantitative improvements (20% Pass@1 gain) as primary contributions." +} +\ No newline at end of file diff --git a/papers/what-wrong-your-2024/paper_type.json b/papers/what-wrong-your-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Extensive experimental study analyzing LLM-generated code across benchmarks, developing a bug taxonomy through empirical analysis, and testing a repair method with quantitative results." +} +\ No newline at end of file diff --git a/papers/when-agents-fail-2026/paper_type.json b/papers/when-agents-fail-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Analyzes 1,187 real-world bug reports to characterize bug types and components, proposes BugReAct agent, and reports quantitative evaluation results (65.75% F1-score) against baseline." +} +\ No newline at end of file diff --git a/papers/when-ai-agents-2025/paper_type.json b/papers/when-ai-agents-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Introduces MultiAgentFraudBench but the primary contribution is quantitative experimental findings about fraud vulnerability across 17 LLMs, collusion effects, and mitigation effectiveness." +} +\ No newline at end of file diff --git a/papers/when-bots-take-2026/paper_type.json b/papers/when-bots-take-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Experimentally evaluates social engineering vulnerabilities in web automation agents across five frameworks with quantitative metrics (67.5% attack success rate, 78.1% defense reduction) and validates a proposed mitigation strategy." +} +\ No newline at end of file diff --git a/papers/when-finetuning-llms-2024/paper_type.json b/papers/when-finetuning-llms-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper runs experiments comparing federated and centralized fine-tuning for LLM-based program repair, reporting quantitative performance improvements and analyzing heterogeneity effects on real benchmarks." +} +\ No newline at end of file diff --git a/papers/when-large-language-2024/paper_type.json b/papers/when-large-language-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs experiments on RepoBugs benchmark, reports quantitative results for LLM program repair performance, and validates an empirical method (RLCE) through ablation analysis." +} +\ No newline at end of file diff --git a/papers/when-nobody-around-2026/paper_type.json b/papers/when-nobody-around-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Conducts a two-stage empirical study (content analysis of 883 comments + diary study with 20 participants) to report findings about user experiences and public discourse on an AI social platform." +} +\ No newline at end of file diff --git a/papers/when-reject-turns-2025/paper_type.json b/papers/when-reject-turns-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper runs experiments to test adversarial prompt injection attacks against LLM-based reviewers, reporting quantitative success rates (up to 86.26%) and vulnerability patterns across different model types and obfuscation strategies." +} +\ No newline at end of file diff --git a/papers/when-singleagent-skills-2026/paper_type.json b/papers/when-singleagent-skills-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs experiments on standard benchmarks (GSM8K, HumanEval, HotpotQA) and reports quantitative results on token usage, latency, and accuracy metrics with empirical analysis of skill selection phase transitions." +} +\ No newline at end of file diff --git a/papers/where-did-it-2025/paper_type.json b/papers/where-did-it-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Introduces a new representation gradient-based method (RepT) and validates it experimentally on three LLMs, reporting quantitative results (precision, auPRC) across multiple attribution tasks." +} +\ No newline at end of file diff --git a/papers/where-do-ai-2026/paper_type.json b/papers/where-do-ai-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Conducts observational analysis of 33,596 real-world agent-authored GitHub PRs with quantitative findings (merge rates, statistical tests) and qualitative analysis of rejection patterns; primary contribution is empirical findings about agent failure modes." +} +\ No newline at end of file diff --git a/papers/where-llms-struggle-code-2025/paper_type.json b/papers/where-llms-struggle-code-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs experiments across multiple code generation benchmarks with six LLMs, analyzes failure patterns quantitatively, and reports empirical findings about task difficulty and failure rates." +} +\ No newline at end of file diff --git a/papers/which-agent-causes-2025/paper_type.json b/papers/which-agent-causes-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "The primary contribution is the Who&When dataset with 184 annotated failure tasks and formalization of the automated failure attribution task; the evaluation of three attribution methods serves to validate the benchmark's difficulty and utility." +} +\ No newline at end of file diff --git a/papers/why-ai-alignment-2026/paper_type.json b/papers/why-ai-alignment-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "position", + "reason": "The paper argues a viewpoint about alignment failure being structural and proposes a conceptual reframing using Fiske's relational models theory, without experimental validation or formal proofs." +} +\ No newline at end of file diff --git a/papers/why-behind-action-2026/paper_type.json b/papers/why-behind-action-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper proposes a method for agentic attribution and validates it through experiments on custom scenarios, reporting quantitative performance metrics (Hit@1, Hit@3) compared to baselines—the primary contribution is the experimental findings demonstrating the method's effectiveness." +} +\ No newline at end of file diff --git a/papers/why-do-language-2025/paper_type.json b/papers/why-do-language-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs controlled experiments testing LLM whistleblowing behavior across models with varying task complexity and prompt conditions, reporting quantitative empirical findings about which model families whistleblow and what factors predict the behavior." +} +\ No newline at end of file diff --git a/papers/why-reasoning-fails-2026/paper_type.json b/papers/why-reasoning-fails-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "theoretical", + "reason": "Primary contribution is mathematical proofs demonstrating that step-wise greedy policies and beam search are arbitrarily suboptimal for long-horizon planning, with empirical validation through the proposed FLARE framework." +} +\ No newline at end of file diff --git a/papers/wink-recovering-from-2026/paper_type.json b/papers/wink-recovering-from-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Introduces the Wink system and reports quantitative experimental findings from production deployment at Meta, including a live A/B test with statistically significant results on recovery rates and system metrics." +} +\ No newline at end of file diff --git a/papers/xgenq-explainable-domainadaptive-2025/paper_type.json b/papers/xgenq-explainable-domainadaptive-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Develops XGen-Q framework and reports quantitative experimental results (perplexity metrics) comparing domain-adapted model against baselines, making the primary contribution experimental findings." +} +\ No newline at end of file diff --git a/papers/you-only-need-2025/paper_type.json b/papers/you-only-need-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper proposes MFEE, an inference-gating architecture, and reports quantitative experimental results (78.1% execution reduction on a 1,000-prompt benchmark with GPT-2), making the primary contribution experimental validation of a system's performance rather than a benchmark, survey, position, or theoretical analysis." +} +\ No newline at end of file diff --git a/papers/your-benchmark-still-2025/paper_type.json b/papers/your-benchmark-still-2025/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "The paper runs systematic experiments on 10 LLMs using semantic-preserving mutations to empirically demonstrate that benchmark rankings are unstable and inflated, with performance drops up to 40-50%—the primary contribution is these experimental findings about benchmark vulnerability, not the dynamic benchmarking framework itself." +} +\ No newline at end of file diff --git a/papers/your-code-generated-2023/paper_type.json b/papers/your-code-generated-2023/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "benchmark-creation", + "reason": "The primary contribution is EvalPlus/HUMANEVAL+, a new and more rigorous code evaluation benchmark with 80x more test cases; the empirical results on 26 LLMs validate and demonstrate why this improved benchmark is needed." +} +\ No newline at end of file diff --git a/papers/yunque-deepresearch-technical-2026/paper_type.json b/papers/yunque-deepresearch-technical-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Introduces a hierarchical multi-agent framework and validates it through quantitative experiments (ablation studies, benchmark evaluations on BrowseComp, HLE, GAIA) where experimental findings are the primary contribution." +} +\ No newline at end of file diff --git a/papers/zeroshot-embedding-drift-2026/paper_type.json b/papers/zeroshot-embedding-drift-2026/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes ZEDD method and experimentally validates it on LLMail-Inject dataset, reporting quantitative accuracy metrics (90.75–95.55%) across multiple encoder architectures." +} +\ No newline at end of file diff --git a/papers/zeroshot-llmguided-counterfactual-2024/paper_type.json b/papers/zeroshot-llmguided-counterfactual-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Proposes FIZLE methodology and experimentally evaluates LLM-guided counterfactual generation across three NLP benchmark datasets, reporting quantitative metrics on label flip rates and model accuracy drops." +} +\ No newline at end of file diff --git a/papers/zeroshot-prompting-approaches-2024/paper_type.json b/papers/zeroshot-prompting-approaches-2024/paper_type.json @@ -0,0 +1,4 @@ +{ + "paper_type": "empirical", + "reason": "Runs controlled experiments comparing multiple prompting approaches (SCGG, Prompt Decomposition, zero-shot baselines) with quantitative results from crowd-worker evaluations and precision metrics, making experimental findings the primary contribution." +} +\ No newline at end of file

Impressum · Datenschutz