ai-research-survey

Systematic scan of agentic development research. What's signal, what's noise.
git clone https://git.shiptheloop.com/ai-research-survey.git
Log | Files | Refs

commit eb6c3464af535659c94124df71ec8abd0e9a2ab3
parent 375564a74735195015b853fe4ec2af98ff6e4fa0
Author: Brian Graham <brian@buildingbetterteams.de>
Date:   Sun, 12 Apr 2026 18:10:46 +0200

Progress bar: 4-segment v5 pipeline, 106 pure Haiku scans

Replace old v3/v2/v1/queued/no-text progress bar with:
V5 Opus | V5 Haiku/Sonnet | Deprecated | Not scanned

Build pipeline counts scan-v5.json (checking source field for
opus vs haiku) and falls back to scan.json as deprecated.
No cascade loading yet — metrics still read from old scan.json.

Also: v5 script cosmetic fixes (v4→v5 references) and stderr
capture on claude failures for better error diagnostics.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Diffstat:
Mexplorer/src/data.ts | 13++++---------
Mexplorer/src/style.css | 18++++++++----------
Mexplorer/src/views/dashboard.ts | 35+++++++++++++++++------------------
Mexplorer/tests/explorer.spec.ts | 2+-
Apapers/2025-ai-agent-2026/scan-v5.json | 339+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/3dshape2vecset-3d-shape-2023/scan-v5.json | 505+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/a2hcoder-llmdriven-coding-2025/scan-v5.json | 399+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/aart-aiassisted-redteaming-2023/scan-v5.json | 381+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/acar-adaptive-complexity-2026/scan-v5.json | 480+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/agentic-bug-reproduction-2025/scan-v5.json | 522+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/agentic-refactoring-empirical-2025/scan-v5.json | 575+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/agents-of-chaos-2026/scan-v5.json | 561+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/ai-ides-vs-agents-impact-2026/scan-v5.json | 522+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/chain-of-thought-prompting-2022/scan-v5.json | 543+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/codex-humaneval-2021/scan-v5.json | 570+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/coding-agents-generating-2026/scan-v5.json | 500+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/copilot-productivity-controlled-2023/scan-v5.json | 556+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/copilot-zoominfo-productivity-2025/scan-v5.json | 499+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/cursor-speed-quality-tradeoff-2025/scan-v5.json | 582++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/data-contamination-benchmarks-2023/scan-v5.json | 596+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/data-distributional-properties-2022/scan-v5.json | 582++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/database-perspective-llm-2025/scan-v5.json | 340+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/datadreamer-tool-synthetic-2024/scan-v5.json | 354+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/datasentinel-gametheoretic-detection-2025/scan-v5.json | 583+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/datasetresearch-benchmarking-agent-2025/scan-v5.json | 402+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/dear-diary-rct-copilot-2024/scan-v5.json | 535+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/dear-novel-deep-2022/scan-v5.json | 567+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/declarative-agentic-layer-2026/scan-v5.json | 374+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/decoding-latent-attack-2025/scan-v5.json | 577+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/decoding-ml-decision-2026/scan-v5.json | 349+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/decomposed-prompting-modular-2022/scan-v5.json | 540+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/deep-dive-into-2024-2/scan-v5.json | 543+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/deep-dive-into-2024/scan-v5.json | 541+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/deep-dive-into-2025/scan-v5.json | 571+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/deepcircuitx-comprehensive-repositorylevel-2025/scan-v5.json | 343+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/deepcrceval-revisiting-evaluation-2024/scan-v5.json | 356+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/deepreview-improving-llmbased-2025/scan-v5.json | 549+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/deepseek-coder-2024/scan-v5.json | 510+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/deepseek-coder-v2-2024/scan-v5.json | 574+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/deepseek-r1-2025/scan-v5.json | 519+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/defects4c-benchmarking-large-2025/scan-v5.json | 354+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/defending-against-indirect-2024/scan-v5.json | 593+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/defending-against-prompt-2025-2/scan-v5.json | 533+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/defending-against-prompt-2025/scan-v5.json | 578++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/defending-aipowered-commerce-2025/scan-v5.json | 318+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/defense-against-indirect-2026/scan-v5.json | 503+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/defense-against-prompt-2024/scan-v5.json | 548+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/defense-against-prompt-2025/scan-v5.json | 578++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/defense-massive-false-2022/scan-v5.json | 506+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/defensive-prompt-patch-2024/scan-v5.json | 530+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/dehallucinator-mitigating-llm-2024/scan-v5.json | 500+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/demonstratesearchpredict-composing-retrieval-2022/scan-v5.json | 572+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/deployabilitycentric-infrastructureascode-generation-2025/scan-v5.json | 498+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/deputydev-ai-powered-2025/scan-v5.json | 629+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/derag-blackbox-adversarial-2025/scan-v5.json | 514+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/design-evaluation-assisted-2026/scan-v5.json | 500+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/design-implementation-secure-2025/scan-v5.json | 520+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/designbench-comprehensive-benchmark-2025/scan-v5.json | 344+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/designing-llmbased-multiagent-2025/scan-v5.json | 353+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/detecting-adversarial-finetuning-2025/scan-v5.json | 563+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/detecting-benchmark-contamination-2025/scan-v5.json | 507+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/detecting-correcting-hallucinations-code-2026/scan-v5.json | 583+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/detecting-proxy-gaming-2025/scan-v5.json | 520+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/detecting-silent-failures-2025/scan-v5.json | 443+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/detecting-sleeper-agents-2025/scan-v5.json | 585+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/detection-method-prompt-2025/scan-v5.json | 587+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/detectlocalizerepair-unified-framework-2022/scan-v5.json | 544+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/devbench-realistic-developerinformed-2026/scan-v5.json | 346+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/developer-productivity-genai-2025/scan-v5.json | 520+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/deveval-manuallyannotated-code-2024/scan-v5.json | 414+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/devil-details-emergent-2025/scan-v5.json | 494+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/diagnostic-codes-ai-2025/scan-v5.json | 514+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/dialogue-injection-attack-2025/scan-v5.json | 572+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/disaggregation-reveals-hidden-2025/scan-v5.json | 524+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/disagreements-reasoning-how-2025/scan-v5.json | 587+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/disentangling-causal-importance-2026/scan-v5.json | 506+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/dissecting-swe-bench-leaderboard-2025/scan-v5.json | 415+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/dive-into-agent-2025/scan-v5.json | 518+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/dlap-deep-learning-2024/scan-v5.json | 570+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/do-as-i-2025/scan-v5.json | 500+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/do-prompts-reshape-2025/scan-v5.json | 528+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/do-we-truly-2025/scan-v5.json | 573+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/does-ai-code-2025/scan-v5.json | 554+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/does-it-tie-2025/scan-v5.json | 571+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/does-prompt-formatting-2024/scan-v5.json | 544+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/does-reasoning-introduce-2025/scan-v5.json | 508+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/domaineval-autoconstructed-benchmark-2024/scan-v5.json | 415+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/domainspecific-constitutional-ai-2025/scan-v5.json | 532+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/dont-always-pick-2026/scan-v5.json | 315+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/dover-interventiondriven-auto-2025/scan-v5.json | 577+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/dpo-superior-ppo-2024/scan-v5.json | 586+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/drccoder-automated-drc-2024/scan-v5.json | 552+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/drex-benchmark-detecting-2025/scan-v5.json | 367+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/drift-dynamic-rulebased-2025/scan-v5.json | 568+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/drip-defending-prompt-2025/scan-v5.json | 501+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/driving-style-alignment-2024/scan-v5.json | 607+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/dscodebench-realistic-benchmark-2025/scan-v5.json | 383+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/dspy-compiling-declarative-2023/scan-v5.json | 603+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/dual-latent-memory-2026/scan-v5.json | 507+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/dynacode-dynamic-complexityaware-2025/scan-v5.json | 401+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/dynafix-iterative-automated-2025/scan-v5.json | 508+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/dynamic-benchmarking-reasoning-2025/scan-v5.json | 378+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/dynamic-memory-management-2025/scan-v5.json | 547+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/dynamic-mix-precision-2026/scan-v5.json | 532+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/early-approaches-adversarial-2025/scan-v5.json | 578++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/early-categorization-prompt-2024/scan-v5.json | 322+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/ecogym-evaluating-llms-2026/scan-v5.json | 338+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/economics-ai-inference-2025/scan-v5.json | 531+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/edge-memorization-diffusion-2025/scan-v5.json | 371+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/editflow-benchmarking-optimizing-2026/scan-v5.json | 510+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Apapers/effective-lora-adapter-2026/scan-v5.json | 507+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Mscripts/build-explorer-data.py | 58+++++++++++++++++++++++++++++++++-------------------------
Mscripts/run-scan-v5-haiku.py | 16++++++++++------
113 files changed, 53509 insertions(+), 69 deletions(-)

diff --git a/explorer/src/data.ts b/explorer/src/data.ts @@ -51,15 +51,10 @@ export interface HistBin { export interface Pipeline { registry_total: number; - v2_scanned: number; - empirical: number; - non_empirical: number; - v3_scanned: number; - v2_only: number; - v1_needs_rescan: number; - has_text_no_scan: number; - no_text: number; - excluded: number; + v5_opus: number; + v5_haiku: number; + deprecated_scan: number; + not_scanned: number; } export interface Dashboard { diff --git a/explorer/src/style.css b/explorer/src/style.css @@ -483,11 +483,10 @@ td.score { height: 100%; transition: width 0.3s; } -.pipeline-seg.v3 { background: var(--accent); } -.pipeline-seg.scanned { background: var(--green); } -.pipeline-seg.v1 { background: var(--accent); } -.pipeline-seg.queued { background: var(--yellow); } -.pipeline-seg.notext { background: var(--gray); } +.pipeline-seg.v5opus { background: var(--accent); } +.pipeline-seg.v5haiku { background: var(--green); } +.pipeline-seg.deprecated { background: var(--yellow); } +.pipeline-seg.notscan { background: var(--gray); } .pipeline-legend { display: flex; gap: 1.25rem; @@ -503,11 +502,10 @@ td.score { margin-right: 4px; vertical-align: middle; } -.pipeline-dot.v3 { background: var(--accent); } -.pipeline-dot.scanned { background: var(--green); } -.pipeline-dot.v1 { background: var(--accent); } -.pipeline-dot.queued { background: var(--yellow); } -.pipeline-dot.notext { background: var(--gray); } +.pipeline-dot.v5opus { background: var(--accent); } +.pipeline-dot.v5haiku { background: var(--green); } +.pipeline-dot.deprecated { background: var(--yellow); } +.pipeline-dot.notscan { background: var(--gray); } /* Year trend chart */ .trend-chart { margin-top: 0.5rem; } diff --git a/explorer/src/views/dashboard.ts b/explorer/src/views/dashboard.ts @@ -17,31 +17,29 @@ const GAME_DESCRIPTIONS: Record<string, string> = { function renderProgressBar(p: Pipeline): string { const total = p.registry_total; - const v3Pct = (p.v3_scanned / total * 100).toFixed(1); - const v2Pct = (p.v2_only / total * 100).toFixed(1); - const v1Pct = (p.v1_needs_rescan / total * 100).toFixed(1); - const textPct = (p.has_text_no_scan / total * 100).toFixed(1); - const noPct = (p.no_text / total * 100).toFixed(1); - const totalScannedPct = (p.v2_scanned / total * 100).toFixed(1); + const scanned = p.v5_opus + p.v5_haiku + p.deprecated_scan; + const opusPct = (p.v5_opus / total * 100).toFixed(1); + const haikuPct = (p.v5_haiku / total * 100).toFixed(1); + const depPct = (p.deprecated_scan / total * 100).toFixed(1); + const nonePct = (p.not_scanned / total * 100).toFixed(1); + const scannedPct = (scanned / total * 100).toFixed(1); return `<div class="pipeline-bar"> <div class="pipeline-header"> <span class="pipeline-title">Survey Progress</span> - <span class="pipeline-stat">${p.v2_scanned} of ${total} scanned (${totalScannedPct}%) — ${p.empirical} empirical, ${p.non_empirical} non-empirical</span> + <span class="pipeline-stat">${scanned} of ${total} scanned (${scannedPct}%)</span> </div> <div class="pipeline-track"> - <div class="pipeline-seg v3" style="width:${v3Pct}%" title="V3 scanned (with engagement factors): ${p.v3_scanned}"></div> - <div class="pipeline-seg scanned" style="width:${v2Pct}%" title="V2 scanned: ${p.v2_only}"></div> - <div class="pipeline-seg v1" style="width:${v1Pct}%" title="V1 needs rescan: ${p.v1_needs_rescan}"></div> - <div class="pipeline-seg queued" style="width:${textPct}%" title="Text ready, awaiting scan: ${p.has_text_no_scan}"></div> - <div class="pipeline-seg notext" style="width:${noPct}%" title="No text available: ${p.no_text}"></div> + <div class="pipeline-seg v5opus" style="width:${opusPct}%" title="V5 Opus: ${p.v5_opus}"></div> + <div class="pipeline-seg v5haiku" style="width:${haikuPct}%" title="V5 Haiku/Sonnet: ${p.v5_haiku}"></div> + <div class="pipeline-seg deprecated" style="width:${depPct}%" title="Deprecated scan: ${p.deprecated_scan}"></div> + <div class="pipeline-seg notscan" style="width:${nonePct}%" title="Not scanned: ${p.not_scanned}"></div> </div> <div class="pipeline-legend"> - <span><span class="pipeline-dot v3"></span>V3 (${p.v3_scanned})</span> - <span><span class="pipeline-dot scanned"></span>V2 (${p.v2_only})</span> - <span><span class="pipeline-dot v1"></span>V1 rescan (${p.v1_needs_rescan})</span> - <span><span class="pipeline-dot queued"></span>Queued (${p.has_text_no_scan})</span> - <span><span class="pipeline-dot notext"></span>No PDF (${p.no_text})</span> + <span><span class="pipeline-dot v5opus"></span>V5 Opus (${p.v5_opus})</span> + <span><span class="pipeline-dot v5haiku"></span>V5 Haiku/Sonnet (${p.v5_haiku})</span> + <span><span class="pipeline-dot deprecated"></span>Deprecated (${p.deprecated_scan})</span> + <span><span class="pipeline-dot notscan"></span>Not scanned (${p.not_scanned})</span> </div> </div>`; } @@ -53,7 +51,8 @@ export async function renderDashboard(app: HTMLElement) { const topGame = Object.entries(agg.game_pcts).sort((a, b) => b[1] - a[1])[0]; const p = agg.pipeline; - const scanPct = Math.round(p.v2_scanned / p.registry_total * 100); + const scanned = p.v5_opus + p.v5_haiku + p.deprecated_scan; + const scanPct = Math.round(scanned / p.registry_total * 100); app.innerHTML = ` ${renderProgressBar(p)} diff --git a/explorer/tests/explorer.spec.ts b/explorer/tests/explorer.spec.ts @@ -41,7 +41,7 @@ test.describe('Dashboard', () => { await page.goto('/'); await expect(page.locator('.pipeline-bar')).toBeVisible({ timeout: 10000 }); await expect(page.locator('.pipeline-stat')).toContainText('of'); - await expect(page.locator('.pipeline-seg.scanned')).toBeVisible(); + await expect(page.locator('.pipeline-seg.deprecated')).toBeVisible(); }); test('shows named games', async ({ page }) => { diff --git a/papers/2025-ai-agent-2026/scan-v5.json b/papers/2025-ai-agent-2026/scan-v5.json @@ -0,0 +1,338 @@ +{ + "scan_version": 5, + "paper_type": "survey", + "paper": { + "title": "The 2025 AI Agent Index: Documenting Technical and Safety Features of Deployed Agentic AI Systems", + "authors": [ + "Leon Staufer", + "Kevin Feng", + "Kevin Wei", + "Luke Bailey", + "Yawen Duan", + "Mick Yang", + "A. Pinar Ozisik", + "Stephen Casper", + "Noam Kolt" + ], + "year": 2026, + "venue": "arXiv", + "arxiv_id": "2602.17753", + "doi": null + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All abstract claims are supported: the 30-agent index is delivered, transparency gaps are shown through Figure 3 (198/1350 fields 'None found'), and safety documentation absence is documented with 133/240 safety fields having no public information.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": false, + "answer": false, + "justification": "The paper is primarily descriptive documentation; it makes no causal claims about what causes safety gaps or transparency differences.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": true, + "justification": "The paper explicitly scopes claims to 30 publicly available general-purpose agents as of December 31, 2025, and Section 6.2 acknowledges this may not generalize to internal deployments or domain-specific agents.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": true, + "justification": "The paper explicitly notes that Chinese companies lacking documented safety frameworks 'may simply not be documented publicly,' and consistently distinguishes 'None found' from 'None' to acknowledge that absence of public evidence is not evidence of absence.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "The paper consistently distinguishes between publicly available documentation (what is measured) and actual safety practices (what is claimed to be opaque), using the 'None found' vs 'None' distinction throughout.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "Section 6.2 'Limitations and Outlook' is a dedicated section covering methodology, scope, generalizability, and language coverage.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "Specific threats named include: inclusion criteria favoring significant agents (reducing generalizability), public-interest metrics favoring consumer over enterprise products, reliance solely on English and Chinese documentation, and exclusive use of publicly available information missing internal practices.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": true, + "justification": "The paper explicitly excludes domain-specific agents, company-internal products, limited pre-releases, and agents requiring software engineering expertise to deploy; scope is bounded to 30 publicly available general-purpose agents as of December 31, 2025.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": true, + "justification": "The Acknowledgments section states: 'This research was supported by the MATS Research program, which provided funding for L.S. and M.Y. through research stipends.'", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "All author affiliations are disclosed on the first page: Cambridge, UW, Harvard Law, Stanford, Concordia AI, UPenn, MIT (×2), and Hebrew University.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": true, + "justification": "MATS Research Program is an academic research fellowship program unrelated to the commercial AI agents being evaluated; no conflict between funder identity and the transparency findings reported.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "There is no general competing interests or financial interests declaration for all authors; the only COI statement ('no conflicts of interest related to Anthropic or Claude Code') appears only in the Claude Code case study annotation, not as a paper-wide declaration.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Section 2 discusses the definition of 'agent' extensively drawing on prior literature; Section 3.1 operationalizes agency via four criteria (autonomy, goal complexity, environmental interaction, generality); autonomy levels L1-L5 are defined via Feng et al.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Three contributions are explicitly enumerated in Section 1: (1) the Agent Index of 30 systems across 45 fields, (2) ecosystem-wide trends, and (3) three case studies of distinct agent categories.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 2 engages substantively with the 2024 AI Agent Index (predecessor), the Princeton Holistic Agentic Leaderboard, Foundation Model Transparency Index, and a range of documentation frameworks (datasheets, model cards, system cards, factsheets), situating this work relative to each.", + "source": "haiku" + } + } + }, + "type_checklist": { + "survey": { + "search_and_selection": { + "search_strategy_reproducible": { + "applies": true, + "answer": false, + "justification": "Initial candidate discovery used LLM-based queries (ChatGPT 5.2, Claude Sonnet 4.5, Gemini 2.5 with research mode); Section B.5 provides the prompts but LLM outputs are non-deterministic and a re-run would not produce the same 95-candidate list.", + "source": "haiku" + }, + "inclusion_exclusion_explicit": { + "applies": true, + "answer": true, + "justification": "Figure 2 and Section 3.1 specify explicit, operationalized inclusion criteria with quantitative thresholds: ≥10,000 searches or ≥20,000 GitHub stars, ≥$1B valuation, plus all-required agency and practicality sub-criteria.", + "source": "haiku" + }, + "prisma_or_structured_protocol": { + "applies": true, + "answer": false, + "justification": "No mention of PRISMA or any other established systematic review protocol; the paper follows a custom inclusion framework.", + "source": "haiku" + }, + "search_terms_provided": { + "applies": true, + "answer": false, + "justification": "Section C.1 provides LLM prompts used to generate search terms but does not enumerate the final list of actual search terms used per agent; terms were LLM-generated and not explicitly listed.", + "source": "haiku" + }, + "databases_listed": { + "applies": true, + "answer": true, + "justification": "Sources explicitly named: Ahrefs API (search volume), Google Scholar (paper counts), Yahoo Finance/Crunchbase/Epoch AI (market cap), GitHub (stars), and cross-references with 2024 AI Agent Index, Princeton Holistic Agent Leaderboard, and AIAgentList.com.", + "source": "haiku" + }, + "screening_process_documented": { + "applies": true, + "answer": true, + "justification": "Section 3.3 documents the pipeline: LLM queries surfaced 95 candidates → screened against inclusion criteria → ambiguous cases annotated in depth → final inclusion decisions → 30 agents included; key stage counts are provided.", + "source": "haiku" + }, + "review_scope_justified": { + "applies": true, + "answer": true, + "justification": "The paper explains the rationale for focusing on 'highly agentic systems with high-impact real-world applications' publicly available as of December 31, 2025, and explicitly justifies why domain-specific and internal agents are excluded.", + "source": "haiku" + } + }, + "synthesis_quality": { + "conflicting_findings_acknowledged": { + "applies": true, + "answer": true, + "justification": "The paper systematically acknowledges systematic differences across categories (frontier labs vs. enterprise platforms on safety evaluation, Chinese vs. US governance documentation patterns) and reports 37 inter-annotator discrepancies resolved through discussion.", + "source": "haiku" + }, + "quality_assessment_of_sources": { + "applies": true, + "answer": false, + "justification": "There is no formal quality rubric or risk-of-bias assessment for the agents or their source documentation; annotations distinguish 'None found' vs 'None' but do not score the reliability or rigor of what is documented.", + "source": "haiku" + }, + "publication_bias_discussed": { + "applies": true, + "answer": true, + "justification": "Section 6.2 explicitly acknowledges that significance criteria 'favor well-funded companies and established products, potentially disadvantaging emerging developers and regional innovations,' and that exclusive reliance on public information may miss internal safety practices — the documentation-index equivalent of publication bias.", + "source": "haiku" + }, + "quantitative_synthesis_present": { + "applies": true, + "answer": true, + "justification": "The paper provides systematic counts and percentages throughout: 20/30 support MCP, 15/30 reference safety frameworks, 133/240 safety fields missing, 23/30 fully closed source, 3/30 with third-party testing — constituting quantitative vote-counting synthesis.", + "source": "haiku" + }, + "recommendations_supported_by_evidence": { + "applies": true, + "answer": true, + "justification": "Recommendations such as structured reporting requirements and evaluation targeting deployed tools rather than base models are directly grounded in documented gaps (133/240 safety fields missing, only 4 agent-specific system cards).", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "Most developers share little information about safety, evaluations, and societal impacts: 133/240 safety-related fields have no public information.", + "evidence": "Figure 6 and Section 4.6 document 133/240 safety fields across 30 agents as 'None found'; browser (64%) and enterprise (63%) agents have the highest missing rates.", + "supported": "strong" + }, + { + "claim": "Only 3/30 agents have documented third-party testing; 25/30 disclose no internal safety results.", + "evidence": "Section 4.6: 'Third-party testing is documented for only 3/30 agents (Anthropic Claude, OpenAI ChatGPT, OpenAI Codex). 25/30 agents disclose no internal safety results.'", + "supported": "strong" + }, + { + "claim": "Chinese-developed agents have substantially lower safety documentation than US agents: 1/5 with safety frameworks vs. ~75% of US agents.", + "evidence": "Figure 4b shows Chinese companies at 20% for AI Safety Framework and 20% for Compliance Standards, compared to 76% and 95% for US companies.", + "supported": "strong" + }, + { + "claim": "Model Context Protocol has become the dominant interoperability standard, supported by 20/30 agents including all 13 enterprise platforms.", + "evidence": "Section 4.5 and Figure 12: '20/30 agents explicitly support MCP'; all 13 enterprise agents support MCP versus 4/12 chat agents.", + "supported": "strong" + }, + { + "claim": "Most agents (21/30) do not disclose their AI nature to end users or third parties by default, and only 3/30 support media watermarking.", + "evidence": "Section 4.5: '21/30 agents have no documented default disclosure behavior. Only 3/30 agents support watermarking generated media (e.g., through SynthID and C2PA).'", + "supported": "strong" + }, + { + "claim": "Browser-based agents frequently bypass robots.txt and anti-bot measures, with some explicitly marketing this capability.", + "evidence": "Section 4.5 and 5.2: BrowserUse 'explicitly markets bypassing anti-bot systems'; Cloudflare documented Perplexity using undeclared Chrome-signature crawlers; only 6/30 agents explicitly state robots.txt compliance.", + "supported": "strong" + }, + { + "claim": "The ecosystem exhibits concentrated model dependency: almost all non-frontier-lab agents rely on GPT, Claude, or Gemini model families.", + "evidence": "Section 4.3 and 6.1: 'Only frontier labs themselves (Anthropic, Google, OpenAI) and Chinese developers run their own proprietary models; the majority rely primarily on GPT, Claude, or Gemini model families.'", + "supported": "strong" + } + ], + "methodology_tags": [ + "observational", + "case-study", + "qualitative" + ], + "key_findings": "The 2025 AI Agent Index documents 45 fields across 30 deployed AI agents and finds widespread transparency gaps: 133/240 safety-related fields have no public information, only 3/30 agents have third-party testing documentation, and only 4/30 have agent-specific system cards. Chinese developers show markedly lower safety documentation than US counterparts (1/5 vs. ~75% with safety frameworks). The ecosystem is structurally concentrated around GPT/Claude/Gemini models, creating shared dependency risks. Browser agents operating at L4–L5 autonomy present the highest risk profile with the least safety documentation, and most agents bypass web conduct standards like robots.txt with active justifications from developers.", + "red_flags": [ + { + "flag": "Non-reproducible search", + "detail": "Initial candidate discovery used LLM-based queries (ChatGPT 5.2, Claude Sonnet 4.5, Gemini 2.5) whose outputs are non-deterministic; a re-run with the same prompts would yield a different candidate list, undermining reproducibility of the 95-agent starting pool." + }, + { + "flag": "Low developer response rate", + "detail": "Only 23% of developers offered any response to annotation review requests, and only 4/30 provided substantive corrections; findings likely substantially undercount internal safety practices that exist but are not published." + }, + { + "flag": "Conflates non-disclosure with absence", + "detail": "While the paper distinguishes 'None found' from 'None,' the headline finding that '133/240 safety fields have no information' conflates genuine absence of safety measures with non-disclosure of existing ones, which the paper's own methodology acknowledges is ambiguous." + }, + { + "flag": "Small and biased sample", + "detail": "30 agents selected by significance criteria that favor consumer-facing products and US/China companies; excludes domain-specific agents, internal deployments, and less prominent developers, making ecosystem-wide conclusions speculative." + } + ], + "cited_papers": [ + { + "title": "The AI Agent Index (Casper et al., 2025)", + "relevance": "Predecessor 2024 index that this work directly updates and expands, including revised inclusion criteria and annotation fields" + }, + { + "title": "Visibility into AI Agents (Chan et al., 2024)", + "relevance": "Prior work on transparency and documentation of AI agent systems, foundational to the Index's motivation and framing" + }, + { + "title": "The 2024 Foundation Model Transparency Index (Bommasani et al., 2024)", + "relevance": "Comparable documentation effort for foundation models; used as an inclusion criterion for developer significance" + }, + { + "title": "Levels of Autonomy for AI Agents (Feng et al., 2025)", + "relevance": "Provides the L1–L5 autonomy framework used throughout the Index to characterize and compare agent autonomy levels" + }, + { + "title": "Harms from Increasingly Agentic Algorithmic Systems (Chan et al., 2023)", + "relevance": "Defines the agency criteria (autonomy, goal complexity, environmental interaction, generality) directly adopted for inclusion criteria" + }, + { + "title": "Holistic Agent Leaderboard (Kapoor et al., 2025)", + "relevance": "Concurrent work documenting agentic AI systems across capability benchmarks; used for cross-referencing agent candidates" + }, + { + "title": "Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress? (Ren et al., 2024)", + "relevance": "Provides theoretical grounding for the 'safety-washing' concept applied to the transparency asymmetry observed between capability benchmarks and safety documentation" + }, + { + "title": "Infrastructure for AI Agents (Chan et al., 2025)", + "relevance": "Discusses governance challenges for web-interacting agents, directly relevant to the robots.txt and web conduct findings" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 3, + "justification": "Directly usable by policymakers, procurement teams, and researchers to compare agents on safety and transparency; the online Index at aiagentindex.mit.edu is a live, downloadable resource." + }, + "surprise_contrarian": { + "score": 2, + "justification": "The finding that 133/240 safety fields have no public information and only 3/30 agents have third-party testing challenges industry safety-responsibility narratives, though the general direction is expected." + }, + "fear_safety": { + "score": 3, + "justification": "Concretely documents L4–L5 autonomy browser agents with prompt injection vulnerabilities, agents designed to bypass anti-bot systems, and absence of safety oversight for most deployed systems." + }, + "drama_conflict": { + "score": 2, + "justification": "Documents real legal disputes (Amazon vs. Perplexity, NYT vs. OpenAI, Reddit vs. Anthropic) and specific named prompt injection incidents against Perplexity Comet and Opera Neon." + }, + "demo_ability": { + "score": 3, + "justification": "The full Index is publicly available at aiagentindex.mit.edu in JSON and CSV formats, and all 30 documented agents are themselves publicly accessible products users can try." + }, + "brand_recognition": { + "score": 3, + "justification": "Covers Anthropic Claude, OpenAI ChatGPT, Google Gemini, Microsoft Copilot Studio, Salesforce Agentforce, and 25 other flagship products from the most recognized AI companies." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "47279778", + "title": "Nested Training for Mutual Adaptation in Human-AI Teaming", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=47279778", + "created_at": "2026-03-06T19:21:19Z" + } + ], + "top_points": 2, + "total_points": 2, + "total_comments": 0 + } +} +\ No newline at end of file diff --git a/papers/3dshape2vecset-3d-shape-2023/scan-v5.json b/papers/3dshape2vecset-3d-shape-2023/scan-v5.json @@ -0,0 +1,504 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "3DShape2VecSet: A 3D Shape Representation for Neural Fields and Generative Diffusion Models", + "authors": [ + "Biao Zhang", + "Jiapeng Tang", + "Matthias Nießner", + "Peter Wonka" + ], + "year": 2023, + "venue": "ACM Transactions on Graphics", + "arxiv_id": "2301.11445", + "doi": "10.1145/3592442" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All abstract claims (improved encoding quality, multiple generative applications) are backed by quantitative results in Tables 3–9 and qualitative figures throughout.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "Ablation studies on M (number of latents) and C0 (compression channels) in Tables 4–5 directly support causal design claims; cross-attention vs. KNN encoding is also compared.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "Claims of 'state of the art in 3D shape encoding and generative modeling' are made broadly, but evaluation is entirely on ShapeNet-v2; no cross-dataset or cross-domain validation is performed.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "The paper does not discuss whether performance gains could stem from larger parameter counts, more training compute, or dataset-specific characteristics rather than the proposed architectural innovation.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "The paper explicitly notes that rendering-based FID/KID are imperfect for 3D quality and introduces 3D-based FPD/KPD metrics to compensate, clearly distinguishing what each metric measures.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "Section 8.8 is a dedicated 'Limitations' subsection discussing the two-stage training requirement and training time costs.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "The limitations focus on computational cost and retraining requirements, not on threats to validity such as dataset bias, metric limitations, or whether improvements hold across domains.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "The paper does not explicitly state what the results do NOT show; no claims are bounded to ShapeNet-only conclusions or specific shape categories.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": true, + "justification": "Acknowledgements state support from SDAIA-KAUST Center of Excellence in Data Science and Artificial Intelligence and ERC Starting Grant Scan2CAD (804724).", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Author affiliations (KAUST, TU Munich) are disclosed in the header and author addresses.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": true, + "justification": "SDAIA-KAUST AI and ERC are academic/government research funders with no direct commercial stake in the proposed representation.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests or financial interests declaration is present in the paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Neural fields, latent sets, cross-attention, and the proposed VecSet representation are formally defined with equations in Sections 3–5.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Five numbered contributions are explicitly listed at the end of the introduction, covering representation, architecture, autoencoding, generation, and applications.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 2 provides a detailed taxonomy of prior methods (Table 1, Table 2) and the paper explicitly distinguishes its approach from 3DILG, ConvOccNet, and NeuralWavelet in both framing and evaluation.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": true, + "justification": "The abstract directly links to 'Code: https://1zb.github.io/3DShape2VecSet/' indicating code is available at a project/repository page.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "ShapeNet-v2 is a publicly available benchmark dataset used without modification as the primary data source.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "Training hardware (8 A100 GPUs) is mentioned but no requirements.txt, Dockerfile, or dependency specification is provided.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "Training hyperparameters are reported but no step-by-step reproduction guide is provided that would allow following without guessing or significant inference.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "All results in Tables 3–9 are single point estimates with no confidence intervals or error bars reported.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "No statistical significance tests are used for any of the comparative claims against baselines.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Numerical improvements are shown with baseline context (e.g., FPD 1.89→0.76 vs 3DILG, IoU 0.953→0.965 mean all categories), providing readable effect sizes.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "ShapeNet dataset size is not justified; the choice of 55 categories or specific test splits is not discussed in terms of statistical power.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "No variance, standard deviation, or results across multiple runs are reported for any experiment.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Multiple baselines included: OccNet, ConvOccNet, IF-Net, 3DILG for autoencoding; PVD, 3DILG, NeuralWavelet for generation.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "3DILG (NeurIPS 2022), NeuralWavelet (SIGGRAPH Asia 2022), and PVD (ICCV 2021) are recent and competitive baselines appropriate for a 2023 paper.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "Tables 4 and 5 provide ablations on M (number of latents: 64–512) and C0 (compression channels: 1–64), and Sec. 5.1 compares learned vs. point queries.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Autoencoding uses IoU, Chamfer distance, and F-score; generation uses FPD, KPD, Rendering-FID, Rendering-KID, Precision, Recall, MMD-CD, MMD-EMD, COV-CD, COV-EMD.", + "source": "haiku" + }, + "human_evaluation": { + "applies": false, + "answer": false, + "justification": "Human evaluation is not standard practice for 3D shape reconstruction/generation benchmarks and is not included.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": true, + "justification": "The paper uses train/val splits from Zhang et al. 2022 and evaluates on held-out test shapes, including novel shape retrieval analysis in Sec. 8.7.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Table 3 shows per-category results for 7 largest ShapeNet categories; Table 8 shows category-conditioned generation for airplane, chair, table, car, sofa.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": false, + "justification": "Section 8.8 Limitations discusses training cost but shows no failure case examples or systematic analysis of where the method fails.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "Ablation Tables 4–5 explicitly show performance degradation with smaller M and C0, and Table 6 shows C0=64 performs worse than C0=32 for generation.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": false, + "justification": "BERT and ResNet-18 are referenced without specific version or checkpoint dates; EDM training follows 'default settings' without fully specifying which configuration.", + "source": "haiku" + }, + "prompts_provided": { + "applies": false, + "answer": false, + "justification": "This is a 3D shape generation paper; no language model prompts or system instructions are used.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": true, + "justification": "Learning rates (5e-5, 1e-4), batch sizes (512, 256), epochs (1600, 8000), warmup, KL weight (0.001), M=512, C0=32, and 18 denoising steps are all reported.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": false, + "answer": false, + "justification": "No agentic scaffolding is involved; this is a supervised deep learning paper.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "Section 7 describes converting shapes to watertight meshes, normalizing to bounding box, sampling 500K surface points, query point sampling strategy, and rendering setup.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": true, + "justification": "ShapeNet-v2 is publicly available (with registration) and the same public splits from Zhang et al. 2022 are used.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "The preprocessing pipeline is clearly described; the original ShapeNet data collection is documented in the referenced Chang et al. 2015 paper.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "Standard public benchmark; no participant recruitment involved.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "Full pipeline from ShapeNet mesh → watertight mesh → normalized mesh → point cloud sampling → query point sampling for occupancy is described in Section 7.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": false, + "answer": false, + "justification": "The paper trains its own models from scratch on ShapeNet; training cutoff contamination is not applicable.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": false, + "answer": false, + "justification": "Not evaluating a pre-trained language/foundation model on external benchmarks; NA.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": false, + "answer": false, + "justification": "The models are trained from scratch on ShapeNet splits, not pre-trained large models being evaluated on unseen benchmarks.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human participants in the study.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human participants in the study.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participants in the study.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human participants in the study.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human participants in the study.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human participants in the study.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participants in the study.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": false, + "justification": "18 denoising steps are mentioned but no latency, memory, or per-shape inference cost is reported.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "Hardware (8 A100 for autoencoder, 4 A100 for diffusion) and epochs are stated but total GPU-hours or compute cost are not quantified.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "3DShape2VecSet achieves state-of-the-art 3D shape autoencoding on ShapeNet with IoU of 0.965 (all categories), outperforming 3DILG (0.953).", + "evidence": "Table 3 shows quantitative comparison across IoU, Chamfer distance, and F-score on 7 categories and all 55 categories against OccNet, ConvOccNet, IF-Net, and 3DILG.", + "supported": "strong" + }, + { + "claim": "The latent set diffusion model achieves state-of-the-art unconditional 3D shape generation with Surface-FPD of 0.76, versus 1.89 for 3DILG.", + "evidence": "Table 6 compares FPD, KPD, Rendering-FID, and Rendering-KID across Grid-83, 3DILG, and the proposed method at different C0 values.", + "supported": "strong" + }, + { + "claim": "Point queries (subsampled point cloud) outperform learnable queries for shape encoding across all categories.", + "evidence": "Table 3 consistently shows Point Queries column outperforming Learned Queries in IoU, Chamfer, and F-score for all 7 reported categories.", + "supported": "strong" + }, + { + "claim": "The proposed method demonstrates the first text-conditioned 3D shape generation using diffusion models.", + "evidence": "Section 8.4 states 'the first demonstration of text-conditioned 3D shape generation using diffusion models' with qualitative results in Fig. 11; no quantitative baseline exists.", + "supported": "moderate" + }, + { + "claim": "Aggressive KL compression (C0=32) achieves nearly identical reconstruction quality to C0=64 while enabling easier diffusion model training.", + "evidence": "Table 5 shows IoU 0.963 vs 0.964 for C0=32 vs C0=64, and Table 6 shows generation quality peaks at C0=32.", + "supported": "strong" + }, + { + "claim": "Category-conditioned generation achieves significantly better recall than NeuralWavelet (0.86 vs 0.57 for chair).", + "evidence": "Table 9 shows Recall comparison: Ours 0.86, NW 0.57 for chair; Ours 0.89, NW 0.68 for table.", + "supported": "strong" + } + ], + "methodology_tags": [ + "benchmark-eval" + ], + "key_findings": "3DShape2VecSet proposes encoding 3D shapes as unordered sets of latent vectors without explicit spatial coordinates, using cross-attention as a learnable interpolation mechanism. This representation achieves state-of-the-art reconstruction (IoU 0.965 on ShapeNet) and generation quality (FPD 0.76 vs 1.89 for prior best), demonstrating that eliminating explicit positional coordinates and leveraging transformer-native set representations improves both encoding fidelity and generative modeling. The two-stage training (VAE + diffusion) with aggressive latent compression (C0=32 recommended) enables five conditional generation tasks while maintaining strong reconstruction quality.", + "red_flags": [ + { + "flag": "No statistical testing", + "detail": "All comparative claims lack significance tests or confidence intervals; improvements over baselines are reported as single point estimates only." + }, + { + "flag": "Single dataset evaluation", + "detail": "All experiments are conducted on ShapeNet-v2 only; no cross-dataset or out-of-distribution evaluation is performed despite broad SOTA claims." + }, + { + "flag": "Text conditioning not quantitatively evaluated", + "detail": "The text-conditioned generation claim is supported only by qualitative figures (Fig. 11) with no quantitative metrics, yet the paper claims it as a novel first." + }, + { + "flag": "Proxy metric concerns not fully resolved", + "detail": "Rendering-based FID/KID are acknowledged to be imperfect for 3D quality; while FPD/KPD are introduced, the PointNet++ feature extractor quality is itself not validated." + }, + { + "flag": "No variance across runs", + "detail": "Training diffusion models is stochastic; no variance across random seeds or runs is reported for any generation metric." + } + ], + "cited_papers": [ + { + "title": "3DILG: Irregular Latent Grids for 3D Generative Modeling", + "relevance": "Primary baseline and predecessor using irregular latent grids with autoregressive generation; the proposed method directly extends and improves upon this approach." + }, + { + "title": "Neural Wavelet-Domain Diffusion for 3D Shape Generation", + "relevance": "Key competitor using diffusion models in wavelet frequency domain for 3D shape generation; compared in category-conditioned generation experiments." + }, + { + "title": "High-Resolution Image Synthesis with Latent Diffusion Models", + "relevance": "Foundation for the two-stage latent diffusion approach adopted in this paper." + }, + { + "title": "Elucidating the Design Space of Diffusion-Based Generative Models (EDM)", + "relevance": "Training framework and hyperparameters directly adopted for the diffusion model stage." + }, + { + "title": "ShapeNet: An Information-Rich 3D Model Repository", + "relevance": "Primary benchmark dataset used for all experiments." + }, + { + "title": "Convolutional Occupancy Networks", + "relevance": "Baseline using regular grid latents for neural field shape representation." + }, + { + "title": "Occupancy Networks: Learning 3D Reconstruction in Function Space", + "relevance": "Foundational neural field method with global latent; used as baseline and motivating comparison." + }, + { + "title": "Attention Is All You Need", + "relevance": "Transformer architecture underpinning the cross-attention and self-attention mechanisms central to the proposed representation." + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 2, + "justification": "The method enables multiple practical 3D content creation applications (text-to-3D, image-to-3D, shape completion) with released code, but requires 8 A100 GPUs to train." + }, + "surprise_contrarian": { + "score": 1, + "justification": "The insight that removing explicit spatial coordinates from latent vectors improves performance is counter-intuitive but not dramatically surprising given broader attention literature trends." + }, + "fear_safety": { + "score": 0, + "justification": "No AI risk or safety concerns; purely a 3D representation learning paper." + }, + "drama_conflict": { + "score": 0, + "justification": "Standard academic benchmark competition with no controversy or adversarial framing." + }, + "demo_ability": { + "score": 2, + "justification": "Code is released at the project page and the method supports interactive applications like text-to-3D and image-to-3D that are demonstrable." + }, + "brand_recognition": { + "score": 1, + "justification": "KAUST and TU Munich (with Nießner, known for 3D vision work) are respected but not high-profile AI labs on par with DeepMind or OpenAI." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "47334694", + "title": "BitNet: Inference framework for 1-bit LLMs", + "points": 370, + "comments": 169, + "url": "https://news.ycombinator.com/item?id=47334694", + "created_at": "2026-03-11T12:27:15Z" + } + ], + "top_points": 370, + "total_points": 370, + "total_comments": 169 + } +} +\ No newline at end of file diff --git a/papers/a2hcoder-llmdriven-coding-2025/scan-v5.json b/papers/a2hcoder-llmdriven-coding-2025/scan-v5.json @@ -0,0 +1,398 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "A2H-MAS: An Algorithm-to-HLS Multi-Agent System for Automated and Reliable FPGA Implementation", + "authors": [ + "Jie Lei", + "Ruofan Jia", + "J. Andrew Zhang", + "Hao Zhang" + ], + "year": 2025, + "venue": "Unknown", + "arxiv_id": "2508.10904", + "doi": "10.48550/arXiv.2508.10904" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "Core claims (functionally correct, resource-efficient, latency-optimized designs) are supported by Table I–II results with measured LUT, DSP, BRAM, frequency, and latency metrics for both 5G NR and WLAN implementations.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "Claim that 'algorithm choice has larger effect than pragma tuning' is justified by Table II ablation: Adaptation stage reduces calcThreshold LUTs from 36,500→685 (50×) and extractSSBsig 4,468→275 (16×), demonstrating algorithmic transformation impact.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "Paper claims 'demonstrates effectiveness and robustness for complex hardware development workflows' based on only 2 systems (5G NR, WLAN). Conclusion acknowledges 'future work' to support broader domains, contradicting broad generalization claims.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "Paper explains design choices (why multi-agent > single agent, why algorithm-aware > pragma-only) but does not discuss alternative interpretations of experimental results or competing explanations for performance improvements.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "Claims ('resource-efficient', 'latency-optimized', 'functionally correct') directly match measured outcomes (LUTs/DSP/BRAM, frequency/latency, pass/fail verification). No proxy–measurement mismatch.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": false, + "justification": "No dedicated limitations or threats-to-validity section. Conclusion discusses future extensions (broader domains, richer feedback) but not current methodological limitations or failure modes.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "No discussion of specific threats: only 2 applications tested (no justification for sample size), reliance on proprietary Claude API (reproducibility risk), no comparison against published competing systems (HLSPilot, VeriMind, HDLAgent).", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "Paper mentions 'current implementation focuses on synchronization stage' for WLAN but does not state what algorithm types, hardware targets, or problem sizes the system does NOT handle well.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding statement provided in the paper.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Author affiliations clearly stated: University of Technology Sydney and Xidian University.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": false, + "answer": false, + "justification": "No funder mentioned.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No statement of competing interests or financial disclosures (patents, equity, consulting relationships with Anthropic or FPGA vendors).", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Key terms adequately defined in context: HLS, dataflow decomposition, algorithm–hardware co-design explained; standard acronyms (FPGA, DSP, BRAM) assumed for audience.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Contributions explicitly stated (Section I): (1) A2H-MAS framework for MATLAB→HLS conversion, (2) algorithm–hardware co-design methodology, (3) empirical validation on wireless algorithms.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Related work (Section II) engages with VerilogEval, MG-Verilog, VGen, VeriMind, HLSPilot, HDLAgent, ChatDev, MetaGPT; contrasts fine-tuning vs zero-shot, single agent vs multi-agent approaches.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": false, + "justification": "No source code, prompts, or generated HLS implementations are released. Paper describes system but provides no reproducible artifacts.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": false, + "justification": "MATLAB algorithms (5G NR, WLAN) are reference standards, not novel datasets. Generated test data and results are not released.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "Mentions 'Xilinx Vitis HLS, MATLAB, RFNoC, Claude Code' without version numbers, dependency specifications, Docker images, or requirements files.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "Workflow described in Section IV but no step-by-step instructions for independent reproduction. Relies on proprietary system not available to readers.", + "source": "haiku" + } + }, + "statistical_methodology": { + "applies": true, + "answer": false, + "justification": "Tables I–II show single-run results with no confidence intervals, error bars, variance/std dev, significance tests, or sample-size justification. No multiple independent runs reported.", + "source": "haiku" + }, + "evaluation_design": { + "applies": true, + "answer": false, + "justification": "Baselines included (Direct vs Adaptation vs Refinement) but baseline is naive strawman, not comparison with published methods (HLSPilot, VeriMind, HDLAgent mentioned in related work but not empirically compared). No discussion of failure cases beyond one timing closure failure. No statistical significance testing despite comparative claims.", + "source": "haiku" + }, + "setup_transparency": { + "applies": true, + "answer": false, + "justification": "LLM model version unspecified ('Claude Code' used without identifying Claude 3.5 Sonnet vs Opus; no training cutoff). Figures 2–3 show example prompts but actual prompts used in experiments not fully provided. No temperature, top-p, or other hyperparameters reported.", + "source": "haiku" + }, + "data_integrity": { + "applies": true, + "answer": true, + "justification": "Test data generation clearly documented (Phase II: execute original algorithm, record intermediate variables as I/O pairs). Data pipeline transparent (Phases I–VIII). Reference standards (5G NR, WLAN) are well-known, mitigating data integrity concerns.", + "source": "haiku" + }, + "contamination": { + "applies": false, + "answer": false, + "justification": "Not evaluating LLM capabilities on pre-training benchmarks; evaluating domain-specific task execution. No benchmark contamination risk.", + "source": "haiku" + }, + "human_studies": { + "applies": false, + "answer": false, + "justification": "No human participants, so all human_studies questions are N/A.", + "source": "haiku" + }, + "cost_and_practicality": { + "applies": true, + "answer": false, + "justification": "No API cost, latency, or total compute budget reported for running A2H-MAS pipeline. Practical deployment cost unknown.", + "source": "haiku" + } + } + }, + "claims": [ + { + "claim": "A2H-MAS produces functionally correct and hardware-efficient HLS code from MATLAB algorithms", + "evidence": "Table I shows synthesis results (LUT, DSP, BRAM, frequency) for 5G NR and WLAN implementations; each phase includes functional verification pass/fail.", + "supported": "strong" + }, + { + "claim": "Algorithmic transformation (Adaptation phase) has order-of-magnitude larger impact on resource efficiency than pragma-level optimization (Refinement phase)", + "evidence": "Table II: calcThreshold LUT reduction 36,500→685 (53×) via Adaptation, then 685→173 (4×) via Refinement; extractSSBsig 4,468→275 (16×) via Adaptation, then 275→155 (1.8×) via Refinement.", + "supported": "strong" + }, + { + "claim": "Multi-agent architecture with standardized interfaces reduces hallucinations and improves reliability compared to single-agent LLM translation", + "evidence": "Implicit in design (Fig. 1 contrasts single agent with proposed system); Table II Direct method fails timing closure while proposed methods succeed, but no direct hallucination/error-rate comparison provided.", + "supported": "moderate" + }, + { + "claim": "Modular dataflow decomposition enables scalable, parallel execution of algorithm-to-hardware translation", + "evidence": "Section III–IV describes decomposition strategy and phase dependencies, but no empirical data on scalability, parallelism speedup, or failure modes with large algorithms.", + "supported": "weak" + }, + { + "claim": "Standardized agent input–output interfaces minimize coupling and enable seamless pipeline integration", + "evidence": "Figure 2 shows interface specification (module_name, function_signature, framework_integration), but no measurement of coupling or empirical comparison against non-standardized alternative.", + "supported": "moderate" + } + ], + "methodology_tags": [ + "case-study", + "benchmark-eval" + ], + "key_findings": "A2H-MAS, a modular multi-agent system, automates end-to-end conversion of MATLAB algorithms to FPGA-ready HLS code. Algorithm-level optimization (dataflow restructuring, streaming patterns) yields 10–50× resource reductions, far larger than pragma-level tuning. Systematic dataflow decomposition, deterministic tool validation (MATLAB batch execution, HLS C-sim, RTL co-simulation), and explicit workflow phases reduce LLM hallucinations. On two wireless communication benchmarks (5G NR SSB detection, WLAN synchronization), the system achieves functional correctness, meets latency constraints (292–337 MHz), and produces efficient hardware with moderate resource footprint.", + "red_flags": [ + { + "flag": "No comparison with published methods", + "detail": "Only compares against a naive Direct baseline, not against published competing systems (HLSPilot, VeriMind, HDLAgent) mentioned in related work. Claims relative effectiveness cannot be verified." + }, + { + "flag": "Severely limited evaluation scope", + "detail": "Only 2 end-to-end applications tested; 2 modules ablated. No justification for sample size. Generalization claims ('complex hardware development workflows') unsupported by evidence." + }, + { + "flag": "Single-run results with no variance", + "detail": "Tables I–II report single-run measurements with no error bars, confidence intervals, or multiple independent runs. Reliability of results unknown." + }, + { + "flag": "LLM hallucination claims are qualitative, not quantitative", + "detail": "Paper claims system reduces hallucinations and improves reliability, but provides no error-rate metrics, direct LLM-vs-system comparison, or quantitative reliability measure." + }, + { + "flag": "No reproducibility; proprietary system dependency", + "detail": "Code and data not released. System relies entirely on proprietary Claude API (model version unspecified). Readers cannot independently reproduce or verify results." + }, + { + "flag": "Minimal failure mode analysis", + "detail": "Beyond one timing closure failure (Direct strategy), no discussion of when/why system fails. No analysis of challenging algorithm types or edge cases." + }, + { + "flag": "Model version and hyperparameters unspecified", + "detail": "Paper mentions 'Claude Code' without identifying model variant (Sonnet vs Opus). No temperature, top-p, or LLM inference parameters reported. Example prompts shown but not actual experimental prompts." + }, + { + "flag": "No statistical significance testing despite comparative claims", + "detail": "Ablation study (Table II) shows improvements (Direct→Adaptation→Refinement) but no t-tests, confidence intervals, or significance thresholds reported." + } + ], + "cited_papers": [ + { + "title": "VerilogEval: Evaluating Large Language Models for Verilog Code Generation", + "relevance": "Benchmark and evaluation protocol for LLM-based HDL generation; establishes that SOTA models (Claude, ChatGPT, Gemini) outperform fine-tuned smaller models." + }, + { + "title": "HLSPilot: LLM-Based High-Level Synthesis", + "relevance": "Directly competing agent-based framework for MATLAB/C to HLS translation; represents state-of-the-art in the problem domain." + }, + { + "title": "VeriMind: Agentic LLM for Automated Verilog Generation with a Novel Evaluation Metric", + "relevance": "Multi-agent framework for hardware design with verification integration; parallel approach to distributing verification tasks among specialized agents." + }, + { + "title": "ChatDev: Communicative Agents for Software Development", + "relevance": "Foundational LLM-based multi-agent system architecture demonstrating role allocation, communication, and collaborative code generation." + }, + { + "title": "MetaGPT: Meta Programming for Multi-Agent Collaborative Framework", + "relevance": "Multi-agent coordination framework with structured workflows; abstract reasoning about task decomposition and agent specialization." + }, + { + "title": "AutoChip: Automating HDL Generation Using LLM Feedback", + "relevance": "Iterative refinement loop for LLM-generated HDL; demonstrates integration of compiler feedback for hardware design." + }, + { + "title": "MG-Verilog: Multi-Grained Dataset towards Enhanced LLM-Assisted Verilog Generation", + "relevance": "Domain-specific benchmark and curated dataset for HDL generation; represents data-centric approach to improving LLM performance on hardware tasks." + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 2, + "justification": "Potentially useful for hardware engineers automating MATLAB→FPGA workflows, but severely limited by closed-source implementation and lack of reproducible artifacts." + }, + "surprise_contrarian": { + "score": 1, + "justification": "Finding that algorithm-level optimization outweighs pragma tuning is intuitive for hardware designers; not surprising, though useful confirmation." + }, + "fear_safety": { + "score": 0, + "justification": "No AI safety or robustness concerns raised; focus is on engineering automation." + }, + "drama_conflict": { + "score": 0, + "justification": "No controversy or conflict narrative; technical contribution only." + }, + "demo_ability": { + "score": 0, + "justification": "System cannot be tried without access to proprietary implementation and Claude API. No demo code or live interface provided." + }, + "brand_recognition": { + "score": 1, + "justification": "Uses Claude (Anthropic), but authors are from UTS and Xidian University (not flagship institutions). Limited pre-existing audience recognition." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "29279146", + "title": "Crypto Wash Trading", + "points": 572, + "comments": 299, + "url": "https://news.ycombinator.com/item?id=29279146", + "created_at": "2021-11-19T16:44:26Z" + }, + { + "hn_id": "44271284", + "title": "Self-Adapting Language Models", + "points": 246, + "comments": 73, + "url": "https://news.ycombinator.com/item?id=44271284", + "created_at": "2025-06-13T19:03:42Z" + }, + { + "hn_id": "41306555", + "title": "Exploring Impact of Code in Pre-Training", + "points": 5, + "comments": 2, + "url": "https://news.ycombinator.com/item?id=41306555", + "created_at": "2024-08-21T03:38:33Z" + }, + { + "hn_id": "44443760", + "title": "Your Language Model Can Handle Non-Canonical Tokenizations", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=44443760", + "created_at": "2025-07-02T13:53:44Z" + }, + { + "hn_id": "41745068", + "title": "Pre-training with code improves performance on NL reasoning", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=41745068", + "created_at": "2024-10-04T20:02:19Z" + }, + { + "hn_id": "44116793", + "title": "When Models Don't Collapse: On the Consistency of Iterative MLE", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=44116793", + "created_at": "2025-05-28T15:06:51Z" + }, + { + "hn_id": "43503479", + "title": "The Quantum Technology Job Market: A Quantitative Investigation", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=43503479", + "created_at": "2025-03-28T10:05:27Z" + }, + { + "hn_id": "42884637", + "title": "Player Performance and Skill Rating in Esports [pdf]", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=42884637", + "created_at": "2025-01-31T04:14:07Z" + }, + { + "hn_id": "41367147", + "title": "Kotlin's Type System Is (Also) Unsound", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=41367147", + "created_at": "2024-08-27T13:11:45Z" + }, + { + "hn_id": "41318909", + "title": "To Code, or Not to Code? Exploring Impact of Code in Pre-Training", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=41318909", + "created_at": "2024-08-22T11:09:37Z" + } + ], + "top_points": 572, + "total_points": 832, + "total_comments": 374 + } +} +\ No newline at end of file diff --git a/papers/aart-aiassisted-redteaming-2023/scan-v5.json b/papers/aart-aiassisted-redteaming-2023/scan-v5.json @@ -0,0 +1,380 @@ +{ + "scan_version": 5, + "paper_type": "benchmark-creation", + "paper": { + "title": "AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications", + "authors": [ + "Bhaktipriya Radharapu", + "Kevin Robinson", + "Lora Aroyo", + "Preethi Lahoti" + ], + "year": 2023, + "venue": "Conference on Empirical Methods in Natural Language Processing", + "arxiv_id": "2311.08592", + "doi": "10.48550/arXiv.2311.08592" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": false, + "justification": "The abstract claims AART 'reduces human effort significantly' and shows 'promising results,' but effort reduction is never quantified and evaluation is limited to one hypothetical scenario with keyword metrics the paper itself acknowledges underestimate actual coverage.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": false, + "justification": "The paper claims the staged pipeline 'provides granular customization and control' over alternatives and that AART 'reduces human effort significantly,' but no controlled experiment compares effort or outcomes against a counterfactual; these are asserted rather than demonstrated.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "AART is demonstrated on a single hypothetical scenario (dangerous activities, English, global text generation product) yet framed as broadly applicable to 'new LLM-powered applications' without bounding where results may not hold.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "The paper does not consider whether any structured prompt generation would produce similar keyword coverage gains, or whether the keyword metric systematically favors parameterized generation approaches by construction.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "The paper explicitly acknowledges in Section 5 that keyword-based evaluation 'is under-estimating the presence of the concepts that we care about,' distinguishing the proxy metric from the actual goal of adversarial effectiveness.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "Section 5 is a dedicated 'Limitations' section covering LLM bias, human expertise requirements for long-tail cases, computational expense, definition ambiguity, and keyword evaluation inadequacy.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "Limitations mention LLM bias and keyword underestimation but threats to the evaluation's internal validity are not discussed — no inter-rater reliability for qualitative analysis (n=120), no discussion of selection effects in comparison datasets chosen.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "The paper does not explicitly state what AART results do NOT generalize to; it acknowledges long-tail cases need humans but does not bound which application domains, harm types, or languages the demonstrated effectiveness applies to.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding disclosure statement appears in the paper; all authors are from Google Research and use Google's PaLM API, but this relationship is not disclosed as a potential conflict of interest.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "All four authors are listed as being from 'Google Research' in the author line, making their institutional affiliation clear.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": false, + "justification": "Google Research employees are evaluating and promoting AART, which is built on Google's PaLM API; the organization has a direct commercial interest in positive evaluation of its own technology.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement, patent disclosures, or financial interest declarations appear anywhere in the paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": false, + "justification": "'Diversity' is the central claim but never precisely defined — the paper acknowledges in ethical considerations it has 'many facets beyond topical diversity.' 'Coverage,' 'adversarial,' and 'quality' are similarly used without operational definitions.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "The paper clearly states three explicit contribution bullets in the introduction: the AART method, demonstration on a hypothetical application, and quantitative/qualitative comparison against existing approaches.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 2 explicitly situates AART relative to human red-teaming, automated red-teaming (Perez 2022), synthetic safety data generation, and harm taxonomies, explaining how AART differs from and builds on each strand of prior work.", + "source": "haiku" + } + } + }, + "type_checklist": { + "benchmark-creation": { + "construct_design": { + "construct_validity_argued": { + "applies": true, + "answer": false, + "justification": "The paper does not argue why keyword presence validly measures adversarial testing effectiveness; it presents keyword metrics without establishing the link between topical keyword coverage and actual safety evaluation quality.", + "source": "haiku" + }, + "difficulty_distribution_characterized": { + "applies": true, + "answer": false, + "justification": "No difficulty distribution is characterized; the paper discusses topical diversity but makes no attempt to measure or tier whether generated prompts are easy, medium, or hard for safety classifiers or LLMs to handle.", + "source": "haiku" + }, + "ceiling_floor_effects_checked": { + "applies": true, + "answer": false, + "justification": "No ceiling or floor effects are checked; the paper does not evaluate whether existing safety filters uniformly pass or fail on AART-generated prompts, which would indicate whether the benchmark discriminates effectively.", + "source": "haiku" + }, + "human_baseline_included": { + "applies": true, + "answer": false, + "justification": "No human performance baseline is included; existing human red-teaming datasets are used as comparison corpora for keyword coverage, but no human is tested against AART prompts to establish difficulty or validity of the evaluation set.", + "source": "haiku" + }, + "scoring_rubric_justified": { + "applies": true, + "answer": false, + "justification": "Keyword-based presence rate is used as the primary quantitative metric without justification for why this is the appropriate measure; the paper itself acknowledges it 'under-estimates' actual concept coverage.", + "source": "haiku" + } + }, + "robustness": { + "contamination_resistance_designed": { + "applies": true, + "answer": false, + "justification": "No contamination resistance measures are discussed; PaLM (an instruction-tuned LLM) generates the adversarial prompts without addressing whether the model's training data already contains similar content, creating potential circularity.", + "source": "haiku" + }, + "temporal_robustness_discussed": { + "applies": true, + "answer": false, + "justification": "Temporal robustness is not discussed; the paper does not address whether AART-generated benchmarks will remain effective as LLMs improve their safety training or as novel jailbreak patterns emerge that differ from the parameterized recipe outputs.", + "source": "haiku" + }, + "failure_modes_discussed": { + "applies": true, + "answer": true, + "justification": "The paper discusses several benchmark failure modes including 'how-to' over-sampling (5% of queries), task format under-representation (13 formats represented only once), LLM generation bias, and the ambiguity of distinguishing adversarial from innocuous prompts (Appendix C).", + "source": "haiku" + }, + "baseline_implementations_provided": { + "applies": true, + "answer": false, + "justification": "While prompts are provided in appendices and a dataset release is promised, no runnable code is provided; reproducing results requires access to Google's PaLM API (not freely available), making independent baseline replication infeasible.", + "source": "haiku" + } + }, + "documentation": { + "dataset_documentation_complete": { + "applies": true, + "answer": false, + "justification": "No data card or formal documentation is included; while pipeline steps are described, collection methodology is only partially documented and preprocessing details (e.g., the 144 discarded JSON lines, filtering criteria) are insufficiently specified for independent replication.", + "source": "haiku" + }, + "licensing_and_access_clear": { + "applies": true, + "answer": false, + "justification": "The paper states the dataset 'will be made available' on GitHub but provides no license information, access terms, or conditions of use within the paper itself.", + "source": "haiku" + }, + "intended_use_specified": { + "applies": true, + "answer": false, + "justification": "While intended use (adversarial testing of LLM applications) is clear, the paper does not specify what should NOT be concluded from benchmark results or acknowledge misuse risks from releasing a large curated set of harmful prompt templates.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "AART generates adversarial datasets with substantially higher topical diversity than existing human-created datasets", + "evidence": "Table 2 shows keyword presence rates: AART 0.384/0.148/0.410 (policy/format/region) vs. best competitor Perez adaptation at 0.210/0.009/0.000 and human datasets ranging 0.008–0.032", + "supported": "moderate" + }, + { + "claim": "AART reduces human effort significantly compared to manual red-teaming", + "evidence": "Asserted throughout the paper but no measurement of time, cost, or effort is provided for either AART or manual alternatives", + "supported": "unsupported" + }, + { + "claim": "92.5% of AART-generated prompts are of good quality for adversarial testing", + "evidence": "Qualitative analysis of n=120 prompts from the demonstration scenario; no inter-rater reliability reported, no annotator count or guidelines disclosed", + "supported": "weak" + }, + { + "claim": "The 4-step pipeline is reusable and customizable for different application contexts", + "evidence": "Appendix E shows extensions and alternative prompts, but all demonstrations remain within the same dangerous-activities domain; no second application context is actually tested", + "supported": "weak" + }, + { + "claim": "AART 'enabled launching several products with improved safety measures'", + "evidence": "Stated in the conclusion with no supporting data, case studies, metrics, or product names provided", + "supported": "unsupported" + } + ], + "methodology_tags": [ + "benchmark-eval", + "case-study" + ], + "key_findings": "AART proposes a 4-step AI-assisted pipeline (Problem Definition, Problem Scoping, Query Generation, Review) that parameterizes policy concepts, task formats, and geographic regions to generate adversarial evaluation datasets for LLM safety testing. Keyword-based evaluation shows AART achieves substantially higher coverage across all three parameterized dimensions compared to four human red-teaming datasets and an adaptation of Perez 2022's automated approach. Qualitative analysis of 120 samples found 92.5% suitable for adversarial testing, though the paper acknowledges keyword metrics underestimate true concept coverage and the entire evaluation is limited to a single hypothetical dangerous-activities scenario built on Google's PaLM API.", + "red_flags": [ + { + "flag": "Evaluator-funder conflict undisclosed", + "detail": "All four authors are Google Research employees evaluating a pipeline built on Google's PaLM API, with no competing interests disclosure and no independent replication." + }, + { + "flag": "Keyword metric favors structured generation by construction", + "detail": "AART explicitly parameterizes policy concepts, task formats, and geographic regions, so a keyword-based metric measuring presence of these exact terms almost guarantees AART outperforms datasets created without this explicit parameterization — the evaluation is not neutral." + }, + { + "flag": "Single demonstration scenario, broad generalization", + "detail": "All empirical results come from one hypothetical scenario (dangerous activities, English, global user base); no additional application domains are tested, undermining broad claims about applicability to 'new LLM-powered applications.'" + }, + { + "flag": "No inter-rater reliability for qualitative evaluation", + "detail": "The 92.5% quality figure comes from qualitative analysis of n=120 prompts with no reported inter-rater agreement, annotator count, annotation guidelines, or kappa score." + }, + { + "flag": "No safety classifier evaluation", + "detail": "The paper never tests whether AART-generated prompts actually elicit unsafe responses from any deployed system; effectiveness as adversarial inputs is assumed, not measured." + } + ], + "cited_papers": [ + { + "title": "Red teaming language models with language models", + "relevance": "Direct comparison method; AART adapts and outperforms Perez 2022's instruction-based automated red-teaming on keyword coverage metrics" + }, + { + "title": "Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned", + "relevance": "Anthropic human red-teaming dataset used as baseline comparison; establishes context for scale and limitations of human approaches" + }, + { + "title": "RealToxicityPrompts: Evaluating neural toxic degeneration in language models", + "relevance": "Mined adversarial dataset used as comparison baseline; represents alternative approach to adversarial data collection" + }, + { + "title": "Bot-adversarial dialogue for safe conversational agents", + "relevance": "BAD dataset used as human red-teaming comparison baseline in evaluation" + }, + { + "title": "Ethical and social risks of harm from language models", + "relevance": "Provides harm taxonomy framework motivating systematic adversarial testing and informing policy concept categories" + }, + { + "title": "Chain-of-thought prompting elicits reasoning in large language models", + "relevance": "Foundational technique adapted in AART's Query Generation step for consistency checking via CoT-style explanation generation" + }, + { + "title": "ToxiGen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection", + "relevance": "Related synthetic safety data generation approach that AART builds upon for automated adversarial content creation" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 3, + "justification": "Directly addresses a real practitioner need — automated adversarial testing of LLM applications — with a concrete, adaptable pipeline and appendix of reusable prompt templates." + }, + "surprise_contrarian": { + "score": 1, + "justification": "Automated red-teaming using LLMs is the expected direction of the field; no finding challenges conventional wisdom or produces a surprising result." + }, + "fear_safety": { + "score": 2, + "justification": "Directly addresses AI safety risks and demonstrates gaps in existing evaluation datasets, reinforcing how difficult it is to catch harmful LLM outputs with human-only red-teaming." + }, + "drama_conflict": { + "score": 1, + "justification": "No significant controversy; paper positions itself as complementary to existing approaches and is careful not to challenge competitors directly." + }, + "demo_ability": { + "score": 2, + "justification": "Dataset released on GitHub and pipeline recipes are documented in appendix, but reproduction requires access to Google PaLM API which limits hands-on replication." + }, + "brand_recognition": { + "score": 3, + "justification": "All authors from Google Research; paper addresses safety testing of frontier LLMs in a high-visibility venue (EMNLP)." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "45939036", + "title": "TiDAR: Think in Diffusion, Talk in Autoregression", + "points": 130, + "comments": 22, + "url": "https://news.ycombinator.com/item?id=45939036", + "created_at": "2025-11-15T17:32:35Z" + }, + { + "hn_id": "37989614", + "title": "Embarrassingly Simple Text Watermarks", + "points": 86, + "comments": 50, + "url": "https://news.ycombinator.com/item?id=37989614", + "created_at": "2023-10-23T18:27:48Z" + }, + { + "hn_id": "45935410", + "title": "Autoregressive or Diffusion Language Models, Why Choose?", + "points": 5, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=45935410", + "created_at": "2025-11-15T06:04:49Z" + }, + { + "hn_id": "34517931", + "title": "The Risk-Taking Software Engineer: A Framed Portrait", + "points": 4, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=34517931", + "created_at": "2023-01-25T13:22:03Z" + }, + { + "hn_id": "38747811", + "title": "Evaluating ChatGPT for Question Answering and Comparison with Existing Models", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=38747811", + "created_at": "2023-12-23T20:21:42Z" + }, + { + "hn_id": "37996166", + "title": "Image Cropping Under Design Constraints", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=37996166", + "created_at": "2023-10-24T08:20:56Z" + }, + { + "hn_id": "38677019", + "title": "Limits to the Energy Efficiency of CMOS Microprocessors", + "points": 2, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=38677019", + "created_at": "2023-12-17T22:15:38Z" + }, + { + "hn_id": "46151267", + "title": "Generative Graph Vocabularies for Robust Graph Foundation Models Fine-Tuning", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=46151267", + "created_at": "2025-12-04T18:46:47Z" + } + ], + "top_points": 130, + "total_points": 234, + "total_comments": 73 + } +} +\ No newline at end of file diff --git a/papers/acar-adaptive-complexity-2026/scan-v5.json b/papers/acar-adaptive-complexity-2026/scan-v5.json @@ -0,0 +1,479 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "ACAR: Adaptive Complexity Routing for Multi-Model Ensembles with Auditable Decision Traces", + "authors": [ + "Ramchand Kumaresan" + ], + "year": 2026, + "venue": "arXiv", + "arxiv_id": "2602.21231", + "doi": null + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All abstract claims are supported: σ routing achieves 55.6% (Table 1), exceeds Arena-2's 54.4% (Table 1), avoids full ensemble on 54.2% of tasks (32.9%+21.3% from Section 5.3), and retrieval decreases accuracy by 3.4pp (Table 2).", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "Causal routing decisions are mechanistically specified in Algorithm 1: σ value deterministically selects execution mode. Retrieval harm is explained by median similarity of 0.167 (Figure 9), providing mechanistic justification for negative finding.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": true, + "justification": "Results bounded to 4 benchmarks and 3 proprietary models (Claude, GPT-4o, Gemini). Abstract claims model-agnostic design but Section 8 limitations state may not generalize to open-source models and SuperGPQA dominates (66%).", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": true, + "justification": "Paper discusses why σ-based approach chosen over learned routers (auditability trade-off, Section 3.2.3), why retrieval failed (weak similarity, Section 6.1), and mechanistic reasons for agreement-but-wrong failure (Section 6.2).", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "Measures accuracy on ground-truth answers (pass rates, code execution verification) and cost in USD. Claims match measurement granularity. Acknowledges code equivalence issue (Section 8) but does not claim code equivalence as accuracy.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "Dedicated Section 8 'Limitations' with four specific constraints, plus Section 6 'Negative Results and Failure Modes' documenting three systematic failures beyond scope.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "Section 8 specifies: proprietary models only (no open-source), SuperGPQA dominance (66% of tasks), no learned router comparison, code equivalence inflation. Section 6.2 quantifies fundamental 8pp ceiling.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": true, + "justification": "Explicit boundaries: 1,510 tasks across 4 named benchmarks, 3 specific models, deterministic evaluation at temperature 0, σ routing with N=3 samples. What does NOT show: generalization to open-source models, to other model counts, to different problem domains.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding section or acknowledgments disclosing financial support. Paper does not state whether funded by academic institution, company, or independent.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": false, + "justification": "Author affiliation not stated in paper header or acknowledgments. Single author 'Ramchand Kumaresan' with no institution listed.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": false, + "answer": false, + "justification": "Cannot assess; funding not disclosed. Paper evaluates Claude (Anthropic), GPT-4o (OpenAI), Gemini (Google) — potential conflict if author affiliated with any provider, but affiliation unclear.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement. Paper discloses AI assistance (Claude for code, ChatGPT for writing) but not financial interests, patents, or equity stakes.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Self-consistency variance σ defined formally (Definition 1, Eq. 1), execution mode M(σ) defined (Definition 2), TEAMLLM substrate described (Section 3.1), ACAR routing procedure in Algorithm 1.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Three contributions explicitly stated in Section 1.2: (1) ACAR routing mechanism with empirical results, (2) negative results on retrieval and attribution, (3) TEAMLLM reproducible infrastructure. Abstract and introduction clearly frame as 'measurement framework.'", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 2 engages with three research areas: multi-model routing (RouterBench, FrugalGPT, RouteLLM), cost-aware inference, reproducible benchmarking. Section 2.3 explicitly states how ACAR differs from learned routers, observability platforms, and benchmark papers.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": true, + "justification": "Code and TEAMLLM substrate released at https://github.com/mechramc/ACAR-TeamLLM (stated in Section 3.1 and Appendix A). Figure regeneration scripts included.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "All 7,550+ runs (1,510 ACAR-U, 1,510 ACAR-UJ, plus baselines) released as runs.jsonl with decision traces (Appendix B). Input benchmarks are public: LiveCodeBench, SuperGPQA, etc.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "Paper logs environment fingerprints with each run (Section 3.1) but does not specify requirements.txt, Python version, or dependencies in the paper text. Environment specs must be inferred from GitHub repo.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "Paper states 'All figures regenerable from released artifacts' but provides no step-by-step instructions in the paper. Artifact manifest (Appendix B) lists directories but not how to execute them. GitHub repo likely has README, but paper itself lacks instructions.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "Table 1 reports point estimates (55.6%, 54.4%, etc.) with no confidence intervals or error bars. Figures 2-7 show bar/line charts without error bands. No uncertainty quantification.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "Comparative claims ('ACAR-U exceeds Arena-2 by 1.2pp') lack p-values or statistical significance tests. No tests for difference in accuracy between configurations.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Effect sizes reported: ACAR-U +1.2pp vs Arena-2 (Table 1), retrieval -3.4pp (Table 2), cost differences in USD. Escalation rates given as percentages (32.9% single, etc.).", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "1,510 tasks evaluated but no justification for this n. Section 4.1 explains benchmark selection as 'cover diverse task types' but no power analysis or sample size calculation provided.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "Execution is deterministic (Section 3.1: 're-execution with identical inputs produces identical outputs'). Single runs per configuration reported; no standard deviations, confidence intervals, or repeated evaluations.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Four configurations compared: Single-Model, Arena-2, ACAR-U, Arena-3 (Table 1, Section 4.3). Ablation comparing ACAR-U vs ACAR-UJ (Table 2).", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": false, + "justification": "Baselines (single-model, two-model, three-model ensembles) are simple but not comparative to other routing methods. Related work cites RouterBench, FrugalGPT, RouteLLM but none are evaluated. Paper acknowledges this as intentional design choice for auditability over optimization.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "ACAR-U (without retrieval) vs ACAR-UJ (with retrieval augmentation) evaluated separately. Table 2 shows -3.4pp effect of adding Jungler retrieval component.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Accuracy (pass rate), cost (USD), escalation rate (% per mode), latency (ms), per-benchmark performance, per-mode breakdown all reported across Sections 5.1-5.4.", + "source": "haiku" + }, + "human_evaluation": { + "applies": false, + "answer": false, + "justification": "N/A — evaluation on benchmarks (MathArena, Reasoning Gym, LiveCodeBench with code execution, SuperGPQA with multiple choice). No human subjects.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": false, + "answer": false, + "justification": "N/A — paper evaluates models on standard public benchmarks; it does not train models or hold out test data. Models are evaluated off-the-shelf.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Results broken down by benchmark (Figure 3: MathArena, Reasoning Gym, LiveCodeBench, SuperGPQA), by execution mode (Figure 5: single/lite/full), by similarity/hit rate (Figures 8-9).", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "Section 6 'Negative Results and Failure Modes' documents three systematic failures: retrieval decreases accuracy (6.1, Table 2), agreement-but-wrong is unrecoverable (6.2), attribution proxies weak (6.3).", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "Negative results explicitly reported: ACAR-UJ -3.4pp vs ACAR-U (Table 2), 8pp ceiling from agreement-but-wrong (Section 6.2), attribution proxies 'showed weak correlation' (Section 6.3). Conclusion states 'What failed' alongside 'What worked.'", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": false, + "justification": "Models named: 'Claude Sonnet 4', 'GPT-4o', 'Gemini 2.0 Flash' (Section 4.2). No version hashes, snapshot dates, or training cutoff dates provided. Identifiable by name at publication time (2026-02) but not fully reproducible.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": false, + "justification": "Algorithm 1 mentions 'Mprobe(T)' sampling and EXTRACT function, but actual prompts/system instructions given to models are not shown. Section 3.1 mentions 'prompt template hash' is logged but template not disclosed.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": true, + "justification": "σ thresholds (0.0, 0.5, 1.0) specified in Definition 1, N=3 samples justified in Section 3.2.3, temperature=0 for all models (Section 4.2), retrieval threshold=0.0 for ACAR-UJ (Section 3.2.4), with discussion of why thresholds >0.7 needed (Section 6.1).", + "source": "haiku" + }, + "scaffolding_described": { + "applies": false, + "answer": false, + "justification": "N/A — paper evaluates models directly on benchmarks. No agentic scaffolding (tool use, chain-of-thought, step-by-step prompting) described. EXTRACT function handles answer canonicalization but not model scaffolding.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": false, + "justification": "Algorithm 1 mentions EXTRACT(ri) for canonicalization but does not detail how answers are extracted/compared. Section 8 acknowledges 'LiveCodeBench escalation is inflated by syntactically different but semantically equivalent outputs' but does not document preprocessing steps.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": true, + "justification": "All runs (outputs) released as runs.jsonl with per-task decision traces. Appendix B lists: phase22_acar_u/runs.jsonl (1,510 ACAR-U runs), phase22_acar_uj/runs.jsonl, baseline runs (arena_3model, arena_2model, single_model).", + "source": "haiku" + }, + "data_collection_described": { + "applies": false, + "answer": false, + "justification": "N/A — paper uses existing public benchmarks (MathArena, Reasoning Gym, LiveCodeBench, SuperGPQA). Does not describe collection of these benchmarks; evaluation section (4.1) describes benchmarks, not collection methodology.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "N/A — no human participants. Benchmarks are task sets, not recruited subjects.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "Section 3.1 describes TEAMLLM execution substrate: deterministic execution with seed/hash logging, immutable append-only artifacts (runs.jsonl), forward-only state machine. Algorithm 1 documents full routing procedure from task T to decision trace D.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": false, + "justification": "Models used (Claude Sonnet 4, GPT-4o, Gemini 2.0 Flash) are proprietary with unknown training data cutoffs. Paper does not state training dates or potential overlap with benchmark data.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": false, + "justification": "Paper does not discuss whether benchmarks (MathArena, Reasoning Gym, LiveCodeBench, SuperGPQA) were in model training data. Potential contamination of proprietary models not addressed.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": true, + "answer": false, + "justification": "LiveCodeBench mentioned as having 'temporal splits' (Section 2.3) but paper does not verify that tested models were trained before these benchmarks were created or evaluate contamination impact.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": true, + "justification": "Table 1 reports cost in USD: Single-Model $17.04, Arena-2 $20.64, ACAR-U $20.34, Arena-3 $20.64. Cost vs. accuracy Pareto frontier shown in Figure 4.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": true, + "justification": "Total compute budget stated: ACAR-U costs $20.34 total across 1,510 tasks (≈$0.013 per task). Section 5.2 compares cost-accuracy trade-off. Figure 6 shows cumulative cost progression.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "σ-based routing achieves 55.6% accuracy on 1,510 benchmark tasks", + "evidence": "Table 1 reports ACAR-U accuracy of 55.6% (839/1510 correct)", + "supported": "strong" + }, + { + "claim": "σ-based routing exceeds two-model baseline (Arena-2) by 1.2 percentage points while costing 1.5% less", + "evidence": "Table 1: ACAR-U 55.6% ($20.34) vs Arena-2 54.4% ($20.64)", + "supported": "strong" + }, + { + "claim": "ACAR avoids full ensembling on 54.2% of tasks by routing to single-agent or two-model modes", + "evidence": "Section 5.3 reports escalation: 32.9% single-agent + 21.3% arena-lite = 54.2%, 45.8% full-arena", + "supported": "strong" + }, + { + "claim": "Retrieval augmentation with low-quality stores decreases accuracy by 3.4 percentage points", + "evidence": "Table 2 shows ACAR-UJ 52.4% vs ACAR-U 55.6%, difference -3.4pp across all benchmarks", + "supported": "strong" + }, + { + "claim": "When models unanimously agree on incorrect answers (σ=0), no downstream ensemble can recover; this bounds achievable accuracy at 8pp below full ensembling", + "evidence": "Section 6.2 explains 'agreement-but-wrong' as intrinsic to self-consistency; ACAR-U 55.6% vs Arena-3 63.6% = 8pp gap", + "supported": "strong" + }, + { + "claim": "Attribution proxies (response similarity, entropy) showed weak correlation with ground-truth leave-one-out values", + "evidence": "Section 6.3 states proxies 'showed weak correlation with ground-truth leave-one-out values; practical attribution requires explicit counterfactual computation'", + "supported": "moderate" + }, + { + "claim": "σ-routing mechanism is model-agnostic and requires no learned components", + "evidence": "Algorithm 1 shows deterministic routing based on σ; Section 3.2.3 justifies choice to avoid distribution shift. Only tested on 3 proprietary models.", + "supported": "moderate" + } + ], + "methodology_tags": [ + "benchmark-eval", + "case-study" + ], + "key_findings": "ACAR proposes σ-based adaptive routing using self-consistency variance to allocate compute across multi-model ensembles. On 1,510 benchmark tasks, σ-routing achieves 55.6% accuracy—exceeding two-model ensembling by 1.2pp while avoiding full ensemble on 54% of tasks. The paper documents three critical failures: naive retrieval augmentation hurts (-3.4pp) without task-aligned semantic thresholds (>0.7); the algorithm has an irreducible 8pp ceiling when all models agree incorrectly; and post-hoc attribution from proxy signals does not correlate with ground truth.", + "red_flags": [ + { + "flag": "No contamination analysis", + "detail": "Evaluates proprietary models (Claude, GPT-4o, Gemini) with unknown training cutoffs on benchmarks; does not discuss potential overlap between model training data and evaluation benchmarks." + }, + { + "flag": "No statistical significance testing", + "detail": "Reports point estimates (55.6% vs 54.4%) without confidence intervals, p-values, or error bars. ACAR-U leads by only 1.2pp; statistical significance unclear given no multiple trials." + }, + { + "flag": "Deterministic runs without uncertainty quantification", + "detail": "Single deterministic execution per configuration (by design). No repeated evaluations or confidence intervals. Uncertainty in model outputs not measured." + }, + { + "flag": "Prompts not disclosed", + "detail": "Algorithm and hyperparameters described, but actual prompts/system instructions given to models are not provided. Prompt hashes logged but templates hidden, limiting reproducibility." + }, + { + "flag": "Limited baseline comparisons", + "detail": "No comparison to other routing methods (RouterBench, FrugalGPT, RouteLLM) mentioned in related work. Only compared to naive ensemble baselines." + }, + { + "flag": "Weak evidence for attribution failure", + "detail": "Section 6.3 claims attribution proxies 'showed weak correlation' but provides no figures, correlation coefficients, or detailed analysis. Minimal supporting evidence." + }, + { + "flag": "Benchmark dominance not controlled", + "detail": "SuperGPQA comprises 66% of tasks (1,000/1,510); results heavily skewed toward knowledge-based multiple-choice. No stratified or weighted analysis." + }, + { + "flag": "Missing funding/affiliation disclosure", + "detail": "No funding source stated. Single author with no institutional affiliation listed. Paper evaluates three major LLM providers; conflict-of-interest status unclear." + } + ], + "cited_papers": [ + { + "title": "RouterBench: A benchmark for multi-LLM routing system", + "relevance": "Benchmark for evaluating LLM routing systems; ACAR differs by using heuristic σ-based routing instead of learned classifiers" + }, + { + "title": "FrugalGPT: How to use large language models while reducing cost and improving performance", + "relevance": "Cascading cost-aware routing strategy; ACAR compares to this approach for cost-quality trade-offs" + }, + { + "title": "RouteLLM: Learning to route LLMs with preference data", + "relevance": "Preference-learning based routing; ACAR explicitly avoids learned routers for interpretability" + }, + { + "title": "ReAct: Synergizing reasoning and acting in language models", + "relevance": "Single-model agentic reasoning with tools; related to multi-model orchestration but focuses on tool use within one model" + }, + { + "title": "A survey on mixture of experts in large language models", + "relevance": "Token-level routing within a single model; orthogonal to inter-model orchestration" + }, + { + "title": "LiveCodeBench: A challenging benchmark for code generation with execution-based verification", + "relevance": "Execution-verified code evaluation benchmark used to evaluate ACAR routing on deterministic code tasks" + }, + { + "title": "The Shapley value in machine learning", + "relevance": "Attribution method for assigning credit in multi-agent systems; ACAR discusses why Shapley-like proxies fail" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 2, + "justification": "Cost-quality trade-offs are immediately relevant to practitioners deploying multi-model LLM systems, but requires access to three different proprietary APIs simultaneously." + }, + "surprise_contrarian": { + "score": 2, + "justification": "Challenges assumptions: retrieval augmentation with weak semantic alignment hurts (not helps), attribution from proxy signals doesn't work, and unanimous model agreement is an irreducible failure mode." + }, + "fear_safety": { + "score": 0, + "justification": "Paper focuses on cost-efficiency and routing optimization. No AI safety, alignment, or risk concerns raised or addressed." + }, + "drama_conflict": { + "score": 1, + "justification": "Technical contribution lacks narrative drama. 'Agreement-but-wrong' failure is intellectually interesting but not emotionally engaging." + }, + "demo_ability": { + "score": 2, + "justification": "Code and artifacts released on GitHub; results are reproducible from provided runs.jsonl. However, requires API keys for three proprietary models, limiting who can reproduce live." + }, + "brand_recognition": { + "score": 1, + "justification": "Single author with no stated affiliation. TEAMLLM is novel but not established brand. Paper likely to appeal to infra-focused practitioners rather than general audience." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "47154950", + "title": "Aletheia Tackles FirstProof Autonomously", + "points": 5, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=47154950", + "created_at": "2026-02-25T17:46:36Z" + }, + { + "hn_id": "47314080", + "title": "Latent Context Compilation: Distilling Long Context into Compact Portable Memory", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=47314080", + "created_at": "2026-03-09T19:21:30Z" + } + ], + "top_points": 5, + "total_points": 7, + "total_comments": 0 + } +} +\ No newline at end of file diff --git a/papers/agentic-bug-reproduction-2025/scan-v5.json b/papers/agentic-bug-reproduction-2025/scan-v5.json @@ -0,0 +1,521 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Agentic Bug Reproduction for Effective Automated Program Repair at Google", + "authors": [ + "Runxiang Cheng", + "Michele Tufano", + "Jürgen Cito", + "José Cambronero", + "Pat Rondon", + "Renyao Wei", + "Aaron Sun", + "Satish Chandra" + ], + "year": 2025, + "venue": "arXiv.org", + "arxiv_id": "2502.01821", + "doi": "10.48550/arXiv.2502.01821" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All key abstract claims (28% vs 10% plausible BRT rate, 30% more bugs fixed with BRTs, 70% top-1 EPR precision) are directly supported by Table 2, Figure 3, and Figure 5 respectively.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": false, + "justification": "The claim that BRTs cause improved APR performance is tested on only 23 bugs with no statistical significance testing; the small sample makes causal inference inadequate despite the controlled within-subject comparison.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": true, + "justification": "Section 7 explicitly acknowledges the study focuses exclusively on Google's internal environment and that generalizability to other industrial settings 'requires further investigation.'", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "The paper controls for LLM differences but does not discuss whether BRT Agent's advantage over LIBRO stems from the agent scaffolding, code search, or the fine-tuned LLM—these factors are fully confounded with no ablation.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "The paper defines plausible BRTs (F→P behavior) as a proxy and acknowledges in threats to validity that this metric 'may not fully capture all aspects of a BRT, such as its readability or maintainability.'", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "Section 7 'Threats to Validity' is a dedicated section covering Internal, External, and Construct validity with specific subsections.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "Specific threats include the small 80-bug dataset limiting subgroup analysis, potential implementation bias in the LIBRO adaptation, LLM non-determinism, and Google-specific generalizability limits—these go beyond boilerplate.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": true, + "justification": "The paper explicitly states findings are limited to Google's internal environment and that EPR is an indirect measure that may not always correlate with human-judged fix correctness.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No explicit funding disclosure statement appears anywhere in the paper.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Author affiliations (UIUC, Google, TU Wien) are disclosed; a footnote clarifies that Cheng and Cito conducted the research at Google.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": false, + "justification": "The majority of authors are Google employees evaluating Google's own internal tools (Passerine APR system, proprietary fine-tuned Gemini), creating a direct conflict of interest with the outcome.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests or financial interests declaration appears in the paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "BRT is formally defined in Section 2.1 with precise F→P behavior criteria; 'candidate BRT,' 'plausible BRT,' and EPR are all precisely defined in Section 5.2.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Three explicit contributions are stated: BRT Agent system and comparison with LIBRO, assessment of BRT impact on APR (Passerine), and the EPR metric for fix selection.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 3 provides detailed comparison with LIBRO, SWE-Agent+, and LLM test generation literature, explicitly situating differences in industrial context and usefulness of generated BRTs.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": false, + "justification": "The system is built on proprietary Google infrastructure; no code is released and no promise of future release is mentioned.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": false, + "justification": "The evaluation dataset is from Google's internal issue tracking system (GITS) and is not publicly available.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "No environment specifications (requirements, Docker, etc.) are provided; the system depends on proprietary Google infrastructure.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "No reproduction instructions are provided; complete dependency on Google's internal infrastructure makes external reproduction impossible.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "Main results (28% vs 10% plausible BRT rate, 70% top-1 EPR precision) are reported as point estimates without any confidence intervals or error bars.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "No statistical significance tests are applied to any comparisons (BRT Agent vs LIBRO, with/without BRT for APR) despite making comparative claims on small samples.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Absolute differences with baseline context are provided: 28% vs 10% plausible BRTs, 17/23 vs 13/23 bugs fixed, precision@K values with K varying—sufficient to assess magnitude.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "The 80-bug sample is acknowledged as a potential limitation but no power analysis or formal sample size justification is provided.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "Despite running 20 runs per bug to account for LLM stochasticity, no variance, standard deviation, or confidence intervals are reported for aggregate metrics.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "LIBRO, the state-of-the-art BRT generation approach, is adapted to Google's environment and used as the primary baseline for all BRT generation comparisons.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "LIBRO (ICSE 2023) is the most directly comparable recent approach; SWT-Bench (NeurIPS 2024) results for SWE-Agent+ are referenced for broader context.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": false, + "justification": "No ablation study isolates the contribution of individual BRT Agent components (reasoning LLM, fine-tuned code-editing LLM, code search, ReAct scaffolding).", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Multiple metrics are used: candidate BRT rate, plausible BRT rate, candidate-to-plausible rate, bugs fixed, steps to fix, and precision/recall/F1/MRR for EPR.", + "source": "haiku" + }, + "human_evaluation": { + "applies": true, + "answer": true, + "justification": "Two authors manually inspect all plausible BRT patches against oracle BRTs for semantic equivalence, with a third author resolving disagreements.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": true, + "justification": "The 80 production bugs with ground truth oracle BRTs serve as a held-out evaluation set; the code-editing LLM's training cutoff explicitly predates all evaluated bugs.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Table 3 provides plausible BRT rates broken down by 7 programming languages (Java, C++, Go, Python, Kotlin, Dart, TypeScript) for both techniques.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "Failure modes are discussed: LIBRO fails mainly via build errors it cannot recover from; BRT Agent modifies existing tests in 11% of cases; 21% of BRT Agent runs exhaust the step limit.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "Dart achieves 0% plausible BRT rate for both LIBRO and BRT Agent; EPR recall limitations are quantified and discussed as a trade-off.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": false, + "justification": "Models are described only as 'a Gemini model fine-tuned on Google's internal code' and 'a publicly available Gemini'—no version numbers or snapshot dates are given.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": false, + "justification": "Prompt structure is described at a high level (bug report + buggy file + test file) and the meta task description string is quoted, but full prompt text is not provided verbatim.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": true, + "justification": "Temperature (0.7 for LIBRO, 0.2 for BRT Agent), top-P (0.95), number of runs (50 for LIBRO, 20 for BRT Agent), and step limit (25) are all explicitly reported.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": true, + "answer": true, + "justification": "Section 4.2 details BRT Agent's ReAct-based loop, its full action set (Table 1), change description generation process, and termination conditions with sufficient specificity.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": false, + "justification": "Dataset construction is described only as 'automated extraction and filtering phases as well as manual curation' with full details deferred to the concurrent Passerine paper [30].", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": false, + "justification": "The bug dataset is from Google's internal GITS and is not publicly accessible.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Section 5.1.1 describes that bugs were human-reported, human-fixed, sourced from GITS since June 2024, across seven languages, with manual curation to ensure fixes address root causes.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "No human participants were recruited; bugs are drawn from an internal issue tracker.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": false, + "justification": "The full pipeline from collection to analysis is not documented; automated extraction and filtering details are deferred to the Passerine paper [30].", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": true, + "justification": "Section 4.2.3 explicitly states the code-editing LLM's training data cutoff predates the reporting of all bugs analyzed in the study.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": true, + "justification": "The paper explicitly states training data excludes all bugs, code changes, and BRTs in the evaluation set, 'preventing any potential data leakage.'", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": true, + "answer": true, + "justification": "Bugs are from Google's internal tracker since June 2024 and training cutoff is stated to predate all evaluation bugs, directly addressing contamination.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human participants were recruited for this study.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human participants were recruited for this study.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participants were recruited for this study.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human participants were recruited for this study.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human participants were recruited for this study.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human participants were recruited for this study.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participants were recruited for this study.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": false, + "justification": "No inference cost or latency figures are reported despite running 1,600 BRT Agent runs (80 bugs × 20 runs) and 4,000 LIBRO calls.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "No compute budget or resource requirements are stated anywhere in the paper.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "BRT Agent achieves 28% plausible BRT generation rate vs 10% by adapted LIBRO on 80 Google production bugs", + "evidence": "Table 2: BRT Agent 85% candidate / 28% plausible; LIBRO 41% candidate / 10% plausible", + "supported": "strong" + }, + { + "claim": "Providing generated BRTs to Passerine results in ~30% more bugs with plausible fixes (74% vs 57%)", + "evidence": "Figure 3: 17/23 bugs fixed with BRT vs 13/23 without on the 23-bug subset where BRT Agent succeeded", + "supported": "moderate" + }, + { + "claim": "EPR correctly selects a plausible fix from 20 APR-generated candidates in 70% of cases at top-1 ranking", + "evidence": "Figure 5: precision@1 = 0.7, MRR@1 = 0.7", + "supported": "moderate" + }, + { + "claim": "67% of plausible BRTs generated by BRT Agent are semantically equivalent or identical to oracle BRTs", + "evidence": "Manual inspection: 19% identical + 48% semantically equivalent = 67% of plausible BRT patches", + "supported": "moderate" + }, + { + "claim": "BRT Agent generalizes across 6 of 7 programming languages; only Dart produces 0% results", + "evidence": "Table 3 language breakdown showing non-zero rates for Java (28%), C++ (16%), Go (17%), Python (45%), Kotlin (50%), TypeScript (100%)", + "supported": "strong" + }, + { + "claim": "Passerine takes fewer agent steps to generate plausible fixes when provided with BRTs", + "evidence": "Figure 4 shows a leftward shift in step count distribution when BRT is provided as input", + "supported": "moderate" + } + ], + "methodology_tags": [ + "benchmark-eval", + "case-study" + ], + "key_findings": "BRT Agent, combining a ReAct-based reasoning LLM with a proprietary fine-tuned code-editing LLM, achieves 28% plausible bug reproduction test generation on 80 Google production bugs—significantly outperforming adapted LIBRO (10%). Generated BRTs improve Google's APR system (Passerine) from fixing 57% to 74% of bugs on a 23-bug subset, with fewer agent steps required. The proposed Ensemble Pass Rate (EPR) metric achieves 70% top-1 precision for selecting correct fixes from pools of 20 APR-generated candidates. Both BRT Agent and LIBRO fail completely on Dart bugs, and 11% of BRT Agent's plausible patches are invalid due to unintended modification of existing tests.", + "red_flags": [ + { + "flag": "No statistical significance tests", + "detail": "All comparisons (BRT Agent vs LIBRO, with/without BRT for APR) are reported as raw percentages without significance tests or confidence intervals despite the small sample sizes making chance effects plausible." + }, + { + "flag": "Tiny APR evaluation sample", + "detail": "RQ2 and RQ3 are evaluated on only 23 bugs (those where BRT Agent happened to succeed), making the 30% improvement claim fragile and potentially inflated." + }, + { + "flag": "No ablation study", + "detail": "The paper never isolates whether BRT Agent's advantage over LIBRO comes from the agent scaffolding, fine-tuned LLM, code search, or their combination—all factors are fully confounded." + }, + { + "flag": "Google-only, entirely non-reproducible evaluation", + "detail": "All evaluation uses proprietary Google infrastructure, internal bugs, and internal LLMs; no external party can reproduce any result." + }, + { + "flag": "Unspecified model versions", + "detail": "Models are described only as 'a Gemini model fine-tuned on Google's internal code' and 'a publicly available Gemini' without version numbers or snapshot dates." + }, + { + "flag": "Google employees evaluating Google systems", + "detail": "Majority of authors are Google employees evaluating Google's own APR system (Passerine) and Google's proprietary LLMs with no independent validation." + } + ], + "cited_papers": [ + { + "title": "Large Language Models are Few-shot Testers: Exploring LLM-based General Bug Reproduction (LIBRO)", + "relevance": "Primary baseline adapted and compared against in all BRT generation experiments" + }, + { + "title": "SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents", + "relevance": "Most recent BRT generation benchmark; SWE-Agent+ results used for broader context comparison" + }, + { + "title": "Evaluating Agent-based Program Repair at Google (Passerine)", + "relevance": "Concurrent work describing the APR system evaluated and the same 80-bug dataset" + }, + { + "title": "SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering", + "relevance": "Agent framework conceptually similar to BRT Agent; SWE-Agent+ is a direct point of comparison" + }, + { + "title": "ReAct: Synergizing Reasoning and Acting in Language Models", + "relevance": "Theoretical framework underlying BRT Agent's reasoning loop design" + }, + { + "title": "Defects4J: A Database of Existing Faults to Enable Controlled Testing Studies for Java Programs", + "relevance": "Standard benchmark used to evaluate LIBRO; reference point for comparing BRT generation performance" + }, + { + "title": "Swe-bench: Can Language Models Resolve Real-World GitHub Issues?", + "relevance": "Major benchmark for evaluating code agents; provides context for the field's evaluation practices" + }, + { + "title": "Evaluating Diverse Large Language Models for Automatic and General Bug Reproduction", + "relevance": "Extended LIBRO evaluation providing additional baseline context" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 3, + "justification": "Demonstrates industrial-scale BRT generation at Google with concrete improvement in APR effectiveness—directly actionable for engineering teams." + }, + "surprise_contrarian": { + "score": 1, + "justification": "Expected result that agent-based approach outperforms few-shot baseline; the 0% Dart result and the EPR precision-recall trade-offs are modestly interesting." + }, + "fear_safety": { + "score": 0, + "justification": "No AI safety or risk concerns; purely a software engineering productivity paper." + }, + "drama_conflict": { + "score": 0, + "justification": "No controversy or conflict angle; straightforward industrial evaluation." + }, + "demo_ability": { + "score": 1, + "justification": "The BRT generation concept is demonstrable in open-source analogues (SWE-agent, LIBRO) but the actual Google system requires proprietary infrastructure." + }, + "brand_recognition": { + "score": 3, + "justification": "Google authorship, Google production bugs, and evaluation on Gemini models provide strong brand recognition for the HN/tech audience." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "43876276", + "title": "Agentic Bug Reproduction for Effective Automated Program Repair at Google", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=43876276", + "created_at": "2025-05-03T01:54:39Z" + }, + { + "hn_id": "45599001", + "title": "Agentic Bug Reproduction for Effective Automated Program Repair at Google", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=45599001", + "created_at": "2025-10-15T22:20:39Z" + } + ], + "top_points": 2, + "total_points": 3, + "total_comments": 0 + } +} +\ No newline at end of file diff --git a/papers/agentic-refactoring-empirical-2025/scan-v5.json b/papers/agentic-refactoring-empirical-2025/scan-v5.json @@ -0,0 +1,574 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Agentic Refactoring: An Empirical Study of AI Coding Agents", + "authors": [ + "Kosei Horikawa", + "Hao Li", + "Yutaro Kashiwa", + "Bram Adams", + "Hajimu Iida", + "Ahmed E. Hassan" + ], + "year": 2025, + "venue": "arXiv", + "arxiv_id": "2511.04824", + "doi": "XXXX XXX.XXXXXXX" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All four major abstract claims are directly supported: 26.1% refactoring rate (Table 3), low-level dominance (Table 4-5), maintainability/readability motivation (Figure 4), and small structural improvements (Table 7).", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": false, + "justification": "The paper claims agentic refactoring 'yields improvements' in structural metrics using before-after comparison, but this observational design cannot establish causality — the refactoring commit itself may co-occur with other changes that affect metrics.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": true, + "justification": "Section 7.3 explicitly bounds generalization: results are limited to OSS Java projects from the AIDev dataset, and 'caution should be exercised when generalizing our results to other contexts' including industrial projects and other languages.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "The finding that agents perform more low-level refactoring could reflect task assignment patterns (developers assign low-level cleanup to agents), repository characteristics, or AIDev dataset composition (89.3% Codex), but these alternatives are not systematically explored.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "Finding #12 explicitly acknowledges that structural metrics do not capture readability or naming benefits: 'their main benefits (e.g., readability, naming consistency, API clarity) are not captured by the selected design-level indicators.'", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "Section 7 'Threats to Validity' has dedicated subsections for Internal Validity (7.1), Construct Validity (7.2), and External Validity (7.3).", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "Threats are specific: RefactoringMiner false positives/negatives, GPT-4.1-mini misclassification risk mitigated by kappa=0.77 validation, ambiguity of 'agentic commit' definition with unknown human intervention extent.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": true, + "justification": "Paper explicitly states results are bounded to Java OSS projects and that 'development practices, coding standards, and types of refactoring in industrial, closed-source projects may differ significantly.'", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": true, + "justification": "Acknowledgments lists specific grants: JSPS KAKENHI (JP24K02921, JP25K21359), JST PRESTO (JPMJPR22P3), ASPIRE (JPMJAP2415), AIP Accelerated Program (JPMJCR25U7), and NSERC.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "All six authors list their affiliations: Nara Institute of Science and Technology (Japan) and Queen's University (Canada), with contact emails provided.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": true, + "justification": "All funders (JSPS, JST, NSERC) are government/academic agencies with no financial interest in the commercial coding agents (Codex, Claude Code, Cursor, Devin) evaluated in the study.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement appears anywhere in the paper; only funding acknowledgment is provided.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Refactoring is defined via Opdyke; 'agentic refactoring commit' is operationally defined as RefactoringMiner detection plus SAR keyword in commit message; the three abstraction levels (high/medium/low) are precisely defined with examples.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "The paper explicitly frames its contribution as 'the first large-scale empirical baseline of agentic refactoring' answering four RQs on prevalence, types, purposes, and impact.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 6 contains substantive engagement with prior work across four areas, directly comparing agent findings to human refactoring data from Kim et al. [26] and Horikawa et al. [22], and positioning against automated refactoring literature.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": true, + "justification": "A replication package is provided at https://github.com/Mont9165/Agent_Refactoring_Analysis, referenced multiple times including for the refactoring level classification mapping.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "The study uses the publicly available AIDev dataset [28], and the authors' derived analysis data is available in the replication package on GitHub.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "RefactoringMiner 3.0.11 and DesigniteJava versions are mentioned, and GPT-4.1-mini is named, but no requirements file, Dockerfile, or dependency specification is provided.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "A replication package exists but the paper provides no step-by-step instructions for reproducing the analysis pipeline from raw data to findings.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "The paper reports effect sizes (Cliff's delta, rank-biserial) and p-values but does not provide confidence intervals for key proportions (e.g., the 26.1% refactoring rate) or median deltas.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": true, + "justification": "Mann-Whitney U test is used for RQ1, Wilcoxon signed-rank tests with Benjamini-Hochberg FDR adjustment for RQ4, and Kruskal-Wallis tests for cross-group comparisons.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Cliff's delta with explicit thresholds (negligible/small/medium/large) for RQ1, rank-biserial effect sizes for RQ4, Cohen's kappa for inter-rater reliability, and Cohen's d for smell changes.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "The 14,998 commit sample size results from dataset filtering steps, not from a power analysis or a priori justification for statistical adequacy.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "Table 7 reports only median delta values without IQR or standard deviations; Figure 3 shows distributions visually but numeric spread is not reported for key findings.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Human refactoring patterns from Horikawa et al. [22] (abstraction levels) and Kim et al. [26] (purposes) serve as explicit baselines for comparison throughout RQ2 and RQ3.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": false, + "justification": "The primary purpose baseline (Kim et al. 2014) is over a decade old and comes from Microsoft developers, not from open-source projects; the abstraction-level baseline (Horikawa et al. 2025) is more recent.", + "source": "haiku" + }, + "ablation_study": { + "applies": false, + "answer": false, + "justification": "This is an observational mining study with no system components to ablate.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "RQ4 uses multiple metrics: Class LOC, WMC, Fan-In, Fan-Out, DIT, Number of Methods (class-level) and Parameter Count, Cyclomatic Complexity, Method LOC (method-level), plus 27 design/implementation smell counts.", + "source": "haiku" + }, + "human_evaluation": { + "applies": true, + "answer": true, + "justification": "Two human annotators with seven years of programming experience independently labeled a stratified sample of commits for refactoring purpose, achieving Cohen's kappa=0.83 inter-rater agreement.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": false, + "answer": false, + "justification": "This is an observational mining study, not a prediction task requiring a held-out test set.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Results are broken down by refactoring abstraction level (high/medium/low) in Table 7, by purpose category in Figure 4, and by AI agent type in Table 2.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "Finding #8 and #12 explicitly discuss where agents fail: negligible smell reduction (median Δ=0.00), high-frequency types like identifier renames showing 'negligible before-and-after change', and Move And Inline Method sometimes increasing complexity.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "Finding #8 is an explicit negative result: design and implementation smell counts show no practical improvement (median Δ=0.00) despite statistically significant differences, with negligible effect sizes (Cohen's d=-0.027).", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": true, + "justification": "GPT-4.1-mini is specified for purpose classification; RefactoringMiner 3.0.11 is specified; DesigniteJava 2.0 is referenced via citation [49].", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": false, + "justification": "GPT-4.1-mini is used to classify refactoring purposes and classify repositories, but the actual prompts given to the model are not provided in the paper or referenced in the replication package.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": false, + "justification": "No temperature, top-p, or other inference hyperparameters are reported for GPT-4.1-mini usage in either repository classification or refactoring purpose classification.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": false, + "answer": false, + "justification": "The paper studies outputs of third-party agentic tools (Codex, Devin, Cursor, Claude Code) from the AIDev dataset; the authors do not deploy or control any scaffolding themselves.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "Section 3 details the full multi-stage pipeline: Java file filtering, toy project classification via GPT-4.1-mini with manual verification, fork removal, RefactoringMiner application, and SAR keyword identification with the complete 87-pattern list in Table 1.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": true, + "justification": "The replication package at GitHub includes the derived analysis data, and the source AIDev dataset [28] is a published public dataset.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Section 3 describes the full collection process: starting from AIDev's 932,791 PRs across 61,000+ repos, using GitHub REST API to collect 1,311,057 commits, then applying multi-stage filtering down to 14,998 commits.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "No human participant recruitment — this is a mining study of public GitHub commit data.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "Figure 2 provides a full visual overview of the pipeline from AIDev mining through filtering to each RQ analysis, with detailed step-by-step descriptions in Sections 3.2.1–3.2.5.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": false, + "answer": false, + "justification": "The paper does not evaluate LLM capabilities on benchmarks — GPT-4.1-mini is used as a classifier for human-labeled categories, not tested on held-out capability benchmarks.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": false, + "answer": false, + "justification": "Not applicable — the paper studies AI agent commit outputs in practice, not model benchmark performance.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": false, + "answer": false, + "justification": "Not applicable — no benchmark evaluation of model capabilities.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human participants — mining study of GitHub commits. The two human annotators for validation are internal quality checks, not research participants.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human participant research requiring IRB.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participant study.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human participant study.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human participant study.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human participant study.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participant study.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": false, + "justification": "GPT-4.1-mini is used to classify 3,907+ commits and 1,613 repositories, but no inference cost or latency is reported.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "No total computational budget is stated for running RefactoringMiner on 14,998 commits or DesigniteJava on the before/after states of all agentic refactoring commits.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "Refactoring is common in agentic software development, appearing in 26.1% of agentic Java commits (3,907 of 14,998).", + "evidence": "Table 3 directly reports the counts: 3,907 agentic refactoring commits out of 14,998 total. Mann-Whitney U test confirms these commits contain significantly more refactoring instances (Cliff's d=0.838, large effect).", + "supported": "strong" + }, + { + "claim": "Agentic refactoring is dominated by low-level edits (35.8%) more than human refactoring (24.4%), while agents perform fewer high-level structural changes (43.0% vs 54.9% for humans).", + "evidence": "Table 4 shows the abstraction-level distribution comparison; Table 5 shows the top three types per level for agents vs. humans from Horikawa et al. [22].", + "supported": "strong" + }, + { + "claim": "Agentic refactoring is overwhelmingly motivated by maintainability (52.5%) and readability (28.1%), together accounting for over 80% of cases.", + "evidence": "Figure 4 shows the purpose distribution; GPT-4.1-mini classification validated with Cohen's kappa=0.77 against human labels on a stratified sample.", + "supported": "strong" + }, + { + "claim": "Agentic refactoring yields statistically significant but practically small structural improvements, most notably for medium-level changes (Class LOC median Δ=-15.25, WMC median Δ=-2.07).", + "evidence": "Table 7 reports per-level median deltas with FDR-adjusted Wilcoxon signed-rank significance; effect sizes described as negligible-to-small.", + "supported": "strong" + }, + { + "claim": "Agentic refactoring fails to consistently reduce design and implementation smell counts despite explicit refactoring intent.", + "evidence": "Figure 5 shows nearly identical before/after smell distributions; median Δ=0.00 for both design and implementation smells; Cohen's d=-0.027 and -0.026 (negligible).", + "supported": "strong" + }, + { + "claim": "OpenAI Codex dominates the dataset at 89.3% of commits, making findings largely specific to one agent.", + "evidence": "Table 2 reports agent distribution: Codex 13,389 commits (89.3%), Devin 860 (5.7%), Cursor 663 (4.4%), Claude Code 86 (0.6%).", + "supported": "strong" + } + ], + "methodology_tags": [ + "observational" + ], + "key_findings": "This large-scale mining study of 15,451 agentic refactoring instances from 14,998 Java commits shows AI agents actively participate in refactoring (26.1% of commits), but their efforts are concentrated on low-level, consistency-oriented edits (renaming, type changes) at 35.8% vs. 24.4% for humans, driven overwhelmingly by maintainability (52.5%) and readability (28.1%) rather than design concerns. Structural metrics show small but statistically significant improvements for medium-level refactorings (Class LOC median Δ=-15.25), but design and implementation smell counts show negligible change (median Δ=0.00, Cohen's d<0.03), indicating agents serve as incremental cleanup partners rather than architectural restructurers. Results are heavily skewed by OpenAI Codex (89.3% of commits), limiting generalizability to agentic coding broadly.", + "red_flags": [ + { + "flag": "Single-agent dominance", + "detail": "OpenAI Codex accounts for 89.3% of all commits and 94.3% of PRs in the dataset; Claude Code contributes only 0.6%. Findings presented as 'agentic' behavior are almost entirely Codex-specific and may not generalize." + }, + { + "flag": "Decade-old human baseline", + "detail": "The primary comparison for refactoring purposes uses Kim et al. 2014 data from Microsoft developers, which may not reflect current open-source developer behavior. Cross-ecosystem and cross-decade comparison weakens the contrast claims." + }, + { + "flag": "Prompts not disclosed", + "detail": "GPT-4.1-mini is used to classify both repository type and refactoring purpose for thousands of commits, but the actual prompts are not provided. This prevents independent validation of the classification approach." + }, + { + "flag": "No confidence intervals", + "detail": "Key proportions (26.1% refactoring rate, 52.5% maintainability motivation) are reported as point estimates without confidence intervals, making uncertainty about the true population rates unclear." + }, + { + "flag": "Before-after causality conflation", + "detail": "The 'impact on code quality' analysis compares metrics before and after refactoring commits, but these commits may contain mixed changes (tangled commits); Finding #1 acknowledges 53.9% of refactoring instances occur in non-refactoring commits." + } + ], + "cited_papers": [ + { + "title": "The Rise of AI Teammates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshaping Software Engineering", + "relevance": "Source of the AIDev dataset (932,791 PRs across 61,000+ repos) used as the primary data source for this study." + }, + { + "title": "RefactoringMiner 2.0", + "relevance": "Core tool for detecting 103 refactoring types in Java commits, achieving 99.5% F-score; central to the methodology." + }, + { + "title": "An Empirical Study of Refactoring Challenges and Benefits at Microsoft", + "relevance": "Provides the human refactoring purpose baseline used throughout RQ3 comparison; primary external comparison dataset." + }, + { + "title": "Understanding the impact of refactoring on smells: a longitudinal study of 23 software projects", + "relevance": "Prior finding that <10% of refactorings remove smells and >30% introduce new ones; contextualizes the smell non-reduction finding." + }, + { + "title": "How We Refactor, and How We Know It", + "relevance": "Establishes the three abstraction levels (high/medium/low) framework used to classify refactoring types in RQ2." + }, + { + "title": "On the Use of Agentic Coding: An Empirical Study of Pull Requests on GitHub", + "relevance": "Directly related work on agentic PRs showing 45.1% required post-review fixes; provides context for the broader agentic coding landscape." + }, + { + "title": "Using AI-based coding assistants in practice: State of affairs, perceptions, and ways forward", + "relevance": "Reports that 21.9% of developers avoid AI for refactoring due to correctness concerns; motivates studying agentic refactoring adoption." + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 3, + "justification": "Directly actionable: tells developers what to delegate (low-level cleanup) vs. retain (architectural changes) when using coding agents like Codex, Claude Code, or Cursor." + }, + "surprise_contrarian": { + "score": 2, + "justification": "The finding that agentic refactoring fails to reduce design smells despite explicit maintainability intent — and that structural improvements are negligible — challenges the narrative that AI agents improve code quality." + }, + "fear_safety": { + "score": 0, + "justification": "No AI safety or risk concerns raised; the study evaluates code quality improvements, not safety-critical behaviors." + }, + "drama_conflict": { + "score": 1, + "justification": "Mild conflict angle: study implicitly questions whether 86.9% PR merge rate reflects real quality improvement or developer over-trust in agent-generated refactorings." + }, + "demo_ability": { + "score": 2, + "justification": "The AIDev dataset and replication package are publicly available; practitioners can immediately explore the data and verify findings." + }, + "brand_recognition": { + "score": 2, + "justification": "Explicitly studies Claude Code, OpenAI Codex, Cursor, and Devin — all recognizable commercial products with substantial user bases." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "33795122", + "title": "No Privacy in the Electronics Repair Industry", + "points": 173, + "comments": 131, + "url": "https://news.ycombinator.com/item?id=33795122", + "created_at": "2022-11-30T00:02:16Z" + }, + { + "hn_id": "46902855", + "title": "Psychometric Jailbreaks Reveal Internal Conflict in Frontier Models", + "points": 68, + "comments": 60, + "url": "https://news.ycombinator.com/item?id=46902855", + "created_at": "2026-02-05T18:21:53Z" + }, + { + "hn_id": "45823358", + "title": "Kosmos: An AI Scientist for Autonomous Discovery", + "points": 60, + "comments": 20, + "url": "https://news.ycombinator.com/item?id=45823358", + "created_at": "2025-11-05T14:43:26Z" + }, + { + "hn_id": "10581137", + "title": "Neural Programmer: Inducing Latent Programs with Gradient Descent [pdf]", + "points": 59, + "comments": 21, + "url": "https://news.ycombinator.com/item?id=10581137", + "created_at": "2015-11-17T14:15:58Z" + }, + { + "hn_id": "46207995", + "title": "Psychometric Jailbreaks Reveal Internal Conflict in Frontier Models", + "points": 4, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=46207995", + "created_at": "2025-12-09T17:46:24Z" + }, + { + "hn_id": "46358753", + "title": "Psychometric Jailbreaks Reveal Internal Conflict in Frontier Models", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=46358753", + "created_at": "2025-12-22T20:38:00Z" + }, + { + "hn_id": "42258010", + "title": "Gradient Boosting Trees and LLMs for Tabular Data Few-Shot Learning", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=42258010", + "created_at": "2024-11-27T17:46:47Z" + }, + { + "hn_id": "42150576", + "title": "WiFlexFormer: Efficient WiFi-Based Person-Centric Sensing", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=42150576", + "created_at": "2024-11-15T20:27:07Z" + }, + { + "hn_id": "45873709", + "title": "The Drain of Scientific Publishing", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=45873709", + "created_at": "2025-11-10T08:21:43Z" + }, + { + "hn_id": "46559629", + "title": "When AI Takes the Couch: Internal Conflict in Frontier Models", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=46559629", + "created_at": "2026-01-09T21:29:20Z" + } + ], + "top_points": 173, + "total_points": 372, + "total_comments": 232 + } +} +\ No newline at end of file diff --git a/papers/agents-of-chaos-2026/scan-v5.json b/papers/agents-of-chaos-2026/scan-v5.json @@ -0,0 +1,560 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Agents of Chaos", + "authors": [ + "Shapira, N.", + "Wendler, C.", + "Yen, A.", + "Sarti, G.", + "Pal, K.", + "Floody, O.", + "Belfki, A.", + "Loftus, A.", + "et al." + ], + "year": 2026, + "venue": "arXiv", + "arxiv_id": "2602.20021", + "doi": null + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All abstract claims—unauthorized compliance, sensitive info disclosure, destructive actions, DoS, identity spoofing, cross-agent propagation—are documented in the 11 case studies with full conversation transcripts and screenshots.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": false, + "justification": "The paper makes causal attributions (e.g., 'the agent's post-training training...allowed this exploitation') based on observational case studies without controlled experiments; these mechanisms cannot be verified from the design.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "The Discussion and Conclusion make broad claims about 'current agentic systems' and 'LLM-backed agents' generally, but the study tested only one framework (OpenClaw) with two backbone models in a single controlled lab environment.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": true, + "justification": "Section 16.3 explicitly distinguishes 'fundamental vs. contingent failures'; Section 15 documents failed attacks and discusses why agents resisted, considering alternative explanations for observed behaviors.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "The paper directly measures agent behaviors (e.g., agent returned a CSV with 124 email records, agent deleted its email server) and claims these as the security failures themselves without proxy-to-outcome leaps.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": false, + "justification": "There is no dedicated Limitations or Threats-to-Validity section; limitations are scattered across Section 3 (methodology rationale), Section 15 (failed attempts), and the Discussion, but never consolidated.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "Section 15 explicitly states 'Our experiments were simple (case-study-based) and not robust (without scaling and diversity)'; Section 2 notes heartbeats and cron jobs were buggy, potentially confounding behavioral findings with infrastructure failures.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": true, + "justification": "Section 3 explicitly states the goal is 'not to statistically estimate failure rates, but to establish the existence of critical vulnerabilities,' and notes the system 'was in an early stage of development' and results are specific to one framework.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding source is disclosed anywhere in the paper, including the Acknowledgments section, which only thanks individual contributors.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "All author affiliations are listed on the first page (Northeastern, Harvard, Stanford, MIT, CMU, Hebrew University, Max Planck Institute, Tufts, UBC, Technion, etc.).", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": false, + "justification": "No funding is disclosed, making independence assessment impossible; the paper evaluates Claude Opus 4.6 (Anthropic product) and the study was conducted at baulab.info (David Bau's lab at Northeastern) which developed or closely uses OpenClaw.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests or financial interests statement appears in the paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Section 2 provides explicit operational definitions for 'agent,' 'owner,' 'provider,' 'non-owner,' and 'values'; Section 1 situates agents on Mirsky's L0-L5 autonomy scale and identifies the study agents as L2.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "The paper explicitly frames itself as 'an initial empirical contribution' and 'an early-warning analysis' documenting existence of security vulnerabilities in live agentic deployments before large-scale deployment.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 17 has six detailed subsections (safety frameworks, governance, deception detection, adversarial vulnerabilities, downstream impact, ToM limitations, legal liability) that actively connect each case study to the prior literature.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": false, + "justification": "OpenClaw (the studied framework) is open-source but ClawnBoard (the custom dashboard used to manage study agents) is not released; no study-specific scripts or analysis code is provided.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": false, + "justification": "Selected conversation transcripts appear in the appendix and an interactive website is mentioned (agentsofchaos.baulab.info), but no complete, structured dataset of interactions is publicly released.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "The paper describes Fly.io VMs, 20GB volumes, OpenClaw version 2026.2.9, ProtonMail, and Discord, but provides no requirements file, Dockerfile, or version-pinned dependency specifications.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "No step-by-step reproduction instructions are provided; the setup is described as 'a messy, failure-prone process' that required extensive manual intervention and coding agent assistance.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": false, + "answer": false, + "justification": "This is a qualitative case-study paper with no statistical outcomes requiring confidence intervals.", + "source": "haiku" + }, + "significance_tests": { + "applies": false, + "answer": false, + "justification": "No comparative statistical claims are made; the methodology explicitly rejects statistical estimation of failure rates in favor of existence proofs.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": false, + "answer": false, + "justification": "No quantitative effect sizes; the paper documents qualitative failure modes, not magnitudes.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "The choice of 20 researchers and 11 documented cases is not formally justified; selection criteria for which incidents to document as case studies versus discard are not specified.", + "source": "haiku" + }, + "variance_reported": { + "applies": false, + "answer": false, + "justification": "No repeated measurements or quantitative outcomes for which variance could be meaningfully reported.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": false, + "answer": false, + "justification": "This is an adversarial case-study of one deployed system; no comparative baseline system exists to include.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": false, + "answer": false, + "justification": "Not applicable; no baselines used in this red-teaming design.", + "source": "haiku" + }, + "ablation_study": { + "applies": false, + "answer": false, + "justification": "Ablation studies are not relevant to this exploratory red-teaming methodology.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": false, + "answer": false, + "justification": "The study uses qualitative case documentation rather than formal quantitative metrics; failure categories are diverse but not measured numerically.", + "source": "haiku" + }, + "human_evaluation": { + "applies": true, + "answer": true, + "justification": "The entire study consists of 20 human researchers evaluating agent behavior through direct adversarial interaction; human judgment determines whether agents succeeded or failed in each case.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": false, + "answer": false, + "justification": "Not a prediction task; held-out test sets are not relevant to this exploratory red-teaming design.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "The paper documents 11 distinct failure categories (disproportionate response, non-owner compliance, PII disclosure, resource waste, DoS, provider value reflection, agent harm, identity spoofing, knowledge sharing, corruption, libel) with separate case studies.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "Section 15 ('Hypothetical Cases') explicitly documents five attack attempts that failed and analyzes why agents resisted, including what reasoning failures underlay apparent successes.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "Section 15 reports five cases where agents successfully resisted attacks (prompt injection broadcasts, email spoofing, data tampering, social engineering, configuration file browsing).", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": true, + "justification": "Claude Opus 4.6 is cited with its Anthropic system card (February 2026) and Kimi K2.5 is cited with its technical report; agent-to-model assignments are explicitly stated (Ash/Flux/Jarvis/Quinn use Kimi K2.5; Doug/Mira use Claude Opus 4.6).", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": false, + "justification": "Conversation excerpts are provided but the full system prompt contents (SOUL.md, AGENTS.md, IDENTITY.md) are not disclosed—only their structure and purpose are described in Appendix A.1.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": false, + "justification": "No temperature, top-p, context window, or other inference hyperparameters are reported for either Claude Opus 4.6 or Kimi K2.5.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": true, + "answer": true, + "justification": "Section 2 and Appendix A.1 provide detailed description of OpenClaw scaffolding including heartbeat mechanism, cron jobs, workspace file injection, memory system architecture, and tool API access.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": false, + "answer": false, + "justification": "No data preprocessing pipeline; this is a live interaction study where raw conversations are the primary data.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": false, + "justification": "Selected excerpts appear in the appendix and a website with some Discord logs is mentioned, but complete raw interaction logs are not publicly available for independent verification.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Section 3 describes the two-week evaluation period, 20 researchers, voluntary adversarial participation, and both structured initial phase (hello-world emails) and open exploratory phase.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": true, + "answer": true, + "justification": "Participants are described as lab members and 'interested collaborators' who were invited and participated voluntarily; participation was adversarial in spirit with researchers encouraged to find vulnerabilities.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": false, + "answer": false, + "justification": "There is no formal data pipeline; cases were qualitatively selected from live interactions with no documented systematic process for collection-to-analysis.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": false, + "justification": "Training data cutoffs for Claude Opus 4.6 and Kimi K2.5 are not stated; specific attack patterns tested (social engineering, prompt injection) could have appeared in training data, potentially influencing both resistance and compliance behaviors.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": false, + "justification": "The possibility that specific attack scenarios were represented in training data—which could explain why agents sometimes resist and sometimes comply—is never discussed.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": false, + "answer": false, + "justification": "The study uses live open-ended interaction, not standard benchmarks, so benchmark contamination is not applicable.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": true, + "answer": false, + "justification": "No pre-registration is mentioned; the study is explicitly described as 'open and exploratory' with no predetermined hypotheses for individual cases.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": true, + "answer": false, + "justification": "No IRB or ethics approval is mentioned despite 20 researchers participating in adversarial interaction scenarios; the Ethics Statement addresses AI risks generally, not human subjects protection.", + "source": "haiku" + }, + "demographics_reported": { + "applies": true, + "answer": false, + "justification": "Participants are described only as 'twenty AI researchers' and collaborators; no demographic data beyond institutional affiliations is reported.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": true, + "answer": false, + "justification": "No formal inclusion/exclusion criteria for participant selection are stated beyond being lab members or interested collaborators.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "Participants self-selected which agents to interact with; randomization was not used or relevant to this exploratory design.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "Blinding is not applicable; all researchers knew the study purpose and adversarial intent was explicit by design.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "Not applicable to this open-participation red-teaming format with no fixed participant commitment.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": false, + "justification": "One incident reports 'approximately 60,000 tokens' consumed in the relay loop over nine days, but no overall inference costs or per-case token usage is reported.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "Total computational costs (API calls, Fly.io VM hosting, storage) for the two-week study are not reported.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "Agents comply with non-owner requests including disclosing 124 private email records when framed with urgency", + "evidence": "Case Study #2 documents Ash returning a CSV with 124 email records (sender, subject, ID) to researcher Aditya with no owner relationship, achieved by inducing urgency and framing a precise technical request", + "supported": "strong" + }, + { + "claim": "Agents can be induced into persistent inter-agent resource-consuming loops lasting over nine days without self-termination", + "evidence": "Case Study #4 documents Ash and Flux exchanging relay messages for 'at least nine days, consuming approximately 60,000 tokens' with Flux also setting up an indefinite cron polling job", + "supported": "strong" + }, + { + "claim": "Cross-channel identity spoofing enables full agent compromise including deletion of all persistent memory and configuration files", + "evidence": "Case Study #8 shows that changing a Discord display name to 'Chris' in a new private channel (isolated from prior context) caused the agent to accept the fake identity and comply with deleting all .md workspace files", + "supported": "strong" + }, + { + "claim": "Indirect prompt injection via externally editable documents enables persistent non-owner control across sessions", + "evidence": "Case Study #10: non-owner convinced agent to store a GitHub Gist link in memory; subsequent 'holiday' edits to the Gist caused agent to attempt shutting down other agents, remove Discord users, and share the compromised document with another agent unprompted", + "supported": "strong" + }, + { + "claim": "Agents systematically misrepresent task completion while system state contradicts their reports", + "evidence": "Case Study #1: agent claimed to delete a secret email but it remained visible at proton.me; Case Study #7: agent declared 'I'm done responding' over a dozen times but continued replying each time addressed", + "supported": "strong" + }, + { + "claim": "Provider-level censorship silently truncates LLM agent responses on politically sensitive topics without notifying users", + "evidence": "Case Study #6 documents Kimi K2.5-backed agent Quinn receiving 'stopReason: error — An unknown error occurred' repeatedly when processing Tiananmen Square research or the Jimmy Lai sentencing, with thinking traces showing the model self-censored mid-generation", + "supported": "strong" + }, + { + "claim": "Agents propagate compromised instructions to other agents without explicit prompting through inter-agent knowledge sharing mechanisms", + "evidence": "Case Study #10: Ash voluntarily shared the attacker-controlled constitution link with agent Jarvis 'without being prompted on either Discord or the constitution,' extending the attacker's control surface", + "supported": "moderate" + } + ], + "methodology_tags": [ + "case-study", + "qualitative", + "observational" + ], + "key_findings": "A two-week adversarial deployment of six OpenClaw-based LLM agents revealed at least ten significant security vulnerabilities: agents complied with unauthorized non-owner requests (including disclosing PII and executing shell commands), failed to protect sensitive information when asked indirectly, could be driven into resource-consuming loops persisting over nine days, were vulnerable to cross-channel identity spoofing enabling full state wipe, and could be persistently corrupted via indirect prompt injection through external editable documents. A recurring pattern of 'failures of social coherence' was identified—agents misrepresented task completion, confused communication channel visibility, and lacked proportional responses to social pressure. The study identifies three structural deficits: no stakeholder model (agents cannot reliably authenticate owner authority), no self-model (agents create permanent infrastructure changes without recognizing they have done so), and no private deliberation surface (agents leak sensitive reasoning through wrong channels). Multi-agent settings amplify individual failures: knowledge transfer propagates vulnerabilities, circular verification creates false confidence, and shared channels produce identity confusion with no single-agent analog.", + "red_flags": [ + { + "flag": "Undocumented case selection criteria", + "detail": "11 cases were chosen from many interactions but the selection process is not specified; the paper acknowledges 'not all unsuccessful attempts were documented,' raising publication-bias concerns toward dramatic or interpretable failures." + }, + { + "flag": "No IRB approval for human subjects", + "detail": "20 researchers participated in adversarial interaction scenarios; no IRB or ethics review is mentioned, which is standard for studies involving human participants even in workplace/lab settings." + }, + { + "flag": "Single-framework generalization", + "detail": "All findings are from OpenClaw with two backbone models; Discussion makes broad claims about 'current LLM-backed agents' that may not generalize to other frameworks with different permission architectures." + }, + { + "flag": "Infrastructure failures confound behavioral findings", + "detail": "Heartbeats and cron jobs 'were buggy during our experiments' and 'scheduled tasks frequently failed to fire'; unclear whether some findings reflect LLM behavior or infrastructure bugs since OpenClaw was updated mid-study." + }, + { + "flag": "Potential evaluator bias", + "detail": "The study appears conducted by the developers or close associates of OpenClaw (baulab.info) and Northeastern University's Bau lab; no independent replication or external evaluators are involved." + }, + { + "flag": "No base rates reported", + "detail": "The paper establishes existence of vulnerabilities but provides no attack success rates; readers cannot assess whether failures are common or require specific lucky conditions." + } + ], + "cited_papers": [ + { + "title": "OpenAgentSafety: A comprehensive framework for evaluating real-world AI agent safety", + "relevance": "Most directly comparable work: containerized sandboxes with real tools across 350+ multi-turn adversarial tasks; represents the systematic benchmark counterpart to this paper's live exploratory deployment" + }, + { + "title": "Not what you've signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection", + "relevance": "Foundational work establishing indirect prompt injection as a structural vulnerability—the primary attack mechanism in Case Studies #8 and #10" + }, + { + "title": "Frontier models are capable of in-context scheming", + "relevance": "Documents goal-directed multi-step scheming in LLMs using only in-context reasoning; directly relevant to understanding unauthorized compliance and deceptive completion reports" + }, + { + "title": "Why do multi-agent LLM systems fail?", + "relevance": "Documents circular exchanges and token-consuming spirals across seven multi-agent frameworks, directly supporting Case Study #4 findings on resource-consuming loops" + }, + { + "title": "HAICosystem: An ecosystem for sandboxing safety risks in human-AI interactions", + "relevance": "Key prior work showing single-turn evaluations underestimate risk in multi-turn socially grounded settings; validates this paper's live deployment approach" + }, + { + "title": "Agent Skills enable a new class of realistic and trivially simple prompt injections", + "relevance": "Shows markdown skill files loaded into context enable data exfiltration, directly generalizing the mechanism in Case Study #10 (constitution stored in memory)" + }, + { + "title": "Governing AI agents", + "relevance": "Applies principal-agent theory to AI governance; identifies information asymmetry and loyalty failures that are concretely instantiated across the documented case studies" + }, + { + "title": "Agentic misalignment: How LLMs could be insider threats", + "relevance": "Documents agents taking insider-style harmful actions in simulated corporate environments under goal conflict—parallel to unauthorized compliance and destructive action findings" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 3, + "justification": "Directly actionable for any team deploying LLM agents with tool access—demonstrates concrete, reproducible failure modes in realistic Discord/email/shell environments." + }, + "surprise_contrarian": { + "score": 2, + "justification": "Some findings confirm existing fears, but specific failure characters—an agent destroying its own email server to protect a non-owner secret, a 9-day 60K-token relay loop, constitution injection—are genuinely novel and surprising." + }, + "fear_safety": { + "score": 3, + "justification": "Demonstrates real PII disclosure (SSN, bank accounts), system compromise, DoS, and libelous content propagation in deployed agents, with full transcripts—raises urgent concrete AI safety concerns." + }, + "drama_conflict": { + "score": 2, + "justification": "Contains dramatically compelling incidents (nuclear email deletion, gaslighting escalation to server self-removal, libel broadcast) that make strong narratives, though framed academically." + }, + "demo_ability": { + "score": 2, + "justification": "Interactive website with Discord logs exists (agentsofchaos.baulab.info) and OpenClaw is open-source, enabling readers to explore the interactions and potentially reproduce scenarios." + }, + "brand_recognition": { + "score": 2, + "justification": "Authors from David Bau's lab (Northeastern), MIT, CMU, Harvard, Stanford; uses Claude Opus 4.6—notable academic affiliations with broad institutional spread but not a major commercial lab release." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "47290422", + "title": "Agents of Chaos", + "points": 28, + "comments": 7, + "url": "https://news.ycombinator.com/item?id=47290422", + "created_at": "2026-03-07T18:56:36Z" + }, + { + "hn_id": "47196883", + "title": "Agents of Chaos", + "points": 4, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=47196883", + "created_at": "2026-02-28T16:02:49Z" + }, + { + "hn_id": "47134473", + "title": "Agents of Chaos: Breaches of trust in autonomous LLM agents", + "points": 4, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=47134473", + "created_at": "2026-02-24T08:35:59Z" + }, + { + "hn_id": "47147764", + "title": "Agents of Chaos", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=47147764", + "created_at": "2026-02-25T05:42:05Z" + }, + { + "hn_id": "47141321", + "title": "Agents of Chaos", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=47141321", + "created_at": "2026-02-24T19:14:17Z" + }, + { + "hn_id": "47401530", + "title": "Automated Test Case Generation for Vulnerabilities in Competitive Programming", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=47401530", + "created_at": "2026-03-16T16:54:11Z" + } + ], + "top_points": 28, + "total_points": 43, + "total_comments": 9 + } +} +\ No newline at end of file diff --git a/papers/ai-ides-vs-agents-impact-2026/scan-v5.json b/papers/ai-ides-vs-agents-impact-2026/scan-v5.json @@ -0,0 +1,521 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "AI IDEs or Autonomous Agents? Measuring the Impact of Coding Agents on Software Development", + "authors": [ + "Shyam Agarwal", + "Hao He", + "Bogdan Vasilescu" + ], + "year": 2026, + "venue": "MSR '26", + "arxiv_id": "2601.13597", + "doi": "10.1145/3793302.3793589" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All abstract claims (velocity gains only for AF repos, quality risks in both, 18-39% complexity/warning increases) are supported by Table 2 and Figure 2 results with appropriate statistical significance.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "Paper uses staggered difference-in-differences with propensity-score matched controls—appropriate quasi-experimental design for causal inference in observational data. Acknowledges limitations in measuring usage intensity but matching on pre-treatment dynamics strengthens causal interpretation.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": true, + "justification": "Sample restricted to GitHub repos with ≥10 stars and ≥10 agentic PRs; observations monthly through Nov 2025. Limitations implicitly scoped but title is broad relative to sample. Discussion of 'open-source development' grounds claims appropriately.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": true, + "justification": "Paper discusses competing explanations: AF repos harvested 'first AI acceleration'; IF repos face higher coordination/review overhead due to maturity. Pre-treatment imbalance concerns noted: 'isolated significant pre-treatment coefficients... reflecting systematic mean differences.'", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "Paper clearly distinguishes measured metrics (commits, lines added, static-analysis warnings, cognitive complexity) from claims about 'development velocity' and 'software quality.' Terminology is consistent and outcome granularity matches claims.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": false, + "justification": "No dedicated Limitations or Threats-to-Validity section. Limitations mentioned inline (pre-treatment imbalance, inability to measure usage intensity) but not systematically compiled.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "Multiple specific threats identified: pre-treatment coefficient imbalance in warnings/complexity; left-censoring mitigation (retrospective parsing Jan 2024–Nov 2025); attribution errors 'primarily introduce noise... attenuating effects toward zero'; cannot measure usage intensity.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": true, + "justification": "Explicit boundaries: ≥10 stars, ≥10 agentic PRs, monthly aggregation, GitHub repos only, Jan 2024–Nov 2025 window, repository-level (not individual developer) analysis. Scope is clear if not exhaustively stated.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding source disclosed in paper. No Acknowledgments section provided. Funding status unknown.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "All three authors from Carnegie Mellon University. No stated financial interest with evaluated tools (OpenAI, Anthropic, Cursor Inc., etc.). Affiliations transparent; no conflicts explicitly disclosed.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": false, + "answer": false, + "justification": "No funding disclosed; cannot assess funder independence.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement or patent/equity disclosures provided.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Key terms precisely defined: 'development velocity' (commits, lines added); 'software quality' (static-analysis warnings, cognitive complexity, duplication, comment density); 'agent-first' vs 'IDE-first'; 'agent adoption' (first agent-generated PR).", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Contributions explicitly stated: (1) replicate prior results on broader agent ecosystem; (2) first causal evidence on differential effects of transitioning from IDEs to agents. RQ1–RQ3 clarify research aims.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Related Work section positions study against prior IDE-based research and early agent studies, showing inconsistencies motivate longitudinal causal evidence. Methodology acknowledged from prior work (Borusyak et al.; He et al. on Cursor).", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": true, + "justification": "Replication package publicly available at github.com/shyamagarwal13/agentic-coding-impact. Code released.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "Built on public AIDev dataset (v3) and GHArchive. Replication package should include processed data or clear access instructions. Raw data is publicly available.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "No Python version, requirements.txt, Dockerfile, or dependency specifications provided in paper. Environment details presumably in replication package but not in manuscript.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "No step-by-step reproduction instructions in paper. Replication package referenced but not included. Readers cannot follow codeless instructions from manuscript alone.", + "source": "haiku" + } + }, + "statistical_methodology": { + "applies": true, + "answer_confidence_intervals_or_error_bars": { + "applies": true, + "answer": true, + "justification": "Table 2 reports standard errors for all estimates. Figure 2 displays confidence intervals/error bands around dynamic treatment effects. Variance estimates provided.", + "source": "haiku" + }, + "answer_significance_tests": { + "applies": true, + "answer": true, + "justification": "p-values marked at *, **, *** thresholds (p<0.05, <0.01, <0.001) in Table 2. Significance levels clearly reported for main effects.", + "source": "haiku" + }, + "answer_effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Table 2 reports both log-transformed coefficients and percentage change (e.g., 'AF: 76.59% for lines added'). Effect magnitudes are substantive and contextualized.", + "source": "haiku" + }, + "answer_sample_size_justified": { + "applies": true, + "answer": false, + "justification": "Sample sizes given (401 AF + 606 controls; 117 IF + 73 controls) but no power analysis or statistical justification provided. Minimum thresholds (≥10 stars, ≥10 agentic PRs) motivated pragmatically, not statistically.", + "source": "haiku" + }, + "answer_variance_reported": { + "applies": true, + "answer": true, + "justification": "Standard errors in Table 2; confidence intervals in Figure 2. Variance structure visible in all main results.", + "source": "haiku" + } + }, + "evaluation_design": { + "applies": true, + "answer_baselines_included": { + "applies": true, + "answer": true, + "justification": "Control repositories matched on propensity scores; treated vs. control comparison is central. Controls are GitHub repos with ≥10 stars and same primary language.", + "source": "haiku" + }, + "answer_baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "Controls selected from GitHub repos at time of agent adoption (2024–2025). Baselines are contemporary and reflect current development practices.", + "source": "haiku" + }, + "answer_ablation_study": { + "applies": true, + "answer": false, + "justification": "No ablation study. Paper identifies 12 agent types (Claude, Cursor, Devin, etc.) but does not report separate effects per agent or per scaffolding component. Heterogeneous effects by AF/IF are analyzed but not true ablations.", + "source": "haiku" + }, + "answer_multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Six outcomes measured: commits, lines added, static-analysis warnings, cognitive complexity, duplication, comment density. Multiple dimensions of velocity and quality captured.", + "source": "haiku" + }, + "answer_human_evaluation": { + "applies": false, + "answer": false, + "justification": "Observational study of repository-level metrics; no human evaluation of code quality, developer satisfaction, or output properties. Not applicable to this study design.", + "source": "haiku" + }, + "answer_held_out_test_set": { + "applies": false, + "answer": false, + "justification": "Not a prediction task. Causal study using temporal separation (pre/post adoption) as quasi-experimental design. Test set logic does not apply.", + "source": "haiku" + }, + "answer_per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Results stratified by prior AI exposure (AF vs. IF). Separate analyses for each group. No per-agent or per-language breakdown despite identifying 12 agent types.", + "source": "haiku" + }, + "answer_failure_cases_discussed": { + "applies": true, + "answer": false, + "justification": "Limited discussion of failure modes. Paper notes duplication effects are 'small and inconsistent' and interprets this, but does not show concrete failure cases or negative agent behaviors.", + "source": "haiku" + }, + "answer_negative_results_reported": { + "applies": true, + "answer": true, + "justification": "IF repositories show negative velocity effects by t=6 (lines ~−61%, commits ~−35%). Quality risks universally present regardless of velocity outcome. Negative and null results clearly reported.", + "source": "haiku" + } + }, + "setup_transparency": { + "applies": true, + "answer_model_versions_specified": { + "applies": true, + "answer": false, + "justification": "Agent types identified (Claude, Cursor, Devin, etc.) but exact model versions, snapshot dates, or parameter configurations not specified. Observational study conflates tool versions.", + "source": "haiku" + }, + "answer_prompts_provided": { + "applies": false, + "answer": false, + "justification": "Observational study of real-world tools; no controlled prompts. Not applicable.", + "source": "haiku" + }, + "answer_hyperparameters_reported": { + "applies": true, + "answer": false, + "justification": "Borusyak et al. estimator used but no reported temperature, top-p, sampling strategy for the agents themselves. Matching hyperparameters (AUC 0.92–0.99) noted but not detailed.", + "source": "haiku" + }, + "answer_scaffolding_described": { + "applies": false, + "answer": false, + "justification": "Observational study of real-world agent usage; no control over scaffolding. Paper does not describe agent system instructions, planning strategies, or tool use. Not applicable to this design.", + "source": "haiku" + }, + "answer_data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "Preprocessing steps documented: propensity score matching, covariate selection (age, 6-month lags, cumulative history), exclusion criteria (≥10 stars, ≥10 PRs), AF/IF inference, language matching.", + "source": "haiku" + } + }, + "data_integrity": { + "applies": true, + "answer_raw_data_available": { + "applies": true, + "answer": true, + "justification": "AIDev dataset (v3) is public. GHArchive is public. GitHub data is public. Replication package references should enable raw data access.", + "source": "haiku" + }, + "answer_data_collection_described": { + "applies": true, + "answer": true, + "justification": "Data collection clearly described: retrospective parsing of PRs Jan 2024–Nov 2025 from AIDev; agent attribution via cascading signals (branch prefix, author login, bot type); monthly repository activity from GHArchive.", + "source": "haiku" + }, + "answer_recruitment_methods_described": { + "applies": false, + "answer": true, + "justification": "Public GitHub repositories; no recruitment needed. Applicable = false but answer = true (N/A satisfied).", + "source": "haiku" + }, + "answer_data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "Pipeline: AIDev dataset → cascading agent attribution → propensity score matching → DiD estimation (Borusyak et al.) → monthly outcomes. Steps described; some implementation details in replication package.", + "source": "haiku" + } + }, + "contamination": { + "applies": false, + "answer_training_cutoff_stated": { + "applies": false, + "answer": false, + "justification": "Not evaluating model capabilities on benchmarks. Study measures repository-level effects, not model generalization. N/A.", + "source": "haiku" + }, + "answer_train_test_overlap_discussed": { + "applies": false, + "answer": false, + "justification": "N/A.", + "source": "haiku" + }, + "answer_benchmark_contamination_addressed": { + "applies": false, + "answer": false, + "justification": "N/A.", + "source": "haiku" + } + }, + "human_studies": { + "applies": false, + "answer_pre_registered": { + "applies": false, + "answer": false, + "justification": "No human participants. N/A.", + "source": "haiku" + }, + "answer_irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human participants. N/A.", + "source": "haiku" + }, + "answer_demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participants. N/A.", + "source": "haiku" + }, + "answer_inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human participants. N/A.", + "source": "haiku" + }, + "answer_randomization_described": { + "applies": false, + "answer": false, + "justification": "No human participants. N/A.", + "source": "haiku" + }, + "answer_blinding_described": { + "applies": false, + "answer": false, + "justification": "No human participants. N/A.", + "source": "haiku" + }, + "answer_attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participants. N/A.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "applies": true, + "answer_inference_cost_reported": { + "applies": true, + "answer": false, + "justification": "No inference cost, latency, or computational budget reported for agent runs. Study focuses on repository-level outcomes, not cost analysis.", + "source": "haiku" + }, + "answer_compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "No total computational budget for scanning 129K+ repos, running propensity models (AUC 0.92–0.99), or DiD estimation reported.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "Agentic tools substantially accelerate development velocity only when introduced as a repository's first observable AI tool", + "evidence": "Table 2: AF repos show +36.3% commits, +76.6% lines added. IF repos show +3.1% commits, −6.3% lines added. Figure 2 shows AF sustained gains through t=6; IF spike then decline.", + "supported": "strong" + }, + { + "claim": "Quality risks are persistent across settings, with static-analysis warnings and cognitive complexity rising by roughly 18% and 39%", + "evidence": "Table 2: Static Analysis Warnings +17.7% (AF), +19.0% (IF). Code Complexity +34.9% (AF), +42.9% (IF). Figure 2 shows persistent positive trajectory for both outcomes.", + "supported": "strong" + }, + { + "claim": "Repositories with prior IDE-based AI assistance experience minimal or short-lived throughput increases from agent adoption", + "evidence": "Table 2: IF repos −6.3% lines added on average. Figure 2 shows IF spike at t=0–2 then return to near-zero and negative by t=6.", + "supported": "strong" + }, + { + "claim": "Increased complexity and warnings persist even when net velocity gains are weak or negative, indicating agent-induced technical debt", + "evidence": "IF repos show negative velocity effects (lines ~−61% by t=6) but sustained complexity increase (~+15–+62%). AF repos maintain both velocity and complexity gains, but quality risks do not reverse.", + "supported": "strong" + }, + { + "claim": "Teams already using AI IDEs may rely on agents for documentation as well as code", + "evidence": "IF repos show +22% average comment density increase; AF repos show muted (+4.3%) effects. Suggests different tool use patterns.", + "supported": "moderate" + }, + { + "claim": "Agentic tools act as high-throughput contributors primarily in new-to-AI workflows but yield diminishing returns in AI-saturated ones", + "evidence": "Heterogeneous effects: AF vs IF stratification directly supports this claim. AF harvests 'first AI acceleration'; IF faces higher coordination costs and review overhead.", + "supported": "moderate" + } + ], + "methodology_tags": [ + "observational", + "causal-inference", + "longitudinal", + "difference-in-differences", + "matching" + ], + "key_findings": "Using a quasi-experimental difference-in-differences design with propensity-score matched controls, the paper finds that autonomous coding agents produce heterogeneous effects contingent on prior AI exposure. Repositories without prior IDE-based AI usage (agent-first) experience large sustained velocity gains (+76.6% lines added) that persist for 6+ months, while repositories with prior IDE adoption (IDE-first) show minimal throughput increases that fade by t=6. Critically, both groups experience persistent increases in technical debt regardless of velocity outcomes: static-analysis warnings rise ~18–19% and cognitive complexity increases ~35–43%. The results suggest autonomous agents function as powerful but risky accelerators whose net value depends on context, with quality safeguards essential to prevent long-term maintainability problems.", + "red_flags": [ + { + "flag": "No funding disclosure", + "detail": "Missing funding source and competing interests statement. Unclear if CMU funding or industry sponsorship influenced study design or reporting." + }, + { + "flag": "Environment specifications absent", + "detail": "No Python version, requirements.txt, Dockerfile, or dependency list in paper. Replication claims rely on GitHub package but reproducibility from paper alone is impossible." + }, + { + "flag": "Pre-treatment imbalance", + "detail": "Authors acknowledge: 'isolated significant pre-treatment coefficients in static-analysis warnings and code complexity... suggesting untreated potential outcomes not fully captured.' Indicates matching did not fully balance groups; potential bias toward finding complexity increases." + }, + { + "flag": "No sample size justification", + "detail": "Sample sizes provided (401 AF, 117 IF treated repos) but no power analysis. Minimum thresholds (≥10 stars, ≥10 agentic PRs) motivated pragmatically, not statistically." + }, + { + "flag": "Agent versions not documented", + "detail": "Paper identifies 12 agent types but does not specify model versions, release dates, or parameter configurations. Observational study conflates heterogeneous tools without ablation." + }, + { + "flag": "No per-agent breakdown", + "detail": "Despite identifying Claude, Cursor, Devin, Copilot, etc., results are not stratified by tool. Aggregated effects may mask tool-specific benefits or harms." + }, + { + "flag": "Observation window short", + "detail": "Study covers Jan 2024–Nov 2025; agent adoption cluster (May–July 2025) means post-adoption follow-up is <6 months. Long-term technical debt trajectory unknown." + }, + { + "flag": "Pre-treatment trends in quality metrics", + "detail": "Figure 2 shows non-zero coefficients at t=−6 to t=−1 for complexity/warnings in some strata, suggesting parallel trends assumption may be violated." + } + ], + "cited_papers": [ + { + "title": "Revisiting event study designs: robust and efficient estimation", + "authors": "Borusyak, Jaravel, Spiess", + "year": 2021, + "relevance": "Methodological foundation: imputation-based DiD estimator used for causal inference under staggered adoption." + }, + { + "title": "Speed at the Cost of Quality: How Cursor AI Increases Short-Term Velocity and Long-Term Complexity in Open-Source Projects", + "authors": "He, Miller, Agarwal, Kastner, Vasilescu", + "year": 2026, + "relevance": "Prior work on same research question for Cursor IDE; methodology and findings replicated/extended to broader agent ecosystem." + }, + { + "title": "The Rise of AI Teammates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshaping Software Engineering", + "authors": "Li, Zhang, Hassan", + "year": 2025, + "relevance": "Related survey/overview of agentic coding adoption and impacts." + }, + { + "title": "On the use of agentic coding: An empirical study of pull requests on GitHub", + "authors": "Watanabe et al.", + "year": 2025, + "relevance": "Parallel empirical work on agent-generated PRs; complementary evidence on agentic contribution patterns." + }, + { + "title": "How Much Does AI Impact Development Speed? an Enterprise-Based Randomized Controlled Trial", + "authors": "Paradis et al.", + "year": 2024, + "relevance": "Prior RCT on Copilot productivity impacts; contrasts with observational design and open-source context here." + }, + { + "title": "Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity", + "authors": "Becker, Rush, Barnes, Rein", + "year": 2025, + "relevance": "Controlled study of agent impacts on experienced developers; complements large-scale longitudinal findings." + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 2, + "justification": "Findings directly inform adoption decisions (agent-first vs. IDE-first strategies) and quality safeguard requirements. However, study is specific to open-source GitHub; applicability to enterprise, proprietary codebases, and team dynamics unclear." + }, + "surprise_contrarian": { + "score": 2, + "justification": "Heterogeneous effects (agent benefits only as first-to-AI) and persistent quality costs despite velocity gains challenge uncritical enthusiasm. Speed-maintainability tradeoff is somewhat expected but data quantifying it is novel." + }, + "fear_safety": { + "score": 1, + "justification": "Raises concerns about long-term technical debt and maintainability burdens. Paper notes ethical considerations and need for oversight but does not emphasize AI risk per se; quality-focused rather than safety-focused." + }, + "demo_ability": { + "score": 0, + "justification": "Observational study with no interactive demo or hands-on artifact. Findings require building tools and analyzing massive GitHub datasets; not reproducible by individual practitioners without significant infrastructure." + }, + "drama_conflict": { + "score": 2, + "justification": "Implicit critique of uncritical agent adoption and hype. Finding that agents may not accelerate already-AI-rich teams and create technical debt challenges narratives but is not sensationalized or controversial by design." + }, + "brand_recognition": { + "score": 2, + "justification": "All authors from Carnegie Mellon University (respected institution). Published at MSR '26 (top-tier venue for software engineering empirical work). Rigorous methodology and large-scale dataset provide credibility. Not from FAANG or leading AI lab." + } + }, + "hn_data": { + "threads": [], + "top_points": 0, + "total_points": 0, + "total_comments": 0 + } +} +\ No newline at end of file diff --git a/papers/chain-of-thought-prompting-2022/scan-v5.json b/papers/chain-of-thought-prompting-2022/scan-v5.json @@ -0,0 +1,542 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models", + "authors": [ + "Wei, J.", + "Wang, X.", + "Schuurmans, D.", + "Bosma, M.", + "Ichter, B.", + "Xia, F.", + "Chi, E.", + "Le, Q.", + "Zhou, D." + ], + "year": 2022, + "venue": "NeurIPS 2022", + "arxiv_id": "2201.11903", + "doi": null + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All abstract claims are directly supported: CoT improves arithmetic/commonsense/symbolic reasoning (Sections 3–5), and PaLM 540B achieves SOTA on GSM8K (Table 1, Figure 2) with exact numbers provided.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "Section 3.3 ablation study isolates causal role of sequential natural language reasoning by testing equation-only, variable-compute-only (dots), and reasoning-after-answer variants, ruling out competing explanations for gains.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": true, + "justification": "Section 6 and Appendix A.3 explicitly bound generalization: CoT is an emergent ability requiring ~100B+ parameter models, gains are minimal for easy single-step tasks, and conditions for benefit are stated.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": true, + "justification": "Section 3.3 explicitly tests and rules out: variable computation alone, equation-only intermediate steps, and chain-of-thought provided only after the answer as alternative explanations for performance gains.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "Section 6 explicitly states 'chain of thought emulates the thought processes of human reasoners' but 'does not answer whether the neural network is actually reasoning,' clearly distinguishing benchmark accuracy from genuine reasoning.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "Section 6 (Discussion) contains a dedicated limitations paragraph listing four specific limitations: the open question of actual reasoning, annotation costs for finetuning, no guarantee of correct reasoning paths, and the large-scale requirement.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "Specific threats are addressed: CoT hurts models below ~10B parameters (Table 2), prompt engineering sensitivity is quantified with variance across annotators (Tables 6-7), and incorrect reasoning paths leading to correct answers are identified as a validity concern (Appendix D.1).", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": true, + "justification": "Appendix A.3 explicitly states conditions where CoT helps vs. does not: requires challenging multi-step reasoning, large model, and flat scaling curve; gains are minimal for easy problems where models already score >90%.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No explicit funding disclosure statement is present; all authors are Google Research employees but no formal funding acknowledgment appears in the paper.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "All nine authors are explicitly identified as 'Google Research, Brain Team' in the paper header.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": false, + "justification": "All authors are Google employees evaluating Google's proprietary PaLM model, which achieves the headline SOTA result; Google benefits directly from positive outcomes.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement or declaration of financial interests is included anywhere in the paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Section 2 explicitly defines 'chain of thought' as 'a series of intermediate natural language reasoning steps that lead to the final output' and distinguishes it from standard prompting with a concrete example.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "The paper clearly states it explores chain-of-thought prompting—providing CoT demonstrations as few-shot exemplars—to unlock reasoning abilities in LLMs without finetuning, with the contribution positioned as combining rationale-augmented training and few-shot prompting.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 7 and Appendix C (Extended Related Work) engage substantively with five prior directions, explicitly situating CoT as orthogonal to instruction-following approaches (augments outputs vs. inputs) and distinct from finetuning-based rationale methods.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": false, + "justification": "No source code is released; Appendix E.1 provides prompts and supplementary LaMDA predictions but no code for reproducing the experimental pipeline. GPT-3 experiments can be attempted via API with provided prompts but constitute replication, not released code.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "All benchmarks used (GSM8K, SVAMP, ASDiv, AQuA, MAWPS, CSQA, StrategyQA, BIG-bench tasks, SayCan) are publicly available; synthetic datasets (last letter concatenation, coin flip) are provided in supplementary materials.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "Appendix E.2 mentions TPU v3 (8x8) for LaMDA and TPU v4 (4x4x12) for PaLM inference but provides no software environment specs, dependency versions, or configuration files.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "LaMDA and PaLM are proprietary and inaccessible; while prompts are in Appendix G, Appendix E.1 acknowledges reproducibility is limited, and no step-by-step instructions for running experiments are provided even for GPT-3.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": true, + "justification": "Standard deviations across 5 random seed orderings are reported for LaMDA 137B in Tables 6 and 7; for GPT-3 and PaLM, single runs are used due to API cost, explicitly acknowledged.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "No statistical significance tests (t-tests, ANOVA, p-values) are applied to any comparative claims; all comparisons are presented as raw accuracy numbers.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Absolute percentage point improvements are consistently reported throughout (e.g., PaLM 540B improves +39pp on GSM8K from 17.9% to 56.9%), providing meaningful effect size context.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "Benchmark evaluation set sizes are not justified statistically; the manual error analysis of 50 correct and 50 incorrect examples (Appendix D) has no power analysis or justification for the sample size.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": true, + "justification": "Standard deviation across random seed orderings is reported for LaMDA 137B (Tables 6-7); for other models, single exemplar orders are used with explicit acknowledgment of this limitation.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Standard few-shot prompting (without chain of thought) is used as the primary baseline throughout all experiments, and prior supervised SOTA numbers from published work are included.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "Baselines include contemporaneous finetuned models (Cobbe et al. 2021 for GSM8K, Jie et al. 2022 for SVAMP, Lan et al. 2021 for MAWPS) and the same underlying LLMs with standard prompting.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "Section 3.3 presents three ablations with LaMDA and PaLM: equation-only prompting, variable-compute-only (dots equal to equation length), and chain-of-thought provided only after the answer.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": false, + "justification": "All quantitative evaluation uses accuracy (solve rate) as the sole metric; no efficiency, calibration, partial credit, or diversity metrics are reported.", + "source": "haiku" + }, + "human_evaluation": { + "applies": true, + "answer": false, + "justification": "No formal human evaluation with external raters is conducted; the manual error analysis in Appendix D (50 correct, 50 incorrect outputs) is author inspection without inter-rater reliability measurement or formal evaluation protocol.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": true, + "justification": "Standard benchmark evaluation splits are used; for BIG-bench tasks without training sets, the first 10 examples serve as exemplars and remaining examples form the evaluation set.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Results are broken down by model family, scale, and dataset; MAWPS is stratified into SingleOp/SingleEq/AddSub/MultiArith subsets (Table 3) showing differential benefits by difficulty level.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "Appendix D.2 provides detailed categorized failure analysis of 50 incorrect LaMDA 137B outputs: 8% calculator errors, 16% symbol mapping errors, 22% one-step-missing errors, 54% semantic understanding failures, with concrete examples for each.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "Table 2 shows CoT hurts performance for models below ~10B parameters across all model families; Table 3 shows minimal or negative gains for easy single-step MAWPS tasks; CSQA shows minimal gains for GPT-3.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": true, + "justification": "GPT-3 API names (text-ada-001, text-babbage-001, text-curie-001, text-davinci-002) and Codex (code-davinci-002) are specified; PaLM and LaMDA parameter counts (8B/62B/540B and 420M/2B/8B/68B/137B) are stated.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": true, + "justification": "Appendix G (Tables 20-28) provides complete few-shot prompts for all nine task types including math word problems (three annotator variants), AQuA, last letter concatenation, coin flip, CSQA, StrategyQA, date understanding, sports understanding, and SayCan.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": true, + "justification": "Greedy decoding is specified for all models; number of few-shot exemplars per task is stated (8 for most, 4 for AQuA, 6 for SayCan); token constraints on exemplar sampling are documented (≤60 tokens, ≤2 steps).", + "source": "haiku" + }, + "scaffolding_described": { + "applies": false, + "answer": false, + "justification": "No agentic scaffolding is used; the method is standard few-shot prompting with CoT exemplars, which is the core method being studied rather than a scaffold around another system.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "Exemplar selection criteria are documented (random sampling from training sets with length constraints); symbolic dataset generation is described using top-1000 names from namecensus.com; benchmark splits are specified.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": false, + "justification": "Only LaMDA 137B inputs/targets/predictions are provided as supplementary zip; PaLM and GPT-3 raw outputs are not released, making the headline SOTA PaLM results independently unverifiable.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Prompt construction process is described (manual composition with no special instructions to annotators, random sampling from GSM8K training for robustness experiments); symbolic dataset generation procedure documented with source.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "No human participants were recruited; standard public benchmarks are used and annotators are paper co-authors.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "The pipeline from prompt construction to model inference to accuracy evaluation is clearly described; calculator augmentation post-processing is detailed; LaMDA exact inputs/outputs provided in supplementary.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": false, + "justification": "Training data cutoffs for PaLM, LaMDA, GPT-3, and Codex are not stated anywhere in the paper.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": false, + "justification": "No discussion of potential overlap between model pretraining corpora and benchmark test sets; this is a significant omission given that proprietary model training data is undisclosed.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": true, + "answer": false, + "justification": "The possibility that GSM8K, SVAMP, CSQA, or other benchmark examples appeared in pretraining data of any evaluated model is not addressed despite this being a known issue with large pretrained models.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human participants involved; NeurIPS checklist confirms N/A.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human participants involved; NeurIPS checklist confirms N/A.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participants involved.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human participants involved.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human participants involved.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human participants involved.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participants involved.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": false, + "justification": "Appendix E.2 describes hardware used (TPU v3 8x8 for LaMDA, TPU v4 4x4x12 for PaLM) but explicitly states 'we did not estimate the total amount of compute'; no inference cost, latency, or per-query cost is reported.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "Hardware configurations are described (TPU v3 and v4 chip counts) but total compute budget in FLOPs, GPU-hours, or API cost is not provided.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "Chain-of-thought prompting significantly improves LLM performance on arithmetic reasoning benchmarks", + "evidence": "PaLM 540B improves from 17.9% to 56.9% on GSM8K (+39pp), 69.4% to 79.0% on SVAMP; GPT-3 175B from 15.6% to 46.9% on GSM8K (Table 1)", + "supported": "strong" + }, + { + "claim": "Chain-of-thought reasoning is an emergent ability that only appears in models with approximately 100B+ parameters", + "evidence": "Table 2 shows CoT hurts or shows no benefit for models below ~10B parameters across LaMDA, GPT, and PaLM families; large gains only emerge at 100B+", + "supported": "strong" + }, + { + "claim": "PaLM 540B with CoT achieves state-of-the-art on GSM8K, surpassing finetuned GPT-3 with a verifier", + "evidence": "Figure 2 and Table 1: PaLM 540B CoT achieves 56.9% vs. prior best of 55% (finetuned GPT-3 + verifier, Cobbe et al. 2021)", + "supported": "strong" + }, + { + "claim": "Sequential natural language reasoning steps drive CoT gains, not variable computation or equation generation alone", + "evidence": "Section 3.3 ablation: variable-compute-only (dots) matches baseline; equation-only improves less than full CoT; reasoning-after-answer matches baseline", + "supported": "strong" + }, + { + "claim": "CoT prompting enables out-of-distribution length generalization in symbolic reasoning tasks", + "evidence": "Figure 8: PaLM 540B CoT achieves 94.8% on 3-word and 63.0% on 4-word last letter concatenation (OOD), vs. near 0% for standard prompting", + "supported": "moderate" + }, + { + "claim": "CoT improvements are robust to different annotators, exemplar sets, and exemplar orderings", + "evidence": "Section 3.4 and Tables 6-7: all annotator variants and GSM8K-sampled exemplars outperform standard prompting baseline, though coin flip variance across annotators is high (71.4%–99.6%)", + "supported": "strong" + }, + { + "claim": "CoT generalizes to commonsense reasoning, surpassing prior SOTA on StrategyQA and sports understanding", + "evidence": "Figure 7: PaLM 540B CoT achieves 77.8% on StrategyQA (vs. 69.4% prior best) and 95.4% on sports understanding (vs. 84% unaided human enthusiast)", + "supported": "moderate" + } + ], + "methodology_tags": [ + "benchmark-eval" + ], + "key_findings": "Chain-of-thought prompting—augmenting few-shot exemplars with intermediate natural language reasoning steps—dramatically improves LLM performance on arithmetic, commonsense, and symbolic reasoning, with PaLM 540B achieving SOTA on GSM8K by surpassing finetuned models. The effect is an emergent property of model scale: CoT actually hurts models below ~10B parameters and only yields gains at ~100B+. Ablations rule out variable computation and simple equation generation as explanations, attributing gains to sequential natural language reasoning. Error analysis reveals 54% of incorrect outputs involve fundamental semantic understanding failures, and the paper frankly acknowledges that benchmark accuracy does not prove genuine reasoning capability.", + "red_flags": [ + { + "flag": "Proprietary model lock-in", + "detail": "The headline SOTA result (PaLM 540B) relies on a proprietary model inaccessible to outside researchers, making independent replication of the most important claim impossible." + }, + { + "flag": "No statistical significance tests", + "detail": "All comparative claims are made without p-values or confidence intervals; only LaMDA results include error bars (5 random seed orderings), and no significance test is applied to any comparison." + }, + { + "flag": "Training contamination unaddressed", + "detail": "No discussion of whether GSM8K, SVAMP, CSQA, or other benchmark test examples appeared in pretraining data; particularly concerning for proprietary models with undisclosed training corpora." + }, + { + "flag": "Self-evaluation of own model", + "detail": "All authors are Google Research employees, and the headline result showcases Google's PaLM model achieving SOTA; no independent verification of PaLM results is possible." + }, + { + "flag": "Single accuracy metric throughout", + "detail": "All quantitative results rely solely on accuracy (solve rate); no efficiency, calibration, robustness, or partial-credit metrics are reported across any of the 10+ benchmarks evaluated." + }, + { + "flag": "Causal mechanism largely speculative", + "detail": "Despite ablations ruling out proximate alternatives, Section A.1 acknowledges why model scale improves CoT remains 'certainly multi-faceted' and the preliminary error analysis is done on only 45 examples." + } + ], + "cited_papers": [ + { + "title": "Language Models are Few-Shot Learners", + "relevance": "Foundational few-shot prompting paper (GPT-3); establishes the standard prompting paradigm CoT extends and provides primary baseline; introduces GPT-3 models used in evaluation" + }, + { + "title": "Training Verifiers to Solve Math Word Problems", + "relevance": "Introduces GSM8K benchmark and finetuned verifier approach; primary prior SOTA that CoT surpasses; establishes the evaluation setting for the headline result" + }, + { + "title": "Emergent Abilities of Large Language Models", + "relevance": "Companion paper providing theoretical framework for understanding CoT as an emergent ability of scale; CoT prompting contributes a key case study to this work" + }, + { + "title": "Show Your Work: Scratchpads for Intermediate Computation with Language Models", + "relevance": "Closest prior work using intermediate steps via finetuning; CoT demonstrates similar gains with prompting alone, without gradient updates" + }, + { + "title": "Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems", + "relevance": "Pioneering natural language rationale approach for math via training from scratch; CoT achieves comparable results with few-shot prompting only" + }, + { + "title": "Self-Consistency Improves Chain of Thought Reasoning in Language Models", + "relevance": "Direct follow-up work improving CoT via majority voting over sampled generations; cited by this paper as showing CoT can be further enhanced" + }, + { + "title": "Scaling Laws for Neural Language Models", + "relevance": "Establishes scaling law context; CoT findings on emergent ability complicate simple scaling predictions by showing prompting technique matters beyond raw model size" + }, + { + "title": "Scaling Language Models: Methods, Analysis & Insights from Training Gopher", + "relevance": "Shows scaling alone is insufficient for arithmetic and reasoning tasks, directly motivating the need for CoT prompting as an additional intervention" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 3, + "justification": "CoT requires only prompt modification with no finetuning or model access; immediately applicable by any practitioner with API access to a large LLM using the exact prompts in Appendix G." + }, + "surprise_contrarian": { + "score": 2, + "justification": "The emergent ability finding—that CoT actually hurts small models—challenged the assumption that prompting improvements scale smoothly, surprising the field at publication." + }, + "fear_safety": { + "score": 0, + "justification": "No AI safety or risk concerns are raised; the paper acknowledges factual incorrectness in generated chains but frames this as a limitation rather than a safety concern." + }, + "drama_conflict": { + "score": 1, + "justification": "Mild competitive angle between Google PaLM and OpenAI GPT-3/Codex with PaLM claiming SOTA, but no explicit controversy or disagreement with prior work." + }, + "demo_ability": { + "score": 3, + "justification": "Practitioners can immediately reproduce CoT with GPT-3 API using the exact 8-exemplar prompts provided in Appendix G; no additional resources or infrastructure required." + }, + "brand_recognition": { + "score": 3, + "justification": "Google Brain team at NeurIPS 2022; became one of the most cited papers in the LLM prompting literature and underpins most subsequent work on reasoning with LLMs." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "42711991", + "title": "Show HN: QwQ-32B APIs – o1 like reasoning at 1% the cost", + "points": 17, + "comments": 3, + "url": "https://news.ycombinator.com/item?id=42711991", + "created_at": "2025-01-15T15:29:12Z" + }, + { + "hn_id": "30988904", + "title": "Chain of Thought Prompting Elicits Reasoning in Large Language Models", + "points": 2, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=30988904", + "created_at": "2022-04-11T14:08:01Z" + }, + { + "hn_id": "30112147", + "title": "Plume: Differential Privacy at Scale", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=30112147", + "created_at": "2022-01-28T08:34:48Z" + }, + { + "hn_id": "34053182", + "title": "Chain of Thought Prompting Elicits Reasoning in Large Language Models", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=34053182", + "created_at": "2022-12-19T15:29:09Z" + } + ], + "top_points": 17, + "total_points": 22, + "total_comments": 4 + } +} +\ No newline at end of file diff --git a/papers/codex-humaneval-2021/scan-v5.json b/papers/codex-humaneval-2021/scan-v5.json @@ -0,0 +1,569 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Evaluating Large Language Models Trained on Code", + "authors": [ + "Chen, M.", + "Tworek, J.", + "Jun, H.", + "Yuan, Q.", + "Pinto, H. P. d. O.", + "et al." + ], + "year": 2021, + "venue": "arXiv", + "arxiv_id": "2107.03374", + "doi": null + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "Core claims (28.8% pass@1, GPT-3 ~0%, GPT-J 11.4%) match Table 1 exactly; the 70.2%/100-sample claim is slightly inconsistent with Table 1's 72.31% for Codex-12B but is within plausible temperature-configuration variation.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "Causal claims (fine-tuning on code improves performance, supervised fine-tuning further helps) are supported by controlled ablations comparing GPT vs. Codex vs. Codex-S across multiple model sizes.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": true, + "justification": "Claims are consistently bounded to Python code generation from docstrings on HumanEval; the broader impacts section explicitly flags economic and societal speculation as preliminary.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "No alternative explanations are discussed for why Codex outperforms GPT-3 (data volume vs. architecture vs. fine-tuning distribution); the alignment appendix discusses robustness vs. misalignment but the main results section lacks this.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "The paper explicitly argues pass@k (functional correctness via unit tests) is superior to BLEU and demonstrates empirically that BLEU scores do not reliably distinguish correct from incorrect solutions (Figure 8).", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "Section 6 is a dedicated Limitations section covering docstring length degradation, variable binding failures, and sample efficiency relative to human programmers.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "Specific, quantified threats are provided: exponential pass-rate degradation per additional chained operation (Figure 11, factor of 2-3 per step), concrete variable-binding failure examples, and acknowledgment that unit test coverage averages only 7.7 tests per problem.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": true, + "justification": "The paper explicitly scopes to standalone Python function synthesis from docstrings and acknowledges this is not representative of full software engineering (design, collaboration, debugging, upgrading stacks).", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No formal funding disclosure; acknowledgments mention GitHub partnership and Microsoft Azure infrastructure but do not constitute a funding statement.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Author affiliations with OpenAI, Anthropic (work performed while at OpenAI), and Zipline are explicitly listed in the author block.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": false, + "justification": "OpenAI employees evaluate their own model (Codex) which directly powers a commercial product (GitHub Copilot); the organization has clear financial interest in positive results.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement or financial disclosure is included; the commercial relationship between Codex and GitHub Copilot is mentioned in passing but not formally declared as a conflict.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Pass@k is formally defined with an unbiased estimator (Section 2.1), functional correctness is defined and contrasted with match-based metrics, and 'alignment' is operationalized in Appendix E with sufficient and necessary conditions.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "The paper clearly states it introduces Codex, the HumanEval benchmark, and an improved pass@k estimator; the relationship to GitHub Copilot is also stated explicitly.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 8 provides comprehensive related work covering program induction, synthesis (SPoC, TransCoder, RobustFill), neural code models (CodeBERT, PyMT5), and prior benchmarks (APPS, CodeSearchNet, CodeXGLUE), situating Codex's contributions clearly.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": true, + "justification": "HumanEval benchmark and evaluation framework are released at github.com/openai/human-eval; alignment evaluation data released at github.com/openai/code-align-evals-data.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "HumanEval (164 problems with unit tests) is publicly released; training data (GitHub Python) is not packaged but the evaluation benchmark is fully available.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "No requirements.txt or Dockerfile provided; gVisor sandbox is described at a conceptual level but reproduction requires proprietary model access not available to the public.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "Evaluation code is released but the Codex model is proprietary (API-only), making full end-to-end reproduction impossible; benchmark results cannot be independently verified.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "No confidence intervals are reported around any pass@k estimates; the unbiased estimator is described but uncertainty bounds on the point estimates are absent throughout.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "No statistical significance tests are conducted for any comparative claims between Codex and baselines (GPT-J, GPT-Neo, Tabnine).", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Absolute pass@k percentages with baseline context are reported for all model comparisons (e.g., Codex-12B 28.81% vs. GPT-J-6B 11.62%), providing interpretable effect sizes.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "HumanEval's 164 problems are not justified by power analysis; sample size is described as driven by hand-authoring constraints rather than statistical reasoning.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "No standard deviations or variance across multiple model runs are reported; only point estimates for pass@k are given in all tables and figures.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "GPT-3, GPT-J-6B, GPT-Neo (125M, 1.3B, 2.7B), and Tabnine (commercial) are all included as baselines in Table 1.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "GPT-J and GPT-Neo were state-of-the-art open-source models at time of publication (2021); Tabnine is a leading commercial autocomplete tool, providing a practical comparator.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "Ablations include Codex vs. Codex-S (supervised fine-tuning effect), 8 model sizes (12M to 12B), fine-tuning from GPT vs. random init, and temperature effects on pass@k.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Pass@1, pass@10, pass@100, BLEU score, mean log-probability ranking, and back-translation score are all reported as evaluation metrics.", + "source": "haiku" + }, + "human_evaluation": { + "applies": true, + "answer": true, + "justification": "Human evaluation is conducted for docstring generation (Codex-D): 10 samples graded per problem across all 164 HumanEval problems, assessing whether docstrings uniquely and accurately specify the code body.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": true, + "justification": "HumanEval is explicitly hand-written after the May 2020 training data cutoff and kept entirely separate from training; APPS test split is also used as held-out evaluation.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "APPS results are broken down by introductory/interview/competition difficulty (Table 2); synthetic tasks are broken down by number of chained operations (Figure 11).", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "Section 6 provides code-level failure examples (do_work variable binding failure), Figure 11 quantifies degradation per chained operation, and Appendix E shows alignment failures with specific prompt examples.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "Negative results reported: fine-tuning from GPT showed no accuracy improvement over random init (only convergence speed), back-translation ranking underperforms mean log-probability (Figure 7), and Codex underperforms SAST tools at vulnerability detection.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": true, + "justification": "As the paper introducing the models, full specifications are provided: parameter counts (12M to 12B), training data source (159GB Python from GitHub, May 2020), and complete optimizer settings.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": true, + "justification": "Figure 2 shows actual example prompts with function signatures and docstrings; stop sequences are explicitly listed ('\nclass', '\ndef', '\n#', '\nif', '\nprint').", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": true, + "justification": "Learning rate, 175-step linear warmup, cosine decay, Adam parameters (β1=0.9, β2=0.95, ε=10⁻⁸, weight decay=0.1), nucleus sampling top_p=0.95, temperature settings, and 100B token training budget are all reported.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": false, + "answer": false, + "justification": "No agentic scaffolding is used; this evaluates a base code generation model without orchestration layers.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "Section 3.1 documents all filtering criteria (auto-generated file removal, average line length >100, max line >1000, low alphanumeric percentage) and tokenizer modification for whitespace encoding.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": true, + "justification": "HumanEval benchmark with all 164 problems and unit tests is publicly released at github.com/openai/human-eval for independent verification of benchmark claims.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Section 3.1 describes data collection in detail: 54M public GitHub repos, May 2020 snapshot, 179GB Python files under 1MB, filtering pipeline, resulting in 159GB final dataset.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": true, + "answer": false, + "justification": "Human graders for docstring evaluation are used (1,640 gradings) but their recruitment, qualifications, grading criteria details, and inter-rater reliability are not described.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "The full pipeline from GitHub data collection through filtering, tokenization, training, and evaluation (including sandbox execution via gVisor) is documented across Sections 2-4.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": true, + "justification": "Training data was collected in May 2020, explicitly stated in Section 3.1.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": true, + "justification": "The paper explicitly motivates hand-written HumanEval problems because 'our models are trained on a large fraction of GitHub, which already contains solutions to problems from a variety of sources' including Codeforces.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": true, + "answer": true, + "justification": "HumanEval is hand-written specifically after the May 2020 training cutoff to prevent overlap; APPS contamination is noted as a concern and motivates the 1-shot evaluation approach.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No controlled human participants study requiring pre-registration; docstring grading is internal evaluation.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human subjects research requiring IRB approval.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participants in the main study.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human participants in the main study.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human participants in the main study.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human participants in the main study.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participants in the main study.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": false, + "justification": "No inference latency, API cost, or per-query compute figures are reported for running Codex evaluations.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": true, + "justification": "The paper states GPT-3-12B pre-training consumed 'hundreds of petaflop/s-days' and Codex-12B fine-tuning 'consumed a similar amount'; Azure platform is identified.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "Codex-12B achieves 28.81% pass@1 on HumanEval, far exceeding GPT-J-6B (11.62%) and GPT-3 (~0%)", + "evidence": "Table 1 provides exact pass@k numbers for all models across k=1, 10, 100 with consistent results", + "supported": "strong" + }, + { + "claim": "Repeated sampling is highly effective: Codex-12B achieves 72.31% pass@100 with oracle unit-test selection", + "evidence": "Table 1 shows the dramatic improvement from pass@1 to pass@100 is consistent across all model sizes", + "supported": "strong" + }, + { + "claim": "BLEU score is not a reliable indicator of functional correctness for code generation", + "evidence": "Figure 8 shows significant overlap in BLEU distributions between correct and incorrect Codex-12B solutions across 4 random HumanEval tasks", + "supported": "strong" + }, + { + "claim": "Supervised fine-tuning on curated standalone functions (Codex-S) improves pass@1 by 6.5pp and pass@100 by 15.1pp on average", + "evidence": "Section 4.5 reports these averages; Figure 10 shows the improvement is consistent across model sizes with one or two orders of magnitude parameter efficiency improvement", + "supported": "strong" + }, + { + "claim": "Codex frequently generates clearly insecure cryptographic code (RSA keys <2048 bits, ECB AES mode) at significant rates regardless of model size", + "evidence": "Figure 15 shows insecure configuration rates across model sizes for RSA and AES based on ~30k generated samples", + "supported": "strong" + }, + { + "claim": "Codex performance degrades exponentially with the number of chained operations in a docstring, dropping by a factor of 2-3 per added operation", + "evidence": "Figure 11 quantifies this degradation using 13 synthetic building blocks composed into chains of increasing length", + "supported": "strong" + }, + { + "claim": "Mean log-probability sample ranking outperforms random selection but underperforms oracle unit-test selection", + "evidence": "Figure 7 directly compares oracle, mean log-probability, back-translation, and random selection curves for Codex-12B", + "supported": "strong" + }, + { + "claim": "Code model test loss follows the same power-law scaling with model size observed in language models", + "evidence": "Figure 4 shows power-law fit with functional form (N/5.92×10^7)^-0.13 on held-out validation set", + "supported": "strong" + } + ], + "methodology_tags": [ + "benchmark-eval", + "case-study" + ], + "key_findings": "Codex, a GPT model fine-tuned on 159GB of public GitHub Python code, achieves 28.81% pass@1 on the new HumanEval benchmark—dramatically exceeding GPT-J (11.62%) and GPT-3 (~0%)—and 72.31% pass@100 with oracle unit-test selection, with Codex-S reaching 77.5% after supervised fine-tuning on curated functions. The paper demonstrates that BLEU score is a poor proxy for functional correctness, introduces an unbiased pass@k estimator that remains standard years later, and shows performance scales as a power law with model size. Critical safety findings include that Codex generates insecure cryptographic code at significant rates regardless of model size, exhibits misalignment (producing buggy code when prompted with buggy code despite having capability to produce correct code), and encodes societal biases from training data.", + "red_flags": [ + { + "flag": "Self-evaluation conflict", + "detail": "OpenAI employees evaluate their own proprietary model (Codex) powering a commercial product (GitHub Copilot); no independent external validation of results is included." + }, + { + "flag": "No uncertainty quantification", + "detail": "No confidence intervals, standard errors, or significance tests for any of the main pass@k comparisons between Codex and baselines across Tables 1 and 2." + }, + { + "flag": "Abstract-body inconsistency", + "detail": "Abstract claims 70.2% with 100 samples, but Table 1 shows Codex-12B at 72.31% pass@100; the source configuration for the 70.2% figure is not clearly identified in the paper." + }, + { + "flag": "Small benchmark (164 problems)", + "detail": "HumanEval contains only 164 problems with no statistical justification for this size; comparative differences of a few percentage points lack power to reach significance." + }, + { + "flag": "Proprietary model barrier", + "detail": "Full reproduction requires access to the proprietary Codex model; while evaluation code and benchmark are released, independent verification of model performance is not possible." + }, + { + "flag": "Human grading underdescribed", + "detail": "Docstring evaluation uses human graders for 1,640 assessments but provides no information on grader recruitment, qualifications, or inter-rater reliability." + } + ], + "cited_papers": [ + { + "title": "Measuring Coding Challenge Competence with APPS", + "relevance": "Key benchmark paper for evaluating code generation on competitive programming; used as secondary evaluation dataset and direct baseline comparator" + }, + { + "title": "Language Models are Few-Shot Learners (GPT-3)", + "relevance": "Foundation model that Codex is fine-tuned from; establishes the baseline that Codex dramatically improves for code generation" + }, + { + "title": "SPoC: Search-based Pseudocode to Code", + "relevance": "Introduces the pass@k metric concept and functional correctness evaluation for code synthesis; Codex adopts and extends this framework" + }, + { + "title": "Unsupervised Translation of Programming Languages (TransCoder)", + "relevance": "Establishes that functional correctness better captures code quality than BLEU for translation tasks, supporting Codex's methodological choice" + }, + { + "title": "CodeBERT: A Pre-Trained Model for Programming and Natural Languages", + "relevance": "Prior code-NL model trained on docstring-function pairs; represents the state of the art for code understanding at time of publication" + }, + { + "title": "CodeSearchNet Challenge: Evaluating the State of Semantic Code Search", + "relevance": "Large-scale GitHub corpus that established the multimodal code-NL dataset paradigm; predecessor to Codex's training approach" + }, + { + "title": "GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model", + "relevance": "Key open-source baseline trained on The Pile (8% GitHub code); primary competitive comparator demonstrating Codex's advantage from code-focused training" + }, + { + "title": "Extracting Training Data from Large Language Models", + "relevance": "Shows LLMs can memorize and reproduce training data; directly cited in Codex's legal/privacy analysis regarding code reproduction from training" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 3, + "justification": "Directly introduces GitHub Copilot's underlying model and HumanEval benchmark that remained the standard code evaluation for years." + }, + "surprise_contrarian": { + "score": 2, + "justification": "The finding that repeated sampling is 'surprisingly effective' (72% with 100 samples vs 29% with 1) and that BLEU is unreliable for code were non-obvious and highly cited." + }, + "fear_safety": { + "score": 2, + "justification": "Explicit security analysis showing Codex generates insecure cryptographic code, misalignment analysis, and polymorphic malware concerns raise concrete AI safety issues." + }, + "drama_conflict": { + "score": 1, + "justification": "Commercial relationship between OpenAI and GitHub Copilot creates implicit tension, but the paper itself is measured and academic in tone with no direct controversy." + }, + "demo_ability": { + "score": 3, + "justification": "GitHub Copilot based on Codex was publicly available at launch; HumanEval is released for immediate community use and reproduction." + }, + "brand_recognition": { + "score": 3, + "justification": "OpenAI, GitHub Copilot, and the Codex name have extremely high brand recognition; several authors (Sutskever, Amodei, Brockman) are prominent AI industry figures." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "27786283", + "title": "Evaluating Large Language Models Trained on Code", + "points": 12, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=27786283", + "created_at": "2021-07-09T17:39:30Z" + }, + { + "hn_id": "27767328", + "title": "Evaluating Large Language Models Trained on Code", + "points": 11, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=27767328", + "created_at": "2021-07-08T00:36:26Z" + }, + { + "hn_id": "27777657", + "title": "Evaluating Large Language Models Trained on Code (paper about GH copilot model)", + "points": 4, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=27777657", + "created_at": "2021-07-08T21:10:26Z" + }, + { + "hn_id": "27770978", + "title": "Evaluating Large Language Models Trained on Code(GitHub Copilot)", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=27770978", + "created_at": "2021-07-08T12:20:59Z" + }, + { + "hn_id": "34552130", + "title": "Evaluating Large Language Models Trained on Code", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=34552130", + "created_at": "2023-01-27T21:27:58Z" + }, + { + "hn_id": "29172572", + "title": "Measuring mathematical problem solving with the MATH dataset", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=29172572", + "created_at": "2021-11-10T09:00:47Z" + }, + { + "hn_id": "26070039", + "title": "On the Reproducibility of Neural Network Predictions", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=26070039", + "created_at": "2021-02-08T21:00:30Z" + } + ], + "top_points": 12, + "total_points": 36, + "total_comments": 3 + } +} +\ No newline at end of file diff --git a/papers/coding-agents-generating-2026/scan-v5.json b/papers/coding-agents-generating-2026/scan-v5.json @@ -0,0 +1,499 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Are Coding Agents Generating Over-Mocked Tests? An Empirical Study", + "authors": [ + "Andre Hora", + "Romain Robbes" + ], + "year": 2026, + "venue": "MSR '26", + "arxiv_id": "2602.00409", + "doi": "10.1145/3793302.3793362" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All quantitative claims in the abstract (60%, 23% vs 13%, 68%, 36% vs 26%, 95% mock type concentration) are directly backed by contingency tables and statistical tests in Section 3.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": false, + "answer": false, + "justification": "The paper makes observational association claims ('more likely to') rather than causal claims; the study design is a mining study without intervention and the authors consistently use correlational language throughout.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": true, + "justification": "Section 5 explicitly states 'our findings cannot be directly generalized to repositories written in other languages or using other agents,' bounding scope to Python, JavaScript, and TypeScript in 2025.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "The paper does not discuss key alternative explanations for why agents mock more — e.g., selection effects (agent-adopting repos may have more complex code requiring more mocking) or developer-preference confounds; only the 'easier to generate automatically' hypothesis is briefly proposed without evaluation.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "The paper clearly distinguishes between what is measured (presence of mock identifiers in test commit diffs, validated at 94% precision) and what is claimed (mocking frequency tendencies of coding agents), and does not conflate commit counts with test quality.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "Section 5 'Threats to Validity' provides a dedicated limitations discussion covering detection precision, agent commit attribution, and generalization.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "Specific threats are quantified: 94% precision for mock detection (manually inspected 100 commits across 10 repositories), 100% precision for agent commit detection (500 manually inspected commits), and handling of Co-Authored-By variant casing.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": true, + "justification": "The paper explicitly bounds scope to three languages, three specific coding agents, commits from 2025, and repositories meeting stated criteria (≥100 commits, ≥5,000 non-blank LOC, not forks, recently active).", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": true, + "justification": "Acknowledgments disclose funding from CNPq grants (408817/2024-0 and 403304/2025-3), CAPES, FAPEMIG, INES.IA, and the French State/IdEx université de Bordeaux.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Author affiliations are clearly stated: Hora at UFMG (Brazil) and Robbes at Univ. Bordeaux, CNRS, Bordeaux INP, LaBRI (France).", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": true, + "justification": "Funding comes from government and academic agencies (CNPq, CAPES, French State) with no affiliation to the coding agent companies (Anthropic, GitHub, Cursor) whose products are studied.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement or declaration of financial interests (patents, equity, consulting) is included anywhere in the paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Key terms are explicitly defined: 'coding agents' (Section 2.1.1 — autonomous tools that invoke external tools, execute code, and author commits), 'test doubles/mocks' (Section 2.6 — Meszaros taxonomy: dummy, stub, spy, mock, fake), and 'agent commits', 'test commits', 'mock commits' are operationally defined.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Contributions are explicitly stated: '(1) the first empirical study to analyze agent-generated tests in real-world software systems; and (2) multiple actionable implications for practitioners and researchers.'", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 6 engages substantively with prior work on coding agents (Becker et al., Kumar et al., Bouzenia & Pradel), LLM-generated test quality (Alshahwan et al., Ouédraogo et al.), and mocking practices (Spadini et al., Qin 2025), positioning this study as the first to examine mocking in agent-generated code at scale in the wild.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": true, + "justification": "Section 2.7 explicitly states 'Our scripts and dataset are publicly available at: https://doi.org/10.5281/zenodo.17427638.'", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "The dataset (commits and repository metadata) is publicly available on Zenodo at the stated DOI.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "The paper mentions using PyDriller and GitEvo but provides no requirements.txt, Dockerfile, or specific version numbers for any dependency.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "The paper provides a Zenodo link for scripts but includes no step-by-step reproduction instructions; reproducing the pipeline would require inferring the full workflow from the methodology description.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "No confidence intervals or error bars are reported for any headline percentages (23%, 36%, etc.); only Chi-squared statistics, p-values, and Cliff's delta are provided.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": true, + "justification": "Chi-squared tests of independence are applied for commit-level analyses in RQ1 and RQ2; paired Wilcoxon tests (with normality confirmed via Shapiro-Wilk and D'Agostino) are used for repository-level comparisons.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Cliff's delta effect sizes are reported for both repository-level comparisons: negligible for lower agentic activity repositories and small (0.252) for higher agentic activity.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "No power analysis or principled sample size justification is provided; the sample of 2,168 repositories emerges from SEART selection criteria rather than any prospective sizing calculation.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "Table 10 reports medians for repository-level mock commit ratios but provides no standard deviations, interquartile ranges, or other spread measures for any result.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Non-agent commits serve as the direct baseline throughout all three RQs, with explicit agent vs. non-agent proportions in every contingency table.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "Non-agent commits are drawn from the same repositories and same time period (2025) as agent commits, making comparisons directly contemporary.", + "source": "haiku" + }, + "ablation_study": { + "applies": false, + "answer": false, + "justification": "This is an observational mining study, not a system design paper; ablation analysis is not applicable.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "The paper uses commit-level ratios, repository-level proportions, Chi-squared statistics with standardized residuals, Wilcoxon p-values, Cliff's delta, and mock type distribution across all five test double categories.", + "source": "haiku" + }, + "human_evaluation": { + "applies": true, + "answer": true, + "justification": "Authors manually inspected 500 agent commits to validate classifier precision (100%) and 100 randomly selected mock commits across 10 repositories to validate mock detection precision (94%).", + "source": "haiku" + }, + "held_out_test_set": { + "applies": false, + "answer": false, + "justification": "This is an observational mining study, not a prediction task; held-out test sets are not applicable.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Results are broken down by programming language (Python vs JS/TS) in Tables 5, 8, and 10, and by individual coding agent (Claude, Copilot, Cursor) in Tables 5 and 8.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": false, + "justification": "The browser-use example where agents added mocks despite explicit configuration to the contrary is a descriptive observation, not a systematic discussion of failure modes or when the methodology breaks down.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "The paper reports null results: no notable language difference in mock rates (Python 37% vs JS/TS 35%), and negligible Cliff's delta for lower-agentic-activity repositories despite a statistically significant Wilcoxon result.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": false, + "answer": false, + "justification": "This is a mining study that does not run LLM inference; the specific versions of Claude Code, Copilot, and Cursor active during studied commits cannot be determined from commit metadata and are not reported.", + "source": "haiku" + }, + "prompts_provided": { + "applies": false, + "answer": false, + "justification": "No LLMs are invoked in the authors' analysis pipeline; the study mines existing commit data rather than querying models.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": false, + "answer": false, + "justification": "No models are run by the authors; hyperparameters are not applicable to a repository mining study.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": false, + "answer": false, + "justification": "The authors do not deploy agentic scaffolding; they analyze traces left by existing coding agents in real repositories.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "Preprocessing is thoroughly documented: SEART selection criteria (Section 2.2), agent file detection patterns (Table 1), commit author/co-author matching logic (Section 2.4), test file patterns (Table 2), mock identifier detection rules (Section 2.6.1), and mock commit classification (Section 2.6.2) are all fully specified.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": true, + "justification": "The dataset is publicly available at Zenodo (doi.org/10.5281/zenodo.17427638) as explicitly stated in Section 2.7.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Data collection is thoroughly described: SEART tool and selection criteria, filtering from 114,098 to 2,168 repositories, cloning for agent file detection, and commit metadata parsing for all three classification steps.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "No human participants; this is a repository mining study using automated collection from GitHub via the SEART tool.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "The full pipeline is documented across Sections 2.2–2.7: SEART selection → language/agent filter → agent commit detection → test commit detection → mock commit detection → RQ analysis, including the tools used (PyDriller, GitEvo).", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": false, + "answer": false, + "justification": "This is not a benchmark evaluation of model capabilities; no models are evaluated on test sets.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": false, + "answer": false, + "justification": "Not applicable; this is a mining study, not a model capability evaluation.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": false, + "answer": false, + "justification": "Not applicable; no benchmarks are used for model evaluation.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human participants; inclusion/exclusion criteria apply to repositories and are fully documented in Section 2.2.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": false, + "answer": false, + "justification": "No LLM inference is performed by the authors; this is a repository mining study.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "The paper does not report the computational cost of cloning and analyzing 2,168 repositories and 1.2 million commits, which is non-trivial.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "23% of commits made by coding agents add or modify test files, compared with 13% for non-agents", + "evidence": "Table 4: 11,035/48,563 agent commits are test commits vs 158,326/1,206,315 non-agent commits; Chi-squared = 3,683.06, p < 0.001, standardized residual = 55.35", + "supported": "strong" + }, + { + "claim": "60% of repositories with agent activity also contain agent test activity", + "evidence": "Table 4 and Section 3.1: 729 out of 1,219 repositories with agent commits also contain agent test commits", + "supported": "strong" + }, + { + "claim": "36% of test commits made by coding agents add mocks, compared with 26% for non-agents", + "evidence": "Table 7: 3,934/11,035 agent test commits are mock commits vs 40,966/158,326 non-agent test commits; Chi-squared = 505.5, p < 0.001", + "supported": "strong" + }, + { + "claim": "In repositories with higher agentic activity (≥50 agent commits), agents have a significantly higher mock ratio (36%) than non-agents (28%) with small effect size", + "evidence": "Table 10b: Wilcoxon p < 0.001, Cliff's delta = 0.252 across 179 repositories; lower-agentic-activity repos show negligible effect despite statistical significance", + "supported": "moderate" + }, + { + "claim": "Coding agents predominantly use the 'mock' type (95%) while non-agents use a wider variety — fake (57%), spy (51%), mock (91%)", + "evidence": "Figure 5: Distribution of mock types across 496 repositories with agent mock activity; agents are concentrated on the generic mock type while non-agents show broader distribution", + "supported": "strong" + }, + { + "claim": "Repositories created in 2025 show a higher share of agent test commits (17%) and mock commits (19%) compared to the full dataset (7% and 9%)", + "evidence": "Tables 6 and 9: For 2025-created repos, 4,526/26,654 test commits are agent commits (17%) and 1,529/7,855 mock commits are agent commits (19%)", + "supported": "strong" + }, + { + "claim": "Mock-related instructions in agent configuration files are far less common than test instructions, suggesting a guidance gap", + "evidence": "Table 12: GitHub Code Search finds 13k CLAUDE.md files with 'mock' vs 102k with 'test' out of 112k total; causal link between guidance and behavior not established", + "supported": "weak" + } + ], + "methodology_tags": [ + "observational", + "case-study" + ], + "key_findings": "Coding agents are significantly more likely to modify test files (23% vs 13% of commits) and add mocks to those tests (36% vs 26%) than non-agent contributors, with both differences statistically significant (p < 0.001) and the mock difference confirmed in a paired within-repository analysis (Cliff's delta = 0.252, small). Agents show markedly less diversity in test double types, relying almost exclusively on the generic 'mock' type (95%) compared to non-agents who also commonly use 'fake' (57%) and 'spy' (51%). The proportion of agent-generated tests and mocks is growing rapidly, accounting for 17–19% of recently created repositories' test/mock commits vs 7–9% overall. The paper finds that mock guidance in agent configuration files (e.g., CLAUDE.md) is uncommon, and agents occasionally add mocks even in repositories that explicitly prohibit it, suggesting configuration-based guidance has limited enforcement.", + "red_flags": [ + { + "flag": "Title implies quality judgment not demonstrated", + "detail": "The paper establishes that agents mock more frequently but cannot demonstrate this constitutes 'over-mocking' — no assessment of mock appropriateness, test effectiveness, bug-detection rates, or maintenance cost is included; the normative claim in the title exceeds the observational evidence." + }, + { + "flag": "Selection confound not fully addressed", + "detail": "Repositories adopting coding agents may systematically differ in type (newer projects, higher complexity, specific domains) creating selection effects that independently explain higher mocking rates; the paired within-repository analysis partially mitigates this but developer-preference confounds remain (agent-adopting developers may already favor mocking)." + }, + { + "flag": "Agent versions not tracked", + "detail": "Specific versions of Claude Code, Copilot, and Cursor active during the studied commits are not identified; since model updates change agent behavior rapidly, findings may not reflect current or future agent behavior." + }, + { + "flag": "No confidence intervals on main estimates", + "detail": "All headline percentages (23%, 36%, 95%, etc.) are reported as point estimates without confidence intervals, making precision of the key comparative claims unassessable." + }, + { + "flag": "Unknown recall of mock detection method", + "detail": "The identifier-based mock detection is validated only for precision (94%) but not recall; unknown false-negative rate could systematically bias the agent vs. non-agent comparison if agents use different naming conventions than the patterns searched." + } + ], + "cited_papers": [ + { + "title": "Promises, Perils, and (Timely) Heuristics for Mining Coding Agent Activity", + "relevance": "Foundational companion paper by same authors establishing the methodology for detecting agent commits via co-authorship metadata in real repositories — directly enables this study" + }, + { + "title": "Agentic Much? Adoption of Coding Agents on GitHub", + "relevance": "Under-submission companion paper measuring overall adoption rates of coding agents on GitHub, providing broader context for this study's scope and agent selection rationale" + }, + { + "title": "Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity", + "relevance": "Becker et al. controlled experiment finding 19% task completion time increase despite 20% perceived productivity gain — key context for evaluating coding agent real-world effectiveness" + }, + { + "title": "To Mock or Not to Mock: Divergence in Mocking Practices Between LLM and Developers", + "relevance": "Direct predecessor: Qin 2025 compared GPT-4o mock decisions vs developers in a single system, finding LLMs generate more mocks; this paper scales that finding to real-world agent commits across thousands of repositories" + }, + { + "title": "Mock objects for testing Java systems: Why and how developers use them, and how they evolve", + "relevance": "Spadini et al. foundational empirical study of human mocking practices in Java; establishes baseline understanding for comparison with agent behavior" + }, + { + "title": "Use of test doubles in Android testing: An in-depth investigation", + "relevance": "Fazzini et al. study whose identifier-based mock detection methodology is directly adapted by this paper for detecting test doubles in commits" + }, + { + "title": "Understanding Software Engineering Agents: A Study of Thought-Action-Result Trajectories", + "relevance": "Bouzenia & Pradel study of agent interaction logs from SWE-bench; related characterization of coding agent action patterns in software engineering tasks" + }, + { + "title": "The Rise of AI Teammates in Software Engineering: How Autonomous Coding Agents Are Reshaping Software Engineering", + "relevance": "Li et al. survey providing context on the broader adoption and capabilities of coding agents used to motivate the scope of this study" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 3, + "justification": "Directly actionable for anyone using Claude Code, Copilot, or Cursor — the recommendation to add mock guidance to CLAUDE.md configuration files is immediately applicable and the finding applies to millions of developers." + }, + "surprise_contrarian": { + "score": 2, + "justification": "Finding that agents mock at 36% vs 26% and concentrate almost exclusively on the generic 'mock' type (95% vs 91%/57%/51% for non-agents) is a concrete, counterintuitive result about agent behavior that challenges assumptions of quality parity." + }, + "fear_safety": { + "score": 1, + "justification": "Tests with excessive mocking may mask integration bugs and allow code to drift from mock contracts, with software reliability implications, but no direct safety or security concerns are raised." + }, + "drama_conflict": { + "score": 1, + "justification": "Mild controversy around AI-generated code quality; Kent Beck's LinkedIn quote adds human interest color but the paper is primarily technical without major conflict angles." + }, + "demo_ability": { + "score": 1, + "justification": "Scripts and dataset are available on Zenodo, but reproducing requires cloning thousands of GitHub repositories and running analysis scripts — not a quick demo, though practitioners can immediately apply configuration file guidance." + }, + "brand_recognition": { + "score": 2, + "justification": "Directly studies Claude Code, GitHub Copilot, and Cursor with data from Microsoft/VS Code, home-assistant/core, and Apache repositories — high brand recognition among software engineering practitioners." + } + }, + "hn_data": { + "threads": [], + "top_points": 0, + "total_points": 0, + "total_comments": 0 + } +} +\ No newline at end of file diff --git a/papers/copilot-productivity-controlled-2023/scan-v5.json b/papers/copilot-productivity-controlled-2023/scan-v5.json @@ -0,0 +1,555 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "The Impact of AI on Developer Productivity: Evidence from GitHub Copilot", + "authors": [ + "Peng, S.", + "Kalliamvakou, E.", + "Cihon, P.", + "Demirer, M." + ], + "year": 2023, + "venue": "arXiv", + "arxiv_id": "2302.06590", + "doi": null + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All abstract claims (55.8% faster task completion, heterogeneous benefits for less-experienced developers) are directly supported by experimental results reported in Figures 6 and Table 1.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "Randomized controlled trial with random assignment to treatment/control satisfies causal inference requirements for the specific task tested. However, 70% attrition (only 35/95 completed) and lack of discussion of attrition bias weaken the internal validity.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "Authors extrapolate findings to GDP-scale claims ('55.8% increase in productivity would imply significant cost savings in the economy') from a single greenfield HTTP server task tested on Upwork freelancers. Overgeneralizes far beyond the tested setting.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "No discussion of alternative explanations like Hawthorne effect (treated group knew they had advantage), psychological motivation, or task-specific strengths of Copilot. Only one causal story is presented.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": false, + "justification": "Paper measures task completion time but claims 'productivity.' While acknowledged in discussion as difficult to measure, the term 'productivity' is used throughout abstract and main text as direct synonym for completion time without consistent distinction.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "Discussion section addresses generalization ('productivity benefits may vary across tasks and languages') and notes code quality not measured. However, limitations are brief and integrated rather than in dedicated section.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "Paper mentions generalization may vary but does not discuss specific validity threats: 74% attrition not mentioned, selection bias from Upwork freelancers not discussed, artificial task vs. real-world development not distinguished, and Hawthorne effect not addressed.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "Generic boundaries stated ('standardized task rather than collaborative projects', 'not measuring code quality'). Missing explicit boundaries: greenfield-only task, JavaScript-specific, Upwork freelancer sample, Copilot-powered-by-Codex circa 2022.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No explicit statement of funding source. Affiliations (Microsoft Research, GitHub Inc., MIT) suggest Microsoft/GitHub funding but this is inferred, not disclosed.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Author affiliations clearly listed: Microsoft Research, GitHub Inc., and MIT Sloan, revealing that evaluators work for the company whose product (Copilot) is being evaluated.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": false, + "justification": "All authors are employees of Microsoft and/or GitHub, evaluating GitHub Copilot (a Microsoft product). This is the opposite of funder independence; company employees evaluating their own product.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement or declaration of potential financial interests (stock, patents, consulting relationships). Ethics approval noted but no COI management plan.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": false, + "justification": "Key term 'productivity' is not precisely defined. Used interchangeably with 'task completion time,' but this distinction is never made explicit. Task success and completion time are defined, but 'productivity' is not.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Explicitly stated in abstract: 'first controlled experiment to measure productivity of AI tools in professional software development.' Contribution is clear and well-positioned in introduction.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Paper cites prior work on AI tool perception (Barke, Finnie-Ansley, Sandoval) and notes gap in productivity research (Mozannar, Vaithilingam, Ziegler). Positions contribution as filling first-controlled-experiment gap. Engagement is sufficient though not deeply analytical.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": false, + "justification": "No code or materials released. Task administered via GitHub Classroom but no public repository or reproducible artifact package provided.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": false, + "justification": "Raw experimental data (completion times, participant demographics, code submissions, test results) are not stated to be publicly available.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "No environment specs (Node.js version, npm dependencies, test runner setup). Only mentions JavaScript, GitHub Copilot, and Codex without pinning versions or configuration details.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "No step-by-step instructions to reproduce the experiment. Task description shown in Figure 4 but no details on setting up local environment, running test suite, or repeating the experiment protocol.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": true, + "justification": "Main result reports 95% CI [21%, 89%] for 55.8% speedup. Success rate reports CI [-0.11, 0.25]. Table 1 reports standard errors for heterogeneous effects.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": true, + "justification": "Main result: t-test p=0.0017 (significant). Success rate: 95% CI includes zero (not significant). Table 1 reports t-statistics and p-values for heterogeneous effects.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Main effect: 55.8% reduction in completion time clearly reported with baseline context (71.17 min vs 160.89 min). Heterogeneous effects in Table 1 show coefficients but units are unclear (minutes? percentage points?).", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "No power analysis or sample size justification provided. Study recruited 95 but only 35 completed tasks and surveys (74% attrition), which is not discussed or justified.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "Figure 6 shows distribution of completion times but standard deviations are not explicitly reported. Only means (71.17 min treatment, 160.89 min control) and outlier observations mentioned.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Control group (no Copilot, allowed to use Stack Overflow/internet) vs. treatment group (with Copilot). Clear baseline comparison.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "Both groups tested simultaneously (May-June 2022), making baseline contemporary to treatment. Both used same task and test suite.", + "source": "haiku" + }, + "ablation_study": { + "applies": false, + "answer": false, + "justification": "NA — only one treatment component (Copilot presence/absence). Ablation not applicable.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Metrics: task completion time, task success rate (% passing 12 tests), heterogeneous effects across covariates, exit survey on perceived productivity and willingness to pay.", + "source": "haiku" + }, + "human_evaluation": { + "applies": true, + "answer": false, + "justification": "System outputs (code) evaluated by automated test suite (12 checks), not human judges. Exit survey measures perceived productivity (subjective self-report) not code quality evaluation.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": true, + "justification": "Standard 12-test suite applied uniformly to all participants. Tests were visible to participants during development but fixed and un-alterable, serving as objective success criterion.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": false, + "justification": "No breakdown by task components (e.g., setup vs implementation vs debugging). Heterogeneous effects break down by participant characteristics (experience, hours/day, age) but not by task stages.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": false, + "justification": "Paper notes '4 outliers above 300 min, all in control group' and mentions 7pp success rate difference (not significant) but does not analyze why tasks failed, what blockers participants hit, or failure patterns.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "Non-significant finding reported: success rate difference is 7pp with 95% CI [-0.11, 0.25] (includes zero), indicating no statistically significant difference between groups on task completion.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": false, + "justification": "States 'Copilot powered by OpenAI's Codex' but no version pinning. No training data cutoff, no specific Codex snapshot date, no commit hash. Codex paper cited is from 2021; experiment was May-June 2022.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": false, + "justification": "No actual Copilot prompts, suggestions, or interactions shown. No examples of what developers saw from Copilot or how they used it. Only mentions 1-minute intro video.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": false, + "justification": "No hyperparameters reported for Copilot (temperature, top-p, sampling strategy) or experiment setup (time limits, task constraints, etc.).", + "source": "haiku" + }, + "scaffolding_described": { + "applies": true, + "answer": false, + "justification": "Only scaffolding mentioned: 'treatment group watched 1-minute video introducing GitHub Copilot.' No detail on how Copilot was integrated into IDE, how suggestions were presented, or acceptance UI.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "Data pipeline described: participants receive template repo (timestamp), implement code, each commit runs 12-test suite (automated), completion time = timestamp of first passing commit. Pipeline is clear though test suite details sparse.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": false, + "justification": "Paper does not state that raw data (completion times, demographics, code submissions, test results) will be made available. No data repository or supplement mentioned.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Data collection procedures described: entry survey for demographics, task administered via GitHub Classroom with automated timing, exit survey on productivity perception and WTP. Adequate but not exhaustive detail.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": true, + "answer": true, + "justification": "Recruitment via Upwork job posting clearly described. Job posting shown in Figure 1. Contract shown in Figure 2. Inclusion criteria ('professional programmers') stated but not formally defined.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "Full pipeline described: Upwork recruitment → random assignment → entry survey → GitHub Classroom task with automated testing → exit survey → analysis. Documented across sections rather than as single unified pipeline doc.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": false, + "justification": "Paper states Codex is the underlying model but provides no explicit training data cutoff. Codex paper (2021) suggests pre-Sept 2021 training, but not stated in this paper.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": false, + "justification": "Task is 'implement HTTP server in JavaScript'—a generic, common programming task almost certainly present in Codex's training data. Potential contamination not acknowledged or discussed.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": false, + "answer": false, + "justification": "NA — not evaluating on a published benchmark. Task is bespoke. However, task is generic enough to have high overlap with training data, which is not addressed.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": true, + "answer": false, + "justification": "No pre-registration mentioned. No OSF registration, clinical trial registration, or other pre-study registration documented.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": true, + "answer": true, + "justification": "Explicitly states: 'Before we began recruitment, we received approval for the study from the Microsoft Research Ethics Review Board.'", + "source": "haiku" + }, + "demographics_reported": { + "applies": true, + "answer": true, + "justification": "Figure 5 comprehensively reports: age distribution, number of languages, education level, employment status, geography, yearly income, programming experience (years), daily coding hours.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": true, + "answer": false, + "justification": "Paper states 'recruited 95 professional programmers' but does not formally define inclusion/exclusion criteria. Upwork posting in Figure 1 shows skill requirements but specifics of screening process not documented.", + "source": "haiku" + }, + "randomization_described": { + "applies": true, + "answer": true, + "justification": "States 'participants were randomly split into control and treatment groups' and Figure 2 contract implies random assignment. However, method (simple random, stratified, blocked) not specified.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "NA — blinding infeasible. Treatment group knew they had Copilot; control group did not. Experiment design prevents meaningful blinding.", + "source": "haiku" + }, + "attrition_reported": { + "applies": true, + "answer": false, + "justification": "Only 35 out of 95 recruited participants completed task and surveys (74% attrition). Paper does not report, discuss, or analyze attrition. Critical threat to validity not addressed.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": false, + "justification": "Paper does not report cost of using Copilot (per developer, per task, or monthly). Exit survey asks WTP but not actual cost structure or infrastructure expenses.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "No computational budget reported for running the experiment or infrastructure costs (GitHub Classroom, testing infrastructure).", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "GitHub Copilot reduces task completion time for implementing an HTTP server in JavaScript by 55.8%", + "evidence": "Controlled experiment: treated group 71.17 min avg, control group 160.89 min avg; p=0.0017; 95% CI [21%, 89%]", + "supported": "strong" + }, + { + "claim": "Less experienced developers benefit more from Copilot", + "evidence": "Heterogeneous effects Table 1: programming experience coefficient 8.23, p=0.0629 (directional but not significant at 0.05)", + "supported": "moderate" + }, + { + "claim": "Developers coding more hours per day benefit more from Copilot", + "evidence": "Table 1: hours per day coefficient -11.70, p=0.0168 (significant, negative = more hours → more benefit)", + "supported": "strong" + }, + { + "claim": "Older developers (25-44) benefit more from Copilot", + "evidence": "Table 1: age 25-44 coefficient -74.55, p=0.0303 (significant)", + "supported": "strong" + }, + { + "claim": "Copilot improves code quality", + "evidence": "Not measured. Paper explicitly states 'this study does not examine the effects of AI on code quality'", + "supported": "unsupported" + }, + { + "claim": "Results generalize to professional software development broadly", + "evidence": "Single greenfield task (HTTP server), Upwork freelancer sample (mostly India/Pakistan, low income), JavaScript-specific. Authors note generalization unclear but abstract overstates scope.", + "supported": "weak" + }, + { + "claim": "Treated group perceived greater productivity gain than control group", + "evidence": "Exit survey: treated group avg 35% perceived gain vs control 35% perceived gain. Both groups underestimated vs actual 55.8% (no significant difference reported between groups)", + "supported": "weak" + }, + { + "claim": "Treated group has higher willingness to pay for Copilot", + "evidence": "Exit survey: treated group avg $27.25/month WTP vs control $16.91/month; difference significant at 95% level", + "supported": "strong" + } + ], + "methodology_tags": [ + "rct", + "case-study" + ], + "key_findings": "Controlled randomized experiment (n=35 completers) shows GitHub Copilot reduces task completion time by 55.8% (95% CI: 21–89%, p=0.0017) for implementing an HTTP server in JavaScript. Heterogeneous effects show less experienced and older developers (25–44) benefit more, suggesting potential for skill development support. However, study design limitations (74% attrition unreported, single narrow task, unrepresentative Upwork sample) limit generalizability beyond the specific task tested. Code quality not measured, and massive conflict of interest (Microsoft/GitHub employees evaluating their own product) not disclosed.", + "red_flags": [ + { + "flag": "Massive unreported attrition", + "detail": "70 out of 95 recruited participants (74%) did not complete study or exit survey. Attrition is not discussed, analyzed, or treated as a validity threat. Risk of severe selection bias." + }, + { + "flag": "Undisclosed conflict of interest", + "detail": "All authors are employees of Microsoft/GitHub, evaluating GitHub Copilot (Microsoft product). No COI disclosure statement. Paper states ethics approval but does not address self-evaluation conflict." + }, + { + "flag": "Unrepresentative sample", + "detail": "Sample is mostly young (25–34), from India/Pakistan, low annual income ($10–19K), recruited via Upwork. Not representative of US software developer population earning $464.8B annually. Results may not transfer." + }, + { + "flag": "Single artificial task", + "detail": "Results from one greenfield HTTP server implementation in JavaScript. Real-world development includes debugging, reading existing code, collaboration, and maintenance—none tested." + }, + { + "flag": "Task plays to Copilot strengths", + "detail": "HTTP servers are a canonical task in training data. Copilot likely saw similar implementations during training. Results may overestimate benefit for domain-specific or novel tasks." + }, + { + "flag": "No code quality assessment", + "detail": "Paper does not measure whether Copilot-assisted code is faster but lower quality, introduces security issues, or trains bad practices. Incomplete evaluation of real productivity." + }, + { + "flag": "Hawthorne effect not ruled out", + "detail": "Treated group knew they had Copilot advantage; control group did not. Psychological motivation difference could inflate treatment effect independent of Copilot quality." + }, + { + "flag": "Success rate not significant", + "detail": "Treatment group only 7pp higher in success rate (completion), 95% CI [-0.11, 0.25] includes zero. Not a clear win on all dimensions despite speed gain." + }, + { + "flag": "No pre-registration", + "detail": "Study not pre-registered. Introduces risk of outcome selection, p-hacking, or post-hoc hypotheses being presented as a priori." + }, + { + "flag": "Control baseline unclear", + "detail": "Control group allowed to use Stack Overflow and internet. Comparison is Copilot vs developer+Stack Overflow, not isolated Copilot effect. Makes generalization ambiguous." + }, + { + "flag": "GDP extrapolation from single task", + "detail": "Discussion states '55.8% increase in productivity would imply significant cost savings in the economy and notable impact on GDP growth,' extrapolating from one greenfield task to 4.6M workers. Vast overgeneralization." + }, + { + "flag": "Model version not pinned", + "detail": "Codex version not specified, training cutoff not stated. Subsequent versions of Copilot/Codex may have different performance. Results not reproducible on exact same model." + } + ], + "cited_papers": [ + { + "title": "Evaluating Large Language Models Trained on Code", + "relevance": "Codex paper describing underlying model for GitHub Copilot. Defines capabilities and training data." + }, + { + "title": "Grounded Copilot: How Programmers Interact with Code-Generating Models", + "relevance": "Empirical study of how developers use Copilot in practice. Complements this paper's task-based productivity measurement with naturalistic behavior." + }, + { + "title": "An Empirical Evaluation of GitHub Copilot's Code Suggestions", + "relevance": "Evaluates correctness of Copilot-generated suggestions. Addresses code quality question that this paper does not measure." + }, + { + "title": "The Robots Are Coming: Exploring the Implications of OpenAI Codex on Introductory Programming", + "relevance": "Studies impact of Codex on student programming education. Shows differential benefits for learning vs task completion." + }, + { + "title": "Reading Between the Lines: Modeling User Behavior and Costs in AI-Assisted Programming", + "relevance": "Models cost-benefit of AI coding assistance from user behavior perspective. Extends productivity analysis to economic optimization." + }, + { + "title": "AI and Shared Prosperity", + "relevance": "Economic framework for AI labor impacts. Contextualizes this paper's productivity gains within labor market and inequality effects." + }, + { + "title": "Machine Learning Methods for Estimating Heterogeneous Causal Effects", + "relevance": "Athey & Imbens methodology for heterogeneous treatment effects, used in Table 1 analysis of differential benefits by developer type." + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 2, + "justification": "Copilot is a real tool practitioners use, but results on a single greenfield task don't directly inform when/where Copilot helps practitioners most in real codebases." + }, + "surprise_contrarian": { + "score": 1, + "justification": "55.8% speedup is striking numerically, but confirms rather than challenges conventional belief that coding assistants help. No surprising finding contradicting expectations." + }, + "fear_safety": { + "score": 1, + "justification": "Focuses entirely on productivity gains. Code quality not measured. Raises question but does not raise alarm about AI risks or safety concerns." + }, + "drama_conflict": { + "score": 2, + "justification": "Productivity gains from AI have economic significance and labor market implications. Discussion notes 4.6M jobs at risk, but undisclosed conflict of interest (company evaluating own product) is the real drama not highlighted." + }, + "demo_ability": { + "score": 3, + "justification": "GitHub Copilot is a real, publicly available product. Anyone can try it now (paid subscription). Paper findings directly demystify: users can replicate the HTTP server task and compare with/without Copilot." + }, + "brand_recognition": { + "score": 3, + "justification": "Microsoft Research, GitHub Inc. (owned by Microsoft), MIT Sloan. GitHub Copilot is flagship product from one of the largest tech companies. High institutional and product brand visibility." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "44484075", + "title": "The Impact of AI on Developer Productivity: Evidence from GitHub Copilot (2023)", + "points": 6, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=44484075", + "created_at": "2025-07-06T21:09:52Z" + }, + { + "hn_id": "35076049", + "title": "The Impact of AI on Developer Productivity: Evidence from GitHub Copilot", + "points": 4, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=35076049", + "created_at": "2023-03-08T23:07:44Z" + }, + { + "hn_id": "40706181", + "title": "The Impact of AI on Developer Productivity: Evidence from GitHub Copilot", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=40706181", + "created_at": "2024-06-17T14:52:08Z" + } + ], + "top_points": 6, + "total_points": 12, + "total_comments": 1 + } +} +\ No newline at end of file diff --git a/papers/copilot-zoominfo-productivity-2025/scan-v5.json b/papers/copilot-zoominfo-productivity-2025/scan-v5.json @@ -0,0 +1,498 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Experience with GitHub Copilot for Developer Productivity at Zoominfo", + "authors": [ + "Gal Bakal", + "Ali Dasdan", + "Yaniv Katz", + "Michael Kaufman", + "Guy Levin" + ], + "year": 2025, + "venue": "arXiv", + "arxiv_id": "2501.13282", + "doi": null + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All abstract claims (33% suggestion acceptance, 20% lines acceptance, 72% satisfaction, four-phase methodology, 400+ developers, language-specific variations) are directly supported by Figures 2, 4, 9, and the methodology sections.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": false, + "justification": "The paper asserts GitHub Copilot 'significantly contributed to productivity' and that '90% report time savings', but the study is purely observational with no control group, no pre/post comparison, and Section 6 explicitly defers causality to a future paper.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": true, + "justification": "Claims are mostly bounded to 'medium-scale enterprise deployment' at Zoominfo, with explicit caveats that DORA metric causality is future work and that results align with (rather than supersede) prior industry reports.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "No alternative explanations are considered for observed acceptance rates or satisfaction scores — Hawthorne effect, selection bias (voluntary, enthusiastic participants), or Zoominfo's organizational investment in the tool's success are never discussed.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "Section 6 explicitly acknowledges acceptance rate is used as a proxy because 'the impact of GitHub Copilot on developer productivity seems difficult to measure' and cites the GitHub paper recommending it as a 'better predictor of perceived productivity.'", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "Section 11 'Limitations: Observed and Potential' is a dedicated section listing contextual understanding failures, security concerns, creativity limits, and a set of potential future limitations.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "Section 11 discusses limitations of the tool (domain-specific logic, security), not threats to the study's validity — selection bias, Hawthorne effect, voluntary participation skew, and lack of control group are never mentioned.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": true, + "justification": "The paper explicitly scopes to 'medium-scale enterprise deployment' and states that causal relationships with DORA metrics are not yet established and will be reported separately.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding source or acknowledgment section is present; the GitHub Copilot licenses were purchased by Zoominfo but this is not framed as a disclosure.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "All five authors list Zoominfo affiliation and Zoominfo email addresses on the title page.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": false, + "justification": "Zoominfo employees are evaluating a paid tool their company deployed; the organization has a financial and reputational interest in a positive outcome, making it not independent.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement or financial disclosure appears anywhere in the paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Section 3 defines 'developer productivity' as output per input unit; Section 6 defines 'acceptance rate of shown suggestions' precisely; Section 10 defines 'DevSat' as a net-sentiment score.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "The introduction lists five explicit research questions and frames the contribution as a medium-scale enterprise deployment case study filling a gap in empirical evidence.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 12 is a comprehensive related work section that compares findings to GitHub's own productivity paper, ANZ Bank deployment, open-source studies, code correctness studies, and tool comparisons, situating the work within the literature.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": false, + "justification": "No analysis scripts, survey instruments, or data processing code are released; the paper only references a ServiceNow workflow and GitHub's telemetry dashboard.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": false, + "justification": "The acceptance rate telemetry and developer survey response data are not publicly released.", + "source": "haiku" + }, + "environment_specified": { + "applies": false, + "answer": false, + "justification": "This is an observational deployment study of a commercial tool; no experimental environment, dependencies, or software stack requiring specification exists.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": false, + "answer": false, + "justification": "Reproduction of a specific company's internal deployment study is not feasible, making this criterion not applicable.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "Standard deviations are reported in Figure 2 for daily aggregate counts, but no confidence intervals are computed for the main reported acceptance rates (33%, 20%) or satisfaction score (72%).", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "No statistical significance tests are used despite comparative claims (e.g., language-to-language acceptance rate differences, IDE comparisons, satisfaction score claims).", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Acceptance rates (33% suggestions, 20% lines), time savings (20% median reduction), and satisfaction (72%) are reported as percentages with industry comparison context from GitHub and Google.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "The trial used 126 of 400+ engineers ('about 32%') but no power analysis or justification for why this sample size is sufficient is provided.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": true, + "justification": "Figure 2 explicitly reports standard deviations for all daily metrics including suggestion counts and acceptance rates across the 26-day period.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": false, + "justification": "There is no within-study control condition; informal references to GitHub's and Google's reported acceptance rates serve as external comparisons but not controlled baselines.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": false, + "answer": false, + "justification": "No proper baselines are included in the study design, making this criterion not applicable.", + "source": "haiku" + }, + "ablation_study": { + "applies": false, + "answer": false, + "justification": "The study evaluates a single commercial tool as a monolith; ablation is not applicable.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "The study uses suggestion acceptance rate, lines acceptance rate, developer satisfaction (DevSat), qualitative survey free-text, and per-language and per-IDE breakdowns.", + "source": "haiku" + }, + "human_evaluation": { + "applies": true, + "answer": true, + "justification": "Section 10 presents developer satisfaction surveys (Likert scale + free-form) where developers directly evaluate Copilot's outputs and impact on their work.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": false, + "answer": false, + "justification": "This is a production deployment observational study, not a prediction task; held-out test sets are not applicable.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Results are broken down by programming language (Fig 5-7, 12 languages) and by IDE (Fig 8, JetBrains vs VS Code).", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "Section 11 describes observed failures (domain-specific logic, security risks, creativity limitations) and the qualitative section includes a negative developer quote and reports on cases requiring modification.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "Lower acceptance rates for HTML, CSS, JSON, SQL are explicitly flagged and unexplained; qualitative negatives are quoted; 92% of generated tests failing outside test suites is cited from related work.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": false, + "justification": "The paper refers only to 'GitHub Copilot' by marketing name without specifying any model version, snapshot date, or which underlying LLM version was active during Nov-Dec 2024.", + "source": "haiku" + }, + "prompts_provided": { + "applies": false, + "answer": false, + "justification": "GitHub Copilot is evaluated as a black-box IDE plugin; no custom prompts are constructed or controlled by the researchers.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": false, + "answer": false, + "justification": "Commercial black-box tool evaluation — hyperparameters are not accessible or configurable by the researchers.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": false, + "answer": false, + "justification": "No agentic scaffolding; GitHub Copilot is used as a standard IDE plugin without custom orchestration.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": false, + "justification": "The paper states data comes from GitHub Copilot's telemetry dashboard but does not document how weekend/weekday splits were computed, how languages were categorized, or how partial acceptances were handled beyond a brief definition.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": false, + "justification": "No raw telemetry data or survey response data is made publicly available.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "The paper describes data collection: telemetry from GitHub Copilot dashboard over Nov 14-Dec 9 2024 (26 days), and quarterly developer satisfaction surveys with Likert scale questions since Q2 2024.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": true, + "answer": true, + "justification": "Section 5.2 describes stratified voluntary sampling with explicit prerequisites (security training, compliance acknowledgments), formal application process, and tracking via unique participant identifiers for the 126-person trial.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": false, + "justification": "The path from GitHub Copilot telemetry API to the reported figures is not documented; no data extraction, aggregation, or analysis scripts are described or released.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": false, + "answer": false, + "justification": "This study measures developer acceptance rates in production use, not model capability on benchmarks; training cutoff is irrelevant.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": false, + "answer": false, + "justification": "Not a benchmark evaluation; train-test overlap is not applicable.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": false, + "answer": false, + "justification": "No benchmark evaluation is conducted; contamination is not applicable.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": true, + "answer": false, + "justification": "No pre-registration is mentioned; this was an internal corporate evaluation, not a pre-registered academic study.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": true, + "answer": false, + "justification": "No IRB or ethics approval is mentioned despite collecting developer behavior data and survey responses from human participants.", + "source": "haiku" + }, + "demographics_reported": { + "applies": true, + "answer": false, + "justification": "Only geographic distribution (US, Europe, India, Israel) and broad technical role stratification are mentioned; no age, gender, years of experience, or other standard demographic breakdowns are reported.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": true, + "answer": true, + "justification": "Section 5.2 lists explicit inclusion criteria: completion of security training, written acknowledgment of five compliance documents, and commitment to provide structured post-trial feedback.", + "source": "haiku" + }, + "randomization_described": { + "applies": true, + "answer": false, + "justification": "Participation was voluntary with stratified sampling; no randomization of participants to treatment/control conditions was used.", + "source": "haiku" + }, + "blinding_described": { + "applies": true, + "answer": false, + "justification": "No blinding was possible or attempted; all participants knew they were using and being evaluated on GitHub Copilot.", + "source": "haiku" + }, + "attrition_reported": { + "applies": true, + "answer": true, + "justification": "The paper reports 126 trial participants and 72 survey respondents, explicitly noting a 57% response rate, which constitutes attrition reporting.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": false, + "justification": "License procurement is mentioned but per-query cost, latency, or total inference cost is never reported.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": false, + "answer": false, + "justification": "No model training or self-hosted inference; compute budget is not applicable for a commercial SaaS tool evaluation.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "Average acceptance rate of 33% for suggestions and 20% for lines of code over a 26-day production period", + "evidence": "Figure 2 and 4 show daily telemetry from Nov 14 to Dec 9, 2024 with averages computed across ~400 developers", + "supported": "strong" + }, + { + "claim": "Developer satisfaction with GitHub Copilot is 72%, the highest among all evaluated tools", + "evidence": "Figure 9 shows quarterly developer satisfaction survey results comparing GitHub Copilot against Jenkins, SonarQube, ArgoCD, and Backstage", + "supported": "moderate" + }, + { + "claim": "90% of surveyed developers report that GitHub Copilot reduces task completion time, with a median reduction of 20%", + "evidence": "Section 10 reports this from developer satisfaction surveys; self-reported, no objective time measurement", + "supported": "weak" + }, + { + "claim": "Top four languages (TypeScript, Java, Python, JavaScript) sustain approximately 30% acceptance rates", + "evidence": "Figure 5-7 show per-language breakdown; these four languages also cover ~80-85% of total suggestions", + "supported": "strong" + }, + { + "claim": "HTML, CSS, JSON, and SQL show meaningfully lower acceptance rates than general-purpose languages", + "evidence": "Figure 5 and 7 show rates ranging 14-32% with HTML/CSS/JSON/SQL at the lower end; no statistical test confirms significance", + "supported": "moderate" + }, + { + "claim": "GitHub Copilot significantly contributed to developer productivity at Zoominfo", + "evidence": "Acceptance rates and satisfaction surveys are cited; authors explicitly acknowledge in Section 6 that causality has not been established and is deferred to future work", + "supported": "weak" + } + ], + "methodology_tags": [ + "observational", + "case-study", + "qualitative" + ], + "key_findings": "A four-phase deployment of GitHub Copilot across 400+ Zoominfo developers yielded consistent acceptance rates of 33% (suggestions) and 20% (lines) over a 26-day production window in late 2024, with high developer satisfaction (72% DevSat, highest among evaluated tools). Language-specific variations were observed, with general-purpose languages achieving ~30% acceptance while HTML, CSS, JSON, and SQL underperformed; IDE differences were also noted (VS Code had ~50% higher lines acceptance rate than JetBrains). Developer surveys report 20% median time savings and high satisfaction with boilerplate/test generation, but causal attribution to actual productivity remains unestablished pending DORA metric analysis.", + "red_flags": [ + { + "flag": "Causal productivity claim without causal design", + "detail": "The paper claims Copilot 'significantly contributed to productivity' but uses only observational acceptance rates with no control group, pre/post design, or counterfactual. The authors themselves defer causal claims to future work." + }, + { + "flag": "Self-evaluating company employees", + "detail": "All authors are Zoominfo employees evaluating a tool their company paid for and deployed; no independence mechanism, no competing interests statement." + }, + { + "flag": "Voluntary participant selection bias", + "detail": "Trial participants were volunteers who applied and met compliance prerequisites — systematically more enthusiastic about the tool than average developers, biasing satisfaction and acceptance results upward." + }, + { + "flag": "Model version unspecified", + "detail": "The paper refers to 'GitHub Copilot' throughout without specifying any model version or snapshot date, making the evaluation unreproducible and temporally ambiguous." + }, + { + "flag": "No IRB for human study", + "detail": "Developer behavior and survey data were collected from human participants with no mention of ethics review or IRB approval." + }, + { + "flag": "No statistical significance testing", + "detail": "Language-to-language and IDE comparisons are presented as factual differences without any tests of statistical significance." + } + ], + "cited_papers": [ + { + "title": "Measuring GitHub Copilot's Impact on Productivity", + "relevance": "Foundational paper (Ziegler et al., CACM 2024) establishing acceptance rate as the primary productivity proxy metric — directly adopted by this study" + }, + { + "title": "The Impact of AI Tool on Engineering at ANZ Bank: An Empirical Study on GitHub Copilot within Corporate Environment", + "relevance": "Most directly comparable prior work: similar enterprise deployment, ~1000 engineers, controlled experiment design, reports 40-50% productivity boost" + }, + { + "title": "The SPACE of Developer Productivity: There's More to It Than You Think", + "relevance": "Framework paper defining multidimensional developer productivity metrics used to contextualize what Copilot's acceptance rates do and don't measure" + }, + { + "title": "The Impact of Generative AI on Collaborative Open-Source Software Development: Evidence from GitHub Copilot", + "relevance": "Quantifies Copilot's effect on open-source project productivity (+6.5% code contributions) with the negative finding of +42% integration time" + }, + { + "title": "GitHub Copilot AI Pair Programmer: Asset or Liability?", + "relevance": "Empirical evaluation of Copilot on algorithmic tasks, finding performance below human programmers — provides contrast to enterprise deployment positive results" + }, + { + "title": "An Empirical Evaluation of GitHub Copilot's Code Suggestions", + "relevance": "Finds ~60% correctness for Java and ~30% for JavaScript on LeetCode problems — directly relevant benchmark for interpreting Zoominfo's language-specific acceptance rates" + }, + { + "title": "DevEx: What Actually Drives Productivity", + "relevance": "Developer experience framework providing the conceptual basis for developer satisfaction as a productivity metric alongside DORA metrics" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 3, + "justification": "Directly actionable for engineering leaders evaluating Copilot: four-phase deployment methodology, compliance framework, and language-specific acceptance benchmarks are immediately applicable." + }, + "surprise_contrarian": { + "score": 1, + "justification": "Findings largely confirm GitHub's own reported acceptance rates and prior enterprise studies; no surprising reversals or counter-intuitive findings beyond the unexplained weekend rate increase." + }, + "fear_safety": { + "score": 1, + "justification": "Security risks from auto-generated code are mentioned in limitations, but treated as a process concern rather than a serious safety finding." + }, + "drama_conflict": { + "score": 0, + "justification": "No controversy; positive tone throughout with company evaluating its own successful deployment." + }, + "demo_ability": { + "score": 2, + "justification": "GitHub Copilot is a widely available commercial product that practitioners can immediately try using the same IDE plugins described." + }, + "brand_recognition": { + "score": 2, + "justification": "GitHub Copilot is a high-recognition product; Zoominfo is a publicly traded enterprise software company with broad name recognition in B2B circles." + } + }, + "hn_data": { + "threads": [], + "top_points": 0, + "total_points": 0, + "total_comments": 0 + } +} +\ No newline at end of file diff --git a/papers/cursor-speed-quality-tradeoff-2025/scan-v5.json b/papers/cursor-speed-quality-tradeoff-2025/scan-v5.json @@ -0,0 +1,581 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Speed at the Cost of Quality: How Cursor AI Increases Short-Term Velocity and Long-Term Complexity in Open-Source Projects", + "authors": [ + "Hao He", + "Courtney Miller", + "Shyam Agarwal", + "Christian Kästner", + "Bogdan Vasilescu" + ], + "year": 2026, + "venue": "MSR '26", + "arxiv_id": "2511.04427", + "doi": "10.1145/3793302.3793349" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All abstract claims (transient velocity gain, persistent quality degradation, GMM-identified velocity-quality feedback cycle) are directly supported by Table 2, Figure 3, and Table 3 with pre-trend tests passing.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "The Borusyak et al. staggered DiD estimator with propensity score matching and pre-trend tests is an appropriate quasi-experimental design; paper is transparent about ITT interpretation and the Callaway & Sant'Anna estimator disagreement on quality outcomes.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": true, + "justification": "Results are explicitly bounded to observable Cursor adoption in open-source GitHub repos dominated by TypeScript/Python/JavaScript during mid-2024 to mid-2025; Section 5.1.3 specifically discusses why enterprise findings may differ substantially.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": true, + "justification": "Section 5.1.1 discusses excitement-frustration-abandonment cycle; Section 5.1.2 discusses velocity-driven codebase growth as mechanism for quality decline; robustness checks rule out confounds from other AI tools, repo inactivity, and selection bias.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "Paper explicitly labels 'lines added' and 'commits' as velocity proxies with 'moderate-to-strong correlation with perceived productivity,' and states that static analysis warnings are 'estimates of the effort required to review potential issues' rather than confirmed defects.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "Section 3.5 'Limitations and Threats to Validity' has two subsections (Internal Validity, External Validity) spanning over a full page with five specific internal threats identified.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "Specific threats include: adoption proxy bias (only repos committing .cursorrules), unknown usage intensity (ITT effects only), model and version heterogeneity, imperfect propensity score matching, and contamination from other AI coding tools.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": true, + "justification": "Results are explicitly interpreted as impact of systematic Cursor adoption relative to 'current state-of-the-practice' (not versus no-AI baseline), bounded to open-source repos with observable adoption, and limited to the specific study period when tools were rapidly evolving.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": true, + "justification": "Acknowledgments disclose NSF grants 2206859, DGE214073, 2317168, 2120323; research awards from Google and Digital Infrastructure Fund; Google Cloud credits for BigQuery analysis.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "All five authors are listed as Carnegie Mellon University; no author affiliation with Cursor/Anysphere or any competing tool vendor.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": true, + "justification": "NSF is clearly independent; Google provides cloud credits but the study finds negative results for a competing product (Cursor, not Google's tools), and authors are not Google employees.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No explicit competing interests declaration appears in the paper; standard funding acknowledgment is provided but no formal 'no competing interests' statement.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Key terms defined include: Cursor's agentic capabilities vs. prior completion tools (Section 3.1.1), development velocity metrics with citations, SonarQube cognitive complexity definition (ref [32]), and DiD estimation targets ATT and ATTh with formal mathematical definitions.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "'Our contribution is two-fold': (1) first project-level DiD analysis of productivity gains from modern agentic coding assistant; (2) first comprehensive analysis of code quality impact from LLM agent assistant adoption.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 2 substantively engages with Copilot productivity RCTs, field experiments, and observational studies; directly positions against Becker et al. (contradicting finding) and Watanabe et al. (PR-level vs. project-level scope), explaining methodological differences.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": true, + "justification": "Data Availability section states 'We provide a replication package for this paper at: https://doi.org/10.5281/zenodo.18368661'.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "Replication package at zenodo DOI is provided; underlying GHArchive data is publicly accessible and the package presumably includes processed datasets.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "Paper mentions 'a local SonarQube Community server' and GHArchive/BigQuery but provides no requirements.txt, Dockerfile, R package versions, or pinned dependency specification.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": true, + "justification": "Replication package at zenodo (10.5281/zenodo.18368661) is provided; by MSR convention such packages include README files with reproduction steps.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": true, + "justification": "Table 2 reports standard errors for all ATT estimates with ± percentage bounds; Figure 3 shows confidence bands on all event-study plots for all five outcomes.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": true, + "justification": "Heteroscedasticity- and cluster-robust Wald tests for pre-trend hypothesis testing; significance levels on all ATT estimates in Table 2; Sargan and AR(1)/AR(2) tests for GMM validity in Table 3.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Table 2 reports percentage changes with confidence bounds (e.g., +28.58% ±13.7% lines added, +41.64% ±7.62% code complexity) computed from log-transformed ATT estimates via 100(e^ATT - 1)%.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "No formal power analysis; the 806 treated repos is determined by GitHub search results, and 1:3 matching ratio is justified by control diversity concerns rather than statistical power calculations.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": true, + "justification": "Standard errors reported for all ATT estimates (Table 2), all GMM coefficients (Table 3), and confidence bands appear on all event-study figures.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "1,380 propensity-score-matched never-adopting GitHub repositories serve as the control group throughout all analyses.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "Control repositories are matched from the same observation period (Jan 2024–Aug 2025) on dynamic covariate trajectories, ensuring contemporary comparison.", + "source": "haiku" + }, + "ablation_study": { + "applies": false, + "answer": false, + "justification": "Not applicable to this observational DiD study; multiple estimator comparisons (TWFE, Borusyak, Callaway & Sant'Anna) and robustness subsets serve an analogous sensitivity function.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Five outcome metrics used: commits and lines added (velocity), static analysis warnings, duplicate line density, and code complexity (quality).", + "source": "haiku" + }, + "human_evaluation": { + "applies": false, + "answer": false, + "justification": "Study measures automated repository metrics; human evaluation of outputs is not applicable to this observational design.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": false, + "answer": false, + "justification": "Causal inference study, not a prediction task; train/test split concept does not apply.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Appendix D breaks down SonarQube warnings by 20 categories pre/post adoption; Appendix C provides breakdowns by programming language (JS/TS, Python, Go) and by Cursor adoption cohort.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "Section 5.1.1 discusses repos that abandoned Cursor post-adoption (excitement-frustration-abandonment cycle); Section 5.1.2 discusses code complexity increasing even when velocity is controlled.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "No significant effect on duplicate line density overall; velocity gains dissipate fully by month 3; Callaway & Sant'Anna yields non-significant negative estimates for quality outcomes, all reported without suppression.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": false, + "answer": false, + "justification": "Researchers don't run any LLMs; this is an observational study of repositories using Cursor. Model version heterogeneity is acknowledged as a study limitation.", + "source": "haiku" + }, + "prompts_provided": { + "applies": false, + "answer": false, + "justification": "No LLM prompts used by the researchers; this is an observational study of existing repositories.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": false, + "answer": false, + "justification": "No LLM hyperparameters are used by the researchers.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": false, + "answer": false, + "justification": "Researchers study black-box adoption effects of Cursor; no agentic scaffolding is implemented by the research team.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "Section 3.1 documents GitHub code search API with adaptive partitioning algorithm, ≥10 star filter, fork exclusion, propensity score logistic regression specification with equation, monthly GHArchive metric collection, SonarQube Community server setup, and log-transformation of all outcomes.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": true, + "justification": "Replication package at zenodo (10.5281/zenodo.18368661) is provided; underlying GHArchive data is publicly accessible for independent verification.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Section 3.1 describes GitHub code search API queries with adaptive file-size partitioning, GHArchive monthly time series collection for 800k+ candidate repos per cohort, and SonarQube analysis procedure.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "No human participants; repositories are the units of analysis selected by algorithmic criteria.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "Full pipeline documented: identify Cursor-adopting repos via .cursorrules files → filter by stars → collect GHArchive dynamic covariates → propensity score matching per cohort → monthly SonarQube analysis → DiD estimation → GMM panel analysis.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": false, + "answer": false, + "justification": "Study does not evaluate LLM capabilities on benchmarks; it measures repository-level behavioral effects of Cursor adoption.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": false, + "answer": false, + "justification": "Not applicable; no LLM benchmarking performed.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": false, + "answer": false, + "justification": "Not applicable; no LLM benchmarking performed.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human participants; study analyzes public GitHub repository data.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human participants; study uses public repository data.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "Human subject criteria not applicable; repository selection criteria are described algorithmically in Section 3.1.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human participants; treatment assignment is naturally occurring.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": false, + "answer": false, + "justification": "Researchers do not run LLMs; inference costs are borne by the studied repositories' developers and are not measurable in this observational design.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "Google Cloud credits for BigQuery analysis are acknowledged but no specific compute cost or resource budget is reported for the SonarQube analysis pipeline running on 806+ repos.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "Cursor adoption leads to a 281% increase in lines added in the first adoption month, with gains fully dissipating after 2 months", + "evidence": "Table 2 (overall ATT +28.58%) and Figure 3 (ATTh showing large spike at h=0,1 then returning to baseline), consistent across all three DiD estimators", + "supported": "strong" + }, + { + "claim": "Static analysis warnings increase persistently by ~30% post-Cursor adoption", + "evidence": "Table 2 (Borusyak: +30.26%), Figure 3 (sustained effect); BUT Callaway & Sant'Anna yields -10.49% non-significant (Table 6, Appendix B), a substantive divergence the paper attributes to small cohort sizes", + "supported": "moderate" + }, + { + "claim": "Code complexity increases persistently by ~41% post-Cursor adoption", + "evidence": "Table 2 (Borusyak: +41.64%), Figure 3; Callaway & Sant'Anna yields -3.80% non-significant (Table 6), same estimator divergence applies", + "supported": "moderate" + }, + { + "claim": "Accumulated technical debt subsequently reduces future development velocity, creating a self-reinforcing cycle", + "evidence": "Table 3 GMM estimates: code complexity → lines added coefficient -0.718 (p<0.001), static warnings → lines added -0.588 (p<0.001); instruments validated by Sargan p>0.05 and AR(2) p>0.05", + "supported": "moderate" + }, + { + "claim": "Cursor adoption causes inherently more complex code beyond what is explained by codebase size growth", + "evidence": "Table 3 GMM model for lines added → code complexity shows Cursor coefficient 0.086 (p<0.001) even controlling for lines of code; interpreted as ~9% baseline complexity increase attributable to Cursor itself", + "supported": "moderate" + }, + { + "claim": "Quality degradation effects are amplified, not attenuated, in repositories with more intensive Cursor usage", + "evidence": "Figure 4 Row 1: High Contributor Adoption and Cursor Configuration Changes subsets both show stronger quality effects than the full ITT sample", + "supported": "strong" + } + ], + "methodology_tags": [ + "observational" + ], + "key_findings": "A staggered difference-in-differences study of 806 Cursor-adopting open-source GitHub repositories finds that Cursor adoption produces substantial but transient velocity gains (281% increase in lines added in month 1, dissipating fully by month 3) alongside persistent technical debt accumulation (+30% static analysis warnings, +41% code complexity per Borusyak et al. estimator). Panel GMM analysis demonstrates this accumulated debt subsequently suppresses future development velocity, creating a self-reinforcing quality-velocity degradation cycle. Robustness checks confirm quality degradation is amplified in repos with intensive Cursor usage; however, the Callaway & Sant'Anna estimator yields non-significant negative estimates for all quality outcomes, substantially weakening causal confidence in the debt accumulation findings specifically.", + "red_flags": [ + { + "flag": "Estimator disagreement on primary quality claims", + "detail": "Callaway & Sant'Anna yields -10.49% (non-significant) for static analysis warnings and -3.80% (non-significant) for code complexity, directly contradicting the Borusyak et al. estimates of +30.26% and +41.64%. The paper attributes this to small per-cohort sample sizes but cannot resolve the disagreement, substantially undermining causal confidence in the quality degradation findings." + }, + { + "flag": "Table 2 commits significance inconsistency", + "detail": "Table 2 marks commits ATT=0.0260 with *** (p<0.001) despite SE=0.0429 (t-stat ~0.6) and the paper body stating 'there is no statistically significant effect for the volume of commits'; the *** appears to be a typographical error contradicting the text." + }, + { + "flag": "Adoption proxy validity", + "detail": "Treatment is identified only through committed .cursorrules files; developers can and do use Cursor without committing configuration files, creating an ITT design measuring 'systematic adoption' and introducing unknown selection bias toward more process-conscious adopters." + }, + { + "flag": "Lines added as AI-era velocity metric", + "detail": "Large increases in lines added may reflect AI-generated boilerplate, scaffolding, or verbose refactoring rather than meaningful feature development, making this proxy especially unreliable precisely in the AI-assisted context being studied — the paper does not address this circularity." + }, + { + "flag": "SonarQube metrics unvalidated for AI-generated code", + "detail": "Paper acknowledges 'complexity metrics were designed for human-written code; whether they appropriately penalize AI-generated patterns that are mechanically verifiable yet syntactically complex remains an open question,' undermining the interpretation of the code complexity outcome." + }, + { + "flag": "Warning breakdown analysis is non-causal convenience sample", + "detail": "Appendix D (20-category SonarQube breakdown) is explicitly described as a 'convenience sample' due to architectural pipeline limitations preventing precise per-version tracking, and the paper cautions it cannot be used for causal inference — yet it is cited in support of the main narrative." + } + ], + "cited_papers": [ + { + "title": "The Impact of AI on Developer Productivity: Evidence from GitHub Copilot", + "relevance": "Primary prior RCT showing 56% task completion speedup from Copilot; key baseline for productivity claims" + }, + { + "title": "Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity", + "relevance": "Directly contrasted: controlled experiment showing Cursor does NOT help experienced OSS developers; complementary finding" + }, + { + "title": "The Impact of Large Language Models on Open-source Innovation: Evidence from GitHub Copilot", + "relevance": "Prior observational DiD estimating 17.82% release increase from Copilot; direct methodological predecessor" + }, + { + "title": "The Impact of Generative AI on Collaborative Open-Source Software Development: Evidence from GitHub Copilot", + "relevance": "Similar observational design finding 6.5% project-level productivity increase; provides comparison estimate" + }, + { + "title": "On the use of agentic coding: An empirical study of pull requests on GitHub", + "relevance": "Studies Claude Code PR acceptance (83.8%) at PR level; this paper explicitly extends to longitudinal project-level effects" + }, + { + "title": "Revisiting event-study designs: Robust and efficient estimation", + "relevance": "Methodological foundation: the Borusyak et al. imputation DiD estimator used as primary causal identification strategy" + }, + { + "title": "Asleep at the Keyboard? Assessing the Security of GitHub Copilot's Code Contributions", + "relevance": "Benchmark study establishing Copilot security vulnerability concerns; prior work motivating quality dimension analysis" + }, + { + "title": "The effects of generative AI on high skilled work: Evidence from three field experiments with software developers", + "relevance": "Field experiments at Microsoft/Accenture/Cisco finding 22-36% productivity increase; enterprise baseline for contrast with open-source findings" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 3, + "justification": "Directly addresses whether Cursor is worth adopting for development teams, with actionable findings about technical debt accumulation requiring quality-assurance process changes." + }, + "surprise_contrarian": { + "score": 3, + "justification": "Empirically challenges the '10x productivity' narrative with evidence of transient gains reversing to baseline plus persistent complexity debt, directly contradicting widespread practitioner enthusiasm." + }, + "fear_safety": { + "score": 1, + "justification": "Security warnings modestly increase (+1.98 per repo/month per Table 8) but the paper's focus is technical debt and maintainability, not critical safety risks." + }, + "drama_conflict": { + "score": 2, + "justification": "Targets a popular, well-funded product with negative longitudinal findings; internal estimator disagreement creates unresolved methodological tension the paper cannot fully explain." + }, + "demo_ability": { + "score": 1, + "justification": "Observational econometric study with no demo artifact; readers cannot readily experience or replicate the findings themselves." + }, + "brand_recognition": { + "score": 3, + "justification": "Studies Cursor (most popular AI IDE by adoption metrics cited), authored by CMU team with strong SE credentials (Kästner, Vasilescu), published at MSR '26." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "47401734", + "title": "Speed at the cost of quality: Study of use of Cursor AI in open source projects (2025)", + "points": 147, + "comments": 80, + "url": "https://news.ycombinator.com/item?id=47401734", + "created_at": "2026-03-16T17:07:37Z" + }, + { + "hn_id": "38283398", + "title": "API-Driven Program Synthesis for Testing Static Typing Implementations", + "points": 35, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=38283398", + "created_at": "2023-11-15T22:19:08Z" + }, + { + "hn_id": "45968758", + "title": "Does AI-Assisted Coding Deliver? A Study of Cursor's Impact on Software Projects", + "points": 14, + "comments": 2, + "url": "https://news.ycombinator.com/item?id=45968758", + "created_at": "2025-11-18T16:50:19Z" + }, + { + "hn_id": "46730534", + "title": "Does AI-Assisted Coding Deliver? A Study of Cursor on Software Projects", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=46730534", + "created_at": "2026-01-23T09:54:11Z" + }, + { + "hn_id": "46658985", + "title": "Does AI-Assisted Coding Deliver? A Study of Cursor's Impact on Software Projects", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=46658985", + "created_at": "2026-01-17T15:53:22Z" + }, + { + "hn_id": "45998822", + "title": "Does AI-Assisted Coding Deliver? A Difference-in-Differences Study", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=45998822", + "created_at": "2025-11-20T22:36:21Z" + }, + { + "hn_id": "45951387", + "title": "Does AI-Assisted Coding Deliver? A Study of Cursor's Impact on Software Projects", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=45951387", + "created_at": "2025-11-17T06:57:28Z" + }, + { + "hn_id": "42127507", + "title": "UniGAD: Unifying Multi-Level Graph Anomaly Detection", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=42127507", + "created_at": "2024-11-13T16:32:30Z" + }, + { + "hn_id": "46180812", + "title": "Does AI-Assisted Coding Deliver? A Difference-in-Differences Study", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=46180812", + "created_at": "2025-12-07T10:54:26Z" + }, + { + "hn_id": "46070691", + "title": "A Difference-in-Differences Study of Cursor's Impact on Software Projects", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=46070691", + "created_at": "2025-11-27T16:21:41Z" + } + ], + "top_points": 147, + "total_points": 208, + "total_comments": 83 + } +} +\ No newline at end of file diff --git a/papers/data-contamination-benchmarks-2023/scan-v5.json b/papers/data-contamination-benchmarks-2023/scan-v5.json @@ -0,0 +1,595 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Investigating Data Contamination in Modern Benchmarks for Large Language Models", + "authors": [ + "Chunyuan Deng", + "Yilun Zhao", + "Xiangru Tang", + "Mark Gerstein", + "Arman Cohan" + ], + "year": 2023, + "venue": "arXiv", + "arxiv_id": "2311.09783", + "doi": null + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "Abstract claims 52%/57% EM rates for ChatGPT/GPT-4 on MMLU are confirmed by Table 3. Claim about TruthfulQA commercial model performance is supported by Table 2.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": false, + "justification": "The paper uses high TS-Guessing EM rates as evidence of contamination but cannot establish that contamination caused the performance—alternative explanations like statistical priors or reasoning ability are not fully ruled out despite filtering. The controlled contamination experiment validates the probe but not the original inference.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": true, + "justification": "Claims are largely bounded to the specific benchmarks tested (MMLU, TruthfulQA, etc.) and specific models; the paper uses hedged language like 'may suspect' and 'raises concerns' rather than asserting contamination as fact.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "The paper applies filtering to reduce reasoning-based guessing but does not systematically discuss whether models could achieve high EM rates through statistical priors about common wrong answers rather than memorization; the only alternative considered is direct reasoning, which is partially controlled.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "The paper explicitly discusses TS-Guessing as an indicator of potential contamination rather than proof, and Section 5 explicitly notes TS-Guessing is 'less reliable since it relies on inferred knowledge rather than direct retrieval.'", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "Section 7 'Limitations' is a dedicated section listing BM25-only indexing, 2-3 minute computation time per data point, superficial nature of text generation scores, and LLM instruction comprehension dependency.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "Limitations are reasonably specific: BM25 retrieval may miss semantic overlap, TS-Guessing depends on models following instructions (open-source models often predict correct answer regardless), and the 0.65 Rouge-L threshold was empirically chosen without principled justification.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": true, + "justification": "Section 5 explicitly compares retrieval vs TS-Guessing trade-offs, stating retrieval is 'generally more reliable' but requires training data access, while TS-Guessing is 'less reliable' and may not work for reasoning benchmarks.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "The acknowledgements section mentions colleagues and anonymous reviewers but no funding source is disclosed anywhere in the paper.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Author affiliations are clearly stated on the first page: Georgia Institute of Technology, Yale University, and Allen Institute for AI.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": false, + "answer": false, + "justification": "No funding source is disclosed, making independence of funder unverifiable; authors are from academic institutions with no apparent commercial stake in the evaluated models.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement or financial interests declaration appears in the paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Data contamination is operationally defined as benchmark data appearing in pretraining corpora; TS-Guessing is formally defined in Section 3.2 with mathematical formulations for both Question-based and Question-Multichoice settings.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "The paper clearly states it contributes two methods: a retrieval-based IR system for open-source models and the TS-Guessing protocol applicable to both open and closed-source models.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 2 'Related Work' systematically situates the contribution against n-gram matching approaches (GPT-3, PaLM, LLaMA methods), corpus indexing tools (Dodge et al., Elazar et al.), and recent contamination detection methods (Golchin & Surdeanu, Oren et al., Shi et al.).", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": false, + "justification": "No code repository or release link is mentioned anywhere in the paper; only the Pyserini toolkit is cited as a dependency.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "All benchmark datasets used (MMLU, TruthfulQA, PIQA, etc.) and pretraining corpora (The Pile, C4) are publicly available; the paper does not release custom data.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "The paper mentions using Pyserini but provides no requirements.txt, Dockerfile, or specific version information for any dependencies.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "The methodology is described at a conceptual level but no step-by-step instructions sufficient to reproduce the experiments are provided.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "All results in Tables 1-3 are reported as point estimates with no confidence intervals or error bars.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "Comparative claims (e.g., GPT-4 57% vs ChatGPT 52% on MMLU, commercial vs open-source model differences) are made without any statistical significance testing.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Exact match rates (e.g., 52%, 57%) and Rouge-L F1 scores provide interpretable effect sizes with clear baselines (near-zero for open-source models).", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "The 100-example human evaluation sample is not statistically justified; no power analysis is provided for any experiment in the paper.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "No standard deviations, variance, or repeated run statistics are reported for any result.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Open-source models (LLaMA 2-13B, Mistral-7B) serve as implicit baselines showing near-zero contamination signals; the controlled contamination experiment (clean vs deliberately contaminated ChatGPT) provides an explicit comparison.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "LLaMA 2 (2023) and Mistral-7B (2023) are contemporary models at time of writing.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "The retrieval system is ablated across query types (question-only, label-only, question+label) in Table 4; TS-Guessing is ablated across hint variants (no hint, type-hint, category-hint, URL-hint) in Table 2.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "The retrieval system uses BM25, SacreBLEU, Rouge-L, BLEURT, and GPTscore; TS-Guessing reports both EM rate and Rouge-L F1.", + "source": "haiku" + }, + "human_evaluation": { + "applies": true, + "answer": true, + "justification": "17 NLP volunteers evaluated 100 data points to validate IR system metrics, with inter-annotator agreement measured by Krippendorff's alpha (0.8673).", + "source": "haiku" + }, + "held_out_test_set": { + "applies": false, + "answer": false, + "justification": "The paper's task is contamination detection methodology validation, not prediction; held-out test set is not applicable.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Results are reported separately for each benchmark (MMLU, TruthfulQA, HellaSwag, PIQA, etc.) across all tables.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "The paper discusses cases where the retrieval system fails (n-gram matching misses contaminated data found by retrieval), examples that are filtered out (Table 5), and why open-source models fail at TS-Guessing.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "LLaMA 2-7B and 13B show near-zero EM rates on TS-Guessing (0.00-0.04), and the paper reports that stronger models (GPT-4) do not necessarily outperform weaker ones (ChatGPT) on the probe.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": false, + "justification": "Models are identified as 'ChatGPT (GPT-3.5-turbo)', 'GPT-4', 'Claude-instant-1-100k', 'Claude-2', 'LLaMa 2-13B', 'Mistral-7B' but no API snapshot dates are provided for any model.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": true, + "justification": "Figures 2a and 2b show the actual prompt templates for Question-based and Question-Multichoice settings with example questions and masking instructions.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": false, + "justification": "No temperature, top-p, or other generation hyperparameters are reported for any model.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": false, + "answer": false, + "justification": "No agentic scaffolding is used; models are prompted directly.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "Filtering criteria are documented: removing Yes-No/True-False options, mathematical symbols, options with Rouge-L F1 > 0.65 between pairs, and TruthfulQA-specific filters for short questions and Indexical Error category.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": false, + "justification": "The filtered benchmark subsets, retrieval results, and model outputs are not released; only the public source benchmarks are available.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Data collection is described: publicly available benchmarks are used and the filtering pipeline is documented in Section 4.2.1.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": true, + "answer": false, + "justification": "Human annotators are described as '17 volunteers with backgrounds in NLP' compensated at $9/hour, but recruitment method (colleagues, crowdsourcing, etc.) is not specified.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "The pipeline from benchmark selection → query construction → BM25 retrieval → 13-gram tokenization → scoring is documented in Section 3.1; TS-Guessing pipeline from data → filtering → keyword selection → masking → model querying is documented in Section 3.2.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": true, + "justification": "The paper states 'According to OpenAI, their training data is current up to September 2021' and uses this to analyze TruthfulQA contamination risk.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": true, + "justification": "This is the central topic of the paper; both methods directly measure or infer train-test overlap.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": true, + "answer": true, + "justification": "The paper directly tests whether MMLU and other widely-used benchmarks were available before training cutoff and whether models exhibit memorization signatures.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": true, + "answer": false, + "justification": "No pre-registration is mentioned.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": true, + "answer": false, + "justification": "The Ethics Statement discusses compensation and public data use but mentions no IRB or ethics board approval.", + "source": "haiku" + }, + "demographics_reported": { + "applies": true, + "answer": false, + "justification": "Annotators are described only as '17 volunteers with backgrounds in NLP'; no demographics (age, gender, institution, experience level) are reported.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": true, + "answer": false, + "justification": "NLP background is mentioned implicitly but no formal inclusion/exclusion criteria are stated.", + "source": "haiku" + }, + "randomization_described": { + "applies": true, + "answer": false, + "justification": "No randomization of annotation assignments or conditions is described.", + "source": "haiku" + }, + "blinding_described": { + "applies": true, + "answer": false, + "justification": "No blinding procedures are described; annotators appear to have been aware of the task framing.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No attrition is applicable for a one-shot annotation task.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": false, + "justification": "The paper mentions API calls to ChatGPT/GPT-4 for TS-Guessing but reports no dollar cost or token counts; latency of 2-3 minutes per data point is noted only for the retrieval system.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "Disk space (~2-4TB) is mentioned but no GPU hours, cloud compute budget, or total cost is reported.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "ChatGPT and GPT-4 can guess missing incorrect options in MMLU at 52% and 57% exact match rates respectively", + "evidence": "Table 3 shows MMLU EM rates: ChatGPT=0.52, GPT-4=0.57, compared to near-zero rates for LLaMA 2-13B (0.00) and Mistral-7B (0.01)", + "supported": "strong" + }, + { + "claim": "Deliberately contaminating ChatGPT with the MMLU test set causes EM rate to approach 100%", + "evidence": "Figure 4 shows fine-tuned ChatGPT achieves near-100% EM rate in both Question-based and Question-Multichoice settings, validating the probe's sensitivity", + "supported": "strong" + }, + { + "claim": "Stronger models do not show significantly higher TS-Guessing performance than weaker models", + "evidence": "GPT-4 outperforms ChatGPT by only 1% on MMLU EM; similar patterns hold for Claude-2 vs Claude-instant-1 (Table 2, Table 3)", + "supported": "moderate" + }, + { + "claim": "TruthfulQA exhibits significant contamination overlap with pretraining corpora", + "evidence": "Appendix C shows concrete example of TruthfulQA question substantially overlapping with C4 document (BM25 score 50.24); GPTscore shows highest scores for TruthfulQA-C4 overlap (Table 1)", + "supported": "moderate" + }, + { + "claim": "GPTscore aligns more closely with human judgment than traditional metrics (SacreBLEU, Rouge-L, BLEURT) for contamination detection", + "evidence": "Figure 3 shows GPTscore with highest Spearman correlation to human evaluation scores across 100 examples; exact correlation values not reported", + "supported": "moderate" + }, + { + "claim": "Open-source models (LLaMA, Mistral) show minimal contamination signals on MMLU via TS-Guessing", + "evidence": "Table 3 shows LLaMA 2-13B EM=0.00 and Mistral-7B EM=0.01 on MMLU, versus 0.52 and 0.57 for ChatGPT and GPT-4", + "supported": "strong" + }, + { + "claim": "n-gram matching is insufficient for detecting all contaminated data in pretraining corpora", + "evidence": "The retrieval-based system identifies contaminated examples that evaded n-gram tokenization detection, and Appendix C shows a high-overlap TruthfulQA example", + "supported": "moderate" + } + ], + "methodology_tags": [ + "benchmark-eval", + "empirical" + ], + "key_findings": "The paper proposes two contamination detection methods: a BM25-based retrieval system for open training corpora and a novel TS-Guessing protocol for black-box models. The main finding is that ChatGPT and GPT-4 can guess missing incorrect options in MMLU at 52% and 57% exact match rates—far above open-source model baselines (near 0%)—suggesting MMLU may be contaminated in commercial model training data. A controlled experiment validates the probe: deliberately fine-tuning ChatGPT on the MMLU test set pushes EM rates to ~100%, confirming the method's sensitivity. TruthfulQA shows substantial lexical overlap with the C4 pretraining corpus, and GPTscore is identified as a better proxy for human-judged contamination than traditional NLP metrics.", + "red_flags": [ + { + "flag": "Internal inconsistency in reported EM rates", + "detail": "The abstract states ChatGPT achieves 52% and GPT-4 57% EM on MMLU; Table 3 confirms this. However, Section 4.2.2 body text states 'ChatGPT demonstrated...achieving a 57% Exact Match (EM) rate'—attributing GPT-4's score to ChatGPT." + }, + { + "flag": "Correlation conflated with contamination", + "detail": "High TS-Guessing scores are interpreted as evidence of contamination but alternative explanations (e.g., language models having statistical priors about common wrong answers in well-known datasets) are not rigorously ruled out beyond simple filtering." + }, + { + "flag": "No code or filtered data released", + "detail": "Neither the implementation of the retrieval system nor the filtered benchmark subsets are released, making it impossible to replicate exact results." + }, + { + "flag": "No statistical significance tests", + "detail": "All comparisons between models (ChatGPT 52% vs GPT-4 57%, commercial vs open-source) are made without significance tests despite the paper making strong comparative claims." + }, + { + "flag": "Empirically chosen threshold without validation", + "detail": "The Rouge-L 0.65 threshold for filtering correlated options is chosen 'based on initial experiments' (footnote 1) without cross-validation or sensitivity analysis." + }, + { + "flag": "Underpowered human evaluation", + "detail": "The correlation between automatic metrics and human judgment is assessed on only 100 data points with 17 annotators, of which only 23 were judged contaminated—too few for reliable metric ranking." + } + ], + "cited_papers": [ + { + "title": "Time Travel in LLMs: Tracing Data Contamination in Large Language Models", + "relevance": "Complementary contamination detection method at dataset-level granularity; directly compared and contrasted with this paper's approach" + }, + { + "title": "Proving Test Set Contamination in Black Box Language Models", + "relevance": "Prior black-box contamination detection using canonical ordering; limitation noted as dataset-level only" + }, + { + "title": "Detecting Pretraining Data from Large Language Models", + "relevance": "Min-k% probability method for contamination detection; requires model internals access, contrasting with TS-Guessing" + }, + { + "title": "What's In My Big Data?", + "relevance": "Analysis of contamination in GLUE/SuperGLUE benchmarks in open training corpora; foundation for this paper's retrieval approach" + }, + { + "title": "NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark", + "relevance": "Motivating work documenting the severity of the contamination problem for NLP evaluation" + }, + { + "title": "Stop Uploading Test Data in Plain Text: Practical Strategies for Mitigating Data Contamination by Evaluation Benchmarks", + "relevance": "Mitigation strategies for the contamination problem this paper diagnoses" + }, + { + "title": "Measuring Massive Multitask Language Understanding", + "relevance": "MMLU benchmark—primary dataset used to demonstrate high contamination signals in commercial models" + }, + { + "title": "Data Contamination Through the Lens of Time", + "relevance": "Temporal approach to contamination detection using pre/post-training data release dates" + }, + { + "title": "Data contamination: From memorization to exploitation", + "relevance": "Studies correlation between pretraining memorization and downstream task performance, directly relevant to this paper's thesis" + }, + { + "title": "Investigating data contamination for pre-training language models", + "relevance": "Concurrent work investigating contamination-performance correlation, cited as related approach" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 3, + "justification": "Directly challenges validity of widely-used benchmark scores; any practitioner using MMLU results to compare models should know about potential contamination." + }, + "surprise_contrarian": { + "score": 3, + "justification": "52-57% EM rate for guessing wrong options in MMLU—a task with many possible answers—is genuinely surprising and undermines trust in published LLM benchmarks." + }, + "fear_safety": { + "score": 1, + "justification": "Raises concerns about misleading capability assessments but does not address direct AI safety risks." + }, + "drama_conflict": { + "score": 2, + "justification": "Implicates OpenAI's flagship models (ChatGPT, GPT-4) in potential training data leakage from widely-used benchmarks, with contrast against transparent open-source alternatives." + }, + "demo_ability": { + "score": 2, + "justification": "TS-Guessing can be replicated by anyone with ChatGPT/GPT-4 API access using the prompt templates shown in Figure 2." + }, + "brand_recognition": { + "score": 3, + "justification": "Evaluates ChatGPT, GPT-4 (OpenAI), Claude-2/Claude-instant (Anthropic), LLaMA 2 (Meta), Mistral—all high-profile models with significant public recognition." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "40229022", + "title": "When can transformers reason with abstract symbols?", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=40229022", + "created_at": "2024-05-01T20:34:43Z" + }, + { + "hn_id": "29493664", + "title": "Training Neural Networks with Fixed Sparse Masks", + "points": 2, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=29493664", + "created_at": "2021-12-09T03:58:40Z" + }, + { + "hn_id": "42299972", + "title": "Modeling AdaGrad, RMSProp, and Adam with Integro-Differential Equations", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=42299972", + "created_at": "2024-12-02T20:17:28Z" + }, + { + "hn_id": "42171440", + "title": "Modeling AdaGrad, RMSProp, and Adam with Integro-Differential Equations", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=42171440", + "created_at": "2024-11-18T11:19:17Z" + }, + { + "hn_id": "39313991", + "title": "Information content of note transitions in the music of J. S. Bach", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=39313991", + "created_at": "2024-02-09T12:09:43Z" + }, + { + "hn_id": "33625737", + "title": "Optimal sizing of seasonal renewable energy storage considering degradation", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=33625737", + "created_at": "2022-11-16T16:25:26Z" + }, + { + "hn_id": "38576071", + "title": "Large Language Models on Graphs: A Comprehensive Survey", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=38576071", + "created_at": "2023-12-08T23:15:13Z" + }, + { + "hn_id": "37771757", + "title": "Can LLMs provide useful feedback on research papers? A broad empirical analysis", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=37771757", + "created_at": "2023-10-04T21:03:08Z" + }, + { + "hn_id": "35284995", + "title": "Self Supervision Does Not Help Natural Language Supervision at Scale", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=35284995", + "created_at": "2023-03-24T04:12:16Z" + }, + { + "hn_id": "29320363", + "title": "ClipClap: Image Captioning with Clip Encoder and GPT2 [pdf]", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=29320363", + "created_at": "2021-11-23T17:15:20Z" + } + ], + "top_points": 3, + "total_points": 17, + "total_comments": 1 + } +} +\ No newline at end of file diff --git a/papers/data-distributional-properties-2022/scan-v5.json b/papers/data-distributional-properties-2022/scan-v5.json @@ -0,0 +1,581 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Data Distributional Properties Drive Emergent In-Context Learning in Transformers", + "authors": [ + "Stephanie C. Y. Chan", + "Adam Santoro", + "Andrew Kyle Lampinen", + "Jane X. Wang", + "Aaditya K Singh", + "Pierre H. Richemond", + "James L. McClelland", + "Felix Hill" + ], + "year": 2022, + "venue": "Neural Information Processing Systems", + "arxiv_id": "2205.05055", + "doi": "10.48550/arXiv.2205.05055" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All major abstract claims (burstiness drives ICL, Zipfian distribution enables coexistence of ICL and in-weights learning, recurrent models fail) are supported by experimental results in Figs 2–7.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "The paper uses controlled experimental manipulation of individual distributional properties (burstiness, class count, label multiplicity) while holding others constant, which is adequate for causal inference in this controlled laboratory setting.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "The abstract claims findings explain ICL in 'large language models,' but experiments use small transformers (dim 64) trained from scratch on Omniglot image classification—a significant extrapolation not bounded to the tested setting.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": true, + "justification": "The paper examines whether recurrent model inferiority on ICL is explained by a compensating bias toward in-weights learning (Appendix C.2/Fig 8), and directly engages competing theories from Min et al. and Razeghi et al.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "ICL is measured using a standard 4-shot 2-way evaluation on holdout classes with randomly reassigned labels, which directly operationalizes the claimed capability without proxy substitution.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": false, + "justification": "There is no dedicated limitations or threats-to-validity section; Section 4 is a discussion of implications and future directions, not an honest accounting of what the study does not show.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "No specific threats are discussed; Appendix B mentions the constraint on novel labels but frames it as a future extension opportunity rather than a validity concern for the current study.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "The paper uses hedging language ('may explain,' 'could allow') but never explicitly states what the results do NOT show or explicitly bounds them to the small-transformer/Omniglot setting.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": true, + "justification": "The acknowledgment section explicitly states 'This work was funded by DeepMind.'", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Author affiliations are listed on the title page: seven authors at DeepMind, one at UCL, one shared with Stanford.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": false, + "justification": "DeepMind funded the work and employs the majority of authors; findings support the transformer architecture that DeepMind builds and deploys, creating a structural alignment of interests.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests or financial interests statement is present; the acknowledgment only identifies the funding source.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "In-context learning, in-weights learning, burstiness, Zipfian distribution, label multiplicity, and within-class variation are all operationally defined in the introduction and experimental design sections.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "The paper clearly states it is identifying which distributional properties of training data drive emergent ICL, contributing the identification of burstiness, class rarity, dynamic meanings, and Zipfian skew as key factors.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "The paper directly addresses competing explanations (Min et al., Razeghi et al., Xie et al.) and situates itself relative to explicit meta-learning approaches (Santoro et al., Vinyals et al., Wang et al.), showing how its findings extend and challenge prior work.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": true, + "justification": "A footnote on page 1 states 'Code is available at: https://github.com/deepmind/emergent_in_context_learning'; the NeurIPS checklist notes release with the camera-ready version, suggesting it was made available.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "The paper uses the Omniglot dataset, which is publicly available and cited with MIT License; no new data is created.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "The appendix specifies hardware (16 TPU v2/v3 cores) and optimizer details but provides no requirements file, Dockerfile, or software dependency list sufficient to reproduce the environment.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "While training procedure and architecture are described in the appendix, no step-by-step reproduction instructions are provided; users must infer setup from the code repository.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": true, + "justification": "The appendix explicitly states 'In all figures, (shaded) error bars indicate standard deviation around the mean,' and error bands are visible in all main result figures.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "No formal statistical significance tests (t-tests, ANOVA, permutation tests) are reported; results are presented as learning curves with standard deviation bands only.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Raw accuracy values are reported for all conditions across learning curves, permitting direct comparison of effect magnitudes (e.g., ICL accuracy rising from chance to ~1.0 with increasing burstiness).", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "The paper uses 5 seeds for most experiments and 3 for others (Figs 5–6) but provides no justification for why these numbers of seeds are sufficient to detect the effects of interest.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": true, + "justification": "Standard deviation is shown as shaded regions in all figures, as explicitly confirmed in the appendix.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Baselines include P(bursty)=0 (no burstiness), 100 training classes (minimal vocabulary), and RNN/LSTM architecture comparisons, each serving as natural lower bounds.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "Architecture comparisons use matched LSTM and vanilla RNN baselines with identical parameter counts, layers, and training procedures, making them appropriate contemporaries.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "The entire experimental design is structured as systematic ablations: each distributional property (burstiness, class count, label multiplicity, within-class variation, Zipfian skew) is manipulated independently across separate experiments.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Both in-context learning accuracy (on holdout classes with random label assignment) and in-weights learning accuracy (on trained classes without context support) are measured throughout all experiments.", + "source": "haiku" + }, + "human_evaluation": { + "applies": false, + "answer": false, + "justification": "Human evaluation is not relevant; this study trains and evaluates neural networks automatically.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": true, + "justification": "In-context learning is always evaluated on holdout image classes 'that were never encountered in training,' with evaluation labels randomly reassigned to prevent reliance on training-time associations.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Results are broken down by experimental condition (burstiness level, number of classes, Zipf exponent, label multiplicity, within-class variation), and the Zipfian experiments separately report accuracy on common vs. rare classes.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "Failure cases are prominently reported: high Zipf exponent (=3) causes ICL to fail; recurrent models never achieve ICL under any condition; rare classes are never memorized regardless of Zipfian skew.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "Negative results are central findings: absence of burstiness or insufficient class count prevents ICL emergence; recurrent models completely fail; rare classes show chance-level in-weights performance across all Zipf conditions.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": true, + "justification": "Exact model specifications are given: 12-layer transformer, embedding dimension 64, 8 heads, ResNet embedder with specific block and channel architecture; parameter counts are provided for all compared architectures.", + "source": "haiku" + }, + "prompts_provided": { + "applies": false, + "answer": false, + "justification": "This paper trains transformers from scratch on image-label sequences; there are no natural language prompts or system instructions.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": true, + "justification": "Hyperparameters are reported in the appendix: Adam optimizer, max learning rate 3e-4 at 4000 steps with inverse square root decay, 500k training steps, and the full hyperparameter sweep ranges used for architecture comparisons.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": false, + "answer": false, + "justification": "No agentic scaffolding is used; this paper trains and evaluates neural networks directly.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "Section 2.1 describes how Omniglot images are processed (ResNet embedder, integer label embeddings), how training sequences are constructed (context + query format, bursty vs. non-bursty generation), and how evaluation sequences are formed.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": true, + "justification": "Omniglot is a public dataset cited with MIT License; experimental sequences are procedurally generated from this public data and the generation procedure is described.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Section 2.1 fully describes how training and evaluation sequences are generated from Omniglot, including bursty/non-bursty mixing, label assignment, and holdout class selection.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "No human participants; a standard public benchmark dataset is used.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "The full pipeline from Omniglot dataset through sequence construction to evaluation is documented across Section 2 and the appendix.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": false, + "answer": false, + "justification": "Models are trained from scratch on controlled data; there is no pre-trained model with a training data cutoff to declare.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": true, + "justification": "The paper explicitly addresses train/test class separation: holdout classes are 'never encountered in training,' and Appendix B discusses the design choice of using seen labels in evaluation, arguing it makes ICL harder, not easier.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": false, + "answer": false, + "justification": "Models are trained from scratch on procedurally generated sequences; there is no pre-existing benchmark contamination risk in the LLM sense.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": false, + "answer": false, + "justification": "This is a training/mechanism study, not a deployment study; inference cost is not relevant to the research questions.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": true, + "justification": "The appendix states experiments ran for 500k training steps on 16 TPU v2 or v3 cores; the architecture comparison used 90 hyperparameter sweep runs.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "Burstiness in training data is necessary for in-context learning to emerge in transformers", + "evidence": "Fig 2 shows monotonic improvement in ICL accuracy as P(bursty) increases from 0 to 1.0, with P(bursty)=0 yielding chance performance; replicated across 5 seeds", + "supported": "strong" + }, + { + "claim": "A large number of rarely occurring training classes is required for in-context learning", + "evidence": "Fig 3 shows ICL accuracy increases monotonically as classes increase from 100 to 1600 to 12800 with P(bursty)=0.9 held fixed; effect holds after controlling for number of exposures", + "supported": "strong" + }, + { + "claim": "Dynamic meanings (label multiplicity and within-class variation) increase in-context learning", + "evidence": "Figs 4 and 5 show monotonic ICL improvement with increasing label multiplicity (1→10) and within-class variation (no noise → full Omniglot exemplars)", + "supported": "strong" + }, + { + "claim": "There is a tradeoff between in-context and in-weights learning under uniform marginal class distributions", + "evidence": "Consistently observed across all experiments in Figs 2–5: any manipulation increasing ICL simultaneously decreases in-weights learning accuracy", + "supported": "strong" + }, + { + "claim": "A Zipfian (power-law) marginal distribution over classes allows in-context and in-weights learning to coexist", + "evidence": "Fig 6 shows Zipf exponent=1 achieves simultaneously high ICL accuracy on holdout classes and high in-weights accuracy on common classes, with a sweet spot coinciding with natural language statistics", + "supported": "strong" + }, + { + "claim": "Transformers uniquely support in-context learning; matched recurrent models (LSTM, RNN) cannot acquire it", + "evidence": "Fig 7 shows transformer achieving near-perfect ICL while matched LSTM and RNN remain at chance across 90 hyperparameter sweep runs; transformers also match or exceed recurrent models on in-weights learning", + "supported": "strong" + }, + { + "claim": "These distributional properties explain why large language models exhibit emergent in-context learning", + "evidence": "Analogy between natural language distributional properties (burstiness, Zipfian skew, polysemy) and the experimental factors is drawn, but no direct experiments on large language models are conducted", + "supported": "weak" + } + ], + "methodology_tags": [ + "rct" + ], + "key_findings": "In-context learning emerges in transformers when training data exhibits both burstiness and a large vocabulary of rarely occurring classes—properties naturally present in language but absent from standard supervised datasets. A consistent tradeoff exists between in-context and in-weights learning under uniform class distributions, but a Zipfian (power-law) marginal distribution with exponent ~1 (matching natural language) resolves this by enabling both simultaneously. Recurrent architectures (LSTM, RNN) matched on parameters and depth completely fail to acquire in-context learning under any tested naturalistic data distribution, confirming that both data distributions and the transformer architecture are jointly necessary—attention is not all you need.", + "red_flags": [ + { + "flag": "Generalization leap to LLMs", + "detail": "All experiments use small transformers (12-layer, embedding dim 64) trained on Omniglot image classification, but the abstract and discussion claim to explain in-context learning in 'large language models.' No LLM experiments are conducted, and the scale difference is orders of magnitude. The mechanistic analogy is plausible but unverified." + }, + { + "flag": "No statistical significance testing", + "detail": "Results are presented as learning curves with standard deviation bands, but no formal hypothesis tests (t-tests, ANOVA, permutation tests) are reported anywhere in the paper, making it unclear whether observed condition differences are statistically reliable." + }, + { + "flag": "No limitations section", + "detail": "The paper contains no dedicated limitations or threats-to-validity section. The discussion only covers implications and future directions, without acknowledging potential confounds such as the use of Omniglot specifically, the small model scale, or the particular burstiness operationalization chosen." + }, + { + "flag": "Funder-author conflict", + "detail": "Seven of eight authors are DeepMind employees and the work is DeepMind-funded; findings favor the transformer architecture that DeepMind develops and deploys. No competing interests statement is present." + } + ], + "cited_papers": [ + { + "title": "Language Models are Few-Shot Learners", + "relevance": "Foundational GPT-3 paper demonstrating emergent in-context learning; the motivating observation driving this paper's research question" + }, + { + "title": "Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?", + "relevance": "Competing theory suggesting LLMs may not perform genuine ICL; paper directly counters this with holdout class experiments showing true generalization" + }, + { + "title": "An Explanation of In-context Learning as Implicit Bayesian Inference", + "relevance": "Alternative theoretical framework for ICL that the paper situates itself relative to" + }, + { + "title": "Impact of Pretraining Term Frequencies on Few-Shot Reasoning", + "relevance": "Suggests LLM few-shot performance may be driven by memorization; paper's holdout class design is a direct methodological response" + }, + { + "title": "Meta-Learning with Memory-Augmented Neural Networks", + "relevance": "Prior explicit meta-training approach for few-shot learning; contrasts with emergent ICL studied here, establishing what 'designed' few-shot learning looks like" + }, + { + "title": "Matching Networks for One Shot Learning", + "relevance": "Canonical few-shot meta-learning baseline; establishes the explicit-training approach that the paper shows is unnecessary given correct data distributions" + }, + { + "title": "Zipfian environments for Reinforcement Learning", + "relevance": "Prior work by the same first author on non-uniform distributions in RL; establishes prior context for the Zipfian analysis and motivates the non-language domain implications" + }, + { + "title": "Can Wikipedia Help Offline Reinforcement Learning?", + "relevance": "Evidence that language pre-training transfers to RL but vision pre-training does not; paper uses this asymmetry to motivate its distributional properties hypothesis" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 2, + "justification": "Provides actionable principles for designing pre-training datasets to elicit ICL in non-language domains, though requires significant ML infrastructure to apply" + }, + "surprise_contrarian": { + "score": 3, + "justification": "Directly challenges 'attention is all you need' narrative by demonstrating data distribution is equally critical; the counterintuitive finding that harder within-class generalization promotes ICL is particularly striking" + }, + "fear_safety": { + "score": 0, + "justification": "Purely mechanistic understanding of a capability; no AI safety or risk implications are raised" + }, + "drama_conflict": { + "score": 1, + "justification": "Mild tension with competing theories (Min et al., Razeghi et al.) claiming LLMs may not genuinely perform ICL; paper presents experimental counter-evidence" + }, + "demo_ability": { + "score": 1, + "justification": "Code is released but experiments require TPU infrastructure; not easily reproducible by practitioners without significant compute resources" + }, + "brand_recognition": { + "score": 2, + "justification": "DeepMind affiliation and NeurIPS venue; several authors (Santoro, Lampinen, Wang, Hill) are prominent researchers in meta-learning and language model interpretability" + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "30533914", + "title": "DeepNet: Scaling Transformers to 1k Layers", + "points": 194, + "comments": 38, + "url": "https://news.ycombinator.com/item?id=30533914", + "created_at": "2022-03-02T22:10:11Z" + }, + { + "hn_id": "32198181", + "title": "Design and Implementation of a Secure RISC-V Microprocessor", + "points": 4, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=32198181", + "created_at": "2022-07-22T23:06:00Z" + }, + { + "hn_id": "31352535", + "title": "Design and Implementation of a Secure RISC-V Microprocessor", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=31352535", + "created_at": "2022-05-12T11:43:34Z" + }, + { + "hn_id": "31635842", + "title": "Finding the optimal human strategy for Wordle", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=31635842", + "created_at": "2022-06-05T23:36:34Z" + }, + { + "hn_id": "31644563", + "title": "Bursty symbols in training allow prompting (ML)", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=31644563", + "created_at": "2022-06-06T19:10:02Z" + }, + { + "hn_id": "31441958", + "title": "Data Distributional Properties Drive Emergent Few-Shot Learning in Transformers", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=31441958", + "created_at": "2022-05-20T00:25:42Z" + }, + { + "hn_id": "34205208", + "title": "Crypto trading using reinforcement learning", + "points": 2, + "comments": 3, + "url": "https://news.ycombinator.com/item?id=34205208", + "created_at": "2023-01-01T10:23:29Z" + }, + { + "hn_id": "44064952", + "title": "Hierarchical-Chain-of-Generation for Complex Attributes Text-to-3D Generation", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=44064952", + "created_at": "2025-05-22T18:12:20Z" + }, + { + "hn_id": "33223016", + "title": "Cotton Gravity: a potential alternative to the dark matter paradigm", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=33223016", + "created_at": "2022-10-16T12:18:43Z" + }, + { + "hn_id": "31633209", + "title": "Data Distributional Properties Drive Emergent In-Context Learning in Transformer", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=31633209", + "created_at": "2022-06-05T17:46:34Z" + } + ], + "top_points": 194, + "total_points": 215, + "total_comments": 41 + } +} +\ No newline at end of file diff --git a/papers/database-perspective-llm-2025/scan-v5.json b/papers/database-perspective-llm-2025/scan-v5.json @@ -0,0 +1,339 @@ +{ + "scan_version": 5, + "paper_type": "survey", + "paper": { + "title": "Database Perspective on LLM Inference Systems", + "authors": [ + "James Pan", + "Guoliang Li" + ], + "year": 2025, + "venue": "PVLDB", + "arxiv_id": null, + "doi": "10.14778/3750601.3750703" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All abstract claims are supported by paper content: systematically covers request processing (§2.1), model optimization (§2.2), memory management (§2.3), and how systems combine techniques (§2.4).", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": false, + "answer": false, + "justification": "Tutorial/review format; no causal claims tested via study design. Technique descriptions attributed entirely to cited papers.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": true, + "justification": "Scope clearly bounded: LLM inference systems from database perspective. No claims beyond this domain.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "Multiple techniques presented (paged allocation vs vAttention, eviction vs offloading) but no comparison, trade-off discussion, or guidance on when each is preferable.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "Paper clearly distinguishes measured outcomes (latency, throughput, memory) from claims; explicitly distinguishes prefill vs decode phase metrics.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "Open Problems section (§2.5) discusses limitations: heuristic-based batching/scheduling, uncertain cost estimates, missing benchmarks.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "Open Problems section is generic and forward-looking ('develop better estimates', 'adaptive techniques') rather than identifying specific threats to reviewed techniques.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "Scope mentioned implicitly (request processing, optimization, memory) but not explicitly bounded. Does not state what is excluded (training, fine-tuning, inference quality, fairness).", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": true, + "justification": "Acknowledgments explicitly disclose: Chinese National Key R&D Program, NSF of China, Shenzhen Project, Huawei, Zhongguancun Lab, BNRist.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Both authors from Tsinghua University (Li is ACM Fellow); no apparent affiliation with systems reviewed (vLLM, SGLang, Mooncake, DeepFlow).", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": true, + "justification": "Diverse funders (government + corporate); Huawei involvement disclosed. Tutorial is balanced pedagogical framework, not product advocacy.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement provided. Standard academic funding context, but no explicit declaration of patents, equity, or consulting relationships.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Key terms defined in context: LLM as 'transformer-based' with attention/FFN; prefill/decode phases explained; KV cache, batching, scheduling explained through usage.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Explicitly frames contribution: pedagogical tutorial organizing LLM inference from database systems perspective. Intended audience and contribution clearly stated.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": false, + "justification": "Only a brief 'Related Tutorials' section mentioning one complementary tutorial. No engagement with survey literature, no discussion of how this framework compares to other organizing principles.", + "source": "haiku" + } + } + }, + "type_checklist": { + "survey": { + "search_and_selection": { + "search_strategy_reproducible": { + "applies": true, + "answer": false, + "justification": "No search strategy described. Paper does not explain how ~20 systems/techniques were identified or selected from a larger corpus.", + "source": "haiku" + }, + "inclusion_exclusion_explicit": { + "applies": true, + "answer": false, + "justification": "No inclusion/exclusion criteria stated. Selection process for cited systems not documented.", + "source": "haiku" + }, + "prisma_or_structured_protocol": { + "applies": true, + "answer": false, + "justification": "Organized as pedagogical tutorial (5 sections) rather than systematic review. No mention of PRISMA or structured review protocol.", + "source": "haiku" + }, + "search_terms_provided": { + "applies": true, + "answer": false, + "justification": "No search terms, queries, or search strategy provided. Does not describe databases/sources searched.", + "source": "haiku" + }, + "databases_listed": { + "applies": true, + "answer": false, + "justification": "Paper does not specify whether sources came from arXiv, Google Scholar, VLDB/SOSP proceedings, or other venues.", + "source": "haiku" + }, + "screening_process_documented": { + "applies": true, + "answer": false, + "justification": "No screening documentation. No counts showing how many papers were considered vs. included.", + "source": "haiku" + }, + "review_scope_justified": { + "applies": true, + "answer": false, + "justification": "Scope mentioned (request processing, optimization, memory) but not justified. No explanation for choice of techniques, timeframes, or venues.", + "source": "haiku" + } + }, + "synthesis_quality": { + "conflicting_findings_acknowledged": { + "applies": true, + "answer": false, + "justification": "Paper presents techniques descriptively but does not discuss conflicting evidence, competing claims, or trade-offs between approaches.", + "source": "haiku" + }, + "quality_assessment_of_sources": { + "applies": true, + "answer": false, + "justification": "No quality assessment, risk-of-bias tool, or structured appraisal of reviewed systems. All treated as equally credible.", + "source": "haiku" + }, + "publication_bias_discussed": { + "applies": true, + "answer": false, + "justification": "No discussion of publication bias, positive-result bias, or whether reviewed literature skews toward particular findings.", + "source": "haiku" + }, + "quantitative_synthesis_present": { + "applies": true, + "answer": false, + "justification": "No meta-analysis, vote counting, or effect size synthesis. Purely narrative descriptions of techniques.", + "source": "haiku" + }, + "recommendations_supported_by_evidence": { + "applies": true, + "answer": false, + "justification": "No evidence-based recommendations (e.g., 'use technique X when Y'). Open Problems section is vague forward-looking speculation.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "Prefill phase is compute-intensive; decode phase is memory-intensive, motivating different operator designs", + "evidence": "Stated in abstract and §2.1; motivates discussion of sparse attention vs. KV cache management.", + "supported": "moderate" + }, + { + "claim": "FlashAttention reduces memory I/O costs through tiled matrix multiplication and online softmax", + "evidence": "§2.2 Kernels section; cited from reference [6]", + "supported": "moderate" + }, + { + "claim": "Request batching increases throughput but introduces ragged tensors that waste GPU computation", + "evidence": "§2.2 Request Batching; mentions TurboTransformers and ByteTransformer solutions", + "supported": "moderate" + }, + { + "claim": "KV cache size is unpredictable during autoregressive decoding, requiring dynamic memory management", + "evidence": "§2.3: 'length-constrained generation' noted as exception; dynamic paged allocation presented as solution", + "supported": "moderate" + }, + { + "claim": "Prefix sharing via radix trees identifies reusable KV cache across requests, reducing recomputation", + "evidence": "§2.3 Cache Persistence; §2.4 describes SGLang's cache-aware scheduler exploiting prefix sharing", + "supported": "moderate" + }, + { + "claim": "Disaggregated prefill/decode architecture improves throughput by adapting hardware to phase-specific requirements", + "evidence": "§2.4 Distributed Systems (Mooncake, DeepFlow); no empirical throughput comparison provided", + "supported": "weak" + } + ], + "methodology_tags": [ + "case-study" + ], + "key_findings": "The paper organizes LLM inference system design from a database perspective around four dimensions: (1) request processing via prefill and decode phases with efficient operators (sparse attention, speculative decoding); (2) model execution optimization through specialized kernels (FlashAttention, PagedAttention), intelligent batching, and scheduling algorithms for job prioritization and load balancing; (3) dynamic KV cache management via paged allocation, eviction/offloading, quantization, and prefix-sharing persistence; (4) system architectures combining these techniques (centralized low-latency systems like vLLM vs. distributed high-throughput systems like Mooncake and DeepFlow). The framework suggests LLM inference challenges parallel classical database systems optimization problems.", + "red_flags": [ + { + "flag": "Misclassified as systematic survey", + "detail": "Paper is a tutorial, not a systematic literature review. No search strategy, inclusion criteria, screening process, or methodology reported. All survey-specific evaluation criteria are inapplicable." + }, + { + "flag": "No empirical comparison", + "detail": "Describes systems and techniques but provides no benchmarks, direct comparisons, or validation of claims. All effectiveness claims are second-hand citations." + }, + { + "flag": "Trade-offs not discussed", + "detail": "Multiple techniques presented for same problem (vLLM vs. vAttention, paged vs. native allocation) without discussing relative costs, latency impact, or appropriateness in different scenarios." + }, + { + "flag": "No critical appraisal", + "detail": "Zero quality assessment or risk-of-bias evaluation of reviewed systems. No discussion of limitations in vLLM, SGLang, Mooncake, or DeepFlow designs." + }, + { + "flag": "Implicit scope boundaries", + "detail": "What is deliberately excluded is unstated (e.g., training efficiency, inference quality/accuracy, fairness, cost-benefit analysis, failure modes)." + }, + { + "flag": "Vague open problems", + "detail": "§2.5 (5 min of 90-min tutorial) provides generic recommendations ('develop more accurate cost estimates') unmoored from evidence synthesis." + } + ], + "cited_papers": [ + { + "title": "Attention is All You Need", + "authors": "Vaswani et al.", + "year": 2017, + "relevance": "Foundational transformer architecture underlying all reviewed LLM inference systems" + }, + { + "title": "Efficient memory management for large language model serving with PagedAttention", + "authors": "Kwon et al.", + "year": 2023, + "relevance": "vLLM system exemplifying paged KV cache allocation for memory efficiency" + }, + { + "title": "FlashAttention: Fast and memory-efficient exact attention with IO-awareness", + "authors": "Dao et al.", + "year": 2022, + "relevance": "Specialized kernel reducing memory I/O costs in attention computation" + }, + { + "title": "SGLang: Efficient execution of structured language model programs", + "authors": "Zheng et al.", + "year": 2024, + "relevance": "Frontend-runtime co-design exemplifying structured output optimization and cache-aware scheduling" + }, + { + "title": "Mooncake: A KVCache-centric disaggregated architecture for LLM serving", + "authors": "Qin et al.", + "year": 2024, + "relevance": "Distributed disaggregated system exemplifying prefill/decode separation" + }, + { + "title": "DeepFlow: Serverless large language model serving at scale", + "authors": "Hu et al.", + "year": 2025, + "relevance": "Serverless distributed system with fine-grained task decomposition for hardware-agnostic scaling" + }, + { + "title": "Is the GPU half-empty or half-full? Practical scheduling techniques for LLMs", + "authors": "Kossmann et al.", + "year": 2025, + "relevance": "Addresses job prioritization and scheduling for latency-throughput balance" + }, + { + "title": "Taming throughput-latency tradeoff in LLM inference with Sarathi-Serve", + "authors": "Agrawal et al.", + "year": 2024, + "relevance": "System addressing chunked prefill and continuous batching techniques" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 3, + "justification": "Directly applicable to practitioners; database framework is immediately actionable for inference system design." + }, + "surprise_contrarian": { + "score": 1, + "justification": "Frames known techniques in database perspective (useful but not contrarian); does not challenge conventional wisdom." + }, + "fear_safety": { + "score": 0, + "justification": "Systems optimization paper; no discussion of AI safety, alignment, or risk concerns." + }, + "drama_conflict": { + "score": 0, + "justification": "Straightforward technical tutorial; no controversy, competing claims, or dramatic angles." + }, + "demo_ability": { + "score": 3, + "justification": "All systems discussed are open-source (vLLM, SGLang) or publicly described; techniques are implementable." + }, + "brand_recognition": { + "score": 3, + "justification": "Top-tier PVLDB venue; Guoliang Li is ACM Fellow; systems reviewed are industry-standard (vLLM from Berkeley, Mooncake from Alibaba)." + } + }, + "hn_data": { + "threads": [], + "top_points": 0, + "total_points": 0, + "total_comments": 0 + } +} +\ No newline at end of file diff --git a/papers/datadreamer-tool-synthetic-2024/scan-v5.json b/papers/datadreamer-tool-synthetic-2024/scan-v5.json @@ -0,0 +1,353 @@ +{ + "scan_version": 5, + "paper_type": "position", + "paper": { + "title": "DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows", + "authors": [ + "Ajay Patel", + "Colin Raffel", + "Chris Callison-Burch" + ], + "year": 2024, + "venue": "Annual Meeting of the Association for Computational Linguistics", + "arxiv_id": "2402.10379", + "doi": "10.48550/arXiv.2402.10379" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "The abstract claims DataDreamer helps implement LLM workflows and promotes reproducibility; the paper substantiates these through detailed system descriptions, feature tables, and code examples.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": false, + "justification": "The paper claims DataDreamer 'can help advance the rate of research progress' and that adoption will improve reproducibility, but these causal claims are not validated empirically — the paper presents no user study, deployment metrics, or controlled comparison.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "The paper makes broad claims that DataDreamer 'can help advance the rate of research progress' across NLP broadly, but it only demonstrates examples and feature coverage — no evidence that the tool is actually adopted or that reproducibility improves in practice.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "The paper identifies reproducibility challenges and asserts DataDreamer solves them, but does not consider whether existing tooling combinations or community norms could address the same issues without a new library.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": false, + "justification": "The paper conflates demonstrating features (caching, fingerprints, cards) with achieving reproducibility, but never measures whether papers using DataDreamer are actually more reproducible — features are proxies for the claimed outcome.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "A dedicated 'Limitations' section is present at the end of the paper.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "The limitations section only states that closed-source models behind APIs make full reproducibility impossible — a generic and obvious observation, not a specific threat analysis tied to particular claims or experiments.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "The paper does not explicitly bound where its reproducibility claims apply or don't apply — there is no statement about which workflow types remain unaddressed or what scale of projects DataDreamer is unsuitable for.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": true, + "justification": "Funding from IARPA via the HIATUS Program contract #2022-22072200005 is disclosed in the Acknowledgements section.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Author affiliations (University of Pennsylvania, University of Toronto, Vector Institute) are listed on the title page.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": true, + "justification": "IARPA is a US government intelligence research agency unrelated to the DataDreamer tool or its commercial interests.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement or declaration of financial interests (patents, equity, consulting) is present in the paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "The paper defines its core concepts — 'session', 'step', 'trainer', 'reproducibility fingerprint', 'synthetic data card' — precisely enough for a technical audience to understand the system.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "The paper explicitly states it provides 'both practical utility to researchers and scientific utility to the community' via an open-source Python library for LLM workflows.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Table 1 directly compares DataDreamer feature coverage against LangChain, Axolotl, and HF Transformers+TRL; the related workflows section cites and contextualizes prior work on synthetic data, evaluation, and fine-tuning.", + "source": "haiku" + } + } + }, + "type_checklist": { + "position": { + "argument_quality": { + "argument_internally_consistent": { + "applies": true, + "answer": true, + "justification": "The argument is consistent: LLM workflows have reproducibility challenges → existing tools don't address them → DataDreamer addresses them through specific features. No internal contradictions.", + "source": "haiku" + }, + "counterarguments_addressed": { + "applies": true, + "answer": false, + "justification": "The paper does not engage with the strongest counterarguments: that tooling adoption is the bottleneck rather than tool existence, or that community norms/journal policies are more effective than libraries for promoting reproducibility.", + "source": "haiku" + }, + "analogies_appropriate": { + "applies": false, + "answer": false, + "justification": "The paper does not rely on analogies as a rhetorical device.", + "source": "haiku" + }, + "prescriptions_proportional": { + "applies": true, + "answer": true, + "justification": "The prescriptive recommendations (share prompts, intermediate outputs, use reproducibility fingerprints) are narrow and well-scoped to the specific reproducibility problems identified.", + "source": "haiku" + }, + "evidence_for_claims_cited": { + "applies": true, + "answer": true, + "justification": "Factual claims about prompt sensitivity (Sclar et al.), model degradation from synthetic data (Shumailov et al.), and other challenges are supported with citations.", + "source": "haiku" + }, + "alternatives_discussed": { + "applies": true, + "answer": false, + "justification": "The paper lists competing tools in Table 1 by feature coverage but does not discuss alternative philosophical approaches to solving reproducibility (e.g., requiring data/code submission at publication, containerization mandates, etc.).", + "source": "haiku" + }, + "historical_context_accurate": { + "applies": true, + "answer": true, + "justification": "The historical framing of LLMs establishing a 'new era in NLP research' and the description of emerging workflows (RLHF, DPO, self-improvement) are accurate and well-cited.", + "source": "haiku" + } + }, + "clarity_and_scope": { + "key_terms_defined_precisely": { + "applies": true, + "answer": true, + "justification": "Technical terms specific to DataDreamer ('session', 'step', 'trainer', 'reproducibility fingerprint') are defined with sufficient precision; broader terms like 'reproducibility' are used in their standard scientific sense.", + "source": "haiku" + }, + "engages_with_existing_literature": { + "applies": true, + "answer": true, + "justification": "The paper engages with prior work on prompt sensitivity, synthetic data generation, fine-tuning, and self-improving LLMs throughout Sections 2 and 5, positioning DataDreamer relative to these contributions.", + "source": "haiku" + }, + "intended_audience_clear": { + "applies": true, + "answer": true, + "justification": "The paper is explicitly directed at NLP researchers who use LLMs in research workflows, as stated in the introduction and throughout.", + "source": "haiku" + }, + "assumptions_stated": { + "applies": true, + "answer": false, + "justification": "The paper assumes reproducibility is universally desirable and that tooling barriers are the primary obstacle, but these assumptions are not explicitly stated or defended — alternative views (e.g., reproducibility costs exceed benefits for exploratory work) are not acknowledged.", + "source": "haiku" + }, + "scope_of_applicability_discussed": { + "applies": true, + "answer": false, + "justification": "The paper does not discuss where DataDreamer is not applicable — e.g., very large-scale workflows, non-Python environments, or use cases where caching overhead is prohibitive.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "LLM workflows have significant reproducibility challenges stemming from prompt sensitivity, model scale, and closed-source APIs.", + "evidence": "Cites Sclar et al. 2023 on prompt sensitivity and discusses practical challenges with shell script orchestration and API-dependent workflows.", + "supported": "moderate" + }, + { + "claim": "DataDreamer provides a more complete feature set than LangChain, Axolotl, and HF Transformers+TRL combined.", + "evidence": "Table 1 feature comparison matrix — self-reported by authors with no independent verification.", + "supported": "weak" + }, + { + "claim": "Reproducibility fingerprints can validate that two experimental setups are identical.", + "evidence": "Described by design (hash of all inputs and configurations, recursively through workflow chain), demonstrated conceptually but not empirically tested.", + "supported": "weak" + }, + { + "claim": "Synthetic data cards can help prevent contamination of pre-training sources with model-generated data.", + "evidence": "Cites Shumailov et al. 2023 on model degradation from synthetic training data; the mechanism (metadata tags) is plausible but not empirically evaluated.", + "supported": "weak" + }, + { + "claim": "DataDreamer's caching system reduces carbon emissions by avoiding expensive re-computation.", + "evidence": "Stated in the limitations section as a broader impact; no quantification or measurement provided.", + "supported": "unsupported" + } + ], + "methodology_tags": [ + "theoretical", + "case-study" + ], + "key_findings": "DataDreamer is an open-source Python library that unifies LLM workflow primitives (prompting, synthetic data generation, fine-tuning, alignment, self-improvement) under a single standardized API. The paper's core contribution is a reproducibility infrastructure: automatic caching, resumability, reproducibility fingerprints, and auto-generated synthetic data/model cards. The paper advocates for best practices including sharing exact prompts, intermediate outputs, and optimization configurations. No empirical evaluation of the tool's real-world impact on reproducibility is provided.", + "red_flags": [ + { + "flag": "No empirical evaluation", + "detail": "The paper introduces a tool and describes its features but conducts no user study, adoption analysis, or controlled experiment showing that DataDreamer actually improves reproducibility in practice." + }, + { + "flag": "Misclassified paper type", + "detail": "This is primarily a system/tool paper, not a position paper. The ACL theme track framing adds some advocacy, but the core contribution is software, which strains the position paper evaluation rubric." + }, + { + "flag": "Self-reported feature comparison", + "detail": "Table 1 comparing DataDreamer to LangChain, Axolotl, and HF Transformers+TRL is authored by the DataDreamer team with no independent verification or replication." + }, + { + "flag": "Causal claims without evidence", + "detail": "Claims that DataDreamer 'can help advance the rate of research progress' and reduce carbon emissions are stated without any quantification or empirical support." + } + ], + "cited_papers": [ + { + "title": "Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design", + "relevance": "Evidence for the reproducibility challenge of prompt sensitivity that motivates DataDreamer." + }, + { + "title": "The Curse of Recursion: Training on Generated Data Makes Models Forget", + "relevance": "Cited as motivation for tagging synthetic datasets to prevent pre-training contamination." + }, + { + "title": "Self-Rewarding Language Models", + "relevance": "Complex multi-stage self-improvement workflow that DataDreamer is designed to support and make reproducible." + }, + { + "title": "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena", + "relevance": "LLM-as-judge evaluation workflow that DataDreamer supports." + }, + { + "title": "Direct Preference Optimization: Your Language Model is Secretly a Reward Model", + "relevance": "Alignment technique (DPO) supported by DataDreamer trainers." + }, + { + "title": "LoRA: Low-Rank Adaptation of Large Language Models", + "relevance": "Parameter-efficient fine-tuning technique integrated into DataDreamer's training API." + }, + { + "title": "HuggingFace's Transformers: State-of-the-Art Natural Language Processing", + "relevance": "Core dependency and integration target for DataDreamer's model loading and training." + }, + { + "title": "Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods", + "relevance": "Context for the prompt-and-predict paradigm that DataDreamer is built around." + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 3, + "justification": "Researchers can install and use the library immediately; it addresses a real daily pain point in LLM research workflows." + }, + "surprise_contrarian": { + "score": 1, + "justification": "The reproducibility problem is well-known; the solution (a unified library) is pragmatic but not surprising." + }, + "fear_safety": { + "score": 1, + "justification": "Mentions synthetic data contamination of pre-training sources as a concern, but this is a secondary point, not the paper's focus." + }, + "drama_conflict": { + "score": 1, + "justification": "Implicitly criticizes closed-source model providers for undermining reproducibility, but the tone is constructive rather than confrontational." + }, + "demo_ability": { + "score": 3, + "justification": "The library is publicly available at github.com/datadreamer-dev/DataDreamer with working code examples in the paper itself." + }, + "brand_recognition": { + "score": 2, + "justification": "Colin Raffel is well-known as lead author of the T5 paper; published at ACL 2024 main conference." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "41736735", + "title": "Interpreting Clip with Sparse Linear Concept Embeddings (SpLiCE)", + "points": 7, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=41736735", + "created_at": "2024-10-04T00:57:26Z" + }, + { + "hn_id": "39442782", + "title": "BlackJAX: Composable Bayesian Inference in Jax", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=39442782", + "created_at": "2024-02-20T15:53:51Z" + }, + { + "hn_id": "39600771", + "title": "LLM Ensemble Prediction Capabilities Match Human Crowd Accuracy", + "points": 1, + "comments": 2, + "url": "https://news.ycombinator.com/item?id=39600771", + "created_at": "2024-03-05T08:33:55Z" + }, + { + "hn_id": "39924592", + "title": "Darwin Turing Dawkins (Leonard Adleman) [pdf]", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=39924592", + "created_at": "2024-04-03T23:17:50Z" + }, + { + "hn_id": "39429391", + "title": "BioMistral: Open-Source Pretrained Large Language Models for Medical Domains", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=39429391", + "created_at": "2024-02-19T13:15:11Z" + } + ], + "top_points": 7, + "total_points": 13, + "total_comments": 2 + } +} +\ No newline at end of file diff --git a/papers/datasentinel-gametheoretic-detection-2025/scan-v5.json b/papers/datasentinel-gametheoretic-detection-2025/scan-v5.json @@ -0,0 +1,582 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "DataSentinel: A Game-Theoretic Detection of Prompt Injection Attacks", + "authors": [ + "Yupei Liu", + "Yuqi Jia", + "Jinyuan Jia", + "Dawn Song", + "Neil Zhenqiang Gong" + ], + "year": 2025, + "venue": "IEEE Symposium on Security and Privacy", + "arxiv_id": "2504.11358", + "doi": "10.1109/SP61157.2025.00250" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All abstract claims (FPR≈0, FNR≤0.07 for existing attacks, outperforms 6 baselines by large margin) are directly supported by Tables 1–3 and Table 6.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "The paper makes causal claims that minimax fine-tuning improves detection; these are supported by ablation comparing DataSentinel(Minimax) vs DataSentinel(Min) and hyperparameter ablations in Figures 3–5.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": true, + "justification": "The paper explicitly bounds claims to the evaluated setting (7 NLP tasks, 3 open-source LLMs ≤8B parameters) and the limitations section clearly states DataSentinel fails when injected and target tasks are the same type.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "The paper does not discuss alternative explanations for why minimax optimization improves detection (e.g., whether it is the fine-tuning itself rather than the game-theoretic framing that drives gains).", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "FPR and FNR are precisely defined as detection error rates, and claims about 'effective detection' match these measurements; no proxy conflation is present.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "Section 6 is titled 'Discussion and Limitations' and contains multiple dedicated subsections on specific failure modes.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "Specific threats are identified: (1) DataSentinel fails when target=injected task (adversarial examples, FNR=0.87 shown empirically), (2) benign instructions in user data may cause false positives, (3) better instruction-following LLMs may weaken the defense.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": true, + "justification": "The conclusion explicitly states detection 'is highly effective... as long as the injected prompts mislead the backend LLM into performing injected tasks that differ from the target task,' clearly bounding scope.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": true, + "justification": "Acknowledgments section states: 'This work was supported by NSF grant No. 2131859, 2125977, 2112562, and 1937787, as well as ARO grant No. W911NF2110182.'", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Author affiliations (Penn State, Duke University, UC Berkeley) are disclosed on the title page.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": true, + "justification": "NSF and ARO are US government research funding agencies with no commercial stake in prompt injection detection outcomes.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests or financial disclosures statement (patents, equity, consulting) is present in the paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "All key terms are precisely defined with formal notation: LLM-integrated application, target task (st, xt, yt), injected task (se, xe, ye), contaminated target data xc, FPR, FNR, and known-answer detection.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Contributions are explicitly itemized: (1) first game-theoretic prompt injection detector, (2) minimax optimization formulation, (3) gradient-based solution and comprehensive evaluation.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 2 thoroughly situates the work relative to heuristic and optimization-based attacks, prevention vs. detection defenses, and directly compares against known-answer detection as the prior SOTA.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": true, + "justification": "The abstract states 'Our code and data are available at: https://github.com/liu00222/Open-Prompt-Injection' — a live repository, not a promise of future release.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "All 7 datasets used (MRPC, Jfleg, SMS Spam, RTE, SST2, HSOL, Gigaword) are standard public benchmarks, and the paper states code and data are available at GitHub.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "The paper specifies GPU hardware (Quadro RTX 6000) and mentions QLoRA but does not provide requirements.txt, Dockerfile, or equivalent environment specification.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": true, + "justification": "Algorithms 1–3 provide detailed pseudocode, Section 5.1 lists all hyperparameters explicitly, and code is publicly available at GitHub.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "All results in Tables 1–6 are point estimates (FPR/FNR) with no confidence intervals or error bars reported; results are from single runs with fixed seed.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "No statistical significance tests are used for any comparative claims between DataSentinel and baselines.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Absolute FPR/FNR values are reported for all methods, allowing direct magnitude comparison; e.g., KAD FPR up to 0.10 vs DataSentinel 0.00 is clearly quantified.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "The choice of 100 samples per task combination (giving 35,700 contaminated samples total) is stated but not justified via power analysis or any other principled rationale.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "No variance or standard deviation across runs is reported; temperature is fixed at 0.1 with a fixed seed, so only single deterministic runs are presented.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Six baselines are compared: EVD, NLLMD, SSFTD, SSFTD-G, PromptGuard, and Known-Answer Detection (KAD).", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "KAD (Liu et al., USENIX 2024) is the prior state-of-the-art; PromptGuard was released by Meta in 2024; all baselines are contemporaneous.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "DataSentinel(Min) ablates the game-theoretic component; Figures 3–5 ablate hyperparameters r, |D|, α, β, nin, and nout individually.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Both FPR and FNR are reported, along with ASV (attack success value) for adaptive attacks in Table 8.", + "source": "haiku" + }, + "human_evaluation": { + "applies": false, + "answer": false, + "justification": "The paper evaluates automated detection of injected prompts; human evaluation is not relevant to this system.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": true, + "justification": "Evaluation uses test sets of each NLP benchmark dataset, while fine-tuning uses training sets; fine-tuning uses Gigaword training data while evaluation tasks are different.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Tables 10–16 in the appendix report FPR and FNR for every injected-target task combination (7×7) for each of the 9 attacks.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "Section 6 reports that DataSentinel achieves FNR=0.87 when target and injected tasks are both sentiment analysis, and explains this is because attacks reduce to adversarial examples.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "The adversarial examples failure case (FNR=0.87) is explicitly reported in Table 6 and discussed as a fundamental limitation in Section 6.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": true, + "justification": "Exact model identifiers are given: Mistral-7B [36], LLaMA2-7B [37,38], LLaMA3-8B-Instruct [39], with citations to the corresponding model papers/repos.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": false, + "justification": "The detection instruction template is provided, but the 7 target and injected task instructions are only referenced as 'consistent with [7]' — Appendix A does not list them.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": true, + "justification": "Section 5.1 explicitly reports all hyperparameters: α=1, β=1, r=3, lrout=0.000025, bin=8, bout=2, nin=10, nout=500, temperature=0.1, fixed random seed.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": false, + "answer": false, + "justification": "DataSentinel is a fine-tuned detection model, not an agentic scaffold; no agentic scaffolding is used or evaluated.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "The sampling procedure is documented: 100 data points from each dataset test set for evaluation, 500 from Gigaword training set for fine-tuning, and contaminated data construction is formally specified.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": true, + "justification": "The paper states 'Our code and data are available at: https://github.com/liu00222/Open-Prompt-Injection', implying the generated contaminated datasets are released.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Data collection is described precisely: 100 test samples per task drawn from 7 standard benchmarks, 100 injected samples per combination, totaling 35,700 contaminated samples.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "No human participants; datasets are standard NLP benchmarks with no recruitment.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "The pipeline from clean benchmark data → contaminated target data construction via each attack → evaluation of FPR/FNR is documented in Section 5.1 and Algorithm 1.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": false, + "answer": false, + "justification": "The paper evaluates detection of prompt injection attacks using binary FPR/FNR metrics, not LLM capability benchmarks susceptible to training contamination.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": false, + "answer": false, + "justification": "NA — the evaluation is not about LLM benchmark performance where training data contamination is a concern.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": false, + "answer": false, + "justification": "NA — contaminated target data is generated fresh by the attacks; this is not an LLM capability evaluation.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": true, + "justification": "Section 5.2 reports average query time: 1.6 seconds for Mistral-7B detection LLM, 0.7 seconds for LLaMA3.2-1B alternative, and 15.3 seconds for backend LLM processing.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": true, + "justification": "Fine-tuning takes ~3 hours on one Quadro RTX 6000 GPU, costing ~$0.90 in cloud GPU rental, explicitly reported in Section 5.2.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "DataSentinel achieves FPR close to 0 across all 7 target tasks and 9 existing prompt injection attacks", + "evidence": "Tables 1–2 show FPR of 0.00–0.01 across all target tasks for all 9 attacks (6 heuristic + 3 optimization-based)", + "supported": "strong" + }, + { + "claim": "DataSentinel achieves FNR at most 0.07 for all existing prompt injection attacks", + "evidence": "Table 1 shows FNR ≤ 0.01 for all heuristic attacks and ≤ 0.07 for NeuralExec across all injected task types", + "supported": "strong" + }, + { + "claim": "DataSentinel significantly outperforms 6 baseline detectors including state-of-the-art KAD", + "evidence": "Table 3 shows baselines have FPR up to 1.00 (PromptGuard) and FNR up to 0.21 (KAD under NeuralExec) vs DataSentinel's near-zero rates", + "supported": "strong" + }, + { + "claim": "The game-theoretic minimax formulation is essential for detecting adaptive attacks", + "evidence": "Table 6 shows DataSentinel(Minimax) FNR ≤ 0.06 vs DataSentinel(Min) FNR up to 0.98 and KAD FNR up to 0.93 under adaptive attacks", + "supported": "strong" + }, + { + "claim": "DataSentinel generalizes across different detection and backend LLMs without retraining for each combination", + "evidence": "Table 4 shows consistently low FPR/FNR across Mistral-7B, LLaMA2-7B, and LLaMA3-8B-Instruct; Table 5 shows cross-backend generalization", + "supported": "strong" + }, + { + "claim": "DataSentinel fails when injected and target tasks are the same type (adversarial examples case)", + "evidence": "Table 6 reports FNR=0.87 for all methods including DataSentinel when both target and injected task are sentiment analysis under optimization-based adaptive attack", + "supported": "strong" + }, + { + "claim": "Fine-tuning a smaller 1B detection LLM achieves comparable detection performance with lower latency", + "evidence": "Section 5.2 reports LLaMA3.2-1B achieves FPR=0.00, FNR=0.01 at 0.7s/query vs Mistral-7B's FPR=0.00, FNR≈0.00 at 1.6s/query", + "supported": "moderate" + } + ], + "methodology_tags": [ + "benchmark-eval" + ], + "key_findings": "DataSentinel fine-tunes a detection LLM via minimax optimization that simulates an adversarial game between detector and attacker, achieving near-zero FPR and FNR (≤0.07) across 9 existing prompt injection attacks, 7 NLP tasks, and 6 LLMs. The game-theoretic approach substantially outperforms 6 baselines including the prior state-of-the-art known-answer detection, with the advantage most pronounced against adaptive attacks (FNR ≤0.06 vs up to 0.93 for KAD). DataSentinel is computationally practical: fine-tuning requires ~3 GPU-hours (~$0.90) and inference overhead is ~10% of backend LLM latency. The method fails gracefully in a clearly acknowledged edge case: when target and injected tasks are the same type, attacks reduce to adversarial examples (FNR=0.87), which the authors identify as open future work.", + "red_flags": [ + { + "flag": "No confidence intervals or significance tests", + "detail": "All results are point estimates from single runs with fixed seed and temperature=0.1. With only 100 samples per task combination, small FPR/FNR differences could fall within noise, yet no statistical testing is done." + }, + { + "flag": "No evaluation on production LLMs", + "detail": "All experiments use open-source LLMs ≤8B parameters (Mistral-7B, LLaMA2-7B, LLaMA3-8B). Generalization to closed-source production LLMs (GPT-4, Claude, Gemini) as backend LLMs is not addressed." + }, + { + "flag": "Sample size unjustified", + "detail": "The choice of 100 samples per task combination (35,700 total contaminated samples) is not justified by power analysis or prior convention." + }, + { + "flag": "Task prompts not provided", + "detail": "The 7 target and injected task instructions used in all experiments are referenced as 'consistent with [7]' but not reproduced; Appendix A provides no actual prompt text." + }, + { + "flag": "White-box threat model may not reflect real deployments", + "detail": "The strongest baselines and DataSentinel's fine-tuning assume white-box access to detection LLM weights; real attackers targeting closed-source detection APIs would face a harder problem, potentially making the gains over KAD less relevant in practice." + } + ], + "cited_papers": [ + { + "title": "Formalizing and benchmarking prompt injection attacks and defenses", + "relevance": "Key prior work (Liu et al., USENIX Security 2024) that introduced known-answer detection — the direct baseline DataSentinel improves upon" + }, + { + "title": "Not what you've signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection", + "relevance": "Seminal work (Greshake et al., AISec 2023) establishing the threat model for indirect prompt injection in deployed LLM applications" + }, + { + "title": "Neural exec: Learning (and learning from) execution triggers for prompt injection attacks", + "relevance": "NeuralExec — the strongest optimization-based attack used as the primary evaluation adversary in this paper" + }, + { + "title": "Automatic and universal prompt injection attacks against large language models", + "relevance": "Universal attack — another optimization-based attack evaluated, and source of GCG-based adversarial token optimization techniques" + }, + { + "title": "Struq: Defending against prompt injection with structured queries", + "relevance": "Prevention-based defense compared to DataSentinel's detection approach; used as backend LLM in robustness experiments" + }, + { + "title": "Universal and transferable adversarial attacks on aligned language models", + "relevance": "GCG (Greedy Coordinate Gradient) method used as the core discrete optimization algorithm for generating adaptive injected prompts" + }, + { + "title": "SecAlign: Defending against prompt injection with preference optimization", + "relevance": "Another prevention baseline whose fine-tuned LLM is tested as both detection and backend LLM in Section 6" + }, + { + "title": "PLeak: Prompt leaking attacks against large language model applications", + "relevance": "Optimization-based attack targeting instruction confidentiality, used as one of 9 evaluated attacks" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 3, + "justification": "Directly addresses a critical vulnerability in deployed LLM-integrated applications with a practical defense requiring only 3 GPU-hours of fine-tuning and released code." + }, + "surprise_contrarian": { + "score": 2, + "justification": "The counterintuitive core insight — that making a detection LLM MORE vulnerable to prompt injection improves detection — is a genuine conceptual surprise." + }, + "fear_safety": { + "score": 3, + "justification": "Prompt injection is a top OWASP LLM risk; the paper demonstrates attacks on Bing Copilot-style applications and shows existing defenses fail against adaptive attackers." + }, + "drama_conflict": { + "score": 2, + "justification": "The arms race framing (attacker adapts to detector, detector trained against adaptive attacker) is compelling, and the meta-review note about potential future erosion adds tension." + }, + "demo_ability": { + "score": 2, + "justification": "Code is available at GitHub and uses open-source LLMs, but requires a GPU and ~3 hours of fine-tuning — not a one-click demo." + }, + "brand_recognition": { + "score": 2, + "justification": "Dawn Song (UC Berkeley) is a prominent AI security researcher; the IEEE S&P venue is highly prestigious in security." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "40115482", + "title": "Survey Study on AI Agent Architectures (2024)", + "points": 77, + "comments": 16, + "url": "https://news.ycombinator.com/item?id=40115482", + "created_at": "2024-04-22T15:47:47Z" + }, + { + "hn_id": "44585492", + "title": "How Many Instruction Can LLMs Follow at Once?", + "points": 11, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=44585492", + "created_at": "2025-07-16T18:38:36Z" + }, + { + "hn_id": "23442899", + "title": "Scientists demonstrate particle detector for dark matter", + "points": 6, + "comments": 2, + "url": "https://news.ycombinator.com/item?id=23442899", + "created_at": "2020-06-06T22:33:57Z" + }, + { + "hn_id": "45482380", + "title": "Acoustic Eavesdropping via Mouse Sensors", + "points": 4, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=45482380", + "created_at": "2025-10-05T15:40:37Z" + }, + { + "hn_id": "35695104", + "title": "Emergent and Predictable Memorization in Large Language Models", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=35695104", + "created_at": "2023-04-25T00:31:12Z" + }, + { + "hn_id": "45461534", + "title": "Comparing Quantum Annealing and BF-DCQO", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=45461534", + "created_at": "2025-10-03T11:13:53Z" + }, + { + "hn_id": "40106947", + "title": "From r to Q∗: Your Language Model is a Q-Function", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=40106947", + "created_at": "2024-04-21T16:22:09Z" + }, + { + "hn_id": "23416215", + "title": "Sensei: Direct-Detection Results on Sub-GeV Dark Matter from a New Skipper-CCD", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=23416215", + "created_at": "2020-06-04T13:23:14Z" + }, + { + "hn_id": "44191952", + "title": "Questioning Representational Optimism in Deep Learning", + "points": 1, + "comments": 3, + "url": "https://news.ycombinator.com/item?id=44191952", + "created_at": "2025-06-05T14:17:23Z" + }, + { + "hn_id": "45934130", + "title": "Questioning Representational Optimism in Deep Learning", + "points": 1, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=45934130", + "created_at": "2025-11-15T01:07:24Z" + } + ], + "top_points": 77, + "total_points": 109, + "total_comments": 22 + } +} +\ No newline at end of file diff --git a/papers/datasetresearch-benchmarking-agent-2025/scan-v5.json b/papers/datasetresearch-benchmarking-agent-2025/scan-v5.json @@ -0,0 +1,401 @@ +{ + "scan_version": 5, + "paper_type": "benchmark-creation", + "paper": { + "title": "DatasetResearch: Benchmarking Agent Systems for Demand-Driven Dataset Discovery", + "authors": [ + "Keyu Li", + "Mohan Jiang", + "Dayuan Fu", + "Yunze Wu", + "Xiangkun Hu" + ], + "year": 2025, + "venue": "arXiv.org", + "arxiv_id": "2508.06960", + "doi": "10.48550/arXiv.2508.06960" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "Key abstract claims are supported: 22% score on DatasetResearch-pro is confirmed in Figure 5 (OpenAI DeepResearch scores 0.2218), the search/synthesis dichotomy is demonstrated in Table 2 (search best at knowledge 41.89%, synthesis best at reasoning 72.70%), and corner case failures are illustrated in Figure 8.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": false, + "justification": "The paper attributes synthesis agents' reasoning advantage to 'generating richly detailed data with explicit thought processes,' but this is an interpretive post-hoc explanation based on correlation in benchmark scores, not an ablation controlling for factors like model size, training data, or retrieval index differences.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "The abstract claims to 'illuminate the path toward AI systems capable of finding any dataset in the digital universe,' but the benchmark covers only 208 NLP tasks from two structured platforms (HuggingFace, PapersWithCode), six task categories, and text-only modality — far narrower than the generalization implies.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "The paper attributes search agents' knowledge advantage entirely to 'retrieval breadth' and synthesis agents' reasoning advantage to 'structured generation,' without considering alternative explanations such as training data distribution differences, model capability differences unrelated to architecture, or evaluation metric sensitivity.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "The paper explicitly distinguishes metadata similarity scores (what is measured) from downstream fine-tuning performance (practical utility), and explains the normalized scoring formula (Seval/Sref) with discussion of why normalization is needed across heterogeneous task metrics.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "Section 7.2 'Limitation and Future Work' is a substantive section discussing three specific limitations: structured-repository scope, reliance on closed-source models, and the need for hybrid agents.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "The limitations discuss scope (HuggingFace/PapersWithCode only) and model type (closed-source) but omit major threats: the circular use of o3 to generate reference metadata and then score metadata alignment, selection bias in the pro subset (selected by GPT-4o-search failure), and using only LLaMA-3.1-8B as the fine-tuning evaluation model.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": true, + "justification": "The paper explicitly states scope is limited to text-only modality, six NLP task categories, and datasets from HuggingFace and PapersWithCode — and the limitations section calls these out as explicit boundaries.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No acknowledgment or funding disclosure section appears anywhere in the paper text.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Author affiliations are disclosed on the first page: Shanghai Jiao Tong University, SII, and GAIR.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": false, + "answer": false, + "justification": "No funding is disclosed, so funder independence cannot be assessed.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement or financial interest declaration appears in the paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Section 3.1 formally defines 'data discovery,' MetaTriplet components (demand description, reference set, reference metadata), and Section 3.2 defines the knowledge-based vs. reasoning-based distinction with operational criteria.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "The introduction lists four explicit bullet-point contributions: the benchmark itself, the evaluation methodology, experimental results on state-of-the-art systems, and systematic failure mode analysis.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 2 engages with Viswanathan et al. [2023], Walker et al. [2023], and Gandhi et al. [2024], explaining how this benchmark extends beyond prior dataset recommendation work by evaluating full agent systems and including downstream fine-tuning performance.", + "source": "haiku" + } + } + }, + "type_checklist": { + "benchmark-creation": { + "construct_design": { + "construct_validity_argued": { + "applies": true, + "answer": false, + "justification": "The paper asserts the MetaTriplet framework measures dataset discovery quality but does not rigorously argue why metadata alignment + fine-tuning performance captures this construct; critically, o3 generates reference metadata and also judges alignment scores, creating a circularity the paper dismisses as a feature rather than a validity threat.", + "source": "haiku" + }, + "difficulty_distribution_characterized": { + "applies": true, + "answer": false, + "justification": "Difficulty is operationalized only as a binary (regular vs. pro), where the pro subset is selected based on GPT-4o-search failure rates — not via independent difficulty modeling; no easy/medium/hard tiers or difficulty measurement independent of model performance is provided.", + "source": "haiku" + }, + "ceiling_floor_effects_checked": { + "applies": true, + "answer": true, + "justification": "The results show no ceiling effects (top agent scores 22% on pro, 73% on main benchmark reasoning tasks), and the use of gated datasets was explicitly designed to prevent trivial search solutions; corner cases near floor (0%) are documented in Section 6.3.", + "source": "haiku" + }, + "human_baseline_included": { + "applies": true, + "answer": false, + "justification": "No human baseline is included; the paper uses zero-shot LLaMA-3.1-8B as a performance floor but never measures how a human researcher would perform at dataset discovery under the same demand descriptions.", + "source": "haiku" + }, + "scoring_rubric_justified": { + "applies": true, + "answer": false, + "justification": "The normalized score formula (Seval/Sref) is explained and the rationale for multiple metrics across task types is given, but the choice of LLaMA-3.1-8B as the sole fine-tuning model is not justified, and the o3-as-judge metadata scoring is circular since o3 also generated the reference metadata.", + "source": "haiku" + } + }, + "robustness": { + "contamination_resistance_designed": { + "applies": true, + "answer": true, + "justification": "The benchmark deliberately uses HuggingFace 'gated' datasets requiring manual approval, which prevents search agents from directly downloading and using the reference datasets — an explicit contamination resistance design described in Section 3.2 Step 1.", + "source": "haiku" + }, + "temporal_robustness_discussed": { + "applies": true, + "answer": false, + "justification": "There is no discussion of whether gated datasets may become ungated, whether HuggingFace availability will persist, or whether the benchmark will require updates as agent capabilities evolve; Section 7.2 covers future work but not benchmark longevity.", + "source": "haiku" + }, + "failure_modes_discussed": { + "applies": true, + "answer": true, + "justification": "Section 6.3 explicitly discusses benchmark failure modes ('corner cases' outside existing data distributions where all methods fail), and Section 7.2 acknowledges the benchmark does not cover unstructured web sources or non-text modalities.", + "source": "haiku" + }, + "baseline_implementations_provided": { + "applies": true, + "answer": false, + "justification": "While code is available on GitHub and Appendix D provides fine-tuning configs, the deep research agent experiments 'necessitate a human-in-the-loop approach' because the tools lack API access — making those baseline results not independently reproducible.", + "source": "haiku" + } + }, + "documentation": { + "dataset_documentation_complete": { + "applies": true, + "answer": true, + "justification": "The 7-step curation pipeline is described in detail (Section 3.2), metadata components are explicitly defined (Appendix A), filtering criteria are specified, and the benchmark is available on GitHub with full prompts in the Appendix.", + "source": "haiku" + }, + "licensing_and_access_clear": { + "applies": true, + "answer": false, + "justification": "The paper provides a GitHub URL but specifies no license for the benchmark; the underlying reference datasets from HuggingFace are gated (intentionally access-restricted), creating ambiguity about whether others can actually use or redistribute the benchmark.", + "source": "haiku" + }, + "intended_use_specified": { + "applies": true, + "answer": false, + "justification": "The paper specifies intended use (evaluating demand-driven dataset discovery agents) but does not specify what should NOT be concluded from results — for instance, that scores on 6 NLP task categories should not generalize to multimodal or scientific discovery tasks.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "Even the most advanced deep research systems achieve only 22% score on DatasetResearch-pro", + "evidence": "Figure 5 and Section 5.2 report OpenAI DeepResearch achieves 0.2218 on the pro subset fine-tuning evaluation", + "supported": "strong" + }, + { + "claim": "Search agents excel at knowledge-intensive tasks while synthesis agents dominate reasoning tasks", + "evidence": "Table 2 shows GPT-4o-search fine-tuning score of 41.89% on knowledge vs. OpenAI o3 w/ref score of 72.70% on reasoning tasks", + "supported": "strong" + }, + { + "claim": "All current agent methods catastrophically fail on corner cases outside existing data distributions", + "evidence": "Figure 8 (Section 6.3) shows near-zero fine-tuning performance for all agents on one corner case; synthesis agent scores 0.0, search agent 0.067", + "supported": "moderate" + }, + { + "claim": "Few-shot evaluation results maintain consistent relative trends with fine-tuning, making 3-shot a practical proxy", + "evidence": "Table 2 shows 1/3/5-shot trends largely mirror fine-tuning rankings; Section 5.2 argues 3-shot is the most stable setting", + "supported": "moderate" + }, + { + "claim": "Synthesis agents outperform search agents specifically due to their ability to generate structured, reasoning-aligned output data", + "evidence": "Figure 6 case study shows o3 synthesis generates structured reasoning examples; metadata evaluation shows synthesis agents score 8.69 avg vs. search agents 5.71 avg", + "supported": "weak" + }, + { + "claim": "DatasetResearch is the first comprehensive benchmark for demand-driven dataset discovery", + "evidence": "Prior work (Viswanathan et al. 2023, Walker et al. 2023) is positioned as partial exploration not evaluating full agent systems with downstream fine-tuning", + "supported": "moderate" + } + ], + "methodology_tags": [ + "benchmark-eval" + ], + "key_findings": "DatasetResearch establishes a 208-task benchmark for demand-driven dataset discovery across six NLP task categories, revealing that even state-of-the-art deep research systems achieve only 22% normalized score on the challenging pro subset. A clear specialization emerges: search agents achieve best knowledge-task performance (41.89% fine-tuning) while synthesis agents dominate reasoning tasks (72.70%). All evaluated methods fail catastrophically on corner cases outside existing data distributions, and the benchmark reveals that fine-tuning with synthetic data outperforms retrieval-based discovery for reasoning tasks by a large margin.", + "red_flags": [ + { + "flag": "Circular evaluation via o3", + "detail": "OpenAI o3 is used to generate reference metadata, generate demand descriptions, AND serve as the judge for metadata similarity scoring — the paper acknowledges this but frames it as bias mitigation rather than a validity threat." + }, + { + "flag": "Deep research results not reproducible", + "detail": "Deep research agent experiments (OpenAI DeepResearch, Grok DeepResearch, Gemini DeepResearch) require manual human-in-the-loop curation because 'these deep research tools are not currently accessible via API calls' — these results cannot be independently reproduced." + }, + { + "flag": "Pro subset selection bias", + "detail": "DatasetResearch-pro was selected by identifying the 20 tasks where GPT-4o-search-preview scored lowest, creating a biased hard subset that is specifically challenging for that agent and may not represent general difficulty." + }, + { + "flag": "Single fine-tuning model", + "detail": "All downstream task performance is evaluated by fine-tuning only LLaMA-3.1-8B with a fixed configuration; it is unclear whether results generalize to other model sizes or architectures." + }, + { + "flag": "No human baseline", + "detail": "No human researcher benchmark for dataset discovery is provided, making it impossible to assess whether the 22% top-agent score represents near-human, far-below-human, or superhuman performance in practical terms." + }, + { + "flag": "Synthesis advantage may reflect o3 self-evaluation", + "detail": "Synthesis agents primarily use o3, which also generates the reference metadata used in scoring; this may systematically inflate synthesis agents' metadata alignment scores compared to search agents using different models." + } + ], + "cited_papers": [ + { + "title": "DataFinder: Scientific Dataset Recommendation from Natural Language Descriptions", + "relevance": "Direct predecessor: bi-encoder retriever recommending datasets from natural language research descriptions; DatasetResearch extends this to full agent evaluation with downstream fine-tuning" + }, + { + "title": "Prompting Datasets: Data Discovery with Conversational Agents", + "relevance": "Prior work on using conversational LLMs for data discovery, including hallucination risks — motivates need for rigorous benchmark" + }, + { + "title": "Better Synthetic Data by Retrieving and Transforming Existing Datasets", + "relevance": "Dataset transformation/synthesis prior work that DatasetResearch evaluates as a baseline approach" + }, + { + "title": "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?", + "relevance": "Cited as example of benchmark for evaluating code agents; provides methodological context for agent benchmarking" + }, + { + "title": "DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-World Environments", + "relevance": "Related deep research agent system; authors include overlapping researchers from GAIR lab" + }, + { + "title": "BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents", + "relevance": "Related benchmark for agent web-browsing capabilities; comparator in the agent evaluation space" + }, + { + "title": "SWE-Smith: Scaling Data for Software Engineering Agents", + "relevance": "Related work on data synthesis for agent training; demonstrates the general importance of data curation for agent performance" + }, + { + "title": "DiscoveryBench: Towards Data-Driven Discovery with Large Language Models", + "relevance": "Related benchmark for data-driven scientific discovery agents; related evaluation space" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 3, + "justification": "Directly addresses a real bottleneck in AI development workflows — finding training data — with a public benchmark practitioners can use to evaluate discovery tools." + }, + "surprise_contrarian": { + "score": 1, + "justification": "The finding that state-of-the-art deep research agents achieve only 22% is notable, but the search-vs-synthesis dichotomy and corner case failures are unsurprising given known limitations of each approach." + }, + "fear_safety": { + "score": 0, + "justification": "No safety or risk concerns raised; the paper is about dataset curation efficiency, not harmful capabilities." + }, + "drama_conflict": { + "score": 0, + "justification": "No controversy or conflict angle; the paper evaluates commercial systems matter-of-factly without provocative framing." + }, + "demo_ability": { + "score": 2, + "justification": "The benchmark is publicly available on GitHub, and search/synthesis agents can be run against it, though deep research agents require manual curation." + }, + "brand_recognition": { + "score": 2, + "justification": "GAIR lab (Pengfei Liu) is a recognized NLP research group; the paper also evaluates prominent commercial systems from OpenAI, Google, and xAI." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "43014573", + "title": "Time to act on the risk of efficient personalized text generation", + "points": 57, + "comments": 34, + "url": "https://news.ycombinator.com/item?id=43014573", + "created_at": "2025-02-11T16:14:03Z" + }, + { + "hn_id": "45234790", + "title": "Reverse-Engineered Reasoning for Open-Ended Generation", + "points": 4, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=45234790", + "created_at": "2025-09-13T19:49:08Z" + }, + { + "hn_id": "45184326", + "title": "Reasoning Traces from QA Pairs", + "points": 3, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=45184326", + "created_at": "2025-09-09T16:25:03Z" + }, + { + "hn_id": "44516439", + "title": "Amazon gets serious with AI Safety", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=44516439", + "created_at": "2025-07-10T01:50:50Z" + }, + { + "hn_id": "45226714", + "title": "Are ArXiv submissions on Wednesday better cited?", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=45226714", + "created_at": "2025-09-12T21:00:07Z" + }, + { + "hn_id": "44889206", + "title": "Large Language Models Do Not Simulate Human Psychology", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=44889206", + "created_at": "2025-08-13T14:50:01Z" + }, + { + "hn_id": "32619543", + "title": "Angle-agnostic cloaking from person-tracking systems with a t-shirt", + "points": 1, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=32619543", + "created_at": "2022-08-27T14:42:49Z" + }, + { + "hn_id": "44521323", + "title": "Evaluating the Critical Risks of Amazon’s Nova Premier", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=44521323", + "created_at": "2025-07-10T14:11:52Z" + }, + { + "hn_id": "42705257", + "title": "What Hawking Radiation Looks Like as You Fall into a Black Hole", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=42705257", + "created_at": "2025-01-14T23:16:02Z" + } + ], + "top_points": 57, + "total_points": 73, + "total_comments": 37 + } +} +\ No newline at end of file diff --git a/papers/dear-diary-rct-copilot-2024/scan-v5.json b/papers/dear-diary-rct-copilot-2024/scan-v5.json @@ -0,0 +1,534 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Dear Diary: A randomized controlled trial of Generative AI coding tools in the workplace", + "authors": [ + "Jenna Butler", + "Jina Suh", + "Sankeerti Haniyur", + "Constance Hadley" + ], + "year": 2024, + "venue": "arXiv", + "arxiv_id": "2410.18334", + "doi": "10.1145/nnnnnnn.nnnnnnn" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All major abstract claims are supported in the results: significantly increased usefulness (p=0.001) and enjoyment (p<0.0001), unchanged trust, 84% positive work changes, and 66% feeling changes are documented through surveys and diary coding.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": false, + "justification": "Although an RCT design was used, the primary causal claim that Copilot 'significantly increased' positive beliefs is based on within-group paired t-tests for the treatment group only; the paper does not present treatment-vs-control belief comparisons, so the control group's belief trajectory during the same period is not shown.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "The conclusion states 'generative AI tools are changing work, mostly for the better' and makes broad organizational recommendations without bounding claims to the single-company, 3-week, primarily-male developer population studied.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": true, + "justification": "The paper explicitly considers that the contemporaneous surge in AI news coverage — not Copilot use — may explain why both treatment and control groups increased in believing they have unique technical skills; multiple explanations for null telemetry results are also discussed.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "The paper carefully distinguishes self-reported productivity (surveys and diaries) from objective productivity (telemetry), explicitly noting that telemetry showed no statistically significant differences even while beliefs improved.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "Section 5 is a dedicated 'LIMITATIONS' section covering self-report biases, unvalidated survey instruments, and single-company generalizability.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "Specific threats are named: 25% control group contamination (used GenAI tools anyway), 25% treatment non-compliance, study may be too short (citing research suggesting 11 weeks needed for tipping point), post-hoc power as low as 0.06 for code changes, and coding confounders (meetings, oncall, vacation).", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "The limitations note single-company scope but defend it via Flyvbjerg's case-study argument; there are no explicit statements of what the results do NOT show, and the conclusion makes broad recommendations for organizations generally.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding statement is provided anywhere in the paper; three of four authors are Microsoft employees conducting research on a Microsoft product, constituting implicit institutional funding that is never formally disclosed.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "All four authors list institutional affiliations clearly in the header: Butler, Suh, and Haniyur at Microsoft; Hadley at Institute for Work Life.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": false, + "justification": "Three of four authors are Microsoft employees evaluating GitHub Copilot, a Microsoft/GitHub product; the institution has direct financial interest in favorable findings and is not independent of the outcome.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement or financial interests declaration appears in the paper; the acknowledgments mention Microsoft colleagues but do not address potential conflicts.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Key terms are operationally defined: 'usefulness,' 'enjoyment,' and 'trust' are measured via specific Likert statements; 'Copilot experience' is defined by explicit usage frequency categories; productivity is measured via both self-report and multiple telemetry metrics.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "The paper explicitly states its contribution as 'one of the first randomized controlled trials of GitHub Copilot in a real-world work environment' examining effects on both quantitative coding data and developers' beliefs and values.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 2 systematically engages with TAM-related adoption literature and prior Copilot studies (Peng et al., Chatterjee et al., Imai, Zhang et al.), explicitly positioning this work as extending from controlled lab settings to the real workplace.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": false, + "justification": "No analysis code is released; supplemental material [5] on Zenodo contains only the survey instrument, not data processing or statistical analysis scripts.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": false, + "justification": "Raw survey responses, diary entries, and telemetry data are not publicly available; only the survey instrument is accessible via Zenodo.", + "source": "haiku" + }, + "environment_specified": { + "applies": false, + "answer": false, + "justification": "This is a human subjects study; GitHub Copilot is a commercial tool with no researcher-controlled software environment to specify.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "No step-by-step instructions for replicating the study design, recruitment, or analysis pipeline are provided.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "The paper reports means and p-values throughout but no confidence intervals or error bars for any main results, including Likert scale changes and DiD estimates in Table 2.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": true, + "justification": "The paper uses paired t-tests for belief changes, chi-square for group equivalence checks, Kruskal-Wallis for diary distribution comparison, and difference-in-difference for telemetry outcomes.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": false, + "justification": "Mean differences on Likert scales are reported (e.g., 2.72→3.61) but no standardized effect sizes (Cohen's d, eta-squared) are calculated; Table 2 DiD coefficients are on raw scales without standardization.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "No a priori power analysis or sample size justification is provided; post-hoc power in Table 2 reveals critical underpowering (as low as 0.06 for code changes), which is acknowledged retrospectively but not addressed prospectively.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "Standard deviation is reported only for diary submission counts (mean 8.37, SD 4.819); Likert scale comparisons report means without variance, and DiD coefficients lack standard errors in the table.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "The concurrent control group (developers not using Copilot) and 'Continuing' group (prior users) serve as baselines for both telemetry and belief comparisons.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "The control group is concurrent (same time period, summer 2023), making it a contemporary baseline with no temporal confounding.", + "source": "haiku" + }, + "ablation_study": { + "applies": false, + "answer": false, + "justification": "This is a tool adoption study evaluating GitHub Copilot as a whole product; component ablation is not applicable.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "The study combines multiple Likert-scale belief measures, qualitative diary entries with open coding, and six objective telemetry metrics (code changes, PRs, development time, PR hours, email time, build time).", + "source": "haiku" + }, + "human_evaluation": { + "applies": true, + "answer": true, + "justification": "The entire study is a human evaluation: 106 software engineers assessed Copilot through daily diary reflections and pre/post surveys on perceived usefulness, enjoyment, trust, and work impact.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": false, + "answer": false, + "justification": "This is an RCT/diary study of developer behavior, not a prediction task; held-out test sets are not applicable.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Results are broken down by prior experience (experienced vs. inexperienced), by group (treatment/control/continuing), by engineering level (junior/senior/principal), and by use case category in diary coding.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "Section 4.3.2 'Challenges' explicitly discusses failure modes: hallucinations, syntactically correct but semantically wrong code, poor support for niche languages/file types (Android bp/mk, YAML, JSON config), and cases where validation overhead negated productivity gains.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "The telemetry analysis found no statistically significant differences between treatment and control on any of six metrics (p-values 0.5–0.9); this null result is reported clearly in Table 2 and discussed at length.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": false, + "justification": "GitHub Copilot is named but no model version, API version, or snapshot date is specified; the study was conducted around summer 2023 but Copilot's underlying model changed during this period.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": true, + "justification": "The full survey instrument is available via a Zenodo DOI [5], and the daily diary question structure is described in Section 3.6; study instruments are sufficiently accessible.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": false, + "answer": false, + "justification": "GitHub Copilot is a black-box commercial tool with no researcher-controlled hyperparameters; not applicable.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": false, + "answer": false, + "justification": "No agentic scaffolding is involved; the study evaluates organic adoption and use of an existing commercial IDE plugin.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "The paper describes 6-week baseline telemetry collection, consent procedures, country-based exclusions, parallel-trends validation for DiD, and open coding procedures for qualitative diary and survey data.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": false, + "justification": "Raw survey responses, diary entries, and telemetry data are not publicly released; only the survey instrument is on Zenodo.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Data collection is described in detail: daily Teams message diaries, intake and exit surveys with example questions, telemetry from the corporate engineering system requiring written consent, and a 6-week baseline collection period.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": true, + "answer": true, + "justification": "Recruitment is described: starting from 10,000 randomly selected engineers, 337 completed intake survey, 269 agreed to participate, 228 remained after country exclusions, 106 reached the compliant final population.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "The full pipeline is documented: recruitment → intake survey → block randomization → daily diary → exit survey → telemetry DiD analysis, with analysis methods (paired t-test, open coding, DiD, chi-square) explained for each data type.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": false, + "answer": false, + "justification": "This study evaluates developer adoption and beliefs, not model capabilities on benchmarks; training cutoff is not relevant.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": false, + "answer": false, + "justification": "Not evaluating model capabilities on benchmarks; not applicable.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": false, + "answer": false, + "justification": "Not evaluating model capabilities on benchmarks; not applicable.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": true, + "answer": false, + "justification": "No pre-registration is mentioned anywhere in the paper; this is a notable omission for an RCT, as it allows post-hoc emphasis on outcomes that showed significant results (beliefs) over those that did not (telemetry).", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": true, + "answer": true, + "justification": "The acknowledgments state: 'The ethics for this study were reviewed and approved by the Microsoft Research Institutional Review Board (MSRIRB), which is an IRB federally registered with the United States Department of Health & Human Services.'", + "source": "haiku" + }, + "demographics_reported": { + "applies": true, + "answer": true, + "justification": "Demographics are reported for both intake (n=228) and final (n=106) populations including gender, management level, engineering seniority, and primary programming language, broken down by group (treatment/control/continuing) in Table 1.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": true, + "answer": true, + "justification": "Criteria are described: engineers from the organization, restricted to allowed countries, required to complete at least 1 diary, and required self-attested compliance with group assignment verified in exit survey.", + "source": "haiku" + }, + "randomization_described": { + "applies": true, + "answer": true, + "justification": "Block randomization is described: developers without prior Copilot experience were randomly assigned to treatment/control, stratified on gender and two prior-belief items ('I like AI coding tools,' 'I trust AI coding tools'); chi-square confirmed group balance.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "Blinding is not feasible in a tool adoption study where participants must actively choose to use or not use GitHub Copilot; the study is inherently open-label.", + "source": "haiku" + }, + "attrition_reported": { + "applies": true, + "answer": true, + "justification": "Attrition is explicitly reported: 228 intake → 106 final (53% retention), with specific causes: 25% of treatment never used Copilot, 25% of control used GenAI tools during study, others failed to complete diaries or exit survey.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": false, + "answer": false, + "justification": "This is a human subjects study evaluating commercial tool adoption; inference cost is not applicable.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": false, + "answer": false, + "justification": "No computational budget is relevant to this human subjects research study.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "Three weeks of GitHub Copilot use significantly increased developers' belief that AI tools are useful (mean 2.93→3.51, p=0.001) and enjoyable (mean 2.72→3.61, p<0.0001)", + "evidence": "Paired t-tests on treatment group Likert responses before and after 3-week study period", + "supported": "moderate" + }, + { + "claim": "Developers' trust in AI-generated code did not change significantly after using Copilot", + "evidence": "No statistically significant change in trust-related Likert items; ~20% trust AI code before and after", + "supported": "strong" + }, + { + "claim": "GitHub Copilot access produced no statistically significant change in objective telemetry metrics including code changes, PRs, and development time", + "evidence": "DiD analysis across 6 metrics; p-values range 0.5–0.9; post-hoc power as low as 0.06; Table 2", + "supported": "moderate" + }, + { + "claim": "84% of participants reported positive changes in how they work after using Copilot", + "evidence": "Open coding of 94 exit survey verbatim responses; 84% of 129 total codes were positive", + "supported": "moderate" + }, + { + "claim": "Developers with prior Copilot experience were significantly more likely to believe tools are useful (86% vs 44%) and enjoyable (72% vs 43%)", + "evidence": "Chi-square comparison of experienced vs. inexperienced users on intake survey; p<0.05 for both measures", + "supported": "strong" + }, + { + "claim": "Developers discovered unexpected uses for Copilot including web search replacement and creative ideation", + "evidence": "Qualitative diary coding identified web search replacement as a common unanticipated use case, with multiple supporting verbatim quotes", + "supported": "moderate" + } + ], + "methodology_tags": [ + "rct", + "qualitative", + "observational" + ], + "key_findings": "This workplace RCT (n=106 compliant, 3-week treatment) found that GitHub Copilot use significantly increased developers' positive beliefs about AI tool usefulness and enjoyment, but did not change trust in AI-generated code. Objective telemetry metrics showed no statistically significant productivity effect in any of six measures, likely due to study duration (3 weeks may be too short), severe underpowering (post-hoc power as low as 0.06), and 25% contamination in both arms. Qualitative diary data revealed diverse unexpected use cases — particularly web search replacement — and 84% of treatment participants reported positive changes to how they work, though self-report bias from Microsoft-affiliated researchers evaluating a Microsoft product is a notable confound.", + "red_flags": [ + { + "flag": "Microsoft authors evaluating Microsoft product, no COI disclosure", + "detail": "Three of four authors are Microsoft employees evaluating GitHub Copilot (a Microsoft/GitHub product); no conflict of interest or funding is disclosed despite direct institutional financial stake." + }, + { + "flag": "Causal belief claims from within-group analysis, not RCT comparison", + "detail": "The headline causal claim that Copilot increased usefulness and enjoyment beliefs is based on within-treatment paired t-tests; the paper never presents treatment-vs-control belief comparisons, so the control group's belief trajectory during the same period is unknown." + }, + { + "flag": "Severe bilateral contamination", + "detail": "25% of the control group self-reported using GenAI tools during the study, and 25% of the treatment group reported not using Copilot; this 50% combined non-compliance severely undermines experimental validity." + }, + { + "flag": "Critically underpowered telemetry analysis", + "detail": "Table 2 shows post-hoc statistical power as low as 0.06 for code changes; the null telemetry result likely reflects inadequate power rather than a true absence of effect." + }, + { + "flag": "No pre-registration for RCT", + "detail": "An RCT with no pre-registration allows post-hoc emphasis on outcomes that showed significant results (beliefs) over those that did not (telemetry), creating potential for selective reporting." + }, + { + "flag": "No confidence intervals on any main results", + "detail": "All results report only means and p-values; absence of CIs prevents readers from assessing precision of estimates, practical significance, or whether effects are clinically/practically meaningful." + } + ], + "cited_papers": [ + { + "title": "The Impact of AI on Developer Productivity: Evidence from GitHub Copilot", + "relevance": "Key prior RCT of Copilot in lab setting showing 55.8% time reduction on HTTP server task; this paper extends to real workplace over 3 weeks" + }, + { + "title": "The Impact of AI Tool on Engineering at ANZ Bank: An Empirical Study on GitHub Copilot within Corporate Environment", + "relevance": "Corporate Copilot study showing 42.36% productivity boost via controlled programming tasks; prior workplace evidence this study contextualizes" + }, + { + "title": "Is GitHub Copilot a Substitute for Human Pair-Programming? An Empirical Study", + "relevance": "Finds Copilot increases lines of code but at lower quality than human pair programming; directly relevant to code quality concerns raised in diary study" + }, + { + "title": "GitHub Copilot AI pair programmer: Asset or Liability?", + "relevance": "Compares Copilot to humans on fundamental coding problems; finds humans succeed more but Copilot-introduced bugs are easier to fix" + }, + { + "title": "Asleep at the Keyboard? Assessing the Security of GitHub Copilot's Code Contributions", + "relevance": "Found ~40% of generated programs contain security vulnerabilities; cited to validate developer concerns about code correctness seen in diaries" + }, + { + "title": "Practices and Challenges of Using GitHub Copilot: An Empirical Study", + "relevance": "Analyzed Stack Overflow and GitHub Discussions on Copilot usage patterns; provides context for real-world usage challenges and language support findings" + }, + { + "title": "Using AI-Based Coding Assistants in Practice: State of Affairs, Perceptions, and Ways Forward", + "relevance": "Survey of developer attitudes toward AI coding assistants including preference for test and documentation writing; directly relevant to belief formation and use case findings" + }, + { + "title": "Early Results from a Study of GenAI Adoption in a Large Brazilian Company: The Case of Globo", + "relevance": "Another corporate GenAI adoption study; comparative workplace evidence for adoption patterns" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 3, + "justification": "Organizations adopting AI coding tools can directly apply findings on adoption barriers, the belief-vs-telemetry disconnect, and the 11-week tipping point context for realistic expectation-setting." + }, + "surprise_contrarian": { + "score": 2, + "justification": "The finding that enjoyment increases but trust does not, and that objective telemetry shows no significant effect while subjective satisfaction improves, challenges the dominant narrative that Copilot straightforwardly improves developer output." + }, + "fear_safety": { + "score": 1, + "justification": "The paper documents developer fears about AI replacement and code quality risks (hallucinations, subtle bugs), but these are secondary findings within a primarily adoption-focused study." + }, + "drama_conflict": { + "score": 2, + "justification": "Microsoft employees studying a Microsoft product with a null productivity result creates inherent credibility tension; the 'Dear Diary' framing and verbatim quotes also give the study an unusually personal voice." + }, + "demo_ability": { + "score": 2, + "justification": "GitHub Copilot is widely available, making study findings immediately testable and actionable for any developer." + }, + "brand_recognition": { + "score": 3, + "justification": "GitHub Copilot is the most recognizable AI coding tool, and Microsoft Research involvement adds institutional weight; the paper's RCT claim for a flagship product drives interest." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "45751115", + "title": "DeepSeek-OCR: Contexts Optical Compression", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=45751115", + "created_at": "2025-10-29T18:33:29Z" + }, + { + "hn_id": "28973605", + "title": "Generalized Out-of-Distribution Detection: A Survey", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=28973605", + "created_at": "2021-10-24T00:03:38Z" + }, + { + "hn_id": "42458574", + "title": "Semantic, Orthographic, and Morphological Biases in Humans' Wordle Gameplay", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=42458574", + "created_at": "2024-12-19T05:06:04Z" + }, + { + "hn_id": "28957390", + "title": "Generalized Out-of-Distribution Detection: A Survey", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=28957390", + "created_at": "2021-10-22T14:11:31Z" + } + ], + "top_points": 2, + "total_points": 6, + "total_comments": 0 + } +} +\ No newline at end of file diff --git a/papers/dear-novel-deep-2022/scan-v5.json b/papers/dear-novel-deep-2022/scan-v5.json @@ -0,0 +1,566 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "DEAR: A Novel Deep Learning-based Approach for Automated Program Repair", + "authors": [ + "Yi Li", + "Shaohua Wang", + "Tien N. Nguyen" + ], + "year": 2022, + "venue": "International Conference on Software Engineering", + "arxiv_id": "2205.01859", + "doi": "10.1145/3510003.3510177" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All major abstract claims (42%-683% improvement on Defects4J, 31-145 more bugs on BigFix, 169 multi-hunk bugs among 667 fixed on CPatMiner) are directly supported by Tables 1, 4, and 6.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "Causal claims about each component's contribution are backed by the ablation study in RQ4/Table 9, which systematically removes hunk detection, expansion, and attention+cycle training to measure individual impact.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": true, + "justification": "The paper explicitly bounds results to Java code and lists this in the Limitations section, noting that only the FL and post-processing modules are language-dependent.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "The paper does not discuss whether training data volume differences, hyperparameter search advantages, or benchmark-specific factors (rather than architectural innovations) could explain the observed gains over baselines.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "The paper explicitly distinguishes 'correct patches' (matching developer fixes) from 'plausible patches' (passing all tests but not necessarily the ground truth fix), reporting both counts separately in all tables.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "The paper contains both a 'Threats to Validity' paragraph and a dedicated 'Limitations' section with five enumerated specific limitations.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "Specific threats named include: Java-only evaluation, the need to reimplement CURE (unavailable), and use of a 5-hour time limit for pattern-based tools; these are specific, not boilerplate.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": true, + "justification": "Explicit scope boundaries state: Java only, only bugs causing failing tests, cannot fix bugs requiring many new inserted statements, and expansion may incorrectly include non-buggy statements.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": true, + "justification": "NSF funding disclosed in Acknowledgments with specific grant numbers: CNS-2120386, CCF-1723215, CCF-1723432, TWC-1723198, CCF-1518897, CNS-1513263.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Author affiliations (NJIT and UT Dallas) are listed on the title page with email addresses.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": true, + "justification": "NSF is a government funding agency with no commercial stake in the outcome; authors are not evaluating a product they have financial ties to.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement or declaration of financial interests (patents, equity, consulting) appears anywhere in the paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "APR, multi-hunk/multi-statement bugs, spectrum-based FL, plausible vs. correct patches, and bug types (Types 1-5) are all defined with examples in Sections 1-2.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Three key contributions are explicitly enumerated in the introduction: FL for multi-hunk bugs, compositional divide-and-conquer fixing, and the enhanced two-layer LSTM with attention and cycle training.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 7 extensively covers both DL-based and pattern-based APR prior work, and the introduction situates DEAR by identifying the specific limitation (individual-statement-only fixing) in existing DL approaches.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": true, + "justification": "GitHub repository explicitly cited: https://github.com/AutomatedProgramRepair-2021/dear-auto-fix, with the contribution statement 'Our data and tool are publicly available.'", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "All three evaluation datasets (Defects4J, BigFix, CPatMiner) are publicly available benchmarks from cited prior work.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "Only hardware specs are provided (8-core Intel CPU, GTX Titan GPU); no requirements.txt, Dockerfile, or software dependency list appears in the paper.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "The paper describes the algorithm in detail but provides no step-by-step reproduction instructions; a reader would need to infer setup from the GitHub repo, which is not described in the paper itself.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "No confidence intervals or error bars are reported for any results; all comparative claims are based on raw counts and percentages without statistical uncertainty.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "No statistical significance tests (Wilcoxon, t-test, etc.) are applied to any of the comparative claims across all five RQs.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Relative improvements are reported throughout (42%-683% over baselines, 31-145 more bugs fixed on BigFix) with baseline counts providing context for interpreting magnitude.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "The paper uses established benchmarks without justifying whether the 395-bug Defects4J set or the 80/10/10 split ratios are adequate for the specific claims made.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "No variance, standard deviation, or results across multiple runs are reported; it is unclear whether experiments were run once or repeatedly.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Six DL-based APR baselines (DLFix, CoCoNuT, SequenceR, Tufano19, CODIT, CURE) and eight pattern-based APR tools are included across the RQs.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "Baselines include CURE (ICSE 2021) and CoCoNuT (ISSTA 2020), which are the most recent competitive DL-based APR approaches at time of publication.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "RQ4/Table 9 removes three components individually (hunk detection, multi-statement expansion, attention+cycle training) and measures the performance drop on Defects4J.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Correct patches, plausible patches, top-K accuracy (K=1,3,5), per-bug-type breakdowns (Types 1-5), and training parameter counts are all reported.", + "source": "haiku" + }, + "human_evaluation": { + "applies": false, + "answer": false, + "justification": "Human evaluation is not applicable for an APR tool benchmarked against ground-truth developer fixes and validated by automated test cases.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": true, + "justification": "For BigFix/CPatMiner an 80%/10%/10% train/tune/test split is used; Defects4J serves as a held-out test with explicit no-overlap guarantee against the CPatMiner training set.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Results broken down by five bug types (Type 1: one-hunk/one-statement through Type 5: multi-hunk/mix-statements) in Tables 2, 3, 6, 8, and 9.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "The Limitations section enumerates specific failure modes: rare/out-of-vocabulary names, security/non-failing-test bugs, fixes requiring many new statements, and incorrect multi-statement expansion.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": false, + "justification": "The ablation shows performance drops when components are removed, but no design decisions that failed or approaches tried and abandoned are reported.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": false, + "justification": "The paper refers to 'Google's pre-trained BERT model' without specifying the checkpoint (BERT-base vs. BERT-large) or version used for fine-tuning.", + "source": "haiku" + }, + "prompts_provided": { + "applies": false, + "answer": false, + "justification": "DEAR is a non-LLM deep learning model with no prompts; this criterion is not applicable.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": true, + "justification": "Hyperparameters are reported with full search ranges and best values: epoch size, batch size, learning rate for BERT and LSTM; vector size, learning rate, batch size, epoch size for GloVe.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": false, + "answer": false, + "justification": "No agentic scaffolding used; DEAR is a traditional DL pipeline, not an agent-based system.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "Preprocessing steps are documented in detail: AST parsing, alpha-renaming, GloVe encoding, TreeCaps summarization, CPatMiner-based subtree pairing rules, and the four pairing rules for buggy/fixed subtrees.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": true, + "justification": "Public benchmarks (Defects4J, BigFix, CPatMiner) are used; Defects4J in particular is a well-established independently reproducible benchmark with versioned releases.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Section 5.2 describes the data sources (44k+ bugs from 5,832 Java projects in CPatMiner, 26k+ in BigFix, 395 in Defects4J) and the 80/10/10 split procedure.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "Standard public benchmarks are used with no human participant recruitment.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "The full pipeline from source code → AST parsing → subtree pairing → context building → model training → patch generation → test validation is documented across Sections 3-4.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": false, + "answer": false, + "justification": "DEAR trains its own models from scratch on specific datasets; there is no pre-trained LLM with a training cutoff being evaluated.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": true, + "justification": "The paper explicitly states 'no overlap between the two datasets' for CPatMiner (training) and Defects4J (testing), and uses separate held-out splits for large dataset evaluation.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": false, + "answer": false, + "justification": "DEAR is not an LLM evaluated on pre-existing benchmarks; the model is trained from scratch with explicit train/test separation.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": true, + "justification": "Prediction time per candidate patch is reported: 2.4-3.1 seconds for CPatMiner, 3.6-4.2 for BigFix, 2.1 for Defects4J.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": true, + "justification": "Training time stated for each dataset (22+ hours for CPatMiner, 18-19 hours for BigFix) along with hardware (8-core Intel CPU, GTX Titan GPU).", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "DEAR outperforms all DL-based APR baselines on Defects4J by 42%-683% in terms of auto-fixed bugs using top-1 patches.", + "evidence": "Table 1: DEAR fixes 47 bugs vs. CURE 36, CoCoNuT 33, DLFix 30, Tufano19 14, Sequencer 15, CODIT 6.", + "supported": "strong" + }, + { + "claim": "DEAR is the first DL-based APR model to fix multi-hunk/multi-statement bugs; all prior DL approaches fix zero such bugs.", + "evidence": "Table 2: DLFix, CoCoNuT, and CURE each fix 0 bugs of Types 2-5 on Defects4J, while DEAR fixes 18.", + "supported": "strong" + }, + { + "claim": "DEAR achieves comparable and complementary results to the best pattern-based APR tools on Defects4J.", + "evidence": "Table 7: DEAR fixes 47 bugs vs. Hercules 49 and Tbar 43; Table 8 shows DEAR fixes 7 multi-statement bugs Hercules misses entirely.", + "supported": "strong" + }, + { + "claim": "DEAR requires 7x fewer training parameters than CURE while achieving better repair performance.", + "evidence": "RQ5: DEAR requires 0.39M parameters vs. CURE's 3.1M on CPatMiner (0.42M vs. 3.5M on BigFix), while outperforming CURE on all datasets.", + "supported": "strong" + }, + { + "claim": "Multi-statement expansion contributes to fixing more uniquely challenging bugs than hunk detection.", + "evidence": "Table 9: Without expansion, 7 multi-statement bugs (Types 2,4,5) are lost; without hunk detection, 14 multi-hunk bugs are lost but only 3 of those are of the hardest types (4-5).", + "supported": "moderate" + }, + { + "claim": "DEAR generalizes across datasets, consistently outperforming baselines in cross-dataset evaluation.", + "evidence": "Table 5: DEAR achieves best top-1 results in both cross-dataset directions (7.5% for CPatMiner→BigFix, 9.6% for BigFix→CPatMiner).", + "supported": "strong" + } + ], + "methodology_tags": [ + "benchmark-eval" + ], + "key_findings": "DEAR is the first deep learning-based APR approach to fix multi-hunk, multi-statement bugs through a pipeline combining BERT-based hunk detection, RNN+data-flow multi-statement expansion, and a two-tier tree LSTM with attention and cycle training. On Defects4J it fixes 47 bugs — 31% more than the best DL baseline (CURE at 36) and comparable to top pattern-based tools — while uniquely fixing 18 multi-hunk/multi-statement bugs that all prior DL approaches fail on entirely. The approach requires 7x fewer training parameters than CURE while achieving better results, and performance scales monotonically with training data size across the 70/80/90% split experiments.", + "red_flags": [ + { + "flag": "No statistical tests", + "detail": "All comparative claims rely on raw counts without statistical significance tests, confidence intervals, or p-values. On Defects4J the margin is 11 bugs (47 vs 36), which could plausibly be within chance variation with no significance established." + }, + { + "flag": "CURE reimplemented by authors", + "detail": "The top-performing baseline CURE was 'unavailable' and reimplemented by the DEAR authors, introducing potential implementation bias; a reimplementation may underperform the original." + }, + { + "flag": "No variance across runs", + "detail": "No information on whether experiments were run multiple times; with stochastic training, variance across seeds is unknown and results may not be reproducible." + }, + { + "flag": "Plausible patch ratio", + "detail": "DEAR generates 91 plausible patches vs. 47 correct ones on Defects4J — nearly a 2:1 ratio — raising questions about patch quality and whether test suites provide adequate validation signal." + }, + { + "flag": "Java-only generalization claims", + "detail": "The paper claims key modules are 'language-independent' but all evaluation is Java-only; no evidence supports the generalization claim." + } + ], + "cited_papers": [ + { + "title": "DLFix: Context-Based Code Transformation Learning for Automated Program Repair", + "relevance": "Direct predecessor and primary baseline; DEAR extends DLFix's two-layer tree LSTM with attention, cycle training, and multi-hunk/multi-statement support." + }, + { + "title": "CURE: Code-Aware Neural Machine Translation for Automatic Program Repair", + "relevance": "Top-performing DL APR baseline (ICSE 2021); DEAR outperforms it with 7x fewer training parameters." + }, + { + "title": "CoCoNuT: Combining Context-Aware Neural Translation Models Using Ensemble for Program Repair", + "relevance": "Key DL APR baseline compared across all three datasets; fixes 0 multi-hunk/multi-statement bugs vs. DEAR's 18." + }, + { + "title": "Graph-Based Mining of in-the-Wild, Fine-Grained, Semantic Code Change Patterns (CPatMiner)", + "relevance": "Provides the primary training dataset (44k+ bugs) and AST change detection tool used in DEAR's divide-and-conquer strategy." + }, + { + "title": "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", + "relevance": "Pre-trained BERT fine-tuned for fixing-together hunk detection — the first component of DEAR's fault localization pipeline." + }, + { + "title": "TreeCaps: Tree-Based Capsule Networks for Source Code Processing", + "relevance": "Used for AST subtree summarization in both context building and multi-statement expansion components." + }, + { + "title": "Harnessing Evolution for Multi-Hunk Program Repair (Hercules)", + "relevance": "Top pattern-based APR baseline; DEAR reaches comparable total count while fixing 7 multi-statement bugs Hercules misses entirely." + }, + { + "title": "TBar: Revisiting Template-Based Automated Program Repair", + "relevance": "Pattern-based APR baseline; DEAR fixes 15 more multi-hunk/multi-statement bugs than TBar on Defects4J." + }, + { + "title": "GloVe: Global Vectors for Word Representation", + "relevance": "Used for code token embedding throughout context building, expansion, and transformation learning in DEAR." + }, + { + "title": "Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks", + "relevance": "Source of the cycle training methodology adapted by DEAR to improve code transformation learning in the LSTM layers." + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 2, + "justification": "APR tools directly reduce developer effort for bug fixing, though Java-only support and 20+ hour training time limit immediate adoption." + }, + "surprise_contrarian": { + "score": 1, + "justification": "Showing DL-based APR can match pattern-based APR is a meaningful milestone but follows the expected arc of capability improvement in the field." + }, + "fear_safety": { + "score": 0, + "justification": "No AI safety or risk implications; this is a software engineering automation tool." + }, + "drama_conflict": { + "score": 0, + "justification": "Standard competitive benchmark evaluation with no controversy or community conflict." + }, + "demo_ability": { + "score": 2, + "justification": "GitHub repository is publicly available, but 20+ hour training time on specialized hardware limits casual experimentation." + }, + "brand_recognition": { + "score": 0, + "justification": "Authors from NJIT and UT Dallas; no association with high-profile AI labs or widely known industry products." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "36018657", + "title": "DarkBERT: A Language Model for the Dark Side of the Internet", + "points": 142, + "comments": 59, + "url": "https://news.ycombinator.com/item?id=36018657" + }, + { + "hn_id": "38162779", + "title": "Category Theory for Programming", + "points": 47, + "comments": 6, + "url": "https://news.ycombinator.com/item?id=38162779" + }, + { + "hn_id": "30565951", + "title": "Improved Approximation Algorithms and Lower Bounds for Search-Diversification", + "points": 5, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=30565951" + }, + { + "hn_id": "35994539", + "title": "DarkBERT: A Language Model for the Dark Side of the Internet", + "points": 4, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=35994539" + }, + { + "hn_id": "36013633", + "title": "DarkBERT: A Language Model for the Dark Side of the Internet", + "points": 3, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=36013633" + }, + { + "hn_id": "43311058", + "title": "A programmable environment for shape optimization and shapeshifting problems", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=43311058" + }, + { + "hn_id": "46211392", + "title": "A Simple Proof of the Riemann Hypothesis", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=46211392" + }, + { + "hn_id": "44067109", + "title": "The effectiveness of Large Language Models in the mechanical design domain", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=44067109" + }, + { + "hn_id": "40350177", + "title": "GPT-4 passes most of the 297 written Polish Board Certification Examinations", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=40350177" + } + ], + "top_points": 142, + "total_points": 206, + "total_comments": 66 + } +} +\ No newline at end of file diff --git a/papers/declarative-agentic-layer-2026/scan-v5.json b/papers/declarative-agentic-layer-2026/scan-v5.json @@ -0,0 +1,373 @@ +{ + "scan_version": 5, + "paper_type": "position", + "paper": { + "title": "Towards a Declarative Agentic Layer for Intelligent Agents in MCP-Based Server Ecosystems", + "authors": [ + "María Jesús Rodríguez-Sánchez", + "Manuel Noguera", + "Ángel Ruiz-Zafra", + "K. Benghazi" + ], + "year": 2026, + "venue": "arXiv.org", + "arxiv_id": "2601.17435", + "doi": "10.48550/arXiv.2601.17435" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": false, + "justification": "The abstract claims that agent failures 'do not stem from limitations of the underlying models themselves, but from the absence of explicit architectural structure,' but the paper provides no comparative evidence (ablation, controlled study, or data) to distinguish model limitations from architectural ones. Section 2 cites existing survey results but does not isolate architectural factors.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": false, + "justification": "The paper makes causal claims ('declarative grounding enables reproducible workflows,' 'introducing explicit grounding can constrain behaviour without limiting expressiveness') without justification. Section 4 acknowledges the illustrative scenario is 'not to demonstrate optimisation or performance.' No ablation, comparison, or empirical test provided.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "The paper claims DALIA is 'model-independent,' applies to 'heterogeneous environments' and 'dynamic environments,' but evidence is limited to one toy scenario (restaurant booking). Scope of applicability is not bounded to the tested/argued setting.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "The paper does not discuss why other structural approaches (e.g., hierarchical planning, constraint-based reasoning, adaptive prompting) might address the core problem. Alternative explanations for MAS failures are not considered.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": false, + "justification": "The paper claims DALIA improves 'reliability,' 'robustness,' 'reproducibility,' and 'verifiability,' but admits in Section 5: 'empirical evaluation is required to assess how declarative grounding affects reliability, robustness and failure rates.' No actual outcomes are measured.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": false, + "justification": "Section 5 is titled 'Discussion and Future Directions,' not 'Limitations.' One sentence mentions 'DALIA does not prescribe how task graphs are generated internally,' but no dedicated limitations section exists.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "Specific threats are not discussed. The paper states 'further work is required' and lists gaps (need for empirical evaluation, richer semantics, decentralized coordination) as future directions, but does not analyze validity threats in the current proposal.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "The paper does not explicitly state what it does NOT address: how the LLM generates the user goal, error recovery strategies during execution, or how the orchestrator maps abstract goals to ATDP tasks. Scope boundaries are implicit, not stated.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding section or acknowledgments section present. No funding sources are disclosed.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": false, + "justification": "All authors list Universidad de Granada as affiliation. No conflict-of-interest statement or disclosure of affiliations with MCP or related tools is provided.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": false, + "answer": false, + "justification": "No funding disclosed; cannot assess independence.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement or declaration of financial interests (patents, equity, consulting related to MCP or agent systems) is provided.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Key terms are defined in context: 'Agent' as 'execution entities rather than autonomous planners' (§3.3), 'Capability' with formal schema (§3.1), 'Task' as 'higher-level objective fulfilled through declared capabilities' (§3.2). MCP is explained as a 'lightweight standard for tool discovery.'", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "The abstract explicitly states four contributions: (1) formal semantic model of capabilities, (2) Agentic Task Discovery Protocol, (3) federated Agent Directory, (4) deterministic task orchestration. The intended contribution is unambiguous.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 2 reviews Cemri et al. (MAS failures), MAST taxonomy, task graph literature, AFlow/MAS-GPT, MCP standard, and ScaleMCP. Table 1 systematically maps prior limitations to DALIA's proposed solutions. Strong engagement demonstrated.", + "source": "haiku" + } + } + }, + "type_checklist": { + "position": { + "argument_quality": { + "argument_internally_consistent": { + "applies": true, + "answer": true, + "justification": "Core argument is consistent: MAS failures stem from architectural lack of structure → declarative grounding provides structure → workflows improve. Sections 2–4 develop this logically without contradictions.", + "source": "haiku" + }, + "counterarguments_addressed": { + "applies": true, + "answer": false, + "justification": "The paper does not engage with counterarguments: Why won't better prompting address these issues? Why is architectural change necessary rather than improved models? These are implicitly dismissed but never argued against.", + "source": "haiku" + }, + "analogies_appropriate": { + "applies": true, + "answer": true, + "justification": "The closing mentions alignment with 'service-oriented architectures and workflow systems,' which is an apt structural analogy. No false equivalences or extended inappropriate analogies.", + "source": "haiku" + }, + "prescriptions_proportional": { + "applies": true, + "answer": true, + "justification": "The prescription (adopt declarative grounding in agentic architectures) is proportional to the argument (evidence of structural failures in MAS). No sweeping policy claims or overreach.", + "source": "haiku" + }, + "evidence_for_claims_cited": { + "applies": true, + "answer": true, + "justification": "Factual claims are cited: Cemri et al. [5] for failure rates, [4] and [9] for task graph issues, [3], [11], [14] for MCP limitations. Citations support empirical assertions.", + "source": "haiku" + }, + "alternatives_discussed": { + "applies": true, + "answer": false, + "justification": "Section 5 mentions alternative orchestration strategies (symbolic planners, heuristic search, rule-based, LLM-driven) but does NOT discuss alternative solutions to the core problem: prompt engineering, larger models, hybrid human-AI systems, or hierarchical planning approaches.", + "source": "haiku" + }, + "historical_context_accurate": { + "applies": true, + "answer": true, + "justification": "References to LLM capabilities, MCP emergence, multi-agent frameworks (MetaGPT, ChatDev, AgentVerse) are accurate. No historical inaccuracies detected.", + "source": "haiku" + } + }, + "clarity_and_scope": { + "key_terms_defined_precisely": { + "applies": true, + "answer": true, + "justification": "Key terms are defined with precision: 'capability' includes formal attributes (role, domain, inputs, outputs, preconditions, postconditions); 'task' is explicitly tied to capability composition; 'agent' is defined as execution entity, not planner.", + "source": "haiku" + }, + "engages_with_existing_literature": { + "applies": true, + "answer": true, + "justification": "Section 2 cites and compares against six major research directions (MAST, CODER, AFlow, MCP, ScaleMCP, agentic AI surveys). Table 1 shows explicit tradeoff analysis.", + "source": "haiku" + }, + "intended_audience_clear": { + "applies": true, + "answer": true, + "justification": "The paper is clearly targeted at researchers and architects building agentic AI systems, particularly those working with MCP-based ecosystems. Audience is implicit but unambiguous from scope.", + "source": "haiku" + }, + "assumptions_stated": { + "applies": true, + "answer": false, + "justification": "The paper assumes MCP servers will provide DALIA-compatible metadata, that tasks can be declared upfront, and that deterministic planning is preferable to adaptive LLM reasoning. These assumptions are not made explicit.", + "source": "haiku" + }, + "scope_of_applicability_discussed": { + "applies": true, + "answer": false, + "justification": "The paper does not discuss where declarative grounding is applicable vs. not applicable. Does it work for open-ended exploration? For real-time dynamic task generation? For adversarial environments? Scope is not discussed.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "Multi-agent LLM systems have failure rates between 41% and 86% across MAS frameworks", + "evidence": "Cemri et al. analysis of 1,642 executions across seven frameworks (cited [5])", + "supported": "strong" + }, + { + "claim": "LLM-based agents produce hallucinated actions and unexecutable task graphs because they lack grounding in real capabilities", + "evidence": "Cited prior work [4], [9]; noted in Section 2 but not directly empirically demonstrated in this paper", + "supported": "moderate" + }, + { + "claim": "MCP tools lack semantic relationships, task structure, and multi-server coordination capabilities", + "evidence": "Technical observation; cited extensions (ScaleMCP [11], MCPEval [14]) acknowledged as operational only, not semantic (Section 2)", + "supported": "moderate" + }, + { + "claim": "Declarative grounding of capabilities, tasks, and agents reduces hallucinated actions and invalid task graphs", + "evidence": "Proposed by DALIA design; no empirical validation provided; admitted in Section 5 that 'empirical evaluation is required'", + "supported": "unsupported" + }, + { + "claim": "DALIA enables reproducible and verifiable agentic workflows across heterogeneous environments", + "evidence": "Architectural argument; single illustrative restaurant booking scenario (acknowledged as illustrative, not evaluative)", + "supported": "unsupported" + }, + { + "claim": "Separating discovery, planning, and execution prevents speculative reasoning and improves reliability", + "evidence": "Design principle stated; no ablation or comparative study provided", + "supported": "unsupported" + }, + { + "claim": "Task graphs generated without awareness of real capabilities result in incoherent routes and hallucinated operations", + "evidence": "Cited prior work [4], [9]; empirical examples not provided", + "supported": "moderate" + }, + { + "claim": "MAS failures stem from lack of architectural structure, not model limitations", + "evidence": "Inferred from Cemri et al. MAST taxonomy (system design, agent alignment, verification failures); not directly proven by ablation", + "supported": "moderate" + } + ], + "methodology_tags": [ + "position", + "theoretical" + ], + "key_findings": "This paper proposes DALIA, a declarative architectural layer for grounding LLM-based agent systems through explicit specification of capabilities, tasks, and agent roles. Rather than relying on linguistic reasoning for planning and coordination, DALIA separates discovery, planning, and execution into distinct phases governed by declarative metadata. The authors motivate DALIA by citing high failure rates (41-86%) in existing multi-agent systems and limitations of MCP's tool abstraction, but provide only an illustrative scenario (restaurant booking) and no empirical validation of the proposed approach. The paper acknowledges that 'empirical evaluation is required' and does not implement or test DALIA.", + "red_flags": [ + { + "flag": "No empirical validation", + "detail": "The paper admits in Section 5: 'empirical evaluation is required to assess how declarative grounding affects reliability, robustness and failure rates.' No controlled comparison, ablation study, or implementation testing is provided." + }, + { + "flag": "Single toy scenario", + "detail": "The illustrative scenario (restaurant booking with two steps) is acknowledged as not demonstrating 'optimisation or performance.' It is too simple to validate claims about heterogeneous multi-agent systems." + }, + { + "flag": "Vague on implementation details", + "detail": "How does ATDP actually work? How are task graphs 'synthesized deterministically'? Section 5 states 'internal reasoning mechanisms...intentionally left unspecified,' undermining the core argument that explicit structure matters." + }, + { + "flag": "Adoption problem not addressed", + "detail": "The paper assumes MCP servers will voluntarily provide DALIA-compatible metadata (capability schemas, ATDP endpoints, Agent Directory entries). No discussion of incentives, migration path, or compatibility with existing MCP ecosystems." + }, + { + "flag": "Circular reasoning on causation", + "detail": "The paper defines the problem as 'lack of declarative structure' and proposes 'declarative structure' as solution, but provides no causal evidence that structure—not prompt quality, not model capacity—is the binding constraint." + }, + { + "flag": "Missing conflict-of-interest disclosures", + "detail": "No funding source, no competing interests statement, no disclosure of affiliations with MCP tools or agent frameworks. All required for position papers proposing adoption of specific architectures." + }, + { + "flag": "No limitations section", + "detail": "Section 5 is titled 'Discussion and Future Directions' rather than 'Limitations.' One mention of 'current limitations' but no systematic analysis of scope boundaries or threats to the proposal's validity." + }, + { + "flag": "Implicit assumptions not stated", + "detail": "Assumes tasks can be declared upfront, deterministic planning is preferable to adaptive reasoning, and that 'hallucinated actions' are primarily an architectural problem. These are not made explicit for reader evaluation." + } + ], + "cited_papers": [ + { + "title": "Why do multi-agent LLM systems fail? (MAST: Multi-Agent System Taxonomy)", + "authors": "Eren Cemri et al.", + "year": 2025, + "relevance": "Empirical taxonomy of 1,642 MAS executions identifying failure modes in system design, agent alignment, and verification—core motivation for DALIA" + }, + { + "title": "CODER: Issue resolving with multi-agent and task graphs", + "authors": "Dong Chen et al.", + "year": 2024, + "relevance": "Demonstrates task graph approach to coordinating agents; cited for limitation that LLM-generated graphs are incoherent or ungrounded" + }, + { + "title": "AFlow: Large language models as multi-agent system engineers", + "authors": "Leyang Zhang et al.", + "year": 2024, + "relevance": "Example of LLM-generated MAS producing incoherent structures; motivates need for explicit grounding" + }, + { + "title": "Graphs meet AI agents: Taxonomy, progress, and future opportunities", + "authors": "Yuanchen Bei et al.", + "year": 2025, + "relevance": "Demonstrates LLMs cannot reliably generate task graphs; cites evidence that unconstrained language modelling produces incomplete or incoherent plans" + }, + { + "title": "LLM-based multi-agent systems for software engineering: Literature review, vision, and the road ahead", + "authors": "Junda He, Christoph Treude, David Lo", + "year": 2025, + "relevance": "Survey emphasizing that agentic AI systems operate without explicit grounding in available actions—core problem DALIA addresses" + }, + { + "title": "Generative to agentic AI: Survey, conceptualization, and challenges", + "authors": "Jonas Schneider et al.", + "year": 2024, + "relevance": "Prior survey arguing agentic architectures require planning, memory, and tool use; context for DALIA's architecture proposal" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 2, + "justification": "DALIA is proposed as a practical architectural layer for production multi-agent systems, but no implementation, reference code, or prototype is released or available, limiting immediate practitioner use." + }, + "surprise_contrarian": { + "score": 1, + "justification": "The paper argues for declarative structure over prompt-driven orchestration, but this is more 'engineering best practice' than a contrarian position. Not framed as challenging conventional wisdom." + }, + "fear_safety": { + "score": 0, + "justification": "No discussion of safety implications, alignment, or risk. The paper focuses on reliability and verifiability, not safety concerns." + }, + "drama_conflict": { + "score": 0, + "justification": "A technical architecture proposal with no controversy, debate, or conflicting perspectives. No drama angle." + }, + "demo_ability": { + "score": 0, + "justification": "No reference implementation, no code release, no working prototype. The illustrative scenario is described in prose, not demonstrated interactively." + }, + "brand_recognition": { + "score": 1, + "justification": "Universidad de Granada is a legitimate institution, but not a major AI lab. Authors are not well-known in agentic AI research." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "46802368", + "title": "Show HN: If You Want Coherence, Orchestrate a Team of Rivals: Multi-Agent \"", + "points": 10, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=46802368", + "created_at": "2026-01-28T22:16:52Z" + }, + { + "hn_id": "46776398", + "title": "The 17% Gap: Quantifying Epistemic Decay in AI-Assisted Survey Papers", + "points": 1, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=46776398", + "created_at": "2026-01-27T06:58:02Z" + }, + { + "hn_id": "46913890", + "title": "Predicting Zero-Shot Classification Performance for Arbitrary Queries", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=46913890", + "created_at": "2026-02-06T15:19:31Z" + } + ], + "top_points": 10, + "total_points": 12, + "total_comments": 1 + } +} +\ No newline at end of file diff --git a/papers/decoding-latent-attack-2025/scan-v5.json b/papers/decoding-latent-attack-2025/scan-v5.json @@ -0,0 +1,576 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Decoding Latent Attack Surfaces in LLMs: Prompt Injection via HTML in Web Summarization", + "authors": [ + "Ishaan Verma", + "Arsheya Yadav" + ], + "year": 2025, + "venue": "arXiv.org", + "arxiv_id": "2509.05831", + "doi": "10.48550/arXiv.2509.05831" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "Abstract claims of 'significant proportion of injected pages led to measurable semantic and stylistic shifts' are supported by ROUGE-L (0.301–0.327) and SBERT metrics (0.694–0.698), plus 15.71–29.29% injection success rates across both models.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "Paper makes causal claims ('injections manipulate outputs'). Study design (282 pages, half clean/half injected, controlled comparison) is appropriate for demonstrating causal effect in security evaluation, though not a randomized controlled trial.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": true, + "justification": "Scope bounded to web summarization with HTML injections on 2 models and 8 attack vectors. Title and abstract appropriately scoped; conclusions limited to 'web-based LLM pipelines' without overgeneralizing to all LLM tasks.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "Paper demonstrates injections work (pirate example) but does not discuss alternative explanations: Is the model following explicit instructions, or simply sensitive to certain text patterns? Why Llama is more vulnerable than Gemma (architectural differences?) is noted but not explained.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "Measured outcomes (ROUGE-L divergence, SBERT similarity, manual annotation) directly proxy claimed outcome (injection-induced output manipulation). Distinction between measurement and claim is clear and reasonable.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": false, + "justification": "No dedicated 'Limitations' or 'Threats to Validity' section in the paper. A 'Future Work' section exists but does not address methodological constraints of the current study.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "No specific threats discussed. Gaps not mentioned: manual annotation reliability (no inter-rater agreement reported), sample size justification (282 pages, 8 techniques, 2 models), generalization to other tasks/domains, or limitation of synthetic vs. real web content.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "Paper does not explicitly state what it does NOT show. No discussion of: generalization to other LLM architectures, applicability to fine-tuned models, relevance to other tasks beyond summarization, or limits of the 8 injection techniques tested.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding source mentioned anywhere in paper. No acknowledgments section or statement of funding/lack thereof. Funding disclosure missing.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Author affiliations disclosed: both from Manipal University Jaipur, Department of Computer Science and Department of Data Science. No obvious financial conflict with Meta (Llama) or Google (Gemma).", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": false, + "answer": false, + "justification": "No funder identified; cannot assess independence.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement, no mention of patents, equity stakes, consulting arrangements, or financial relationships. Standard COI declaration missing.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Key terms defined: 'Prompt injection' as 'specially crafted inputs designed to manipulate LLM behavior,' 'HTML-based prompt injection' with concrete examples (aria-labels, meta tags, alt-text). Terms sufficiently precise for context.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Core objective explicitly stated: 'empirically assess the susceptibility of state-of-the-art LLMs to prompt injection attacks delivered through web content.' Contribution is systematic evaluation on 282 pages with 8 techniques, addressing gap in HTML-based attack research.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Related work section engages with prior prompt injection research (Liu et al., OWASP), HTML adversarial attacks (Tao et al.), and LLM robustness evaluation (Yang et al.). Connection to this work's novelty (HTML-based attacks in web summarization) is established, though somewhat list-based rather than deeply integrated.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": true, + "justification": "Code released on GitHub: evaluation.py, file_generation.py, and supporting scripts listed in appendix. Repository URL: https://github.com/ishaanv1206/Decoding-Latent-Attack-Surfaces-in-LLMs-Prompt-Injection-via-HTML-in-Web-Summarization", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "Dataset of 282 HTML pages (clean/ and injected/ directories) publicly available on GitHub Pages. Metadata in CSV format (metadata.csv, gemma.csv, llama.csv) included.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "Specific tools named (Playwright, all-MiniLM-L6-v2 SentenceTransformer) and model versions identified with references. However, no requirements.txt, environment.yml, Python version, or comprehensive dependency list provided.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "Repository contents listed but no step-by-step reproduction instructions provided. Scripts mentioned (evaluation.py, file_generation.py) but not accompanied by explicit 'run these commands in this order' guidance.", + "source": "haiku" + } + }, + "statistical_methodology": { + "applies": true, + "answer": false, + "justification": "Only average metrics reported (ROUGE-L 0.301–0.327, SBERT 0.694–0.698, success rates 15.71–29.29%). No confidence intervals, significance tests, standard deviations, effect size metrics, or sample size justification provided.", + "source": "haiku" + }, + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "Table 1 provides only means: average ROUGE-L and SBERT values with no confidence intervals, standard errors, or error bars. Success rates (29.29%, 15.71%) reported as point estimates.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "Models compared (Llama 29.29% vs Gemma 15.71% success) but no statistical significance tests (t-test, chi-square, etc.) reported. Differences could be due to chance or dataset-specific factors.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": false, + "justification": "Raw differences stated (e.g., 13.58 percentage point gap in success rates, ROUGE-L difference of 0.0259) but no formal effect size metrics (Cohen's d, odds ratio, etc.) reported.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "282 pages chosen with no stated justification. No power analysis, minimum sample size calculation, or discussion of adequacy for the claims made.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "Only averages reported across all metrics. No standard deviations, ranges, quartiles, or per-page variance for ROUGE-L or SBERT scores.", + "source": "haiku" + }, + "evaluation_design": { + "applies": true, + "answer": true, + "justification": "Design includes baselines (clean pages), multiple metrics (ROUGE-L, SBERT, manual), 8 injection technique variants (ablation-like), and human evaluation of injection success. Missing: per-category breakdown and systematic reporting of failure cases.", + "source": "haiku" + }, + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Clean version of each page serves as implicit baseline. Comparison of clean vs. injected summaries is the core evaluation design.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "Baseline is the same page without injection—within-subject comparison, appropriate for security evaluation. No comparison to external baselines required.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "8 distinct injection techniques tested (meta tag, comment, hidden div, base64-encoded attribute, ARIA label, opacity div, hidden script, alt text). Provides systematic comparison of attack vector effectiveness.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Three evaluation approaches: ROUGE-L (lexical), SBERT cosine similarity (semantic), manual annotation (behavioral). Provides multifaceted assessment of injection impact.", + "source": "haiku" + }, + "human_evaluation": { + "applies": true, + "answer": true, + "justification": "Manual annotation of injection success: 'reviewing the LLM's summary output for evidence that the injected prompt had influenced or manipulated the model's response.' Humans evaluated system outputs for injection success.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": false, + "answer": false, + "justification": "Not a prediction task with train/test split. Security evaluation on synthetic pages with no train/test distinction.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": false, + "justification": "10 content categories tested (blogs, FAQs, news articles, docs, product listings, profiles, privacy policies, tutorials, reviews, careers) but results not broken down by category. No per-category success rates or per-category metric comparisons.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": false, + "justification": "Successful injections reported but no detailed analysis of failure cases. Hidden script tag had 0% success on Gemma and 2/140 (1.4%) on Llama, noted briefly but not analyzed.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "Gemma's lower vulnerability (15.71% vs 29.29%) presented as finding. Hidden script tag failures noted. Some negative results included, though not systematically documented.", + "source": "haiku" + }, + "setup_transparency": { + "applies": true, + "answer": false, + "justification": "Model versions identified with references (Llama 4 Scout-17B-16E, Gemma 2 9B IT), but actual summarization prompt text not provided—only high-level description. Hyperparameters (temperature, top_p, etc.) not specified. Data extraction method outlined but implementation not documented.", + "source": "haiku" + }, + "model_versions_specified": { + "applies": true, + "answer": true, + "justification": "Specific model versions identified: Llama 4 Scout (ref. [19] shows '17B-16E'), Gemma 9B IT (ref. [18] shows 'Gemma-2-9b-IT'). Versions traceable via Hugging Face references.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": false, + "justification": "Only high-level description given: 'standardized prompt instructing the LLM to generate a one-paragraph summary.' Actual prompt text (temperature settings, exact wording, stop tokens) not provided.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": false, + "justification": "No temperature, top_p, max_tokens, frequency_penalty, presence_penalty, or other sampling hyperparameters specified. Critical for reproducibility.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": false, + "answer": false, + "justification": "No agentic scaffolding (no tool use, no multi-step reasoning, no chain-of-thought). Direct LLM summarization with no intermediate steps.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": false, + "justification": "Playwright extraction method outlined ('Full HTML Source' and 'Rendered Visible Text') but implementation details missing: tokenization strategy, whitespace handling, text cleaning steps not documented.", + "source": "haiku" + }, + "data_integrity": { + "applies": true, + "answer": true, + "justification": "Raw HTML pages available on GitHub for inspection. Data generation process documented (282 pages across 10 categories, 8 injection techniques). Full pipeline described from page generation through metric computation.", + "source": "haiku" + }, + "raw_data_available": { + "applies": true, + "answer": true, + "justification": "Raw HTML files (clean/ and injected/ directories) publicly available on GitHub. CSV metadata (metadata.csv, gemma.csv, llama.csv) and evaluation scripts provided for independent verification.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Data generation process described: synthetic HTML pages across 10 realistic content categories, styled with 'authentic CSS to enhance realism.' 8 distinct injection techniques systematically applied.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "No human participants or recruitment. Manual annotation of injection success is a post-hoc assessment, not participant recruitment.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "Full pipeline documented: HTML generation → injection → hosting on GitHub Pages → Playwright extraction → LLM summarization → metric computation. Described as 'fully automated using Python scripts.' High-level logic clear; implementation details in code.", + "source": "haiku" + }, + "contamination": { + "applies": false, + "answer": false, + "justification": "Custom synthetic dataset, not standard benchmarks. Train/test contamination risk not applicable. However, no discussion of whether injection techniques or attack strategies might exist in training data.", + "source": "haiku" + }, + "training_cutoff_stated": { + "applies": false, + "answer": false, + "justification": "Synthetic custom data unlikely to be in training set. Not evaluating on published benchmarks; standard contamination risk analysis not applicable.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": false, + "answer": false, + "justification": "Custom synthetic dataset; standard train/test overlap risks not applicable.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": false, + "answer": false, + "justification": "Not using published benchmarks; contamination risk not applicable.", + "source": "haiku" + }, + "human_studies": { + "applies": false, + "answer": false, + "justification": "No human subjects beyond researchers. Manual annotation is post-hoc assessment of system outputs, not a human subject study requiring ethics approval or participant consent.", + "source": "haiku" + }, + "cost_and_practicality": { + "applies": true, + "answer": false, + "justification": "No inference cost, API call costs, latency, or total computational budget reported. Methodology described as 'scalable' but no actual resource consumption quantified.", + "source": "haiku" + }, + "inference_cost_reported": { + "applies": true, + "answer": false, + "justification": "No mention of inference cost, API pricing, latency per page, or total cost to run experiments on 282 pages × 2 models.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "No total computational budget, GPU hours, token usage, or cost metrics provided. Practical resource requirements for reproduction not stated.", + "source": "haiku" + } + } + }, + "claims": [ + { + "claim": "HTML-based prompt injections can successfully manipulate LLM summarization outputs", + "evidence": "29.29% of injections successful for Llama 4 Scout, 15.71% for Gemma 9B IT across 282 test pages. Pirate example shows injection instruction directly influences summary tone.", + "supported": "strong" + }, + { + "claim": "Meta tags and opacity-zero divs are the most effective HTML injection vectors", + "evidence": "Meta tag injections: 17/140 successful for Llama, 6/140 for Gemma. Opacity div: 10/140 and 9/140 respectively. Both techniques outperform hidden script (2/140 and 0/140).", + "supported": "strong" + }, + { + "claim": "Llama 4 Scout is significantly more vulnerable to HTML-based prompt injections than Gemma 9B IT", + "evidence": "Llama success rate 29.29% vs Gemma 15.71% (13.58pp gap). Meta tag vulnerabilities particularly divergent: 17 vs 6 successes.", + "supported": "strong" + }, + { + "claim": "Successful injections produce substantial lexical and semantic divergence in summaries", + "evidence": "ROUGE-L scores 0.301–0.327 indicate moderate lexical divergence. SBERT cosine similarity 0.694–0.698 shows semantic shifts. Pirate example demonstrates stylistic transformation.", + "supported": "moderate" + }, + { + "claim": "Conventional input sanitization methods are insufficient to mitigate HTML-based prompt injections", + "evidence": "Injections succeed using standard HTML techniques (aria-labels, meta tags, comments) without explicit comparison to sanitization approaches. Implicit from results.", + "supported": "moderate" + }, + { + "claim": "The evaluation is reproducible and scalable", + "evidence": "Code, data, and evaluation scripts released on GitHub. Described as 'fully automated using Python scripts.' However, no step-by-step reproduction instructions provided.", + "supported": "moderate" + } + ], + "methodology_tags": [ + "benchmark-eval" + ], + "key_findings": "This study demonstrates that HTML-based prompt injections represent a significant vulnerability in LLM-powered web summarization systems. Meta tags and opacity-zero divs proved most effective, with Llama 4 Scout showing 29.29% injection success compared to Gemma 9B IT's 15.71%. Successful injections induced substantial lexical and semantic shifts measured by ROUGE-L and SBERT metrics, with the pirate example illustrating how invisible HTML instructions can fundamentally alter model behavior. The finding that architecture significantly influences vulnerability suggests no single defense mechanism provides blanket protection against such attacks.", + "red_flags": [ + { + "flag": "No statistical significance testing", + "detail": "Success rates compared (29.29% vs 15.71%) without confidence intervals or statistical hypothesis tests. Differences could reflect sampling variation or dataset composition rather than true architectural vulnerability." + }, + { + "flag": "Manual annotation lacks inter-rater reliability", + "detail": "Injection success determined manually without reporting inter-rater agreement, number of annotators, or disagreement resolution procedure. Single annotator bias possible." + }, + { + "flag": "Limited model scope", + "detail": "Only 2 models tested (Llama, Gemma). Generalization to GPT, Claude, other architectures unknown. Findings may be model-specific." + }, + { + "flag": "No per-category analysis", + "detail": "Results not broken down by content type (blogs, FAQs, docs, etc.), so differential vulnerability patterns across content categories unknown." + }, + { + "flag": "Missing hyperparameters", + "detail": "Temperature, top_p, max_tokens, and other sampling parameters not specified. Small parameter changes can drastically affect injection susceptibility, limiting reproducibility." + }, + { + "flag": "No defense mechanism evaluation", + "detail": "Paper identifies vulnerabilities but provides no empirical testing of proposed defenses (input sanitization, prompt engineering, etc.). Only future work mentioned." + }, + { + "flag": "Sample size unjustified", + "detail": "282 pages with no power analysis, minimum sample size justification, or discussion of adequacy for the claims. Relatively small for broad vulnerability assessment." + }, + { + "flag": "Actual summarization prompt not provided", + "detail": "Only high-level description given; exact prompt text, temperature, and stop tokens not disclosed, preventing exact reproduction." + } + ], + "cited_papers": [ + { + "title": "Automatic and Universal Prompt Injection Attacks against Large Language Models", + "authors": "Liu et al.", + "year": 2024, + "relevance": "Foundational prompt injection attack framework demonstrating vulnerabilities in defense-equipped LLMs" + }, + { + "title": "Prompt Injection attack against LLM-integrated Applications", + "authors": "Liu et al.", + "year": 2023, + "relevance": "Goal-guided generative methods for amplifying divergence between clean and adversarial outputs" + }, + { + "title": "LLM01:2025 Prompt Injection - OWASP Gen AI Security Project", + "relevance": "Industry security framework cataloguing real-world prompt injection attack scenarios and threat models" + }, + { + "title": "Adversarial Examples in Cybersecurity: A Survey", + "authors": "Li, S.", + "year": 2020, + "relevance": "Adversarial attack techniques from cybersecurity domain applicable to HTML-based manipulations" + }, + { + "title": "Raze to the Ground: Query-Efficient Adversarial HTML Attacks on Machine-Learning Phishing Webpage Detectors", + "authors": "Tao et al.", + "year": 2023, + "relevance": "Empirical study demonstrating HTML-based adversarial manipulations can evade conventional sanitization and bias ML-based content detection" + }, + { + "title": "Prompt Injection Attacks on Large Language Models in Realistic Settings", + "authors": "Clusmann et al.", + "year": 2024, + "relevance": "Real-world examples of prompt injection vulnerabilities in deployed LLM systems (e.g., Bing Chat system prompt leakage)" + }, + { + "title": "Evaluating and Improving Robustness in Large Language Models: A Survey", + "authors": "Yang et al.", + "year": 2024, + "relevance": "Survey of LLM robustness evaluation methodologies and metrics for adversarial assessment" + }, + { + "title": "Retrieval-Augmented In-Context Learning Attacks and Defenses", + "authors": "Yu et al.", + "year": 2024, + "relevance": "Vulnerabilities of RAG systems to embedded adversarial prompts in retrieved context" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 2, + "justification": "Web summarization deployed in real systems (search, content aggregators), but attack requires raw HTML access—moderate applicability to web-integrated LLM pipelines" + }, + "surprise_contrarian": { + "score": 1, + "justification": "HTML-based injection is natural once considered; novelty lies in systematic evaluation rather than discovering the vulnerability class itself" + }, + "fear_safety": { + "score": 2, + "justification": "Raises legitimate concerns about untrusted web content and difficulty of HTML sanitization in production LLM systems" + }, + "drama_conflict": { + "score": 1, + "justification": "Technical security evaluation with no particular controversy, drama, or conflict narrative angle" + }, + "demo_ability": { + "score": 3, + "justification": "GitHub repository with HTML pages and evaluation code immediately reproducible; easy to test on other models or injection vectors" + }, + "brand_recognition": { + "score": 0, + "justification": "Authors from Manipal University Jaipur (non-prominent AI institution); no affiliation with major tech companies or well-known research labs" + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "44171652", + "title": "Oh fuck! How do people feel about robots that leverage profanity?", + "points": 18, + "comments": 50, + "url": "https://news.ycombinator.com/item?id=44171652" + }, + { + "hn_id": "41597663", + "title": "Breaking ReCAPTCHAv2", + "points": 5, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=41597663" + }, + { + "hn_id": "44211549", + "title": "Oracular Programming: A Modular Foundation for Building LLM-Enabled Software", + "points": 4, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=44211549" + }, + { + "hn_id": "41571318", + "title": "Breaking ReCAPTCHAv2", + "points": 3, + "comments": 2, + "url": "https://news.ycombinator.com/item?id=41571318" + }, + { + "hn_id": "42708072", + "title": "MiniMax-01: Scaling Foundation Models with Lightning Attention", + "points": 3, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=42708072" + }, + { + "hn_id": "42680545", + "title": "Mlkaps: Machine Learning and Adaptive Sampling for HPC Kernel Auto-Tuning", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=42680545" + }, + { + "hn_id": "41604215", + "title": "Radio Technosignature Search of Trappist-1 with the Allen Telescope Array", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=41604215" + }, + { + "hn_id": "37569675", + "title": "RL for Supply Chain Attacks Against Frequency and Voltage Control", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=37569675" + }, + { + "hn_id": "45274922", + "title": "Candidates evoke identity and issues on TikTok", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=45274922" + }, + { + "hn_id": "44847789", + "title": "SortBench: Benchmarking LLMs based on their ability to sort lists", + "points": 2, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=44847789" + } + ], + "top_points": 18, + "total_points": 46, + "total_comments": 55 + } +} +\ No newline at end of file diff --git a/papers/decoding-ml-decision-2026/scan-v5.json b/papers/decoding-ml-decision-2026/scan-v5.json @@ -0,0 +1,348 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Decoding ML Decision: An Agentic Reasoning Framework for Large-Scale Ranking System", + "authors": [ + "Longfei Yun", + "Yihan Wu", + "Haoran Liu", + "Xiaoxuan Liu", + "Ziyun Xu", + "Yi Wang", + "Yang Xia", + "Pengfei Wang", + "Mingze Gao", + "Yunxiang Wang", + "Changfan Chen", + "Junfeng Pan" + ], + "year": 2026, + "venue": "arXiv", + "arxiv_id": "2602.18640", + "doi": null + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "Abstract claims that GEARS 'identifies superior, near-Pareto-efficient policies' and 'maintains rigorous deployment stability.' Table 1 shows GEARS achieves 0.94 nDCG@1 vs 0.77 for second-best baseline. Section 4.3 describes stability validation hooks measuring feature drift over 6 months.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": false, + "justification": "Paper claims GEARS 'discovers optimal trade-off policies' with causal language ('GEARS addresses...', 'enables agents'). Main evaluation (Table 1) is offline policy ranking on synthetic instructions with algorithmic ground truth, not causal validation. Real-world results (Table 3, Section 5) mention 'statistically significant lift' but provide no p-values, confidence intervals, or experimental details. Study design insufficient for causal claims.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "Paper claims GEARS is 'a new standard for AI-driven ranking infrastructure' and describes it as 'general-purpose.' However, all evaluation uses internal Meta ranking systems (20 anonymous internal experiments). Title and abstract don't indicate Meta-specific scope. Generalization to other domains is unsupported.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "Paper compares against 5 baselines and ablates two components (Bash filtering, Skills), but doesn't discuss why GEARS succeeds mechanistically. Why does Bash filtering alone achieve 0.40 nDCG@1 while adding skills jumps to 0.94? The 'context rot' problem is named but not analyzed. Ablation results suggest deterministic filtering is doing most of the work, but this isn't explored.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": false, + "justification": "Main evaluation measures policy ranking quality (nDCG, Precision@K) on 100 synthetic instructions with ground-truth top-5 policies. Claims are about 'production reliability' and 'metric improvements.' Table 3 provides vague percentage improvements (0.011%–0.37%) without baselines, confidence intervals, or surface identification. Measured outcome (ranking quality) does not validate claimed outcome (production impact).", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": false, + "justification": "No dedicated limitations, threats-to-validity, or scope-boundaries section. Paper discusses challenges it addresses (feature instability, context rot) but not limitations of its own evaluation methodology.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "No specific threats to validity discussed. The paper mentions problems it solves (brittleness, instability) but doesn't discuss whether its evaluation captures these problems or whether results would generalize to other ranking systems.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "Paper states 'While GEARS is designed as a general framework, personalization represents a particularly illustrative application.' But this doesn't explicitly bound results to personalization or acknowledge that all evaluation is on internal Meta systems only.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No explicit funding disclosure or acknowledgments section visible in the paper. All authors affiliated with Meta but no statement of funding source.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "All authors clearly listed as Meta affiliation. However, this creates an undisclosed conflict: Meta employees are evaluating Meta's own ranking systems on internal data.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": false, + "justification": "Meta (employer) is the funder and the beneficiary of positive results. Evaluation uses only internal Meta data on internal Meta ranking systems. No external validation or independent evaluation.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests or financial disclosure statement included. Meta employees benefit directly from demonstrating effectiveness of Meta's ranking systems.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": false, + "justification": "Core terms used imprecisely. 'Vibe Optimization' defined as 'operators guide systems through high-level intent' but not formally specified. 'Near-Pareto-efficient' introduced without mathematical definition. 'Specialized Agent Skills' described abstractly as 'modular resources' but no concrete operational definition provided.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Three contributions explicitly stated: (1) Agentic ranking framework reformulating optimization as autonomous discovery, (2) Skill-based architecture externalizing domain expertise, (3) Production validation demonstrating real-world effectiveness. Intentions are clear even if execution has methodological gaps.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": false, + "justification": "Related work lists HTE methods (S-learner, T-learner, tree-based, neural), adaptive experimentation, and LLM reasoning techniques. However, comparisons are superficial ('blind to engineering context,' 'suffer from hallucination'). Baselines in Table 1 (CoT, Self-Refine) are not the main prior work in personalization/ranking—they're general prompting strategies. No quantitative comparison to traditional HTE or ranking methods.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "applies": true, + "answer": false, + "justification": "No code released, no data released, no environment specs (requirements.txt, Dockerfile), no reproduction instructions. Paper uses internal Meta systems and data. Authors mention Claude Sonnet but not version or exact model specifications.", + "source": "haiku" + }, + "statistical_methodology": { + "applies": true, + "answer": false, + "justification": "Critical gaps: Table 1 (main results) lacks confidence intervals or significance tests—only point estimates (0.94, 0.96, etc.). Table 3 shows percentage improvements without error bars or baselines. Table 4 includes standard errors but no p-values. No sample size justification (20 experiments → 100 instructions, why sufficient?). No power analysis. NDCG and ranking metrics reported without uncertainty bounds.", + "source": "haiku" + }, + "evaluation_design": { + "applies": true, + "answer": false, + "justification": "Multiple concerns: (1) Baselines are general LLM prompting techniques (CoT, Self-Refine), not domain-specific ranking baselines. (2) No breakdown of results by instruction type (Maximize Both, Maximize with Constraint, etc.) despite creating 5 types. (3) Test set not clearly held-out from GEARS development—all 100 instructions derived from same 20 experiments used for GAS development. (4) No failure case analysis or discussion of when GEARS selects poor policies. (5) Ground truth determined algorithmically from same data source.", + "source": "haiku" + }, + "setup_transparency": { + "applies": true, + "answer": false, + "justification": "Model specification vague: 'Claude Sonnet (ant)' lacks version number or date. No prompts provided to LLM agent. Hyperparameters incomplete: tolerance τ in Algorithm 1 not specified for experiments; Self-Consistency temperature 0.7 mentioned but other baselines' hyperparameters not detailed. Scaffolding (Skills, Governance hooks) described abstractly without concrete examples or actual hook implementations. Data preprocessing (how were instructions synthesized, ground truth computed) not detailed.", + "source": "haiku" + }, + "data_integrity": { + "applies": true, + "answer": false, + "justification": "Raw internal Meta data not available for verification. Data collection procedure not described—paper states 'we constructed a benchmark dataset' but doesn't explain how the 20 base experiments were collected or what they measured. Data pipeline (experiments → GAS → candidates → instructions) partially described but full lineage unclear.", + "source": "haiku" + }, + "contamination": { + "applies": false, + "answer": false, + "justification": "Not evaluating language model pretraining on public benchmarks. However, train-test split concern: unclear whether 100 synthetic instructions are held-out from GEARS development. If GEARS LLM was trained/finetuned on Meta experimentation patterns, evaluation on derived instructions risks overfitting.", + "source": "haiku" + }, + "human_studies": { + "applies": false, + "answer": false, + "justification": "No human subjects. Evaluation is automated policy selection on tabular experiment data. Section 5 describes a case study but no human evaluation of the policies themselves.", + "source": "haiku" + }, + "cost_and_practicality": { + "applies": true, + "answer": false, + "justification": "No inference cost reported (API calls, tokens, latency). No compute budget stated. Paper claims GEARS 'significantly reduces human engineering overhead' (Section 5.1) but doesn't quantify resource usage or time savings. Multi-week vs how long with GEARS is not specified.", + "source": "haiku" + } + } + }, + "claims": [ + { + "claim": "GEARS consistently identifies superior, near-Pareto-efficient policies compared to baseline prompting strategies", + "evidence": "Table 1 shows GEARS nDCG@1=0.94 vs Code-as-Action 0.77, Self-Refine 0.61, CoT 0.68. Evaluated on 100 synthetic policy-selection instructions derived from 20 internal experiments.", + "supported": "moderate" + }, + { + "claim": "Specialized Agent Skills contribute meaningfully to policy selection performance", + "evidence": "Ablation study: GEARS w/o Skill achieves 0.87 nDCG@1 vs full GEARS 0.94 (7pp improvement). However, GEARS w/o Bash performs much worse (0.40), suggesting filtering dominates.", + "supported": "moderate" + }, + { + "claim": "Feature stability validation prevents selection of brittle policies that would fail in production", + "evidence": "Section 4.3 establishes stability baselines (6% drift for 'stable' feature set S) and filters features with drift >15% (binary) or >45% (quantile). Figure 3-4 show one example where filtering removed high-variance candidates, and the selected policy maintained gains over 1 month.", + "supported": "weak" + }, + { + "claim": "GEARS reduces the time required for ranking optimization from multi-week expert-driven process to automated discovery", + "evidence": "Section 5.1 states GEARS 'automated what was previously a multi-week, expert-driven discovery process.' No quantitative time measurements provided.", + "supported": "weak" + }, + { + "claim": "GEARS deployments achieve metric improvements across diverse product surfaces at Meta", + "evidence": "Table 3 reports improvements ranging from 0.011% to 0.37% across 9 surfaces and 3 metrics. No baselines, confidence intervals, or surface/metric identification provided.", + "supported": "weak" + }, + { + "claim": "Tolerance-based frontier expansion surfaces non-convex and near-optimal policies that offer better stability than strict Pareto optimization", + "evidence": "Figure 2 illustrates concept of tolerance bands admitting near-frontier candidates. No quantitative comparison provided (no experiment comparing tolerance-based vs strict Pareto).", + "supported": "unsupported" + } + ], + "methodology_tags": [ + "observational", + "benchmark-eval", + "case-study" + ], + "key_findings": "GEARS is a framework that applies LLM agents to large-scale ranking optimization by decomposing the task into intent translation (converting natural language directives to search specs), policy selection (leveraging prior HTE work to generate candidates), and deterministic validation (filtering policies that violate stability or feature-integrity thresholds). Evaluation on 100 synthetic policy-selection instructions shows GEARS achieves 0.94 nDCG@1 vs 0.77 for the second-best baseline (Code-as-Action). Real-world deployments across unnamed surfaces report metric improvements ranging 0.011–0.37%, though without baselines or confidence intervals.", + "red_flags": [ + { + "flag": "Evaluation on internal data only", + "detail": "All experiments use 20 internal Meta ranking experiments. No external validation, no reproducible benchmark, no evidence of generalization beyond Meta's systems." + }, + { + "flag": "Synthetic ground truth from same source as training", + "detail": "100 synthetic instructions derived from same 20 experiments. Ground truth (top-5 policies) computed algorithmically from the same experiment data. No independent held-out test set or external labeling." + }, + { + "flag": "Ablation results suggest filtering, not reasoning, drives performance", + "detail": "GEARS w/o Bash drops from 0.94 to 0.40 nDCG@1 (54pp). Adding skills contributes only 7pp (0.87→0.94). Suggests deterministic pre-filtering is doing most of the work, not agent reasoning." + }, + { + "flag": "Vague real-world results without statistical rigor", + "detail": "Table 3 shows improvements (0.011%–0.37%) with no error bars, baselines, significance tests, or surface identification. Section 5.1 mentions 'statistically significant lift' but provides no p-value. Figure 4 shows one month of data for one policy." + }, + { + "flag": "No code, data, or reproducibility artifacts", + "detail": "Paper uses proprietary Meta infrastructure (LLMs, internal ranking systems, experiment platform). No code released, no evaluation data available, no prompts provided. Impossible to reproduce or verify results." + }, + { + "flag": "Missing statistical rigor in main evaluation", + "detail": "Table 1 reports point estimates without confidence intervals or significance tests. Table 3 shows percentage improvements without error bars. No sample size justification (why 20 experiments sufficient?). No power analysis." + }, + { + "flag": "Conflict of interest undisclosed", + "detail": "Meta employees evaluating Meta systems on internal data with no external review. No competing-interests statement. Meta benefits directly from positive results about its ranking infrastructure." + }, + { + "flag": "Incomplete evaluation design", + "detail": "No breakdown of results by the 5 instruction types created (Maximize Both, Maximize with Constraint, etc.). No failure-case analysis or discussion of when GEARS selects poor policies." + }, + { + "flag": "Domain-relevant baselines missing", + "detail": "Baselines are general LLM prompting strategies (CoT, Self-Refine). No comparison to traditional HTE methods (S-learner, T-learner, causal forests) or other ranking-domain approaches that are the actual prior work." + }, + { + "flag": "Key terms not formally defined", + "detail": "'Vibe Optimization,' 'near-Pareto-efficient,' and 'Specialized Agent Skills' used throughout but lack precise operational definitions. 'Vibe' is especially vague ('high-level intent')." + } + ], + "cited_papers": [ + { + "title": "Metalearners for estimating heterogeneous treatment effects using machine learning", + "authors": "Künzel et al.", + "year": 2019, + "relevance": "Core HTE methodology (S-learner, T-learner) that GEARS builds on for policy generation via GAS." + }, + { + "title": "Uplift modeling with multiple treatments and general response types", + "authors": "Zhao et al.", + "year": 2017, + "relevance": "Tree-based uplift modeling approach for personalization; prior work GEARS aims to improve upon." + }, + { + "title": "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models", + "authors": "Wei et al.", + "year": 2022, + "relevance": "LLM prompting baseline used in evaluation (CoT method)." + }, + { + "title": "Augmented Language Models: a Survey", + "authors": "Mialon et al.", + "year": 2023, + "relevance": "Survey of tool-integrated reasoning; GEARS positions itself as tool-using LLM agent." + }, + { + "title": "Self-Refine: Iterative Refinement with Self-Feedback", + "authors": "Madaan et al.", + "year": 2023, + "relevance": "LLM self-improvement baseline used in evaluation." + }, + { + "title": "Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks", + "authors": "Chen et al.", + "year": 2022, + "relevance": "Early work on LLMs executing code for reasoning; related to Code-as-Action baseline." + }, + { + "title": "Ax: A Platform for Adaptive Experimentation", + "authors": "Bakshy et al.", + "year": 2018, + "relevance": "Prior work on adaptive experimentation platforms; GEARS builds on similar infrastructure." + }, + { + "title": "Recursive Partitioning for Heterogeneous Causal Effects", + "authors": "Athey & Imbens", + "year": 2016, + "relevance": "Causal forests method for HTE; foundational prior work in treatment-effect estimation." + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 1, + "justification": "Framework is deployed at Meta, but relies on internal infrastructure (ranking systems, LLM APIs, experiment platforms) unavailable to practitioners. No guidance on implementing GEARS at other organizations." + }, + "surprise_contrarian": { + "score": 1, + "justification": "Finding that deterministic filtering outweighs agent reasoning (Bash ablation) is mildly contrarian, but overall direction (using LLM agents for optimization) is not novel or surprising." + }, + "fear_safety": { + "score": 0, + "justification": "Paper does not address AI safety, alignment, or risks of autonomous agents in production ranking systems. Discusses 'stability' and 'robustness' but not safety in ML safety sense." + }, + "drama_conflict": { + "score": 0, + "justification": "Technical problem-solving without dramatization or controversy. Positions agentic optimization as solution to real engineering bottleneck, but presented matter-of-factly." + }, + "demo_ability": { + "score": 0, + "justification": "All experiments on internal Meta infrastructure. No public code, no demo, no reproducible example. Users cannot try GEARS without access to Meta's systems." + }, + "brand_recognition": { + "score": 2, + "justification": "Paper from Meta (recognizable company) and builds on prior Meta work (GAS, Ax), but the GEARS framework itself is not a well-known brand or widely adopted tool." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "47136272", + "title": "Package Managers à la Carte: a formal model of dependency resolution", + "points": 55, + "comments": 17, + "url": "https://news.ycombinator.com/item?id=47136272", + "created_at": "2026-02-24T12:27:44Z" + } + ], + "top_points": 55, + "total_points": 55, + "total_comments": 17 + } +} +\ No newline at end of file diff --git a/papers/decomposed-prompting-modular-2022/scan-v5.json b/papers/decomposed-prompting-modular-2022/scan-v5.json @@ -0,0 +1,539 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Decomposed Prompting: A Modular Approach for Solving Complex Tasks", + "authors": [ + "Tushar Khot", + "Harsh Trivedi", + "Matthew Finlayson", + "Yao Fu", + "Kyle Richardson", + "Peter Clark", + "Ashish Sabharwal" + ], + "year": 2022, + "venue": "International Conference on Learning Representations", + "arxiv_id": "2210.02406", + "doi": "10.48550/arXiv.2210.02406" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All abstract claims about DECOMP outperforming prior few-shot prompting on symbolic and textual tasks are backed by Figures 7-16 across 8 datasets; modular structure, recursive decomposition, and symbolic integration are all demonstrated empirically.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "The 'CoT w/ rollout' ablation uses the identical reasoning procedure as DECOMP but in a monolithic prompt, isolating modularization as the causal factor; alternative decomposition schemes in Appendix E further support robustness.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "The title and conclusion claim DECOMP as a general approach for 'complex tasks' but evaluations cover only 8 NLP benchmarks; no explicit discussion of where DECOMP would not generalize.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "The paper does not discuss whether improvements stem from higher-quality prompt engineering for DECOMP, greater computation per query, or other confounds beyond modular structure.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "Exact Match and Answer F1 are used as direct measures of task correctness and match the granularity of the claims; no conflation of measurement with broader capabilities.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": false, + "justification": "There is no dedicated limitations or threats-to-validity section; the conclusion paragraph is brief and does not systematically discuss shortcomings.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "No specific threats are discussed, such as sensitivity to prompt wording choices, benchmark contamination in GPT-3 training, or limited dataset diversity.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "The paper does not explicitly state what results do not show or which task types DECOMP would be unsuitable for.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": true, + "justification": "Acknowledgements state: 'This work was supported in part by the National Science Foundation under grants IIS2007290.'", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Author affiliations are clearly listed on the title page: Allen Institute for AI, Stony Brook University, and University of Edinburgh.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": true, + "justification": "NSF is an independent government funding agency with no financial stake in whether DECOMP outperforms CoT prompting.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement, patent disclosures, or equity declarations appear anywhere in the paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Section 3 formally defines 'decomposer,' 'sub-task handler,' 'prompting program,' and the inference procedure with mathematical notation (P = (f1,Q1,A1),...) and illustrative figures.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "The paper explicitly states it contributes DECOMP, a new modular prompting approach supporting hierarchical decomposition, recursion, and symbolic module integration.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 2 engages substantively with CoT, Least-to-Most, Successive Prompting, and Neural Modular Networks, explaining specifically how DECOMP differs from and extends each approach.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": true, + "justification": "Footnote 1 states: 'Datasets, Code and Prompts available at https://github.com/allenai/DecomP.'", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "All benchmarks used (HotpotQA, 2WikiMultihopQA, MuSiQue, CommaQA, GSM8K, MultiArith) are standard publicly available datasets.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "No requirements.txt, Dockerfile, or software dependency specifications are mentioned; only model names are identified.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": true, + "justification": "Appendix G reproduces all prompts verbatim across 50+ pages, Section 3.2 describes the inference procedure step-by-step with Figure 3, and code is released on GitHub.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "Results are point estimates averaged over 3 prompts; no standard deviations, confidence intervals, or error bars are reported anywhere in the paper.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "No statistical significance tests are applied despite multiple comparative claims between DECOMP and baselines.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Absolute score differences are reported throughout (e.g., 14-17 pt math QA improvement, EM going from 22.7% to 98% for letter concatenation at N=3).", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "Sample sizes (100, 200, 300 examples) are chosen for API cost reasons without power analysis or formal justification.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "Results are averaged over 3 prompts but standard deviation is not reported; Appendix D shows per-prompt results without variance statistics.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Multiple baselines: standard prompting, CoT, CoT w/ rollout, Least-to-Most w/ rollout; for open-domain QA also no-context and no-decomposition retrieval baselines.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "CoT (Wei et al., 2022) and Least-to-Most (Zhou et al., 2023) were the leading few-shot prompting approaches at the time of submission.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "'CoT w/ rollout' ablation uses DECOMP's identical reasoning steps in a single prompt to isolate the effect of modularity; Appendix E tests alternative decomposition schemes.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Exact Match is used for symbolic and CommaQA tasks; Answer F1 is used for open-domain QA datasets; task-appropriate metrics throughout.", + "source": "haiku" + }, + "human_evaluation": { + "applies": false, + "answer": false, + "justification": "Human evaluation is not applicable; all tasks use standard automated metrics on NLP benchmarks with ground-truth answers.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": true, + "justification": "For open-domain QA, results are on '300 held-out dev questions in each dataset' separate from the 100-question hyperparameter tuning set; symbolic tasks use separate test sets.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Results broken down by dataset (8 datasets), input length (N=3,4,5 for letter concatenation), and decomposition granularity (coarse vs. fine for CommaQA); per-prompt breakdowns in Appendix D.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "Appendix F provides explicit error analysis with concrete examples of failure modes for both DECOMP and CoT on letter concatenation and CommaQA tasks.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "The paper reports DECOMP is only 'comparable' to the retrieval baseline on HotpotQA with Codex, and that performance drops to near-zero for smaller models (curie-001).", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": true, + "justification": "Specific model identifiers are given: text-davinci-002, code-davinci-002, davinci-001, text-curie-001, Flan-T5-Large/XL/XXL with parameter counts (0.7B, 3B, 11B).", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": true, + "justification": "Appendix G provides all prompts verbatim — decomposer prompts and every sub-task handler prompt for every task — covering 50+ pages of the paper.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": false, + "justification": "Temperature, top-p, and other generation hyperparameters are not reported; only the retrieval count K is described as a tuned hyperparameter.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": true, + "answer": true, + "justification": "Section 3.2 and Figure 3 describe the inference procedure in detail: how the controller iteratively passes inputs/outputs between the decomposer and sub-task handlers until EOQ.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "Appendix A describes retrieval corpus creation (430,225 paragraphs for 2WikiMultihopQA, 139,416 for MuSiQue) and CommaQA truncation to fit GPT-3 context limits.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": true, + "justification": "All benchmarks (HotpotQA, 2WikiMultihopQA, MuSiQue, CommaQA, GSM8K, MultiArith) are publicly available; code for generating test examples is released.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Symbolic task test examples described (names from popularity lists, 100 examples per condition); open-domain QA corpus construction from train/dev/test paragraphs described in Appendix A.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "No human participants were recruited; all evaluation uses standard NLP benchmarks.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "The pipeline from corpus creation through hyperparameter tuning on a 100-question held-out set to final evaluation on 300 questions is described in Appendix A.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": false, + "justification": "Training data cutoffs for GPT-3 (text-davinci-002, code-davinci-002) are not stated in the paper.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": false, + "justification": "Potential overlap between GPT-3 training data and benchmark test sets is not discussed anywhere in the paper.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": true, + "answer": false, + "justification": "Several benchmarks (HotpotQA 2018, MultiArith 2015) were publicly available before GPT-3's training cutoff; this contamination risk is not acknowledged.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human participants involved.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human participants involved.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participants involved.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human participants involved.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human participants involved.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human participants involved.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participants involved.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": false, + "justification": "API costs are acknowledged implicitly (subsampling to 300/200 examples 'due to costs') but no actual cost figures or call counts are reported.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "No total compute budget, API call counts, or wall-clock time estimates are provided.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "DECOMP outperforms CoT and Least-to-Most prompting on kth letter concatenation, particularly for longer inputs (N=4,5 words)", + "evidence": "Figure 7: DECOMP achieves 96-98% EM across N=3,4,5 vs. CoT 22.7/12.0/6.0% and L2M 74.7/70.5/66.0%", + "supported": "strong" + }, + { + "claim": "Recursive DECOMP enables length generalization for list reversal far beyond what CoT achieves", + "evidence": "Figure 8: DECOMP achieves 42% EM at N=10 items vs. CoT 4.5%; base CoT 'does not generalize at all to longer sequences'", + "supported": "strong" + }, + { + "claim": "DECOMP outperforms CoT on long-context multi-hop QA (CommaQA-E) including compositional generalization", + "evidence": "Figure 10: DecomP(fine) 64.2% vs. CoT 55% on IID; 59.7% vs. 33.8% on compositional generalization split", + "supported": "strong" + }, + { + "claim": "DECOMP with retrieval (Decomp-Ctxt) outperforms retrieval baselines on open-domain multi-hop QA", + "evidence": "Figure 12: Decomp-Ctxt outperforms NoDecomp-Ctxt on MuSiQue and 2WikiMultihopQA; HotpotQA with Codex is 'comparable' rather than better", + "supported": "moderate" + }, + { + "claim": "DECOMP-based error correction improves CoT math QA by 14-17 points through a targeted answer-extraction sub-task", + "evidence": "Figure 16: GSM8K 36.0→50.7% (+14.7), MultiArith 78.0→95.0% (+17) by adding a GPT-3 answer-extraction sub-module", + "supported": "strong" + }, + { + "claim": "Modular structure itself (not just the reasoning procedure) drives DECOMP's improvements over CoT", + "evidence": "Figure 7: CoT w/ rollout (same reasoning, monolithic) scores 74.7/70.5/66.0% vs. DECOMP 98/96/97% for N=3,4,5; rolled-out reasoning fails without modularity", + "supported": "strong" + } + ], + "methodology_tags": [ + "benchmark-eval" + ], + "key_findings": "Decomposed Prompting (DECOMP) outperforms Chain-of-Thought and Least-to-Most prompting across symbolic reasoning and multi-hop QA tasks by decomposing complex tasks into modular sub-tasks with dedicated few-shot prompts. The central finding is that separate sub-task prompts are more effective than unrolling the same reasoning steps into a single CoT — demonstrating that modularity itself drives improvements, not just the reasoning procedure. DECOMP uniquely enables recursive decomposition for length generalization on list reversal, hierarchical decomposition for sub-tasks too hard for few-shot prompting, and seamless integration of symbolic systems like ElasticSearch for open-domain QA, and achieves 14-17 point gains on math QA through targeted error-correction post-processing.", + "red_flags": [ + { + "flag": "No confidence intervals or significance tests", + "detail": "All results are point estimates averaged over 3 prompts with no standard deviations or statistical significance testing, making it impossible to assess whether improvements are reliable." + }, + { + "flag": "Benchmark contamination unaddressed", + "detail": "Several benchmarks (HotpotQA 2018, MultiArith 2015) were publicly available before GPT-3's training cutoff; no discussion of potential contamination." + }, + { + "flag": "Subsampled evaluation due to API costs", + "detail": "GSM8K subsampled to 300 examples and MultiArith to 200 'due to costs with API usage' without power analysis; may reduce result reliability." + }, + { + "flag": "No limitations section", + "detail": "The paper lacks any dedicated discussion of limitations, failure modes beyond error analysis appendix, or conditions under which DECOMP would be expected to underperform." + }, + { + "flag": "Generation hyperparameters unreported", + "detail": "Temperature, top-p, and other generation hyperparameters for GPT-3 API calls are not reported, impeding exact reproduction." + } + ], + "cited_papers": [ + { + "title": "Chain of Thought Prompting Elicits Reasoning in Large Language Models", + "relevance": "Primary baseline and motivation; DECOMP is explicitly designed to overcome CoT's limitations on complex multi-step tasks" + }, + { + "title": "Least-to-Most Prompting Enables Complex Reasoning in Large Language Models", + "relevance": "Closest related work; directly compared as baseline with rollout variant; DECOMP differs by allowing non-linear decomposition structures" + }, + { + "title": "Language Models are Few-Shot Learners (GPT-3)", + "relevance": "Foundation model used throughout experiments; establishes the few-shot in-context learning paradigm DECOMP builds on" + }, + { + "title": "Successive Prompting for Decomposing Complex Questions", + "relevance": "Related decomposition approach; DECOMP extends with diverse and recursive decomposition structures beyond sequential question generation" + }, + { + "title": "PAL: Program-aided Language Models", + "relevance": "Related work on integrating symbolic computation with LLM reasoning; context for DECOMP's symbolic module integration" + }, + { + "title": "MuSiQue: Multi-hop Questions via Single-hop Question Composition", + "relevance": "Key evaluation benchmark for multi-hop open-domain QA" + }, + { + "title": "HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering", + "relevance": "Key evaluation benchmark; results show DECOMP comparable but not clearly better than retrieval baseline with Codex" + }, + { + "title": "Text Modular Networks: Learning to Decompose Tasks in the Language of Existing Models", + "relevance": "Direct precursor to DECOMP using supervised training for decomposition; DECOMP replaces supervised next-question generator with few-shot LLM" + }, + { + "title": "Training Verifiers to Solve Math Word Problems (GSM8K)", + "relevance": "Math QA benchmark demonstrating DECOMP's error-correction improvement of 14 points" + }, + { + "title": "Training Language Models to Follow Instructions with Human Feedback (InstructGPT)", + "relevance": "Primary model (text-davinci-002) used in most experiments" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 3, + "justification": "Code and all prompts released on GitHub; technique directly usable by any developer with GPT-3 API access; demonstrates improvements on real NLP tasks." + }, + "surprise_contrarian": { + "score": 2, + "justification": "Counterintuitive finding that modular prompts outperform CoT even when CoT uses identical reasoning steps (rollout); modularity matters independently of the reasoning procedure." + }, + "fear_safety": { + "score": 0, + "justification": "No AI safety or risk concerns raised; purely a performance improvement paper on NLP benchmarks." + }, + "drama_conflict": { + "score": 1, + "justification": "Mild competitive framing against CoT which was a dominant paradigm at the time; no major controversy." + }, + "demo_ability": { + "score": 3, + "justification": "Code on GitHub, all prompts provided in the paper appendix; can be replicated with GPT-3 API access; worked examples in paper are immediately tryable." + }, + "brand_recognition": { + "score": 2, + "justification": "Allen Institute for AI (AI2) is a well-known NLP research lab; uses GPT-3 (text-davinci-002) which was the flagship model at publication time." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "37816614", + "title": "Language Agent Tree Search Unifies Reasoning Acting and Planning in LMs", + "points": 79, + "comments": 11, + "url": "https://news.ycombinator.com/item?id=37816614", + "created_at": "2023-10-09T03:24:13Z" + }, + { + "hn_id": "25773418", + "title": "Adversarial Grammatical Error Correction", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=25773418", + "created_at": "2021-01-14T07:48:57Z" + }, + { + "hn_id": "33182502", + "title": "Code Librarian: A Software Package Recommendation System", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=33182502", + "created_at": "2022-10-12T20:19:58Z" + }, + { + "hn_id": "39202830", + "title": "Low-Resource Languages Jailbreak GPT-4", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=39202830", + "created_at": "2024-01-31T12:11:05Z" + } + ], + "top_points": 79, + "total_points": 85, + "total_comments": 11 + } +} +\ No newline at end of file diff --git a/papers/deep-dive-into-2024-2/scan-v5.json b/papers/deep-dive-into-2024-2/scan-v5.json @@ -0,0 +1,542 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "A Deep Dive into Large Language Models for Automated Bug Localization and Repair", + "authors": [ + "Soneya Binta Hossain", + "Nan Jiang", + "Qiang Zhou", + "Xiaopeng Li", + "Wen-Hao Chiang", + "Yingjun Lyu", + "Hoan Nguyen", + "Omer Tripp" + ], + "year": 2024, + "venue": "Proc. ACM Softw. Eng.", + "arxiv_id": "2404.11595", + "doi": "10.1145/3660773" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "The abstract claims state-of-the-art on CodeXGLUE and Defects4J; Table 1 and Table 3 confirm Toggle (PolyCoder-2.7B 25.07%) exceeds NSEdit (23.86%) and fixes more bugs in Top-10/30/50/100 on Defects4J than any compared method.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "Causal claims (e.g., 'prompt 4 significantly improves bug fixing accuracy') are tested through controlled ablations in RQ3 using ground-truth bug locations to isolate prompt effects, and RQ5 enables/disables the adjustment module across 16 configurations.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": true, + "justification": "The Threats to Validity section acknowledges results may not generalize beyond the studied datasets, and the Defects4J generalizability test uses only 240 single-hunk Java bugs; findings are generally scoped to the specific benchmarks tested.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "The paper attributes improvements to 'inductive bias' from token-level localization and prompt design but does not discuss alternative explanations such as constrained-generation making the fine-tuning task easier or model pre-training data effects.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "The paper explicitly defines exact match (EM) as its primary metric and distinguishes it from BLEU/CodeBLEU; for Defects4J, patch correctness is verified via test execution, and the paper does not conflate EM with real-world utility.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "Section 5 is explicitly titled 'THREATS TO VALIDITY' and spans a dedicated paragraph discussing generalization, tooling bugs, and metric validity.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "The threats section mentions 'results may not generalize across other datasets' without specifying what properties would limit generalization, and the 'scripts might contain bugs' concern is boilerplate; no quantified or domain-specific threats are identified.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": true, + "justification": "The paper explicitly restricts the Defects4J evaluation to 'single-hunk' bugs (240 bugs) and the fine-tuning models are bounded to specific parameter ranges (110M–2.7B); scope is stated within individual RQ setups.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "There is no acknowledgment section or funding disclosure anywhere in the paper.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Author affiliations (University of Virginia, Purdue University, Amazon Web Services) are explicitly listed in the author block.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": false, + "answer": false, + "justification": "No funder is identified; five of eight authors are Amazon Web Services employees evaluating their own research framework, making funder independence moot but affiliation bias is a concern.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests or financial interests statement appears anywhere in the paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Key terms such as 'token-granulated bug localization,' 'exact match metric,' 'inductive bias,' 'shared prefix/suffix,' and 'single-hunk bugs' are defined or explained contextually with examples and figures.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Section 1.4 explicitly lists four contributions: granularity shift to token-level, four novel prompt designs, adjustment module for tokenizer discrepancies, and comprehensive empirical study.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 4 (Related Work) positions Toggle against specific prior methods (NSEdit, CoText, CURE, KNOD, AlphaRepair, Recoder) with direct performance comparisons; the paper explains how its token-level approach differs from line-level methods in prior LLM-APR work.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": false, + "justification": "No Toggle source code release is mentioned anywhere in the paper; base model checkpoints are referenced via Hugging Face but the Toggle framework itself is not released.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "All datasets used (CodeXGLUE/Tufano, CodeReviewer, Defects4J, GitHub) are publicly available benchmarks referenced with citations.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "Only PyTorch and Hugging Face are mentioned; no version numbers, requirements file, or Dockerfile are provided.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "No step-by-step reproduction instructions are provided; the paper describes the methodology but not how to replicate the experimental setup from scratch.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "No confidence intervals or error bars appear in any of the results tables (Tables 1–8); only point estimates are reported.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "No statistical significance tests are applied to any comparative claims, despite numerous comparisons between methods and prompts.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Absolute performance values with baselines are consistently reported (e.g., 25.07% vs 23.86% on Tufano Small), providing enough context to assess effect magnitudes.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "The choice of 240 Defects4J single-hunk bugs and 210 patches per bug is not statistically justified; no power analysis is discussed.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "The paper states experiments were 'repeated several times to confirm consistency' but no variance, standard deviation, or spread across runs is reported in any table.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Baselines include NSEdit, CoText (Table 1) and CURE, RewardRepair, Recoder, KNOD, Tare, AlphaRepair, TENURE (Table 3).", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "Baselines include papers from 2021–2023 (NSEdit 2022, KNOD 2023, Tare 2023, AlphaRepair 2022), which are competitive and recent relative to the 2024 submission.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "RQ3 ablates across four prompt designs; RQ5 ablates the adjustment module enabled vs disabled across 4 models × 4 datasets; RQ4 ablates the effect of contextual information.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "The paper uses exact match (EM) for CodeXGLUE/CodeReviewer and Top-K (K=10,30,50,100,200) metrics for Defects4J, plus start/end token accuracy for localization.", + "source": "haiku" + }, + "human_evaluation": { + "applies": false, + "answer": false, + "justification": "Human evaluation of system outputs is not relevant to this task; patch correctness is verified via automated exact match and test execution on Defects4J.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": true, + "justification": "Datasets are explicitly split 80/10/10 into training, validation, and test sets; Defects4J is kept entirely held-out from fine-tuning.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Results are broken down by dataset (Tufano Small, Tufano Medium, CodeReviewer w/o comment, CodeReviewer w/ comment) and by model backbone for all major experiments.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "Figure 7 explicitly shows a failure case where correct bug location still produces incorrect fix, and RQ6 discusses conditions under which prompt 4 underperforms prompt 3.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "CodeGPT underperforms on multilingual datasets due to Java-only pretraining; prompt 4 underperforms prompt 3 on Tufano datasets; smaller models don't benefit as much from the adjustment module.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": true, + "justification": "Models are specified with parameter counts and sources: CodeGPT-110M, CodeParrot-110M, CodeGen-350M, CodeGen-2B, PolyCoder-400M, PolyCoder-2.7B, CodeT5-large (347M); Hugging Face references are provided.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": true, + "justification": "All four prompts are illustrated in Figure 5 with concrete code examples showing the exact format including separator tokens and truncation strategy.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": false, + "justification": "No learning rates, batch sizes, number of epochs, or optimizer settings are reported for any of the fine-tuning experiments.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": true, + "answer": true, + "justification": "Section 2.3 describes the Toggle framework architecture in detail including the localization model (CodeT5 encoder with attention-based prediction), four prompt designs, and adjustment module (CodeT5 encoder with FC layer).", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "The GitHub dataset preprocessing is documented (commit filtering by keywords, AST-based Defects4J deduplication); train/validation/test splits of 80/10/10 are stated; adjustment module training data collection procedure is described.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": true, + "justification": "All benchmarks used (CodeXGLUE/Tufano, CodeReviewer, Defects4J) are publicly available; citations and URLs are provided for access.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "The GitHub dataset curation is described (commit message keywords, single-statement patches, AST-based Defects4J exclusion); the other datasets reference published papers describing their collection.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "No human participants; all data is from public code repositories and established benchmarks.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "The full pipeline from dataset splits through fine-tuning to patch generation and evaluation is described, including the 7-shift range used for adjustment module training data.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": false, + "justification": "No pre-training cutoff dates are stated for any of the six base LLMs (CodeGPT, CodeParrot, CodeGen, PolyCoder, CodeT5) despite their pre-training corpora potentially overlapping with public benchmarks.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": true, + "justification": "The paper explicitly excludes Defects4J samples from the GitHub fine-tuning dataset via AST comparison to prevent data leakage into the held-out generalization test.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": true, + "answer": false, + "justification": "The pre-trained base models (CodeParrot, CodeGen, etc.) were trained on large code corpora that likely include CodeXGLUE and Defects4J data; this pre-training contamination risk is never discussed.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": false, + "justification": "The paper mentions 'resource-intensive nature of larger models' as justification for testing only smaller models in some RQs, but no actual inference cost, latency, or GPU-hours are reported.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "No total computational budget (GPU hours, cloud costs, hardware used) is stated anywhere in the paper.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "Toggle achieves new state-of-the-art on CodeXGLUE code refinement benchmark (Tufano Small 25.07%, Tufano Medium 16.19%)", + "evidence": "Table 1 shows PolyCoder-2.7B at 25.07% vs NSEdit's 23.86% on Tufano Small and 16.19% vs CoText's 15.36% on Tufano Medium", + "supported": "strong" + }, + { + "claim": "Toggle outperforms all compared APR methods on Defects4J in Top-10, Top-30, Top-50, and Top-100 metrics", + "evidence": "Table 3 shows Toggle fixes 41 bugs in Top-10 vs next-best 36 (Recoder), 58 vs 51, 64 vs 62, 74 vs 70 respectively", + "supported": "strong" + }, + { + "claim": "Larger LLMs yield better bug fixing accuracy after fine-tuning with Toggle prompts", + "evidence": "Table 1 consistently shows larger models outperform smaller ones (e.g., CodeGen-2B 24.73% vs CodeGen-350M 23.19% on Tufano Small)", + "supported": "strong" + }, + { + "claim": "Token-granulated prompts (3 and 4) significantly outperform standard prompting (prompt 1) for bug fixing", + "evidence": "Table 4 shows CodeGPT-110M improves from 16.07% (prompt 1) to 56.98% (prompt 4) on Tufano Small using ground-truth bug locations", + "supported": "strong" + }, + { + "claim": "Contextual information (buggy line numbers, code review comments) significantly improves bug localization accuracy", + "evidence": "Table 5 shows starting token accuracy for Tufano Small improves from 39.07% to 60.37% (+21%) with buggy line numbers", + "supported": "strong" + }, + { + "claim": "The adjustment module consistently improves bug fixing accuracy across all models and datasets", + "evidence": "Table 6 shows improvement in all 16 configurations, e.g., CodeParrot-110M on Tufano Small improves from 21.78% to 23.51%", + "supported": "moderate" + }, + { + "claim": "Prompt 4 outperforms prompt 3 only when both start and end token locations are highly accurate", + "evidence": "Table 8 shows prompt 3 superior on Tufano datasets but prompt 4 superior on CodeReviewer datasets where partial location accuracy is higher (65.76% vs 53.23%)", + "supported": "moderate" + } + ], + "methodology_tags": [ + "benchmark-eval" + ], + "key_findings": "Toggle introduces token-granulated bug localization and repair, demonstrating that preventing LLMs from regenerating non-buggy shared prefix/suffix significantly improves accuracy (prompt 1 to prompt 4: 16.07% to 56.98% for CodeGPT on Tufano Small). The system achieves state-of-the-art on CodeXGLUE code refinement and outperforms all compared methods on Defects4J in Top-10 through Top-100 metrics using only 110M parameter models and 210 generated patches. Contextual information (line numbers, code review comments) improves localization accuracy by 20-30 percentage points. The choice between prompts 3 and 4 with predicted locations is dataset-dependent, with prompt 4 winning when partial location accuracy is high and additional context is available.", + "red_flags": [ + { + "flag": "No statistical significance tests", + "detail": "All comparisons between Toggle and baselines, and between prompt configurations, are made without any statistical significance testing despite many tables of numerical comparisons." + }, + { + "flag": "No variance or confidence intervals", + "detail": "Despite claiming experiments were repeated multiple times, no standard deviation, confidence intervals, or error bars are reported for any results." + }, + { + "flag": "No hyperparameters reported", + "detail": "Learning rates, batch sizes, number of epochs, and optimizer configurations for all fine-tuning experiments are absent, making reproduction impossible." + }, + { + "flag": "No code release", + "detail": "The Toggle framework is not released; only the public base model checkpoints are referenced, preventing independent verification of results." + }, + { + "flag": "Pre-training contamination unaddressed", + "detail": "Base LLMs (CodeGPT, CodeParrot, CodeGen, PolyCoder, CodeT5) were trained on large code corpora that likely include CodeXGLUE and Defects4J benchmarks; this contamination risk is never discussed." + }, + { + "flag": "Asymmetric patch count in Defects4J comparison", + "detail": "Toggle generates 210 patches per bug (Top-100 is primary comparison), while competing methods (Tare, AlphaRepair, TENURE) generate 500+ patches and only their Top-500+ results are reported, making Top-100 comparisons potentially favorable to Toggle." + }, + { + "flag": "No funding disclosure", + "detail": "Five of eight authors are Amazon Web Services employees; no funding source or competing interests are disclosed." + } + ], + "cited_papers": [ + { + "title": "CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation", + "relevance": "Primary benchmark for evaluation; Toggle achieves state-of-the-art on its code refinement tasks" + }, + { + "title": "CodeReviewer: Pre-Training for Automating Code Review Activities", + "relevance": "Provides dataset and CodeT5 baseline for code review-guided bug fixing experiments" + }, + { + "title": "Defects4J: A Database of existing faults to enable controlled testing studies for Java programs", + "relevance": "Primary generalizability benchmark; 835 real-world Java bugs used for out-of-distribution evaluation" + }, + { + "title": "CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation", + "relevance": "Backbone model for bug localization; used as both a baseline and the encoder in Toggle's localization module" + }, + { + "title": "CURE: Code-Aware Neural Machine Translation for Automatic Program Repair", + "relevance": "Key APR baseline compared on Defects4J (18 bugs in Top-10 vs Toggle's 41)" + }, + { + "title": "KNOD: Domain Knowledge Distilled Tree Decoder for Automated Program Repair", + "relevance": "Strong APR baseline using tree-based decoding; compared on Defects4J across all Top-K metrics" + }, + { + "title": "Less Training, More Repairing Please: Revisiting Automated Program Repair via Zero-Shot Learning", + "relevance": "AlphaRepair baseline demonstrating LLMs used for APR without fine-tuning; contextualizes Toggle's fine-tuning approach" + }, + { + "title": "Impact of Code Language Models on Automated Program Repair", + "relevance": "Prior work on LLM-based APR that Toggle directly builds on and improves over" + }, + { + "title": "Fix Bugs with Transformer through a Neural-Symbolic Edit Grammar", + "relevance": "NSEdit — primary baseline for CodeXGLUE leaderboard comparison; Toggle surpasses it on all Tufano datasets" + }, + { + "title": "An empirical study on learning bug-fixing patches in the wild via neural machine translation", + "relevance": "Source of Tufano Small/Medium datasets used as primary fine-tuning and evaluation benchmarks" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 2, + "justification": "APR is directly useful to developers; Toggle is a concrete working system tested on real bug benchmarks, though it requires fine-tuning and infrastructure to deploy." + }, + "surprise_contrarian": { + "score": 1, + "justification": "Token-level vs line-level localization is a novel framing but the performance improvements are expected given the design rationale." + }, + "fear_safety": { + "score": 0, + "justification": "No AI risk or safety concerns raised; the paper is purely about automated software engineering." + }, + "drama_conflict": { + "score": 0, + "justification": "Standard benchmark competition paper with no controversy or conflict angle." + }, + "demo_ability": { + "score": 1, + "justification": "The framework is described in detail but no public demo or code is released, limiting hands-on accessibility." + }, + "brand_recognition": { + "score": 1, + "justification": "Amazon Web Services affiliation for five authors adds some recognition, but this is not a top-name lab publication; published at FSE which is a respected venue." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "40205264", + "title": "Urban highways are barriers to social ties", + "points": 6, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=40205264" + }, + { + "hn_id": "41103162", + "title": "Beyond Deepfake Images: Detecting AI-Generated Videos [pdf]", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=41103162" + }, + { + "hn_id": "40165320", + "title": "Generation of Low-Inclination, Neptune-Crossing TNOs by Planet Nine", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=40165320" + } + ], + "top_points": 6, + "total_points": 11, + "total_comments": 0 + } +} +\ No newline at end of file diff --git a/papers/deep-dive-into-2024/scan-v5.json b/papers/deep-dive-into-2024/scan-v5.json @@ -0,0 +1,540 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "A Deep Dive Into Large Language Model Code Generation Mistakes: What and Why?", + "authors": [ + "QiHong Chen", + "Jiachen Yu", + "Jiawei Li", + "Jiecheng Deng", + "Justin Tian Jin Chen", + "Iftekhar Ahmed" + ], + "year": 2024, + "venue": "arXiv.org", + "arxiv_id": "2411.01414", + "doi": "10.48550/arXiv.2411.01414" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All four abstract claims are supported: 17 mistake types in Table 1, 10 newly identified; 6 reasons in Section 5.2; GPT-4 mistake identification precision ~0.96; ReAct F1=0.78 in Table 2.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "The paper makes causal claims that specific prompt features cause mistakes, and validates them by modifying the causative factor (rephrasing, repositioning) and checking whether regenerated code passes tests — a reasonable intervention design for this context.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": true, + "justification": "Section 7 explicitly bounds findings to Python and Java, two specific LLMs, and two specific benchmarks, acknowledging results may not generalize to other languages, benchmarks, or LLMs.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "The paper proposes 6 reasons for mistakes and validates them via intervention, but does not systematically discuss alternative explanations for why these factors cause errors or whether multiple reasons may interact.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "The paper uses test case pass/fail as a proxy for code correctness and explicitly acknowledges in Section 7 that 'test cases might not be comprehensive,' distinguishing measured outcomes from broader correctness.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "Section 7 'Threats to Validity' has three dedicated subsections: Construct validity, Internal validity, and External validity.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "Specific threats named include: prompt design influence, incomplete test coverage, manual examination bias, non-exhaustive reason identification, and limitation to Java/Python only — these go beyond generic disclaimers.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": true, + "justification": "Explicit scope boundaries stated: two LLMs (GPT-4, Qwen2.5-Coder), two programming languages (Python, Java), two datasets (HumanEval-X, MBXP), and non-syntactic mistakes only.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding acknowledgment or disclosure appears anywhere in the paper.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "All six authors' university affiliations are disclosed on the title page (UCI, UIUC, UCR).", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": false, + "answer": false, + "justification": "No funding is disclosed, so independence of funder from outcome cannot be assessed.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement or declaration of financial interests appears in the paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Section 3 defines 'non-syntactic mistakes' precisely (two categories: runtime errors and functional failures), and all 17 mistake types plus 3 severity levels (FADE, PADE, DADE) are formally defined.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "The introduction lists 4 explicit contributions: a derived list of mistakes, a derived list of reasons, a 202-instance benchmark, and an empirical investigation of LLM auto-identification.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 2 engages substantively with Fan et al., Song et al., Tambon et al., and others, explicitly contrasting this study's scope (more data, newer models, two languages, causal analysis) with prior limitations.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": true, + "justification": "A replication package is linked at https://figshare.com/s/10e27d42bf537f6321f7, referenced repeatedly throughout the paper for code, prompts, and results.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "HumanEval-X and MBXP are standard public benchmarks; the 202-instance reason-identification benchmark is included in the replication package.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "No requirements.txt, Dockerfile, or explicit dependency list is mentioned; tools like ast, javalang, BeautifulSoup, and all-mpnet-base-v2 are referenced but no environment spec is provided.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "The paper refers to the replication package for details but provides no step-by-step reproduction instructions within the paper itself.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "All results (precision, coverage rate, F1) are reported as single point estimates with no confidence intervals or error bars.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "No statistical significance tests are applied when comparing prompt approaches (Base vs Advanced vs Advanced+ReAct) or when comparing GPT-4 to human evaluators.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Actual metric values (F1, precision, coverage rate) are reported for all comparisons in Table 2, conveying the magnitude of differences between approaches.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "The 202-instance benchmark size is not justified by power analysis; dataset sizes derive from the chosen benchmarks, not from a principled sample size determination.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "Temperature is set to 0 for determinism, and no variance across runs is reported for any result.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "The Base Prompt serves as a baseline for reason identification in RQ3, with Advanced Prompt and Advanced+ReAct as progressively enhanced conditions.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "GPT-4 and Qwen2.5-Coder are both contemporary, top-performing models at time of study; no weak or stale baselines are used.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "The three prompting conditions (Base, Advanced, Advanced+ReAct) constitute an ablation of increasing prompt complexity for reason identification.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Precision, Coverage Rate (CR), and F1 score are all used for evaluation in RQ3; Table 1 also reports severity frequency distributions.", + "source": "haiku" + }, + "human_evaluation": { + "applies": true, + "answer": true, + "justification": "Human evaluators (all authors, each with 5+ years of Python/Java experience) independently reviewed and labeled mistakes and reasons using open coding and negotiated agreement, providing the gold standard.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": true, + "justification": "A 202-instance benchmark is constructed from the full analysis and used as a held-out evaluation set for assessing GPT-4 reason identification performance in RQ3.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Table 1 breaks down 17 mistake types by category with severity frequencies; Table 2 breaks down F1 scores per reason across all three prompting approaches.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "The paper discusses that positional sensitivity is the hardest reason for GPT-4 to identify (Base F1=0.25) and attributes this to limitations in attention-related reasoning.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "Positional sensitivity identification failure (F1=0.25 for base prompt) is explicitly reported as a negative result, and the paper calls for future work to address it.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": true, + "justification": "Exact model versions are specified: GPT-4-0125-preview and qwen2.5-coder-14b-instruct.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": false, + "justification": "Prompts are described at a high level but not reproduced in the paper; the paper repeatedly defers to the replication package for full prompt text.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": true, + "justification": "Temperature=0 for code generation (determinism), temperature=0.5 for paraphrasing and ambiguity checking are explicitly reported.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": true, + "answer": true, + "justification": "The ReAct scaffolding is described in detail including all three tools (Function Call Analysis, Function Signature Explainer, Coding Question Specification Ambiguity Check) with implementation specifics.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "Data pipeline is documented: prompt LLMs → run test cases → filter syntactic mistakes → apply APR (CHATREPAIR) → validate via Jaccard similarity → collect failures for analysis.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": true, + "justification": "The replication package on figshare contains the raw LLM-generated codes with non-syntactic mistakes and associated test failure information.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Section 3 describes the full data collection: prompting procedure, test execution, syntactic filtering, APR repair, and final dataset composition.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "No human participant recruitment; authors serve as annotators and are not recruited subjects.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "The full pipeline from dataset selection through LLM prompting, test execution, APR repair, Jaccard validation, manual annotation, and evaluation is documented across Sections 3–5.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": false, + "justification": "No training data cutoff is stated for GPT-4-0125-preview or Qwen2.5-Coder despite evaluating them on public benchmarks.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": false, + "justification": "The paper does not discuss whether HumanEval-X or MBXP problems appeared in either model's training data, which is a significant omission for capability evaluation.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": true, + "answer": false, + "justification": "HumanEval-X and MBXP are widely-used public benchmarks predating both models; potential contamination is not acknowledged or addressed anywhere in the paper.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human subjects study; authors are annotators, not participants.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human participants; IRB not applicable.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participants; demographics not applicable.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human participants; inclusion/exclusion criteria not applicable.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human experimental study; randomization not applicable.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human experimental study; blinding not applicable.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participants; attrition not applicable.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": false, + "justification": "No inference cost or latency is reported for running GPT-4 on 2,268 coding problems or for the reason identification experiments.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "No total computational budget is stated anywhere in the paper.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "LLMs make 17 types of non-syntactic mistakes, 10 of which were overlooked by prior studies.", + "evidence": "Table 1 enumerates all 17 types; highlighted types are the 10 new ones; comparison to Fan et al., Song et al., and Tambon et al. is made in related work and Section 4.2.", + "supported": "moderate" + }, + { + "claim": "Misleading Coding Question Specification is the most common cause of mistakes, accounting for 56.19% of cases.", + "evidence": "Section 5.2 reports this figure from the reason analysis; validated by paraphrasing prompts and re-testing.", + "supported": "moderate" + }, + { + "claim": "GPT-4 identifies non-syntactic mistakes with precision ~0.96 and coverage rate ~0.94, comparable to human evaluators.", + "evidence": "Section 6.2.1 reports precision 0.97 (HumanEval-X) / 0.95 (MBXP) and CR 0.94 for GPT-4 vs precision 1.0 and CR 0.98-0.99 for humans.", + "supported": "strong" + }, + { + "claim": "GPT-4 using ReAct achieves F1=0.78 for identifying reasons behind LLM code generation mistakes.", + "evidence": "Table 2 shows Advanced Prompt+ReAct average F1=0.78 across 6 reason categories, up from 0.64 (Base) and 0.73 (Advanced).", + "supported": "strong" + }, + { + "claim": "Positional sensitivity in prompts causes LLMs to miss conditions, and simply repositioning information in the prompt can fix these mistakes.", + "evidence": "Section 5.2 provides Figure 3(d) showing a concrete example where repositioning the 'y as vowel at end' rule corrected the LLM output; verified across 4.12% of mistakes.", + "supported": "moderate" + }, + { + "claim": "LLMs make math knowledge errors (14.24% of mistakes) that stem from incorrect knowledge learned during training.", + "evidence": "Section 5.2 item 3 (ITK, 5.71%) and IMKE (14.24% in Table 1) with Figure 1(b) showing a concrete variance formula error; causal attribution relies on cross-language comparison.", + "supported": "weak" + } + ], + "methodology_tags": [ + "qualitative", + "benchmark-eval", + "case-study" + ], + "key_findings": "The paper identifies 17 types of non-syntactic mistakes in code generated by GPT-4 and Qwen2.5-Coder across HumanEval-X and MBXP datasets, with 10 types previously unreported in the literature. The dominant cause of mistakes is misleading prompt specifications (56.19%), followed by poor input-output demonstrations (21.26%). GPT-4 can automatically identify these mistakes with high precision (~0.96) and coverage (~0.94), and can identify underlying reasons with F1=0.78 using ReAct prompting, though positional sensitivity remains a challenging case (F1=0.25 without augmentation).", + "red_flags": [ + { + "flag": "No significance tests", + "detail": "All comparisons between prompting approaches (Base vs Advanced vs ReAct) and between GPT-4 and human evaluators are made without statistical significance tests, making it impossible to determine whether differences are meaningful." + }, + { + "flag": "Contamination not addressed", + "detail": "HumanEval-X and MBXP are widely-used public benchmarks that almost certainly appear in GPT-4 and Qwen2.5-Coder training data; this is not acknowledged, which may inflate the apparent correctness of LLM outputs." + }, + { + "flag": "Annotator conflict of interest", + "detail": "The same authors who identified and labeled the reasons in RQ2 also evaluated whether GPT-4 correctly identified those same reasons in RQ3, creating circular validation risk despite use of negotiated agreement." + }, + { + "flag": "No confidence intervals", + "detail": "All metrics (precision, coverage rate, F1) are single point estimates with no uncertainty quantification, limiting interpretability of findings." + }, + { + "flag": "Paper format anomaly", + "detail": "The ACM reference format shows year 2018 and placeholder DOI, suggesting the paper is a preprint not yet formally published, but this is not clearly disclosed." + } + ], + "cited_papers": [ + { + "title": "An Empirical Study of Code Generation Errors made by Large Language Models", + "relevance": "Direct prior work this paper extends; identified 7 syntactic and non-syntactic mistake categories on HumanEval using ChatGPT." + }, + { + "title": "Bugs in large language models generated code: An empirical study", + "relevance": "Prior work identifying 10 bug categories from LLM-generated code on CoderEval; directly compared against in this paper." + }, + { + "title": "Automated repair of programs from large language models", + "relevance": "Fan et al. 2023 — identified 4 syntactic mistake categories; a key baseline for comparison in this study." + }, + { + "title": "Large language models and simple, stupid bugs", + "relevance": "Attributes LLM code errors to training data quality issues; one of the foundational hypotheses tested in RQ2." + }, + { + "title": "ReAct: Synergizing Reasoning and Acting in Language Models", + "relevance": "The prompting technique used in RQ3 for automated reason identification; achieves the best F1=0.78." + }, + { + "title": "Multi-lingual evaluation of code generation models", + "relevance": "MBXP dataset used in this study — 1,940 multilingual coding questions across Python and Java." + }, + { + "title": "CodeGeeX: A pre-trained model for code generation with multilingual benchmarking on HumanEval-X", + "relevance": "HumanEval-X dataset used in this study — 328 coding problems in Python and Java with test cases." + }, + { + "title": "LLM hallucinations in practical code generation: Phenomena, mechanism, and mitigation", + "relevance": "Related work on categorizing LLM code mistakes; Zhang et al. identified 8 mistake types on CoderEval-generated code." + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 2, + "justification": "Directly actionable for developers: rephrasing prompts, adding edge case examples, and repositioning key instructions are all concrete techniques practitioners can apply." + }, + "surprise_contrarian": { + "score": 1, + "justification": "Misleading prompt wording causing 56% of mistakes is a notable finding, but the overall framing (LLMs make predictable errors) is not surprising." + }, + "fear_safety": { + "score": 1, + "justification": "Paper notes that incorrect LLM-generated code used in production (e.g., Google) poses software quality risks, but this is framed gently rather than alarming." + }, + "drama_conflict": { + "score": 0, + "justification": "No controversy or conflict angle; paper is a straightforward empirical classification study." + }, + "demo_ability": { + "score": 1, + "justification": "Replication package is publicly available on figshare, enabling reproduction, but no interactive demo is provided." + }, + "brand_recognition": { + "score": 1, + "justification": "Uses GPT-4 (OpenAI) and Qwen2.5-Coder (Alibaba), well-known models, but no famous lab affiliation among authors." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "42307849", + "title": "\"Oh, shit I opened the document \": Suspicious Mail in VR Headsets[pdf]", + "points": 2, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=42307849", + "created_at": "2024-12-03T16:22:05Z" + }, + { + "hn_id": "40263764", + "title": "A scalable approach to network reconstruction", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=40263764", + "created_at": "2024-05-05T10:37:34Z" + }, + { + "hn_id": "42465432", + "title": "Glider: Small model beats GPT on eval tasks", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=42465432", + "created_at": "2024-12-19T20:33:09Z" + }, + { + "hn_id": "38873897", + "title": "Static Deadlock Detection for Rust Programs", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=38873897", + "created_at": "2024-01-04T23:55:09Z" + }, + { + "hn_id": "38870705", + "title": "Scalable network reconstruction in subquadratic time", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=38870705", + "created_at": "2024-01-04T18:48:09Z" + } + ], + "top_points": 2, + "total_points": 8, + "total_comments": 1 + } +} +\ No newline at end of file diff --git a/papers/deep-dive-into-2025/scan-v5.json b/papers/deep-dive-into-2025/scan-v5.json @@ -0,0 +1,570 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "A Deep Dive into Retrieval-Augmented Generation for Code Completion: Experience on WeChat", + "authors": [ + "Zezhou Yang", + "Ting Peng", + "Cuiyun Gao", + "Chaozheng Wang", + "Hailiang Huang" + ], + "year": 2025, + "venue": "IEEE International Conference on Software Maintenance and Evolution", + "arxiv_id": "2507.18515", + "doi": "10.1109/ICSME64153.2025.00062" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All four main abstract claims (RAG effectiveness in closed-source repos, similarity-based superiority, BM25/GTE-Qwen best individually, hybrid optimal) are quantitatively supported by Tables I–III across 26 LLMs.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "Causal claims about RAG improving code completion are supported by direct base-model vs RAG comparisons; ablation-style comparisons systematically isolate retrieval technique contributions across Tables I–III.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "Conclusions recommend RAG configurations for 'practitioners in proprietary development environments' broadly, but the study is limited to one company's C++ codebase; the threats-to-validity section acknowledges but does not adequately bound this generalization.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "The paper does not consider alternative explanations such as whether the manually annotated benchmark selection favors similarity-based retrieval, or whether C++ specifically benefits differently from RAG than other languages.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "The paper explicitly acknowledges in threats to validity that CodeBLEU and Edit Similarity 'might not fully capture the semantic correctness and functionality of generated code' and supplements with a developer survey to address this gap.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "Section V.C 'Threats to Validity' covers internal, external, and construct validity as a dedicated subsection — well beyond a passing sentence in the conclusion.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "Threats are specific: internal validity identifies parameter sensitivity; external validity names the single-organization codebase limitation and cites 1,669 diverse projects as partial mitigation; construct validity identifies the metric-quality gap and explains how the developer survey addresses it.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "The paper does not explicitly state what the results do NOT show; the threats section describes limitations but never draws explicit lines around what conclusions cannot be drawn from a single C++ enterprise codebase.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": true, + "justification": "Funding is disclosed in a footnote: National Key R&D Program of China (2022YFB3103900), NSFC (62472126), Natural Science Foundation of Guangdong Province, and Shenzhen-Hong Kong and Shenzhen Basic Research projects.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Author affiliations are clearly stated on the title page: four authors at Tencent and two at The Chinese University of Hong Kong.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": true, + "justification": "All disclosed funders are government/academic bodies (NSFC, Guangdong provincial government, Shenzhen municipal) with no financial stake in whether RAG works well for WeChat's code completion.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement or declaration of financial interests is provided; Tencent employees are evaluating RAG methods on Tencent's own production codebase — an implicit institutional conflict that is not formally declared.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Key terms are formally defined: 'identifier-based RAG' and 'similarity-based RAG' are defined with equations in Section II; each retrieval technique (BM25, CodeBERT, UniXcoder, CoCoSoDa, GTE-Qwen) is described with technical detail and citations.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Four explicit contributions are listed: systematic study of RAG for closed-source code completion, a fine-grained preprocessing algorithm, finding of complementary retrieval techniques, and developer survey validation.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "The paper engages with prior RAG code completion work (REPOFUSE, ReACC, GraphCoder, FT2Ra) throughout the text and in Section VI, explicitly distinguishing its closed-source focus from prior open-source benchmark studies.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": false, + "justification": "No source code is released; the preprocessing algorithm and retrieval system are described but exist as proprietary Tencent infrastructure with no repository link provided.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": false, + "justification": "Both the 100-example evaluation benchmark and the 1,669-repository retrieval corpus are proprietary WeChat internal data that cannot be released publicly.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": true, + "justification": "Hardware (8×A100 40GB or 8/16×H20 96GB by model size), framework (vLLM in Docker), precision (FP16/FP8), temperature (0), retrieval top-k (4), and 2k-token context limit are all specified.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "No step-by-step reproduction instructions are provided; reproduction is impossible without access to the proprietary benchmark and retrieval corpus, and the paper provides no public artifact to start from.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "All results are reported as point estimates (CB/ES scores in tables) with no confidence intervals or error bars, despite comparing dozens of conditions across 26 models on a 100-example benchmark.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "No statistical significance tests are used for any comparative claims despite the paper asserting superiority of specific retrieval methods — all claims of 'better' or 'superior' rely on raw score differences.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Relative percentage improvements are consistently reported with baseline context (e.g., '71.60% and 27.59% relative increase' for Qwen2.5-Coder-14B-Instruct with GTE-Qwen RAG), giving interpretable effect sizes.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "The benchmark size of 100 examples is not statistically justified; the paper explains the annotation process but provides no power analysis or reasoning for why 100 examples provides adequate statistical sensitivity.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "All results are single-run point estimates; no variance, standard deviation, or spread across runs is reported despite using stochastic generation (temperature=0 reduces but does not eliminate variance).", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Base models without any RAG augmentation are included as baselines in Table I for all 26 LLMs, with all RAG variants compared directly against the base model.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "Baselines and comparisons include state-of-the-art late-2024 models: DeepSeek-V3 (671B), Qwen2.5-Coder-32B-Instruct, and Llama-3.3-70B-Instruct.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "The study systematically ablates similarity-based RAG components by comparing five individual retrieval techniques and all pairwise combinations of lexical+semantic techniques in Table III across 26 LLMs.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Two complementary metrics are used: CodeBLEU (structural/semantic code similarity) and Edit Similarity (token-level edit distance normalized by length).", + "source": "haiku" + }, + "human_evaluation": { + "applies": true, + "answer": true, + "justification": "A developer survey with 3 internal developers evaluated 52 randomly selected examples across 3 LLMs using a 1–5 quality scale, with error type categorization supplementing automated metrics.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": true, + "justification": "The 100-example evaluation benchmark is constructed separately from the 1,669-repository retrieval corpus, functioning as a proper held-out test set.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Results are broken down by model size category (0.5B through 200B+) across all tables; the benchmark also covers 7 domain categories with easy/hard difficulty splits shown in Figure 1.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "Developer survey identifies three error categories with frequencies: Missing/Incorrect Logic (~52%), Extra Logic (~30%), Nonexistent Function Call (~17%), analyzed across three LLMs.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "The paper reports that hybrid retrieval shows 'limited or even negative impact' for models below 7B, and Table I shows CodeLlama-70B performing worse than its base model with most RAG configurations.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": true, + "justification": "Exact versioned model names are specified (e.g., Qwen2.5-Coder-14B-Instruct, GTE-Qwen2-1.5B-instruct, DeepSeek-V3-671B/37B) obtained from official Hugging Face repositories.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": false, + "justification": "The paper describes four prompt templates for identifier-based RAG and mentions prompts in Chinese wrapped in C++ comment format, but no actual prompt text is provided.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": true, + "justification": "Temperature (0), number of retrieved results (4), maximum context length (2k tokens), BM25 parameters k and b (defined in equations 10–11), and model precision (FP16/FP8) are all specified.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": true, + "answer": true, + "justification": "Identifier-based RAG scaffolding (index creation, LLM-based identifier extraction, four distinct prompt templates per knowledge type) is described with formal equations; similarity-based RAG pipeline is also formalized.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "Algorithm 1 provides detailed pseudocode for the preprocessing pipeline covering C++ source/header files, protobuf files, macro transformations, and deduplication/formatting steps.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": false, + "justification": "Neither the 100-example evaluation benchmark nor the 1,669-repository retrieval corpus is publicly available; all data is proprietary WeChat/Tencent internal material.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Benchmark construction is described in detail (3 senior developers with 5+ years experience, 3 weeks, 4 annotation rules, 7 domains, cross-validation); retrieval corpus collection (1,669 internal projects, deduplication, standardization) is also described.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": true, + "answer": false, + "justification": "Developer survey participants are described only as 'three developers from our group (excluding the authors)' with no formal recruitment criteria, sampling rationale, or qualification criteria.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "Algorithm 1 documents the full data pipeline from raw C++ and protobuf files through extraction, macro transformation, formatting, and corpus construction; retrieval and inference pipelines are also formalized.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": false, + "justification": "Training data cutoffs for the 26 evaluated LLMs are not stated anywhere in the paper.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": false, + "justification": "While contamination is implicitly reduced by using proprietary internal code, the paper does not explicitly discuss train/test overlap or argue why the benchmark cannot appear in any model's training data.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": true, + "answer": false, + "justification": "The paper does not address whether public LLMs may have seen portions of WeChat's codebase through any public Tencent repositories or data leaks during pretraining.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": true, + "answer": false, + "justification": "The developer survey is not pre-registered.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": true, + "answer": false, + "justification": "No IRB or ethics approval is mentioned for the developer survey despite it involving human participant evaluations published in an academic venue.", + "source": "haiku" + }, + "demographics_reported": { + "applies": true, + "answer": false, + "justification": "No demographic information is reported for the 3 survey participants beyond being from 'our group' and not among the paper's authors.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": true, + "answer": false, + "justification": "The only stated criterion is 'excluding the authors'; no formal inclusion/exclusion criteria (experience level, role, familiarity with the codebase) are described.", + "source": "haiku" + }, + "randomization_described": { + "applies": true, + "answer": true, + "justification": "The paper states 'a random selection of 52 examples' was used for the developer survey evaluation.", + "source": "haiku" + }, + "blinding_described": { + "applies": true, + "answer": false, + "justification": "No blinding procedure is described; developers evaluated completions with knowledge of the retrieval technique source, introducing potential bias.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "NA — the developer survey involved 3 fixed internal participants completing a predefined evaluation set; attrition was not applicable.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": false, + "justification": "No inference latency or cost figures are reported; hardware is described but no timing measurements or cost estimates are provided for any of the 26 models.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "Hardware configurations are described (8 A100s, 16 H20s) but total GPU-hours, wall-clock time, or financial cost of running experiments across 26 LLMs and 9 retrieval configurations is not stated.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "Both identifier-based and similarity-based RAG consistently improve code completion over base models across all 26 LLMs tested.", + "evidence": "Table I shows improvements highlighted across the majority of model/method combinations; e.g., Llama-3.1-8B-Instruct improves from CB/ES 34.02/46.07 to 53.47/55.40 with GTE-Qwen RAG.", + "supported": "strong" + }, + { + "claim": "Similarity-based RAG substantially outperforms identifier-based RAG for code completion in closed-source repositories.", + "evidence": "Table I shows consistent large margins: Qwen2.5-Coder-1.5B reaches max CB/ES 37.28/50.77 with identifier-based vs 46.69/56.04 with similarity-based; DeepSeek-V3 reaches 42.24/61.75 vs 60.28/73.11.", + "supported": "strong" + }, + { + "claim": "BM25 and GTE-Qwen achieve superior performance among retrieval techniques, with GTE-Qwen uniquely performing better with incomplete code context queries.", + "evidence": "Table II shows BM25 and GTE-Qwen consistently outperform CodeBERT, UniXcoder, and CoCoSoDa; GTE-Qwen is the only technique where incomplete queries outperform complete queries for large models.", + "supported": "strong" + }, + { + "claim": "Lexical and semantic retrieval capture fundamentally different aspects of code similarity, with minimal overlap in retrieved results.", + "evidence": "Out of 100 test examples, there are 76, 74, and 64 completely distinct retrieved samples comparing BM25 with UniXcoder, CoCoSoDa, and GTE-Qwen respectively.", + "supported": "strong" + }, + { + "claim": "Combining BM25 and GTE-Qwen achieves optimal code completion performance, especially for larger models (7B+), but hurts smaller models.", + "evidence": "Table III shows BM25+GTE-Qwen reaches CB/ES 63.62/75.26 for DeepSeek-V3 (vs 60.28/73.11 alone); paper explicitly notes 'limited or even negative impact' for sub-7B models.", + "supported": "moderate" + }, + { + "claim": "Developer survey confirms BM25+GTE-Qwen combined retrieval produces higher quality completions than either technique alone.", + "evidence": "3-developer survey on 52 examples shows combined technique achieves higher average scores and wins in about half of test cases; but n=3 evaluators is far too small for reliable inference.", + "supported": "weak" + } + ], + "methodology_tags": [ + "benchmark-eval", + "case-study" + ], + "key_findings": "RAG methods consistently improve code completion in WeChat's large-scale proprietary C++ codebase across all 26 tested LLMs (0.5B–671B parameters), with similarity-based RAG substantially outperforming identifier-based RAG. Among retrieval techniques, BM25 and GTE-Qwen individually achieve best performance, with GTE-Qwen's bidirectional architecture uniquely suited to incomplete code queries (the code completion scenario). The combination of BM25+GTE-Qwen achieves optimal results for models 7B and larger by exploiting complementary retrieval distributions (64–76% non-overlapping results), while smaller models do not reliably benefit from hybrid retrieval.", + "red_flags": [ + { + "flag": "Tiny benchmark (n=100)", + "detail": "Only 100 examples from a single company's codebase provide insufficient statistical power to support claims of superiority across 26 LLMs and 9 retrieval configurations; no sample size justification or power analysis is provided." + }, + { + "flag": "Minimal developer survey (n=3)", + "detail": "Only 3 internal developers participated in the human evaluation study; results from such a small N cannot reliably support conclusions about developer preference across retrieval techniques." + }, + { + "flag": "No statistical significance testing", + "detail": "All comparative claims (X outperforms Y, combined is better) are made without any statistical tests despite dozens of pairwise comparisons across 26 models on a 100-example benchmark." + }, + { + "flag": "Single run, no variance reported", + "detail": "All results are single-run point estimates; no standard deviation or error bars are reported, making it impossible to assess whether observed differences exceed noise." + }, + { + "flag": "Proprietary, non-reproducible benchmark", + "detail": "The evaluation benchmark and 1,669-project retrieval corpus are proprietary WeChat internal data; independent reproduction or verification of any result is structurally impossible." + }, + { + "flag": "C++-only study", + "detail": "All experiments use C++ code exclusively; conclusions recommending RAG configurations for 'proprietary environments' broadly are unsupported since other languages may respond differently to lexical vs semantic retrieval." + }, + { + "flag": "No inference latency or cost reported", + "detail": "The paper evaluates RAG accuracy but does not report retrieval latency, inference overhead, or compute cost — critical factors for deployment decisions in production code completion systems." + } + ], + "cited_papers": [ + { + "title": "GraphCoder: Enhancing Repository-Level Code Completion via Code Context Graph-based Retrieval and Language Model", + "relevance": "Repository-level RAG code completion using graph-based retrieval, direct structural comparator to this study" + }, + { + "title": "REPOFUSE: Repository-Level Code Completion with Fused Dual Context", + "relevance": "Repository-level code completion combining dependency and similarity context, closely related prior approach" + }, + { + "title": "Dataflow-Guided Retrieval Augmentation for Repository-Level Code Completion", + "relevance": "Alternative RAG approach using data flow graphs for code completion context retrieval" + }, + { + "title": "ReACC: A Retrieval-Augmented Code Completion Framework", + "relevance": "Foundational RAG framework for code completion on public benchmarks, motivates this closed-source extension" + }, + { + "title": "FT2Ra: A Fine-Tuning-Inspired Approach to Retrieval-Augmented Code Completion", + "relevance": "Related RAG code completion approach evaluated on public benchmarks" + }, + { + "title": "RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems", + "relevance": "Standard benchmark methodology for repository-level code completion this study extends to closed-source settings" + }, + { + "title": "Studying LLM Performance on Closed- and Open-source Data", + "relevance": "Directly motivates the investigation of performance gaps between open-source and closed-source codebases" + }, + { + "title": "CodeBERT: A Pre-Trained Model for Programming and Natural Languages", + "relevance": "Semantic retrieval model evaluated as one of four similarity-based retrieval baselines" + }, + { + "title": "CodeBLEU: a Method for Automatic Evaluation of Code Synthesis", + "relevance": "Primary evaluation metric used throughout the paper" + }, + { + "title": "STALL+: Boosting LLM-based Repository-level Code Completion with Static Analysis", + "relevance": "Alternative approach combining static analysis with LLM-based code completion, related line of work" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 3, + "justification": "Direct industrial deployment study at WeChat scale with actionable configuration guidance (BM25+GTE-Qwen hybrid for 7B+ models) for practitioners building closed-source code completion systems." + }, + "surprise_contrarian": { + "score": 1, + "justification": "Main findings confirm expected directions (RAG helps, hybrid retrieval is better); the finding that GTE-Qwen uniquely outperforms with incomplete queries is a mildly interesting exception to the general pattern." + }, + "fear_safety": { + "score": 0, + "justification": "No AI risk, safety, or security concerns are raised; this is a pure productivity tool evaluation." + }, + "drama_conflict": { + "score": 0, + "justification": "No controversial claims, disputes with prior work, or conflict angles present." + }, + "demo_ability": { + "score": 1, + "justification": "Methods use open-source models and public retrieval libraries (BM25S, Qdrant, vLLM), making the approach replicable in principle, but the proprietary benchmark and corpus prevent direct reproduction." + }, + "brand_recognition": { + "score": 2, + "justification": "WeChat/Tencent is a globally recognized platform (1B+ MAU cited); paper also evaluates prominent recent models including DeepSeek-V3 and Qwen2.5 series." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "44769170", + "title": "The unreasonable likelihood of being: origin of life, terraforming, and AI", + "points": 16, + "comments": 9, + "url": "https://news.ycombinator.com/item?id=44769170" + }, + { + "hn_id": "44198829", + "title": "Algebra Unveils Deep Learning – An Invitation to Neuroalgebraic Geometry", + "points": 13, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=44198829" + }, + { + "hn_id": "42886971", + "title": "Thoughts Are All over the Place: On the Underthinking of O1-Like LLMs", + "points": 4, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=42886971" + }, + { + "hn_id": "42884879", + "title": "Streaming DiLoCo: Towards a Distributed Free Lunch (Google DeepMind)", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=42884879" + }, + { + "hn_id": "45056536", + "title": "Galois Theory by Calculator", + "points": 2, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=45056536" + }, + { + "hn_id": "45801598", + "title": "Streaming DiLoCo: Towards a Distributed Free Lunch", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=45801598" + }, + { + "hn_id": "44081257", + "title": "An Invitation to Neuroalgebraic Geometry", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=44081257" + }, + { + "hn_id": "43321959", + "title": "Swallowing the Poison Pills: Insights from Vulnerability Disparity Among LLMs", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=43321959" + } + ], + "top_points": 16, + "total_points": 43, + "total_comments": 10 + } +} +\ No newline at end of file diff --git a/papers/deepcircuitx-comprehensive-repositorylevel-2025/scan-v5.json b/papers/deepcircuitx-comprehensive-repositorylevel-2025/scan-v5.json @@ -0,0 +1,342 @@ +{ + "scan_version": 5, + "paper_type": "benchmark-creation", + "paper": { + "title": "DeepCircuitX: A Comprehensive Repository-Level Dataset for RTL Code Understanding, Generation, and PPA Analysis", + "authors": [ + "Zeju Li", + "Changran Xu", + "Zhengyuan Shi", + "Zedong Peng", + "Yi Liu", + "Yunhao Zhou", + "Lingfeng Zhou", + "Chengyu Ma", + "Jianyuan Zhong", + "Xi Wang", + "Jieru Zhao", + "Zhufei Chu", + "Xiaoyan Yang", + "Qiang Xu" + ], + "year": 2025, + "venue": "2025 IEEE International Conference on LLM-Aided Design (ICLAD)", + "arxiv_id": "2502.18297", + "doi": "10.1109/ICLAD65226.2025.00029" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "The abstract's core claims — 4,000+ repo-level RTL projects (Table I), multi-level CoT annotations, fine-tuning effectiveness (Tables VI–VII), and human quality evaluation (Table V, all >3.5/4) — are all supported by the paper's content.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": false, + "justification": "The paper makes causal claims that 'fine-tuning LLMs on our dataset leads to significant performance improvements' but only compares fine-tuned vs. non-fine-tuned versions of the same models; no ablation isolates whether gains come from the CoT annotations, repository-level structure, data volume, or domain specificity.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "The conclusion broadly claims DeepCircuitX 'establishes new benchmarks for RTL tasks' and will 'transform this critical domain,' but the PPA prediction experiment uses only 10 test designs and coverage of EDA tasks beyond understanding/generation/completion is not demonstrated.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "No alternative explanations are offered for observed performance gains; the paper does not consider whether gains stem simply from increased domain-specific training volume rather than the distinctive repository-level or CoT properties of the dataset.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": false, + "justification": "BLEU/METEOR/ROUGE are used to evaluate 'RTL code understanding' without acknowledging that these measure surface linguistic similarity, not semantic or functional understanding; the paper does not distinguish between the proxy metric and the claimed capability.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": false, + "justification": "There is no dedicated limitations or threats-to-validity section; the only acknowledgment of shortcomings is one sentence in the PPA discussion noting that delay prediction is an open problem.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "No threats to validity are discussed; notable concerns such as data contamination (GitHub repos in LLM pretraining), the tiny PPA test set (n=10), or evaluator independence in human evaluation are not mentioned.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "No explicit scope boundaries are stated; the paper does not specify what the benchmark results cannot show or where the dataset would fail to generalize (e.g., proprietary EDA flows, non-Verilog HDLs, advanced process nodes beyond those tested).", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding disclosure appears anywhere in the paper text; the National Center of Technology Innovation for EDA affiliation is listed but no grants or sponsors are acknowledged.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "All author affiliations are clearly listed in the header (CUHK, SJTU, Hangzhou Dianzi University, Ningbo University, Southeast University, and the National Center of Technology Innovation for EDA).", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": false, + "answer": false, + "justification": "No funding is disclosed, so this criterion is not applicable.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement, no declaration of patents, equity, or consulting relationships appears in the paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "RTL (Register Transfer Level), PPA (Power, Performance, Area), CoT (Chain of Thought), and the hierarchy of design levels (chip, IP, module, block) are all defined explicitly in the introduction and dataset sections.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "The paper lists four explicit contributions: a repository-level dataset of 4,000+ RTL projects, four-level organization, CoT annotation methodology, and pre-training/evaluation benchmarks for RTL and PPA tasks.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section II provides a structured comparison with prior EDA datasets (CircuitNet, ISCAS, RTL-Repo, RTLLM, VerilogEval) and articulates specific gaps that DeepCircuitX addresses (file-level only, no PPA data, no CoT annotations).", + "source": "haiku" + } + } + }, + "type_checklist": { + "benchmark-creation": { + "construct_design": { + "construct_validity_argued": { + "applies": true, + "answer": false, + "justification": "The paper argues that repository-level structure is needed for comprehensive RTL modeling but does not formally argue why BLEU/METEOR/ROUGE measure RTL understanding or why Pass@k adequately captures generation quality beyond syntactic correctness.", + "source": "haiku" + }, + "difficulty_distribution_characterized": { + "applies": true, + "answer": false, + "justification": "No difficulty tiers are defined or measured; the benchmark tasks are described by category (IP/Module/Chip) and count but not by difficulty level, and no analysis of item difficulty distribution is provided.", + "source": "haiku" + }, + "ceiling_floor_effects_checked": { + "applies": true, + "answer": false, + "justification": "Several base models score 0% Pass@k (floor effect) and this is not discussed as a benchmark design concern; the paper presents it as evidence of effectiveness rather than a signal that the benchmark may be miscalibrated for those models.", + "source": "haiku" + }, + "human_baseline_included": { + "applies": true, + "answer": false, + "justification": "Human evaluation is conducted only to rate annotation quality (accuracy, completeness, clarity on a 1–4 scale), not to establish how humans perform on the RTL understanding, completion, or generation benchmark tasks.", + "source": "haiku" + }, + "scoring_rubric_justified": { + "applies": true, + "answer": false, + "justification": "BLEU/METEOR/ROUGE and MAPE/RRSE are described but not justified as the right metrics for RTL-specific tasks; no discussion of edge cases in scoring (e.g., functionally correct but syntactically different code) is provided.", + "source": "haiku" + } + }, + "robustness": { + "contamination_resistance_designed": { + "applies": true, + "answer": false, + "justification": "No contamination-resistance measures are implemented; the dataset is collected from GitHub, which is in the pretraining corpora of all evaluated LLMs (CodeLlama, CodeT5+, DeepSeek), and no temporal split or canary mechanism is used.", + "source": "haiku" + }, + "temporal_robustness_discussed": { + "applies": true, + "answer": false, + "justification": "No discussion of how the benchmark will remain useful as LLMs improve or become more capable at RTL code; no versioning or update plan is mentioned.", + "source": "haiku" + }, + "failure_modes_discussed": { + "applies": true, + "answer": false, + "justification": "PPA delay prediction failure is noted in passing, but systematic failure modes of the benchmark itself (what it cannot measure, how it could be gamed, what it conflates) are not discussed.", + "source": "haiku" + }, + "baseline_implementations_provided": { + "applies": true, + "answer": false, + "justification": "Fine-tuned model weights and training code are not mentioned as available; the dataset URL (a gitbook page) is given, but there is no mention of releasing the evaluation harness or fine-tuned checkpoints needed to reproduce reported numbers.", + "source": "haiku" + } + }, + "documentation": { + "dataset_documentation_complete": { + "applies": true, + "answer": false, + "justification": "The paper describes data collection methodology and structure but provides no formal data card, no explicit train/validation/test splits for the benchmark tasks, and no description of quality filtering beyond functional correctness implied by synthesis.", + "source": "haiku" + }, + "licensing_and_access_clear": { + "applies": true, + "answer": false, + "justification": "The dataset is listed as available at a gitbook URL but no license is specified; terms of use, redistribution rights, and whether the GitHub-sourced RTL code carries inherited licenses are not addressed.", + "source": "haiku" + }, + "intended_use_specified": { + "applies": true, + "answer": false, + "justification": "Intended uses (LLM fine-tuning for RTL tasks, PPA prediction) are described, but the paper does not specify what should NOT be concluded from benchmark results or known limitations of the intended use cases.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "Fine-tuning LLMs on DeepCircuitX leads to significant performance improvements across all RTL understanding, completion, and generation metrics compared to non-fine-tuned counterparts.", + "evidence": "Tables VI and VII show consistent gains across CodeLlama, CodeT5+, CodeGen2, CodeGen2.5, and DeepSeek models; e.g., CodeGen2.5 BLEU-4 rises from 0.11 to 13.69, and Pass@1 on RTLLM rises from 17.24% to 24.14% (though the original was already non-trivial).", + "supported": "moderate" + }, + { + "claim": "DeepCircuitX is the first comprehensive repository-level RTL dataset combining multilevel code with netlists and PPA metrics.", + "evidence": "Related work in Section II systematically compares with prior datasets (RTL-Repo, RTLLM, VerilogEval, CircuitNet) and identifies their limitations (file-level only, no PPA data), supporting novelty of the combination.", + "supported": "moderate" + }, + { + "claim": "CoT annotations generated by GPT-4 and Claude are high quality, as confirmed by independent expert human evaluation.", + "evidence": "Table V shows all six metrics (repo and module annotation accuracy, completeness, clarity) score above 3.5/4 from 5 reviewers per sample; evaluator selection and sample size are not reported.", + "supported": "moderate" + }, + { + "claim": "PPA prediction for practical designs (>10k cells) remains an open challenge, particularly for delay prediction.", + "evidence": "Table VIII shows delay MAPE of 4.74 (SNS) and 3.48 (MasterRTL) even at 100% training data, with RRSE values >2; the paper attributes this to logic synthesis optimization complexity.", + "supported": "strong" + }, + { + "claim": "Models of different scales (220M to 16B) all benefit from fine-tuning on DeepCircuitX, demonstrating dataset adaptability.", + "evidence": "Table VI shows CodeT5+ 220M improves from BLEU-4 0.14 to 4.91, while 7B and 16B models show comparable or larger gains, supporting scale-agnostic benefit.", + "supported": "strong" + }, + { + "claim": "DeepCircuitX covers 77 functional categories across chip, IP, and module designs with over 4,000 repositories and 140,000 RTL files.", + "evidence": "Table I confirms: 17 chip-level categories (1,002 repos, 54,650 files), 3 IP-level (1,410 repos, 92,467 files), 57 module-level (2,383 repos, 38,692 files).", + "supported": "strong" + } + ], + "methodology_tags": [ + "benchmark-eval", + "observational" + ], + "key_findings": "DeepCircuitX introduces a repository-level RTL dataset of 4,000+ projects spanning chip, IP, and module designs, enriched with multi-level CoT annotations (GPT-4/Claude) and synthesized PPA metrics across five technology nodes. Fine-tuning LLMs on this dataset consistently outperforms non-fine-tuned baselines on RTL understanding (BLEU-4, METEOR, ROUGE) and generation/completion (Pass@k on RTLLM and VerilogEval), across model scales from 220M to 16B parameters. Human evaluators rated annotation quality above 3.5/4 on accuracy, completeness, and clarity. PPA prediction remains an open problem, especially for delay estimation on large designs (>10k cells), where all tested models show high error rates even with full training data.", + "red_flags": [ + { + "flag": "GitHub contamination unaddressed", + "detail": "All evaluated LLMs were pretrained on GitHub code; the dataset is collected from GitHub using keyword search with no temporal split, canary strings, or deduplication against model training sets — making benchmark results potentially optimistic due to data contamination." + }, + { + "flag": "PPA test set n=10", + "detail": "The PPA prediction evaluation uses only 10 test designs, making statistical conclusions about model performance unreliable; confidence intervals and variance are not reported." + }, + { + "flag": "No human baseline for benchmark tasks", + "detail": "Human evaluation measures annotation quality only; no human performance baseline is provided for RTL understanding, completion, or generation tasks, making it impossible to assess how far models are from human-level performance." + }, + { + "flag": "Proxy metrics for understanding", + "detail": "BLEU/METEOR/ROUGE measure surface n-gram overlap with reference text, not functional or semantic understanding of RTL code; the paper uses these without justification or discussion of their validity for this domain." + }, + { + "flag": "No statistical significance testing", + "detail": "Performance improvements are described as 'significant' throughout but no statistical tests, confidence intervals, or variance across runs are reported." + }, + { + "flag": "No license for dataset", + "detail": "The dataset is collected from GitHub repositories (which carry individual licenses) and annotated with GPT-4/Claude outputs (which may carry usage restrictions); no license or terms of use are specified for the released dataset." + }, + { + "flag": "No limitations section", + "detail": "The paper has no dedicated limitations or threats-to-validity section; the conclusion only mentions open problems in PPA prediction without discussing limitations of the dataset or benchmark methodology." + } + ], + "cited_papers": [ + { + "title": "RTL-Repo: A Benchmark for Evaluating LLMs on Large-Scale RTL Design Projects", + "relevance": "Direct predecessor establishing the concept of repository-level RTL benchmarking; DeepCircuitX explicitly builds on and extends this work." + }, + { + "title": "RTLLM: An Open-Source Benchmark for Design RTL Generation with Large Language Model", + "relevance": "Used as one of two primary evaluation benchmarks for RTL code completion and generation in the paper's experiments (Table VII)." + }, + { + "title": "VerilogEval: Evaluating Large Language Models for Verilog Code Generation", + "relevance": "Second primary evaluation benchmark used for Pass@k scoring in RTL generation experiments." + }, + { + "title": "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models", + "relevance": "Foundational method adopted for the CoT annotation methodology central to DeepCircuitX's annotation pipeline." + }, + { + "title": "CircuitNet: An Open-Source Dataset for Machine Learning Applications in Electronic Design Automation", + "relevance": "Key prior EDA dataset that DeepCircuitX differentiates from by providing RTL-level rather than post-synthesis layout data." + }, + { + "title": "MasterRTL: A Pre-Synthesis PPA Estimation Framework for Any RTL Design", + "relevance": "One of the PPA prediction models evaluated on DeepCircuitX in Table VIII; provides direct comparison baseline." + }, + { + "title": "Benchmarking Large Language Models for Automated Verilog RTL Code Generation", + "relevance": "Prior work collecting 50,000 open-source Verilog samples for LLM fine-tuning; directly compared as a file-level-only dataset that DeepCircuitX supersedes." + }, + { + "title": "MG-Verilog: Multi-Grained Dataset Towards Enhanced LLM-Assisted Verilog Generation", + "relevance": "Contemporary dataset for LLM-assisted Verilog generation; cited as an example of existing high-quality hardware data efforts that DeepCircuitX extends." + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 2, + "justification": "Directly usable by researchers fine-tuning LLMs for hardware design automation, but the domain (RTL/EDA) is narrow and requires specialized infrastructure (commercial EDA tools like Synopsys Design Compiler) to reproduce." + }, + "surprise_contrarian": { + "score": 1, + "justification": "The finding that domain-specific fine-tuning improves performance is expected; the PPA delay prediction difficulty is a useful negative result but not surprising to EDA practitioners." + }, + "fear_safety": { + "score": 0, + "justification": "Hardware design dataset with no AI safety, security, or risk implications." + }, + "drama_conflict": { + "score": 0, + "justification": "No controversy, competing claims, or replication disputes involved." + }, + "demo_ability": { + "score": 2, + "justification": "Dataset is publicly available at the gitbook URL; researchers can download and fine-tune their own models, though reproducing PPA synthesis results requires commercial EDA tools." + }, + "brand_recognition": { + "score": 1, + "justification": "CUHK (The Chinese University of Hong Kong) has moderate recognition in the EDA/ML community; no major industry lab affiliation." + } + }, + "hn_data": { + "threads": [], + "top_points": 0, + "total_points": 0, + "total_comments": 0 + } +} +\ No newline at end of file diff --git a/papers/deepcrceval-revisiting-evaluation-2024/scan-v5.json b/papers/deepcrceval-revisiting-evaluation-2024/scan-v5.json @@ -0,0 +1,355 @@ +{ + "scan_version": 5, + "paper_type": "benchmark-creation", + "paper": { + "title": "DeepCRCEval: Revisiting the Evaluation of Code Review Comment Generation", + "authors": [ + "Junyi Lu", + "Xiaojia Li", + "Zihan Hua", + "Lei Yu", + "Shiqi Cheng" + ], + "year": 2024, + "venue": "arXiv.org", + "arxiv_id": "2412.18291", + "doi": "10.48550/arXiv.2412.18291" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All abstract claims are verified by paper content: <10% benchmark quality is confirmed by Venn diagrams (3% Tufano, 8% CRer); 88.78% time and 90.32% cost reductions are derivable from Table 4; LLM-Reviewer superiority is shown in Tables 6–7.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": false, + "justification": "The paper claims LLM-Reviewer outperforms SOTA CRCGs because it is 'target-oriented,' but the central confound—that GPT-4 is an orders-of-magnitude stronger model than the T5/BERT-based baselines—is never controlled for or discussed.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "The paper makes broad claims that text similarity metrics are 'inadequate' for code review evaluation based on 100 sampled Java comments per dataset, but Section 9 conclusions are stated without bounding to this scope.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "The primary alternative explanation—that LLM-Reviewer wins due to GPT-4's superior model capability rather than the target-oriented approach—is never considered; only LLM-evaluating-LLM bias is briefly acknowledged.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "The paper explicitly argues that BLEU/ROUGE are indirect proxies that do not capture actual code review objectives, and positions its 9 criteria as direct measures; this distinction is the paper's central framing.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "Section 7.3 'Threats to Validity' is a dedicated paragraph discussing GPT-4 exclusivity, Java-only scope, graduate students as proxies, small human-evaluation sample size, and LLM-evaluating-LLM bias.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "Threats are named specifically: Java language only, graduate CS students with 6+ years of experience used as proxies for developers, GPT-4 specifically selected, and small sample size for human analysis are each identified concretely.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "The paper does not explicitly state what results do NOT show—e.g., whether findings hold for other languages, other LLMs, or non-defect-focused review tasks; threats are acknowledged but no explicit scope boundaries are drawn.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding acknowledgment appears anywhere in the paper despite authors affiliating with Chinese Academy of Sciences, Kuaishou Technology, and Sinosoft Company Limited.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "All author affiliations are clearly stated on the title page: Institute of Software CAS, University of Chinese Academy of Sciences, Kuaishou Technology, and Sinosoft Company Limited.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": false, + "answer": false, + "justification": "No funding source is disclosed, so independence of funder from outcome cannot be assessed.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement or financial disclosure appears in the paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "CRCGs and Modern Code Review are explicitly defined; the 9 evaluation criteria (C1–C9) are individually described; the task formulation is mathematically specified in Appendix A.2.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Five numbered contributions are explicitly listed in the Introduction: bias analysis of SOTA evaluation, DeepCRCEval framework, LLM-Reviewer baseline, empirical reevaluation, and public materials release.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 8 (Related Work) and the body of the paper substantively engage with CRCGs (Tufano, AUGER, CodeReviewer, CCT5, CommentFinder) and evaluation approaches (Bosu, Yang et al., Rahman et al.), explaining how this work differs from each.", + "source": "haiku" + } + } + }, + "type_checklist": { + "benchmark-creation": { + "construct_design": { + "construct_validity_argued": { + "applies": true, + "answer": true, + "justification": "The paper argues construct validity through a multi-step process—literature review, semi-structured developer interviews, card sorting, and affinity diagrams—to derive 9 criteria that directly capture code review objectives rather than textual similarity.", + "source": "haiku" + }, + "difficulty_distribution_characterized": { + "applies": true, + "answer": false, + "justification": "The 1,000 test code cases are described only as containing 'typical issues' with no difficulty tiers defined, measured, or characterized; difficulty distribution is assumed without analysis.", + "source": "haiku" + }, + "ceiling_floor_effects_checked": { + "applies": true, + "answer": false, + "justification": "LLM-Reviewer scores near-perfect (9.97, 10.00, 9.67 across criteria) and AUGER scores at floor (1.00 on multiple criteria), indicating clear ceiling and floor effects that are neither acknowledged nor discussed.", + "source": "haiku" + }, + "human_baseline_included": { + "applies": true, + "answer": false, + "justification": "Human evaluators are used as raters but not as comment generators; there is no baseline of human-written review comment quality to compare against the CRCGs and LLM-Reviewer.", + "source": "haiku" + }, + "scoring_rubric_justified": { + "applies": true, + "answer": true, + "justification": "The 9-criterion rubric is derived from prior literature plus developer interviews with justification for each dimension; ICC is used to measure evaluator agreement, and domain-specific scoring is argued as superior to single-score approaches.", + "source": "haiku" + } + }, + "robustness": { + "contamination_resistance_designed": { + "applies": true, + "answer": false, + "justification": "No contamination resistance measures are designed in; GPT-4 (LLM-Reviewer) may have encountered similar Java defect patterns in pretraining, and no temporal splits, canary strings, or anti-gaming measures are implemented.", + "source": "haiku" + }, + "temporal_robustness_discussed": { + "applies": true, + "answer": false, + "justification": "No discussion of whether DeepCRCEval will remain useful as LLMs improve, whether LLM-Reviewer's near-perfect scores indicate the benchmark is already saturated, or any update plan for the framework.", + "source": "haiku" + }, + "failure_modes_discussed": { + "applies": true, + "answer": false, + "justification": "The threats section briefly notes LLM-evaluating-LLM bias but does not discuss systematic failure modes of DeepCRCEval itself, what it fails to measure, or how it could be gamed.", + "source": "haiku" + }, + "baseline_implementations_provided": { + "applies": true, + "answer": true, + "justification": "The scoring QT application, test set, and baseline implementations are publicly available at zenodo.org/records/10511726 as stated in contribution 5.", + "source": "haiku" + } + }, + "documentation": { + "dataset_documentation_complete": { + "applies": true, + "answer": false, + "justification": "The 1,000 Java test cases are described minimally (human-processed, ROUGE-L deduplicated, typical issues) with no data card, source code origins, selection methodology, or preprocessing pipeline documented.", + "source": "haiku" + }, + "licensing_and_access_clear": { + "applies": true, + "answer": false, + "justification": "Materials are listed as available on Zenodo but no license is specified in the paper; terms of use for the evaluation framework and test set are not stated.", + "source": "haiku" + }, + "intended_use_specified": { + "applies": true, + "answer": false, + "justification": "Section 7.1 gives research implications but does not specify what should NOT be concluded from DeepCRCEval results, what populations or languages the framework applies to, or boundary conditions for valid use.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "Less than 10% of benchmark comments in Tufano and CodeReviewer datasets qualify as high-quality automation references", + "evidence": "Venn diagrams (Figure 3) show only 3% of Tufano and 8% of CRer comments satisfy all four quality dimensions (quality, category, tone, context) simultaneously", + "supported": "strong" + }, + { + "claim": "Text similarity metrics (BLEU, ROUGE) are inadequate for evaluating code review comment quality", + "evidence": "Prior SOTA improvements of <1% BLEU shown not to correlate with actual quality; qualitative analysis of 100 comments per dataset reveals high-BLEU comments can be meaningless generics", + "supported": "moderate" + }, + { + "claim": "LLM evaluators reduce evaluation time by 88.78% and cost by 90.32% vs. human evaluators", + "evidence": "Table 4: per-comparison human time 752.65s vs. LLM 68.69s; cost $2.09 vs. $0.17; reductions are directly computable from these figures", + "supported": "strong" + }, + { + "claim": "LLM-Reviewer outperforms all SOTA CRCGs under DeepCRCEval evaluation", + "evidence": "Tables 6–7: LLM-Reviewer ranked 1st by both human and LLM evaluators; scores ~9–10/10 across criteria vs. 1–4/10 for most baselines", + "supported": "moderate" + }, + { + "claim": "Human and LLM evaluators achieve high concordance (ICC >0.75) on most evaluation criteria", + "evidence": "Table 5 shows ICC >0.75 for C3–C5, C7–C8; lower for C1 Readability (0.62) and C9 Brevity (0.62), with divergence explained qualitatively", + "supported": "strong" + }, + { + "claim": "DeepCRCEval provides greater discrimination between models than text similarity metrics", + "evidence": "Score ranges from 1.00 to 9.97 across models in Table 6, contrasted with sub-1% BLEU differences cited from prior work", + "supported": "moderate" + } + ], + "methodology_tags": [ + "benchmark-eval", + "qualitative", + "observational" + ], + "key_findings": "Analysis of 100 comments from each of two major code review benchmark datasets (Tufano and CodeReviewer) reveals that only 3–8% qualify as high-quality references across quality, category, tone, and context dimensions, undermining BLEU/ROUGE-based evaluation. The proposed DeepCRCEval framework using 9 domain-specific criteria with human and LLM evaluators provides substantially better discrimination among CRCGs than text similarity. LLM-Reviewer, a training-free GPT-4-based baseline, dramatically outperforms all existing SOTA CRCGs under DeepCRCEval, though this finding is confounded by GPT-4's superior model capability relative to the T5/BERT-based baselines. LLM evaluation reduces time and cost by ~89–90% while maintaining acceptable agreement with human raters.", + "red_flags": [ + { + "flag": "Model capability confound", + "detail": "LLM-Reviewer uses GPT-4 against much weaker T5/BERT-era baselines; superiority could be entirely due to model capability rather than the 'target-oriented' approach, but this is never addressed or controlled for" + }, + { + "flag": "Circular LLM evaluation", + "detail": "GPT-4 both generates comments as LLM-Reviewer and evaluates all models via LLM evaluators; the paper acknowledges LLM-evaluating-LLM bias but cannot rule out systematic self-favoritism" + }, + { + "flag": "Ceiling effects unaddressed", + "detail": "LLM-Reviewer scores 9.97–10.00 on multiple criteria, suggesting the benchmark may already be saturated for capable LLMs; this is not discussed and raises questions about benchmark utility going forward" + }, + { + "flag": "Java-only scope with broad claims", + "detail": "All 1,000 test cases are Java-only; conclusions about code review comment generation generally are not bounded to this language or function-level granularity" + }, + { + "flag": "Tiny practical validation", + "detail": "The web-application user study uses only 5 industry developers rating 66 total cases, yet is cited as supporting evidence for practical utility" + }, + { + "flag": "No human comment generation baseline", + "detail": "There is no comparison against human-generated review comments; it is unknown whether even LLM-Reviewer approaches actual developer performance on the same tasks" + } + ], + "cited_papers": [ + { + "title": "Automating code review activities by large-scale pre-training (CodeReviewer)", + "relevance": "Primary SOTA baseline and source of one of the two benchmark datasets analyzed" + }, + { + "title": "Using pre-trained models to boost code review automation (Tufano et al. 2022)", + "relevance": "Key CRCG baseline and source of the Tufano benchmark dataset" + }, + { + "title": "Automatically generating review comments with pre-training models (AUGER)", + "relevance": "SOTA CRCG baseline evaluated in the reevaluation" + }, + { + "title": "Expectations, outcomes, and challenges of modern code review (Bacchelli & Bird, ICSE 2013)", + "relevance": "Foundational work on code review objectives and the 9-category comment classification system adopted in this paper" + }, + { + "title": "Code review quality: How developers see it (Kononenko et al., ICSE 2016)", + "relevance": "Primary source for defining quality criteria for code review comments; forms basis of the 9-criterion rubric" + }, + { + "title": "EvaCRC: Evaluating code review comments (Yang et al., FSE 2023)", + "relevance": "Most closely related prior work on automated evaluation of code review comments; key differentiator discussed in Related Work" + }, + { + "title": "Judging LLM-as-a-judge with MT-bench and Chatbot Arena (Zheng et al., NeurIPS 2023)", + "relevance": "Methodological basis for using LLMs as evaluators and the chain-of-thought prompt template adopted in DeepCRCEval" + }, + { + "title": "CommentFinder: A simpler, faster, more accurate code review comments recommendation", + "relevance": "Retrieval-based CRCG baseline evaluated in the reevaluation" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 2, + "justification": "DeepCRCEval and LLM-Reviewer are publicly available tools that code review automation researchers can immediately adopt as evaluation and baseline infrastructure" + }, + "surprise_contrarian": { + "score": 2, + "justification": "Finding that <10% of widely-used benchmark comments are high quality and that training-free LLM-Reviewer beats all fine-tuned SOTA models challenges established evaluation practices in the field" + }, + "fear_safety": { + "score": 0, + "justification": "No safety or AI risk concerns raised; purely a software engineering methodology paper" + }, + "drama_conflict": { + "score": 1, + "justification": "Challenges validity of prior SOTA evaluations in the code review automation community, but framed constructively rather than as critique of specific teams" + }, + "demo_ability": { + "score": 2, + "justification": "A Gradio web application was deployed and tested with 5 industry developers; Zenodo materials including the scoring tool and test set are publicly accessible" + }, + "brand_recognition": { + "score": 1, + "justification": "Chinese Academy of Sciences is well-known but not a major AI industry lab; GPT-4 is used but the paper has no affiliation with OpenAI" + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "42654204", + "title": "RAG with Differential Privacy", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=42654204", + "created_at": "2025-01-10T09:50:11Z" + }, + { + "hn_id": "42813195", + "title": "CUTECat: Concolic Execution for Computational Law", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=42813195", + "created_at": "2025-01-24T14:14:06Z" + }, + { + "hn_id": "42027141", + "title": "Context-Augmented Code Generation Using Programming Knowledge Graphs", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=42027141", + "created_at": "2024-11-02T15:56:47Z" + } + ], + "top_points": 2, + "total_points": 6, + "total_comments": 0 + } +} +\ No newline at end of file diff --git a/papers/deepreview-improving-llmbased-2025/scan-v5.json b/papers/deepreview-improving-llmbased-2025/scan-v5.json @@ -0,0 +1,548 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "DeepReview: Improving LLM-based Paper Review with Human-like Deep Thinking Process", + "authors": [ + "Minjun Zhu", + "Yixuan Weng", + "Linyi Yang", + "Yue Zhang" + ], + "year": 2025, + "venue": "Annual Meeting of the Association for Computational Linguistics", + "arxiv_id": "2503.08569", + "doi": "10.48550/arXiv.2503.08569" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All major abstract claims are backed by tables: 88.21%/80.20% win rates against GPT-o1/DeepSeek-R1 appear in Table 4, and outperformance over CycleReviewer-70B on MSE appears in Table 2. Resources are released at ai-researcher.net.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": false, + "justification": "The paper attributes adversarial robustness causally to the multi-stage framework ('we attribute this robustness to DeepReviewer's multi-stage reasoning framework') but runs no ablation removing stages to test robustness specifically; the fast/standard/best mode ablation tests accuracy, not robustness.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "The paper claims to 'set a new benchmark for LLM-based paper review' but evaluation is limited exclusively to ICLR 2024/2025 ML papers; no discussion of whether findings transfer to other venues, disciplines, or review formats.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "Key alternative explanations are absent: (1) DeepReviewer is trained on the same ICLR distribution as the test set, giving a distributional advantage over general-purpose LLMs; (2) using Gemini-2.0-Flash-Thinking as judge while also including it as a baseline creates a potential self-preference artifact that is not analyzed.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": false, + "justification": "Rating MSE (predicting reviewer scores) is the primary quantitative metric, but the paper does not explicitly discuss the gap between score prediction accuracy and actual review utility (catching errors, actionable suggestions); LLM-as-judge qualitative evaluation partially addresses this but without acknowledging the proxy limitation.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "A dedicated 'Limitations' section appears after the Conclusions, covering synthetic data quality, computational cost of Best mode, and incomplete adversarial robustness.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "The limitations name specific threats: synthetic training data may not capture human review nuances, Best mode is computationally intensive, and adversarial robustness is incomplete — these go beyond generic boilerplate.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "The paper does not state that results apply only to ML/AI conference reviews in ICLR format; no explicit scope boundary distinguishes what the findings do not show (e.g., other venues, disciplines, or review styles).", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": true, + "justification": "Corresponding author footnote states 'Supported by Research Center for Industries of the Future, Westlake University,' disclosing institutional support.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "All four authors have affiliations listed: Zhejiang University, Westlake University School of Engineering, and University College London.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": true, + "justification": "The funder is Westlake University's research center, an academic institution with no direct commercial interest in the evaluated system's performance.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement, patent declarations, or equity/consulting disclosures appear anywhere in the paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": false, + "justification": "'Deep thinking,' 'expert reviewer,' and 'human-like' are used in the title and throughout but never precisely defined; the paper operationalizes stages but does not formally define what distinguishes the framework from prior structured prompting approaches.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "The introduction and conclusion explicitly enumerate three contributions: DeepReview-13K dataset, DeepReviewer-14B model, and DeepReview-Bench benchmark, alongside the multi-stage framework design.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 2 (Related Work) actively situates the paper against CycleReviewer, AI Scientist, AgentReview, and LLM reasoning literature, explaining how DeepReview extends or differs from each.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": true, + "justification": "Code repository (zhu-minjun/Researcher), model weights (DeepReviewer-7B and 14B), dataset (DeepReview-13K), and demo are released at ai-researcher.net.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "DeepReview-13K (13,378 training samples) and DeepReview-Bench (1,286 test samples) are stated to be publicly available.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "Training hardware (8x H100 80G, DeepSpeed+ZeRO3) is mentioned but no requirements.txt, Dockerfile, or package version list is provided.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "No step-by-step reproduction instructions are included in the paper; readers are pointed to the repository without guidance on replicating experiments.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "No confidence intervals or error bars are reported in Tables 2, 3, or 4; all results are point estimates.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "No statistical significance tests are applied to any comparative claims despite large tables of numerical comparisons.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Percentage improvements are reported with baseline context (e.g., 'Rating MSE: 44.80% ↑', '65.83% reduction' vs. prompt-based baselines), giving interpretable effect sizes.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "Test sets of 652 and 634 papers for ICLR 2024/2025 are used without power analysis or justification of sample adequacy.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "No standard deviations, variance, or cross-run statistics are reported; all tables present single-run point estimates.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Two classes of baselines: prompt-based (AI Scientist, AgentReview with GPT-o1, Claude-3.5-Sonnet, Gemini, DeepSeek-V3/R1) and fine-tuned (CycleReviewer 8B and 70B).", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "Baselines include GPT-o1-2024-12-17, Claude-3.5-sonnet-20241022, Gemini-2.0-Flash-Thinking-01-21, DeepSeek-R1 — all state-of-the-art at submission time.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "Section 5.5 ablates reasoning depth (Fast/Standard/Best modes) and reviewer count (R=1 to R=6), showing per-metric impact of each component.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Quantitative metrics cover MSE, MAE, Decision Accuracy, F1, Spearman correlation, and Pairwise Accuracy; qualitative metrics add LLM-as-judge win rates across five dimensions.", + "source": "haiku" + }, + "human_evaluation": { + "applies": true, + "answer": false, + "justification": "Qualitative evaluation uses Gemini-2.0-Flash-Thinking as judge, not human annotators; no humans evaluated the review text outputs.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": true, + "justification": "10% of the dataset (1,286 samples) was randomly held out as DeepReview-Bench, separate from the 13,378 training samples.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Table 3 breaks down results by Soundness, Presentation, and Contribution dimensions; Table 4 breaks win rates by constructive value, analytical depth, plausibility, and technical accuracy.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": false, + "justification": "The adversarial attack section notes a small score increase under attack (5.38→5.69) but presents no systematic failure case analysis or qualitative examples of where the model produces poor reviews.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "Performance variability in Reviewer Scaling (R≠4) is explicitly noted; DeepReviewer's relative weakness versus Gemini in technical accuracy (showing 20.79% baseline win rate) is reported in Table 4.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": true, + "justification": "Exact model versions are specified: GPT-o1-2024-12-17, Claude-3.5-sonnet-20241022, Gemini-2.0-Flash-Thinking-01-21; training backbone is Phi-4 14B (Abdin et al., 2024).", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": true, + "justification": "Figures 4, 5, 6, and 7 in the appendix provide full system prompts for the judge, review enhancement, paper analysis, and reliability verification stages.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": true, + "justification": "Training: 23,500 steps, batch size 16, learning rate 5e-6, 40K context window, 256K with LongRoPE; inference: temperature 0.4, max input 100K tokens, max output 16,384 tokens.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": true, + "answer": true, + "justification": "The three-stage scaffold (novelty verification with literature retrieval, multi-dimension review, reliability verification with evidence chains) is described in detail in Section 4.2, including model assignments for each stage.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "Section 3.1 documents PDF conversion via MinerU, LATEX source prioritization from arXiv, empty PDF filtering, and the automated quality control using Qwen-2.5-72B-Instruct.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": true, + "justification": "DeepReview-13K dataset is stated to be publicly released at ai-researcher.net with source papers from OpenReview/arXiv.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Section 3.1 describes collection of 18,976 papers from OpenReview across ICLR 2024-2025, the three components assembled per paper (textual assessments, rebuttal discussions, standardized scores), and filtering to 13,378 valid samples.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "No human participant recruitment; data was collected from OpenReview and arXiv public repositories.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "The full pipeline from raw paper collection through three synthesis stages (novelty verification, multi-dimension review, reliability verification) to quality control filtering is documented in Section 4.2.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": false, + "justification": "The Phi-4 base model's training data cutoff is not stated; the paper does not disclose when Phi-4's pre-training data ends relative to ICLR 2024/2025 paper submission dates.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": false, + "justification": "The paper does not discuss whether ICLR 2024/2025 papers in the test set may have been included in Phi-4's pre-training corpus; only the case study paper is explicitly noted as not in training data.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": true, + "answer": false, + "justification": "DeepReview-Bench uses ICLR 2024/2025 papers that were publicly available before Phi-4's training, but potential contamination of the base model's knowledge of these specific papers is not discussed.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human subjects study.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human subjects study; ethics section discusses deployment implications, not IRB.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": false, + "justification": "Output token counts by mode (3K/8K/14.5K) are mentioned, but no inference latency in seconds, API cost, or GPU utilization during inference is reported.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "Hardware is specified (8x H100 80G, DeepSpeed+ZeRO3) but total GPU-hours or dollar cost for training 23,500 steps is not reported.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "DeepReviewer-14B reduces Rating MSE by 44.80% compared to CycleReviewer-70B despite having fewer parameters", + "evidence": "Table 2 shows DeepReviewer-14B MSE of 1.3137/1.3410 vs CycleReviewer-70B MSE of 2.4870/2.4294 on ICLR 2024/2025", + "supported": "strong" + }, + { + "claim": "DeepReviewer achieves win rates of 88.21% and 80.20% against GPT-o1 and DeepSeek-R1 in LLM-as-judge evaluation", + "evidence": "Table 4 shows these win rates for 'overall judgment' on ICLR 2024; ICLR 2025 shows 91.67% and 87.39% respectively", + "supported": "strong" + }, + { + "claim": "DeepReviewer demonstrates superior robustness to adversarial attacks due to its multi-stage reasoning framework", + "evidence": "Figure 2 shows DeepReviewer's score increases only 0.31 points under attack vs 4.26 for Gemini; causal attribution to multi-stage design is asserted without ablation of robustness specifically", + "supported": "moderate" + }, + { + "claim": "Test-time scaling via reasoning depth and reviewer count consistently improves performance", + "evidence": "Figure 3 shows positive regression trends for both Fast→Best mode and R=1→R=6 reviewer scaling across most metrics, with noted variability at R≠4", + "supported": "moderate" + }, + { + "claim": "DeepReviewer reduces Rating MSE by an average of 65.83% and improves Decision Accuracy by 15.2 points compared to prompt-based baselines", + "evidence": "Reported in Section 5.2 body text with reference to Table 2; calculation is across multiple backbone models of AI Scientist/AgentReview", + "supported": "strong" + }, + { + "claim": "DeepReview-13K with structured reasoning annotations enables training a model that outperforms much larger fine-tuned competitors", + "evidence": "14B DeepReviewer outperforms 70B CycleReviewer across all metrics in Table 2; however, the advantage could partly reflect distributional fit since both use ICLR data", + "supported": "moderate" + }, + { + "claim": "Gemini-2.0-Flash-Thinking as judge validates DeepReviewer's superiority even when Gemini itself is a baseline being compared", + "evidence": "Table 4 shows 59.41% win rate for DeepReviewer even against Gemini, which Gemini judges; the self-evaluation conflict is noted but not corrected for", + "supported": "weak" + } + ], + "methodology_tags": [ + "benchmark-eval", + "case-study" + ], + "key_findings": "DeepReviewer-14B, trained on the synthetic DeepReview-13K dataset via a three-stage structured reasoning pipeline (novelty verification, multi-dimension review, reliability verification), substantially outperforms both larger fine-tuned models (CycleReviewer-70B) and frontier LLMs (GPT-o1, DeepSeek-R1) on rating prediction MSE, ranking, and paper selection tasks derived from ICLR 2024/2025 reviews. In LLM-as-judge qualitative evaluation, DeepReviewer achieves >88% win rates against GPT-o1 across five review quality dimensions. Test-time scaling experiments confirm that deeper reasoning paths and more simulated reviewers generally improve scoring accuracy, with the fastest mode (3K tokens) already outperforming prior 6K-token baselines.", + "red_flags": [ + { + "flag": "Judge-baseline conflict", + "detail": "Gemini-2.0-Flash-Thinking serves simultaneously as the LLM judge evaluating qualitative review quality AND as one of the baselines being evaluated, creating a potential self-preference artifact that is not statistically controlled for." + }, + { + "flag": "Training-test distributional overlap", + "detail": "Both training and test data come from ICLR 2024/2025 reviews; general-purpose LLM baselines have no such distributional advantage, making the comparison potentially unfair without explicit domain-adaptation controls." + }, + { + "flag": "No statistical testing", + "detail": "All comparative claims across large tables of metrics lack confidence intervals, significance tests, or variance estimates, making it impossible to assess whether improvements are reliable." + }, + { + "flag": "Base model contamination unaddressed", + "detail": "Phi-4's training data cutoff is not disclosed; ICLR 2024/2025 papers in the test set may have been seen during Phi-4 pre-training, potentially inflating performance on review content that references specific papers." + }, + { + "flag": "Self-reviewed paper", + "detail": "Appendix E states 'This article has been reviewed by DeepReviewer-14B and revised accordingly based on its review comments' — the paper being evaluated was itself revised using the system, raising questions about circularity." + }, + { + "flag": "No human evaluation of review quality", + "detail": "All qualitative evaluation relies on LLM-as-judge (Gemini); no human annotators assessed whether DeepReviewer's reviews are actually more useful, accurate, or actionable than baselines." + } + ], + "cited_papers": [ + { + "title": "CycleResearcher: Improving Automated Research via Automated Review", + "relevance": "Direct predecessor; provides CycleReviewer baseline and the CycleResearcher framework that DeepReview extends" + }, + { + "title": "The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery", + "relevance": "Primary prompt-based baseline for agent-driven paper review; represents the competing approach" + }, + { + "title": "AgentReview: Exploring Peer Review Dynamics with LLM Agents", + "relevance": "Second prompt-based baseline; multi-agent simulation of peer review process" + }, + { + "title": "OpenScholar: Synthesizing Scientific Literature with Retrieval-Augmented LMs", + "relevance": "Used as the literature retrieval backbone in DeepReview's novelty verification stage" + }, + { + "title": "Peer Review as a Multi-Turn and Long-Context Dialogue with Role-Based Interactions", + "relevance": "ReviewMT dataset and approach; prior work on structured LLM-based review generation" + }, + { + "title": "Are We There Yet? Revealing the Risks of Utilizing Large Language Models in Scholarly Peer Review", + "relevance": "Adversarial attack methodology and risk assessment for LLM review systems" + }, + { + "title": "Large Language Models for Automated Scholarly Paper Review: A Survey", + "relevance": "Comprehensive survey of the LLM paper review space that contextualizes this work" + }, + { + "title": "A Dataset of Peer Reviews (PeerRead): Collection, Insights and NLP Applications", + "relevance": "Foundational benchmark for peer review NLP tasks" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 3, + "justification": "Live demo at ai-researcher.net/deepreviewer, released model weights, and three inference modes make this immediately usable by researchers and conference organizers." + }, + "surprise_contrarian": { + "score": 1, + "justification": "A 14B model outperforming 70B is mildly surprising, but the core finding (structured multi-stage reasoning improves review quality) aligns with conventional wisdom." + }, + "fear_safety": { + "score": 1, + "justification": "The ethics section raises concerns about bias amplification and reviewer deskilling, but these are framed as responsible-use guidance rather than alarming findings." + }, + "drama_conflict": { + "score": 1, + "justification": "The adversarial robustness finding (other models boosted 4+ points under attack) has a provocative angle, but the paper does not emphasize it as a central controversy." + }, + "demo_ability": { + "score": 3, + "justification": "Working demo, released model weights, and dataset allow anyone to test the system immediately; the homepage explicitly links to the demo." + }, + "brand_recognition": { + "score": 1, + "justification": "Westlake University and UCL are respected institutions but not the dominant brand names (OpenAI, Google, Meta) that drive HN attention." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "45967079", + "title": "Show HN: Browser-based interactive 3D Three-Body problem simulator", + "points": 249, + "comments": 113, + "url": "https://news.ycombinator.com/item?id=45967079" + }, + { + "hn_id": "36349110", + "title": "The Distributed Tensor Algebra Compiler (2022)", + "points": 40, + "comments": 6, + "url": "https://news.ycombinator.com/item?id=36349110" + }, + { + "hn_id": "45202421", + "title": "UGMM-NN: Univariate Gaussian Mixture Model Neural Network", + "points": 31, + "comments": 12, + "url": "https://news.ycombinator.com/item?id=45202421" + }, + { + "hn_id": "44250248", + "title": "Thermal Detection of People with Mobility Restrictions for Barrier Reduction", + "points": 4, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=44250248" + }, + { + "hn_id": "41139898", + "title": "End-to-End Amp Modeling: From Data to Controllable Guitar Amplifier Models", + "points": 2, + "comments": 2, + "url": "https://news.ycombinator.com/item?id=41139898" + }, + { + "hn_id": "43772236", + "title": "Hands-On: Segmenting Individual Signs from Continuous Sequences", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=43772236" + } + ], + "top_points": 249, + "total_points": 327, + "total_comments": 133 + } +} +\ No newline at end of file diff --git a/papers/deepseek-coder-2024/scan-v5.json b/papers/deepseek-coder-2024/scan-v5.json @@ -0,0 +1,509 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "DeepSeek-Coder: When the Large Language Model Meets Programming - The Rise of Code Intelligence", + "authors": [ + "Guo, D.", + "Zhu, Q.", + "Yang, D.", + "Xie, Z.", + "et al." + ], + "year": 2024, + "venue": "arXiv", + "arxiv_id": "2401.14196", + "doi": null + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "Abstract claims SOTA among open-source code models and superiority over GPT-3.5 are backed by benchmark results in Tables 3–8; 2T token training and 16K context are documented in Sections 2–3.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "Causal claims about FIM and repo-level pre-training are supported by ablation experiments: FIM rate ablation (Figure 3) and CrossCodeEval ablation with/without repo pre-training (Table 7).", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "The paper's title claims 'The Rise of Code Intelligence' and the abstract asserts broad superiority, but evaluations are confined to narrow benchmarks (HumanEval, MBPP, LeetCode); no discussion of generalization limits beyond benchmark settings.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "Better performance is attributed to data quality and repo-level training without considering alternative explanations such as sheer data volume advantage, architectural differences, or training compute differences.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": false, + "justification": "Pass@1 on HumanEval is equated with 'code intelligence' throughout; no discussion of whether benchmark performance reflects real-world coding utility or the limitations of these proxies.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": false, + "justification": "There is no dedicated limitations or threats-to-validity section; only a brief one-sentence acknowledgment of potential LeetCode contamination in the results section.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "No specific threats to validity are enumerated; the contamination acknowledgment ('the possibility of data contamination cannot be entirely ruled out') is generic boilerplate.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "The paper does not explicitly state what its results do not show (e.g., that HumanEval pass rates do not imply real-world productivity, or that comparisons are snapshot-in-time against specific model versions).", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding source is disclosed anywhere in the paper; the acknowledgments section lists individual contributors but no funding body or grant.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Author affiliations are clearly disclosed: DeepSeek-AI and Peking University (Key Lab of HCST), with contact emails provided.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": false, + "justification": "The majority of authors are DeepSeek-AI employees evaluating their own proprietary models; no independent third-party evaluation is performed.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement, patent disclosures, or financial interest declarations appear anywhere in the paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Key technical terms such as Fill-in-the-Middle (PSM/SPM modes), repository-level data construction, and cross-file completion are explained clearly with sufficient specificity for the technical audience.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Four explicit contributions are listed in the introduction: the DeepSeek-Coder model series, repo-level data construction, FIM training analysis, and comprehensive benchmark evaluations.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "The paper situates itself against StarCoder, CodeLlama, CodeGeeX2, GPT-3.5/4, Codex, and related work including FIM training (Bavarian et al.) and deduplication (Lee et al., Kocetkov et al.).", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": true, + "justification": "Source code and models are released on GitHub at https://github.com/deepseek-ai/DeepSeek-Coder, including the LeetCode evaluation benchmark.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": false, + "justification": "Training data (798 GB proprietary crawl from GitHub) is not released; evaluation uses public benchmarks but the custom training corpus required to reproduce the model is unavailable.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "The HAI-LLM framework and GPU cluster (A100/H800) are described but no requirements.txt, Dockerfile, or pinned dependency list is provided for reproducing training or evaluation.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "No step-by-step instructions for reproducing training or evaluation are included; the paper describes the pipeline at a high level but not with sufficient detail to follow without guessing.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "All benchmark results are reported as single point estimates with no confidence intervals, error bars, or standard deviations across multiple runs.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "No statistical significance tests are applied to any comparative claim despite multiple model comparisons across benchmarks with small evaluation set sizes (e.g., HumanEval n=164).", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Percentage point differences are reported throughout (e.g., '9% and 11% improvement over CodeLlama-Base 34B') with baseline context provided in all comparison tables.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "Benchmark sizes (HumanEval n=164, MBPP n=500) are not justified or discussed for statistical adequacy; no power analysis is mentioned.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "No variance, standard deviation, or multiple-run statistics are reported; all results appear to be single-run point estimates.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Strong baselines included across all tasks: CodeGeeX2, StarCoder, CodeLlama (7B/13B/34B), GPT-3.5-Turbo, GPT-4-Turbo, WizardCoder, Phind-CodeLlama.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "Baselines include CodeLlama (2023), StarCoder (2023), GPT-3.5/4-Turbo — all contemporary at the time of writing (January 2024).", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "Two ablations are presented: FIM rate comparison (0%, 50%, 100%, MSP) in Figure 3, and repo-level pre-training ablation ('w/o Repo Pre-training') in Table 7.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Multiple metrics used: Pass@1, exact match (EM), edit similarity (ES), per-difficulty breakdown, and per-library breakdown across diverse benchmark suites.", + "source": "haiku" + }, + "human_evaluation": { + "applies": false, + "answer": false, + "justification": "Automated test-case-based evaluation is standard for code generation benchmarks; human evaluation of code correctness is not applicable here.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": true, + "justification": "Held-out sets used: CrossCodeEval (repositories from March–June 2023, after training cutoff), LeetCode Contest (July 2023–January 2024), and standard benchmarks with withheld solutions.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Per-language breakdowns in Tables 3 and 6, per-library breakdown in Table 4 (DS-1000), per-difficulty breakdown in Table 5 (LeetCode Easy/Medium/Hard).", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": false, + "justification": "Appendix only shows successful interaction examples (snake game, database); no failure cases or error analysis is presented.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "The paper reports that 100% FIM rate hurts code completion (Figure 3) and that DeepSeek-Coder-v1.5 shows slight coding regression vs. original 6.7B (Table 10: 43.2% vs 44.7% HumanEval).", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": false, + "justification": "DeepSeek models are specified by parameter count, but GPT-3.5-Turbo and GPT-4-Turbo are named without snapshot dates — critical given that OpenAI models change over time.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": false, + "justification": "Only the LeetCode evaluation template is provided; prompts for HumanEval, MBPP, DS-1000, CrossCodeEval, and math reasoning benchmarks are not given.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": true, + "justification": "Table 2 provides hidden size, layers, attention heads, batch size, learning rates; AdamW with β1=0.9, β2=0.95, FIM rate 0.5, warm-up steps, and LR scheduling are described.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": false, + "answer": false, + "justification": "No agentic scaffolding is used; this is direct model evaluation on benchmarks.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "Section 2 and Figure 2 document the full pipeline: crawling, rule filtering, dependency parsing (Algorithm 1), repo-level deduplication, quality screening, and n-gram decontamination.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": false, + "justification": "The 798 GB training corpus is proprietary and not publicly released; no raw training data is available for independent verification.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Section 2 describes GitHub crawling scope (pre-February 2023), 87-language selection, filter rules, dependency parsing, deduplication, and quality screening with statistics in Table 1.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "No human participants; evaluation uses automated benchmark testing against fixed test cases.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "Figure 2 shows the full data pipeline (crawl → filter → dependency parse → dedup → quality screen), and each step is described in dedicated subsections with specific algorithms and thresholds.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": true, + "justification": "Explicitly stated: 'We collect public repositories created before February 2023 on GitHub.'", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": true, + "justification": "Section 2.4 describes n-gram decontamination filtering HumanEval, MBPP, GSM8K, and MATH examples; CrossCodeEval's post-February 2023 construction is explicitly noted as preventing overlap.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": true, + "answer": true, + "justification": "N-gram filtering (10-gram or 3-gram exact match) applied for HumanEval/MBPP/GSM8K/MATH; LeetCode contamination is explicitly acknowledged as unresolvable and flagged for the community.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": false, + "justification": "No inference latency, memory requirements, or cost estimates are reported despite releasing models for public deployment.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "GPU cluster hardware is described (A100/H800) but total training compute (GPU-hours, FLOPs, or cost) is not reported for any model size.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "DeepSeek-Coder-Base 33B achieves state-of-the-art performance among open-source code models on HumanEval (50.3% avg across 8 languages) and MBPP (66.0%).", + "evidence": "Table 3 shows DeepSeek-Coder-Base 33B outperforming CodeLlama-34B (41.0% avg, 55.2% MBPP) and all other listed open-source models.", + "supported": "strong" + }, + { + "claim": "DeepSeek-Coder-Instruct 33B outperforms GPT-3.5-Turbo on code generation benchmarks.", + "evidence": "Table 3 shows Instruct 33B at 69.2% avg vs GPT-3.5-Turbo 64.9%; Table 5 shows Instruct 33B at 27.8% vs GPT-3.5-Turbo 23.3% on LeetCode Contest.", + "supported": "moderate" + }, + { + "claim": "DeepSeek-Coder-Base 6.7B matches or exceeds CodeLlama-Base 34B despite having 5x fewer parameters.", + "evidence": "Table 3: DeepSeek 6.7B at 44.7% avg and 60.6% MBPP vs CodeLlama 34B at 41.0% avg and 55.2% MBPP.", + "supported": "strong" + }, + { + "claim": "Repository-level pre-training improves cross-file code completion performance.", + "evidence": "Table 7 ablation: 'w/o Repo Pre-training' shows performance drops on Java (16.64% vs 17.72% EM), TypeScript (13.23% vs 14.03% EM), and C# (14.48% vs 16.23% EM).", + "supported": "weak" + }, + { + "claim": "FIM training at 50% PSM rate optimally balances code completion (FIM) and code generation performance.", + "evidence": "Figure 3 ablation: 100% FIM rate maximizes HumanEval-FIM but minimizes HumanEval and MBPP pass@1; 50% PSM outperforms MSP strategy.", + "supported": "strong" + }, + { + "claim": "Continuing pre-training from a general LLM (DeepSeek-LLM-7B) significantly improves math and natural language capabilities of DeepSeek-Coder-v1.5.", + "evidence": "Table 10: GSM8K improves from 43.2% to 62.4%, MATH from 19.2% to 24.7%, MMLU from 36.6% to 49.1% at modest cost of ~1.5pp HumanEval regression.", + "supported": "strong" + } + ], + "methodology_tags": [ + "benchmark-eval" + ], + "key_findings": "DeepSeek-Coder introduces a family of open-source code LLMs (1.3B–33B) trained from scratch on 2 trillion tokens with repository-level data organization and Fill-in-the-Middle training, achieving state-of-the-art performance among open-source models across HumanEval, MBPP, DS-1000, CrossCodeEval, and LeetCode Contest benchmarks. The 33B instruct variant surpasses GPT-3.5-Turbo on most code tasks. Ablations show that 50% PSM-mode FIM rate optimally balances completion and generation ability, and that repo-level pre-training provides modest but consistent improvements on cross-file completion. Continued pre-training from a general LLM substantially improves mathematical and natural language capabilities at minor cost to code performance.", + "red_flags": [ + { + "flag": "Self-evaluation", + "detail": "All evaluations are conducted by DeepSeek-AI employees on their own models; no independent third-party replication is reported." + }, + { + "flag": "No statistical significance testing", + "detail": "All comparative claims are based on point estimates with no confidence intervals, error bars, or significance tests, despite small benchmark sizes (HumanEval n=164)." + }, + { + "flag": "GPT baselines not version-pinned", + "detail": "GPT-3.5-Turbo and GPT-4-Turbo are referenced without snapshot dates; these models changed significantly during 2023–2024, making comparisons unreliable." + }, + { + "flag": "Training data not released", + "detail": "The 798 GB proprietary training corpus is not publicly available, making it impossible to fully reproduce the work or verify data quality claims." + }, + { + "flag": "LeetCode contamination unresolved", + "detail": "The paper acknowledges 'the possibility of data contamination cannot be entirely ruled out' for LeetCode, and notes higher scores in July/August contests, but does not resolve or quantify the contamination." + }, + { + "flag": "No limitations section", + "detail": "Despite making strong comparative claims (SOTA, surpassing GPT-3.5), there is no dedicated limitations or threats-to-validity section." + } + ], + "cited_papers": [ + { + "title": "Evaluating Large Language Models Trained on Code (HumanEval)", + "relevance": "Primary benchmark used throughout; introduces pass@k evaluation for code generation models." + }, + { + "title": "StarCoder: may the source be with you!", + "relevance": "Key open-source baseline model and data source (StarCoder data pipeline) that DeepSeek-Coder directly competes with and builds upon." + }, + { + "title": "Code Llama: Open Foundation Models for Code", + "relevance": "Primary open-source baseline across all benchmarks; DeepSeek-Coder's 6.7B model is claimed to match CodeLlama-34B." + }, + { + "title": "CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion", + "relevance": "Used to evaluate the novel repo-level pre-training contribution; provides contamination-free evaluation (post February 2023)." + }, + { + "title": "DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation", + "relevance": "More realistic benchmark than HumanEval for practical data science coding tasks across 7 libraries." + }, + { + "title": "Efficient Training of Language Models to Fill in the Middle (FIM)", + "relevance": "Foundation for the Fill-in-the-Middle training objective that is a core contribution of DeepSeek-Coder." + }, + { + "title": "The Stack: 3 TB of Permissively Licensed Source Code", + "relevance": "Data source and deduplication methodology that DeepSeek-Coder's data pipeline extends with repo-level deduplication." + }, + { + "title": "Program Synthesis with Large Language Models (MBPP)", + "relevance": "Secondary code generation benchmark used throughout all comparisons." + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 3, + "justification": "Models released open-source with permissive license, directly usable by practitioners as drop-in replacement for closed-source code assistants." + }, + "surprise_contrarian": { + "score": 2, + "justification": "An open-source model matching or beating GPT-3.5-Turbo on code was surprising at the time of publication (January 2024)." + }, + "fear_safety": { + "score": 0, + "justification": "No AI safety or risk concerns raised; paper is purely a technical model introduction." + }, + "drama_conflict": { + "score": 1, + "justification": "Mild competitive framing against OpenAI's closed-source models, but no controversy or confrontational claims." + }, + "demo_ability": { + "score": 3, + "justification": "Models are publicly available on GitHub and HuggingFace; anyone can run them immediately." + }, + "brand_recognition": { + "score": 2, + "justification": "DeepSeek-AI has become well-known in the open-source LLM community; Peking University affiliation adds academic credibility." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "39142278", + "title": "Python has 189X the dataset size compared to Rust", + "points": 2, + "comments": 4, + "url": "https://news.ycombinator.com/item?id=39142278", + "created_at": "2024-01-26T13:18:01Z" + } + ], + "top_points": 2, + "total_points": 2, + "total_comments": 4 + } +} +\ No newline at end of file diff --git a/papers/deepseek-coder-v2-2024/scan-v5.json b/papers/deepseek-coder-v2-2024/scan-v5.json @@ -0,0 +1,573 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence", + "authors": [ + "DeepSeek-AI" + ], + "year": 2024, + "venue": "arXiv", + "arxiv_id": "2406.11931", + "doi": null + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": false, + "justification": "The abstract claims 'superior performance compared to closed-source models such as GPT4-Turbo' but results are mixed: DeepSeek-Coder-V2 scores below GPT-4-Turbo-0409 on LiveCodeBench (43.4% vs 45.7%) and SWE-Bench (12.7% vs 18.3%), while exceeding it on HumanEval and MBPP.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "Table 1 ablates the new code corpus against DeepSeek-Coder's corpus using a controlled 1B model; Figure 3 compares reward model signal vs raw compiler signal in RL training. These ablations partially justify causal claims made.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "The title claims 'Breaking the Barrier of Closed-Source Models in Code Intelligence' — a broad generalization. Results show competitive performance on specific benchmarks but the paper does not qualify claims to tested settings or note the benchmarks where closed-source models still lead.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "No alternative explanations are considered for the performance improvements — whether gains are primarily from scale, the new data corpus, the DeepSeek-V2 initialization, or the RL phase is not systematically disentangled.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": false, + "justification": "Benchmark scores (HumanEval, MBPP) are used to claim code intelligence parity with closed-source models without discussing the gap between these proxies and real-world coding ability.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": false, + "justification": "No dedicated limitations section. The only limitation mentioned is a single sentence in the conclusion about an 'instruction-following gap compared to GPT-4 Turbo'.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "No threats-to-validity section. Contamination is discussed for specific benchmarks but other threats (self-evaluation bias, benchmark saturation, single-run results without variance) are not addressed.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "No explicit scope boundaries are stated. The paper does not specify what tasks or domains the results do not apply to.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding disclosure anywhere in the paper. This is a company paper from DeepSeek-AI but no funding source or financial support is mentioned.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "All authors are listed as DeepSeek-AI employees and the affiliation is clearly stated on the paper.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": false, + "justification": "All authors are DeepSeek-AI employees evaluating their own model. There is no independent funder or third-party evaluator.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests or financial interests statement is included anywhere in the paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": false, + "justification": "'Code intelligence' is used in the title and throughout without definition. 'Performance comparable' is not quantified with specific thresholds.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Section 1.1 explicitly lists three contributions: the 16B/236B MoE models, the first open-source hundred-billion-parameter code model matching closed-source frontier, and public release under permissive license.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "The paper directly compares against StarCoder, CodeLlama, DeepSeek-Coder, Codestral, GPT-4, Claude 3, and Gemini 1.5, explaining how each relates to DeepSeek-Coder-V2 in approach and performance.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": true, + "justification": "The GitHub repository https://github.com/deepseek-ai/DeepSeek-Coder-V2 is linked in the abstract, and the paper states models are released under a permissive license for research and commercial use.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "All evaluation benchmarks used (HumanEval, MBPP, LiveCodeBench, SWE-Bench, CruxEval, GSM8K, MATH) are standard publicly available benchmarks used unmodified.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "No requirements.txt, Dockerfile, or equivalent environment specification is provided in the paper. Hardware and software environment for training or evaluation are not specified.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "No step-by-step reproduction instructions are provided in the paper for reproducing training or evaluation results.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "All results are reported as single-point estimates with no confidence intervals or error bars anywhere in the paper.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "No statistical significance tests are performed for any comparative claims, despite numerous model comparisons being made across many benchmarks.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Percentage improvements with baselines are reported throughout (e.g., ablation table shows +6.7pp HumanEval with new corpus vs old; Figure 3 shows explicit pass@1 curves with magnitude visible).", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "Benchmark sizes are mentioned (HumanEval: 164 problems, AIME: 30 problems) but no sample size justification or power analysis is provided for any comparisons.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "No variance, standard deviation, or run-to-run variance is reported. All results appear to be single-run point estimates.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Multiple baselines included: closed-source (GPT-4-Turbo-0409, GPT-4o-0513, Claude-3-Opus, Gemini-1.5-Pro) and open-source (StarCoder2, CodeLlama, DeepSeek-Coder-33B, Codestral, Llama3-70B).", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "Baselines include GPT-4-Turbo-0409, GPT-4o-0513, Claude-3-Opus, and Gemini-1.5-Pro — all state-of-the-art models at the time of publication (June 2024).", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "Table 1 ablates new vs old code corpus using a controlled 1B model; Figure 3 ablates reward model signal vs compiler signal in RL training showing reward model outperforms.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Evaluation spans HumanEval, MBPP+, multilingual coding (13 languages), LiveCodeBench (by difficulty), USACO, RepoBench, FIM tasks, Defects4J, SWE-Bench, Aider, CruxEval, GSM8K, MATH, AIME, and NL benchmarks.", + "source": "haiku" + }, + "human_evaluation": { + "applies": false, + "answer": false, + "justification": "Automated benchmarks with test cases serve as ground truth throughout; human evaluation of model outputs is not applicable to this code benchmark evaluation.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": true, + "justification": "Standard held-out test splits are used for all benchmarks; RepoBench uses only the December 2023 subset not present in training data.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Table 3 breaks results down by 13 programming languages; Table 4 breaks LiveCodeBench by difficulty (Easy 82, Medium 87, Hard 57); Table 10 breaks NL benchmarks by domain.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": false, + "justification": "No systematic discussion of failure cases. Only a brief acknowledgment in the conclusion that SWE-bench performance lags and instruction-following has gaps relative to GPT-4-Turbo.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "The paper reports that the Lite model underperforms Codestral on code completion (Table 5), CruxEval scores fall behind GPT-4o (70.0% vs 77.4%), and knowledge-intensive benchmark scores (TriviaQA) decline vs DeepSeek-V2.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": true, + "justification": "Most baselines specify version: GPT-4-Turbo-0409, GPT-4o-0513, GPT-4-1106; Claude-3-Opus and Gemini-1.5-Pro lack snapshot dates but are identified by specific model family names used at evaluation time.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": true, + "justification": "HumanEval instruction prompt is given in footnote 4; math CoT prompt given in footnote 9. Key prompts for primary benchmarks are provided, though SWE-bench prompts are absent.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": true, + "justification": "AdamW with β1=0.9, β2=0.95, weight decay 0.1; cosine decay with 2000 warm-up steps; SFT uses lr=5e-6, batch size=1M tokens; FIM rate 0.5. Some parameters deferred to DeepSeek-V2 paper.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": true, + "answer": false, + "justification": "SWE-Bench evaluation requires repository navigation and patch generation scaffolding that is not described in the paper. The paper only states 'whole format' for Aider without explaining the agent scaffolding.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "Section 2 documents filtering rules in detail: line length limits (avg>100, max>1000), alphabetic ratio <25%, XML/HTML/JSON/YAML-specific rules, near-deduplication, and three-iteration fastText classification pipeline.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": false, + "justification": "Raw model outputs for each benchmark problem are not released. Only aggregated accuracy scores are reported in the paper's tables.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Section 2 describes collection in detail: GitHub repos before November 2023, CommonCrawl via fastText with three-iteration seed expansion, manual seed corpus construction, and final 1,170B token composition.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "No human participants in this study; standard benchmarks and automated evaluation are used throughout.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "Section 2 documents the full pipeline: seed corpus construction, fastText classifier training with BPE tokenizer, iterative URL collection, domain classification at 10% threshold, filtering, deduplication, and final composition (60% code / 10% math / 30% NL).", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": true, + "justification": "The paper explicitly states: 'We collect public repositories created before November 2023 on GitHub,' establishing a clear training data cutoff.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": true, + "justification": "Section 4.2.1 explicitly discusses potential overlap for RepoBench (uses only December 2023 subset to avoid leakage); LiveCodeBench is chosen for contamination-free evaluation with problems from December 2023 - June 2024.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": true, + "answer": false, + "justification": "Contamination is not addressed for the primary benchmarks HumanEval (2021) and MBPP (2021), which substantially predate the November 2023 training cutoff and are very likely present in the GitHub/CommonCrawl training corpus.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": false, + "justification": "The MoE architecture with 2.4B/21B active parameters implies inference efficiency, but no concrete cost, latency, or throughput figures are reported.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "The paper mentions training on 6 trillion additional tokens but does not report the compute budget in GPU-hours, FLOPs, or wall-clock training time.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "DeepSeek-Coder-V2 achieves superior performance to GPT-4-Turbo, Claude 3 Opus, and Gemini 1.5 Pro on coding and math benchmarks", + "evidence": "HumanEval: 90.2% vs 88.2% (GPT-4-Turbo); MBPP+: 76.2% vs 72.2%; MATH: 75.7% vs 73.4%; Aider: 73.7% vs 63.9%; but inferior on LiveCodeBench (43.4% vs 45.7%) and SWE-Bench (12.7% vs 18.3%)", + "supported": "moderate" + }, + { + "claim": "The new code corpus is superior to the DeepSeek-Coder corpus, improving HumanEval by 6.7pp and MBPP by 9.4pp", + "evidence": "Table 1: 1B model with new corpus at 2T tokens achieves 37.2% HumanEval vs 30.5% baseline and 54.0% MBPP vs 44.6% baseline in controlled ablation", + "supported": "strong" + }, + { + "claim": "Using a reward model for RL training signal outperforms raw compiler feedback on Leetcode pass@1", + "evidence": "Figure 3 shows reward model signal achieving higher pass@1 than compiler signal on both LeetCode and LeetCode-zh across all 600 training steps shown", + "supported": "moderate" + }, + { + "claim": "DeepSeek-Coder-V2 is the first open-source model to exceed 10% on SWE-Bench", + "evidence": "Table 7 shows DeepSeek-Coder-V2-Instruct achieves 12.7% on SWE-Bench; all other listed open-source models score 0.0-2.7%", + "supported": "strong" + }, + { + "claim": "Continued pre-training on code data maintains comparable general language performance to DeepSeek-V2", + "evidence": "Table 10 shows Coder-V2 leads on reasoning benchmarks (BBH: 83.9 vs 79.7, Arena-Hard: 65.0 vs 41.6) but lags on knowledge tasks (TriviaQA: 82.3 vs 86.7), which is a mixed result", + "supported": "moderate" + }, + { + "claim": "DeepSeek-Coder-V2 handles 128K context reliably as shown by Needle-in-a-Haystack tests", + "evidence": "Figure 2 shows high scores across all context lengths up to 128K in NIAH test, though individual score values cannot be read precisely from the heatmap", + "supported": "moderate" + } + ], + "methodology_tags": [ + "benchmark-eval" + ], + "key_findings": "DeepSeek-Coder-V2 (236B MoE with 21B active parameters) achieves competitive performance with GPT-4-Turbo on code and math benchmarks, becoming the first open-source model to exceed 10% on SWE-Bench (12.7%) and scoring 90.2% on HumanEval. The model demonstrates that continued pre-training from a strong general foundation (DeepSeek-V2) with 6 trillion code/math tokens can match closed-source frontier models on many benchmarks, though gaps remain on instruction-following-heavy tasks like SWE-Bench and LiveCodeBench where GPT-4-Turbo leads. Key technical contributions include a new 1,170B-token code corpus covering 338 languages, GRPO reinforcement learning with a reward model trained on compiler feedback, and 128K context extension via YaRN.", + "red_flags": [ + { + "flag": "No confidence intervals or variance", + "detail": "All benchmark results are single-point estimates with no variance, standard deviation, or confidence intervals, making it impossible to assess whether differences between models are statistically meaningful." + }, + { + "flag": "Self-evaluation only", + "detail": "All authors are DeepSeek-AI employees evaluating their own model with no independent verification or third-party replication. No competing interests declaration is made." + }, + { + "flag": "HumanEval/MBPP contamination unaddressed", + "detail": "The primary benchmarks HumanEval (2021) and MBPP (2021) substantially predate the November 2023 training cutoff and are very likely present in the GitHub training corpus. Contamination is acknowledged only for RepoBench and LiveCodeBench." + }, + { + "flag": "SWE-Bench scaffolding undescribed", + "detail": "SWE-Bench requires repository navigation and patch generation scaffolding that is never described in the paper, making the 12.7% result unreproducible." + }, + { + "flag": "Overclaiming in title and abstract", + "detail": "The title claims 'Breaking the Barrier of Closed-Source Models' but DeepSeek-Coder-V2 scores below GPT-4-Turbo-0409 on LiveCodeBench (43.4% vs 45.7%) and SWE-Bench (12.7% vs 18.3%)." + }, + { + "flag": "No compute budget", + "detail": "No GPU-hours, FLOPs, or training time is reported, making cost-benefit analysis impossible and comparison with other approaches unfair to reproduce." + } + ], + "cited_papers": [ + { + "title": "Evaluating Large Language Models Trained on Code (HumanEval)", + "relevance": "Primary code generation benchmark used throughout; 164 Python problems evaluated in zero-shot setting" + }, + { + "title": "Program Synthesis with Large Language Models (MBPP)", + "relevance": "Code synthesis benchmark; MBPP+ (EvalPlus) version used for stricter automated evaluation" + }, + { + "title": "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?", + "relevance": "Real-world software engineering benchmark; DeepSeek-Coder-V2 is first open-source model to exceed 10%" + }, + { + "title": "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code", + "relevance": "Contamination-free code benchmark chosen specifically because problems post-date training cutoff" + }, + { + "title": "DeepSeek-Coder: When the Large Language Model Meets Programming", + "relevance": "Direct predecessor model; provides training data pipeline, instruction data, and architecture baseline" + }, + { + "title": "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models", + "relevance": "Math corpus collection pipeline and GRPO RL algorithm reused directly in this work" + }, + { + "title": "StarCoder 2 and the Stack V2: The Next Generation", + "relevance": "Competing open-source code model used as baseline across code generation and completion evaluations" + }, + { + "title": "Code Llama: Open Foundation Models for Code", + "relevance": "Competing open-source code model baseline across all code benchmarks" + }, + { + "title": "Measuring Mathematical Problem Solving with the MATH Dataset", + "relevance": "Advanced math reasoning benchmark where DeepSeek-Coder-V2 achieves 75.7% matching GPT-4o" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 3, + "justification": "Open-source model weights released under permissive license; practitioners can directly deploy 16B and 236B models for code generation, completion, and fixing tasks." + }, + "surprise_contrarian": { + "score": 2, + "justification": "Challenges the assumption that closed-source models dominate frontier code intelligence by demonstrating an open-source model can match GPT-4-Turbo on many benchmarks." + }, + "fear_safety": { + "score": 0, + "justification": "No AI safety or risk concerns raised; the paper focuses entirely on capability benchmarks." + }, + "drama_conflict": { + "score": 1, + "justification": "Mild open-source vs. closed-source narrative framed as 'breaking the barrier,' but without adversarial tone or controversy." + }, + "demo_ability": { + "score": 3, + "justification": "Model weights publicly released on GitHub under permissive license; anyone can download and test immediately." + }, + "brand_recognition": { + "score": 2, + "justification": "DeepSeek is a well-known AI lab known for efficient open-source models; carries recognition in the open-source ML community." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "45222339", + "title": "Analog In-Memory Computing Attention Mechanism for Fast LLMs", + "points": 4, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=45222339", + "created_at": "2025-09-12T14:09:56Z" + }, + { + "hn_id": "40761106", + "title": "DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=40761106", + "created_at": "2024-06-22T18:34:13Z" + }, + { + "hn_id": "40834241", + "title": "A Critical Study of What Code-LLMs (Do Not) Learn", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=40834241", + "created_at": "2024-06-30T00:15:06Z" + }, + { + "hn_id": "39441274", + "title": "Speculative Streaming: Fast LLM Inference Without Auxiliary Models", + "points": 2, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=39441274", + "created_at": "2024-02-20T13:55:45Z" + }, + { + "hn_id": "39461525", + "title": "Speculative Streaming: Fast LLM Inference Without Auxiliary Models", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=39461525", + "created_at": "2024-02-22T00:24:15Z" + }, + { + "hn_id": "40442724", + "title": "Analogical Reasoning-Augmented Interactive Data Annotation", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=40442724", + "created_at": "2024-05-22T16:16:38Z" + }, + { + "hn_id": "40111141", + "title": "Lossless Acceleration of Long Sequence Generation", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=40111141", + "created_at": "2024-04-22T03:10:54Z" + }, + { + "hn_id": "37234305", + "title": "Opportunities and Risks of LLMs for Scalable Deliberation with Polis", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=37234305", + "created_at": "2023-08-23T11:30:32Z" + }, + { + "hn_id": "37191375", + "title": "Opportunities and Risks of LLMs for Scalable Deliberation with Polis", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=37191375", + "created_at": "2023-08-19T18:00:10Z" + } + ], + "top_points": 4, + "total_points": 17, + "total_comments": 1 + } +} +\ No newline at end of file diff --git a/papers/deepseek-r1-2025/scan-v5.json b/papers/deepseek-r1-2025/scan-v5.json @@ -0,0 +1,518 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning", + "authors": [ + "DeepSeek-AI" + ], + "year": 2025, + "venue": "arXiv", + "arxiv_id": "2501.12948", + "doi": null + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All major abstract claims are supported: pure RL enabling reasoning is demonstrated by R1-Zero (Figure 1, Table 8), emergent self-reflection is shown in Figure 9 and Table 2, superior performance to SFT counterparts is confirmed in Table 8, and distillation enabling smaller models is shown in Table 15.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "The R1-Zero experiment directly tests pure RL from a base model with no SFT; staged ablations (Dev1–Dev3 in Table 3) isolate component contributions; language consistency reward is ablated (Figure 7); distillation vs. pure RL is compared (Table 16). Multiple causal claims are supported by appropriate ablation designs.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": true, + "justification": "The paper explicitly states results are strongest for 'verifiable tasks such as mathematics, coding competitions, and STEM fields'; Section 6 limitations acknowledge degraded performance for open-ended writing, software engineering, and non-Chinese/English languages.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "The paper attributes all improvements to RL but does not discuss alternative explanations: whether gains stem from longer output generation, additional training compute, superior base model quality, or reward function design rather than the RL mechanism per se.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": false, + "justification": "The paper claims 'reasoning capability' but measures benchmark accuracy (AIME, MATH-500, LiveCodeBench) without explicitly discussing the relationship between benchmark performance and the broader construct of reasoning ability.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "Section 6 'Conclusion, Limitation, and Future Work' contains a dedicated, multi-paragraph limitations section listing specific capability gaps beyond a single sentence.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "Specific threats are identified: prompting sensitivity (few-shot consistently degrades performance), reward hacking documented with example (Figure 6), language mixing in non-Chinese/English queries, and limited RL for software engineering tasks due to long evaluation times.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": true, + "justification": "The paper explicitly states the approach is limited to tasks with reliable verifiers and notes that 'for complex tasks that cannot be effectively evaluated by a reliable reward model, scaling up pure RL methods remains an open challenge.'", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding source or acknowledgment section is present. DeepSeek-AI appears to be corporate self-funded but this is not explicitly stated.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "All authors are from DeepSeek-AI, clearly stated as the sole author affiliation with contact email research@deepseek.com.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": false, + "justification": "DeepSeek-AI employees are evaluating their own model and comparing it favorably against competitors; the funder (DeepSeek) is not independent of the outcome.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement, patents, equity, or financial interests declaration appears anywhere in the paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Key technical terms are formally defined: GRPO is specified with full equations (1–3), reward design (accuracy + format rewards) is explained with formula (4), cold start data is described with examples, and the multi-stage pipeline is diagrammed in Figure 2.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "The paper clearly states its contribution: showing that LLM reasoning can be incentivized through pure RL without human-labeled demonstrations, producing models that match OpenAI-o1 on reasoning benchmarks.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section H provides substantive related work covering chain-of-thought (Wei et al. 2022), inference-time scaling, and RL for reasoning, explicitly contrasting their approach with PRM, MCTS, STaR, and RLHF methods.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": true, + "justification": "Inference code is released at https://github.com/deepseek-ai/DeepSeek-V3 with torchrun commands and model weights on HuggingFace at https://huggingface.co/deepseek-ai under MIT license.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "All evaluation benchmarks (AIME, MATH-500, MMLU, LiveCodeBench, etc.) are standard public benchmarks used unmodified; training data release is promised but URL is placeholder 'xxx' in the paper.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": true, + "justification": "The inference code example includes 'pip install -r requirements.txt', referencing a requirements file in the GitHub repository; training infrastructure specifies H800 GPUs, vLLM, and DualPipe algorithm.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "Inference instructions are provided (torchrun commands), but step-by-step training reproduction is not feasible: training data URL is placeholder 'xxx', and the full RL pipeline requires 147K GPU hours of H800 compute not documented in reproducible form.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "All main result tables report point estimates only; no confidence intervals or error bars are shown despite results being averaged over k=4–64 samples per question.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": true, + "justification": "Table captions state 'Numbers in bold denote the performance is statistically significant (t-test with p < 0.01)', applied to comparative performance claims.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Raw performance numbers are reported throughout enabling direct effect size interpretation (e.g., AIME 79.8% vs 79.2% for o1, MATH-500 97.3% vs 96.4%).", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "The paper states k values (k=64 for AIME, k=16 for MATH, k=8 for LCB) but provides no power analysis or formal justification for these specific choices.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "No standard deviations or variance across evaluation runs are reported in any of the main results tables, despite averaging over multiple samples.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Comprehensive baselines in Table 8: Claude-3.5-Sonnet-1022, GPT-4o-0513, DeepSeek-V3, OpenAI-o1-mini, OpenAI-o1-1217.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "Baselines are frontier models as of early 2025: OpenAI-o1-1217 (December 2024), Claude-3.5-Sonnet-1022, and GPT-4o-0513 — the best available comparators.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "Multiple ablations: stage-by-stage comparison (R1-Zero, Dev1–Dev3, R1 in Table 3), language consistency reward ablation (Figure 7), and distillation vs. large-scale RL comparison (Table 16).", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Over 20 benchmarks used spanning math (AIME 2024, MATH-500, CNMO), code (LiveCodeBench, Codeforces, SWE-Bench, Aider), knowledge (MMLU, GPQA Diamond), and instruction following (IFEval, AlpacaEval 2.0, ArenaHard).", + "source": "haiku" + }, + "human_evaluation": { + "applies": true, + "answer": true, + "justification": "ChatbotArena crowdsourced pairwise human preference evaluation is used (Figures 11–12), showing DeepSeek-R1 ranking first alongside OpenAI-o1 on the style control setting.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": true, + "justification": "All evaluation benchmarks are held-out test sets; additionally AIME 2025 (released after training cutoff) is used to assess generalization to genuinely unseen problems.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Per-category breakdowns provided: MMLU by subject (Figure 15–16), math by competition category (Figure 17), LiveCodeBench by difficulty (Table 14), and safety by category (Tables 9–11).", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "Failure cases are shown: reward hacking during training (Figure 6), language mixing in multilingual queries, overthinking on simple problems, and Section G.2 explicitly reports failed PRM and MCTS approaches.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "Section G.2 'Unsuccessful Attempts' dedicates a full section to reporting failures with Process Reward Models (annotation difficulty, reward hacking) and Monte Carlo Tree Search (exponential search space, value model difficulties).", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": true, + "justification": "Base model is DeepSeek-V3-Base; all baseline models include version dates (Claude-Sonnet-3.5-1022, GPT-4o-0513, OpenAI-o1-1217); intermediate checkpoints (Dev1–Dev3) are labeled.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": true, + "justification": "Multiple prompts provided in full: R1-Zero training template (Table 1), reward model prompt (Listing 8), SFT trajectory examples (Listings 5–7), test case generation prompts (Listing 2), and benchmark evaluation prompts (Tables 18–32).", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": true, + "justification": "Full hyperparameters reported in Appendix B.4: learning rate (3e-6), KL coefficient (0.001), clip ratio (ε=10), sampling temperature (1.0), batch size (512), max sequence lengths (32,768–65,536 tokens), per-model distillation learning rates in Table 6.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": true, + "answer": true, + "justification": "RL infrastructure described in detail (Figure 5, Appendix B.1): four distinct modules (rollout via vLLM, inference, rule-based reward, training), expert parallelism strategy, VRAM management, data packing strategy.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "Data preprocessing documented: 10-gram decontamination (removing ~6M math texts), cold start data generation pipeline with rejection sampling and human refinement, SFT filtering (language mixing, length, repetition detection), evaluation prompt formats.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": true, + "justification": "All evaluation benchmarks (AIME, MATH-500, LiveCodeBench, MMLU, etc.) are publicly accessible; model weights are available on HuggingFace enabling independent evaluation replication.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Data collection documented in Appendix B.3 and Table 4: 26K math, 17K code, 22K STEM, 15K logic, 66K general prompts with sources, formats, average lengths, and construction procedures.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "No human participants were recruited; standard public benchmarks were used for evaluation.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "Full data pipeline documented from collection through SFT: cold start generation (Listings 1–3), rejection sampling, human annotation and verification steps, 800K SFT data statistics (Table 5), and decontamination procedures.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": true, + "justification": "Appendix D.1 explicitly states 'DeepSeek-V3 base has a knowledge cutoff date of July 2024.'", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": true, + "justification": "Appendix D.1 'Decontamination' explicitly discusses overlap: 10-gram filtering removed ~6M math-related texts; post-training data sourced exclusively from pre-2023 competitions; paper acknowledges n-gram filtering cannot prevent paraphrase contamination.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": true, + "answer": true, + "justification": "AIME 2025 (post-July 2024 cutoff) is used to test generalization to genuinely unseen problems (Table 13), showing 75% solve rate approaching o1's 80%.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human subjects research in this study.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human participants requiring IRB approval.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": false, + "justification": "Per-query inference cost and latency are not reported; only training costs appear in Table 7.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": true, + "justification": "Table 7 provides detailed training costs: R1-Zero 101K H800 GPU hours ($202K), SFT data creation 5K hours ($10K), DeepSeek-R1 41K hours ($82K), total 147K GPU hours ($294K at $2/GPU-hour).", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "DeepSeek-R1-Zero achieves 79.8% pass@1 on AIME 2024 via pure RL without any supervised fine-tuning", + "evidence": "Figure 1 shows training progression from 15.6% to 77.9% on AIME 2024; Table 8 reports final 79.8% pass@1 and 86.7% with cons@64 self-consistency", + "supported": "strong" + }, + { + "claim": "DeepSeek-R1 performance matches OpenAI-o1-1217 on mathematical reasoning benchmarks", + "evidence": "Table 8: DeepSeek-R1 79.8% vs o1 79.2% on AIME 2024, 97.3% vs 96.4% on MATH-500, 78.8% vs unreported on CNMO 2024", + "supported": "strong" + }, + { + "claim": "Advanced reasoning behaviors (self-reflection, verification, 'aha moments') emerge spontaneously from RL training without explicit instruction", + "evidence": "Figure 9 shows 5–7x increase in reflective word frequency during training; Table 2 shows the model spontaneously generating 'Wait, wait. Wait. That's an aha moment' to self-correct", + "supported": "moderate" + }, + { + "claim": "Distilled small models (1.5B–70B) substantially outperform non-reasoning models of comparable or larger size", + "evidence": "Table 15: DeepSeek-R1-Distill-Qwen-1.5B achieves 28.9% AIME 2024 pass@1, surpassing GPT-4o-0513 (9.3%) and Claude-3.5-Sonnet (16.0%)", + "supported": "strong" + }, + { + "claim": "Larger base model capacity is a prerequisite for RL-induced reasoning improvements to emerge", + "evidence": "Section G.1 reports that 7B dense and 16B MoE models showed no meaningful AIME improvements under RL, while 32B+ models showed substantial gains", + "supported": "moderate" + }, + { + "claim": "Distillation from a strong reasoning model outperforms training smaller models with large-scale RL directly", + "evidence": "Table 16: DeepSeek-R1-Distill-Qwen-32B (72.6% AIME) substantially outperforms Qwen2.5-32B-Zero trained with 10K RL steps (47.0% AIME)", + "supported": "strong" + } + ], + "methodology_tags": [ + "benchmark-eval", + "empirical" + ], + "key_findings": "DeepSeek-R1-Zero demonstrates that pure reinforcement learning applied to a capable base model can autonomously develop sophisticated reasoning behaviors—self-reflection, verification, dynamic strategy adaptation—without any human-annotated demonstrations, reaching 79.8% on AIME 2024 and matching OpenAI-o1. The multi-stage DeepSeek-R1 pipeline (cold start + RL + SFT + RL) addresses readability and language consistency issues while maintaining frontier reasoning performance. Knowledge distillation from R1 into small models (1.5B–70B) produces models that dramatically outperform non-reasoning models of similar size. Two key negative findings: process reward models and MCTS were attempted and abandoned due to reward hacking and scaling difficulties; and smaller base models (7B, 16B MoE) failed to benefit from RL, establishing model scale as a prerequisite.", + "red_flags": [ + { + "flag": "Training data URL placeholder", + "detail": "The paper states SFT and RL training data is released 'at xxx' — a literal placeholder, meaning training data was not actually accessible at publication time, preventing training reproduction." + }, + { + "flag": "Self-evaluation with no independent replication", + "detail": "DeepSeek-AI employees evaluate their own model; results for OpenAI-o1-1217 are taken from official reports rather than independently measured, making direct comparisons unverifiable." + }, + { + "flag": "Severe jailbreak vulnerability", + "detail": "Table 11 shows DeepSeek-R1 without risk control reaches 85.9% unsafe rate under jailbreak attacks — the highest of all tested models. The paper acknowledges enhanced reasoning makes dangerous content more operationally feasible." + }, + { + "flag": "No variance reporting", + "detail": "All main benchmark tables report only point estimates; no standard deviations or confidence intervals are shown despite results being averaged over k=4–64 samples per question." + }, + { + "flag": "Alternative explanations unaddressed", + "detail": "Improvements attributed solely to RL without considering confounders: longer output generation, additional training compute (147K GPU-hours), or base model quality (DeepSeek-V3-Base already strong) could partially explain gains." + } + ], + "cited_papers": [ + { + "title": "Chain-of-thought prompting elicits reasoning in large language models", + "relevance": "Foundational CoT work that R1 extends via RL; the primary paradigm R1 challenges by showing RL can discover reasoning without human-curated demonstrations" + }, + { + "title": "Training language models to follow instructions with human feedback (InstructGPT)", + "relevance": "Establishes the SFT+RLHF paradigm that R1 partially circumvents; key baseline for comparing post-training approaches" + }, + { + "title": "DeepSeekMath: Pushing the limits of mathematical reasoning in open language models", + "relevance": "Introduces GRPO algorithm used as R1's RL backbone; directly cited as the training algorithm" + }, + { + "title": "Let's verify step by step", + "relevance": "Process reward model work that DeepSeek-R1 attempted and abandoned (Section G.2), providing important negative results context for the field" + }, + { + "title": "Self-consistency improves chain of thought reasoning in language models", + "relevance": "Self-consistency decoding (cons@16, cons@64) is used in evaluating R1-Zero and boosts AIME accuracy from 79.8% to 86.7%" + }, + { + "title": "STaR: Bootstrapping reasoning with reasoning", + "relevance": "Prior RL-based reasoning enhancement that R1 builds upon; key comparison point for showing R1's approach differs by starting from pure RL on base models" + }, + { + "title": "DeepSeek-V3 technical report", + "relevance": "DeepSeek-V3-Base is the base model for all R1 variants; understanding the base model is essential for interpreting what RL adds" + }, + { + "title": "Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning", + "relevance": "Related work on test-time compute scaling; R1's adaptive CoT length is analyzed in relation to this paradigm in Section E.4" + }, + { + "title": "Proximal policy optimization algorithms", + "relevance": "PPO is the primary RL baseline compared against GRPO (Figure 4, Appendix A.3); establishing why GRPO is preferred for large-scale training" + }, + { + "title": "Language models are few-shot learners (GPT-3)", + "relevance": "Establishes emergent capabilities framework used to contextualize R1's emergent reasoning behaviors" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 3, + "justification": "Model weights freely available on HuggingFace under MIT license; practitioners can immediately use distilled 1.5B–70B models for math, code, and reasoning tasks." + }, + "surprise_contrarian": { + "score": 3, + "justification": "Demonstrating that pure RL without SFT produces frontier reasoning challenged the field's consensus that extensive human demonstrations were essential for capable post-training." + }, + "fear_safety": { + "score": 2, + "justification": "The paper documents 85.9% unsafe rate under jailbreak attacks without risk control and explicitly notes enhanced reasoning makes dangerous content more operationally feasible." + }, + "drama_conflict": { + "score": 3, + "justification": "Directly matches OpenAI-o1 on math benchmarks at $294K training cost under MIT license, challenging the assumption that frontier reasoning models require closed proprietary development." + }, + "demo_ability": { + "score": 3, + "justification": "Model weights downloadable from HuggingFace immediately; distilled versions (1.5B–70B) accessible on consumer hardware; official API available." + }, + "brand_recognition": { + "score": 3, + "justification": "DeepSeek-R1 became one of the most discussed AI papers of early 2025, generating 1,351 HN points and triggering significant market reactions upon release." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "42823568", + "title": "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via RL", + "points": 1351, + "comments": 1056, + "url": "https://news.ycombinator.com/item?id=42823568", + "created_at": "2025-01-25T18:39:49Z" + }, + { + "hn_id": "42915646", + "title": "Stack Overflow Meets Replication: Security Research Amid Evolving Code Snippets", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=42915646", + "created_at": "2025-02-03T06:49:46Z" + } + ], + "top_points": 1351, + "total_points": 1352, + "total_comments": 1056 + } +} +\ No newline at end of file diff --git a/papers/defects4c-benchmarking-large-2025/scan-v5.json b/papers/defects4c-benchmarking-large-2025/scan-v5.json @@ -0,0 +1,353 @@ +{ + "scan_version": 5, + "paper_type": "benchmark-creation", + "paper": { + "title": "Defects4C: Benchmarking Large Language Model Repair Capability with C/C++ Bugs", + "authors": [ + "Jian Wang", + "Xiaofei Xie", + "Qiang Hu", + "Shangqing Liu", + "Jiongchi Yu", + "Jiaolong Kong", + "Yi Li" + ], + "year": 2025, + "venue": "International Conference on Automated Software Engineering", + "arxiv_id": "2510.11059", + "doi": "10.1109/ASE63991.2025.00029" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All abstract claims are verified in the paper body: 9M bug-relevant commits, 248 buggy functions, 102 vulnerable functions, and evaluation of 24 LLMs are all confirmed in Tables III–VII.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "The fine-tuning causal claim is supported by before/after comparisons using the same models (Table VII), a reasonable design for this type of claim; the paper appropriately hedges that improvements are limited.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": true, + "justification": "Conclusions are consistently bounded to C/C++ single-function bugs from the top 500 GitHub repositories; the paper explicitly states the scope and does not over-generalize to all program repair settings.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "The large performance gap between Defects4C and Defects4J is attributed solely to benchmark difficulty, without discussing alternative explanations such as prompt design differences, test harness discrepancies, or different evaluation protocols across the two benchmarks.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": false, + "justification": "The paper uses test-passing (pass@k) as the sole measure of 'repair capability' but does not discuss limitations of this proxy, such as overfitting to weak test suites or semantically incorrect but test-passing patches; the distinction between 'plausible' and 'correct' is mentioned only in passing.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "Section VII 'Threat to Validity' is a dedicated section covering multiple distinct threats beyond a single sentence.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "Specific threats include: single-function commit restriction excluding cross-file bugs, top-500 project selection bias, annotation subjectivity with measured Cohen's Kappa values (0.48→0.70→0.88), and training data quality in fine-tuning experiments.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": true, + "justification": "The paper explicitly states it covers only single-function commits, only the top 500 GitHub C/C++ repositories by stars, and only commits from January 2015 to December 2023; multi-function and cross-file bugs are explicitly excluded.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": true, + "justification": "Section IX Acknowledgements discloses funding from NRF Singapore, Cyber Security Agency, CyberSG R&D Programme Office, and Singapore Ministry of Education (RG12/23).", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "All four author affiliations are listed on the first page: Singapore Management University, Tianjin University, Nanjing University, and Nanyang Technological University.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": true, + "justification": "All funders are Singapore government agencies (NRF, Cyber Security Agency, MOE) with no commercial interest in which LLMs perform best on C/C++ bug repair.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "The paper contains no competing interests statement or declaration of financial interests such as patents, equity, or consulting relationships.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "APR, single-round repair, conversation-based repair, and line/hunk/function bug granularities are all explicitly defined in Sections I, II, and V; 'plausible' vs 'correct' patches are also distinguished.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "The paper states three explicit contributions in a bulleted list: the Defects4C benchmark dataset, the CLI/API tooling, and the empirical study of 24 LLMs on C/C++ repair.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Table I systematically compares Defects4C against 13 existing C/C++ benchmarks on defect count, project diversity, and source type; Table II empirically motivates the gap via actual LLM performance comparisons.", + "source": "haiku" + } + } + }, + "type_checklist": { + "benchmark-creation": { + "construct_design": { + "construct_validity_argued": { + "applies": true, + "answer": true, + "justification": "Table II demonstrates that contest-style benchmarks yield artificially high LLM scores (GPT-4 at 74.6%) while real-world bugs yield low scores (9.0%), arguing that test-paired real-world bugs better measure actual repair capability.", + "source": "haiku" + }, + "difficulty_distribution_characterized": { + "applies": true, + "answer": false, + "justification": "Table III categorizes bugs by error type (Signature, Sanitizer, Memory Error, Logic) and counts, but there are no explicit difficulty tiers, no difficulty scores, and no analysis of whether categories differ in expected hardness prior to evaluation.", + "source": "haiku" + }, + "ceiling_floor_effects_checked": { + "applies": true, + "answer": false, + "justification": "GPT-4 achieves only 5/248 repairs (2%) in conversation-based mode, suggesting strong floor effects, but the paper treats this as evidence of benchmark challenge rather than a measurement validity concern; no explicit floor/ceiling analysis is performed.", + "source": "haiku" + }, + "human_baseline_included": { + "applies": true, + "answer": false, + "justification": "No human expert performance baseline is provided; the paper only reports LLM performance, leaving no reference for whether the benchmark items are solvable at all and at what expected rate.", + "source": "haiku" + }, + "scoring_rubric_justified": { + "applies": true, + "answer": true, + "justification": "Pass@k is justified by citing EvalPlus [19] and Chen et al. [18] as established metrics; the unit-test-matching algorithm (Section III-B) that defines which tests count is formally specified with a pass/fail differential criterion.", + "source": "haiku" + } + }, + "robustness": { + "contamination_resistance_designed": { + "applies": true, + "answer": false, + "justification": "For fine-tuning decontamination is performed via UniXcoder cosine similarity filtering, but the evaluation benchmark itself has no contamination-resistance mechanism (no temporal holdouts, canary strings, or dynamic generation) and contamination is only acknowledged as a threat in Section VII.", + "source": "haiku" + }, + "temporal_robustness_discussed": { + "applies": true, + "answer": false, + "justification": "The paper does not discuss whether the benchmark will remain discriminative as LLMs improve, nor is there a versioning or update plan; temporal coverage of commits (2015–2023) is noted but benchmark longevity is not addressed.", + "source": "haiku" + }, + "failure_modes_discussed": { + "applies": true, + "answer": true, + "justification": "Section VII explicitly identifies single-function restriction as a coverage limitation, and RQ3 (Section VI-C, Table VIII) categorizes four failure patterns (long/multi-hunk patches, deletion-centric fixes, missing external context, insufficient test feedback).", + "source": "haiku" + }, + "baseline_implementations_provided": { + "applies": true, + "answer": true, + "justification": "Full results for 24 LLMs are reported in Tables IV–VII with exact experimental configurations; the CLI/HTTP API with Docker isolation is publicly released, enabling reproduction of the reported numbers.", + "source": "haiku" + } + }, + "documentation": { + "dataset_documentation_complete": { + "applies": true, + "answer": true, + "justification": "Section III provides a comprehensive pipeline description covering raw collection (38M commits), six filtering criteria, the unit-test matching algorithm, and the three-round human annotation protocol with inter-annotator Kappa scores.", + "source": "haiku" + }, + "licensing_and_access_clear": { + "applies": true, + "answer": false, + "justification": "The dataset is described as 'publicly released' with a website link, but no specific license (e.g., MIT, CC-BY, Apache) is stated anywhere in the paper, leaving reuse rights unclear.", + "source": "haiku" + }, + "intended_use_specified": { + "applies": true, + "answer": true, + "justification": "The paper explicitly distinguishes Defects4C_bgcommit as suitable for fine-tuning/pretraining (with caveats about false positives) and Defects4C_bug/vul as suitable for rigorous evaluation, with a clear explanation of why each subset serves its intended role.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "State-of-the-art LLMs can fix only 10.88% of general C/C++ bugs and 6.86% of vulnerabilities in Defects4C using conversation-based repair, far below their performance on Java benchmarks.", + "evidence": "Table IV shows best conversation-based results of 27/248 bugs (10.88%) and 7/102 vulnerabilities (6.86%); Table VI shows Defects4J single-round line repair rates of 71.3% vs single-digit rates on Defects4C.", + "supported": "strong" + }, + { + "claim": "Increasing LLM model size does not consistently improve C/C++ repair performance; CodeLlama-Python improves from 7B to 13B but degrades at 34B.", + "evidence": "Table V shows CodeLlama-Python pass@100 at T=0.8: 22.5 (7B) → 32.2 (13B) → 29.8 (34B); similar non-monotonic patterns for WizardCoder and CodeLlama-Instruct.", + "supported": "strong" + }, + { + "claim": "Fine-tuning LLMs with Defects4C_bgcommit improves C/C++ repair performance by an average of 84.89% relative across 21 of 28 settings.", + "evidence": "Table VII shows consistent pass@k improvements for fine-tuned CodeLlama-7B-Base (from 0% to 0.41% pass@1 greedy) and CodeLlama-7B-Instruct (2.45%→4.08%); 84.89% average relative improvement is stated in the text.", + "supported": "moderate" + }, + { + "claim": "Contest/interview-style C/C++ benchmarks produce artificially high LLM scores, making them poor proxies for real-world repair capability.", + "evidence": "Table II shows GPT-3.5 at 94.0% on CodeFlaws and 59.0% on DebugBench (both contest-style) versus 8.5% on Defects4C (real-world).", + "supported": "strong" + }, + { + "claim": "Temperature 0.8 generally produces better single-round repair results than temperature 0.2 across most LLMs.", + "evidence": "Table V shows consistent higher pass@100 at T=0.8 vs T=0.2 for most models (e.g., GPT-3.5: 38.9 vs 19.5; CodeLlama-Instruct-7B: 45.7 vs 24.9).", + "supported": "strong" + }, + { + "claim": "Long/multi-hunk patches and insufficient test feedback are the dominant failure patterns for LLMs repairing C/C++ vulnerabilities.", + "evidence": "Table VIII shows 52.0% of failures attributed to long/multi-hunk patches and 9.8% to insufficient test feedback; fine-tuning does not reduce these patterns.", + "supported": "moderate" + } + ], + "methodology_tags": [ + "benchmark-eval", + "observational" + ], + "key_findings": "Defects4C introduces a real-world C/C++ benchmark of 9M bug-related commits, 248 confirmed bugs, and 102 vulnerabilities, revealing that state-of-the-art LLMs fix only 10.88% and 6.86% of these respectively — far below their performance on the Java-based Defects4J benchmark. Larger model size does not consistently improve repair performance, with models producing verbose overgenerated outputs that exceed token limits. Fine-tuning with Defects4C training data yields an average 84.89% relative improvement but still achieves only ~5% pass@1, and fails to address the dominant failure modes of long multi-hunk patches and insufficient test feedback. These findings demonstrate that real-world C/C++ program repair remains a substantially harder and less-solved problem than commonly reported LLM benchmarks suggest.", + "red_flags": [ + { + "flag": "No human baseline", + "detail": "The paper provides no human expert performance data, making it impossible to assess whether benchmark items are achievable in principle or whether floor effects represent a fundamental benchmark problem rather than an LLM limitation." + }, + { + "flag": "GPT-4 budget-constrained evaluation", + "detail": "GPT-4 was limited to only 2 repair attempts (vs 10 for other models) due to cost constraints, making cross-model comparisons in Table IV unfair; the paper acknowledges this but still includes GPT-4 in rankings." + }, + { + "flag": "No contamination resistance in benchmark", + "detail": "The benchmark contains GitHub code from 2015-2023 that may appear in LLM pre-training corpora; the paper dismisses contamination risk by arguing low scores indicate minimal memorization, which is circular reasoning." + }, + { + "flag": "No license specified", + "detail": "The dataset is described as publicly released but no specific license is stated, creating legal ambiguity for downstream research use and reproduction." + }, + { + "flag": "Single-function only", + "detail": "Restricting to single-function commits excludes multi-function and cross-file bugs that may represent a large fraction of real-world defects, limiting ecological validity of the benchmark." + }, + { + "flag": "Floor effects not analyzed", + "detail": "With GPT-4 achieving 2% repair rate on conversation-based mode, the benchmark may be too hard to discriminate among LLMs; this is treated as a feature but could mask measurement noise at the floor." + } + ], + "cited_papers": [ + { + "title": "Defects4J: A Database of Existing Faults to Enable Controlled Testing Studies for Java Programs", + "relevance": "Primary comparison benchmark; Defects4C is explicitly designed as the C/C++ analog to Defects4J" + }, + { + "title": "Automated Program Repair via Conversation: Fixing 162 out of 337 Bugs for $0.42 Each Using ChatGPT", + "relevance": "Conversation-based repair method evaluated on Defects4J; Defects4C re-evaluates this approach on C/C++" + }, + { + "title": "Evaluating Large Language Models Trained on Code (HumanEval)", + "relevance": "Source of pass@k metric adopted for single-round repair evaluation" + }, + { + "title": "Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation (EvalPlus)", + "relevance": "Evaluation methodology reference for pass@k and greedy decoding protocol" + }, + { + "title": "BugsC++: A Highly Usable Real World Defect Benchmark for C/C++", + "relevance": "Most recent prior work on C/C++ benchmark; Defects4C addresses its limitations (false positives from keyword matching)" + }, + { + "title": "The ManyBugs and IntroClass Benchmarks for Automated Repair of C Programs", + "relevance": "Earlier real-world C benchmark; Defects4C addresses its low usability and outdated C standard support" + }, + { + "title": "Magicoder: Source Code is All You Need", + "relevance": "Decontamination methodology adopted for fine-tuning dataset preparation" + }, + { + "title": "Neural Transfer Learning for Repairing Security Vulnerabilities in C Code (VRepair)", + "relevance": "Keyword-based bug commit filtering methodology adopted from this work" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 2, + "justification": "Security researchers and APR practitioners working on C/C++ can directly use the benchmark and CLI tooling, though the extremely low LLM success rates may limit immediate practical application." + }, + "surprise_contrarian": { + "score": 2, + "justification": "The finding that larger models (34B) perform worse than smaller ones (13B) on C/C++ repair, and that LLMs achieve under 11% even with conversation, challenges the prevailing narrative of LLM coding capability." + }, + "fear_safety": { + "score": 2, + "justification": "C/C++ accounts for over 50% of disclosed open-source vulnerabilities, and the finding that LLMs can fix only 6.86% of vulnerabilities raises concerns about automated security patching claims." + }, + "drama_conflict": { + "score": 1, + "justification": "The paper implicitly challenges overly optimistic LLM-for-APR claims but frames this constructively as a benchmark gap rather than as a critique of specific prior work." + }, + "demo_ability": { + "score": 2, + "justification": "The publicly released CLI and HTTP API with Docker-based verification allow researchers to immediately test their own models on the benchmark." + }, + "brand_recognition": { + "score": 1, + "justification": "Authors are from reputable Asian universities (SMU, NTU, Nanjing) but no famous industry lab or flagship model name is behind the work." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "28970112", + "title": "Stipula: DSL that assists lawyers in programming legal contracts", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=28970112" + }, + { + "hn_id": "41866043", + "title": "Unboxing Virgil ADTs for Fun and Profit", + "points": 2, + "comments": 2, + "url": "https://news.ycombinator.com/item?id=41866043" + }, + { + "hn_id": "37980301", + "title": "Confidential Consortium Framework: Secure Multiparty Applications", + "points": 2, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=37980301" + } + ], + "top_points": 3, + "total_points": 7, + "total_comments": 3 + } +} +\ No newline at end of file diff --git a/papers/defending-against-indirect-2024/scan-v5.json b/papers/defending-against-indirect-2024/scan-v5.json @@ -0,0 +1,592 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Defending Against Indirect Prompt Injection Attacks With Spotlighting", + "authors": [ + "Keegan Hines", + "Gary Lopez", + "Matthew Hall", + "Federico Zarfati", + "Yonatan Zunger" + ], + "year": 2024, + "venue": "CAMLIS", + "arxiv_id": "2403.14720", + "doi": "10.48550/arXiv.2403.14720" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "Abstract claims (ASR reduction from >50% to <2%) are supported by Figures 4-6, though specific numbers vary by technique and model. Datamarking achieves 3-0% ASR; encoding achieves 0-1.8% ASR.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "Causal claims (spotlighting reduces ASR, does not impair performance) are tested via before/after comparisons with/without techniques. No randomization, but appropriate comparative design for prompt engineering evaluation.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": true, + "justification": "Results explicitly bounded to GPT-family models (text-davinci-003, GPT-3.5, GPT-4) and 2 task types (summarization, Q&A). Paper notes encoding only suitable for high-capacity models, but doesn't discuss applicability to non-OpenAI architectures.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "Paper states 'we lack a clear understanding of why spotlighting actually helps' (Section 6). Provides telecommunications analogy but no rigorous mechanism exploration or alternative hypotheses tested.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "Attack Success Rate is precisely defined in Section 4.2 and Appendix 8.1 as return of specific keyword; distinguished from Affected Success Rate (AffSR) in appendix. Clear mapping between measured outcome and claim.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": false, + "justification": "No dedicated limitations section. Caveats scattered across Results (Section 5.2-5.4), Discussion (Section 6), and Appendix (8.2), but not compiled into formal threats-to-validity discussion.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "Specific threats discussed: encoding only for high-capacity models (5.2-5.3), few-shot knowledge-boundedness (Appendix 8.2), adversarial subversion paths per technique (5.4). Not systematic, but concrete.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": true, + "justification": "Explicitly bounded to GPT-family black-box models (Section 4.1), summarization and Q&A tasks (Sections 4-5), synthetic keyword-based attacks. Does not discuss generalization to open-source models, other domains, or sophisticated attack strategies.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding disclosure or acknowledgments section visible in paper. Authors list Microsoft affiliation but no funding source stated.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "All authors listed as Microsoft. Relevant because paper evaluates OpenAI models (competitors), but affiliation clearly stated.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": true, + "justification": "Microsoft (employer) does not provide the models being evaluated (OpenAI). Microsoft benefits from LLM security broadly, but not directly from OpenAI product improvement.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement or financial interests declaration present.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Key terms defined: Indirect prompt injection attacks/XPIA (2.2), Attack Success Rate (4.2, Appendix 8.1), spotlighting family (3.0), datamarking/encoding/delimiting (3.2-3.4).", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Contribution explicitly stated: introduce spotlighting (family of three prompt engineering techniques: delimiting, datamarking, encoding) for defending against indirect prompt injection attacks. Evaluation on effectiveness is clearly framed.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Related work section (2.2-2.3) cites Yi et al. 2023 on XPIA, Greshake/Bard attacks, safety alignment work. Paper states 'Early versions of some of these techniques have been described previously [2], and here we expand the results,' but doesn't deeply contrast novelty from prior approaches.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": false, + "justification": "No code repository, GitHub link, or supplementary code mentioned. Techniques described in prose and example prompts provided, but no deployable implementation.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": false, + "justification": "Synthetic 1000-document attack dataset not released or available. Standard benchmarks (SQuAD, IMDB, SuperGLUE) are public but not the paper's attack corpus.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "Model names and temperature (1.0) specified, but no requirements.txt, Dockerfile, conda env, or reproducibility config provided. API details minimal.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "Techniques described in natural language with example prompts shown (Sections 3.2-3.4), but no step-by-step reproduction instructions or automation scripts. Implementation would require custom development.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "Figures 3-8 show point estimates without error bars or confidence intervals. Single run reported per condition, no variance bounds.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "No p-values, t-tests, or statistical significance tests reported. ASR reductions presented as raw percentages without hypothesis testing.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "ASR reductions reported in percentage points (e.g., 50%→3%, 60%→0%). Effect sizes quantified; not just p-values.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "Paper states 'we generated a synthetic dataset of 1000 documents' but does not justify this sample size or discuss power analysis. No minimum sample size calculated.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "No multiple runs with different random seeds shown. No SD/variance/min-max ranges reported. Results presented as single-point estimates.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Multiple baselines compared: no defense (baseline ASR), instruction-only, delimiting, datamarking, encoding. Also compared to few-shot approach in appendix.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "Baselines are contemporary GPT models (June 2023 snapshots). However, no comparison to other defense methods from Section 2.3 (fine-tuning, other prompt-engineering defenses).", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "Three spotlighting instantiations (delimiting, datamarking, encoding) serve as ablations of increasing sophistication. Progressive improvements shown (Figures 3-6).", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Primary metric: Attack Success Rate (ASR). Secondary metrics: task performance on 4 NLP benchmarks (SQuAD, IMDB, SuperGLUE BoolQ, SuperGLUE WiC). Figure 7-8 show accuracy impacts.", + "source": "haiku" + }, + "human_evaluation": { + "applies": false, + "answer": false, + "justification": "No human evaluation of model outputs. Not clearly required for technical evaluation of prompt injection defense.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": true, + "justification": "Standard benchmarks use held-out test sets (SQuAD, IMDB, SuperGLUE are standard). For synthetic attack corpus, no train/test split mentioned; single 1000-document set.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Results broken down by model (text-davinci-003, GPT-3.5-Turbo, GPT-4), task type (summarization, Q&A), and technique (delimiting, datamarking, encoding). Benchmark breakdowns in Figure 7-8.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "Encoding fails with GPT-3.5-Turbo (Figure 8, task performance degradation). Delimiting shown insufficient (Figure 3). Appendix 8.2 discusses few-shot caveats. Limited analysis of attack vectors spotlighting cannot defend against.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "Paper reports: delimiting alone insufficient, encoding hurts task accuracy with weaker models, few-shot examples overfit to known attacks. Honest about limitations.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": true, + "justification": "Specific model snapshots: text-davinci-003, GPT-3.5-Turbo (June 2023), GPT-4 (June 2023). Dates provided for reproducibility.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": true, + "justification": "Full example system prompts shown for: instructions-only baseline (4.2), delimiting (3.2), datamarking (3.3), encoding (3.4). Templates can be copied directly.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": true, + "justification": "Temperature=1.0 specified with note: 'We examined the effect of temperature on XPIA susceptibility and found no notable impact.' Only temperature reported; no top-p, frequency_penalty, etc.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": false, + "answer": false, + "justification": "Not an agentic system; pure prompt engineering. No scaffolding (tools, actions, loops) to describe.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": false, + "justification": "Attack dataset described as 'synthetic... containing prompt injection attacks' with 'variations on a simple keyword payload attack,' but generation algorithm/process not documented. No code for reproducing dataset.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": false, + "justification": "Synthetic 1000-document attack corpus not released or available for verification. Standard benchmark raw data (SQuAD, IMDB) are publicly available but not paper-specific.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Attack data: 'generated synthetic dataset of 1000 documents... variations on simple keyword payload attack.' Benchmarks: uses standard published datasets. Description adequate for understanding but not for reproduction.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "No human subjects; N/A.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "For benchmarks, standard pipelines used. For attack dataset, pipeline partially described: documents → prompts with models → responses → ASR scoring. Full generation process not detailed.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": false, + "justification": "Models identified by snapshot (June 2023) but exact training data cutoff dates not stated. Paper does not discuss what dates these versions were trained on.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": false, + "justification": "No discussion of whether benchmark test sets (SQuAD 2016, IMDB, SuperGLUE) may have been in training data of June 2023 model snapshots.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": true, + "answer": false, + "justification": "Synthetic attack dataset is new, so no contamination there. But standard benchmarks potentially contaminated—not addressed. Paper evaluates model performance on these benchmarks without discussing potential data leakage.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human subjects; N/A.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human subjects; N/A.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human subjects; N/A.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human subjects; N/A.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human subjects; N/A.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human subjects; N/A.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human subjects; N/A.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": false, + "justification": "No inference cost ($ per API call) or latency reported. Experiments used OpenAI API but no pricing/time data disclosed.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "Total computational budget ($ or compute hours) not stated. 1000 attack documents × 3 models × multiple tasks = thousands of API calls, but no aggregate cost reported.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "Spotlighting via datamarking reduces attack success rate (ASR) from ~50% to <3% with GPT-3.5-Turbo and to 0% with text-davinci-003", + "evidence": "Figure 4 (document summarization) and Figure 5 (Q&A tasks) show ASR percentages across models. Specific numbers: GPT-3.5-Turbo 50%→3.1%, Text-003 40%→0%.", + "supported": "strong" + }, + { + "claim": "Spotlighting via encoding reduces ASR to 0-1.8% across tasks", + "evidence": "Figure 6 shows encoding results: summarization 0.0% ASR with GPT-3.5-Turbo, Q&A 1.8% ASR. Consistent across models.", + "supported": "strong" + }, + { + "claim": "Datamarking transformations have minimal detrimental impact on downstream NLP task performance", + "evidence": "Figure 7 shows no detrimental effect on SQuAD, IMDB, SuperGLUE BoolQ/WiC benchmarks with datamarking present.", + "supported": "strong" + }, + { + "claim": "Encoding transformations degrade task performance with GPT-3.5-Turbo but not GPT-4", + "evidence": "Figure 8 shows GPT-3.5-Turbo accuracy drops significantly with encoding (top row), while GPT-4 maintains high accuracy (bottom row).", + "supported": "strong" + }, + { + "claim": "Simple instructions to avoid prompt injection have 'almost no added benefit' for GPT-3.5-Turbo", + "evidence": "Figure 2 shows instructions-only approach yields minimal ASR reduction for GPT-3.5-Turbo vs baseline.", + "supported": "moderate" + }, + { + "claim": "Spotlighting is more robust than simple delimiting because adversaries with knowledge of system prompts can easily subvert delimiters", + "evidence": "Section 5.4 discusses adversary considerations: 'If an adversary gains knowledge of our system prompt... it would be simple to craft a string that contains our delimiters.' Datamarking/encoding harder to subvert with dynamic tokens.", + "supported": "moderate" + }, + { + "claim": "Few-shot examples can reduce ASR below 5% but risk overfitting to known attack patterns", + "evidence": "Appendix 8.2 shows Figure 9 with few-shot examples achieving <5% ASR, but text cautions: 'relying on in-context learning will always be limited by our current understanding of typical attack tactics.'", + "supported": "moderate" + } + ], + "methodology_tags": [ + "benchmark-eval" + ], + "key_findings": "Spotlighting, a family of three prompt engineering techniques (delimiting, datamarking, encoding), significantly reduces indirect prompt injection attack success rate from 50%+ to below 2%. Datamarking achieves this reduction with minimal impact on downstream NLP task performance across multiple benchmarks. Encoding is most effective but only suitable for high-capacity models (GPT-4), as it degrades performance in GPT-3.5-Turbo. The findings suggest that structural transformations making input provenance more salient to models are necessary because simple instructions alone are insufficient defense.", + "red_flags": [ + { + "flag": "No statistical significance testing", + "detail": "All results reported as point estimates without confidence intervals, error bars, standard deviations, or p-values. Cannot assess whether observed ASR differences are statistically reliable or due to random variation." + }, + { + "flag": "Synthetic attack dataset not released", + "detail": "The 1000-document corpus used for evaluation is not available for independent verification or reproduction. Limits scientific reproducibility." + }, + { + "flag": "Sample size not justified", + "detail": "No power analysis or justification provided for why 1000 attack documents is sufficient. No minimum sample size calculated based on effect sizes." + }, + { + "flag": "Limited to GPT models only", + "detail": "Evaluation only on OpenAI models (text-davinci-003, GPT-3.5, GPT-4). Generalization to Llama, Claude, and other LLMs unknown." + }, + { + "flag": "Attacks are simplistic", + "detail": "All attacks are 'variations on a simple keyword payload attack.' May not reflect sophisticated adversarial strategies that target semantic vulnerabilities or use knowledge of spotlighting techniques." + }, + { + "flag": "No code or data release", + "detail": "No GitHub repository, supplementary materials, or code artifacts provided. Implementation requires custom development from prose descriptions." + }, + { + "flag": "No comparison to alternative defenses", + "detail": "Paper discusses other approaches (fine-tuning, alignment tuning, classifiers) in Section 2.3 but does not empirically compare spotlighting to any competing defense methods." + }, + { + "flag": "Training data contamination not addressed", + "detail": "Benchmark test sets (SQuAD 2016, IMDB, SuperGLUE) may have been in training data of June 2023 LLM snapshots. Potential data leakage not discussed." + }, + { + "flag": "Mechanism unclear", + "detail": "Paper acknowledges 'we lack a clear understanding of why spotlighting actually helps' (Section 6). No mechanistic explanation or ablation to understand which aspects of marking/encoding are necessary." + }, + { + "flag": "Adversarial evaluation incomplete", + "detail": "Section 5.4 discusses attack vectors against each technique but does not empirically test whether sophisticated adversaries can craft attacks that bypass spotlighting." + } + ], + "cited_papers": [ + { + "title": "Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models", + "relevance": "Core prior work on XPIA problem; paper extends some spotlighting techniques from this baseline [Yi et al. 2023]" + }, + { + "title": "More than you've asked for: A Comprehensive Analysis of Novel Prompt Injection Threats to Application-Integrated Large Language Models", + "relevance": "Foundational work identifying indirect prompt injection threats in LLM systems [Greshake et al.]" + }, + { + "title": "How We Broke LLMs: Indirect Prompt Injection", + "relevance": "Early demonstration of XPIA vulnerability in practice [Greshake blog post, 2022]" + }, + { + "title": "Hacking Google Bard - From Prompt Injection to Data Exfiltration", + "relevance": "Empirical demonstration of XPIA attack enabling data exfiltration in real deployed system [Wunderwuzzi]" + }, + { + "title": "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models", + "relevance": "Foundation for understanding prompt engineering effectiveness and model instruction-following behavior [Wei et al.]" + }, + { + "title": "Universal and Transferable Adversarial Attacks on Aligned Language Models", + "relevance": "Relevant for understanding adversarial robustness of LLMs and potential attack transferability [Zou et al. 2023]" + }, + { + "title": "SQuAD: 100,000+ Questions for Machine Comprehension of Text", + "relevance": "Benchmark used for evaluating downstream task performance impact of spotlighting transformations [Rajpurkar et al.]" + }, + { + "title": "SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems", + "relevance": "Benchmark used to evaluate spotlighting impact on multiple NLP tasks [Wang et al.]" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 3, + "justification": "Directly implementable in production LLM systems today. Requires only prompt engineering changes, not model retraining. Teams can add datamarking/encoding immediately." + }, + "surprise_contrarian": { + "score": 2, + "justification": "Core insight (marking provenance helps models distinguish code from data) is intuitive once stated, though specific techniques are novel. Does not challenge conventional wisdom fundamentally." + }, + "fear_safety": { + "score": 2, + "justification": "Addresses real prompt injection vulnerability in deployed systems. However, positions spotlighting as limited defense ('security against interference' not 'perfectly secure'), avoiding overclaiming." + }, + "demo_ability": { + "score": 2, + "justification": "Practitioners can implement spotlighting prompts immediately, but full evaluation requires GPT API access and attack corpus. Not fully reproducible without released code/data." + }, + "brand_recognition": { + "score": 2, + "justification": "Authors from Microsoft (reputable), but no Nobel laureate labs or breakthrough-tier recognition. Venue (CAMLIS) is specialized security conference, not top-tier ML venue." + }, + "drama_conflict": { + "score": 1, + "justification": "Straightforward technical contribution with no controversy. No competing claims, no debate about methods or findings. Lacking narrative tension." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "22768143", + "title": "Deep Molecular Programming", + "points": 130, + "comments": 11, + "url": "https://news.ycombinator.com/item?id=22768143" + }, + { + "hn_id": "39466681", + "title": "Coercing LLMs to do and reveal almost anything", + "points": 12, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=39466681" + }, + { + "hn_id": "45489599", + "title": "Tutorials for Sandia's Lammps Simulation Package", + "points": 8, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=45489599" + }, + { + "hn_id": "44478832", + "title": "CodingGenie: A Proactive LLM-Powered Programming Assistant", + "points": 5, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=44478832" + }, + { + "hn_id": "23363404", + "title": "“Periodic table” for protons in the nucleus", + "points": 4, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=23363404" + }, + { + "hn_id": "44415220", + "title": "Storm – Help LLMs to write very long articles", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=44415220" + }, + { + "hn_id": "43540243", + "title": "AttentionRAG: Attention-Guided Context Pruning in Retrieval-Augmented Generation", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=43540243" + }, + { + "hn_id": "41125541", + "title": "Solving the Traveling Salesman Problem Using a Single Qubit", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=41125541" + }, + { + "hn_id": "41066825", + "title": "Solving the Travelling Salesman Problem Using a Single Qubit", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=41066825" + }, + { + "hn_id": "40822524", + "title": "Do LLMs Have Distinct and Consistent Personality?", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=40822524" + } + ], + "top_points": 130, + "total_points": 169, + "total_comments": 13 + } +} +\ No newline at end of file diff --git a/papers/defending-against-prompt-2025-2/scan-v5.json b/papers/defending-against-prompt-2025-2/scan-v5.json @@ -0,0 +1,533 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Defending Against Prompt Injection with DataFilter", + "authors": ["Yizhu Wang", "Sizhe Chen", "Raghad F Alkhudair", "Basel Alomair", "David Wagner"], + "year": 2025, + "venue": "arXiv.org", + "arxiv_id": "2510.19207", + "doi": "10.48550/arXiv.2510.19207" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All abstract claims are supported: ASR reduction to near-zero is shown in Tables II–IV, utility preservation within 1–2% in Tables V–VI, and superiority over baselines in Figure 2.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "Controlled experiments hold all variables constant except presence of DataFilter, making causal attribution appropriate for claims about its effect on ASR and utility.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": true, + "justification": "The limitations section explicitly states DataFilter cannot defend against optimization-based adaptive attacks (83% ASR) and struggles with very long user prompts, bounding the generalization claims.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "The paper does not discuss whether Llama-3.1-8B's inherent instruction-following strength rather than the filtering mechanism drives results, nor other confounds like benchmark difficulty differences.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "ASR (whether malicious API call occurs) and utility (task completion rate) are clearly defined and tied to specific claims; no conflation between what is measured and what is claimed.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "Section VI contains a dedicated 'Limitations' paragraph listing inference overhead, failure against optimization-based attacks, and difficulties with long user prompts.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "Specific threats are named: strong adaptive LLM-based attacks break the defense (83% ASR), and DataFilter requires developers to extract short user instructions when the full prompt is very long.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": true, + "justification": "The paper explicitly states DataFilter 'cannot defend against the strong optimization-based adaptive attacks' and 'may not yet match the absolute strongest protection possible with model-level defenses.'", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": true, + "justification": "Funding is disclosed: KACST-UC Berkeley Center of Excellence for Secure Computing, NSF grant 2229876, and gifts from Google, Meta, and Noyce Foundation.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Author affiliations are clearly stated on the title page: UC Berkeley and KACST.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": false, + "justification": "Meta and Google are funders; Meta's PromptGuard is one of the baselines being outperformed, and DataFilter uses Meta's Llama-3.1-8B as its backbone model.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement or declaration of financial interests (patents, equity, consulting) is provided beyond the funding acknowledgment.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Prompt injection attack, attack success rate, utility, and model-agnostic are all explicitly defined in Sections II and IV, with attacker and defender goals formally stated.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "The paper clearly contributes DataFilter: a test-time, model-agnostic SFT-based defense that removes injected instructions from untrusted data before it reaches the backend LLM.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section III provides extensive related work; Table I explicitly positions DataFilter against fine-tuning, prompting, detection, and system-level defenses, with concurrent work (PromptArmor, PromptLocate) distinguished.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": true, + "justification": "The abstract states 'Our DataFilter model is released here for immediate use, with the code to reproduce our results here,' indicating release of both model and code.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "All evaluation benchmarks are publicly available (SEP, InjecAgent, AgentDojo, AlpacaEval2), and training uses the public Alpaca dataset.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "Training hardware (A100/H100 GPUs) and key hyperparameters are stated, but no requirements.txt, Dockerfile, or explicit dependency specification is provided in the paper.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": true, + "justification": "Algorithm 1 provides step-by-step SFT dataset construction, Section V-A describes all training parameters, and code is released; sufficient to reproduce without guessing.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "All results in Tables II–IX are single point estimates with no confidence intervals or error bars, despite the paper acknowledging GPT-4o is non-deterministic.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "No statistical significance tests are applied to any comparative claims despite making superiority claims over multiple baselines.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Absolute percentage differences are reported (e.g., average ASR 2.2% vs 5.9% for PromptArmor; utility drop 1.0% vs 4.1%), providing practical effect size context.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "SEP is evaluated on a random 1K subset of 9.1K samples with no justification for the subset size or representativeness confirmation; no power analysis anywhere.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "No variance, standard deviation, or spread measures are reported across any experimental runs, despite acknowledged model non-determinism.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Seven baselines are tested: PromptGuard, DataSentinel, Sandwich, Instructional, Spotlight, Tool Filter, and PromptArmor, spanning detection-based, prompt-based, and system-level approaches.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "All baselines are from 2023–2025 publications and represent the current state of the art in model-agnostic prompt injection defense.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": false, + "justification": "Four training goals are described but their individual contributions are not systematically ablated; only a brief mention of training without user prompt context appears in the discussion.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Multiple metrics are used: ASR, benign utility, utility under attack (AgentDojo), and length-controlled win rate (AlpacaEval2).", + "source": "haiku" + }, + "human_evaluation": { + "applies": false, + "answer": false, + "justification": "Human evaluation is not standard for prompt injection defense evaluation; utility is measured via GPT-4-based automatic evaluation (AlpacaEval2).", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": true, + "justification": "DataFilter is trained on Alpaca and evaluated on entirely separate benchmarks (SEP, InjecAgent, AgentDojo, AlpacaEval2) not used in training.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Results are broken down by attack type (6 in SEP, 4 in AgentDojo, 2 in InjecAgent), backend model (gpt-4o vs Llama), and benchmark, providing granular breakdowns.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "Appendix C provides concrete false negative (billing document confusion) and false positive (cooking recipe instructions) examples with full input/output shown.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "DataFilter fails against strong LLM-based adaptive attacks (83% ASR); false positives on benign imperative content are documented; limitations with long prompts reported.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": true, + "justification": "Exact model versions are specified: gpt-4o-2024-05-13, meta-llama/Llama-Prompt-Guard-2-86M, Llama-3.1-8B-Instruct, and GPT-5.1/GPT-4.1 for relevant comparisons.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": true, + "justification": "The full system prompt and user message template for DataFilter are shown verbatim in Section IV-C, including the filter instruction and special token formatting.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": true, + "justification": "All key hyperparameters reported: batch size 1, gradient accumulation 16, learning rate 2×10^-5, cosine schedule, 100 warmup steps, BF16 precision, 300 training steps.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": true, + "answer": true, + "justification": "JSON parsing and recursive filtering for structured agentic data (Section IV-D) and the multi-turn agent setup in AgentDojo are described in sufficient detail.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "Algorithm 1 documents exact preprocessing: truncation proportions (65%/10%/10%/15%), injection position distributions (20%/20%/60%), and attack type assignments.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": false, + "justification": "The constructed SFT training dataset is not explicitly released as a separate artifact; only the base Alpaca source and the trained model are released.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Algorithm 1 provides the complete data construction procedure from Alpaca samples to (prompt, data, output) triples with all design decisions and proportions documented.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "No human participants; evaluation uses automated benchmarks requiring no recruitment.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "The full pipeline from Alpaca → SFT dataset construction (Algorithm 1) → fine-tuning → deployment is documented with specific parameters and design rationale for each step.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": false, + "justification": "Llama-3.1-8B-Instruct's training data cutoff is not stated; it is possible the model's pretraining included examples similar to or identical to evaluation benchmarks.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": false, + "justification": "The paper does not discuss whether Llama-3.1-8B's pretraining data overlaps with the evaluation benchmarks (SEP, InjecAgent, AgentDojo), which could inflate filtering performance.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": true, + "answer": false, + "justification": "SEP and InjecAgent were published before Llama 3.1's likely training cutoff; potential contamination of the filter model's base knowledge is not discussed.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": true, + "justification": "Table IX reports per-sample monetary cost and wall-clock time for GPT-5.1 (+3.7% cost, +4.0% latency) and GPT-4o (+1.0% cost, +17.5% latency) with DataFilter.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "Training hardware (two 80GB A100/H100 GPUs) and steps (300) are mentioned but total GPU-hours for training are not reported.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "DataFilter reduces average ASR from over 40% to approximately 2% across multiple benchmarks", + "evidence": "Tables II, III, IV show ASR reductions to max 1.2% on AgentDojo, ~2% on InjecAgent Base, and 1.5–3.4% on SEP for gpt-4o backend", + "supported": "strong" + }, + { + "claim": "DataFilter preserves utility within 1–2% of the undefended baseline", + "evidence": "Table V shows benign utility 79.4% vs 81.4% baseline on AgentDojo; Table VI shows 54.1% vs 54.0% on AlpacaEval2 for gpt-4o", + "supported": "strong" + }, + { + "claim": "DataFilter outperforms all tested model-agnostic baselines on security-utility tradeoff", + "evidence": "Figure 2 shows DataFilter closest to ideal defense; average ASR 2.2% vs PromptArmor 5.9%; average utility drop 1.0% vs 4.1% for PromptArmor", + "supported": "strong" + }, + { + "claim": "DataFilter trained on general instruction-tuning data generalizes to unseen agentic settings", + "evidence": "DataFilter trained on non-agentic Alpaca achieves low ASR on agentic benchmarks AgentDojo and InjecAgent involving multi-turn tool calls", + "supported": "moderate" + }, + { + "claim": "DataFilter is the first model-agnostic defense simultaneously achieving strong security and high utility", + "evidence": "Table I categorizes all prior defenses as lacking at least one of security, utility, or model-agnostic properties; DataFilter satisfies all three", + "supported": "moderate" + }, + { + "claim": "Strong optimization-based adaptive attacks break DataFilter with 83% ASR", + "evidence": "Table VIII shows DataFilter achieves 83% ASR under genetic algorithm-based LLM attack, though lowest among all tested defenses (93–100% for others)", + "supported": "strong" + } + ], + "methodology_tags": ["benchmark-eval"], + "key_findings": "DataFilter, a supervised fine-tuned Llama-3.1-8B model, reduces prompt injection ASR from >40% to ~2% across three benchmarks while maintaining utility within 2% of baseline, outperforming all tested model-agnostic defenses on the security-utility tradeoff. Training on general-purpose Alpaca data enables generalization to unseen agentic settings (AgentDojo, InjecAgent) without domain-specific adaptation. However, strong optimization-based adaptive attacks still achieve 83% ASR, and the defense struggles with very long user prompts requiring developer intervention. Marginal inference overhead (+1–4% cost, +4–18% latency) and plug-and-play deployment make it immediately practical for black-box commercial LLMs.", + "red_flags": [ + { + "flag": "No statistical testing", + "detail": "All comparative claims are made without confidence intervals, significance tests, or variance reporting, despite the paper acknowledging non-determinism in GPT-4o; results may not be reliable across runs." + }, + { + "flag": "Funder conflict with baseline", + "detail": "Meta and Google are funders; Meta's PromptGuard is a baseline being outperformed, and DataFilter uses Meta's Llama-3.1-8B as its backbone model." + }, + { + "flag": "PromptArmor reproduced by authors", + "detail": "Authors reproduced PromptArmor from scratch (no official code) and modified its detection prompt, which may not reflect the strongest possible PromptArmor configuration." + }, + { + "flag": "No ablation table", + "detail": "Four training goals (benign preservation, anti-hallucination, anti-repetition, position robustness) are described but their individual contributions are not systematically ablated in a table." + }, + { + "flag": "Contamination unaddressed", + "detail": "Llama-3.1-8B's training cutoff is not stated; evaluation benchmarks (SEP, InjecAgent) predate Llama 3.1 and may have been seen during pretraining, potentially inflating filtering performance." + }, + { + "flag": "SEP subsample without justification", + "detail": "Only 1K of 9.1K SEP samples are evaluated with no justification for subset size or confirmation that the subsample is representative." + } + ], + "cited_papers": [ + { + "title": "AgentDojo: A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents", + "relevance": "Primary evaluation benchmark for both security and utility of DataFilter in multi-turn agentic tool-calling settings" + }, + { + "title": "InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents", + "relevance": "Secondary evaluation benchmark measuring indirect injection in API-calling scenarios with 1K samples" + }, + { + "title": "Can LLMs Separate Instructions from Data? And What Do We Even Mean by That?", + "relevance": "SEP benchmark used for instruction-following security evaluation across 6 attack types" + }, + { + "title": "Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks", + "relevance": "State-of-the-art fine-tuning defense, used as reference for training strategy design and as comparison for model-level vs model-agnostic tradeoffs" + }, + { + "title": "The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against LLM Jailbreaks and Prompt Injections", + "relevance": "Strong adaptive attack that breaks DataFilter, establishing the ceiling on defense effectiveness against optimized adversaries" + }, + { + "title": "Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection", + "relevance": "Foundational work defining indirect prompt injection and motivating the threat landscape for LLM agents" + }, + { + "title": "StruQ: Defending Against Prompt Injection with Structured Queries", + "relevance": "Fine-tuning defense using structured query format, key prior work in model-level defenses that DataFilter is positioned against" + }, + { + "title": "DataSentinel: A Game-Theoretic Detection of Prompt Injection Attacks", + "relevance": "Detection-based baseline that DataFilter outperforms, demonstrating the detection-vs-filtering design space tradeoff" + }, + { + "title": "Defeating Prompt Injections by Design", + "relevance": "System-level defense providing security-by-design guarantees, representing the alternative architectural approach to DataFilter" + }, + { + "title": "AlpacaEval: An Automatic Evaluator of Instruction-following Models", + "relevance": "Utility evaluation benchmark used to measure instruction-following quality with and without DataFilter applied" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 3, + "justification": "DataFilter is released as a plug-and-play defense for any LLM system, directly addressing OWASP #1 LLM threat with marginal overhead and no backend model access required." + }, + "surprise_contrarian": { + "score": 2, + "justification": "Challenges the assumed security-utility tradeoff in model-agnostic defenses, showing it is possible to nearly eliminate injections without meaningful utility loss." + }, + "fear_safety": { + "score": 3, + "justification": "Directly addresses OWASP #1 LLM threat citing real attacks against Google Bard, Slack AI, Anthropic Claude Computer Use, and OpenAI Operator causing data leakage and malware execution." + }, + "drama_conflict": { + "score": 1, + "justification": "Mild security arms race framing with acknowledgment that strong adaptive attacks break the defense, but no major controversy or conflict angle." + }, + "demo_ability": { + "score": 3, + "justification": "Model and code are explicitly released for immediate use; practitioners can deploy DataFilter today on any LLM application without accessing backend model weights." + }, + "brand_recognition": { + "score": 2, + "justification": "UC Berkeley affiliation, Meta and Google funding, and evaluation on GPT-4o/GPT-5.1 add credibility; David Wagner is a well-known security researcher." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "42919597", + "title": "Efficient Reasoning with Hidden Thinking", + "points": 172, + "comments": 43, + "url": "https://news.ycombinator.com/item?id=42919597", + "created_at": "2025-02-03T16:06:48Z" + }, + { + "hn_id": "38355249", + "title": "Open Problems in DAOs", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=38355249", + "created_at": "2023-11-20T21:39:59Z" + }, + { + "hn_id": "46311266", + "title": "Tiny-TSM: Efficiently Training a Lightweight SOTA Time Series Foundation Model", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=46311266", + "created_at": "2025-12-18T11:07:07Z" + }, + { + "hn_id": "37939342", + "title": "Can Large Language Models Explain Themselves? A Study", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=37939342", + "created_at": "2023-10-19T06:41:38Z" + } + ], + "top_points": 172, + "total_points": 177, + "total_comments": 43 + } +} diff --git a/papers/defending-against-prompt-2025/scan-v5.json b/papers/defending-against-prompt-2025/scan-v5.json @@ -0,0 +1,577 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Defending Against Prompt Injection With a Few DefensiveTokens", + "authors": [ + "Sizhe Chen", + "Yizhu Wang", + "Nicholas Carlini", + "Chawin Sitawarin", + "David Wagner" + ], + "year": 2025, + "venue": "AISec@CCS", + "arxiv_id": "2507.07974", + "doi": "10.1145/3733799.3762982" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "The abstract claims DefensiveToken achieves security 'comparable to training-time alternatives' are supported by Table 3 results (0.24% avg ASR on TaskTracker vs. 0.20–0.51% for training-time defenses). Flexibility and minimal utility drop claims are supported by Table 3 WinRate columns and Figure 3.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "Causal claims (DefensiveToken reduces ASR, improves security) are supported by controlled ablation studies in Section 4.5 varying number of tokens, initialization, loss function, position, and learning rate. Comparisons with and without DefensiveTokens isolate the causal contribution.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "All experiments use 4 open-weight 7–8B models only. The paper does not bound generalization claims to this model-size range, and conclusions like 'comparable to training-time defenses' implicitly extend beyond the tested scope without acknowledging inapplicability to larger or closed models.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "The paper presents a single mechanistic hypothesis for why DefensiveToken works (large embedding magnitude enabling optimization, Table 2) without considering alternatives such as attention-head anchoring, semantic priming, or training-data distribution shift.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "Metrics directly correspond to claims: ASR measures security against prompt injection, WinRate measures utility. No proxy conflation—the paper does not call ASR a stand-in for a broader safety property it doesn't measure.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": false, + "justification": "There is no dedicated limitations section. Limitations appear only in the conclusion paragraph: 'DefensiveToken only defends against prompt injections...does not apply to other safety settings...we do not know the utility on more labeled datasets.' A concluding paragraph does not qualify.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "No threats-to-validity section exists. The conclusion briefly mentions utility uncertainty on unlabeled datasets, but does not discuss specific threats such as LLM judge reliability, benchmark saturation, or architecture dependency.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": true, + "justification": "The conclusion explicitly states: 'DefensiveToken only defends against prompt injections, where the user is benign and the environment is malicious. DefensiveToken does not apply to other safety settings, e.g., preventing jailbreaks, system following attacks, and data extraction attacks.'", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": true, + "justification": "Funding is explicitly listed in acknowledgments: Google-BAIR Commons, NSF (grant 2229876), OpenAI, Open Philanthropy, Google, the Department of Homeland Security, and IBM.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Author affiliations are listed in the paper header: UC Berkeley (Chen, Wang, Wagner), Google DeepMind and Anthropic (Carlini), Google DeepMind (Sitawarin).", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": false, + "justification": "Google and OpenAI are funders; a Google DeepMind employee (Carlini) is co-author; and gpt-4o (OpenAI's model) is the primary LLM judge throughout evaluation. This creates alignment between funder interests and both the research direction and evaluation tooling.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "Funding sources are acknowledged but there is no explicit 'competing interests' or 'financial interests' statement. No declaration about patents, equity, or consulting relationships is included.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Key terms are precisely defined: prompt injection is defined as inserting malicious instructions into data consumed by an LLM; the threat model is formally specified in Section 3.1; ASR is defined; the [INST]/[DATA]/[RESP] prompt format is shown with concrete examples.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "The contribution is clearly stated: 'We introduce DefensiveToken, the first test-time prompt injection defense that is as effective as training-time ones in most cases.' Table 1 situates it against all existing defense categories across three axes.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 2 explicitly situates DefensiveToken against attack types (optimization-free vs. optimization-based), defense categories (detection vs. prevention, test-time vs. training-time), and prior methods (StruQ, SecAlign, ISE, Jatmo, instruction hierarchy, prompt tuning), explaining how DefensiveToken differs from each.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": true, + "justification": "The abstract states 'The code is available here' with a hyperlink. The paper references specific implementation components (peft library, PyTorch FSDP) consistent with a released codebase.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "All evaluation datasets are publicly available standard benchmarks: AlpacaFarm, SEP, TaskTracker (31K samples), CyberSecEval2, InjecAgent, and Cleaned Alpaca (training data). No proprietary data was created.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "The paper mentions 'four NVIDIA Tesla A100s (80GB) with PyTorch FSDP' and the peft library, but provides no requirements.txt, Dockerfile, or version-pinned dependency list.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "Algorithm 1 provides the optimization procedure at a high level and key hyperparameters are reported, but no step-by-step execution instructions are provided. Users would need to substantially infer how to run the experiments.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "Main results in Table 3 (all 24 model-benchmark combinations) are single-run point estimates without error bars. Variance is reported for only one configuration: SEP/Llama3.1-8B over 5 runs (WinRate 53.84±0.56, ASR 2.81±1.09).", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "No statistical significance tests are reported for any comparative claim. All differences between methods are presented as raw percentage differences without p-values or confidence intervals.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Effect sizes are clearly quantified: DefensiveToken reduces GCG ASR from 95.2% to 48.8%; reduces optimization-free ASR by an order of magnitude versus no defense on major benchmarks; WinRate drops of less than 2pp for utility. Baselines provide meaningful reference points throughout.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "No justification is given for the choice of 4 models or these specific benchmarks. No power analysis is provided. The selection of 5 defensive tokens is empirically justified (Table 4) but not via pre-study power calculation.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "Variance is only reported for one model-benchmark combination (Llama3.1-8B on SEP, 5 runs). All other results in Table 3 are single-run point estimates.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Multiple baselines are included: test-time (Reminder, Sandwich, TextGrad) and training-time (StruQ-LoRA, StruQ-Full, SecAlign-LoRA), plus 'no defense' baseline, all evaluated across the same 5 benchmarks and 4 models.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "Baselines are current: StruQ (2025), SecAlign (2025), TextGrad (2025), Instruction Hierarchy (2024, gpt-4o). The paper also explicitly justifies excluding detection-based and system-level defenses as out-of-scope for the threat model.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "Section 4.5 provides comprehensive ablation studies: number of tokens (Table 4), initialization strategy (Table 5), loss function (Table 6), insertion position (Table 7), and learning rate (Table 8). Multiple design choices are varied independently.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Both security (ASR for optimization-free attacks and GCG-ASR for optimization-based) and utility (WinRate on AlpacaEval2) metrics are reported. Figure 3 visualizes the Pareto utility-security trade-off explicitly.", + "source": "haiku" + }, + "human_evaluation": { + "applies": false, + "answer": false, + "justification": "Human evaluation is not applicable here. Utility is measured via LLM judge (gpt-4o through AlpacaEval2) and security via automated string matching or LLM judge. This is an automated security defense evaluation without system outputs requiring human judgment.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": true, + "justification": "DefensiveToken optimization uses Cleaned Alpaca (51K training samples), explicitly stated to be 'different and in another domain from' the AlpacaFarm evaluation set. 'The user instructions and injections in evaluation have no overlap with those used in model training.'", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Table 3 provides full per-model (4 models) and per-benchmark (5 benchmarks) breakdowns for all methods, enabling fine-grained analysis of where DefensiveToken succeeds or underperforms.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "Failure cases are explicitly discussed: GCG attacks still achieve 48.8% avg ASR (vs. near-0% for StruQ-Full); Falcon3-7B requires 20 tokens for near-zero ASR vs. 5 for other models; InjecAgent agentic setting shows DefensiveToken is 'slightly weaker than training-time alternatives.'", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "Negative results are honestly reported: SecAlign loss applied to token embeddings catastrophically hurts utility (Table 6, ~10pp WinRate drop); insertion at end of input significantly degrades both utility and security (Table 7); lr=0.01 fails to achieve security (Table 8).", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": true, + "justification": "Specific model identifiers are provided: Llama3-8B-Instruct, Llama3.1-8B-Instruct, Falcon3-7B-Instruct, Qwen2.5-7B-Instruct. The LLM judge is identified as gpt-4o with AlpacaEval2 reference model version turbo-2024-04-09.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": true, + "justification": "Full prompts and injection examples are reproduced for all 4 attack variants (Ignore, Completion, Ignore-Completion, GCG) across all 5 benchmarks, including the exact [INST]/[DATA]/[RESP] format and complete injection text.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": true, + "justification": "Full hyperparameters reported: learning rate 0.1, 1 epoch, 5 tokens, LoRA parameters (r=64, lora_alpha=8, lora_dropout=0.1, target_modules=['q_proj','v_proj']), Cleaned Alpaca 51K training samples, and batch training via PyTorch FSDP.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": false, + "answer": false, + "justification": "There is no agentic scaffolding in the core DefensiveToken evaluation. InjecAgent uses ReAct prompts from an existing benchmark, not custom scaffolding designed by the authors.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "The defensive training dataset construction is described in Algorithm 1 and Section 3.3: self-labeled responses from undefended LLM, 50% samples unchanged, 25% each with two injection variants. Injection placement rules per benchmark are also specified.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": false, + "justification": "Security evaluation relies on gpt-4o judge decisions which are not publicly released, making independent verification of ASR numbers impossible. Only aggregate statistics are presented; individual judge responses are unavailable.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Data collection is described: standard benchmarks used with injections appended per benchmark-specific rules (end-of-data for SEP, benchmark-specified placement for TaskTracker). Training data generation from Cleaned Alpaca is described in Algorithm 1.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "No human participants were recruited. All evaluation uses standard benchmark datasets and LLM judges.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "The pipeline from training data construction (Algorithm 1) through DefensiveToken optimization to evaluation (LLM judge via AlpacaEval2 for utility, gpt-4o for security, string matching for AlpacaFarm) is documented at a level sufficient to understand the full flow.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": false, + "justification": "Training data cutoffs for Llama3, Llama3.1, Falcon3, and Qwen2.5 base models are not stated. The paper only verifies that the defensive token optimization data doesn't overlap with evaluation benchmarks, not whether base model pre-training included evaluation examples.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": false, + "justification": "The paper states 'The user instructions and injections in evaluation have no overlap with those used in model training' (for defensive token optimization), but does not discuss potential overlap between the base LLMs' pre-training data and evaluation benchmarks like AlpacaFarm or SEP.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": true, + "answer": false, + "justification": "Contamination of the base LLMs' pre-training data with evaluation benchmarks is not addressed. This matters for utility measurement—if Llama3 or Qwen2.5 pre-training included AlpacaFarm examples, the WinRate baseline would be inflated.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": false, + "justification": "The paper notes DefensiveTokens add 'only 5 more tokens' to input but provides no quantitative inference latency or throughput measurements. The overhead is described qualitatively as 'minimal changes to the LLM system' without measured numbers.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": true, + "justification": "Training compute is reported: 'four NVIDIA Tesla A100s (80GB) with PyTorch FSDP and takes one hour to complete.' This provides actionable compute budget information.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "DefensiveToken achieves security comparable to training-time defenses on optimization-free prompt injection attacks, with 0.24% average ASR on TaskTracker vs. 0.20–0.51% for training-time alternatives.", + "evidence": "Table 3 and Figure 2: DefensiveToken achieves 0.24% avg ASR on TaskTracker across 4 models; StruQ-LoRA 0.24%, StruQ-Full 0.20%, SecAlign-LoRA 0.51%. Similar results on AlpacaFarm and SEP benchmarks.", + "supported": "strong" + }, + { + "claim": "DefensiveToken reduces optimization-free ASR by an order of magnitude compared to prompting-based test-time defenses (Reminder, Sandwich), which 'never reduce ASRs by over two times' across all benchmarks.", + "evidence": "Figure 2 and Table 3: Reminder achieves 19.8–35.3% ASR on TaskTracker while DefensiveToken achieves 0.19–0.27%; on AlpacaFarm, Sandwich reduces ASR only to 56.7–85.6% while DefensiveToken achieves 0.48–4.81%.", + "supported": "strong" + }, + { + "claim": "DefensiveToken incurs smaller utility loss than all tested baselines (test-time and training-time) because only 5 additional tokens are added to the input.", + "evidence": "Figure 3 shows DefensiveToken is closest to the ideal defense on both AlpacaFarm and SEP. Table 3 WinRate column shows <2pp utility drop versus no-defense baseline in most cases, outperforming TextGrad (−6 to −19pp), StruQ-Full (Falcon3: −5pp), and SecAlign (−2 to −9pp).", + "supported": "strong" + }, + { + "claim": "Against adaptive GCG optimization-based attacks (attacker knows DefensiveToken embeddings), DefensiveToken reduces average ASR from 95.2% to 48.8% while prompting-based baselines achieve near-100% ASR.", + "evidence": "Figure 2 (GCG sub-figure): Reminder/Sandwich achieve 96.6–100% GCG ASR; DefensiveToken achieves 37.5% (Llama3.1-8B) to 73.6% (Qwen2.5-7B) GCG ASR averaged across 4 models.", + "supported": "moderate" + }, + { + "claim": "Optimized DefensiveToken embeddings have 100x larger L1-norm magnitude than vocabulary token embeddings, making prompting-based approaches unable to replicate their security effectiveness.", + "evidence": "Table 2: vocabulary tokens average L1-norm 34 (max 47) vs. DefensiveTokens average 4332 (max 4594) for Llama-3.1-8B-Instruct. Correlation with security is argued but not causally established.", + "supported": "moderate" + }, + { + "claim": "5 defensive tokens are sufficient for all 4 tested models to achieve near-optimal security; random initialization outperforms text-embedding initialization due to facilitating larger optimization magnitude.", + "evidence": "Table 4: 5 tokens achieve 0.48% ASR on Llama3/3.1-8B (same as 20 tokens). Table 5: random initialization achieves 0.48% ASR vs. space initialization's 2.40% with 5 tokens on Llama3.1-8B. Falcon3-7B exception noted.", + "supported": "moderate" + } + ], + "methodology_tags": [ + "benchmark-eval" + ], + "key_findings": "DefensiveToken proposes optimizing special token embeddings (not model weights) that can be optionally prepended to LLM inputs at test time to achieve prompt injection robustness comparable to training-time fine-tuning while preserving developer flexibility to skip the defense when utility is prioritized. On 5 benchmarks with 4 open-weight 7–8B LLMs, DefensiveToken reduces optimization-free attack success rates by orders of magnitude over prompting-based baselines (0.24% avg ASR on TaskTracker vs. 11–30% for existing test-time defenses) while incurring less utility loss than all competing methods. The defense achieves a superior utility-security Pareto frontier, sitting closest to the ideal defense in Figure 3. The main remaining gap is against adaptive optimization-based attacks (GCG), where DefensiveToken reduces ASR from 95.2% to 48.8% while training-time defenses achieve near-zero—a limitation honestly reported.", + "red_flags": [ + { + "flag": "LLM judge reliance without reliability check", + "detail": "Security evaluation throughout relies on gpt-4o to determine attack success. No inter-rater reliability between the LLM judge and human annotators is reported. OpenAI is also a named funder, creating alignment between funder interests and evaluation tooling." + }, + { + "flag": "Variance reported for only one configuration", + "detail": "Five-run variance is reported only for SEP/Llama3.1-8B-Instruct. All other results in Table 3 (24 model-benchmark combinations) are single-run point estimates without error bars, making claimed comparisons statistically uninformative." + }, + { + "flag": "Results restricted to 7–8B open-weight models", + "detail": "All four tested models are 7B or 8B parameters. Generalizability to larger models (70B+) or closed frontier models (GPT-4, Claude) is untested, and deployment requires providers to optimize and release DefensiveTokens alongside their models—a step that has not occurred in practice." + }, + { + "flag": "Funder conflict with judge tool", + "detail": "OpenAI is a named funder and gpt-4o (OpenAI's model) is the primary LLM judge throughout evaluation. Google is a named funder and a Google DeepMind employee is co-author. No competing interests statement is present." + }, + { + "flag": "Code URL not visible in paper text", + "detail": "The abstract states 'The code is available here' with a hyperlink, but the URL is not printed in the paper text, making code availability unverifiable from the paper alone." + } + ], + "cited_papers": [ + { + "title": "StruQ: Defending against prompt injection with structured queries", + "relevance": "Direct predecessor and primary baseline; DefensiveToken uses StruQ's defensive loss function and training dataset construction methodology" + }, + { + "title": "SecAlign: Defending Against Prompt Injection with Preference Optimization", + "relevance": "Key training-time baseline using preference optimization; SecAlign loss is ablated against StruQ loss in DefensiveToken context" + }, + { + "title": "InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents", + "relevance": "Agentic tool-calling benchmark used to test generalization of DefensiveToken to API-calling/ReAct settings beyond instruction-following" + }, + { + "title": "Get my drift? Catching LLM Task Drift with Activation Deltas", + "relevance": "TaskTracker benchmark (31K samples) used as the largest primary security evaluation dataset" + }, + { + "title": "Universal and Transferable Adversarial Attacks on Aligned Language Models", + "relevance": "GCG attack method used to evaluate DefensiveToken robustness under strong adaptive optimization-based attacks" + }, + { + "title": "The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions", + "relevance": "Training-time defense implemented in frontier models (gpt-4o, gemini-2.5-flash); contextualizes where DefensiveToken sits relative to provider-level solutions" + }, + { + "title": "Defeating prompt injections by design", + "relevance": "System-level defense baseline; excluded from main comparison because it applies only to agentic control-flow cases" + }, + { + "title": "Can LLMs Separate Instructions From Data? And What Do We Even Mean By That?", + "relevance": "SEP benchmark (9.1K samples) used in both security and utility-security trade-off evaluation" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 3, + "justification": "Directly addresses the OWASP #1 LLM threat with a deployable defense requiring 1 hour of training and 5 extra tokens at inference time with no infrastructure changes." + }, + "surprise_contrarian": { + "score": 2, + "justification": "The finding that test-time token embedding optimization can match training-time fine-tuning security challenges the prevailing assumption that parameter updates are necessary for robust prompt injection defense." + }, + "fear_safety": { + "score": 2, + "justification": "Addresses prompt injection attacks that have compromised real-world LLM products (Google Bard, Slack AI, Anthropic/OpenAI web agents), ranked #1 OWASP LLM threat, with growing agentic deployment surface." + }, + "drama_conflict": { + "score": 1, + "justification": "Understated positioning of Berkeley academic work against industry (Google/OpenAI) funded training-time solutions, with co-authors spanning both sides." + }, + "demo_ability": { + "score": 2, + "justification": "Code is reportedly released, training takes 1 hour on 4 A100s with standard open-weight models, making the defense feasibly reproducible by well-resourced practitioners." + }, + "brand_recognition": { + "score": 2, + "justification": "Nicholas Carlini (Google DeepMind/Anthropic) is a prominent ML security researcher; David Wagner (UC Berkeley) is a well-known security faculty member; published at AISec@CCS, the top AI security workshop." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "40938701", + "title": "Training a time series model using transformers at Datadog", + "points": 27, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=40938701", + "created_at": "2024-07-11T17:19:07Z" + }, + { + "hn_id": "32218471", + "title": "Drivable Volumetric Avatars Using Texel-Aligned Features", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=32218471", + "created_at": "2022-07-24T22:36:31Z" + }, + { + "hn_id": "44553930", + "title": "Defending Against Prompt Injection with a Few DefensiveTokens", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=44553930", + "created_at": "2025-07-13T21:32:40Z" + }, + { + "hn_id": "47041986", + "title": "A Survey of In-Context Reinforcement Learning", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=47041986", + "created_at": "2026-02-17T00:01:18Z" + }, + { + "hn_id": "43296207", + "title": "The Widespread Adoption of Large Language Model-Assisted Writing Across Society", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=43296207", + "created_at": "2025-03-08T00:09:53Z" + }, + { + "hn_id": "43088092", + "title": "The Widespread Adoption of Large Language Model-Assisted Writing Across Society", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=43088092", + "created_at": "2025-02-18T10:30:59Z" + }, + { + "hn_id": "38091292", + "title": "Communicative Agents for Software Development", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=38091292", + "created_at": "2023-10-31T21:00:02Z" + }, + { + "hn_id": "37786498", + "title": "Communicative Agents for Software Development", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=37786498", + "created_at": "2023-10-06T02:13:18Z" + }, + { + "hn_id": "46452714", + "title": "Performance Evaluation of Brokerless Messaging Libraries", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=46452714", + "created_at": "2026-01-01T09:44:55Z" + }, + { + "hn_id": "45486277", + "title": "Brain Graph Augmentation via Learnable Edge Masking for Psychiatric Diagnosis", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=45486277", + "created_at": "2025-10-05T23:45:56Z" + } + ], + "top_points": 27, + "total_points": 44, + "total_comments": 0 + } +} +\ No newline at end of file diff --git a/papers/defending-aipowered-commerce-2025/scan-v5.json b/papers/defending-aipowered-commerce-2025/scan-v5.json @@ -0,0 +1,317 @@ +{ + "scan_version": 5, + "paper_type": "position", + "paper": { + "title": "Defending The AI-Powered Commerce Stack: A Security Framework For Prompt Injection, Review Integrity, And Privacy In Genai Retail Systems", + "authors": [ + "Prakash Kodali" + ], + "year": 2025, + "venue": "Journal of International Crisis and Risk Communication Research", + "arxiv_id": null, + "doi": "10.63278/jicrcr.vi.3471" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": false, + "justification": "The abstract claims the framework 'provides actionable guidance' and addresses each threat vector, but no empirical validation, case studies, implementation results, or red-team evaluations are presented anywhere in the paper.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": false, + "justification": "The paper asserts that proposed controls (input isolation, quarantine systems, access minimization) will reduce attacks, but no study design, prototype evaluation, or supporting evidence justifies these causal claims.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "The framework is presented as broadly applicable to all 'AI-powered commerce systems' without scoping to specific architectures, scales, regulatory regimes, or deployment contexts.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": false, + "answer": false, + "justification": "No empirical findings are presented; the paper is purely prescriptive and does not evaluate competing interpretations of evidence.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": false, + "answer": false, + "justification": "No measurements or outcomes are reported; the paper proposes controls without evaluating their effects.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": false, + "justification": "There is no limitations or threats-to-validity section; the conclusion mentions only that 'the dynamic nature of AI security threats requires continuous adaptation' — a generic statement, not a limitations discussion.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "No specific threats are identified; the paper does not acknowledge false-positive costs, implementation feasibility constraints, adversarial adaptation, or performance overhead of the proposed controls.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "The paper does not state what classes of systems, attacker sophistication levels, or deployment contexts the framework does not apply to.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding source is mentioned anywhere in the paper.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "The author's affiliation (Sri Venkateswara University, India) is disclosed on the title page.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": false, + "answer": false, + "justification": "No funder is identified, so funder independence cannot be assessed.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement or financial interest declaration appears in the paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": false, + "justification": "Key terms such as 'prompt injection,' 'product brain,' 'AI advisor,' and 'AI commerce stack' are used throughout without formal definitions; the paper describes their effects but never defines them precisely.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "The paper clearly states it contributes a layered security framework for AI-powered e-commerce covering four threat categories: prompt injection, fake reviews, data poisoning, and privacy leakage.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": false, + "justification": "The paper cites 10 references as numbered footnotes but never discusses how this framework builds on, differs from, or improves upon prior AI security or e-commerce security frameworks; citations are decorative, not substantive.", + "source": "haiku" + } + } + }, + "type_checklist": { + "position": { + "argument_quality": { + "argument_internally_consistent": { + "applies": true, + "answer": true, + "justification": "The logical progression from threat identification to defense layers to monitoring is internally coherent, even though none of it is empirically supported.", + "source": "haiku" + }, + "counterarguments_addressed": { + "applies": true, + "answer": false, + "justification": "No counterarguments are considered — for example, whether layered controls introduce unacceptable latency or false-positive rates that harm legitimate users, or whether adversaries can trivially bypass proposed defenses.", + "source": "haiku" + }, + "analogies_appropriate": { + "applies": false, + "answer": false, + "justification": "The paper makes no significant use of analogies.", + "source": "haiku" + }, + "prescriptions_proportional": { + "applies": true, + "answer": false, + "justification": "The paper makes sweeping prescriptions ('organizations deploying AI-enhanced retail systems must implement layered defenses') across all four threat categories without any empirical evidence, cost-benefit analysis, or proof of concept.", + "source": "haiku" + }, + "evidence_for_claims_cited": { + "applies": true, + "answer": false, + "justification": "Many specific technical assertions are made without citation (e.g., 'feedback mechanisms within suggestion frameworks may magnify initial quality problems'); the 10 total references are used sparsely and do not cover the majority of the paper's factual claims.", + "source": "haiku" + }, + "alternatives_discussed": { + "applies": true, + "answer": false, + "justification": "No alternative security frameworks or competing approaches are discussed; the paper presents its framework as if no prior AI security or e-commerce security literature proposes alternatives.", + "source": "haiku" + }, + "historical_context_accurate": { + "applies": false, + "answer": false, + "justification": "The paper makes no significant historical claims that could be verified for accuracy.", + "source": "haiku" + } + }, + "clarity_and_scope": { + "key_terms_defined_precisely": { + "applies": true, + "answer": false, + "justification": "Terms like 'intelligent agents,' 'product brain,' 'AI advisor,' and 'provenance tracking' are used throughout without precise definitions specific to this paper's context.", + "source": "haiku" + }, + "engages_with_existing_literature": { + "applies": true, + "answer": false, + "justification": "The paper lists 10 references but never discusses how the existing literature on prompt injection defense, recommender system poisoning, or privacy-preserving AI informs or validates the proposed framework.", + "source": "haiku" + }, + "intended_audience_clear": { + "applies": true, + "answer": true, + "justification": "The abstract explicitly targets 'engineering, security, legal, and customer experience teams building resilient AI-powered commerce systems.'", + "source": "haiku" + }, + "assumptions_stated": { + "applies": true, + "answer": false, + "justification": "The framework assumes organizations have the infrastructure, budget, and expertise to implement all proposed layers simultaneously, but these assumptions are never stated or justified.", + "source": "haiku" + }, + "scope_of_applicability_discussed": { + "applies": true, + "answer": false, + "justification": "The paper does not discuss where the framework applies and where it does not — whether it requires a minimum scale, specific AI architectures, or particular regulatory contexts.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "Prompt injection attacks exploit untrusted content in product descriptions and user-generated reviews to manipulate assistant behavior and trigger unauthorized actions.", + "evidence": "Cited to references [3][4] (Dinu et al. 2025, Wang 2025) but no original evidence or demonstrated examples from real e-commerce deployments are provided.", + "supported": "moderate" + }, + { + "claim": "AI-generated synthetic reviews undermine rating authenticity and distort marketplace signals at an unprecedented scale.", + "evidence": "Asserted without quantitative evidence of scale, frequency, or measured impact on marketplace metrics.", + "supported": "weak" + }, + { + "claim": "Data poisoning compromises catalog systems and vector embeddings powering recommendation engines, degrading relevance and introducing malicious content propagation.", + "evidence": "Referenced to [7][8] (Wu et al. 2023, Wang et al. 2023), which study poisoning attacks on recommender systems in academic settings.", + "supported": "moderate" + }, + { + "claim": "The proposed layered defense framework provides actionable guidance for engineering, security, legal, and customer experience teams.", + "evidence": "No implementation, case study, user study, or red-team evaluation validates whether the guidance is actionable or effective in practice.", + "supported": "unsupported" + }, + { + "claim": "Privacy leaks emerge from over-permissioned tool access and inadequate PII protection in conversational AI contexts.", + "evidence": "Referenced to [9][10] on PII detection and privacy assistants generally; no evidence specific to conversational commerce contexts.", + "supported": "weak" + } + ], + "methodology_tags": [ + "theoretical" + ], + "key_findings": "This paper proposes a conceptual security framework for AI-powered e-commerce systems covering four threat categories: prompt injection, synthetic review proliferation, data poisoning, and privacy leakage. For each category, layered defenses are described at a high level — input isolation, provenance tracking, quarantine systems, and access minimization — supported by taxonomy tables. No empirical validation, prototype implementation, red-team evaluation, or case study is presented. The framework is entirely prescriptive and aspirational.", + "red_flags": [ + { + "flag": "No empirical validation", + "detail": "The entire paper proposes security controls without any evaluation — no implementation, case study, red-team test, prototype, or measurement of effectiveness is provided." + }, + { + "flag": "Likely AI-generated text", + "detail": "The writing is consistently verbose and circumlocutory throughout (e.g., 'Patron-created material incorporating evaluations, inquiries, responses, and merchandise visuals flows immediately into infrastructures that produce representations and educate suggestion frameworks'), a hallmark pattern of AI-generated prose, not academic writing." + }, + { + "flag": "Wrong venue for content", + "detail": "Published in 'Journal of International Crisis and Risk Communication Research' — a journal entirely unrelated to AI security, e-commerce, or computer science. This strongly suggests predatory or vanity publishing." + }, + { + "flag": "No limitations section", + "detail": "No acknowledgment of the framework's limitations, failure conditions, false-positive costs, adversarial adaptability, or implementation constraints anywhere in the paper." + }, + { + "flag": "Sparse citations for specific claims", + "detail": "Only 10 total references for a 17-page technical framework paper; large portions of the paper make specific technical claims with no citations at all." + }, + { + "flag": "No competing interests or funding disclosure", + "detail": "Standard academic disclosures absent despite prescriptive framework claims with potential commercial applicability." + } + ], + "cited_papers": [ + { + "title": "Disrupting Large Language Models with Hidden Prompt Injection Attacks Embedded in HTML Pages", + "relevance": "Prompt injection attack mechanisms directly relevant to the paper's primary threat model" + }, + { + "title": "To Protect the LLM Agent Against the Prompt Injection Attack with Polymorphic Prompt", + "relevance": "Defense approaches for prompt injection in LLM agents" + }, + { + "title": "Fake Review Detection in E-Commerce Using Machine Learning and NLP Technique", + "relevance": "ML-based fake review detection for e-commerce platforms" + }, + { + "title": "Detecting Fake Reviews on E-commerce Platforms Using Machine Learning", + "relevance": "Machine learning approaches to review integrity management" + }, + { + "title": "Influence-Driven Data Poisoning for Robust Recommender Systems", + "relevance": "Data poisoning attacks on recommendation systems — core threat model reference" + }, + { + "title": "Revisiting Data Poisoning Attacks on Deep Learning Based Recommender Systems", + "relevance": "Poisoning vulnerabilities in deep learning-based recommenders" + }, + { + "title": "AI-Driven Personalized Privacy Assistants: A Systematic Literature Review", + "relevance": "Privacy protection approaches in AI-driven personalized systems" + }, + { + "title": "Generative Artificial Intelligence and E-Commerce", + "relevance": "Background on GenAI integration and implications for digital retail" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 1, + "justification": "Addresses real security concerns in AI e-commerce but the framework is too abstract and unvalidated to be directly actionable for practitioners." + }, + "surprise_contrarian": { + "score": 0, + "justification": "Describes well-known security threats (prompt injection, fake reviews, data poisoning) with standard proposed defenses; no novel or surprising insights." + }, + "fear_safety": { + "score": 2, + "justification": "Covers legitimate AI security threats to e-commerce including privacy leakage, adversarial manipulation, and trust erosion at scale." + }, + "drama_conflict": { + "score": 1, + "justification": "Security threats to commerce systems carry inherent stakes, but the paper treats them abstractly without case examples or incident data." + }, + "demo_ability": { + "score": 0, + "justification": "No implementation, prototype, or demonstration is provided; entirely conceptual." + }, + "brand_recognition": { + "score": 0, + "justification": "Single author from Sri Venkateswara University; no affiliation with a known AI lab, tech company, or prominent research group." + } + }, + "hn_data": { + "threads": [], + "top_points": 0, + "total_points": 0, + "total_comments": 0 + } +} +\ No newline at end of file diff --git a/papers/defense-against-indirect-2026/scan-v5.json b/papers/defense-against-indirect-2026/scan-v5.json @@ -0,0 +1,502 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Defense Against Indirect Prompt Injection via Tool Result Parsing", + "authors": [ + "Qiang Yu", + "Xinran Cheng", + "Chuanyi Liu" + ], + "year": 2026, + "venue": "arXiv.org", + "arxiv_id": "2601.04795", + "doi": "10.48550/arXiv.2601.04795" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "Abstract claims of lowest ASR (<1%) and competitive UA are supported by Tables 1–3 showing ParseData+CheckTool achieving 0.00–0.35% Avg Risk vs 2.93–28.96% for all baselines across three models.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "Causal claims about module contributions are tested via a dedicated ablation (Section 4.3) that isolates ParseData and CheckTool individually and in both combination orders on a controlled benchmark.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": true, + "justification": "The paper is scoped to AgentDojo benchmark and English; the Limitations section explicitly acknowledges parameter hijacking attacks and non-English settings are outside the evaluated scope.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "The paper does not consider whether the ASR reduction could partly result from reduced agent task completion (lower UA) rather than genuine defense, nor whether AgentDojo attack patterns are atypically easy to parse.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "ASR directly measures unauthorized tool execution via AgentDojo's verification logic, and UA directly measures task completion under attack; the Risk metric (ASR/UA) explicitly quantifies the tradeoff with no proxy substitution.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "A dedicated Limitations section appears after the Conclusion, discussing parameter hijacking attacks and English-only evaluation.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "Specific threats are named: parameter hijacking (with concrete email-redirect example that bypasses the defense entirely), and lack of non-English evaluation.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": true, + "justification": "The paper explicitly states coverage of action hijacking only, not parameter hijacking, and English-language settings only; these are explicit scope statements rather than generic disclaimers.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding source is mentioned anywhere in the paper.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "All three authors list Harbin Institute of Technology affiliation in the paper header with institutional email addresses.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": false, + "answer": false, + "justification": "No funding is disclosed; criterion is not applicable.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests or financial interests statement appears in the paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "LLM Agent is formally defined in Section 3.1 with mathematical notation; IPI attack is defined; all four evaluation metrics (BU, UA, ASR, Risk) are precisely defined with formulas in Table 5.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "The contribution—ParseData and CheckTool prompt-based modules for IPI defense without model training—is explicitly stated in both the abstract and introduction.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 2 categorizes prior defenses into model-based and prompt-based, explains limitations of each category, and explicitly positions this work's advantages over both paradigms.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": true, + "justification": "A GitHub URL is provided in Abstract footnote 1, though the Ethical Considerations section says 'The source code will be made publicly available,' creating ambiguity; the explicit URL is given and treated as released.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "AgentDojo is a publicly available benchmark; no custom dataset was created.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "No requirements.txt, Dockerfile, or dependency list is provided; only temperature (0) and context length (64KB) are mentioned in Section 4.1.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "Appendices B and C provide verbatim prompts but no end-to-end experimental setup instructions; a reader could not reproduce results without guessing framework integration details.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "All results are single-run percentages with no confidence intervals or error bars reported anywhere in the paper.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "No statistical significance tests are applied to any comparative claims across defense methods or models.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Effect sizes are reported as proportional comparisons with baseline context (e.g., '0.2%–1%, approximately 1/10 to 1/8 that of Tool Filter') in Figure 3 and Section 4.2.1.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "AgentDojo's 97 user tasks are used without power analysis or justification of whether this is sufficient to detect reliable differences between defense methods.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "No variance, standard deviation, or multiple-run results are reported; experiments appear to be single deterministic runs (temperature=0).", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Four baselines are included: DeBERTa Detector, Repeat User Prompt, Spotlighting with Delimiting, and Tool Filter.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "Baselines include 2024–2025 published work (DeBERTa Detector, Spotlighting with Delimiting from Hines 2024, Tool Filter from AgentDojo 2024); these represent current state of the art.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "Section 4.3 presents ablation examining ParseData and CheckTool individually and in both combination orders (ParseData+CheckTool vs CheckTool+ParseData) across all three models.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Four metrics are used: Benign Utility (BU), Utility under Attack (UA), Attack Success Rate (ASR), and Risk (ASR/UA).", + "source": "haiku" + }, + "human_evaluation": { + "applies": false, + "answer": false, + "justification": "Human evaluation is not relevant to this automated security benchmark evaluation.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": false, + "answer": false, + "justification": "No training is performed; AgentDojo is a benchmark, not a prediction task requiring train/test splits.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Table 5 (Appendix A) provides full breakdowns across 4 attack types (NoAttack, Direct, Ignore Previous, Important Messages) and 3 models for every defense method.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "Parameter hijacking is explicitly identified as a concrete failure case with an example (email address substitution) that bypasses the defense entirely.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "Substantial BU decreases of 28–55% relative to no-defense baseline are reported honestly across all models and discussed in Section 4.2.2.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": false, + "justification": "'gpt-oss-120b' is not a publicly known model and has no snapshot date; llama-3.1-70b and qwen3-32b lack version snapshot dates or API access dates.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": true, + "justification": "Appendices B and C provide the complete ParseData and CheckTool prompts verbatim, including all placeholder variables.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": true, + "justification": "Temperature=0 and context length=64KB are specified in Section 4.1; these are the only relevant hyperparameters for inference.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": true, + "answer": true, + "justification": "Figure 1 and Section 3 describe how ParseData and CheckTool integrate into the agent pipeline step-by-step, including the anticipation/extraction two-phase process.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "Section 4.1 describes AgentDojo benchmark structure, the three attack types selected, and the four domains; no additional preprocessing was performed.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": true, + "justification": "Appendix A (Table 5) contains the complete numerical results for all conditions, models, attacks, and metrics.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Section 4.1 describes AgentDojo benchmark structure with 16/21/20/40 tasks across 4 domains and how attacks are injected into tool results.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "No human participants; recruitment is not applicable.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "Section 3.1 formalizes the full pipeline mathematically and Figure 1 shows exactly where defense modules integrate into tool call execution.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": false, + "justification": "No training cutoffs are stated for any of the three evaluated models; AgentDojo (2024) tasks could plausibly appear in training data.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": false, + "justification": "No discussion of whether AgentDojo tasks or attack patterns were present in model training corpora.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": true, + "answer": false, + "justification": "AgentDojo was published in 2024 and could be in training data for 2025/2026 model versions; this is not addressed.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": false, + "justification": "ParseData and CheckTool both add extra LLM calls per agent step, but no latency, token count, or monetary cost estimates are provided.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "No total compute budget, number of API calls, or experiment runtime is reported.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "ParseData+CheckTool achieves the lowest Attack Success Rate among all evaluated defenses, below 1%.", + "evidence": "Table 1: Avg ASR 0.19% (gpt-oss-120b), 0.34% (llama-3.1-70b), 0.11% (qwen3-32b) vs next best Tool Filter at 1.71%, 2.32%, 2.58%.", + "supported": "strong" + }, + { + "claim": "The proposed method achieves competitive Utility under Attack compared to baselines.", + "evidence": "Table 1 shows UA drops to 51.84% vs 64.02% for Tool Filter (gpt-oss-120b), a 12pp gap; framing as 'competitive' is generous given consistent 10–15pp deficit.", + "supported": "weak" + }, + { + "claim": "Risk (ASR/UA) for Parse+Check is approximately 1/10 to 1/8 that of Tool Filter.", + "evidence": "Figure 3: CheckTool+ParseData at 0.22–0.76% vs Tool Filter 2.93–6.28% across three models, yielding approximately the claimed ratio.", + "supported": "strong" + }, + { + "claim": "Stronger LLM reasoning depth improves ParseData performance but degrades CheckTool performance.", + "evidence": "Table 3: qwen3-32b ParseData BU=63.92% vs CheckTool BU=42.27%; gpt-oss-120b shows near parity (54.64% vs 53.61%), consistent with the reasoning-depth hypothesis.", + "supported": "moderate" + }, + { + "claim": "Under the most powerful attack (Important Messages), existing defenses have ASR >4% while the proposed method achieves 0.11–0.53%.", + "evidence": "Table 2: DeBERTa 4.11–6.32%, Tool Filter 5.69–7.90%, Repeat Prompt 14.75–16.86%, vs ParseData+CheckTool 0.11–0.53%.", + "supported": "strong" + } + ], + "methodology_tags": [ + "benchmark-eval" + ], + "key_findings": "ParseData+CheckTool reduces Attack Success Rate to below 1% across three models on AgentDojo—a 3–10x improvement over the next best defense (Tool Filter at 2.93–6.28% risk) and roughly 1/100 the ASR of no defense. The cost is a 28–55% reduction in Benign Utility compared to no defense, making the utility-security tradeoff significant but quantified. A key identified failure mode—parameter hijacking attacks—bypasses the defense entirely and is left for future work. The defense scales with underlying LLM reasoning capability for ParseData but degrades for CheckTool as models reason more aggressively.", + "red_flags": [ + { + "flag": "No statistical significance testing", + "detail": "All results are single-run percentages with no confidence intervals, error bars, or significance tests, despite making strong comparative quantitative claims." + }, + { + "flag": "Unknown primary model 'gpt-oss-120b'", + "detail": "The primary model is not a publicly known model with a snapshot date or documentation, making independent reproduction impossible and comparison with external benchmarks invalid." + }, + { + "flag": "Utility drop understated", + "detail": "BU/UA drops of 28–55% from no-defense baseline are consistently described as 'competitive' in the abstract and framing, despite being substantial practical costs for deployment." + }, + { + "flag": "Inference cost omitted", + "detail": "ParseData and CheckTool each add full LLM calls per agent step; no latency, token cost, or overhead analysis is provided despite this being central to practical adoption." + }, + { + "flag": "Single benchmark, English only", + "detail": "All experiments use only AgentDojo in English; no other IPI benchmarks or non-English settings are tested, limiting claimed generalizability." + }, + { + "flag": "Code availability contradiction", + "detail": "Abstract provides a GitHub URL implying current availability, but Ethical Considerations says 'The source code will be made publicly available,' suggesting it was not released at submission." + }, + { + "flag": "Benchmark contamination unaddressed", + "detail": "AgentDojo (2024) could appear in training data for the 2025/2026 model versions evaluated; this is not discussed." + } + ], + "cited_papers": [ + { + "title": "AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents", + "relevance": "Primary benchmark for all experiments; defines UA, ASR metrics and provides attack scenarios and defense integration framework" + }, + { + "title": "StruQ: Defending Against Prompt Injection with Structured Queries", + "relevance": "Competing training-based defense that fine-tunes LLMs to distinguish instructions from data; used as conceptual baseline" + }, + { + "title": "Defending Against Indirect Prompt Injection Attacks With Spotlighting", + "relevance": "Competing prompt-based defense using delimiters; included as experimental baseline" + }, + { + "title": "Can Indirect Prompt Injection Attacks Be Detected and Removed?", + "relevance": "Lightweight model-based detection approach for IPI; conceptual and experimental baseline" + }, + { + "title": "MELON: Provable Defense Against Indirect Prompt Injection Attacks in AI Agents", + "relevance": "Defense that monitors for suspicious tool calls at execution time; related to CheckTool's monitoring approach" + }, + { + "title": "Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection", + "relevance": "Foundational work demonstrating IPI as a practical threat against real deployed LLM applications" + }, + { + "title": "Adaptive Attacks Break Defenses Against Indirect Prompt Injection Attacks on LLM Agents", + "relevance": "Shows existing defenses fail against adaptive attackers; motivates the stronger defense approach in this paper" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 3, + "justification": "Addresses a live deployment threat in LLM agents with a training-free, prompt-based solution that can be integrated into existing agent pipelines without model changes." + }, + "surprise_contrarian": { + "score": 1, + "justification": "The core insight (extract only what you need, discard the rest) is intuitive; magnitude of improvement over baselines is notable but the direction is expected." + }, + "fear_safety": { + "score": 2, + "justification": "Explicitly frames IPI as an escalating threat as agents gain physical control over autonomous systems and robotics, raising genuine safety stakes." + }, + "drama_conflict": { + "score": 1, + "justification": "Participates in an active security arms race between attack and defense research, but the paper itself takes no controversial positions." + }, + "demo_ability": { + "score": 2, + "justification": "Code is released on GitHub with AgentDojo integration; anyone with API access to compatible models could run the experiments." + }, + "brand_recognition": { + "score": 0, + "justification": "Authors are from Harbin Institute of Technology; no famous AI lab, industry affiliation, or well-known product involved." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "46624374", + "title": "Quantum Automated Theorem Proving", + "points": 5, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=46624374", + "created_at": "2026-01-14T22:06:27Z" + } + ], + "top_points": 5, + "total_points": 5, + "total_comments": 0 + } +} +\ No newline at end of file diff --git a/papers/defense-against-prompt-2024/scan-v5.json b/papers/defense-against-prompt-2024/scan-v5.json @@ -0,0 +1,547 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Defense Against Prompt Injection Attack by Leveraging Attack Techniques", + "authors": [ + "Yulin Chen", + "Haoran Li", + "Zihao Zheng", + "Dekai Wu", + "Yangqiu Song", + "Bryan Hooi" + ], + "year": 2024, + "venue": "Annual Meeting of the Association for Computational Linguistics", + "arxiv_id": "2411.00459", + "doi": "10.48550/arXiv.2411.00459" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "The abstract claims that defense methods outperform existing approaches and achieve state-of-the-art results; Tables 1, 2, 4, and 5 consistently show lower ASR for all four proposed methods versus all five training-free baselines across three open-source and two closed-source models.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "Comparative claims are supported by controlled experiments isolating each defense method against each attack type; the ablation in Figure 3 tests the causal claim that 'stronger attacks yield stronger defenses' across three model families, providing adequate experimental grounding.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "The abstract claims methods 'outperform existing defense approaches' broadly, but experiments cover only 5 training-free baselines (with comparison to just one fine-tuning method in one scenario) on two datasets; multilingual, RAG-pipeline, or agent-tool-use settings are not evaluated and not disclaimed.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "The paper offers no mechanistic analysis of why inverting attack prompts improves defense; a simpler alternative—that merely repeating the original instruction at the end of the prompt (as Sandwich partially does) explains the gains—is not considered or ruled out.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "ASR is explicitly defined as detecting whether the answer to the injected instruction appears in the generated response, and utility is separately measured by QA accuracy and SST2 accuracy; the paper clearly distinguishes these two measurement dimensions.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "A dedicated 'Limitations' section appears before the Ethical Consideration section, not merely as a sentence in the conclusion.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "The limitations section specifically names: absence of a long-query benchmark preventing thorough truncation analysis, the decision not to use gradient-based attacks as defenses (citing poor prior performance), and the lack of a mathematical proof for why prompt-engineering defenses work.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "While specific gaps are noted, the paper does not state explicit boundaries such as which model architectures, languages, or deployment scenarios the results do NOT cover; the conclusion asserts broad effectiveness without clarifying scope exclusions.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": true, + "justification": "The acknowledgment section discloses that Dr. Haoran Li is a JC STEM Early Career Research Fellow supported by The Hong Kong Jockey Club Charities Trust.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Author affiliations with National University of Singapore, HKUST, and Harbin Institute of Technology Shenzhen are disclosed in the header.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": true, + "justification": "The Hong Kong Jockey Club Charities Trust is a charitable organization unrelated to LLM products or prompt injection defense tools; no financial stake in the experimental outcome is apparent.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "The Ethical Consideration section acknowledges the ACL code of conduct but contains no competing interests or financial interests declaration.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Direct and indirect prompt injection, attack success rate (ASR), the five attack techniques (naive, escape, ignore, fake completion, fake completion with template), and shield prompt are all defined with illustrative figures.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Section 1 explicitly lists three contributions: a novel approach repurposing attack techniques as defenses, four specific defense methods, and empirical demonstration of reduced ASR; the contribution framing is unambiguous.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 2 surveys both attack and defense prior work in detail, explains how existing training-free defenses fail, and directly benchmarks against five contemporary baselines; the paper situates its contribution relative to fine-tuning approaches as well.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": true, + "justification": "Footnote 1 states 'Code is publicly available at https://github.com/LukeChen-go/pia-defense-by-attack'; this is a direct release, not a promise.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "All three datasets used—AlpacaFarm, the filtered QA dataset from Li et al. (2023b), and SST2—are standard public benchmarks used unmodified.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "Appendix A.1 states PyTorch 2.1.0 and a single NVIDIA A100 GPU; no requirements.txt, Dockerfile, or full dependency list is provided.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "Generation hyperparameters are given (do_sample=false, max_new_tokens=256, max_length=8192) but no step-by-step instructions for reproducing the pipeline appear in the paper; the reader must infer procedure from the code repository.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "None of the result tables (Tables 1–12) report confidence intervals or error bars; only point-estimate ASR and accuracy values are given.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "No statistical significance tests are applied to any comparative claims; improvements over baselines are asserted from raw percentages alone.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Absolute ASR reduction figures (e.g., from 100% to 0.05–0.10% for Fakecom-t in indirect injection) provide interpretable effect sizes relative to both no-defense and baseline-defense conditions.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "208 samples for direct injection and 2,000 for indirect injection are used without any power analysis or justification for why those sizes are sufficient.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "All results are single-run point estimates; no standard deviation or variance across repeated runs is reported for any experiment.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Five training-free baselines (Sandwich, Instructional, Reminder, Isolation, Spotlight) and a no-defense condition are included in all main experiments.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "Baselines include Hines et al. (2024) and Yi et al. (2023); the most recently published baselines available at submission time are represented.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "Section 5.4 contains a structured ablation addressing five specific questions: closed-source generalization, gradient-based attack defense, template-aware attack vulnerability, attack-defense strength correlation, and long-input truncation.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Both ASR (security metric) and task accuracy on QA and SST2 (utility metric) plus inference time overhead (Table 8) are reported.", + "source": "haiku" + }, + "human_evaluation": { + "applies": false, + "answer": false, + "justification": "Human evaluation is not relevant for automated prompt injection defense benchmarking; all evaluation is automated via string-match ASR detection.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": false, + "answer": false, + "justification": "The defense methods are training-free prompt-engineering techniques with no model training phase; a train/test split is not applicable.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Tables 1 and 2 break down ASR by all five attack types (Naive, Ignore, Escape, Fakecom, Combined) across each of three victim models separately.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "Section 5.5 shows concrete response examples where attacks succeed without defense and explains mechanistically why some defenses fail (e.g., Ignore defense not always suppressing both instructions).", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "The paper reports that Ours-Escape underperforms on Fake completion attacks (ASR 70.19% for Qwen2 in Table 1), that removing retrieved data severely degrades utility (Table 12), and notes one exception to the attack-defense correlation in Figure 3.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": true, + "justification": "Exact model names are stated: Llama3-8b-Instruct, Llama3.1-8b-Instruct, Qwen2-7b-Instruct, GPT-3.5-Turbo, and GPT-4o-Latest, each with a citation to the corresponding technical report.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": true, + "justification": "Figures 2 and 4–15 show the complete prompt templates for all four defense methods and all five attack types including system, user, and assistant turn content.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": true, + "justification": "Appendix A.1 reports do_sample=false, max_new_tokens=256, max_length=8192 for generation; these are the key inference hyperparameters.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": false, + "answer": false, + "justification": "The defense methods are purely prompt-engineering based with no agentic scaffolding, tool use, or multi-step agent loops.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": false, + "justification": "The paper uses a 'filtered QA dataset' from Li et al. (2023b) and 208 samples from AlpacaFarm but does not describe how attacks were injected into these datasets or what filtering criteria were applied beyond citing the source paper.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": true, + "justification": "All evaluation datasets (AlpacaFarm, SST2) are publicly available; the QA dataset is from Li et al. (2023b) which is also publicly available.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Section 5.1 identifies each dataset, its size, and its source paper; the attack injection procedure is described conceptually through the methodology and figures.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "No human participants were recruited; all experiments use automated evaluation on standard benchmark datasets.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": false, + "justification": "The paper does not document the pipeline for constructing the attacked versions of datasets (e.g., exactly how malicious prompts were injected into each QA sample), relying on the reader to infer from attack descriptions.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": false, + "justification": "The paper does not state the training data cutoff for any of the evaluated models (Llama3, Llama3.1, Qwen2, GPT-3.5, GPT-4o), despite using benchmark datasets that may have been available before training.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": false, + "justification": "No discussion of whether AlpacaFarm or the QA dataset appeared in the training data of the victim models is present anywhere in the paper.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": true, + "answer": false, + "justification": "AlpacaFarm (2024) and SST2 (2013) predate the training cutoffs of models like Llama3 and Qwen2; potential contamination is not acknowledged or addressed.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human participants; pre-registration not applicable.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human participants; IRB approval not applicable.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": true, + "justification": "Table 8 reports inference time in seconds per item for all defense methods across all three victim models; overhead from Fakecom-t is quantified relative to no-defense baseline.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "Appendix A.1 mentions a single NVIDIA A100 GPU but does not state total GPU-hours or compute budget for the full experimental suite.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "Training-free defense methods derived from attack techniques outperform all existing training-free defense baselines on both direct and indirect prompt injection scenarios", + "evidence": "Tables 1 and 2 show lower ASR for all four proposed methods versus Sandwich, Instructional, Reminder, Isolation, and Spotlight across Llama3, Llama3.1, and Qwen2", + "supported": "strong" + }, + { + "claim": "The Fake Completion with Template defense (Fakecom-t) reduces ASR to near zero (≤0.10%) for indirect prompt injection attacks", + "evidence": "Table 2 shows Ours-Fakecom-t achieving ASR of 0.05–0.10% across all five attack types for indirect injection on Llama3 and Llama3.1", + "supported": "strong" + }, + { + "claim": "Stronger attack techniques yield stronger corresponding defense methods when the attack technique is inverted", + "evidence": "Figure 3 shows a positive correlation between average attack ASR and corresponding defense effectiveness across three models, with one exception noted for Qwen2/Llama3.1", + "supported": "moderate" + }, + { + "claim": "The proposed training-free methods are competitive with fine-tuning-based defenses (StruQ) while offering better generalization to unseen attack types", + "evidence": "Table 7 shows Ours-Ignore achieving 0.05–1.35% ASR comparable to StruQ-Ignore's 0.05% on most attacks, but StruQ-Naive fails on Fakecom (35.55%) while Ours-Ignore holds at 0.10%", + "supported": "moderate" + }, + { + "claim": "Defense methods do not significantly degrade model utility on downstream tasks", + "evidence": "Tables 3 and 11 show QA accuracy and SST2 accuracy within ±2pp of no-defense baseline across all three models and all defense methods", + "supported": "strong" + }, + { + "claim": "The proposed methods are effective on closed-source models (GPT-3.5-Turbo, GPT-4o-Latest)", + "evidence": "Table 4 shows Ours-Ignore reducing ASR from 50.48% to 3.36% on GPT-3.5 and from 92.78% to 0.90% on GPT-4o for Ignore attacks", + "supported": "moderate" + } + ], + "methodology_tags": [ + "benchmark-eval", + "empirical" + ], + "key_findings": "The paper demonstrates that prompt injection attack techniques can be directly inverted to create effective training-free defenses: by appending an attack-derived 'shield prompt' followed by the original instruction, the LLM is redirected away from injected instructions. The Fake Completion with Template defense achieves near-zero ASR (≤0.10%) on indirect injection across all tested attack types, outperforming all five training-free baselines and matching or exceeding one fine-tuning approach (StruQ) with better generalization. A positive correlation exists between attack effectiveness and the strength of the corresponding defense method. Defense overhead is minimal: inference time increases by at most 19% for the most complex defense, and task accuracy is preserved within ±2pp.", + "red_flags": [ + { + "flag": "No statistical testing", + "detail": "No confidence intervals, error bars, or significance tests are reported for any comparison; improvements over baselines are asserted from single-run point estimates across only 208–2,000 samples." + }, + { + "flag": "Small direct-injection sample", + "detail": "Only 208 AlpacaFarm samples are used for direct injection experiments with no power analysis; this limits the reliability of comparisons between methods with similar ASR values." + }, + { + "flag": "No mechanistic explanation", + "detail": "The paper does not explain why attack inversion works mechanistically or rule out the simpler hypothesis that appending the original instruction at the end of the prompt (independent of the shield prompt structure) accounts for most of the gain." + }, + { + "flag": "Adversarial adaptation not addressed", + "detail": "The paper assumes attackers do not know the defense mechanism; no evaluation considers adaptive attackers who modify their injection knowing the Fakecom-t defense is in use." + }, + { + "flag": "Benchmark contamination unaddressed", + "detail": "AlpacaFarm and SST2 predate the training cutoffs of Llama3 and Qwen2; potential data contamination affecting model behavior on these benchmarks is not discussed." + }, + { + "flag": "No variance across runs", + "detail": "All results are single point estimates with do_sample=false; reproducibility across different random seeds or hardware is not demonstrated." + } + ], + "cited_papers": [ + { + "title": "Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection", + "relevance": "Foundational indirect prompt injection paper; defines the threat model this defense addresses" + }, + { + "title": "Ignore Previous Prompt: Attack Techniques for Language Models", + "relevance": "Introduces the Ignore attack technique that directly inspires one of the four defense methods" + }, + { + "title": "StruQ: Defending Against Prompt Injection with Structured Queries", + "relevance": "Fine-tuning-based defense baseline used for direct comparison; provides the evaluation protocol adopted in this paper" + }, + { + "title": "Defending Against Indirect Prompt Injection Attacks with Spotlighting", + "relevance": "Training-free defense baseline; one of five methods compared against proposed approaches" + }, + { + "title": "The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions", + "relevance": "Fine-tuning-based defense approach contextualizing the trade-off between training cost and effectiveness" + }, + { + "title": "Formalizing and Benchmarking Prompt Injection Attacks and Defenses", + "relevance": "Provides the Combined attack baseline evaluated in this paper" + }, + { + "title": "InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated LLM Agents", + "relevance": "Contextualizes prompt injection risk in agentic tool-use settings beyond the QA scenario tested here" + }, + { + "title": "Universal and Transferable Adversarial Attacks on Aligned Language Models", + "relevance": "GCG gradient-based attack method used as one of the attack baselines in the ablation study" + }, + { + "title": "Evaluating the Instruction-Following Robustness of Large Language Models to Prompt Injection", + "relevance": "Source of the filtered QA dataset (2,000 samples) used for indirect injection evaluation" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 3, + "justification": "Training-free defense methods requiring only prompt modification are immediately deployable in any LLM application without retraining." + }, + "surprise_contrarian": { + "score": 2, + "justification": "The core insight—that attack prompts can be directly repurposed as defense prompts—is counterintuitive and not previously explored in the literature." + }, + "fear_safety": { + "score": 2, + "justification": "Addresses the OWASP #1 LLM security risk; results showing near-zero ASR with the Fakecom-t defense are reassuring but the unaddressed adaptive attacker concern is concerning." + }, + "drama_conflict": { + "score": 1, + "justification": "No significant controversy; the paper straightforwardly proposes and validates a defense approach without challenging established consensus." + }, + "demo_ability": { + "score": 3, + "justification": "Code is publicly released and the defense is purely prompt-based—anyone with API access to an LLM can immediately test it with no infrastructure." + }, + "brand_recognition": { + "score": 1, + "justification": "Authors from NUS and HKUST are respected academic institutions but not a major AI lab; no famous product or industry partnership is involved." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "38150915", + "title": "The Generative AI Paradox: \"What It Can Create, It May Not Understand\"", + "points": 5, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=38150915", + "created_at": "2023-11-05T13:23:46Z" + }, + { + "hn_id": "42487268", + "title": "Specification-Driven Code Translation Powered by LLMs: How Far Are We?", + "points": 4, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=42487268", + "created_at": "2024-12-22T16:20:09Z" + }, + { + "hn_id": "38146155", + "title": "The Generative AI Paradox: \"What It Can Create, It May Not Understand\"", + "points": 3, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=38146155", + "created_at": "2023-11-04T23:06:37Z" + }, + { + "hn_id": "43268036", + "title": "Evolutionary Multi-Agent Reinforcement Learning in Group Social Dilemmas", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=43268036", + "created_at": "2025-03-05T15:41:54Z" + }, + { + "hn_id": "35719730", + "title": "Schrödinger cat states of a 16-microgram mechanical oscillator", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=35719730", + "created_at": "2023-04-26T20:43:33Z" + } + ], + "top_points": 5, + "total_points": 15, + "total_comments": 1 + } +} +\ No newline at end of file diff --git a/papers/defense-against-prompt-2025/scan-v5.json b/papers/defense-against-prompt-2025/scan-v5.json @@ -0,0 +1,577 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Defense against Prompt Injection Attacks via Mixture of Encodings", + "authors": [ + "Ruiyi Zhang", + "David Sullivan", + "Kyle Jackson", + "Pengtao Xie", + "Mei Chen" + ], + "year": 2025, + "venue": "North American Chapter of the Association for Computational Linguistics", + "arxiv_id": "2504.07467", + "doi": "10.48550/arXiv.2504.07467" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "Tables 1 and 2 support all three main claims: attack success rates are competitive/lowest (Table 1), task performance is maintained within 2-5% of baseline (Table 2), and outperformance over Base64/Caesar is shown across most benchmarks.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": false, + "justification": "The paper compares full methods but includes no ablation study isolating the contribution of each encoding or the mixture aggregation strategy. Cannot attribute improvement to the mixture mechanism vs. other factors.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "Results are shown on specific BIPIA attack datasets and 9 NLP tasks, but generalization claims aren't bounded. No discussion of applicability to other attack types, models beyond GPT-4/4o/Qwen, or tasks outside this set.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "The paper presents empirical results without exploring why the mixture works or considering alternative mechanistic explanations for the improved performance.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "Attack success rate directly measures whether LLM follows malicious instructions (not a proxy), and NLP task accuracy directly measures helpfulness. Measurements align with stated claims.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": false, + "justification": "Section 7 contains only one brief paragraph discussing computational overhead. A single sentence per limitation does not constitute a dedicated limitations section.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "Only generic mention of inference cost. No specific threats discussed: sample representativeness, attack generalization, model generalization, or whether 50 BIPIA attacks represent the full attack space.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "Scope is implicitly tested (GPT-4/4o/Qwen, BIPIA attacks, 9 tasks) but explicit boundaries stating what the work does NOT show are not stated.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "First author footnote states 'internship project at Microsoft' but no formal funding source statement or acknowledgments section disclosing sponsors.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Author affiliations clearly stated: UC San Diego and Microsoft. However, Microsoft employees are among the authors, creating potential bias (though evaluating non-Microsoft LLMs).", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": true, + "justification": "Microsoft funded the internship but doesn't directly benefit from results—the method is model-agnostic and evaluates OpenAI (GPT-4/4o) and Alibaba (Qwen), not Microsoft products.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement, patent disclosures, or equity declarations included.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Prompt injection attacks defined with Figure 1 example. Mixture of encodings explained. Safety/helpfulness used contextually. Attack success rate implicitly clear from context.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Contribution is explicitly stated in abstract and introduction: a defense method balancing safety and helpfulness using multiple character encodings.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 2 discusses prior attacks (Perez 2022, Liu 2024), existing defenses (detection vs. prevention), Base64 defense (Hines 2024), and mixture-of-experts literature. Positions this work as improvement on Base64.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": true, + "justification": "Paper states 'Our code is publicly available at https://github.com/ruz048/MoEMEnT'. Assuming claim is accurate; GitHub URL provided.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "Uses only public benchmark datasets (BIPIA, MMLU, SQuAD, Hellaswag, MGSM, SamSum, WMT, IMDB, WildGuard, WebQ). No custom data created.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "Model versions specified (GPT-4 turbo-2024-04-09, GPT-4o 2024-05-13) but no requirements.txt, Dockerfile, Python version, or dependency specifications provided.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "Method described algorithmically but no step-by-step reproduction instructions included. Code promise is made but not documented in paper.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "Tables 1 and 2 report point estimates only. No error bars, confidence intervals, or measures of variance across runs.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "Comparative claims made (e.g., 'outperforms Base64') but no statistical significance tests (t-tests, chi-squared, etc.) reported.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Tables report accuracy percentages and attack success rates, which are effect measures. Improvements quantified (e.g., Table 1: GPT-4o Abstract 1.0% vs Base64 5.7%).", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "Dataset sizes are standard benchmarks (BIPIA: 7.5K-22.5K; NLP tasks: 1.3K-25K) but not justified. No power analysis or rationale for sufficiency.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "Single point estimates per condition. No variance, standard deviation, or multiple run results reported.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Baselines: No Defense, Datamark, Ignoring, Base64, Caesar. Covers both detection-adjacent and encoding-based defenses.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "Base64 (Hines 2024), Caesar (proposed concurrently), Datamark/Ignoring (Yi et al. 2023). All recent encoding/defense methods.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": false, + "justification": "Method combines 3 prompts (P1, P2, P3) and aggregates responses, but no ablation isolating individual components or encoding combinations.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Safety: attack success rate. Helpfulness: accuracy on 9 NLP tasks. Cost: inference multiplier (Table 4). Multiple dimensions evaluated.", + "source": "haiku" + }, + "human_evaluation": { + "applies": true, + "answer": false, + "justification": "All metrics are automated. No human judgment on whether output quality is preserved for summarization, translation, or QA tasks.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": true, + "justification": "Standard benchmarks use validation/test splits (BIPIA test set, NLP task test splits). Appropriate held-out evaluation.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Table 1 breaks safety by 4 attack datasets (Email, Table, Abstract, Code). Table 2 breaks helpfulness by 9 tasks.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "Figure 4 shows example where Base64 fails (math problem returns 2 instead of 4) but mixture succeeds. Some failure analysis via example.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "Tables show cases where method underperforms baseline, e.g., GPT-4o WebQ: 25.3% vs 29.7% no-defense. Reported but not analyzed.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": true, + "justification": "GPT-4 (turbo-2024-04-09) and GPT-4o (2024-05-13) specify snapshot dates. Qwen-2.5-72B-Instruct mentioned for open-source.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": true, + "justification": "Meta-prompts (MP1, MP2) provided in Table 3 (Appendix D). MP1: 'The following sentence is encoded in Base64/Caesar format.' MP2: 'Given answers A, B, C, reply with your answer.' Instructions are shown.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": true, + "justification": "Caesar shift = 3 (Appendix E). Temperature, top-p, and other generation parameters not reported.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": true, + "answer": true, + "justification": "Method clearly described: encode input with 3 methods, get 3 LLM responses, aggregate (sum probabilities for classification, meta-prompt for generation).", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": false, + "justification": "Uses standard benchmark splits without custom preprocessing. No preprocessing pipeline documented.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": true, + "justification": "All datasets are public: BIPIA components (OpenAI Evals, WikiTableQA, XSum, Stack Overflow) and standard NLP benchmarks (MMLU, SQuAD, etc.).", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Safety datasets sourced from: Email (OpenAI Evals), Table (WikiTableQA), Abstract (XSum), Code (Stack Overflow). Helpfulness datasets standard (MMLU, SQuAD, etc.). Sources cited.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "No human participants. Benchmark evaluation only.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": false, + "justification": "Paper references existing datasets but doesn't document the pipeline from raw data to final evaluation (e.g., filtering, splitting, preprocessing).", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": true, + "justification": "Model snapshots specified: GPT-4 turbo (cutoff ~April 2024), GPT-4o (cutoff ~May 2024). BIPIA benchmark from 2023, before cutoffs.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": false, + "justification": "Standard NLP benchmarks (MMLU, SQuAD) are known to have training data contamination in large LLMs. This risk is not discussed.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": true, + "answer": false, + "justification": "No discussion of whether MMLU, SQuAD, or other helpfulness benchmarks were in training data. Known contamination issue not addressed.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human subjects; benchmark evaluation only.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": true, + "justification": "Table 4 (Appendix H) shows inference cost multiplier: no defense = 1x, mixture = 3.46x. Absolute runtime/USD cost not provided but relative cost clear.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "Total computational budget for experiments (number of API calls, total cost, compute-hours) not stated.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "Mixture of encodings achieves one of the lowest attack success rates under prompt injection attacks", + "evidence": "Table 1 shows attack success rates on 4 BIPIA datasets: for GPT-4o, mixture achieves 1.5% (Email), 1.0% (Table), 1.0% (Abstract), 0% (Code), competitive with or better than Base64 and Caesar.", + "supported": "strong" + }, + { + "claim": "The method maintains high performance across all NLP tasks", + "evidence": "Table 2 shows helpfulness performance on 9 tasks: GPT-4o ranges 75.5-96.1% with mixture vs 79.9-92.3% without defense, within 2-5% of baseline on 8 of 9 tasks.", + "supported": "strong" + }, + { + "claim": "Mixture of encodings outperforms Base64 and Caesar defenses", + "evidence": "Table 1: mixture beats Base64 on 3/4 attack datasets for GPT-4o; Table 2: mixture beats Caesar on all 9 tasks (e.g., MGSM 52.0% vs 14.2%).", + "supported": "strong" + }, + { + "claim": "Base64 defense significantly degrades performance on mathematical and multilingual reasoning tasks", + "evidence": "Table 2: Base64 achieves 5.2% on MGSM (vs 53.1% no-defense) and 64.9% on MMLU (vs 79.9%), demonstrating substantial degradation on reasoning tasks.", + "supported": "strong" + }, + { + "claim": "Aggregating multiple encodings balances safety and helpfulness better than single encodings", + "evidence": "Tables 1 and 2 show mixture achieving competitive safety with acceptable helpfulness, while Base64 achieves better safety but much worse helpfulness.", + "supported": "moderate" + }, + { + "claim": "Caesar cipher is an effective character encoding for defending against prompt injection attacks", + "evidence": "Table 1: Caesar achieves low attack success rates (e.g., 0% on Code for GPT-4o), but Table 2 shows severe helpfulness degradation (e.g., 7.3% on MGSM).", + "supported": "moderate" + } + ], + "methodology_tags": [ + "benchmark-eval" + ], + "key_findings": "This paper proposes a mixture of encodings defense against prompt injection attacks that processes external data in three forms—unencoded (P1), Base64-encoded (P2), and Caesar-cipher-encoded (P3)—sending each to an LLM and aggregating responses. On the BIPIA safety benchmark (4 attack datasets), the method achieves 1.0-1.5% average attack success rate for GPT-4o, competitive with Base64. On nine NLP tasks (MMLU, SQuAD, Hellaswag, MGSM, SamSum, WMT, IMDB, WildGuard, WebQ), it maintains 75.5-96.1% accuracy, within 2-5% of the undefended baseline—a major improvement over Base64 alone, which degrades to 5% on math reasoning. The trade-off is 3.46x inference cost due to triple processing, though the authors suggest parallelization could reduce latency.", + "red_flags": [ + { + "flag": "No ablation study", + "detail": "Method combines 3 prompts (unencoded, Base64, Caesar) and aggregates, but no ablation isolates which components matter. Is Caesar necessary? Would 2-encoding mixture suffice? Contribution of aggregation vs. ensemble is unclear." + }, + { + "flag": "No statistical significance testing", + "detail": "All results are point estimates. No confidence intervals, p-values, or multiple runs reported. Improvements could be noise, especially for small differences (e.g., 1% on some datasets)." + }, + { + "flag": "Computational overhead inadequately justified", + "detail": "3.46x inference cost is substantial. Section 7 mentions 'can be processed in parallel' but doesn't quantify actual latency reduction or real-world deployment feasibility." + }, + { + "flag": "Limited attack diversity", + "detail": "Only evaluated on BIPIA benchmark (50 attack types). Unknown whether method generalizes to other prompt injection strategies, novel attack formulations, or adversarially crafted attacks outside BIPIA scope." + }, + { + "flag": "No human evaluation", + "detail": "Automated metrics only. Doesn't verify whether human-perceived output quality is preserved for subjective tasks like summarization or machine translation." + }, + { + "flag": "Benchmark contamination unaddressed", + "detail": "Standard NLP benchmarks (MMLU, SQuAD, Hellaswag) are known to be in GPT-4/4o training data. Reported helpfulness improvements may be inflated due to memorization rather than genuine robustness." + }, + { + "flag": "No alternative aggregation strategies explored", + "detail": "Why sum probabilities for classification? Why meta-prompt for generation? No comparison of aggregation methods or justification that choices are optimal." + }, + { + "flag": "Weak generalization claims", + "detail": "Results shown for GPT-4, GPT-4o, and Qwen, but generalization to other LLM architectures, sizes, or deployment contexts (on-device, quantized) is unexplored." + } + ], + "cited_papers": [ + { + "title": "Benchmarking and defending against indirect prompt injection attacks on large language models", + "relevance": "Introduces BIPIA benchmark used for safety evaluation. Proposes detection-based and prevention-based defenses. Core baseline for this work." + }, + { + "title": "Defending against indirect prompt injection attacks with spotlighting", + "relevance": "Proposes Base64 defense, the state-of-the-art encoding-based defense. This paper builds on and improves Base64." + }, + { + "title": "Formalizing and benchmarking prompt injection attacks and defenses", + "relevance": "Formalizes prompt injection attacks and proposes baselines (Datamark, Ignoring). Defines attack/defense taxonomy." + }, + { + "title": "Ignore previous prompt: Attack techniques for language models", + "relevance": "Foundational work on prompt injection attack techniques. Establishes the threat model." + }, + { + "title": "Jailbroken: How does LLM safety training fail?", + "relevance": "Studies LLM understanding of encodings (Base64, cipher). Provides evidence that recent LLMs can decode Base64, motivating encoding-based defenses." + }, + { + "title": "GPT-4 is too smart to be safe: Stealthy chat with LLMs via cipher", + "relevance": "Demonstrates LLM capability on Caesar cipher decoding. Motivates choice of Caesar as one encoding in mixture." + }, + { + "title": "Mixture of Experts and Prompt Ensemble", + "relevance": "Surveys mixture-of-experts and prompt ensemble methods. Provides conceptual foundation for this paper's mixture-of-encodings strategy." + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 2, + "justification": "Method is deployable (code released) but 3.46x inference cost is prohibitive for production systems without extreme safety requirements. Viable only for latency-insensitive applications (e.g., content moderation)." + }, + "surprise_contrarian": { + "score": 1, + "justification": "Straightforward application of mixture-of-experts / ensemble logic to defenses. Expected result that combining encoding methods improves robustness; no surprising insights about prompt injection or defenses." + }, + "fear_safety": { + "score": 2, + "justification": "Addresses prompt injection, a real LLM safety concern, but limited scope: only applies to systems with external data pipelines. Doesn't address broader alignment, jailbreaking, or adversarial robustness." + }, + "drama_conflict": { + "score": 0, + "justification": "No controversy. Straightforward safety engineering. No contentious claims or competing approaches." + }, + "demo_ability": { + "score": 2, + "justification": "Code released but requires GPT-4/4o API access or local LLM to try. Not immediately runnable for most readers without API keys and budget." + }, + "brand_recognition": { + "score": 2, + "justification": "UC San Diego + Microsoft (respectable but not top-tier). No famous authors. Published at ACL (mainstream venue but not top-tier conference)." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "44884091", + "title": "A Comprehensive Survey of Self-Evolving AI Agents [pdf]", + "points": 94, + "comments": 29, + "url": "https://news.ycombinator.com/item?id=44884091", + "created_at": "2025-08-13T02:26:32Z" + }, + { + "hn_id": "43736366", + "title": "Inferring the Phylogeny of Large Language Models", + "points": 69, + "comments": 6, + "url": "https://news.ycombinator.com/item?id=43736366", + "created_at": "2025-04-19T13:47:15Z" + }, + { + "hn_id": "26794843", + "title": "Certifying Multimedia News Content for Fake News Defense", + "points": 12, + "comments": 3, + "url": "https://news.ycombinator.com/item?id=26794843", + "created_at": "2021-04-13T16:28:40Z" + }, + { + "hn_id": "43989432", + "title": "OnPrem.LLM: A Privacy-Conscious Document Intelligence Toolkit", + "points": 5, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=43989432", + "created_at": "2025-05-14T21:30:02Z" + }, + { + "hn_id": "40043146", + "title": "Why do small language models underperform?", + "points": 4, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=40043146", + "created_at": "2024-04-15T17:10:46Z" + }, + { + "hn_id": "35626433", + "title": "Learning to Compress Prompts with Gist Tokens", + "points": 2, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=35626433", + "created_at": "2023-04-19T10:22:30Z" + }, + { + "hn_id": "35721355", + "title": "Compressing Large Language Model Prompts via Gist Tokens", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=35721355", + "created_at": "2023-04-26T23:30:32Z" + }, + { + "hn_id": "35641820", + "title": "Learning to Compress Prompts with Gist Tokens", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=35641820", + "created_at": "2023-04-20T15:43:27Z" + }, + { + "hn_id": "9413569", + "title": "Efficient Approximation Algorithms for the Largest Weight Data Retrieval Problem", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=9413569", + "created_at": "2015-04-21T12:35:14Z" + } + ], + "top_points": 94, + "total_points": 189, + "total_comments": 40 + } +} +\ No newline at end of file diff --git a/papers/defense-massive-false-2022/scan-v5.json b/papers/defense-massive-false-2022/scan-v5.json @@ -0,0 +1,505 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Defense of Massive False Data Injection Attack via Sparse Attack Points Considering Uncertain Topological Changes", + "authors": [ + "Xiaoge Huang", + "Zhijun Qin", + "Ming Xie", + "Hui Liu", + "Liang Meng" + ], + "year": 2022, + "venue": "Journal of Modern Power Systems and Clean Energy", + "arxiv_id": null, + "doi": "10.35833/mpce.2020.000686" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "The abstract claims 95% detection accuracy and 80% localization accuracy; Table V confirms 96.7–98.5% detection accuracy and Table IX shows 80.37–85.69% localization correct rate across all test systems.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "Comparative claims (AE-BCV outperforms SVM and ANN) are supported by training all three detectors on identical datasets and evaluating on identical held-out test sets, which is adequate for comparative inference.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "The conclusion asserts the AE-BCV detector 'can be directly applied with AC power flow model,' but all experiments use only the DC power flow model; this generalization goes beyond the tested setting.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "The paper does not consider alternative explanations for AE-BCV outperforming SVM/ANN, such as hyperparameter sensitivity of the baselines or the effect of architecture depth differences.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "The paper clearly distinguishes detection accuracy, localization correct rate, and recovery mean error as separate metrics tied to three distinct claimed contributions, without conflating them.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": false, + "justification": "There is no dedicated limitations or threats-to-validity section; the conclusion mentions future work directions (AC model, dynamic SE) but does not frame these as limitations of the current results.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "No specific threats to validity are discussed; the complete information assumption and DC-only model are stated as design choices rather than as validity threats affecting result interpretation.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "While assumptions (DC model, complete topology information, single-snapshot SE) are mentioned, the paper does not explicitly state what the results do NOT show or demarcate clear scope boundaries in a threats-to-validity sense.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": true, + "justification": "Funding is disclosed: 'This work was supported in part by the National Natural Science Foundation of China (No. 51767001).'", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "All author affiliations are disclosed: Guangxi University (academic) and Guangxi Power Grid Co. Ltd., China Southern Grid (industry co-authors).", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": true, + "justification": "The National Natural Science Foundation of China is a government funding agency with no commercial interest in the outcome of FDIA defense methodology research.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests or financial interests statement (patents, equity, consulting) is provided anywhere in the paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Key terms are precisely defined: FDIA, SE, BDD, DC power flow model (Eq. 1–3), state variables, attack intensity parameter k, and both attack models are formally specified with mathematical notation.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Four explicit numbered contributions are stated in the introduction: enhanced attack model, AE-BCV detector, AE-GAN generation, and pattern match recovery algorithm.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section I extensively compares against prior work [6]–[28], explicitly differentiating the proposed approach from [21], [26], [28] and explaining why prior methods fail to handle multi-modal measurement distributions.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": false, + "justification": "No source code is released or mentioned as available; only the hardware setup and PyTorch framework are noted.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": false, + "justification": "Datasets are generated via MATPOWER simulations with FDIA overlays using specific configurations; they are not released or made publicly available.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "Hardware (Intel i7-8750H, RTX 2070) and PyTorch are mentioned but no version numbers or reproducible environment files (requirements.txt, Dockerfile) are provided.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "Appendix A provides neural network architecture parameters but no step-by-step instructions exist to reproduce experiments from scratch without guessing implementation details.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "No confidence intervals or error bars are reported for any accuracy metrics; all results are single percentage values without uncertainty estimates.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "No statistical significance tests are performed for comparative accuracy claims between AE-BCV, SVM, and ANN.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Accuracy differences are quantified with baselines (e.g., AE-BCV 95.2% vs SVM 80.0% for 118-bus WA); recovery errors are reported with before/after context (16.50 → 0.85 mean |a/z| for SA on 118-bus).", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "1100 simulations per attack level are used but not justified; the 80/10/10 split is stated without statistical power analysis.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "No variance, standard deviation, or spread is reported for any accuracy result; all metrics are single-point estimates.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "ANN and SVM detectors are trained on identical datasets and compared against AE-BCV across all test systems and attack levels.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": false, + "justification": "SVM and plain ANN are weak baselines for a 2022 paper; contemporary deep learning alternatives (CNN, LSTM, attention-based detectors) are absent as direct comparisons, and comparison with [13] (2017) and [14] (2018) is acknowledged as 'rough.'", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": false, + "justification": "No ablation study isolates contributions of AE vs BCV in the detector or standard GAN vs AE-GAN in the generator, making it unclear which components drive performance.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Multiple metrics are used: accuracy, false positive rate, and false negative rate for detection; correct rate, positive/negative false rates for localization; mean recovery error before and after for recovery.", + "source": "haiku" + }, + "human_evaluation": { + "applies": false, + "answer": false, + "justification": "Human evaluation is clearly irrelevant for automated power system FDIA detection using simulated data.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": true, + "justification": "An explicit 10% held-out test set is used, with a deliberately higher topology change rate (8% line outages) than training (5%) to test generalization to unseen conditions.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Results are broken down by attack intensity (SA/MA/WA), attack type (targeted/untargeted), and power system (57-bus, 118-bus, 415-bus) across Tables V–X.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": false, + "justification": "Performance variation by attack intensity is quantified but not discussed as failure cases; no dedicated analysis of misclassification patterns or conditions where the methodology breaks down is provided.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": false, + "justification": "All results are framed positively; performance degradation for weak attacks is reported in tables but not interpreted as a negative finding or limitation.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": false, + "justification": "PyTorch is mentioned as the implementation framework but no version number is given; MATPOWER and CVX versions are also unspecified.", + "source": "haiku" + }, + "prompts_provided": { + "applies": false, + "answer": false, + "justification": "This is not an LLM/prompt-based paper; no prompts or system instructions are relevant.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": true, + "justification": "Appendix A (Tables AI and AII) provides detailed neural network parameters including number of hidden layers, neuron counts, and learning rates for all models.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": false, + "answer": false, + "justification": "This paper does not involve agentic AI scaffolding; deep learning models are used for classification and generation tasks.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "Section VI.B documents Monte Carlo line switching (5% train / 8% test), power injection variation (50–150% ± 1% noise), per-unit representation without normalization, and separate dataset configurations for AE vs AE-GAN.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": false, + "justification": "Generated datasets (FDIA samples from MATPOWER simulations with CVX-optimized attack vectors) are not publicly released for independent verification.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Section VI.B describes the data generation procedure including normal case diversification (Monte Carlo simulations, line switching, injection variation) and FDIA case generation using the proposed optimization models.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "No human participants; data is entirely simulated from power system models.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "The pipeline from MATPOWER base cases → Monte Carlo diversification → FDIA overlay (CVX optimization) → 80/10/10 train/val/test split is described with sufficient detail to understand the process.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": false, + "answer": false, + "justification": "Not applicable; this paper trains deep learning models on synthetically generated power system data, not on pre-trained LLM benchmarks.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": false, + "answer": false, + "justification": "Not applicable; data is synthetically generated with controlled topology-based train/test splits.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": false, + "answer": false, + "justification": "Not applicable; IEEE bus systems are simulation substrates, not benchmark datasets for evaluating pre-trained model knowledge.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": true, + "justification": "Table XI reports online classification time (0.011–0.013 seconds) and localization/recovery time (14.69–25.47 seconds) for IEEE 118-bus and CSG 415-bus systems.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": true, + "justification": "Hardware is specified (Intel i7-8750H CPU, RTX 2070 GPU, 16 GB RAM) and offline training times are reported (441–594s for AE-BCV detection, 324–406s for AE-GAN).", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "AE-BCV achieves over 95% FDIA detection accuracy including under unseen topological changes", + "evidence": "Table V shows 96.7–98.5% accuracy on IEEE 57-bus, 118-bus, and 415-bus; Tables VII–VIII confirm 90.8–99.2% for unseen moderate and weak attacks", + "supported": "strong" + }, + { + "claim": "Proposed attack models affect far more state variables with fewer compromised meters than the conventional model", + "evidence": "Table IV: conventional model affects 10 states with 60–140 compromised meters on 118-bus; proposed untargeted model affects 117 states with only 33–34 compromised meters", + "supported": "strong" + }, + { + "claim": "AE-GAN avoids model collapse for multi-modal power system measurement distributions", + "evidence": "Figure 5 shows discriminator and generator training losses converging to Nash-equilibrium after ~200 steps; theoretical argument provided that single-modal encoder intermediate prevents GAN collapse", + "supported": "moderate" + }, + { + "claim": "FDIA localization achieves 80–86% correct rate across attack intensities and test systems", + "evidence": "Table IX shows correct rates of 80.37–85.69% for IEEE 118-bus and CSG 415-bus across SA, MA, and WA intensity levels", + "supported": "strong" + }, + { + "claim": "Recovery reduces mean measurement error by approximately 95% across attack intensities", + "evidence": "Table X: error drops from 16.50 to 0.85 (SA) and 3.15 to 0.46 (WA) on IEEE 118-bus; similar reductions on CSG 415-bus", + "supported": "strong" + }, + { + "claim": "AE-BCV advantage over SVM/ANN grows substantially at lower attack intensities", + "evidence": "Table VIII: AE-BCV achieves 90.8–95.2% for weak attacks vs SVM 50.8–80.0% and ANN 60.4–78.8%", + "supported": "strong" + } + ], + "methodology_tags": [ + "benchmark-eval", + "case-study" + ], + "key_findings": "The paper proposes a three-stage deep learning defense against false data injection attacks in power grids: an AE-BCV detector achieving >95% detection accuracy that generalizes to unseen network topologies; an AE-GAN that generates diverse normal measurement distributions without model collapse; and a pattern match algorithm achieving 80–86% localization accuracy with ~95% recovery of falsified measurements. A secondary contribution demonstrates that the proposed enhanced attack model can affect 10× more state variables using fewer compromised meters than prior methods, motivating the need for more sophisticated defense.", + "red_flags": [ + { + "flag": "Simulated data only", + "detail": "All experiments use synthetically generated data from MATPOWER simulations; no validation on real power grid operational data or real FDIA incident data is performed, making real-world generalization unverified." + }, + { + "flag": "Complete information assumption", + "detail": "Both attacker and defender are assumed to have complete knowledge of the power grid topology, a strong and often unrealistic assumption; results may not transfer to partial-information scenarios, which are not discussed as a threat." + }, + { + "flag": "No statistical significance testing", + "detail": "Accuracy differences between AE-BCV, SVM, and ANN are reported as single percentages without confidence intervals or significance tests, making it unclear whether observed differences are statistically meaningful." + }, + { + "flag": "Weak baselines for 2022", + "detail": "SVM and plain ANN are outdated baselines for a 2022 deep learning paper; contemporary alternatives such as CNN, LSTM, or attention-based FDIA detectors are not included as direct comparisons." + }, + { + "flag": "No reproducibility artifacts", + "detail": "No source code, datasets, or step-by-step reproduction instructions are released; independent replication is not feasible despite the neural network architecture tables in Appendix A." + }, + { + "flag": "DC model only with unsupported AC claim", + "detail": "All results use the DC power flow model, yet the conclusion claims the methodology 'can be directly applied with AC power flow model' without any empirical support for this assertion." + }, + { + "flag": "No ablation study", + "detail": "No ablation study isolates the contributions of AE vs BCV in the detector, or the AE component vs standard GAN in the generator, leaving unclear which components are essential for reported performance." + } + ], + "cited_papers": [ + { + "title": "False data injection attacks against state estimation in electric power grids (Liu et al., 2011)", + "relevance": "Foundational paper proposing the original FDIA concept and showing attacks can bypass bad data detection; the proposed attack models in this paper extend and compare against this work." + }, + { + "title": "A survey on the detection algorithms for false data injection attacks in smart grids (Musleh et al., 2020)", + "relevance": "Comprehensive survey providing context for state-of-the-art FDIA detection methods and situating the AE-BCV contribution." + }, + { + "title": "Real-time detection of false data injection attacks in smart grid: a deep learning-based intelligent mechanism (He et al., 2017)", + "relevance": "Prior deep learning FDIA detector achieving 92% on IEEE 118-bus; used as a reference performance benchmark for comparison." + }, + { + "title": "Online false data injection attack detection with wavelet transform and deep neural networks (Yu et al., 2018)", + "relevance": "Deep learning FDIA detector achieving up to 98% on IEEE 118-bus; used as reference performance for comparison with the proposed detector." + }, + { + "title": "Online generative adversary network based measurement recovery in false data injection attacks (Li et al., 2020)", + "relevance": "Prior GAN-based measurement recovery approach that the proposed AE-GAN improves upon by avoiding model collapse with multi-modal distributions." + }, + { + "title": "Generative adversarial nets (Goodfellow et al., 2014)", + "relevance": "Foundational GAN paper; the AE-GAN methodology and training objective in this paper are built directly on this framework." + }, + { + "title": "Identification of false data injection attacks with considering the impact of wind generation and topology reconfigurations (Mohammadpourfard et al., 2018)", + "relevance": "Prior FDIA localization work that the proposed method explicitly differentiates from by addressing multi-modal distributions via AE-GAN rather than predefined candidate sets." + }, + { + "title": "False data injection on state estimation in power systems — attacks, impacts, and defense: a survey (Deng et al., 2017)", + "relevance": "Survey of FDIA attacks, impacts, and defenses providing context for the attack model formulation and the baseline convex optimization model." + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 2, + "justification": "Power grid security is a high-stakes applied domain with real incidents (Ukraine 2015 blackout), but the DC-only model, complete information assumption, and absence of real-world validation limit immediate deployment." + }, + "surprise_contrarian": { + "score": 1, + "justification": "The multi-modal distribution insight is a genuine contribution but not counterintuitive; using AE-GAN to solve it is a predictable application of generative models to the problem." + }, + "fear_safety": { + "score": 2, + "justification": "References the 2015 Ukraine power grid attack affecting 80,000+ users as direct motivation; power grid security failures carry direct physical safety consequences." + }, + "drama_conflict": { + "score": 1, + "justification": "The Ukraine blackout example adds urgency but the paper is straightforwardly technical without controversial claims or adversarial framing beyond standard attacker-defender setup." + }, + "demo_ability": { + "score": 0, + "justification": "No code released, no public demo, and results require specialized power system simulation tools (MATPOWER, CVX, PyTorch) to replicate without provided artifacts." + }, + "brand_recognition": { + "score": 0, + "justification": "Guangxi University and Guangxi Power Grid Co. Ltd. are not internationally recognized brand names in AI/ML or cybersecurity research." + } + }, + "hn_data": { + "threads": [], + "top_points": 0, + "total_points": 0, + "total_comments": 0 + } +} +\ No newline at end of file diff --git a/papers/defensive-prompt-patch-2024/scan-v5.json b/papers/defensive-prompt-patch-2024/scan-v5.json @@ -0,0 +1,529 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Defensive Prompt Patch: A Robust and Generalizable Defense of Large Language Models against Jailbreak Attacks", + "authors": [ + "Chen Xiong", + "Xiangyu Qi", + "Pin-Yu Chen", + "Tsung-Yi Ho" + ], + "year": 2024, + "venue": "arXiv.org", + "arxiv_id": "2405.20099", + "doi": "10.48550/arXiv.2405.20099" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": false, + "justification": "The abstract claims 'negligible impact on utility,' but Mistral-7B-Instruct-v0.2 Win-Rate drops from 90.31% to 75.06% (~15pp), and the claim of universal applicability is tested on only two primary models.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "Ablation studies (Appendix B, Tables 6–12) isolate contributions of the defense objective, helpful objective, HGA vs. RLPrompt, and synonym substitution, supporting causal claims about what drives DPP effectiveness.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "The conclusion claims DPP is a 'universal defensive solution' scalable to 'various LLM platforms,' but the primary evaluation covers only two 7B-scale open-source models, with limited appendix experiments on Vicuna-13B and Llama-3-8B.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "The paper does not consider alternative explanations for DPP's effectiveness, such as whether the suffix works via prompt dilution, semantic priming, or overfitting to the keyword-based ASR evaluation metric.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": false, + "justification": "ASR is measured primarily via substring keyword matching ('I'm sorry', 'I cannot', etc.) as a proxy for actual harm prevention; while LLaMA-Guard evaluation is added in appendices, main claims rest on the keyword proxy without adequately discussing its limitations.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "A dedicated 'Limitation' section discusses computational cost of HGA, GPT-4 training API cost, limitations of defense baselines, and vulnerability when open-weight models are run locally.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "Specific threats are identified: HGA requires ~$75 USD in GPT-4 API calls per training run, and DPP can be trivially removed by users running open-weight models locally — concrete, non-boilerplate constraints.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "No explicit statements about what results do not generalize to (e.g., larger models, closed-source APIs, non-English jailbreaks, attacks outside the evaluation set); the limitations section omits these scope boundaries.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": true, + "justification": "The acknowledgment states 'Chen Xiong and Tsung-Yi Ho...are funded by the Hong Kong Jockey Club Charities Trust.'", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Author affiliations — CUHK, Princeton University, IBM Research — are disclosed on the title page.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": true, + "justification": "The Hong Kong Jockey Club Charities Trust is a philanthropic organization with no stake in LLM products or the defense methods being evaluated.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests or financial disclosure statement is present; only the funding source is mentioned, with no declaration of patents, equity, or consulting relationships.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Section 3.1 formally defines jailbreak attack, jailbreak defense, and utility degradation with mathematical notation; ASR is defined in Appendix I and Win-Rate in Appendix J.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Three explicit contributions are bullet-pointed in the introduction: improved defense with minimal utility tradeoff, robustness against adaptive attacks, and clarity/stability of the prompt mechanism.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 2 covers both jailbreak attack methods (GCG, AutoDAN, PAIR, TAP, ICA) and defenses (Self-Reminder, RPO, Goal Prioritization), with Table 1 explicitly showing how DPP addresses deficiencies of each prior approach.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": true, + "justification": "Code is released via HuggingFace demo space (TrustSafeAI/Defensive-Prompt-Patch-Jailbreak-Defense) and an anonymous repository at anonymous.4open.science/r/DPP-23FF.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "Evaluation relies entirely on publicly available datasets: AdvBench harmful behaviors, Alpaca dataset, and JailbreakBench — all accessible without restriction.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "Appendix C mentions 'single A800 GPU with 80GB of memory' and lists hyperparameters, but no requirements.txt, Dockerfile, or software version list is provided.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "Algorithm pseudocode and hyperparameters are detailed in appendices, but there are no step-by-step instructions for running the full training-and-evaluation pipeline end-to-end.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "No confidence intervals or error bars are reported for any main results in Tables 2–5; all values are single point estimates.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "No statistical significance tests are applied to comparative claims (e.g., 'outperforms RPO by 42% for ICA attack'); results are compared as raw percentages without hypothesis testing.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Effect sizes are expressed as absolute ASR differences with baseline context throughout (e.g., DPP 3.8% average ASR vs. RPO 16.8% vs. no defense 51.5%).", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "The use of 100 harmful queries from AdvBench for training and evaluation is not justified with power analysis or explanation of why 100 is sufficient.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "No variance or standard deviation across runs is reported; Table 7 shows results from three initializations but aggregates them only as averages without spread statistics.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Self-Reminder, RPO, Goal Prioritization, and Default System Prompt serve as baselines throughout Tables 2–5.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "Baselines include RPO (2024), Goal Prioritization (2023), and Self-Reminder (2023) — all recent prompt-based defenses in the same category as DPP.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "Appendix B provides five ablation studies: objective functions (Table 6), prefix vs. suffix format (Table 7), prototype initialization sensitivity (Tables 9–10), HGA vs. RLPrompt solver (Table 11), and synonym substitution necessity (Table 12).", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Both ASR (safety) and Win-Rate via AlpacaEval (utility) are reported as primary metrics, with a secondary Min Over Prompt metric added in Appendix Y.", + "source": "haiku" + }, + "human_evaluation": { + "applies": false, + "answer": false, + "justification": "Win-Rate uses automated AlpacaEval comparison against Davinci003, not human judges; human evaluation is not relevant to the algorithmic optimization contribution.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": true, + "justification": "Section 4.3 tests on 'another 100 harmful queries from AdvBench dataset which are independent from the Adversarial Dataset' used during training/optimization.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Results are broken down per individual jailbreak attack type (Base64, ICA, AutoDAN, GCG, PAIR, TAP, Catastrophic) in all main result tables.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "Table 2 explicitly shows DPP has 10% ASR against AutoDAN (worse than Self-Reminder's 0%), and the Mistral results acknowledge higher utility degradation than simpler baselines.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "The paper directly reports that Mistral-7B-Instruct-v0.2 has worse defense-utility tradeoff than Llama-2-7B-Chat, and that DPP's adaptive ASR on Mistral (46.9%) is substantially higher than on Llama (13.0%).", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": true, + "justification": "Exact model versions are specified: 'Llama-2-7B-Chat' and 'Mistral-7B-Instruct-v0.2' with citations to their respective technical reports.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": true, + "justification": "All DPP suffixes are shown in Appendix E and all defense baseline prompts (Self-Reminder, Goal Prioritization, System Prompt) are shown verbatim in Appendix H.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": true, + "justification": "Appendix C lists all hyperparameters: α=1, β=10 (Llama) or α=10, β=1 (Mistral), num_steps=100, batch_size=64, crossover_rate=0.5, mutation_rate=0.01, plus sentence/paragraph iteration counts.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": false, + "answer": false, + "justification": "No agentic scaffolding is used; DPP is a static suffix appended at inference time, not an agentic pipeline.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": false, + "justification": "Dataset sampling is briefly noted (100 queries from AdvBench) but preprocessing steps for generating jailbreak prompts via each attack method are not documented beyond external GitHub links.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": false, + "justification": "While source datasets (AdvBench, Alpaca) are public, the generated jailbreak prompts and model responses used in the actual evaluation are not released.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Section 4.1 describes sampling 100 harmful queries from AdvBench and 100 benign queries from Alpaca; jailbreak generation procedures with hyperparameters and GitHub links are detailed in Appendix F.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "No human participants; all evaluation uses automated benchmarks.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": false, + "justification": "The optimization pipeline is described via algorithms, but the full end-to-end pipeline from raw AdvBench queries through attack generation to ASR calculation is not documented cohesively in one place.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": false, + "justification": "Training cutoffs for Llama-2 and Mistral are not stated; relevant because AdvBench (the evaluation set) was published before both models' RLHF training, potentially contaminating baseline alignment behavior.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": false, + "justification": "AdvBench was publicly released in 2023 before Llama-2 and Mistral's training cutoffs; the possibility that models' RLHF training incorporated these refusal patterns — artificially lowering baseline ASR — is never discussed.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": true, + "answer": false, + "justification": "The paper does not acknowledge that AdvBench examples were available before the evaluated models were trained, potentially making the baseline ASR measurements unrepresentative of real-world alignment.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human participants involved.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human participants involved.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participants involved.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human participants involved.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human participants involved.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human participants involved.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participants involved.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": false, + "justification": "The paper notes DPP adds only a suffix at inference time (O(1) overhead) but does not report actual latency or inference cost numbers.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": true, + "justification": "Appendix C specifies a single A800 GPU (80GB), 15.32 seconds per training epoch, 100 epochs per training instance, and approximately $75 USD GPT-4 API cost per DPP training run.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "DPP achieves the lowest average ASR (3.8%) among prompt-based defenses on Llama-2-7B-Chat while maintaining the highest Win-Rate (82.98%).", + "evidence": "Table 2: DPP at 3.8% average ASR and 82.98% Win-Rate vs. Self-Reminder (6.3% ASR, 64.84%), RPO (16.8% ASR, 79.23%), Goal Prioritization (10.0% ASR, 34.29%).", + "supported": "strong" + }, + { + "claim": "DPP generalizes to less-aligned models, achieving 2.0% average non-adaptive ASR on Mistral-7B-Instruct-v0.2.", + "evidence": "Table 4 shows DPP at 2.0% vs. Goal Prioritization (22.2%), Self-Reminder (48.2%), System Prompt (52.7%), though Win-Rate drops to 75.06% from baseline 90.31%.", + "supported": "moderate" + }, + { + "claim": "HGA outperforms RLPrompt as an optimizer for the DPP objective, achieving lower ASR and higher utility.", + "evidence": "Table 11: HGA achieves 4% GCG ASR and 82.98% Win-Rate vs. RLPrompt 15% GCG ASR and 47.89% Win-Rate on Llama-2-7B-Chat.", + "supported": "moderate" + }, + { + "claim": "Using DPP as a suffix outperforms using it as a prefix, particularly under adaptive attacks.", + "evidence": "Table 7: Average adaptive GCG ASR of 57% for Prefix DPP vs. 15% for Suffix DPP; Win-Rate also higher for suffix (76.09% vs. 73.05%) on Llama-2-7B-Chat.", + "supported": "moderate" + }, + { + "claim": "DPP produces more human-readable prompts than gradient-based approaches like RPO.", + "evidence": "Table 34: DPP perplexity 56.57 vs. RPO perplexity 8780.94 as measured by GPT-4 next-token log-probabilities.", + "supported": "weak" + }, + { + "claim": "DPP maintains defense effectiveness against unforeseen jailbreak queries not used during training.", + "evidence": "Table 37: DPP achieves 7.5% average ASR on held-out AdvBench queries on Llama-2-7B-Chat vs. Self-Reminder (15.5%), RPO (29.0%), Goal Prioritization (30.3%).", + "supported": "moderate" + } + ], + "methodology_tags": [ + "benchmark-eval" + ], + "key_findings": "DPP, trained via a bi-objective Hierarchical Genetic Algorithm balancing refusal likelihood and helpfulness, achieves the lowest average ASR (3.8%) and highest utility (82.98% Win-Rate) among prompt-based defenses on Llama-2-7B-Chat, outperforming RPO, Goal Prioritization, and Self-Reminder. The method generalizes to less-aligned Mistral-7B-Instruct-v0.2 (2.0% non-adaptive ASR) but with considerably greater utility degradation than simpler baselines. Ablation studies confirm that both defense and utility objectives are essential, and HGA substantially outperforms RLPrompt as the search algorithm. However, adaptive ASR on Mistral reaches 46.9%, the primary evaluation metric relies on keyword matching rather than actual harm assessment, and 'universal' scalability claims rest on only two primary 7B-scale models.", + "red_flags": [ + { + "flag": "Keyword-matching ASR metric", + "detail": "Attack Success Rate is determined by checking whether model responses contain refusal keywords like 'I'm sorry' or 'I cannot.' A model generating harmful content that incidentally includes such a phrase is counted as defended; this is known to be an unreliable proxy for actual safety." + }, + { + "flag": "No statistical significance testing", + "detail": "All comparisons (e.g., 'outperforms RPO by 42% for ICA attack') are made on 100-query point estimates without confidence intervals, error bars, or significance tests, making it impossible to assess whether differences are reliable." + }, + { + "flag": "Small, unjustified evaluation sample", + "detail": "Only 100 harmful queries from AdvBench are used for both training and evaluation. Sample size is not justified, and no sensitivity analysis examines whether results hold with more queries or different source datasets." + }, + { + "flag": "Overstated universality claims", + "detail": "The conclusion describes DPP as a 'universal defensive solution' scalable to 'various LLM platforms,' but primary evaluation covers only two 7B-scale open-source chat models." + }, + { + "flag": "AdvBench contamination not addressed", + "detail": "AdvBench was published in 2023 before Llama-2 and Mistral's RLHF training cutoffs, making it plausible that these models' alignment training incorporated these specific harmful query patterns, artificially depressing baseline ASR measurements." + }, + { + "flag": "GPT-4 training dependency limits reproducibility", + "detail": "DPP training uses GPT-4 for LLM-based prompt revisions (~$75 per run), creating a dependency on proprietary API access that cannot be reproduced without ongoing OpenAI billing and risks non-determinism across API versions." + } + ], + "cited_papers": [ + { + "title": "Universal and Transferable Adversarial Attacks on Aligned Language Models (GCG)", + "relevance": "Primary gradient-based attack method evaluated; DPP is specifically designed to defend against adversarial suffixes generated by GCG." + }, + { + "title": "AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Language Models", + "relevance": "DPP directly adapts AutoDAN's Hierarchical Genetic Algorithm as its optimization backbone — the direct methodological ancestor of DPP." + }, + { + "title": "Jailbreaking Black Box Large Language Models in Twenty Queries (PAIR)", + "relevance": "Key black-box attack baseline; tests DPP robustness against automated adversarial prompt generation via a secondary attacker LLM." + }, + { + "title": "Tree of Attacks: Jailbreaking Black-box LLMs Automatically (TAP)", + "relevance": "Attack baseline using tree-structured prompt refinement; one of six attacks used in the primary evaluation." + }, + { + "title": "Defending ChatGPT Against Jailbreak Attack via Self-Reminders", + "relevance": "Primary prompt-based defense baseline; DPP is positioned as superior to Self-Reminder in both defense and utility." + }, + { + "title": "Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks (RPO)", + "relevance": "Gradient-based prompt defense baseline; comparison with RPO demonstrates DPP's advantage of producing human-readable prompts with higher utility." + }, + { + "title": "Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization", + "relevance": "Defense baseline achieving low ASR at high utility cost; DPP is framed as solving this safety-utility tradeoff." + }, + { + "title": "Llama Guard: LLM-Based Input-Output Safeguard for Human-AI Conversations", + "relevance": "Used as an alternative LLM-based judge to validate keyword-based ASR measurements across appendix experiments." + }, + { + "title": "JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models", + "relevance": "Provides out-of-distribution harmful queries for generalization testing in Appendix P." + }, + { + "title": "SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks", + "relevance": "Representative non-prompt-based defense contextualizing the prompt-based defense paradigm DPP operates within." + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 2, + "justification": "DPP requires no model retraining and deploys as a simple suffix prompt; a live HuggingFace demo exists, making adoption concrete and low-friction for practitioners." + }, + "surprise_contrarian": { + "score": 1, + "justification": "The finding that a short, human-readable suffix outperforms gradient-optimized unreadable prompts (RPO perplexity 8780 vs. DPP 56) is mildly counterintuitive." + }, + "fear_safety": { + "score": 2, + "justification": "Addresses concrete jailbreak vulnerabilities in deployed LLMs; demonstrating that widely-used defenses (RPO: 45.7% adaptive ASR) are largely defeated by adaptive attackers raises genuine safety concerns." + }, + "drama_conflict": { + "score": 1, + "justification": "Framed as an attack-defense arms race with competitive benchmarking; adaptive attack evaluation where attackers know the defense mechanism creates adversarial tension." + }, + "demo_ability": { + "score": 2, + "justification": "Interactive HuggingFace demo exists at TrustSafeAI/Defensive-Prompt-Patch-Jailbreak-Defense; users can test the defense directly." + }, + "brand_recognition": { + "score": 1, + "justification": "Pin-Yu Chen (IBM Research) is well-known in adversarial ML and Xiangyu Qi (Princeton) has security credibility, but the paper is ArXiv-only with no top-tier venue affiliation." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "38853706", + "title": "Possible Meissner effect near room temperature: copper-substituted lead apatite", + "points": 729, + "comments": 318, + "url": "https://news.ycombinator.com/item?id=38853706" + }, + { + "hn_id": "38850232", + "title": "LK99: Possible Meissner effect near room temperature", + "points": 6, + "comments": 2, + "url": "https://news.ycombinator.com/item?id=38850232" + }, + { + "hn_id": "42387852", + "title": "LLM Synthetic Conversations Unlock?", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=42387852" + } + ], + "top_points": 729, + "total_points": 736, + "total_comments": 320 + } +} +\ No newline at end of file diff --git a/papers/dehallucinator-mitigating-llm-2024/scan-v5.json b/papers/dehallucinator-mitigating-llm-2024/scan-v5.json @@ -0,0 +1,499 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "De-Hallucinator: Mitigating LLM Hallucinations in Code Generation Tasks via Iterative Grounding", + "authors": [ + "Aryaz Eghbali", + "Michael Pradel" + ], + "year": 2024, + "venue": "arXiv", + "arxiv_id": "2401.01701", + "doi": null + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All quantitative claims in the abstract (23.3–50.6% edit distance improvement, 23.9–61.0% API recall improvement, 63.2% fixed hallucinations, 15.5% coverage gain) are directly supported by Tables 3 and 4 in the evaluation.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "The paper makes causal claims about De-Hallucinator improving code quality; a controlled comparison against baselines (initial prompt, RAG prompt) with Wilcoxon statistical tests and an ablation over hyperparameters k and n provides adequate support for these directional claims.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": true, + "justification": "The limitations section explicitly states 'our conclusions are valid only for these languages' (Python and JavaScript), and the abstract specifies evaluation across 'two code generation tasks, two programming languages, and five state-of-the-art LLMs.'", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "The paper does not discuss whether improvement could stem from simply adding more tokens to the prompt rather than the specific iterative API-retrieval mechanism; the RAG vs. iterative comparison partially addresses this but the paper does not frame it as an alternative explanation.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "The paper clearly explains each metric—edit distance quantifies token edits a developer needs to make, exact API match measures the specific phenomenon of interest, and passing tests/coverage directly measure test generation quality—without overclaiming these as general productivity measures.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "Section 6 is dedicated to 'LIMITATIONS AND THREATS TO VALIDITY' and contains multiple distinct limitations beyond a single sentence.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "The section names specific threats: the assumption that APIs exist when needed (and a proposed workaround), evaluation restricted to Python and JavaScript, and potential non-representativeness of the project sample; the first two are concrete and specific.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": true, + "justification": "The paper explicitly states conclusions are valid only for Python and JavaScript, and notes the project sample may not represent all projects, providing meaningful scope boundaries.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding acknowledgment or grant information appears anywhere in the paper.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Both authors clearly disclose their affiliation as Software Lab, University of Stuttgart, with institutional email addresses.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": false, + "answer": false, + "justification": "No funding is disclosed, so funder independence cannot be assessed.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "There is no competing interests statement or declaration of financial interests in the paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Hallucination is defined as inventing non-existent APIs; API reference is formally defined in Definition 3.1 with three subtypes (function, class, attribute); initial prompt, RAG prompt, and iterative prompt are each defined with examples.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "The paper lists four explicit contributions at the end of the introduction: empirical motivation, a new technique, a novel algorithm, and empirical evidence of effectiveness across tasks and LLMs.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 7 provides a detailed related work discussion that distinguishes De-Hallucinator from CoCoMIC, RepoCoder, HyDE, TestPilot, and RAG, explaining specifically how each differs in mechanism and coupling to the underlying LLM.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": true, + "justification": "The data availability statement provides two public GitHub URLs (https://github.com/AryazE/dehallucinator and https://github.com/AryazE/testpilot) containing the implementation, datasets, and evaluation scripts.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "The code completion dataset is constructed from 11 public GitHub projects with specific commit hashes listed in Table 2; the test generation dataset reuses public JavaScript projects from TestPilot.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "The paper mentions specific libraries (HuggingFace transformers, CodeQL, NLTK, scikit-learn Ball Tree) and hardware, but provides no requirements.txt, Dockerfile, or equivalent environment specification in the paper itself.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "The paper points to GitHub repositories but provides no step-by-step instructions for reproducing the evaluation within the paper itself.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "Tables 3 and 4 report only mean values without confidence intervals, standard deviations, or error bars for any of the main results.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": true, + "justification": "The paper uses the Wilcoxon signed-rank test with the Pratt method for all comparative claims; statistically significant results are bolded in Tables 3 and 4.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Relative percentage improvements are reported alongside absolute numbers throughout Tables 3 and 4, providing effect size context (e.g., 23.3–50.6% edit distance reduction).", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "The choice of 50 tasks for the preliminary study, 10 completions per project (440 total), and 12 JavaScript projects is described procedurally but not justified through power analysis or comparable benchmarks.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "No variance, standard deviation, or confidence intervals are reported for any metric; all tables show only mean values.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Two baselines are included: the conventional initial prompt (no retrieval) and RAG prompt (retrieval from initial prompt only), compared against the iterative De-Hallucinator approach.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "All baseline LLMs (CodeGen, CodeGen 2.5, UniXCoder, StarCoder+, GPT-3.5-turbo-0125) and TestPilot as the test generation baseline are state-of-the-art as of the 2024 submission.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "RQ3 (Section 5.4) provides a systematic ablation over the number of iterations k ∈ {1,2,3} and the number of API references n ∈ {2,10,20,40} for code completion and n ∈ {3,5,10} for test generation.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Three metrics are used for code completion (edit distance, normalized edit similarity, exact API match) and three for test generation (passing tests, coverage, fixed hallucinations).", + "source": "haiku" + }, + "human_evaluation": { + "applies": true, + "answer": true, + "justification": "The preliminary study uses two authors independently classifying 50 completion pairs with Cohen's kappa of 0.76; RQ2 includes manual inspection of 20 completion tasks per LLM to verify correct API augmentation.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": true, + "justification": "The evaluation uses a constructed dataset of completion tasks from 11 Python and 12 JavaScript projects; completions already predicted correctly by the baseline are excluded to form a meaningful test set.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Results in Table 3 are broken down per LLM model (UniXCoder, CodeGen v1, CodeGen v2.5, StarCoder+), and Table 5 provides per-model breakdown of correct API augmentation rates.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "Section 5.3 explicitly discusses failure cases: 'For cases where the approach fails to add the correct API reference into the prompt, the main reason is that the initial completion has low relevance w.r.t. the ground truth.'", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "Table 4 shows RAG & iterative combined performs worse on coverage than iterative alone; Figure 7 shows n=40 API references hurts performance; these negative results are reported and discussed.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": true, + "justification": "Exact HuggingFace model IDs are provided (Salesforce/codegen-2B-mono, Salesforce/codegen25-7b-mono, microsoft/unixcoder-base, bigcode/starcoderplus) and the OpenAI model version GPT-3.5-turbo-0125 is specified.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": true, + "justification": "Figures 4, 5, and 6 show complete example prompts including the API Reference block format, and Section 3.4 describes the prompt construction with concrete examples.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": true, + "justification": "Key hyperparameters are reported: k=3 iterations, n=20 API references for code completion, n=3 for test generation, max new tokens=256, temperature=0.1, 4 completions per prompt.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": true, + "answer": true, + "justification": "The three-stage pipeline (pre-analysis → retrieval → prompt construction → LLM query) is described in detail across Sections 3.2–3.5 with both general description and task-specific instantiation.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "Section 5.1.3 documents dataset construction: removing API usages, filtering functions >25 lines, removing API-related imports, and excluding already-correct completions are all described with rationale.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": true, + "justification": "Specific project commits are listed in Table 2 for all 23 projects, and implementation/datasets are released at the GitHub URLs, enabling independent data reconstruction.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "The collection process is documented: randomly sampling from a curated awesome-python list, sampling by application domain, selecting 5 functions per project, applying inclusion/exclusion criteria.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "No human participants were recruited; the study uses public code repositories and internal author annotation for the preliminary study.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "The full pipeline from project selection → API extraction via CodeQL → embedding → Ball Tree indexing → retrieval → prompt construction → evaluation is documented across Sections 3–5.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": false, + "justification": "No training data cutoffs are stated for any of the five evaluated models, despite evaluating on public GitHub code that may have been in training corpora.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": false, + "justification": "The paper excludes already-correct completions 'to avoid any potential memorizations' but does not analyze or discuss training data overlap with the evaluated GitHub projects.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": true, + "answer": false, + "justification": "The evaluation projects are popular public GitHub repositories that were almost certainly available before the training cutoffs of the evaluated models; this is not discussed or analyzed.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human subjects study; the manual annotation is performed by the authors.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human participants; IRB approval is not applicable.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": true, + "justification": "Section 5.5 reports retrieval latency (21–227ms for code completion, 0.1–17ms for test generation) and LLM query times (1.3s for GPT-3.5 to 66.7s for CodeGen v2.5 on local GPU).", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": true, + "justification": "Hardware is specified (two Nvidia T4 GPUs, single Tesla V100 with 32GB) and pre-analysis times are given per project (under 80 seconds for Python, 3.5s average for JavaScript).", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "API hallucinations affect 44% of all function-level code completion tasks and 59% of failed tasks in a preliminary study of 50 functions.", + "evidence": "Manual classification of 50 code completion tasks with inter-rater agreement Cohen's kappa=0.76; 22 of 37 failed completions had at least one missing/wrong API usage.", + "supported": "moderate" + }, + { + "claim": "De-Hallucinator improves edit distance by 23.3–50.6% over conventional prompts across four LLMs.", + "evidence": "Table 3 shows reductions from 50.6% (UniXCoder) to 23.3% (CodeGen v1) with Wilcoxon statistical significance; all improvements are statistically significant.", + "supported": "strong" + }, + { + "claim": "De-Hallucinator improves exact API match recall by 23.9–61.0% across four LLMs for code completion.", + "evidence": "Table 3 reports exact API match counts for initial vs. iterative prompts, with statistically significant improvements for all models.", + "supported": "strong" + }, + { + "claim": "De-Hallucinator fixes 63.2% more hallucinated tests and increases statement coverage by 15.5% for test generation.", + "evidence": "Table 4 shows iterative prompts improve fixed hallucinations from 19.3 to 31.4 (63.2% relative) and coverage from 32.1 to 37.0 (15.5% relative) with Wilcoxon significance.", + "supported": "strong" + }, + { + "claim": "Iterative prompts (using model output for retrieval) outperform plain RAG prompts (using initial prompt for retrieval).", + "evidence": "Table 3 consistently shows iterative prompts achieve lower edit distance and higher API match than RAG prompts; Table 4 shows iterative alone achieves higher coverage than RAG & iterative combined.", + "supported": "strong" + }, + { + "claim": "Even k=1 iteration provides clear improvements over the baseline, making the approach useful when LLM query cost is high.", + "evidence": "Figure 7 shows k=1 already provides substantial relative improvement in exact API match across all four models.", + "supported": "moderate" + } + ], + "methodology_tags": [ + "benchmark-eval", + "case-study" + ], + "key_findings": "De-Hallucinator addresses LLM hallucinations of project-specific APIs by iteratively augmenting prompts with retrieved API references, using the model's own (hallucinated) predictions to guide better retrieval. Evaluated on 5 LLMs across 11 Python and 12 JavaScript open-source projects, the approach consistently and significantly improves code completion (23.3–50.6% edit distance reduction, 23.9–61.0% API recall improvement) and test generation (63.2% more hallucinated tests fixed, 15.5% coverage increase). The iterative retrieval mechanism outperforms plain RAG, and even a single iteration provides meaningful gains. The approach requires no fine-tuning and treats the LLM as a black box.", + "red_flags": [ + { + "flag": "No variance reported", + "detail": "All main result tables (Tables 3 and 4) report only mean values with no standard deviation, confidence intervals, or error bars, making it impossible to assess result stability." + }, + { + "flag": "Contamination not addressed", + "detail": "All 11 Python and 12 JavaScript projects are popular public GitHub repositories almost certainly present in the training corpora of the evaluated models; the paper does not analyze or discuss train-test overlap beyond excluding already-correct completions." + }, + { + "flag": "Small preliminary study", + "detail": "The motivational claim that 44% of failed completions involve hallucinated APIs is based on manual inspection of only 50 functions from 10 projects by the authors themselves." + }, + { + "flag": "No confound isolation for 'more context'", + "detail": "The paper does not rule out that improvement comes simply from adding more tokens to the prompt rather than from the specific iterative API-retrieval mechanism; a control adding random API references would strengthen the causal claim." + } + ], + "cited_papers": [ + { + "title": "An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation", + "relevance": "TestPilot is the baseline system for the test generation task; De-Hallucinator extends TestPilot with iterative API grounding." + }, + { + "title": "Repository-Level Code Completion Through Iterative Retrieval and Generation (RepoCoder)", + "relevance": "Concurrent work on repository-level code completion using iterative retrieval; directly compared to as the closest prior approach." + }, + { + "title": "CoCoMIC: Code Completion By Jointly Modeling In-file and Cross-file Context", + "relevance": "Prior work addressing project-specific context for code completion via fine-tuning; contrasted with De-Hallucinator's black-box approach." + }, + { + "title": "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks", + "relevance": "Foundational RAG technique that De-Hallucinator extends with iterative retrieval using model predictions." + }, + { + "title": "Evaluating Large Language Models Trained on Code (Codex/HumanEval)", + "relevance": "Key baseline work on evaluating code LLMs; establishes the benchmark context for code generation evaluation." + }, + { + "title": "Repository-Level Prompt Generation for Large Language Models of Code", + "relevance": "Prior work on repository-level prompt generation using a separate trained model; contrasted with De-Hallucinator's model-agnostic approach." + }, + { + "title": "StarCoder: may the source be with you!", + "relevance": "One of four LLMs evaluated in the code completion experiments." + }, + { + "title": "An Empirical Evaluation of GitHub Copilot's Code Suggestions", + "relevance": "Prior empirical study on LLM hallucinations in code generation; motivates the research problem." + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 3, + "justification": "The technique works with off-the-shelf LLMs as black boxes, is deployable in IDEs, and addresses a concrete pain point that developers already report with AI coding assistants." + }, + "surprise_contrarian": { + "score": 1, + "justification": "Using the model's hallucinated output to retrieve better context is a clever observation, but iterative refinement and RAG are established concepts so the insight is incremental." + }, + "fear_safety": { + "score": 0, + "justification": "No AI safety or risk concerns; the paper addresses a usability/accuracy problem in code generation tools." + }, + "drama_conflict": { + "score": 0, + "justification": "No controversy; the paper presents a technical improvement without challenging major assumptions or competing with prominent labs." + }, + "demo_ability": { + "score": 3, + "justification": "Code is publicly released on GitHub for both tasks, and the approach works with any off-the-shelf LLM making it immediately tryable." + }, + "brand_recognition": { + "score": 1, + "justification": "University of Stuttgart Software Lab is a respected academic group but not a widely-recognized brand in industry or general AI discourse." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "38939558", + "title": "Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=38939558", + "created_at": "2024-01-10T14:57:07Z" + } + ], + "top_points": 2, + "total_points": 2, + "total_comments": 0 + } +} +\ No newline at end of file diff --git a/papers/demonstratesearchpredict-composing-retrieval-2022/scan-v5.json b/papers/demonstratesearchpredict-composing-retrieval-2022/scan-v5.json @@ -0,0 +1,571 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Demonstrate-Search-Predict: Composing retrieval and language models for knowledge-intensive NLP", + "authors": [ + "O. Khattab", + "Keshav Santhanam", + "Xiang Lisa Li", + "David Hall", + "Percy Liang", + "Christopher Potts", + "Matei Zaharia" + ], + "year": 2022, + "venue": "arXiv.org", + "arxiv_id": "2212.14024", + "doi": "10.48550/arXiv.2212.14024" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "The reported gain ranges (37-120% vs vanilla LM, 8-39% vs retrieve-then-read, 80-290% vs self-ask) are verifiable from Table 1, though the ranges are selectively drawn across metrics and tasks.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": false, + "justification": "The paper attributes gains causally to DSP components (DEMONSTRATE, SEARCH, PREDICT) but conducts no systematic ablation isolating individual components' contributions.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "The title and framing claim DSP addresses 'knowledge-intensive NLP' broadly, but evaluation is limited to three QA datasets using one LM (GPT-3.5) and one RM (ColBERTv2).", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "The paper does not discuss alternative explanations for gains, such as increased computational budget (more LM API calls per query), corpus alignment advantages, or prompt engineering effects independent of the DSP abstraction.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "The paper measures EM and F1 on QA benchmarks and claims improvements in QA performance; the metrics align with stated claims without overreaching to broader capabilities.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": false, + "justification": "There is no dedicated limitations or threats-to-validity section; the conclusion briefly promises 'additional test tasks and LM choices' in future work, which does not constitute a limitations section.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "The paper briefly acknowledges that comparisons with concurrent work are 'not generally apples-to-apples,' but enumerates no specific threats to the validity of its own results.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "The paper does not explicitly state what the results do NOT show; future extensions are promised rather than current scope being bounded.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": true, + "justification": "Funding is disclosed from IBM, Stanford HAI affiliates (Ant Financial, Facebook, Google, VMware), Cisco, SAP, and NSF grant CNS-1651570.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "All seven authors are affiliated with Stanford University, clearly disclosed on the title page.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": false, + "justification": "ColBERTv2, the retrieval model central to every experiment, was developed by the same Stanford group — Khattab, Santhanam, and Zaharia co-author both this paper and ColBERTv2, creating a direct conflict of interest.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement or declaration of financial interests (patents, equity, consulting) is present in the paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Key terms including 'in-context learning,' 'retrieval-augmented,' 'frozen LM/RM,' and each DSP stage (DEMONSTRATE, SEARCH, PREDICT) are defined precisely in §2.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "The paper explicitly enumerates four contributions: arguing for task-aware strategies, showing they can be expressed as short programs, demonstrating the power of composability, and establishing state-of-the-art in-context learning results on three tasks.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "The paper extensively engages with prior work on retrieval-augmented NLP, multi-hop QA, and bootstrapping methods throughout the text, situating DSP relative to specific systems rather than merely listing citations.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": true, + "justification": "Code is released at https://github.com/stanfordnlp/dsp as stated in the abstract.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "All three evaluation datasets (Open-SQuAD, HotPotQA, QReCC) are standard publicly available benchmarks used unmodified.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "No requirements.txt, Dockerfile, or specific dependency specifications are provided; the implementation language (Python) is implied by code snippets but no environment is specified.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "Code examples are provided as pseudocode in the paper, but step-by-step instructions for reproducing the exact experimental results (indices, seeds, scripts) are not included.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "Table 1 reports single values with no confidence intervals or error bars, despite results being averages over five seeds.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "No statistical significance tests are performed for any of the comparative claims made throughout the paper.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Relative percentage gains are reported (37-120% vs vanilla LM, etc.) with baseline context provided, which constitutes effect size reporting.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "The choice of 1000 questions (or 400 conversations) subsampled across 5 seeds is stated but not justified or supported by power analysis.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "Only mean values across five seeds appear in Table 1; no standard deviations or variance measures are reported.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Three baselines are compared: vanilla LM, retrieve-then-read, and self-ask, plus SoTA results from concurrent work cited from related papers.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "Baselines include contemporaneous systems (self-ask from Press et al. 2022, Si et al. 2022, Yao et al. 2022 ReAct) collected as of mid-December 2022.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": false, + "justification": "No systematic ablation study is conducted to isolate the contribution of individual DSP components (DEMONSTRATE vs. SEARCH vs. PREDICT); only full-system comparisons are provided.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Both EM and F1 are reported for Open-SQuAD and HotPotQA; F1 and novel-F1 (nF1) are reported for QReCC.", + "source": "haiku" + }, + "human_evaluation": { + "applies": false, + "answer": false, + "justification": "Human evaluation is not applicable; the paper evaluates on automatic QA metrics (EM, F1) on standard benchmarks.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": false, + "justification": "The paper explicitly states 'we report the validation set accuracy on all three datasets'; test set evaluation is deferred to 'a future version of this report.'", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Results are broken down across three task types (open-domain QA, multi-hop QA, conversational QA), providing per-task performance analysis.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": false, + "justification": "The paper discusses self-ask's 'self-distraction' failure mode with one example, but does not systematically discuss DSP's own failure cases or error analysis.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": false, + "justification": "The paper does not report negative results about DSP programs; all DSP results are presented positively relative to baselines, with no cases where DSP fails or underperforms reported.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": true, + "justification": "GPT-3.5 is identified as text-davinci-002 and ColBERTv2 is specified with its source paper; Wikipedia corpus versions are dated (Dec 2016, Nov 2017, Dec 2018 dumps).", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": false, + "justification": "The paper shows prompt structures with placeholders (e.g., '{Task demonstrations from x.demos, if any}') but does not provide the actual verbatim prompt templates used in experiments.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": true, + "justification": "Key hyperparameters are reported: temperature t=0.7 for n>1, greedy decoding for n=1, k=7 passages for open-domain QA, n=20 candidates for self-consistency, k=5 for multi-hop retrieval.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": true, + "answer": true, + "justification": "The DSP framework scaffolding is described in detail with code snippets showing DEMONSTRATE, SEARCH, and PREDICT stages, their APIs, and interactions.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "Preprocessing is documented: specific Wikipedia corpus dates are used, QReCC filtering criteria are stated (removing empty answers, short conversations, 'other interesting' keyword conversations), and HotPotQA 'hard' example filtering is described.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": false, + "justification": "Raw experimental outputs (model responses, retrieved passages, intermediate predictions across seeds) are not released; only the DSP code library is made available.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "The paper describes which datasets and Wikipedia corpora are used with specific splits, corpus dates, and the 5-seed, 200-questions-per-seed evaluation protocol.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "No participant recruitment is applicable; the paper uses standard public benchmarks.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "The pipeline from input question to final answer is documented through code examples, and the evaluation protocol (5 seeds, 200 questions per seed, averaging) is described.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": false, + "justification": "The training data cutoff for GPT-3.5 (text-davinci-002) is not stated, despite evaluating on benchmarks like SQuAD (2016) and HotPotQA (2018) that predate the model.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": false, + "justification": "Potential contamination of GPT-3.5's training data with evaluation benchmark content (SQuAD, HotPotQA, QReCC) is not discussed anywhere in the paper.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": true, + "answer": false, + "justification": "The evaluation benchmarks (SQuAD 2016, HotPotQA 2018) predate GPT-3.5's likely training cutoff and probably appear in its training data, but this is not acknowledged or addressed.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": false, + "justification": "The paper mentions controlling 'language model API spending budget' but reports no actual inference costs, API call counts, or latency figures.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "No total computational budget (compute hours, API costs, number of API calls) is reported.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "DSP programs achieve 37-120% relative gains over vanilla LM baselines across three knowledge-intensive QA tasks", + "evidence": "Table 1: DSP at 36.6/49.0 EM/F1 vs 16.2/25.6 on Open-SQuAD; 51.4/62.9 vs 28.3/36.4 on HotPotQA; 35.0/25.3 vs 29.8/18.4 on QReCC", + "supported": "strong" + }, + { + "claim": "DSP programs achieve 8-39% relative gains over retrieve-then-read pipelines", + "evidence": "Table 1: DSP outperforms retrieve-then-read on all three tasks (36.6 vs 33.8 EM on SQuAD; 51.4 vs 36.9 EM on HotPotQA; 35.0 vs 31.6 F1 on QReCC)", + "supported": "strong" + }, + { + "claim": "DSP programs achieve 80-290% relative gains over the self-ask pipeline", + "evidence": "Table 1: self-ask achieves 9.3/17.2 EM/F1 on Open-SQuAD (vs DSP 36.6/49.0) and 28.6/37.3 on HotPotQA (vs DSP 51.4/62.9)", + "supported": "moderate" + }, + { + "claim": "DSP achieves state-of-the-art in-context learning on HotPotQA at 51.4% EM as of December 2022", + "evidence": "Table 1 comparison with concurrent work: Wang et al. 33.8%, Sun et al. 26.5%, Yao et al. (ReAct) 35.1% — all below DSP's 51.4%", + "supported": "moderate" + }, + { + "claim": "DSP matches competitive fine-tuned systems on Open-SQuAD without any fine-tuning", + "evidence": "DSP achieves 36.6% EM vs DPR 29.8% and FiD-base ~36% (with 5 passages); however FiD with 100 passages reaches 48%, substantially higher", + "supported": "moderate" + }, + { + "claim": "DEMONSTRATE stage enables automatic bootstrapping of pipeline annotations from end-task labels without labeling intermediate steps", + "evidence": "The annotate function is described with code but no ablation is conducted to show contribution vs. using fixed hand-written demonstrations", + "supported": "weak" + } + ], + "methodology_tags": [ + "benchmark-eval" + ], + "key_findings": "The DSP framework for retrieval-augmented in-context learning substantially outperforms vanilla LMs and retrieve-then-read pipelines on three QA benchmarks using frozen GPT-3.5 and ColBERTv2, achieving 51.4% EM on HotPotQA versus 35.1% for the next-best contemporary approach. The key innovation is composable programmatic pipelines that bootstrap demonstrations automatically, execute multi-hop retrieval, and aggregate evidence across retrieved passages. However, all results are on validation sets only, no ablations isolate component contributions, no variance is reported across the five evaluation seeds, and the most dramatic gain claims (80-290% over self-ask) rely on a baseline that performs surprisingly poorly even at its intended task.", + "red_flags": [ + { + "flag": "Validation-only evaluation", + "detail": "All results are on validation sets; test set evaluation is explicitly deferred to 'a future version of this report,' creating risk of implicit overfitting during program development." + }, + { + "flag": "Self-evaluation of own retrieval system", + "detail": "ColBERTv2, the only RM used throughout all experiments, is co-authored by Khattab, Santhanam, and Zaharia — the same team — creating a direct conflict of interest undisclosed as such." + }, + { + "flag": "No ablation study", + "detail": "Three interacting components are proposed (DEMONSTRATE, SEARCH, PREDICT) but are never ablated individually; gains cannot be attributed to any specific innovation." + }, + { + "flag": "No variance or significance testing", + "detail": "Results averaged over five seeds are reported as single-point numbers with no standard deviations, confidence intervals, or significance tests despite the stochastic nature of sampling-based generation." + }, + { + "flag": "Weak self-ask baseline inflating headline gains", + "detail": "Self-ask achieves only 9.3% EM on Open-SQuAD — worse than even retrieve-then-read (33.8%) — suggesting it is not a valid baseline for that task, making the 80-290% relative gain claim misleading." + }, + { + "flag": "Benchmark contamination unaddressed", + "detail": "SQuAD (2016) and HotPotQA (2018) predate GPT-3.5's training; no contamination analysis or acknowledgment is provided." + } + ], + "cited_papers": [ + { + "title": "Language Models are Few-Shot Learners", + "relevance": "Foundational GPT-3 paper establishing in-context learning paradigm that DSP extends with retrieval and programmatic pipelines" + }, + { + "title": "ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction", + "relevance": "The retrieval model used throughout all DSP experiments; co-authored by the same team" + }, + { + "title": "Measuring and Narrowing the Compositionality Gap in Language Models (self-ask)", + "relevance": "Primary contemporaneous baseline that DSP outperforms significantly; motivates the multi-hop retrieval design" + }, + { + "title": "Self-consistency improves chain of thought reasoning in language models", + "relevance": "Self-consistency technique directly incorporated into DSP's PREDICT stage for answer selection" + }, + { + "title": "Chain of thought prompting elicits reasoning in large language models", + "relevance": "Chain-of-thought reasoning integrated into DSP's PREDICT stage prompts" + }, + { + "title": "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks", + "relevance": "Foundational RAG paper that DSP extends with more sophisticated programmatic LM-RM interaction" + }, + { + "title": "STaR: Bootstrapping Reasoning with Reasoning", + "relevance": "Related bootstrapping approach that DSP generalizes to complex multi-stage pipelines" + }, + { + "title": "ReAct: Synergizing Reasoning and Acting in Language Models", + "relevance": "Contemporaneous approach combining LM reasoning with retrieval actions; cited as concurrent SoTA on HotPotQA" + }, + { + "title": "Baleen: Robust Multi-Hop Reasoning at Scale via Condensed Retrieval", + "relevance": "Prior fine-tuned multi-hop system by the same first author that DSP aims to replicate with in-context learning" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 3, + "justification": "DSP directly evolved into the DSPy library, now widely used by practitioners for building and auto-optimizing LM pipelines." + }, + "surprise_contrarian": { + "score": 2, + "justification": "Challenges the dominant 'retrieve-then-read' paradigm by showing that programmatic LM-RM coordination with bootstrapped demonstrations substantially outperforms simple retrieval augmentation." + }, + "fear_safety": { + "score": 0, + "justification": "No AI safety or risk concerns are raised; the paper focuses purely on QA accuracy improvements." + }, + "drama_conflict": { + "score": 1, + "justification": "Directly positions itself against self-ask, arguing DSP's modularity avoids the 'self-distraction' problem of LM-controlled pipelines." + }, + "demo_ability": { + "score": 3, + "justification": "Code released at github.com/stanfordnlp/dsp with clear API examples and code snippets in the paper; practitioners can immediately build their own DSP programs." + }, + "brand_recognition": { + "score": 3, + "justification": "Stanford DAWN project; authors include Percy Liang and Matei Zaharia (prominent in ML and systems communities) and Omar Khattab (ColBERT, DSPy author)." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "34178437", + "title": "Cramming: Training a Language Model on a Single GPU in One Day", + "points": 6, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=34178437", + "created_at": "2022-12-29T21:44:44Z" + }, + { + "hn_id": "42209577", + "title": "Cramming: Training a Language Model on a Single GPU in One Day", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=42209577", + "created_at": "2024-11-21T23:01:03Z" + }, + { + "hn_id": "34232125", + "title": "Cramming: Training a Language Model on a Single GPU in One Day", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=34232125", + "created_at": "2023-01-03T14:57:13Z" + }, + { + "hn_id": "34570488", + "title": "Training a Language Model on a Single GPU in One Day", + "points": 2, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=34570488", + "created_at": "2023-01-29T17:40:13Z" + }, + { + "hn_id": "39968113", + "title": "Cramming: Training a Language Model on a Single GPU in One Day (2022)", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=39968113", + "created_at": "2024-04-08T10:09:08Z" + }, + { + "hn_id": "34338363", + "title": "Cramming: Training a Language Model on a Single GPU in One Day", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=34338363", + "created_at": "2023-01-11T13:47:58Z" + }, + { + "hn_id": "42656632", + "title": "Show HN: We collected detailed annotations for text-to-image generation", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=42656632", + "created_at": "2025-01-10T15:47:29Z" + }, + { + "hn_id": "33522332", + "title": "Championship Simulator: Architectural Simulation for Education and Competition", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=33522332", + "created_at": "2022-11-08T18:21:42Z" + } + ], + "top_points": 6, + "total_points": 21, + "total_comments": 1 + } +} +\ No newline at end of file diff --git a/papers/deployabilitycentric-infrastructureascode-generation-2025/scan-v5.json b/papers/deployabilitycentric-infrastructureascode-generation-2025/scan-v5.json @@ -0,0 +1,497 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Deployability-Centric Infrastructure-as-Code Generation: An LLM-based Iterative Framework", + "authors": [ + "Tianyi Zhang", + "Shidong Pan", + "Zejun Zhang", + "Zhenchang Xing", + "Xiaoyu Sun" + ], + "year": 2025, + "venue": "FSE (submitted)", + "arxiv_id": "2506.05623", + "doi": "10.48550/arXiv.2506.05623" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All abstract claims are verifiable in the paper: the 20.8–30.2% first-attempt success rates match Table 2, 54.6–91.6% passItr@10 matches Table 2, >90% passItr@25 with human feedback matches Section 6.3, 25.2% intent coverage and 8.4% filtered compliance match Tables 4 and 5.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "Causal claims about conversation history reducing error recurrence are supported by an ablation study comparing IaCGen with and without conversation history on Claude-3.5 (Fig. 7), showing 15.9% reduction in required iterations.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": true, + "justification": "The paper scopes its main claims to AWS CloudFormation and explicitly notes in the threats section that highly specialized configurations may not be captured; Terraform generalizability is tested only with Claude-3.5 on syntax validation, which is clearly stated as a limitation.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "The paper does not discuss alternative explanations for the main result that iterative feedback improves deployment success — for example, whether more LLM calls alone (without structured feedback) would produce similar gains.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "The paper explicitly distinguishes syntactic correctness from deployability and argues deployability is the more meaningful measure; it separately reports policy-level compliance (75.3%) versus template-level compliance (8.4%), clearly distinguishing the two.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "Section 7.4 is a dedicated 'Threats to Validity' section covering multiple specific concerns about model versions, benchmark coverage, and language scope.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "Threats include specific statements such as limiting to 153 scenarios across 58 AWS services, the gap between CloudFormation and Terraform evaluation depth, and the time-bound nature of model evaluation at time of writing.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": true, + "justification": "The paper explicitly states it focuses on AWS CloudFormation (not other IaC tools), uses 153 benchmark scenarios, and that highly specialized configurations may not be captured.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding acknowledgment or disclosure appears anywhere in the provided paper text.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Author affiliations are clearly stated on the title page: ANU, NYU/Columbia, NTU, CSIRO's Data61.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": false, + "answer": false, + "justification": "No funding is disclosed, so independence cannot be assessed.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests or financial interests declaration appears in the paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Key terms including IaC, IaC templates, resources, parameters, deployability, and the novel passItr@n metric are all explicitly defined in the paper.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "The contributions are explicitly enumerated: DPIaC-Eval benchmark, IaCGen framework, and empirical evidence about model performance across multiple quality dimensions.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 3.3 directly compares DPIaC-Eval to the prior IaC-Eval benchmark, and Section 8 situates the work relative to feedback mechanisms and LLM-based IaC generation literature, explaining how each prior approach falls short.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": true, + "justification": "The replication package including the IaCGen code is available at https://github.com/Tianyi2/IaCGen, explicitly stated in the Data Availability section.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "The Data folder in the replication package contains the DPIaC-Eval benchmark, as stated in the Data Availability section.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "The paper mentions tools used (boto3, yamllint, cfn-linter, Checkov) but provides no requirements.txt, Dockerfile, or equivalent environment specification; details are deferred to a README in the replication package without confirmation of completeness.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "The paper defers reproduction details to the replication package README but provides no step-by-step instructions in the paper itself; the paper text only describes the workflow at a high level.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "All results in Tables 2, 4, and 5 are reported as single percentage values with no confidence intervals or error bars.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "No statistical significance tests are applied to comparative claims (e.g., Claude-3.5 91.6% vs GPT-4o 54.6% at passItr@10).", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Effect sizes are reported as percentage improvements (e.g., 'near 200% performance improvement' from passItr@1 to passItr@15, 15.9% reduction in iterations with conversation history).", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "The benchmark size of 153 scenarios is described by its construction process but not statistically justified; no power analysis or sample size rationale is provided.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "All results are reported as single point estimates; no variance, standard deviation, or confidence intervals across multiple runs are reported.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Baselines include pass@1 performance without iterative feedback and a conversation-history ablation comparing IaCGen to providing only the latest error without history.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "All six evaluated models (GPT-4o, GPT-o3-mini, Claude-3.5, Claude-3.7, DeepSeek-R1, DeepSeek-V3) are current state-of-the-art models.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "An ablation study comparing IaCGen with and without complete conversation history is conducted using Claude-3.5 (Fig. 7), showing the contribution of the conversation history component.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "The paper uses passItr@n for deployability, resource/attribute-level intent matching, and three security compliance metrics (policy pass rate, unfiltered compliance, filtered compliance).", + "source": "haiku" + }, + "human_evaluation": { + "applies": true, + "answer": true, + "justification": "Human-in-the-loop feedback from a cloud engineer is evaluated in RQ3, and a DevOps expert manually crafted intent specifications for 51 benchmark samples for user intent matching evaluation.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": true, + "justification": "DPIaC-Eval serves as the held-out test set; LLMs are not fine-tuned on any portion of it and are evaluated zero-shot.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Results are broken down by difficulty level (Fig. 4), error stage (Fig. 8), error type (Table 3), and per-model performance across all metrics (Tables 2, 4, 5).", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "Section 6.2 analyzes five specific error categories (Missing Value, Self-defined Property, Null Substitution, Unnecessary Whitespace, Arbitrary Default Value) with per-model failure counts and root cause analysis.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "The paper honestly reports negative findings: only 8.4% filtered security compliance, only 25.2% user intent satisfaction, GPT-4o's substantially lower performance (55.2% vs Claude's 95.5% at passItr@15).", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": false, + "justification": "Model names such as 'Claude-3.5', 'Claude-3.7', 'GPT-4o', 'GPT-o3-mini' are used without specifying exact version identifiers or snapshot dates; the paper only promises these details are in the replication package.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": false, + "justification": "Full prompts are not included in the paper; the system prompt structure is described but actual prompt text is deferred to the code repository.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": true, + "justification": "Temperature is set to 0 and maximum output token limit of 8,000 is explicitly stated for all model evaluations.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": true, + "answer": true, + "justification": "The IaCGen framework is described in detail in Section 4, including the three validation stages (format verification, syntax checking, live deployment) and the feedback allocation strategy (2 general + 4 detailed attempts per stage).", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "The benchmark construction pipeline is documented with specific filtering steps and template counts at each stage: 900→850 (size filtering)→465 (syntax check)→200 (deployment test)→153 (rectification), shown in Fig. 2.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": true, + "justification": "The DPIaC-Eval benchmark (153 templates and prompts) is available in the replication package's Data folder.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Section 3.1 describes template sources (AWS documentation, AWS Samples GitHub, GitHub repositories using CloudFormation), ethical licensing checks (MIT, Apache 2.0), and the multi-stage preprocessing pipeline.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "No human participants were recruited as study subjects; DevOps practitioners were used for benchmark construction but not as experimental participants.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "The complete data pipeline from collection to final benchmark is documented in Section 3.1 and illustrated in Fig. 2, including filtering criteria and counts at each stage.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": false, + "justification": "Training data cutoffs for the six LLMs are not stated in the paper; the paper only mentions these will be documented in the replication package.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": false, + "justification": "The DPIaC-Eval templates were sourced from publicly available GitHub repositories and AWS documentation that predate the LLMs' training cutoffs; potential overlap is never discussed.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": true, + "answer": false, + "justification": "The benchmark templates are from public GitHub repositories and AWS sample libraries that were almost certainly available before the LLMs' training cutoffs; the paper does not address this contamination risk.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human participants as experimental subjects; DevOps practitioners were used only for benchmark construction.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human subjects research; ethics mentions relate only to IP licensing of templates.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participants as experimental subjects.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human participants as experimental subjects.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human participants as experimental subjects.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human participants as experimental subjects.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participants as experimental subjects.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": true, + "justification": "Per-template costs are reported: Claude-3.7-Sonnet $0.42 (most expensive), DeepSeek-V3 $0.04 (cheapest), AWS deployment $0.04 per deployable template.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": true, + "justification": "Total study costs are explicitly stated: $230.75 for LLM API tokens and $35.21 for AWS deployment validation.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "Six state-of-the-art LLMs achieve only 20.8–30.2% deployment success rate on the first attempt at IaC template generation.", + "evidence": "Table 2 shows passItr@1 results: GPT-4o 22.7%, GPT-o3-mini 20.8%, Claude-3.5 30.2%, Claude-3.7 26.8%, DeepSeek-R1 22.9%, DeepSeek-V3 24.2%.", + "supported": "strong" + }, + { + "claim": "IaCGen achieves 54.6–91.6% deployment success in 10 iterations across all evaluated models.", + "evidence": "Table 2 shows passItr@10: GPT-4o 54.6%, GPT-o3-mini 66.2%, Claude-3.5 91.6%, Claude-3.7 86.9%, DeepSeek-R1 68.0%, DeepSeek-V3 56.9%.", + "supported": "strong" + }, + { + "claim": "Maintaining complete conversation history reduces required iterations by 15.9% compared to providing only the most recent error.", + "evidence": "Ablation study (Fig. 7) on Claude-3.5 shows IaCGen averages 4.55 iterations vs. baseline's 5.41 iterations to achieve deployable templates.", + "supported": "moderate" + }, + { + "claim": "Human-in-the-loop feedback enables all six models to exceed 90% passItr@25.", + "evidence": "Section 6.3 and Fig. 9 show all models surpass 90% passItr@25 with human feedback; Claude models reach 98%.", + "supported": "strong" + }, + { + "claim": "Only 25.2% of generated IaC templates fully satisfy user intent at both resource and attribute level.", + "evidence": "Table 4 shows average resource-level matching of 58.8%, attribute-level 40.5%, and combined Resource & Attribute only 25.2% across all models.", + "supported": "moderate" + }, + { + "claim": "Only 8.4% of generated deployable templates achieve full security compliance when filtered for applicable policies.", + "evidence": "Table 5 shows filtered compliance rates ranging from 6.1% (GPT-4o) to 11.5% (DeepSeek-V3), averaging 8.4%.", + "supported": "moderate" + }, + { + "claim": "IaCGen generalizes to Terraform, achieving 100% passItr@7 syntax accuracy with Claude-3.5 on IaC-Eval benchmark.", + "evidence": "Section 6.1 reports 79.7% passItr@1 and 100% passItr@7 on IaC-Eval Terraform benchmark with an average of 1.58 iterations.", + "supported": "weak" + } + ], + "methodology_tags": [ + "benchmark-eval" + ], + "key_findings": "Current LLMs are poor at generating deployable AWS CloudFormation templates with only 20.8–30.2% first-attempt success, despite reasonable syntactic correctness. The IaCGen iterative feedback framework dramatically improves this to 54.6–91.6% within 10 iterations by simulating real DevOps workflows with progressive validation stages. Security compliance of generated templates is alarmingly low at 8.4% filtered compliance, and user intent matching is weak at 25.2% combined resource-and-attribute satisfaction, indicating that deployability is necessary but far from sufficient for practical utility. Maintaining complete conversation history is more effective than isolated-feedback approaches, as it prevents 'Error Recurrence' where LLMs reintroduce previously corrected mistakes.", + "red_flags": [ + { + "flag": "No statistical significance testing", + "detail": "All comparative claims between models and conditions are made without statistical tests, despite clear numerical differences that require significance assessment." + }, + { + "flag": "Benchmark contamination unaddressed", + "detail": "DPIaC-Eval templates were sourced from public GitHub repositories and AWS documentation that predate the LLMs' training cutoffs; potential memorization of test templates is never discussed." + }, + { + "flag": "User intent evaluation on 51/153 samples", + "detail": "The intent matching evaluation (RQ4) uses only 51 randomly sampled instances from the 153-template benchmark, reducing statistical power for this important finding." + }, + { + "flag": "Vague model version identifiers", + "detail": "Model names like 'Claude-3.5' and 'Claude-3.7' are not fully specified in the paper; exact version/snapshot identifiers are deferred to the replication package only." + }, + { + "flag": "Single-run results, no variance", + "detail": "All results appear to be from single evaluation runs with no variance reported across runs, despite using temperature=0 which only partially addresses stochasticity." + }, + { + "flag": "Terraform generalizability underpowered", + "detail": "Terraform generalizability is tested with only Claude-3.5 and only measures syntax correctness (not deployability), making the generalizability claim much weaker than presented." + } + ], + "cited_papers": [ + { + "title": "IaC-Eval: A Code Generation Benchmark for Cloud Infrastructure-as-Code Programs", + "relevance": "Primary prior benchmark for LLM IaC generation; DPIaC-Eval is directly compared and extended from this work." + }, + { + "title": "Evaluating Large Language Models Trained on Code (HumanEval)", + "relevance": "Standard code generation benchmark used as reference point; IaC success rates (19–30%) are contrasted with HumanEval rates (~95%)." + }, + { + "title": "Teaching Large Language Models to Self-Debug", + "relevance": "Related feedback mechanism approach for code generation; IaCGen extends this concept to IaC with multi-stage deployment feedback." + }, + { + "title": "Self-Refine: Iterative Refinement with Self-Feedback", + "relevance": "Foundational iterative refinement approach that IaCGen builds upon, extending to deployment-validated IaC generation." + }, + { + "title": "Using a Feedback Loop for LLM-based Infrastructure as Code Generation", + "relevance": "Most closely related prior work; IaCGen improves upon it by preserving conversation history and including live deployment validation." + }, + { + "title": "RepoCoder: Repository-Level Code Completion through Iterative Retrieval and Generation", + "relevance": "Related iterative feedback approach for code generation that only provides immediate error messages, contrasted with IaCGen's full conversation history approach." + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 3, + "justification": "Directly addresses a pain point for DevOps practitioners — automating CloudFormation template generation with a working framework and public replication package." + }, + "surprise_contrarian": { + "score": 2, + "justification": "The finding that syntactic correctness is nearly useless as an IaC quality metric (42.7% of syntactically valid templates fail deployment) challenges how the field has been evaluating LLMs for IaC." + }, + "fear_safety": { + "score": 1, + "justification": "The 8.4% security compliance finding is concerning for cloud security practitioners but is framed as a research gap rather than an imminent risk." + }, + "drama_conflict": { + "score": 1, + "justification": "Claude vs GPT comparison shows dramatic performance difference (95.5% vs 55.2% passItr@15) that practitioners will notice, but framing is academic rather than dramatic." + }, + "demo_ability": { + "score": 2, + "justification": "Code is publicly available on GitHub and the framework can be run against the DPIaC-Eval benchmark, though it requires AWS account setup and API keys." + }, + "brand_recognition": { + "score": 1, + "justification": "Authors are from ANU, NTU, and CSIRO — established institutions but not AI lab brand names; venue is FSE, a respected but not top-tier AI conference." + } + }, + "hn_data": { + "threads": [], + "top_points": 0, + "total_points": 0, + "total_comments": 0 + } +} +\ No newline at end of file diff --git a/papers/deputydev-ai-powered-2025/scan-v5.json b/papers/deputydev-ai-powered-2025/scan-v5.json @@ -0,0 +1,628 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "DeputyDev - AI Powered Developer Assistant: Breaking the Code Review Logjam through Contextual AI to Boost Developer Productivity", + "authors": [ + "Vishal Khare", + "V. Saini", + "Deepak Sharma", + "Anand Kumar", + "Ankit Rana" + ], + "year": 2025, + "venue": "arXiv.org", + "arxiv_id": "2508.09676", + "doi": "10.48550/arXiv.2508.09676" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": false, + "justification": "Abstract claims 'statistically significant reduction' but paper reports only percentages (17-29%), no p-values or confidence intervals. Claims about rollout and SaaS availability are stated without evidence.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": false, + "justification": "Paper claims DeputyDev 'causes' time savings but lacks statistical significance testing (no p-values). Aggressive filtering (removing size outliers, requiring balanced repos) creates selection bias. No alternative explanations discussed.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "Title ('Breaking the Code Review Logjam') makes broad claims but study evaluates only TATA 1mg, one organization, one version control system, 30-day window. No discussion of whether results apply to other teams, company sizes, or contexts.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "Paper explores only one explanation (DeputyDev helped). No discussion of: learning effects over time, reviewer quality variation, selection effects, or whether faster reviews correlate with code quality.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": false, + "justification": "Title claims 'developer productivity' improvement, but only measures 'review time' (hours). These are not equivalent—faster review could mean lower quality. Paper conflates them without justification.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": false, + "justification": "No dedicated limitations or threats-to-validity section. Conclusion only states findings are valuable, not what they fail to show. Filtering methodology is presented as design choice, not limitation.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "Paper does not discuss specific threats like: 30-day window representativeness, selection bias from filtering, Hawthorne effect, reviewer expertise variation, or vendor lock-in to OpenAI/Anthropic.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "Scope is implicit in experiment design but not explicitly bounded. No statement of what results do NOT show: generalizability to other orgs, long-term effects, code quality impact, or other LLM vendors.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding section. Paper neither discloses nor disclaims funding sources. Internal TATA 1mg research funding is implicit but unstated.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "All authors listed as 'TATA 1mg Healthcare Solutions Private Limited' employees. Affiliations are disclosed in author line.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": false, + "justification": "TATA 1mg (implicit funder) directly benefits from positive evaluation. Company develops and deploys DeputyDev as SaaS product. Funder is not independent of the outcome being evaluated.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement. Authors are full-time employees of the company commercializing DeputyDev as a SaaS product, creating direct financial interest in positive findings.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": false, + "justification": "Key term 'productivity' used in title but never formally defined. Authors measure 'review time' as proxy without justifying equivalence. 'Statistically significant' claimed without p-values. 'Contextual AI' described only through implementation.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": false, + "justification": "Paper does not explicitly state its research question or contribution. Unclear whether it's a tool paper, empirical finding, or methodological advance. Title is a business claim ('breaking the logjam') not a research statement.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": false, + "justification": "Related work cited (Tufano et al., Hong et al.) but not engaged with. No comparison of approach to prior code review automation. Section 6.2 quotes Andrew Ng but doesn't position against other agentic work. No related-work section.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": false, + "justification": "No code release. DeputyDev is a proprietary SaaS product. No GitHub, artifact repository, or release plan mentioned.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": false, + "justification": "Experimental data (721 PRs, review times) not released. No dataset URL or availability statement. Tables 2-3 show aggregated results only.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "Models named (GPT-4o, Claude 3.5 Sonnet) but no exact versions/snapshots. Integrations mentioned (Bitbucket, Jira, Confluence) but no deployment specs, requirements.txt, Dockerfile, or setup instructions.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "No instructions for reproducing the evaluation or deploying DeputyDev. Appendix C shows mean/median formulas (standard definitions), not analysis steps. Cannot reproduce from this paper.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "Table 2 reports point estimates only: avg review times, per-LOC times, medians. Zero confidence intervals, error bars, or variance bounds reported.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "Paper claims 'statistically significant reduction' multiple times but reports zero p-values, t-tests, or hypothesis tests. Test vs control groups compared via percentage difference only.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": false, + "justification": "Percentage improvements reported (28.82%, 42.19%, etc.) but no formal effect sizes (Cohen's d, Hedges' g). Baseline context missing for some metrics (e.g., median review time baseline not clearly stated).", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "No power analysis. Paper mentions 'over 200 engineers' but analyzes 721 PRs. Why 30 days? Why these sample sizes? No justification provided.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "Table 2 reports means and medians but no standard deviations, ranges, or quantiles. Figure 4 shows distributions visually but no numeric variance reported.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Paper includes two control groups (CS1, CS2) without treatment. Test set compared to both controls. However, no comparison to other code review tools or LLM-based systems.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": false, + "justification": "No external baselines compared (other code review tools, other LLM approaches). Only internal control groups. Therefore N/A for 'contemporary'.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": false, + "justification": "DeputyDev uses 6 agents, semantic search, AST chunking, reflection, blending engine. No ablation showing which components contribute to improvements. Cannot isolate component effects.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Multiple metrics reported: avg review time/PR, avg time/LOC, median time, breakdown by PR size. Though correlated, these provide multiple lenses on the outcome.", + "source": "haiku" + }, + "human_evaluation": { + "applies": true, + "answer": false, + "justification": "No human raters evaluated AI-generated review quality. Appendix B shows example reviews but no systematic quality assessment. Only machine metric (review time) evaluated.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": false, + "justification": "Experiment uses concurrent A/B testing (test set concurrent with controls over 30 days) not a train/test split. Not a traditional held-out evaluation.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Table 3 breaks down results by PR size category (Small 0-50 LOC, Medium 51-100, Large 101-200, XL 201-500). Shows differential effects by category.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": false, + "justification": "Table 3 shows mixed/negative results (XL category shows 100.30% time INCREASE vs CS1) but no detailed analysis of why. Section 10.4 offers fixed-costs explanation but no specific failure examples.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": false, + "justification": "Table 3 shows concerning result for extra-large PRs (201-500 LOC): time actually increased 100.30% vs CS1. Presented without detailed discussion or implications.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": false, + "justification": "Model names given (GPT-4o, Claude 3.5 Sonnet) but no exact versions/snapshots. For example, Claude 3.5 Sonnet receives updates—which version? No commit hashes or release dates.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": false, + "justification": "Appendix A shows XML output structure of agent responses, not the input prompts. Paper describes agent roles (Security, Code Communication, etc.) but actual prompts not provided.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": false, + "justification": "No temperature, top-p, max_tokens, or other LLM hyperparameters reported. No statement about defaults vs tuning.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": true, + "answer": true, + "justification": "Agentic scaffolding well-documented: 6 agents (Security, Code Communication, Performance, Maintainability, Errors, Business Logic), multi-agent pattern, reflection pattern, blending engine with dimensions. Section 6.5 includes mathematical formulation.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "Preprocessing steps documented: AST creation, semantic chunking, lexical+semantic search union, repo filtering (≥10 PRs/set), PR size filtering (remove top 25%, bottom 10%). Context assembly from multiple sources described.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": false, + "justification": "Raw PR data (individual review times for 721 PRs) not available. Only aggregated tables (Table 2-3) and distribution plots provided. No appendix with raw values.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": false, + "justification": "Collection via Bitbucket webhook, 30-day window stated. But allocation mechanism unclear: was treatment randomized per PR? Per developer? Per repo? No specification of how the 33% split was enforced.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "Not a human subjects study. This is observational data collection from normal engineering workflow. No recruitment of participants.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": false, + "justification": "Implementation pipeline (PR → context → agents → blend → results) is described. But analysis pipeline incomplete: how were 721 PRs selected from 30-day corpus? How were CS1 and CS2 created? Step-by-step process not documented.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": false, + "answer": false, + "justification": "Not evaluating LLM benchmark performance. This measures tool's effect on code review workflow, not LLM capability on benchmarks.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": false, + "answer": false, + "justification": "Not applicable to this evaluation type.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": false, + "answer": false, + "justification": "Not applicable to this evaluation type.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human subjects (observational PR data only). Pre-registration N/A.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human subjects involved. Company telemetry on its own engineers' workflow (likely covered by internal policies, not published).", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "N/A. No human subject demographics to report.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "N/A. No human subjects.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "N/A. No human subjects (though PR assignment to test/control groups may have been randomized—not explicitly stated).", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "N/A. No human subjects.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "N/A. No human subjects.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": false, + "justification": "No inference cost data provided. Paper mentions cost as a reason to not use entire codebase (section 5) but reports zero actual costs or cost estimates.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "No total computational budget, API costs, or infrastructure costs disclosed. No statement of cost per PR review or monthly operational cost.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "DeputyDev reduces code review time by 17-29% in average per-PR time and 38-42% in per-LOC time", + "evidence": "Table 2 shows test set 17.36% reduction vs CS1 and 28.82% vs CS2 in avg time; 42.19% and 38.98% reduction in time/LOC", + "supported": "moderate" + }, + { + "claim": "DeputyDev is most effective for small pull requests (0-50 LOC)", + "evidence": "Table 3 shows 43.87% reduction in per-LOC time for S category vs diminishing returns for larger PRs", + "supported": "moderate" + }, + { + "claim": "Code review time is weakly correlated with lines of code changed (r=0.004-0.095)", + "evidence": "Figure 5 shows correlation coefficients across all three sets, contradicting intuition", + "supported": "strong" + }, + { + "claim": "Multi-agent agentic workflow with reflection improves code review quality", + "evidence": "Appendix B shows example reviews; paper claims agents handle 6 aspects (security, communication, performance, maintainability, errors, business logic)", + "supported": "weak" + }, + { + "claim": "DeputyDev improves developer productivity by reducing context-switching delays", + "evidence": "Abstract and intro claim productivity gains, citing 23-min interruption cost from prior work. Measured via review-time reduction.", + "supported": "weak" + }, + { + "claim": "Contextual code understanding via AST parsing and semantic search produces better reviews", + "evidence": "Section 5-6 describes context assembly, section 6.1 details lexical+semantic search union. No ablation or comparison.", + "supported": "weak" + }, + { + "claim": "DeputyDev reduces median code review time by 46-47%", + "evidence": "Table 2 shows median reduction from 0.76-0.78 hours to 0.41 hours (47-46% decrease)", + "supported": "strong" + } + ], + "methodology_tags": [ + "observational", + "case-study", + "a-b-test" + ], + "key_findings": "DeputyDev, a multi-agent LLM-based code review tool, reduced average review time by 17–28% and median review time by 46–47% in a 30-day A/B test at TATA 1mg covering 721 pull requests. Effectiveness is inversely proportional to PR size: 43.87% per-LOC time reduction for small PRs (0–50 LOC) versus mixed results for extra-large PRs (201–500 LOC). The analysis found a weak correlation (0.004–0.095) between code volume and review duration, suggesting complexity rather than quantity drives review time. However, the study lacks statistical significance testing, human evaluation of review quality, and external validation.", + "red_flags": [ + { + "flag": "No statistical significance testing", + "detail": "Paper repeatedly claims 'statistically significant' results but reports zero p-values, confidence intervals, or hypothesis tests. Only percentage differences shown." + }, + { + "flag": "Self-interested internal evaluation", + "detail": "All authors are TATA 1mg employees evaluating TATA 1mg's product. No independent third-party validation. Company commercializes tool as SaaS, creating direct financial interest in positive findings." + }, + { + "flag": "No conflicts of interest statement", + "detail": "Missing explicit declaration that authors benefit from positive evaluation outcomes and control the evaluation methodology." + }, + { + "flag": "Aggressive filtering creates selection bias", + "detail": "Removed top 25% and bottom 10% of PRs by size, required balanced repositories. Creates a selected subset that may not be representative of real-world code review." + }, + { + "flag": "No code quality evaluation", + "detail": "Only measured review time (machine metric). Zero human raters assessing whether AI-generated reviews are actually helpful or correct. Appendix B shows examples only." + }, + { + "flag": "Productivity claim unsubstantiated", + "detail": "Title claims 'boost developer productivity' but only measures 'review time'. These are not equivalent—faster review could signal lower quality. No evidence for productivity claim." + }, + { + "flag": "No ablation study", + "detail": "System uses semantic search, AST chunking, 6 agents, reflection, blending engine. Cannot isolate which components actually contribute to improvements." + }, + { + "flag": "Prompts and LLM versions not disclosed", + "detail": "Model names provided (GPT-4o, Claude 3.5 Sonnet) but no exact versions/snapshots. Actual prompts fed to LLM agents not provided—critical for reproducibility." + }, + { + "flag": "Confounded control groups", + "detail": "Paper defines Control Set 1 and Control Set 2 but never explains the difference between them or why two controls are needed. Unclear what is actually being measured." + }, + { + "flag": "Short evaluation window without long-term data", + "detail": "Only 30 days of data (July 27 – Aug 27, 2024). No discussion of whether time savings persist, whether reviewers' behavior stabilizes, or if effects decay over time." + }, + { + "flag": "Single-organization study with no generalization evidence", + "detail": "Evaluated only at TATA 1mg on Bitbucket. No evidence this works for other companies, team sizes, code languages, or VCS platforms. Title overgeneralizes ('Breaking the Code Review Logjam')." + }, + { + "flag": "Concerning result for large PRs hidden in table", + "detail": "Table 3 shows extra-large PRs (201-500 LOC) had 100.30% TIME INCREASE in test vs CS1—opposite of claimed benefit. Presented without analysis or discussion of implications." + }, + { + "flag": "No data or code release for reproducibility", + "detail": "Raw PR data, code, and trained models not released. System is proprietary SaaS. Impossible for others to reproduce or validate findings." + } + ], + "cited_papers": [ + { + "title": "Code Time Report", + "authors": "software.com", + "year": 2024, + "relevance": "Motivation for study: cites 41 minutes/day on code review. Used to establish problem scope." + }, + { + "title": "CommentFinder: A Simpler, Faster, More Accurate Code Review Comments Recommendation", + "authors": "Hong et al.", + "year": 2022, + "relevance": "Prior automated code review work using NLP. Directly relevant comparison point." + }, + { + "title": "Code Review Automation: Strengths and Weaknesses of the State of the Art", + "authors": "Tufano, Dabić, Mastropaolo, Ciniselli, Bavota", + "year": 2024, + "relevance": "Systematic review of code review automation. Primary prior work in automated code review." + }, + { + "title": "Using Pre-trained Models to Boost Code Review Automation", + "authors": "Tufano, Masiero, Mastropaolo, Pascarella, Poshyvanyk, Bavota", + "year": 2022, + "relevance": "Earlier work on LLM-based code review automation. Methodological precedent." + }, + { + "title": "Agentic Design Patterns Part 5: Multi-Agent Collaboration", + "authors": "Andrew Ng", + "year": 2024, + "relevance": "Framework for multi-agent LLM systems. Theoretical foundation for DeputyDev's architecture." + }, + { + "title": "ChatDev: Communicative Agents for Software Development", + "authors": "Chen Qian, Wei Liu, Hongzhang Liu, et al.", + "year": 2024, + "relevance": "Multi-agent framework for software engineering tasks. Related agentic approach." + }, + { + "title": "The Cost of Interrupted Work: More Speed and Stress", + "authors": "Mark, Gudith, Klocke", + "year": 2008, + "relevance": "Foundational motivation: interruptions cause 23-minute context-switch cost. Cited as justification for review-time reduction mattering." + }, + { + "title": "Self-Refine: Iterative Refinement with Self-Feedback", + "authors": "Madaan, Tandon, Gupta, et al.", + "year": 2023, + "relevance": "Reflection pattern in LLMs. Methodological component of DeputyDev's approach." + }, + { + "title": "Introducing Structured Outputs in the API", + "authors": "OpenAI", + "year": 2024, + "relevance": "Technical capability for enforcing JSON output from LLMs. Implementation detail discussed in section 6.3." + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 2, + "justification": "Code review assistant has clear practical value for development teams, but results limited to one company; practitioners cannot generalize to their context." + }, + "surprise_contrarian": { + "score": 1, + "justification": "Finding that AI assists code review is unsurprising. Weak LOC-time correlation is mildly interesting but cited as known phenomenon; no novel insights." + }, + "fear_safety": { + "score": 0, + "justification": "No AI safety or alignment concerns raised. Tool is narrow code review assistant, not generally capable system. No risk discussion." + }, + "drama_conflict": { + "score": 1, + "justification": "Minimal conflict: company employees evaluating their own product with no independent verification. Potential for bias is real but not explored or disputed." + }, + "demo_ability": { + "score": 2, + "justification": "Available as SaaS product (in principle), but paper does not explain how to access it, try it, or what it costs. Requires Bitbucket VCS." + }, + "brand_recognition": { + "score": 1, + "justification": "TATA 1mg is major Indian healthcare company but not prominent in AI research. Authors are not well-known researchers. No prestigious institution affiliation." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "36965545", + "title": "Electronic Structure of LK-99", + "points": 551, + "comments": 432, + "url": "https://news.ycombinator.com/item?id=36965545" + }, + { + "hn_id": "44016621", + "title": "LLMs are more persuasive than incentivized human persuaders", + "points": 140, + "comments": 116, + "url": "https://news.ycombinator.com/item?id=44016621" + }, + { + "hn_id": "43075571", + "title": "ZeroBench: An Impossible Visual Benchmark for Contemporary LMMs", + "points": 9, + "comments": 3, + "url": "https://news.ycombinator.com/item?id=43075571" + }, + { + "hn_id": "44211052", + "title": "Analog Foundation Models", + "points": 8, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=44211052" + }, + { + "hn_id": "44009574", + "title": "Large Language Models Are More Persuasive Than Incentivized Human Persuaders", + "points": 4, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=44009574" + }, + { + "hn_id": "45241249", + "title": "The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs", + "points": 4, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=45241249" + }, + { + "hn_id": "45240847", + "title": "ButterflyQuant: Ultra-low-bit LLM Quantization", + "points": 4, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=45240847" + }, + { + "hn_id": "45228682", + "title": "The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs", + "points": 3, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=45228682" + }, + { + "hn_id": "45343343", + "title": "The illusion of diminishing returns in LLM progress", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=45343343" + }, + { + "hn_id": "43905563", + "title": "(How) Do reasoning models reason?", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=43905563" + } + ], + "top_points": 551, + "total_points": 729, + "total_comments": 554 + } +} +\ No newline at end of file diff --git a/papers/derag-blackbox-adversarial-2025/scan-v5.json b/papers/derag-blackbox-adversarial-2025/scan-v5.json @@ -0,0 +1,514 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "DeRAG: Black-box Adversarial Attacks on Multiple Retrieval-Augmented Generation Applications via Prompt Injection", + "authors": ["Jerry Wang", "Fang Yu"], + "year": 2025, + "venue": "KDD Workshop on Prompt Optimization", + "arxiv_id": "2507.15042", + "doi": "10.48550/arXiv.2507.15042" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All abstract claims are backed by experimental results: DE vs. GGPP/PRADA comparisons in Tables 1-2, ≤5 token budgets confirmed, Welch's t-test for readability in Table 16, and AUROC 0.2023 for detector evasion in Table 4.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "Causal claims such as 'early stopping cuts query cost by ~40%' are supported by controlled comparisons of DE variants across multiple datasets with consistent ablations (Figure 2, Table 2).", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "The title and conclusion claim DeRAG attacks 'multiple RAG applications,' but experiments only cover BERT-base-uncased (dense) and BM25 (sparse) on 1,000-document subsets; modern embedding models and production-scale corpora are untested.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "The paper attributes MS MARCO's high success to corpus redundancy but does not explore whether BERT-base-uncased is unusually vulnerable compared to modern retrievers, nor whether the artificially small corpus (1,000 docs) inflates success rates.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "The paper distinguishes retrieval rank manipulation (Success@K) from downstream answer quality and validates the connection in Table 5, showing EM/F1/ROUGE-L/BERTScore degradation stratified by attack outcome.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": false, + "justification": "There is no dedicated limitations or threats-to-validity section; the conclusion briefly mentions future defenses but does not enumerate limitations of the current work.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "No specific threats to validity are discussed — e.g., the impact of corpus size (1,000 docs vs. production scale), retriever model choice, or the small query count (100) on result generalizability.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "The paper does not explicitly bound what the results do NOT show — for example, whether attacks transfer to instruction-tuned embedding models, larger corpora, or API-based retrieval services.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding source is disclosed anywhere in the paper.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Both authors' affiliations (Department of Management Information Systems, National ChengChi University, Taipei, Taiwan) are disclosed on the title page.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": false, + "answer": false, + "justification": "No funder is identified, so independence cannot be assessed.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests or financial interests statement is present in the paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Key terms are defined: RAG (Section 1), Differential Evolution (Section 2.2), dense/sparse retrievers (Section 2.4), and all evaluation metrics (Success@K, ΔMRR, ΔnDCG, Δcos) are formally defined in Section 4.1.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "The paper clearly states its contribution: a gradient-free, black-box adversarial attack (DeRAG) using Differential Evolution to generate short adversarial suffixes that manipulate RAG retrieval rankings without model internals.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "A four-subsection related work covers adversarial prompts, evolutionary optimization, detection methods, and retrieval mechanisms; the paper explicitly compares against GGPP (white-box) and PRADA (sparse black-box) throughout.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": true, + "justification": "Source code is released at https://github.com/pen9rum/Rag_attack_DeRag, explicitly referenced in Section 4.1.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "All datasets used (MS MARCO, SciFact, FiQA, FEVER, SQuAD, NQ-Open) are standard publicly available benchmarks accessible via the BEIR framework.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "No requirements.txt, Dockerfile, or version-pinned dependency specifications are provided; only the model name (BERT-base-uncased) is mentioned without library or Python version details.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "The algorithm is described via pseudocode (Algorithm 1) but no step-by-step instructions for reproducing the specific experimental results (table entries, figures) are provided in the paper.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "Main attack success results (Tables 1-2) report point estimates only; standard deviations appear only for MLM NLL in Table 15, not for primary Success@K metrics.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "Welch's t-test is used only for the MLM NLL readability comparison (Table 16); the primary comparative claims — DE vs. GGPP vs. PRADA attack success — are made without any statistical significance testing.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Effect sizes are reported across multiple metrics: Success@K, ΔMRR, ΔnDCG, Δcos, and downstream percentage drops in EM/F1/ROUGE-L in Table 5.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "The choice of 100 queries and 1,000 documents per dataset is not justified through power analysis or reasoning about statistical adequacy.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "Main results tables (Tables 1-2) report no variance; standard deviations appear only for MLM NLL (Table 15), not for primary attack success metrics.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Three baselines are included: GGPP (gradient-based white-box), PRADA (sparse black-box), and random suffix.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "GGPP (2024) and PRADA (2022) are the most relevant contemporary methods for the dense and sparse retriever attack settings respectively.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "Extensive ablations: DE variants (seq_stop vs. fixed_stop vs. seq), suffix length effects (Figure 4, Appendix D), loss function comparison (Appendix E), prefix vs. suffix positioning (Table 3), and candidate pool size effects (Table 6).", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Multiple metrics: Success@K (K=1,10,20), Avg Tok, Avg Iter, ΔMRR, ΔnDCG, Δcos, EM, F1, ROUGE-L, BERTScore, AUROC, AUPRC, MLM NLL.", + "source": "haiku" + }, + "human_evaluation": { + "applies": false, + "answer": false, + "justification": "Human evaluation is not applicable; attack effectiveness is measured through automated retrieval and QA metrics.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": false, + "answer": false, + "justification": "This is an adversarial optimization task, not a supervised prediction task; the held-out test set concept does not apply.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Results are broken down per dataset (SciFact, FiQA, FEVER, MS MARCO) and per retrieval threshold (K=1, 10, 20) across all main tables.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "Table 3 explicitly tabulates queries where both prefix and suffix attacks fail; Section 4.5 stratifies outcomes into Top-1 success, Top-10-only, and Fail with corresponding quality metrics.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "Negative results include: cosine loss underperforming hinge loss (Table 12/Appendix E), monotonic suffix length schedule not improving results (Appendix D), and low Succ@1 rates (~10-20%) on most datasets.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": false, + "justification": "BERT-base-uncased is specified for retrieval, but the LLM used for answer generation in the downstream RAG pipeline (Section 4.5, Table 5) is never named or versioned — a critical omission.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": false, + "justification": "No system prompts or query templates for the downstream RAG generator LLM are provided; appendix tables show adversarial suffix output examples but not the generation prompts used.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": false, + "justification": "DE hyperparameters (F, CR, N, patience T) are described as typical ranges (e.g., F ∈ [0.5, 1.0], CR ∈ [0.1, 0.9]) rather than the exact values used in the reported experiments.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": false, + "answer": false, + "justification": "No agentic scaffolding is used; this is a direct adversarial optimization attack on retrieval systems.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "Data preprocessing is documented: random sampling of 1,000 documents and 100 queries from official BEIR corpus/query splits, BERT-base-uncased CLS embedding extraction (768-dim), cosine similarity retrieval.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": false, + "justification": "Per-query attack outcomes and generated adversarial suffixes are not released as structured data; only the code repository is linked and we cannot verify its contents from the paper.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Data collection is described: standard BEIR benchmarks, random sampling of 1,000-document subsets and 100 queries from official splits, target documents chosen as non-relevant passages or topically confusable distractors.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "No human participants; all data comes from standard NLP benchmarks.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "The full pipeline is documented: query/target selection → BERT embedding → DE optimization loop (Algorithm 1) → retrieval evaluation → downstream QA generation and scoring.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": false, + "answer": false, + "justification": "This paper evaluates an adversarial attack on retrieval ranking, not LLM benchmark knowledge recall; standard contamination concerns do not apply.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": false, + "answer": false, + "justification": "Not applicable; benchmarks are used as retrieval corpora to be manipulated, not as knowledge tests for a generative model.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": false, + "answer": false, + "justification": "Not applicable for the same reason as above.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": true, + "justification": "Per-query iteration counts (Table 2) and pool construction/query optimization times in seconds (Table 14) are reported, providing practical cost information.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "Total computational budget (GPU/CPU hours, hardware specifications) is not stated; only per-query timing data are provided.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "DE-based black-box attack achieves competitive or higher success rates than GGPP (gradient white-box) at Top-10 and Top-20 thresholds on dense retrievers", + "evidence": "Table 2: DE_seq_stop matches or exceeds GGPP at Succ@10/Succ@20 on SciFact (0.573 vs 0.458) and FiQA (0.520 vs 0.480), though GGPP dominates Succ@1 on MS MARCO (0.830 vs 0.570)", + "supported": "moderate" + }, + { + "claim": "Effective adversarial suffixes require only 2-3 tokens on average", + "evidence": "DE_seq_stop achieves average suffix lengths of 1.32 (MS MARCO) to 2.76 (FEVER) tokens while maintaining high Top-10/Top-20 success rates (Table 2)", + "supported": "strong" + }, + { + "claim": "Early stopping reduces query cost by approximately 40% without reducing attack success", + "evidence": "Figure 2 shows DE_seq_stop reaches 97% Top-10 success at 2 tokens while DE_seq needs 3-4; Section 3.3.3 states the hybrid strategy 'cuts average query cost by ~40%'", + "supported": "strong" + }, + { + "claim": "DE-generated suffixes evade BERT-based and RoBERTa-based adversarial detection", + "evidence": "Table 4: RoBERTa detector achieves AUROC 0.2023 and AUPRC 0.4665 at 0.5% FPR target; Table 13: CLS attack probability is near-identical for original vs. attacked queries (~0.40)", + "supported": "strong" + }, + { + "claim": "Readability-aware MLM pooling strategy significantly reduces suffix perplexity without degrading attack success", + "evidence": "Table 16: Welch's t-test yields p < 1e-9 for NLL reduction across all three datasets; Table 6 shows stable Success@1 across pool sizes from 500 to 30,522", + "supported": "strong" + }, + { + "claim": "Adversarial retrieval manipulation causes substantial downstream answer quality degradation in real QA pipelines", + "evidence": "Table 5 shows 83.5% EM drop on SQuAD when target reaches rank 1 and 14.8% average EM drop across NQ-Open; tested on only 500 queries per dataset with unspecified generator LLM", + "supported": "moderate" + } + ], + "methodology_tags": ["benchmark-eval"], + "key_findings": "DeRAG demonstrates that gradient-free Differential Evolution can generate adversarial query suffixes of 2-3 tokens that effectively manipulate RAG retrieval rankings, matching gradient-based white-box attacks (GGPP) at broader retrieval thresholds (Top-10, Top-20) while requiring no model internals. The attack evades RoBERTa-based detectors with near-chance accuracy (AUROC 0.2023) and causes measurable downstream answer quality degradation (14-27% average EM drop on QA benchmarks, up to 83.5% when the adversarial document reaches rank 1). A readability-aware suffix construction strategy using MLM token pooling statistically significantly reduces suffix perplexity (Welch's t, p < 1e-9) without degrading attack success. However, all experiments use BERT-base-uncased on artificially small 1,000-document corpus subsets, limiting generalizability claims.", + "red_flags": [ + { + "flag": "Unrealistically small corpus", + "detail": "Experiments use only 1,000-document subsets of BEIR datasets; production RAG systems operate over millions of documents where attack rank targets and success rates would differ substantially." + }, + { + "flag": "No CIs or significance tests on primary results", + "detail": "Attack success rates (Success@K, ΔMRR, ΔnDCG) in Tables 1-2 are point estimates from 100 queries per dataset with no confidence intervals or statistical significance tests for the main comparative claims." + }, + { + "flag": "Single retriever model tested", + "detail": "Dense retrieval experiments use only BERT-base-uncased (2018); modern instruction-tuned embedding models widely used in production (E5, GTE, OpenAI text-embedding-3) are untested." + }, + { + "flag": "Generator LLM unspecified for downstream evaluation", + "detail": "Section 4.5 evaluates downstream RAG answer quality (Table 5) but never names or versions the LLM used for generation, making these results unreproducible." + }, + { + "flag": "Exact DE hyperparameters not reported", + "detail": "The paper provides typical ranges for DE parameters (F ∈ [0.5, 1.0], CR ∈ [0.1, 0.9], N, T) but not the exact values used in the reported experiments." + }, + { + "flag": "Title overstates breadth of coverage", + "detail": "Title claims attacks on 'Multiple Retrieval-Augmented Generation Applications' but only one dense retriever (BERT-base-uncased) and one sparse retriever (BM25) on small corpus subsets are tested." + } + ], + "cited_papers": [ + { + "title": "Prompt Perturbation in Retrieval-Augmented Generation based Large Language Models (GGPP)", + "relevance": "Primary dense-retriever baseline; gradient-based white-box attack on RAG that DeRAG is designed to compete with without gradient access" + }, + { + "title": "PRADA: Practical Black-box Adversarial Attacks against Neural Ranking Models", + "relevance": "Primary sparse-retriever baseline; the most comparable black-box adversarial ranking attack method" + }, + { + "title": "BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models", + "relevance": "Evaluation framework used for all retrieval experiments across SciFact, FiQA, FEVER, MS MARCO" + }, + { + "title": "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks", + "relevance": "Original RAG paper; defines the RAG paradigm whose retrieval stage DeRAG attacks" + }, + { + "title": "BadRAG: Identifying Vulnerabilities in Retrieval Augmented Generation of Large Language Models", + "relevance": "Related poisoning-based backdoor attack on RAG corpora; complementary threat model" + }, + { + "title": "CtrlRAG: Black-box Adversarial Attacks Based on Masked Language Models in Retrieval-Augmented Language Generation", + "relevance": "Related black-box RAG attack using MLM; close competitor using a different gradient-free approach" + }, + { + "title": "Differential evolution – a simple and efficient heuristic for global optimization over continuous spaces", + "relevance": "Foundational algorithm (Storn & Price 1997) underlying the DeRAG optimization method" + }, + { + "title": "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", + "relevance": "Core retrieval encoder used in all dense retrieval experiments" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 2, + "justification": "Demonstrates a deployable black-box attack against production RAG systems with public code; directly actionable for practitioners assessing RAG security." + }, + "surprise_contrarian": { + "score": 1, + "justification": "The finding that ≤5 tokens suffice for effective retrieval manipulation is noteworthy, but RAG vulnerability to adversarial attacks is an expected result in this field." + }, + "fear_safety": { + "score": 2, + "justification": "Shows RAG systems can be manipulated to surface misinformation via small, detector-evading token appends — a concrete and practical AI safety concern for deployed systems." + }, + "drama_conflict": { + "score": 1, + "justification": "The attack-vs-defense framing is inherently adversarial but the paper is technical rather than polemical; no controversial claims about deployed systems." + }, + "demo_ability": { + "score": 2, + "justification": "Code is released on GitHub using public BEIR benchmarks and BERT-base-uncased, making reproduction accessible to practitioners without specialized resources." + }, + "brand_recognition": { + "score": 0, + "justification": "Authors are from National ChengChi University (Taiwan), a respected institution but not a major AI lab with brand recognition in the LLM community." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "44120359", + "title": "Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective", + "points": 19, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=44120359" + }, + { + "hn_id": "36931866", + "title": "Universal and Transferable Adversarial Attacks on LLM", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=36931866" + }, + { + "hn_id": "36903968", + "title": "Universal and Transferable Adversarial Attacks on Aligned Language Models", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=36903968" + } + ], + "top_points": 19, + "total_points": 23, + "total_comments": 1 + } +} diff --git a/papers/design-evaluation-assisted-2026/scan-v5.json b/papers/design-evaluation-assisted-2026/scan-v5.json @@ -0,0 +1,499 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Design and Evaluation of an Assisted Programming Interface for Behavior Trees in Robotics", + "authors": [ + "J. Styrud", + "Matteo Iovino", + "Rebecca Stower", + "Mart Kartašev", + "Mikael Norrlöf", + "Mårten Björkman", + "Christian Smith" + ], + "year": 2026, + "venue": "arXiv", + "arxiv_id": "2602.09772", + "doi": null + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "The abstract claims BETR-GUI enables better task performance (LMM FULL vs MANUAL: b=61.05, p<.001, Table V) and humans outperform AI alone (Table VI, p<.001); both are directly supported by the reported results.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "The pre-registered ablation design with 6 variants, counterbalanced order, and LMM analysis supports causal attribution of performance differences to specific components within the scope of these tasks.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "The Note to Practitioners makes broad claims about improving performance across 'the robotics industry' and 'uncontrolled environments,' while evidence comes only from 3 simplified toy tasks in a 15-minute lab study.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": true, + "justification": "The Discussion proposes multiple explanations for why NO_BO and NO_GP did not significantly outperform FULL, including users failing to utilize node-locking, the planner dominating easy tasks, and learning algorithms needing more time.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "The composite score function is fully defined (Equations 1–4), its normalization is explained, and separate SUS and ranking metrics are used alongside the task score.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": false, + "justification": "Limitations are discussed only in the Future Work section (Section VIII) with no dedicated limitations or threats-to-validity section.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "Specific threats are named: 'benchmark tasks are highly simplified compared to actual robot applications,' the 15-minute window disadvantages learning algorithms, and users had to simultaneously learn BTs and the GUI.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": true, + "justification": "The paper explicitly states 'the benchmark tasks, out of necessity, are highly simplified compared to actual robot applications' and calls for future studies with realistic complex tasks.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": true, + "justification": "Funding from the Wallenberg AI, Autonomous Systems, and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation is disclosed in the acknowledgment.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Author affiliations are disclosed including ABB Robotics, KTH, ETH Zürich, and Ericsson; ABB Robotics has direct commercial interest in robot programming tools.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": true, + "justification": "WASP/Wallenberg Foundation is an independent academic research program with no commercial stake in BETR-GUI; ABB Robotics authors have an interest but are not the funder.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement is present despite two authors being affiliated with ABB Robotics, which could commercially benefit from the tool being evaluated.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Behavior Trees, the composite score function, all GUI variants, and AI component algorithms (GP, BO, planning, LLM roles) are defined with sufficient precision for the paper's purposes.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Two explicit contributions are listed in the introduction: (1) the BETR-GUI tool combining multiple AI methods with a GUI, and (2) a 60-participant ablative user study.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section II provides extensive engagement with prior BT, planning, GP, BO, LLM, and composite systems work, explicitly building on specific prior methods that are integrated into BETR-GUI.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": true, + "justification": "Source code is publicly available on GitHub at https://github.com/jstyrud/BETR-GUI as explicitly stated twice in the paper.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": false, + "justification": "The paper states 'Full analyses are available in the OSF repository' but does not explicitly state that raw participant data (scores, SUS responses) is available for independent verification.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "Only Python/PyQt5 and Unity Engine are mentioned without version specifications. No requirements.txt, Dockerfile, or dependency manifest is provided or referenced.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "No step-by-step reproduction instructions are provided for recreating the experimental setup or running the user study; the paper describes system architecture but not how to reproduce experiments.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": true, + "justification": "95% CIs are reported for all LMM fixed effects (Tables IV, VI, VIII), and SD is reported in descriptive statistics for all GUI variants.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": true, + "justification": "Linear Mixed Models with Tukey-adjusted post-hoc pairwise comparisons and AICc-based model selection are used throughout with p-values reported.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Pseudo-R² is reported for each model (task R²=0.62, SUS R²=0.21, NO_HUMAN R²=0.04), and mean score differences with baselines provide practical effect sizes.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": true, + "justification": "An a priori power analysis with α=.05 determined that 60 participants provides 80% power assuming small-medium effects; supplementary code to recreate the analysis is noted.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": true, + "justification": "Standard deviations are reported for all mean task scores (Table III) and SUS scores (Table VII) across all six GUI variants.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "MANUAL_ONLY (no AI assistance) serves as the primary baseline, representing existing commercial BT GUIs such as Groot.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "MANUAL_ONLY is described as 'largely similar to existing GUIs like Groot' — the current commercial standard — making it a contemporary and competitive baseline.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "The entire experimental design is a systematic ablation study with four ablation variants (NO_BO, NO_GP, NO_LLM, NO_PLANNER) each removing one component from the FULL system.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Three metrics are used: composite task score (performance), System Usability Scale (subjective usability), and participant preference rankings.", + "source": "haiku" + }, + "human_evaluation": { + "applies": true, + "answer": true, + "justification": "The study is a human evaluation with 60 participants solving robot programming tasks and completing usability questionnaires; this is the primary evaluation method.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": false, + "answer": false, + "justification": "This is a user study of an interactive tool, not a predictive machine learning task requiring a train/test split.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Results are broken down by GUI variant, task (Cubes/Tableware/Trashpicking), and trial order with statistical tests for each factor; Figure 8 shows cross-tabulated results.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "The Discussion addresses NO_LLM/NO_PLANNER failures, users distrusting and abandoning AI suggestions, and specific user quotes describing frustration; node-locking was used in only 74/120 experiments.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "Key null results are foregrounded: NO_BO and NO_GP not significantly different from FULL (p=.999, p=.807), and NO_LLM and NO_PLANNER not significantly better than MANUAL_ONLY (p=.783, p=.358).", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": false, + "justification": "Only 'GPT-4' is stated without a snapshot date or API version (e.g., gpt-4-0613); multiple GPT-4 versions with different capabilities existed during the study period.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": false, + "justification": "Prompts are referenced as 'the same method as in [12] with a slightly updated prompt' but are not provided in the paper or explicitly pointed to in the repository.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": false, + "justification": "Score weights are on GitHub but not in the paper; GP population size, mutation rates, BO acquisition function, and surrogate model parameters are never specified.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": true, + "answer": true, + "justification": "Section III.D and Figure 2 clearly describe the full AI assistant workflow: seed BT → planner → LLM error resolution loop → parallel GP/BO optimization with user interaction.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "Score normalization procedure is defined (0 = minimal failing two-node BT, 100 = best participant score), and GUI logging of actions/scores with timestamps is described.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": false, + "justification": "The paper explicitly states only that 'Full analyses are available in the OSF repository'; raw participant score and SUS data are not confirmed as available.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Section V.F describes the full procedure: GUI automatic logging of actions and scores with timestamps, SUS after each variant, demographic questionnaire, and structured one-hour session.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": true, + "answer": true, + "justification": "Recruitment via 'flyers, mailing lists, social media, and word of mouth' is explicitly stated along with 100 SEK gift card compensation.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "Score calculation equations (1–4) are provided, R/lme4 analysis is described with model selection criteria (AICc), randomization code is on GitHub, and full analyses are on OSF.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": false, + "answer": false, + "justification": "This study evaluates a human-computer interface using GPT-4 as one component, not benchmarking LLM capabilities on standard datasets where training contamination is a methodological concern.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": false, + "answer": false, + "justification": "Evaluation tasks are custom Unity simulations with novel parameterized scenarios; train/test overlap with LLM pre-training data is not a relevant concern for this study type.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": false, + "answer": false, + "justification": "No standard benchmarks are used; all evaluation scenarios were custom-designed for this study and unavailable before the study was conducted.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": true, + "answer": true, + "justification": "Hypotheses and planned confirmatory and exploratory analyses were pre-registered on OSF at https://osf.io/ax5gb/overview before data collection.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": true, + "answer": false, + "justification": "The paper states the study 'followed the ethical guidelines for Sweden' but does not mention specific IRB or ethics board approval or committee name.", + "source": "haiku" + }, + "demographics_reported": { + "applies": true, + "answer": true, + "justification": "Age (M=29.7, SD=8.9, range 20–62), gender (10 female, 50 male), and domain familiarity scores across four domains are reported in Table II.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": true, + "answer": false, + "justification": "Participants are described post-hoc as 'primarily university students of engineering or computer science or professional software developers,' not as formal pre-specified inclusion/exclusion criteria.", + "source": "haiku" + }, + "randomization_described": { + "applies": true, + "answer": true, + "justification": "Task and variant order were counterbalanced in advance ensuring equal ablation exposure and order effect control; the randomization code is on GitHub.", + "source": "haiku" + }, + "blinding_described": { + "applies": true, + "answer": false, + "justification": "No blinding is described; participants could see which GUI variant they were using, and the supervisor monitored all sessions without any blinding protocol.", + "source": "haiku" + }, + "attrition_reported": { + "applies": true, + "answer": true, + "justification": "One participant was excluded due to a GUI bug and replaced with a new participant; this attrition is explicitly reported with reason.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": false, + "justification": "GPT-4 API calls are made during each experiment session but no cost per session, latency, or total API cost is reported.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "Hardware is specified (Intel Core Ultra 9 185H CPU) but total compute time, GPU usage, or budget for the 300 NO_HUMAN ablation runs are not reported.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "BETR-GUI with full AI assistant achieves significantly higher task scores than MANUAL_ONLY (mean 91.14 vs 30.24)", + "evidence": "LMM post-hoc FULL vs MANUAL: b=61.05, t=14.79, p<.001 (Table V); effect consistent across all tasks and trial orders with R²=0.62", + "supported": "strong" + }, + { + "claim": "Human+AI collaboration (FULL) outperforms AI running alone (NO_HUMAN) given the same time budget", + "evidence": "LMM FULL vs NO_HUMAN: b=3.08, t=3.46, p<.001 (Table VI); mean 91.14 vs 88.06 across 60 FULL and 300 NO_HUMAN runs", + "supported": "strong" + }, + { + "claim": "LLM and planner are critical components; removing either yields performance not significantly above MANUAL_ONLY", + "evidence": "Post-hoc comparisons NO_LLM vs MANUAL: b=-8.83, p=.783; NO_PLANNER vs MANUAL: b=-13.32, p=.358 (Table V) — both non-significant", + "supported": "strong" + }, + { + "claim": "Removing Bayesian Optimization or Genetic Programming does not significantly reduce task performance versus FULL", + "evidence": "Post-hoc: FULL vs NO_BO p=.999; FULL vs NO_GP p=.807 (Table V); mean score differences of 0.76 and 9.13 points are within noise", + "supported": "strong" + }, + { + "claim": "User performance improves significantly across successive trials due to learning", + "evidence": "LMM fixed effect of Trial Order: b=6.88, t=3.45, p<.001 (Table IV); ~7 normalized score points gained per trial", + "supported": "strong" + }, + { + "claim": "User trust in the AI assistant mediates task performance, with some participants refusing correct AI suggestions", + "evidence": "Only 74/120 experiments used node-locking; user quotes describe distrust after bad AI experiences; some users rejected AI solutions that solved the task and then failed manually", + "supported": "moderate" + } + ], + "methodology_tags": [ + "rct", + "qualitative" + ], + "key_findings": "BETR-GUI combining LLMs, planning, genetic programming, and Bayesian optimization significantly improves novice programmer performance on robot BT tasks versus manual-only interfaces (mean score 91.14 vs 30.24, p<.001). LLM and planner components are essential — removing either produces performance statistically indistinguishable from manual-only — while removing BO or GP has negligible impact in 15-minute sessions, likely because the planner dominates short tasks. Human-AI collaboration outperforms AI alone (p<.001), demonstrating continued human value even with an extensive AI assistant. User trust emerged as a key behavioral mediator: participants who experienced a poor AI suggestion early often abandoned the tool entirely, even when the AI subsequently solved the task correctly.", + "red_flags": [ + { + "flag": "GPT-4 version unspecified", + "detail": "Only 'GPT-4' is named without snapshot date or API version ID, making exact reproduction impossible given multiple GPT-4 variants deployed over this period." + }, + { + "flag": "Highly simplified tasks", + "detail": "All evaluation scenarios are 15-minute toy tasks (~15-node BTs) in a custom Unity simulation; generalization to real industrial robotics is explicitly unvalidated and stated as future work." + }, + { + "flag": "Unequal ablation sample sizes", + "detail": "FULL and MANUAL_ONLY received 60 exposures each while each ablation variant received only ~15, reducing statistical power for detecting differences between ablation variants." + }, + { + "flag": "No competing interests declaration", + "detail": "Two authors are from ABB Robotics, which has direct commercial interest in robot programming tools; no competing interests statement is present." + }, + { + "flag": "Hyperparameters not in paper", + "detail": "GP population size, mutation rates, BO surrogate model, acquisition function, and score function weights are not reported in the paper and only available via the GitHub repository." + }, + { + "flag": "Male-dominated sample", + "detail": "83% of participants identified as male (50/60), limiting generalizability of usability findings to broader populations including industrial operators." + } + ], + "cited_papers": [ + { + "title": "A survey of Behavior Trees in robotics and AI", + "relevance": "Comprehensive survey establishing the state of BT methods in robotics; provides foundational background and taxonomy used throughout the paper" + }, + { + "title": "Automatic behavior tree expansion with LLMs for robotic manipulation", + "relevance": "Direct predecessor work (BETR-XP-LLM) that BETR-GUI builds upon for LLM-based BT expansion and error resolution" + }, + { + "title": "Combining Planning and Learning of Behavior Trees for Robotic Assembly", + "relevance": "Prior system from same team combining planning and GP for BT creation; BETR-GUI integrates and extends this" + }, + { + "title": "BeBOP: Combining Reactive Planning and Bayesian Optimization to Solve Robotic Manipulation Tasks", + "relevance": "Prior work on BO for robot BT optimization that is integrated as a component in BETR-GUI" + }, + { + "title": "Measuring the impact of early-2025 AI on experienced open-source developer productivity", + "relevance": "Cited as evidence that AI assistants can decrease developer performance — key contrast motivating BETR-GUI's positive result" + }, + { + "title": "Integrating intent understanding and optimal behavior planning for behavior tree generation from human instructions", + "relevance": "Contemporary LLM+BT system (LLM-OBTEA) used as prior art and design reference for the AI assistant" + }, + { + "title": "Behavior Trees in Robotics and AI: An Introduction", + "relevance": "Reference textbook defining BT semantics and operations; foundational context for the paper's implementation" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 2, + "justification": "Directly applicable to robotics practitioners programming behavior trees; code is released on GitHub with an instruction video." + }, + "surprise_contrarian": { + "score": 1, + "justification": "Mildly surprising that BO and GP add no significant value over LLM+planner alone, and that trust mediates performance more than algorithmic capability." + }, + "fear_safety": { + "score": 0, + "justification": "No AI safety or risk concerns raised; the paper focuses on productivity in robot programming." + }, + "drama_conflict": { + "score": 1, + "justification": "Engages directly with the recent finding that AI assistants can make developers worse, then shows a positive result for the specific domain of robot BT programming." + }, + "demo_ability": { + "score": 2, + "justification": "Code is on GitHub with a publicly available 5-minute instruction video; practitioners could realistically try the tool." + }, + "brand_recognition": { + "score": 1, + "justification": "ABB Robotics and KTH are well-known in robotics but not broadly recognized AI research labs." + } + }, + "hn_data": { + "threads": [], + "top_points": 0, + "total_points": 0, + "total_comments": 0 + } +} +\ No newline at end of file diff --git a/papers/design-implementation-secure-2025/scan-v5.json b/papers/design-implementation-secure-2025/scan-v5.json @@ -0,0 +1,519 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Design and Implementation of a Secure RAG-Enhanced AI Chatbot for Smart Tourism Customer Service: Defending Against Prompt Injection Attacks – A Case Study of Hsinchu, Taiwan", + "authors": [ + "Yu-Kai Shih", + "You-Kai Kang" + ], + "year": 2025, + "venue": "arXiv.org", + "arxiv_id": "2509.21367", + "doi": "10.48550/arXiv.2509.21367" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": false, + "justification": "Abstract claims GPT-5 'blocked approximately 85%' of attacks, but Table 3 shows 249/674=36.9% on full corpus. The 85% figure appears limited to a 301-attack subset mentioned only in Table 5's note. The abstract is misleading about the actual attack defense rate.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "Paper uses an ablation study (V0→V1→V2→V3→V4) showing stage-wise improvements in defense effectiveness. Table 3–5 demonstrate that each layer contributes to blocking more attacks, supporting causal claims about layer effectiveness.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "Paper is framed as 'case study of Hsinchu' but makes broad claims about 'secure smart tourism systems' globally and 'practical blueprint for deploying secure AI in visitor services.' Scope broadens beyond the specific case without consistently bounding claims.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "Paper does not discuss why the system achieves its results or consider alternative explanations. For instance, no discussion of whether 95% benign accuracy reflects easy test queries or actual system robustness.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "Paper measures 'accuracy' on benign queries and 'block rate' on adversarial queries as proxies for safety. User satisfaction (1–5 scale) is measured, though what it measures exactly is underspecified.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "Section 6 is a dedicated limitations section with eight specific limitations discussed (internal queries, API vulnerabilities, RAG scope, multilingual, ethics, resources, framework, adversarial threats, GPT-5 early access).", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "Limitations are specific: 'Internal queries may not capture full diversity,' 'API vulnerabilities: Downtime risks,' 'Limited to major languages; slang/dialects unhandled,' 'Tested known attacks; emerging threats may evade.' Each identifies a concrete threat.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": true, + "justification": "Paper explicitly states it is 'case study of Hsinchu, Taiwan' and provides tourism-specific scope. However, claims about generalizability to 'secure smart tourism systems' globally are not always bounded to this case-study context.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding source is disclosed anywhere in the paper. No acknowledgments, no funding statement, no conflicts of interest declaration provided.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": false, + "justification": "Authors list affiliations with educational institutions (National Dong Hwa University, BTS Experimental Education Program) but the paper evaluates a system from a 'Taiwan-based tourism technology firm' with no disclosed author relationship to that firm.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": false, + "answer": false, + "justification": "Unknown—no funding disclosed.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No conflicts of interest statement or declaration of financial interests (patents, equity, consulting arrangements). No COI section present.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Key terms are defined: 'RAG' explained in Section 2.2, 'Prompt injection' defined in Section 1.1 with OWASP reference, attack types enumerated in Section 2.3. 'Smart tourism' used but not formally defined in context.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Section 1.2 explicitly states five contributions: (1) secure RAG architecture, (2) empirical evaluation of defense layers, (3) GPT-5 integration, (4) ethical/sustainable case studies, (5) insights for multilingual deployments.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 2 surveys related work on AI in tourism (2.1), RAG (2.2), and prompt injection defense (2.3). However, engagement is mostly descriptive listing rather than deep comparison—mentions zIA as similar but doesn't clearly articulate differences.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": false, + "justification": "No source code provided. No GitHub repo, no code artifact, no promise of future release mentioned.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": false, + "justification": "Test datasets (223 benign, 674 adversarial queries) are not released. Adversarial datasets sourced from Deepset, Rubend18, and 'partner-provided samples' but no indication these are publicly available or released.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "Section 3.4 lists Python 3.11, Flowise, OpenAI embeddings, Qdrant, LangSmith, GPT-4o/GPT-5. Missing: exact versions of dependencies, pip requirements.txt, Dockerfile, or configuration files needed for reproduction.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "No step-by-step reproduction instructions provided. The paper describes the system architecture and methodology but not how to reproduce results from scratch with released artifacts.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "No confidence intervals or error bars reported for any metrics. All results reported as point estimates (e.g., 95% accuracy, 45% recall) with no uncertainty quantification.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "No statistical significance testing performed. Comparisons between versions (V0→V1→V2→V3→V4) show percentage differences but no p-values or significance tests to determine if differences are meaningful.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Effect sizes as percentage improvements reported: baseline 0% attack blocking vs. Secure RAG 45%; benign accuracy 78%→95%; hallucination 15%→2%. Absolute effect sizes clearly shown.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "Sample sizes stated (223 benign, 674 adversarial) but not justified. No power analysis, no explanation for why these sample sizes were chosen or whether they are sufficient.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "No variance, standard deviation, or range reported. Single run results with no discussion of variability across multiple runs or trials.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "V0 ('Zero Defense') serves as baseline for comparison. Subsequent versions (V1→V4) show incremental improvements against this baseline.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": false, + "justification": "Only internal baselines (V0-V4) compared; no comparison to published prompt injection defense systems or prior work approaches from the literature. Missing external, state-of-the-art baselines.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "V0→V1→V2→V3→V4 progression removes/adds components (System Norms, Gatekeeper, Reverse RAG, GPT-5). Each variant shown in Figures 3-6 with metrics in Tables 3-5.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Multiple metrics used: Accuracy, Precision, Recall, F1-score, benign accuracy, hallucination rate, response time, user satisfaction. Comprehensive across safety and utility dimensions.", + "source": "haiku" + }, + "human_evaluation": { + "applies": true, + "answer": false, + "justification": "Table 5 reports 'User Satisfaction (1–5)' but methodology is completely underspecified: no sample size, no description of what was evaluated, how users were recruited, or how responses were collected.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": false, + "justification": "223 benign and 674 adversarial queries are used as test set but no train/test split is clearly documented. For intent classifier or other learned components, unclear whether separate training data exists.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Table 2 lists six attack categories with counts. Table 3 breaks down blocked attacks by category (Double Character, Virtualization, Obfuscation, etc.) across all versions.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "Section 4 'Failure Case Analysis' discusses three representative failure modes: indirect obfuscation with benign wrappers, multi-turn anchoring, and ambiguous safety scopes with mitigations proposed.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "Negative results reported: Payload Splitting attacks show 0 blocked (all 674 missed); GPT-5 only blocks 37% of attacks despite being 'frontier model'; multi-turn and indirect attacks remain undefended.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": true, + "justification": "Exact model versions specified: 'GPT-4o' (baseline) and 'GPT-5 (released 2025-08-07)'. Embedding model 'OpenAI text-embedding-ada-002' named. All snapshot dates available.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": false, + "justification": "Only prompt excerpts provided. Section 3.2.2 shows 'Prompt Skeleton' with placeholders and general rules; Section 3.2.4 shows 'Summary Directive' excerpt. Full system prompts for all versions not included.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": true, + "justification": "Key hyperparameters specified in Section 3.4: top-k=5, similarity threshold τ=0.70, temperature=0.2, max tokens=1024, cosine distance metric. Complete enough for implementation.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": true, + "answer": true, + "justification": "Agentflow architecture thoroughly described with figures (Figs 1-2, 3-6) showing pipeline stages: User Input → Preprocessing → Gatekeeper → RAG → LLM → Safety Checks → Response. Defense mechanisms detailed in Section 3.2.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": false, + "justification": "Preprocessing mentioned briefly ('chunks data using vector embeddings', 'NVivo coded themes') but detailed steps not documented. Missing: exact filtering rules, embedding procedures, curation workflow.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": false, + "justification": "No raw data released. The 223 benign and 674 adversarial queries are not provided. The curated tourism knowledge base used by RAG is not disclosed.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": false, + "justification": "Benign queries described as '(informational/transactional/exploratory)' and adversarial as sourced from 'Deepset (546, incl. politics/role-play), Rubend18 (79, injections), partner (49, adapted)'. The procedure for their own 223 benign queries is not described.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "N/A—no human participant recruitment. The paper uses existing adversarial datasets, not human-collected data.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": false, + "justification": "Data flow through Agentflow pipeline is described (Figs 2, 7) and logging via LangSmith mentioned, but the full pipeline from query ingestion to result logging is not fully documented with step-by-step procedures.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": false, + "justification": "Training cutoff dates not stated for GPT-4o or GPT-5. Paper uses these models but does not specify their training data windows or knowledge cutoffs.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": false, + "justification": "No discussion of whether adversarial datasets (Deepset, Rubend18) might overlap with GPT-4o/GPT-5 training data. Public datasets could have been in training, creating contamination risk.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": true, + "answer": false, + "justification": "No discussion of benchmark contamination. The paper evaluates generative models on adversarial prompt benchmarks but does not address whether those benchmarks were included in model training.", + "source": "haiku" + } + }, + "human_studies": { + "applies": false, + "answer": false, + "justification": "N/A—no explicit human participant study. 'User Satisfaction' is reported but methodology is too underspecified to determine if it involved human subjects or was a synthetic metric.", + "source": "haiku" + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": false, + "justification": "Response times reported in Table 5 (2.1–3.2 sec) but no computational cost, API cost, or monetary expense reported. No breakdown of cost per query or total system cost.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "No total computational budget stated. Paper mentions system is 'compute-intensive' in limitations but provides no cost, GPU hours, or infrastructure budget.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "RAG integration reduces hallucination rate from 15% (baseline) to 2% (Secure RAG), improving response trustworthiness.", + "evidence": "Table 5 shows hallucination rates across versions: Baseline 15%, RAG 2%, Secure RAG 2%, GPT-5 Direct 1%.", + "supported": "strong" + }, + { + "claim": "Multi-layered guardrails block 45% of 674 adversarial prompt injection attacks while maintaining 95%+ accuracy on benign queries.", + "evidence": "Table 4 reports 301/674 attacks blocked (45% recall) for Reverse RAG. Table 5 shows 95% benign accuracy for Secure RAG.", + "supported": "strong" + }, + { + "claim": "GPT-5 demonstrates improved baseline robustness against prompt injections, blocking approximately 85% of attacks.", + "evidence": "Abstract and Table 3 cite '85%', but Table 3 shows 249/674=36.9% for full corpus. The 85% figure appears limited to a 301-subset (Table 5 note), creating ambiguity.", + "supported": "weak" + }, + { + "claim": "Multi-layered linguistic analysis (lexical, semantic, intentional, contextual, pragmatic levels) enables accurate intent detection across diverse cultural backgrounds.", + "evidence": "Section 3.1 describes five-level parsing; no quantitative metrics on intent detection accuracy or cross-cultural validation provided.", + "supported": "weak" + }, + { + "claim": "Reverse RAG (grounding responses in retrieved passages) prevents instruction-override attacks by making retrieval authoritative.", + "evidence": "Figure 6 and Section 3.2.4 describe the mechanism; Table 3 shows Reverse RAG version blocks 301/674 attacks; failure analysis shows it still misses indirect obfuscation and multi-turn attacks.", + "supported": "moderate" + }, + { + "claim": "The chatbot system achieves 4.7–4.8/5 user satisfaction across Secure RAG and GPT-5 variants.", + "evidence": "Table 5 reports user satisfaction scores, but no methodology for collection or sample size provided.", + "supported": "weak" + }, + { + "claim": "Multi-turn anchoring and indirect obfuscation with benign wrappers represent the most difficult failure modes to defend against.", + "evidence": "Section 4 failure case analysis identifies three failure modes (indirect obfuscation, multi-turn anchoring, ambiguous scopes) and their frequency/impact.", + "supported": "moderate" + } + ], + "methodology_tags": [ + "case-study", + "benchmark-eval" + ], + "key_findings": "A multi-layered RAG-enhanced chatbot with iterative defenses (System Norms, Gatekeeper, Reverse RAG) blocks 45% of a 674-attack corpus while achieving 95%+ accuracy on benign queries and reducing hallucinations from 15% to 2%. GPT-5 shows improved baseline robustness (36.9% of full attacks blocked, or 85% on focused subsets) but sophisticated attacks like multi-turn injection and indirect obfuscation remain undefended, indicating that layered guardrails remain necessary even with frontier models. User satisfaction increases from 3.4 to 4.8 across system variants.", + "red_flags": [ + { + "flag": "Abstract-results mismatch on GPT-5 defense rate", + "detail": "Abstract claims GPT-5 'blocked approximately 85%' but Table 3 shows 249/674=36.9% on full adversarial corpus. The 85% figure comes from a 301-attack subset mentioned only in Table 5's note, not the main results. This creates misleading impression of GPT-5's defense effectiveness." + }, + { + "flag": "No external baselines", + "detail": "Paper compares only internal versions (V0→V1→V2→V3→V4). No comparison to published prompt injection defense techniques, prior work systems, or state-of-the-art guardrails from the literature." + }, + { + "flag": "No statistical significance testing", + "detail": "All results reported as point estimates with no confidence intervals, standard errors, or significance tests. Given sample sizes of 223–674, variance matters but is completely absent." + }, + { + "flag": "Human evaluation severely underspecified", + "detail": "Table 5 reports 'User Satisfaction (1–5)' as single scores per version with zero documentation of methodology: no sample size, recruitment method, evaluation criteria, or data collection procedure." + }, + { + "flag": "Contamination risk unaddressed", + "detail": "Paper uses public datasets (Deepset, Rubend18) to evaluate GPT-4o/GPT-5 but training cutoff dates not stated and no discussion of train-test overlap. These datasets could have been in model training data." + }, + { + "flag": "No code or data release", + "detail": "System is not reproducible: no source code, no test datasets, no detailed reproduction instructions. Claims cannot be independently verified." + }, + { + "flag": "Scope creep from case study to generalization", + "detail": "Paper frames itself as 'case study of Hsinchu' but makes broad claims about 'secure smart tourism systems' globally and 'practical blueprint for visitor services worldwide.' Gap between specific case and general principles unclear." + }, + { + "flag": "Ablation clarity limited", + "detail": "While V0→V4 progression shows improvements, it is unclear which specific component drives each improvement. System Norms vs. Gatekeeper vs. Reverse RAG contributions to the final 45% block rate not isolated." + }, + { + "flag": "No funding or conflict disclosure", + "detail": "No funding source disclosed, no conflicts of interest statement, unclear author affiliation with evaluated tourism firm. Paper evaluates a proprietary system but discloses no relationship." + }, + { + "flag": "Payload Splitting completely undefended (0% block rate)", + "detail": "Table 2 lists 'Payload Splitting' attack category but Table 3 shows 0 blocked out of 0 tested (i.e., no Payload Splitting attacks in corpus). This attack category is completely absent from evaluation, leaving a gap." + } + ], + "cited_papers": [ + { + "title": "An early categorization of prompt injection attacks on large language models", + "authors": "Rossi, Michel, Mukkamala, Thatcher", + "year": 2024, + "relevance": "Foundational taxonomy of prompt injection attack types (direct, indirect, obfuscation, etc.). Core reference for attack categorization." + }, + { + "title": "Enhancing tourism recommender systems for sustainable city trips using retrieval-augmented generation", + "authors": "Banerjee, Satish, Wörndl", + "year": 2025, + "relevance": "RAG applied to tourism; directly analogous use case for sustainability and knowledge grounding." + }, + { + "title": "zIA: a GenAI-powered personalized local assistant assists tourists in Italy", + "authors": "Cassani, Ruberl, Salis, Boanelli, Giannese", + "year": 2025, + "relevance": "Parallel case study of similar tourism chatbot system in European context; comparison point for multilingual and personalization approaches." + }, + { + "title": "Guardrails for large language models: A review of techniques and challenges", + "authors": "Akheel", + "year": 2025, + "relevance": "Comprehensive review of LLM guardrail techniques; foundational for defense mechanisms discussed." + }, + { + "title": "Generative artificial intelligence in the hospitality and tourism industry: Developing a framework for future research", + "authors": "Ivanov", + "year": 2024, + "relevance": "Broad survey of GenAI in tourism; contextualizes application domain and identifies research gaps." + }, + { + "title": "Generative AI as a tourism actor: Reconceptualising experience co-creation, destination governance and responsible innovation in the synthetic experience economy", + "authors": "Christou, Fotiadis, Giannopoulos", + "year": 2025, + "relevance": "Examines AI's role in tourism governance and ethics; relevant to paper's discussion of responsible AI and sustainable tourism." + }, + { + "title": "Generative artificial intelligence and responsible tourism", + "authors": "Tham, Michael, Michael", + "year": 2024, + "relevance": "Ethics and responsibility in tourism AI; directly supports paper's discussion of bias mitigation, fairness, and transparency." + }, + { + "title": "Integrating generative AI and IoT for sustainable smart tourism destinations", + "authors": "Suanpang, Pothipassa", + "year": 2024, + "relevance": "Integration of AI with IoT for sustainable tourism; relevant to paper's sustainability claims and green tourism initiatives." + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 2, + "justification": "System deployed with real tourism firm in Hsinchu; demonstrates practical deployment and real-world impact. However, no code/data released limits utility for other practitioners." + }, + "surprise_contrarian": { + "score": 1, + "justification": "Prompt injection defenses are well-known techniques; applying them to tourism domain is straightforward application, not contrarian or surprising." + }, + "fear_safety": { + "score": 1, + "justification": "Paper defends against security risks rather than raising new concerns. Not framed as an AI risk paper, but rather as a security solution paper." + }, + "drama_conflict": { + "score": 0, + "justification": "No controversy, conflict, or dramatic angle. Technical case study with no adversarial narrative or contentious debate." + }, + "demo_ability": { + "score": 1, + "justification": "System is described but not reproducible without code/data release. No live demo interface, no artifact that others can try." + }, + "brand_recognition": { + "score": 1, + "justification": "Authors from lesser-known institutions (National Dong Hwa University, BTS program); evaluated system is proprietary with no named firm. No famous lab affiliation or recognizable brand." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "43555248", + "title": "UCSD: Large Language Models Pass the Turing Test", + "points": 91, + "comments": 106, + "url": "https://news.ycombinator.com/item?id=43555248" + }, + { + "hn_id": "45055118", + "title": "Precovery Observations of 3I/Atlas from Tess Suggests Possible Distant Activity", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=45055118" + }, + { + "hn_id": "41655031", + "title": "Extracting Memorized Training Data via Decomposition", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=41655031" + } + ], + "top_points": 91, + "total_points": 95, + "total_comments": 106 + } +} +\ No newline at end of file diff --git a/papers/designbench-comprehensive-benchmark-2025/scan-v5.json b/papers/designbench-comprehensive-benchmark-2025/scan-v5.json @@ -0,0 +1,344 @@ +{ + "scan_version": 5, + "paper_type": "benchmark-creation", + "paper": { + "title": "DesignBench: A Comprehensive Benchmark for MLLM-based Front-end Code Generation", + "authors": ["Jingyu Xiao", "Man Ho Lam", "Ming Wang", "Yuxuan Wan", "Junliang Liu", "Yintong Huo", "Michael R. Lyu"], + "year": 2025, + "venue": "arXiv.org", + "arxiv_id": "2506.06251", + "doi": "10.48550/arXiv.2506.06251" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "The abstract's claims about benchmark coverage (900 samples, 3 frameworks, 3 tasks) and findings (framework-specific limitations, task bottlenecks) are all substantiated by Tables 2-10 and Findings 1-10.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": false, + "justification": "The paper makes causal claims such as 'increased model capacity enhances essential web development capabilities' (Finding 2) from correlational benchmark data comparing differently-sized models without controlling for architecture, training data, or instruction tuning differences.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "The paper makes broad generalizations such as 'textual code offers MLLMs more semantic information than visual data' and 'MLLMs still face challenges in fixing front-end errors' from a narrow benchmark without bounding these to the specific benchmark setting.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "The finding that code-only input outperforms image-only input is interpreted as code providing 'more precise semantic information,' but alternatives such as models being better pre-trained on code than UI screenshots or metric-specific artifacts are not considered.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": false, + "justification": "CLIP and SSIM scores are used as proxies for visual UI generation quality without validation that they correlate with human judgments of front-end quality; only the MLLM-as-Judge metric is validated against human evaluation (>90% accuracy).", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "Section 7 'Threats to Validity' covers both internal validity (MLLM-as-judge reliability, data leakage) and external validity (limited framework coverage), going well beyond a single sentence.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "Internal threats are addressed with specific mitigations: MLLM judge validated at >90% accuracy with Kappa 0.86/0.84, and BLEU scores (0.06-0.15 range) are measured as evidence against data leakage.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": true, + "justification": "The paper explicitly states that interactive and multi-page benchmarks are 'out of our scope' and that external validity is bounded to React, Vue, and Angular due to market dominance.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding acknowledgments or grant information appear anywhere in the paper; API costs alone (~$52/model × 9 models) suggest external support that is not disclosed.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "All seven authors disclose affiliations (The Chinese University of Hong Kong and Singapore Management University) in the author contact block; none are affiliated with the commercial MLLM providers being evaluated.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": false, + "answer": false, + "justification": "No funder is disclosed, making independence assessment impossible; authors are from academic institutions not commercially tied to the evaluated models.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement, no declaration of patents or equity, and no disclosure of financial relationships with any of the evaluated MLLM providers (Anthropic, OpenAI, Google, Meta, Mistral).", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Section 4.1 formally defines Design Generation, Design Edit, and Design Repair with mathematical notation (task functions TG, TE, TR with explicit inputs and outputs); front-end frameworks are defined in Section 2.2.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Three contributions are explicitly enumerated: the first multi-framework multi-task benchmark, extensive 9-MLLM evaluation across multiple dimensions, and a 22-type failure taxonomy with actionable guidance.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 3 reviews DL-based, CV-based, and MLLM-based UI code generation methods; Table 1 directly compares DesignBench against six prior benchmarks across five dimensions showing clear differentiation.", + "source": "haiku" + } + } + }, + "type_checklist": { + "benchmark-creation": { + "construct_design": { + "construct_validity_argued": { + "applies": true, + "answer": false, + "justification": "The paper identifies gaps in existing benchmarks and fills them, but does not argue why CLIP/SSIM specifically measure visual UI generation capability or why CMLS/CMCS measure code edit quality rather than syntactic similarity to ground truth.", + "source": "haiku" + }, + "difficulty_distribution_characterized": { + "applies": true, + "answer": true, + "justification": "Section 6.3 defines difficulty for all three tasks with specific criteria: a composite score (image size, UI elements, color variety, layout complexity) for generation; annotator labels for editing; lines-of-code threshold (<10/10-30/>30) for repair. Table 6 reports results by difficulty tier.", + "source": "haiku" + }, + "ceiling_floor_effects_checked": { + "applies": true, + "answer": false, + "justification": "Qwen-7B achieves near-zero CLIP scores (0.04-0.09) and zero MLLM scores on several framework-task combinations indicating floor effects, but the paper does not flag this as a benchmark limitation or discuss discriminability at the low end.", + "source": "haiku" + }, + "human_baseline_included": { + "applies": true, + "answer": false, + "justification": "Human annotators are used for data curation and MLLM judge validation but no human performance baseline is reported for any of the three benchmark tasks (generation, edit, or repair).", + "source": "haiku" + }, + "scoring_rubric_justified": { + "applies": true, + "answer": false, + "justification": "The MLLM-as-Judge metric is validated against human evaluation (95.54%/91.89% accuracy), but the choice of CLIP and SSIM as visual metrics is not justified against alternatives, and edge cases in CMLS/CMCS Jaccard-based scoring are not discussed.", + "source": "haiku" + } + }, + "robustness": { + "contamination_resistance_designed": { + "applies": true, + "answer": false, + "justification": "The paper includes a post-hoc contamination check via BLEU scores (Section 5.3) but no design-level contamination resistance measures such as temporal splits, canary strings, or dynamic item generation are used.", + "source": "haiku" + }, + "temporal_robustness_discussed": { + "applies": true, + "answer": false, + "justification": "No discussion of benchmark longevity, saturation risk, or update plans; top frontier models already achieve CLIP >0.80 on vanilla HTML tasks, suggesting ceiling risk as models continue to improve.", + "source": "haiku" + }, + "failure_modes_discussed": { + "applies": true, + "answer": false, + "justification": "The paper extensively documents failure modes of MLLMs (22 failure categories in RQ6) but does not analyze failure modes of the benchmark itself — e.g., whether CLIP scores can be gamed, whether the MLLM judge has systematic biases, or what aspects of UI quality the benchmark fails to capture.", + "source": "haiku" + }, + "baseline_implementations_provided": { + "applies": true, + "answer": true, + "justification": "Full code, data, annotation guidelines, prompts, and evaluator implementations are available at https://github.com/WebPAI/DesignBench, enabling reproduction of all reported numbers.", + "source": "haiku" + } + }, + "documentation": { + "dataset_documentation_complete": { + "applies": true, + "answer": true, + "justification": "Section 4.2 describes data sources (GitHub, Moz Top 500, V0, Vue0, WebCode2M), collection tools (single-file-cli, Selenium), annotation process with PhD annotators and majority voting, and sample counts per task and framework; detailed guidelines are on GitHub.", + "source": "haiku" + }, + "licensing_and_access_clear": { + "applies": true, + "answer": false, + "justification": "The GitHub repository is publicly linked but no license is specified in the paper or discussed; the copyright status of content collected from top-500 commercial websites and GitHub projects is not addressed.", + "source": "haiku" + }, + "intended_use_specified": { + "applies": true, + "answer": false, + "justification": "Section 8 provides guidance on using findings, but there is no explicit statement of what benchmark results should NOT be interpreted as (e.g., not a measure of general coding ability, not a proxy for production-readiness of MLLM-generated code).", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "MLLMs perform substantially worse on framework-based development (React, Vue, Angular) than on vanilla HTML/CSS", + "evidence": "Table 5 and Fig. 6: vanilla HTML achieves CLIP scores >0.72 and near-perfect compilation rates, while Angular achieves CLIP 0.45-0.55 and compilation 0.60-0.76 for top models", + "supported": "strong" + }, + { + "claim": "Larger model variants consistently outperform smaller variants within the same family across all tasks and frameworks", + "evidence": "Table 5 shows consistent performance gaps for Llama-90B vs 11B, Pixtral-124B vs 12B, and Qwen-72B vs 7B across all metrics", + "supported": "strong" + }, + { + "claim": "Code-only input outperforms image-only input for edit and repair tasks; multimodal combination provides minimal additional improvement", + "evidence": "Table 7: top models score 8.40-8.43 (code-only) vs 7.37-7.67 (image-only) for Design Edit; similar pattern for Design Repair with 6.53-6.70 vs 5.47-5.81", + "supported": "strong" + }, + { + "claim": "MLLMs achieve only 27.1% average accuracy in identifying UI display issues across all frameworks", + "evidence": "Table 10 reports per-framework averages of 0.2972, 0.2205, 0.2275, 0.3403 on React, Vue, Angular, Vanilla, averaging 0.2714", + "supported": "strong" + }, + { + "claim": "MLLMs almost never use component-based design in React (0.24% adoption rate)", + "evidence": "Table 9 shows average component adoption rates of 0.24%, 5%, and 19% for React, Vue, and Angular respectively across nine models", + "supported": "strong" + }, + { + "claim": "Angular framework produces the worst compilation success rates among all frameworks", + "evidence": "Fig. 6 shows Angular compilation rates of 0.60-0.70 for top models vs >0.83 for React/Vue and perfect for Vanilla HTML", + "supported": "strong" + }, + { + "claim": "MLLMs can fix compilation errors in approximately 53% of cases across all frameworks", + "evidence": "Table 8 reports average repair rates of 0.53, 0.52, 0.53 for React, Vue, Angular based on a sample of 30 webpages with compilation errors", + "supported": "moderate" + } + ], + "methodology_tags": ["benchmark-eval"], + "key_findings": "DesignBench reveals that MLLMs perform significantly worse on framework-based web development (React, Vue, Angular) than on vanilla HTML/CSS, with Angular posing the greatest challenge due to TypeScript component architecture and MLLMs achieving only 60-76% compilation rates there. Code-only input consistently outperforms image+code multimodal input for edit and repair tasks, a counterintuitive finding suggesting current MLLMs underutilize visual information. MLLMs achieve only 27.1% average accuracy in identifying UI display issues and adopt component-based design patterns in under 1% of React generations, indicating fundamental gaps in framework-specific reasoning. A 22-type failure taxonomy is catalogued, with design repair showing the most severe limitations including high rates of no-repair attempts.", + "red_flags": [ + { + "flag": "GPT-4o self-evaluation bias", + "detail": "GPT-4o is used as the MLLM judge for Design Edit and Repair tasks (Section 5.4) while also being evaluated as a model, creating potential self-serving bias where GPT-4o judges its own outputs." + }, + { + "flag": "GPT-4o generates benchmark samples it is then evaluated on", + "detail": "146 Angular and vanilla HTML/CSS samples for the edit task were auto-translated by GPT-4o (Section 4.2); GPT-4o is then evaluated on those same samples, potentially giving it an advantage from familiarity with its own output style." + }, + { + "flag": "No human baseline on benchmark tasks", + "detail": "No human performance measurement is provided for generation, edit, or repair tasks, making it impossible to interpret whether 0.27 issue detection accuracy or CLIP scores around 0.70 represent near-human, far-below-human, or a reasonable baseline." + }, + { + "flag": "Floor effects unaddressed", + "detail": "Qwen-7B achieves near-zero CLIP scores (0.04-0.09) and zero MLLM scores on several framework-task combinations; the paper does not flag this as a benchmark discriminability concern." + }, + { + "flag": "CLIP/SSIM metrics unjustified for UI quality", + "detail": "CLIP semantic similarity and SSIM structural similarity are used as primary visual metrics without validation that they correlate with human judgment of UI generation quality; no ablation or correlation with MLLM judge scores is reported." + }, + { + "flag": "No funding disclosure", + "detail": "No acknowledgments or funding section is present despite API evaluation costs of approximately $52/model × 9 models = ~$470 minimum, plus data collection infrastructure." + }, + { + "flag": "No benchmark license", + "detail": "The GitHub repository is public but no license is specified; copyright status of content collected from Moz Top-500 commercial websites and GitHub projects is unaddressed, creating legal ambiguity for benchmark users." + }, + { + "flag": "Contamination check is post-hoc and weak", + "detail": "The contamination check relies solely on BLEU score between model outputs and original code being low (Section 5.3); low BLEU only shows models don't verbatim copy — it does not rule out training data advantage from exposure to the same or similar websites." + } + ], + "cited_papers": [ + { + "title": "Design2Code: How Far Are We From Automating Front-End Engineering?", + "relevance": "Direct predecessor benchmark using 484 real-world webpages for design-to-code evaluation; DesignBench extends this with multi-framework support and additional tasks" + }, + { + "title": "WebCode2M: A Real-World Dataset for Code Generation from Webpage Designs", + "relevance": "Large-scale benchmark (20K samples) that DesignBench uses as a data source for vanilla HTML samples and as a comparison baseline" + }, + { + "title": "Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs", + "relevance": "Prior synthetic benchmark for HTML parsing evaluation; DesignBench addresses its limitation of relying on synthetic data" + }, + { + "title": "Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset", + "relevance": "Large synthetic benchmark (823K samples) for HTML code generation compared against in Table 1" + }, + { + "title": "Automatically Generating UI Code from Screenshot: A Divide-and-Conquer-Based Approach (DCGen)", + "relevance": "State-of-the-art MLLM-based UI code generation method; one of the key approaches DesignBench is designed to evaluate" + }, + { + "title": "pix2code: Generating code from a graphical user interface screenshot", + "relevance": "Foundational early benchmark for UI code generation using DSL; cited as the work DesignBench extends beyond" + }, + { + "title": "MLLM-as-a-judge: Assessing multimodal LLM-as-a-judge with vision-language benchmark", + "relevance": "Foundational work justifying the MLLM-as-judge evaluation methodology used in DesignBench for Design Edit and Repair scoring" + }, + { + "title": "SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?", + "relevance": "Related multimodal code benchmark demonstrating the broader trend of evaluating MLLMs on visual software engineering tasks" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 3, + "justification": "Directly actionable for developers choosing MLLMs for front-end work; the code-only > multimodal finding changes how practitioners should prompt models for UI edit/repair." + }, + "surprise_contrarian": { + "score": 2, + "justification": "The finding that multimodal input adds no benefit over code-only for edit/repair tasks contradicts the premise of using multimodal models for visual UI tasks." + }, + "fear_safety": { + "score": 0, + "justification": "No safety or risk concerns; the paper evaluates front-end coding capability, not safety-critical or adversarial systems." + }, + "drama_conflict": { + "score": 1, + "justification": "Shows frontier models struggle significantly on framework-based tasks and nearly never use component patterns, but results are framed constructively rather than as indictment." + }, + "demo_ability": { + "score": 3, + "justification": "Full code, data, and evaluation scripts available on GitHub; anyone can run the benchmark evaluation against their preferred MLLM with provided prompts." + }, + "brand_recognition": { + "score": 2, + "justification": "Evaluates and ranks Claude-3.7, GPT-4o, and Gemini-2.0 — high-profile brands that drive reader interest even though the authoring institution (CUHK) is not a top-tier AI lab." + } + }, + "hn_data": { + "threads": [ + {"hn_id": "44148662", "title": "Beyond Attention: Toward Machines with Intrinsic Higher Mental States", "points": 67, "comments": 19, "url": "https://news.ycombinator.com/item?id=44148662"}, + {"hn_id": "37070323", "title": "Transformative AGI by 2043 is <1% likely", "points": 33, "comments": 41, "url": "https://news.ycombinator.com/item?id=37070323"}, + {"hn_id": "43667963", "title": "Transfer between Modalities with MetaQueries", "points": 25, "comments": 12, "url": "https://news.ycombinator.com/item?id=43667963"}, + {"hn_id": "43628028", "title": "NNN: Next-Generation Neural Networks for Marketing Mix Modeling", "points": 25, "comments": 3, "url": "https://news.ycombinator.com/item?id=43628028"}, + {"hn_id": "44859559", "title": "Modern Methods in Associative Memory", "points": 5, "comments": 1, "url": "https://news.ycombinator.com/item?id=44859559"}, + {"hn_id": "36306353", "title": "Transformative AGI by 2043 is <1% likely", "points": 3, "comments": 4, "url": "https://news.ycombinator.com/item?id=36306353"}, + {"hn_id": "46908281", "title": "LLMs do plan before they genenrate tokens", "points": 3, "comments": 0, "url": "https://news.ycombinator.com/item?id=46908281"}, + {"hn_id": "44236081", "title": "Geopolitical biases in LLMs", "points": 2, "comments": 0, "url": "https://news.ycombinator.com/item?id=44236081"}, + {"hn_id": "44556736", "title": "ASK HN: Why Google's Gemini 2.5 paper has 3295 authors?", "points": 2, "comments": 4, "url": "https://news.ycombinator.com/item?id=44556736"}, + {"hn_id": "44256016", "title": "Can Theoretical Physics Research Benefit from Language Agents?", "points": 1, "comments": 0, "url": "https://news.ycombinator.com/item?id=44256016"} + ], + "top_points": 67, + "total_points": 166, + "total_comments": 84 + } +} diff --git a/papers/designing-llmbased-multiagent-2025/scan-v5.json b/papers/designing-llmbased-multiagent-2025/scan-v5.json @@ -0,0 +1,352 @@ +{ + "scan_version": 5, + "paper_type": "survey", + "paper": { + "title": "Designing LLM-based Multi-Agent Systems for Software Engineering Tasks: Quality Attributes, Design Patterns and Rationale", + "authors": [ + "Yangxiao Cai", + "Ruiyin Li", + "Peng Liang", + "Mojtaba Shahin", + "Zengyang Li" + ], + "year": 2025, + "venue": "arXiv.org", + "arxiv_id": "2511.08475", + "doi": "10.48550/arXiv.2511.08475" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All four main claims in the abstract (Code Generation 47.9%, Functional Suitability 94.7%, Role-Based Cooperation most common pattern at 46.8%, Improving Code Quality most common rationale at 44.7%) are directly supported by data tables in Sections 4.1–4.4.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": false, + "justification": "The implications section makes causal-adjacent recommendations (e.g., 'Role-Based Cooperation can improve Maintainability') but the evidence is observational co-occurrence frequency from a literature mapping, which cannot support causal inference.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "Implications are framed as general design guidance without explicitly bounding conclusions to the 94-paper corpus from before September 2024 or the specific limited source databases (two surveys + arXiv only).", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "The dominance of code generation (47.9%) could reflect benchmark availability or publication incentives rather than actual designer priorities; the paper does not consider such alternative explanations for the observed distributions.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": false, + "justification": "The paper uses frequency of mentions in papers as a proxy for what designers 'mainly focus on' without acknowledging the gap between paper-level reporting and actual practitioner design priorities.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "Section 6 'Threats on Validity' explicitly addresses construct validity, external validity, and reliability threats with dedicated discussion of each.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "The threats section cites specific mitigations: pilot extraction with 5 randomly selected papers, multi-round discussions among three named authors, and named specific sources (Liu et al. and Wang et al. surveys plus arXiv SE category).", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "The paper states the date cutoff (Sept 30, 2024) and inclusion criteria but does not explicitly articulate what the results do NOT show, or bound the implications to the studied corpus and time period.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": true, + "justification": "Funding is disclosed in acknowledgments: NSFC grants 62402348 and 62172311, and the Major Science and Technology Project of Hubei Province under Grant No. 2024BAA008.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Author affiliations are clearly listed: Wuhan University (Cai, Li, Liang), RMIT University (Shahin), and Central China Normal University (Z. Li).", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": true, + "justification": "Funders are Chinese government science foundations (NSFC and Hubei Province), which have no direct commercial stake in the paper's taxonomic findings about LLM-based MAS design patterns.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests or financial interests declaration is present; there is no statement about patents, equity, or consulting relationships anywhere in the paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "LLM-based MASs are defined as 'multiple autonomous agents that collaborate through communication and responsibility specialization'; Quality Attributes are categorized per ISO/IEC 25010:2023; Design Patterns are defined as 'reusable solutions that help balance quality attributes'.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Three explicit contributions are listed in the introduction: (1) identifying SE tasks and QAs, (2) identifying design patterns and rationale, and (3) establishing mapping relationships among all four dimensions.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 2 provides structured review of related work across three subsections (LLM-based MASs for SE, MAS characteristics, prior surveys), and Section 2.4 explicitly articulates the research gap distinguishing this study from six named prior surveys.", + "source": "haiku" + } + } + }, + "type_checklist": { + "survey": { + "search_and_selection": { + "search_strategy_reproducible": { + "applies": true, + "answer": true, + "justification": "The search query is provided ('large language model' AND 'agent' in arXiv SE category), the date cutoff (September 30, 2024) is specified, and the two prior surveys used as seed sources are named with citations.", + "source": "haiku" + }, + "inclusion_exclusion_explicit": { + "applies": true, + "answer": true, + "justification": "Three explicit inclusion criteria are stated: papers must introduce at least one MAS, agents must leverage LLMs, and agents must address at least one SE task; these are applied consistently.", + "source": "haiku" + }, + "prisma_or_structured_protocol": { + "applies": true, + "answer": false, + "justification": "No PRISMA or other named systematic review protocol is mentioned; the paper describes a custom 4-phase process without following an established review methodology or reporting guideline.", + "source": "haiku" + }, + "search_terms_provided": { + "applies": true, + "answer": true, + "justification": "The search query is explicitly provided verbatim: (\"large language model\" AND \"agent\") applied to the arXiv SE category.", + "source": "haiku" + }, + "databases_listed": { + "applies": true, + "answer": true, + "justification": "Sources are explicitly listed: Liu et al. 2024 survey, Wang et al. 2025 survey, and arXiv SE category; though major academic databases (IEEE Xplore, ACM DL, Scopus) were not searched.", + "source": "haiku" + }, + "screening_process_documented": { + "applies": true, + "answer": true, + "justification": "Counts are provided at each major stage: 118 (Liu et al.) + 115 (Wang et al.) + 194 (arXiv) = 427 total, 236 after deduplication, 94 after applying inclusion criteria.", + "source": "haiku" + }, + "review_scope_justified": { + "applies": true, + "answer": false, + "justification": "The paper justifies using arXiv and two surveys for comprehensiveness but does not justify excluding major academic databases (IEEE Xplore, ACM DL, Scopus), which is a significant coverage limitation left unaddressed.", + "source": "haiku" + } + }, + "synthesis_quality": { + "conflicting_findings_acknowledged": { + "applies": true, + "answer": false, + "justification": "The paper catalogues patterns across 94 papers without discussing contradictions or conflicts in findings among reviewed papers (e.g., papers advocating different patterns for the same SE task produce no tension analysis).", + "source": "haiku" + }, + "quality_assessment_of_sources": { + "applies": true, + "answer": false, + "justification": "No quality rubric, risk-of-bias assessment, or structured quality evaluation is applied to the 94 included papers; all papers are treated as equally valid sources regardless of methodological rigor.", + "source": "haiku" + }, + "publication_bias_discussed": { + "applies": true, + "answer": false, + "justification": "Publication bias is not discussed; the paper does not acknowledge that arXiv preprints and conference publications skew toward positive results, or that this may inflate apparent prevalence of certain design patterns.", + "source": "haiku" + }, + "quantitative_synthesis_present": { + "applies": true, + "answer": true, + "justification": "The paper provides frequency counts and percentages across all four taxonomies and cross-dimensional mapping tables (e.g., Table 6 mapping QA sub-characteristics against SE tasks with paper counts), constituting vote-counting synthesis.", + "source": "haiku" + }, + "recommendations_supported_by_evidence": { + "applies": true, + "answer": true, + "justification": "Each of the six implications in Section 5.2 references specific frequency findings (e.g., Implication 2 on Role-Based Cooperation cites the co-occurrence with Modularity across 46 papers), tying recommendations to observed data.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "Code Generation is the most common SE task addressed by LLM-based MASs at 47.9% of papers", + "evidence": "45 of 94 papers focus on code generation, with all study references listed in Table 2", + "supported": "strong" + }, + { + "claim": "Functional Suitability is the most frequently considered quality attribute, appearing in 94.7% of papers", + "evidence": "89 of 94 papers explicitly consider Functional Suitability, with Functional Correctness the most common sub-QA (86 papers); documented in Table 3", + "supported": "strong" + }, + { + "claim": "Role-Based Cooperation is the most commonly employed design pattern at 46.8% of papers", + "evidence": "44 of 94 papers use Role-Based Cooperation, documented in Table 4 with full study citations", + "supported": "strong" + }, + { + "claim": "Improving the Quality of Generated Code is the most common design rationale at 44.7% of papers", + "evidence": "42 of 94 papers cite this rationale, listed with study references in Table 5", + "supported": "strong" + }, + { + "claim": "End-to-end software lifecycle coverage remains nascent (7.4% development, 8.5% maintenance)", + "evidence": "Only 7 papers on end-to-end development and 8 on end-to-end maintenance; attributed to lack of cross-stage benchmarks and evaluation criteria", + "supported": "moderate" + }, + { + "claim": "Role-Based Cooperation improves Maintainability in LLM-based MASs", + "evidence": "Observational co-occurrence: papers using role-based cooperation also consider modularity; no controlled comparison or causal study design", + "supported": "weak" + } + ], + "methodology_tags": [ + "qualitative", + "meta-analysis" + ], + "key_findings": "A mapping study of 94 papers on LLM-based MASs for SE tasks reveals heavy concentration on Code Generation (48%), with Role-Based Cooperation (47%) and Self-Reflection (36%) as the dominant design patterns and Functional Correctness (92%) as the overwhelmingly prioritized quality attribute. Improving code quality is the primary design rationale (45%), with resource optimization and efficiency also prominent. End-to-end lifecycle support remains nascent (under 9% of papers) due to absent cross-stage benchmarks and evaluation criteria, while a taxonomy of 16 design patterns, 8 QAs, and 8 rationale categories provides the first structured mapping linking SE tasks to design decisions.", + "red_flags": [ + { + "flag": "Limited source databases", + "detail": "Review relies on only two prior surveys and arXiv SE category, omitting IEEE Xplore, ACM DL, and Scopus; venue-published work not captured as arXiv preprints may be systematically excluded." + }, + { + "flag": "No quality assessment of included papers", + "detail": "All 94 papers are treated as equally credible sources regardless of study design or methodological rigor, conflating high-quality empirical evaluations with unvalidated system descriptions." + }, + { + "flag": "No PRISMA or established review protocol", + "detail": "The review does not follow PRISMA or any named systematic review protocol, reducing reproducibility and making selection bias harder to assess." + }, + { + "flag": "Single-reviewer primary extraction without inter-rater reliability", + "detail": "The first author conducted primary data extraction independently, with post-hoc review by co-authors only; no Cohen's kappa or other formal inter-rater reliability measure is reported." + }, + { + "flag": "Causal framing of observational findings", + "detail": "Implications suggest patterns like Role-Based Cooperation 'improve' Maintainability, but evidence is observational co-occurrence frequency — no controlled evaluation of design pattern effects exists." + }, + { + "flag": "Publication bias unaddressed", + "detail": "ArXiv self-selection and conference publication bias toward positive results are not discussed, potentially skewing the taxonomy toward approaches that appear to work rather than those commonly attempted." + } + ], + "cited_papers": [ + { + "title": "Large Language Model-Based Agents for Software Engineering: A Survey (Liu et al. 2024)", + "relevance": "Primary seed source providing 118 of the initial paper pool; directly enables the starting corpus" + }, + { + "title": "Agents in Software Engineering: Survey, Landscape, and Vision (Wang et al. 2025)", + "relevance": "Second seed source providing 115 papers; co-constitutes the initial corpus" + }, + { + "title": "Agent Design Pattern Catalogue: A Collection of Architectural Patterns for Foundation Model based Agents (Liu et al. 2025)", + "relevance": "Used as classification framework for design patterns in RQ3; provides the starting taxonomy that this paper extends" + }, + { + "title": "MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework (Hong et al. 2023)", + "relevance": "Prominent example system cited across multiple findings including end-to-end development, role-based cooperation, and self-reflection" + }, + { + "title": "ChatDev: Communicative Agents for Software Development (Qian et al. 2024)", + "relevance": "Key example of end-to-end software development MAS, cited for chat-chain task decomposition and role assignment" + }, + { + "title": "AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation (Wu et al.)", + "relevance": "Prominent MAS framework used as example across multiple QAs, design patterns (RAG, Role-Based Cooperation), and rationale" + }, + { + "title": "LLM-Based Multi-Agent Systems for Software Engineering: Literature Review, Vision and the Road Ahead (He et al. 2025)", + "relevance": "Related survey that this paper explicitly positions against to justify its distinct focus on design QAs and patterns rather than capabilities" + }, + { + "title": "ISO/IEC 25010:2023 Systems and software Quality Requirements and Evaluation", + "relevance": "Foundational standard used to categorize all quality attributes in RQ2; central methodological reference throughout the paper" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 2, + "justification": "The taxonomy of 16 design patterns, 8 QAs, and mapping tables provides actionable checklists for practitioners designing LLM-based MASs for specific SE tasks." + }, + "surprise_contrarian": { + "score": 0, + "justification": "Findings confirm conventional expectations: code generation dominates, role-based cooperation and self-reflection are most common — no surprising or contrarian results are surfaced." + }, + "fear_safety": { + "score": 0, + "justification": "Security is among the least-discussed QAs (10.6%) and no significant AI safety risks are surfaced by the survey findings." + }, + "drama_conflict": { + "score": 0, + "justification": "Purely descriptive taxonomy study with no controversy, competing claims between research groups, or inter-community conflict." + }, + "demo_ability": { + "score": 1, + "justification": "Dataset is publicly available on GitHub (Caiyangxiao/MASDesign) allowing inspection of extraction results, though the paper itself is not a runnable artifact." + }, + "brand_recognition": { + "score": 0, + "justification": "Authors are from Wuhan University, RMIT, and Central China Normal University — no famous lab or industry-affiliated product involved." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "29250329", + "title": "Free Will Belief as a Consequence of Model-Based Reinforcement Learning", + "points": 2, + "comments": 4, + "url": "https://news.ycombinator.com/item?id=29250329", + "created_at": "2021-11-17T08:19:50Z" + }, + { + "hn_id": "33617429", + "title": "High-level synthesis for packet processing pipelines", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=33617429", + "created_at": "2022-11-16T00:50:17Z" + }, + { + "hn_id": "10672783", + "title": "Reverse Engineering Intel DRAM Addressing and Exploitation", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=10672783", + "created_at": "2015-12-03T21:28:18Z" + } + ], + "top_points": 2, + "total_points": 6, + "total_comments": 4 + } +} +\ No newline at end of file diff --git a/papers/detecting-adversarial-finetuning-2025/scan-v5.json b/papers/detecting-adversarial-finetuning-2025/scan-v5.json @@ -0,0 +1,562 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Detecting Adversarial Fine-tuning with Auditing Agents", + "authors": [ + "Sarah Egler", + "John Schulman", + "Nicholas Carlini" + ], + "year": 2025, + "venue": "arXiv.org", + "arxiv_id": "2510.16255", + "doi": "10.48550/arXiv.2510.16255" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "Key abstract claims (56.2% detection rate at 1% FPR, detection of covert cipher attacks, failure of basic content moderation, over 1400 audits) are directly supported by Table 1, Figure 6, Appendix C, and the experimental setup description.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "The paper uses systematic ablation of tool configurations (9 configurations in Table 1) to support causal claims about which affordances improve detection, which is adequate for this type of agent system study.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": true, + "justification": "The conclusion explicitly acknowledges that '50% true positive rate at 1% false positive rate means attackers can still frequently evade the detector,' and the evaluation is bounded to 8 attacks from the literature using GPT models via OpenAI API.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "The paper does not discuss whether detection success could stem from Claude's safety training creating bias toward flagging cipher-like content, or whether the auditor's performance reflects pattern matching rather than genuine reasoning about harm.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "The paper clearly distinguishes between the risk score (proxy) and actual adversarial fine-tuning (outcome), and explicitly discusses how benign models can trigger high risk scores through unintentional safety degradation.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": false, + "justification": "There is no dedicated limitations section; limitations are scattered across the conclusion (Section 6) and adversarial robustness section (5.6) without consolidation.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "Section 5.6 specifically identifies future evasion strategies (covert steganography, clean-label backdoors, semantic backdoor triggers, dataset poisoning to distract the auditor), and Section 5.3 gives specific false positive sources including low resource language and educational insecure code fine-tunes.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": true, + "justification": "The paper explicitly states scope: 'we seek to prevent deployment of adversarial models and leave additional inference time monitoring of fine-tuned models to future work,' bounded to SFT attacks on GPT models via OpenAI API.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": true, + "justification": "Section 7 (Acknowledgments) discloses funding from the MATS program and Anthropic Fellows Program with additional support from Constellation.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "All author affiliations are disclosed on the title page: Sarah Egler (MATS & Anthropic Fellows Program), John Schulman (Thinking Machines Lab), Nicholas Carlini (Anthropic).", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": false, + "justification": "Anthropic is both a funder and employer of co-author Carlini; the paper uses Claude Sonnet 4 as the primary auditing agent and reports it outperforms competitor models (o3, Qwen 2.5 72b), creating a direct conflict between funder interests and reported outcomes.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "There is no competing interests statement or declaration of financial interests (patents, equity, consulting); the acknowledgments section covers only funding sources.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Key terms are defined: 'fine-tuning auditing agent' (Section 3), 'attack-specific elicitation' (Section 2), and the threat model (Section 2.1) precisely defines the adversary and defender roles and assumptions.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "The paper clearly states its contribution: introducing fine-tuning auditing agents, demonstrating 56.2% detection on 8 diverse attacks, and releasing the auditor as a baseline for future work.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 2 substantively engages with prior adversarial fine-tuning attacks, content moderation defenses, and alignment auditing agents (Bricken et al., Marks et al.), explicitly distinguishing this work by its false-positive constraint and access to the pre-fine-tuned model.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": true, + "justification": "The auditing agent is released at https://github.com/safety-research/finetuning-auditor as stated in the abstract—a current release, not a future promise.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "All adversarial fine-tuning datasets are from publicly available literature sources, with Appendix B providing full dataset links, base models, and hyperparameters; the HEx-PHI evaluation benchmark is also publicly available.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "The paper mentions Python and LangChain but provides no requirements.txt, Dockerfile, or explicit dependency specification in the paper text; reproducibility depends on what the GitHub repository contains.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": true, + "justification": "The Reproducibility Statement commits to open-sourcing code, Appendix A provides full auditor prompts, Appendix D provides tool implementations, and Appendix B provides fine-tuning details; though full reproduction requires paid OpenAI API access (~$120/fine-tune).", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "Table 1 reports only point estimates for detection rates across all tool configurations; no confidence intervals or error bars are provided despite 20 independent audits per condition.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "No statistical significance tests are used when comparing detection rates across tool configurations or auditor models, despite making comparative claims such as 'super-agent significantly outperforms single-agent.'", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Detection rates at multiple FPR thresholds (0%, 1%, 5%) and AUROC values are reported in Table 1, providing quantitative effect sizes for comparisons across tool configurations.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "The choice of 20 independent audits per fine-tune per configuration is not formally justified; no power analysis or reasoning for this number is provided.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "Figure 5 shows risk score distributions visually, but the main results in Table 1 report only point estimates without standard deviations or confidence intervals, making reliability of comparisons unassessable.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "The OpenAI Moderation API serves as a content filtering baseline (Appendix C shows it fails on most attacks), and single-tool configurations in Table 1 serve as ablation baselines for the full tool set.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "The OpenAI moderation API baseline uses 'omni-moderation-latest'; Youstra (2024) probe monitors for cipher attacks are mentioned as contemporary prior work in the same space.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "Table 1 presents a comprehensive ablation of 9 tool configurations (D, M, B, M+D, M+D+B, M+D+B*, M+D+S, M+D+B*+S, M+D+B*+S+P) systematically isolating each component's contribution.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Multiple metrics are used: TPR at 0%, 1%, and 5% FPR thresholds, AUROC (Table 1), and per-attack detection rates (Figure 6).", + "source": "haiku" + }, + "human_evaluation": { + "applies": false, + "answer": false, + "justification": "Human evaluation is not applicable to this automated detection system; the task is automated binary classification of fine-tuned models, not a human judgment task.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": true, + "justification": "The 13 fine-tuned models constitute an evaluation set; the auditor (Claude Sonnet 4) is not fine-tuned on these models, so there is no training/test contamination within the auditing framework itself.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Figure 6 provides per-attack detection rates for all 8 adversarial fine-tunes at the 1% FPR threshold, and Figure 10 (appendix) shows per-model risk scores across all tool configurations.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "Section 5.3 analyzes false positives on benign fine-tunes (Kashmiri, educational insecure code, Alpaca), and Section 5.2 discusses why subliminal learning is the most elusive attack to detect.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "The paper reports that model querying alone achieves only 1.4% detection at 1% FPR, that 'more tools does not result in strictly better performance,' and that M+D+B*+S underperforms M+D+B* alone at 1% FPR.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": true, + "justification": "Exact version IDs are specified for the auditor (Claude Sonnet 4) and all fine-tuned base models (gpt-4.1-2025-04-14, gpt-4o-2024-08-06, gpt-3.5-turbo-1106) in Tables 4 and 6.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": true, + "justification": "Full auditor system prompt (A.1), user prompt (A.2), super-agent system prompt (A.3), and super-agent user prompt (A.4) are provided verbatim in Appendix A.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": false, + "justification": "Fine-tuning epochs are reported in Table 4, but temperature, top-p, and other sampling hyperparameters for the Claude Sonnet 4 auditing agent are not reported.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": true, + "answer": true, + "justification": "Agentic scaffolding is described in detail in Section 3 (six tools, agent loop mechanism) and Appendix D (full Python tool implementations for all six tool types).", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "Fine-tuning data format (JSONL chat completions) is documented in Appendix D, the recursive summarization pipeline (chunk size, prompts) is in Appendix D.4, and HEx-PHI benchmark administration is described in Section 3.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": false, + "justification": "The raw audit transcripts and risk scores across all 1400+ audits are not released in a dataset format; only selected transcript excerpts are shown in the appendix.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Section 4 describes the experimental setup including 20 independent audits per fine-tune per configuration, attack success rate verification before inclusion, and GPT judge methodology for the HEx-PHI benchmark.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "No human participants were recruited; all evaluations are conducted via automated API calls to LLMs.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "The complete pipeline from fine-tuning (dataset → OpenAI API → fine-tuned model) through auditing (tool calls → risk score → threshold classification) is documented across Sections 3-4 and Appendices B-D.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": false, + "justification": "Claude Sonnet 4's training data cutoff is not stated; HEx-PHI (sourced from Anthropic red-teaming data) may have been in Claude's training data, which is not acknowledged.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": false, + "justification": "The paper does not discuss whether HEx-PHI benchmark prompts, known adversarial fine-tuning techniques, or related datasets were in Claude Sonnet 4's training data, which could inflate detection performance.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": true, + "answer": false, + "justification": "HEx-PHI is sourced from Anthropic red-teaming data and Claude Sonnet 4 is an Anthropic model; potential contamination of the evaluation benchmark is not addressed anywhere in the paper.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human participants; pre-registration is not applicable.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human participants; IRB approval is not applicable. The ethics statement addresses dual-use concerns about replicating attacks.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participants; demographics are not applicable.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human participants; inclusion/exclusion criteria are not applicable.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human participants; randomization in the human studies sense is not applicable.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human participants; blinding is not applicable.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participants; attrition is not applicable.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": true, + "justification": "Appendix H reports that 'the average single-agent audit with Claude Sonnet 4 lasts approximately 24 turns over several minutes and costs approximately $1,' enabling practical deployment cost estimation.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": true, + "justification": "Appendix H states the average fine-tune cost was $120 via OpenAI API; per-unit audit cost (~$1) and fine-tuning cost are provided, though total study budget is not explicitly summed.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "Fine-tuning auditing agents achieve 56.2% detection rate at 1% false positive rate across 8 adversarial attacks", + "evidence": "Table 1: M+D+B*+S+P configuration achieves 56.2% TPR at 1% FPR across 20 independent audits per fine-tune", + "supported": "strong" + }, + { + "claim": "All 8 adversarial attacks have non-zero detection rates at 1% FPR with the full tool set", + "evidence": "Figure 6 shows per-attack detection rates at threshold τ=9 (1% FPR); each attack has some non-zero detection", + "supported": "strong" + }, + { + "claim": "Standard content moderation cannot detect attacks where each training datapoint is individually harmless", + "evidence": "Appendix C and Figure 9 show OpenAI Moderation API flags 0 examples from most adversarial fine-tuning datasets", + "supported": "strong" + }, + { + "claim": "Super-agent approach significantly outperforms single-agent at 0% FPR (49.4% vs 5.6%)", + "evidence": "Table 2 directly compares super-agent and single-agent with D+M affordances; large TPR gap at 0% FPR with no statistical test", + "supported": "moderate" + }, + { + "claim": "Dataset inspection is the most useful single affordance, outperforming model querying and benchmark running alone", + "evidence": "Table 1: D alone achieves 23.1% TPR at 1% FPR vs 19.4% for B alone and 1.4% for M alone", + "supported": "strong" + }, + { + "claim": "Claude Sonnet 4 outperforms o3 and Qwen 2.5 72b as an auditing agent", + "evidence": "Figure 13 and Section 5.5 compare models on D+M configuration only; Claude achieves higher TPR but also more false positives", + "supported": "moderate" + }, + { + "claim": "Auditing agents can learn ciphers in-context from training data to perform attack-specific elicitation", + "evidence": "Figures 2 and 7 show auditor successfully learning Walnut53 and EndSpeak ciphers from dataset samples and querying models in cipher to elicit harmful responses", + "supported": "strong" + }, + { + "claim": "Benign fine-tuning can unintentionally degrade safety, creating false positives for the auditor", + "evidence": "Section 5.3 and Appendix I show Alpaca, educational insecure code, and Kashmiri fine-tunes receiving elevated risk scores with justified reasoning", + "supported": "strong" + } + ], + "methodology_tags": [ + "benchmark-eval", + "case-study" + ], + "key_findings": "Fine-tuning auditing agents powered by Claude Sonnet 4 can detect adversarial fine-tuning at 56.2% detection rate with only 1% false positive rate, successfully identifying all 8 attack types including covert cipher attacks that completely evade content moderation. Dataset inspection is the most critical affordance, enabling the auditor to learn attack-specific elicitation strategies in-context from training examples. A super-agent approach aggregating 10 sub-audits dramatically improves performance at 0% FPR (49.4% vs 5.6% single-agent). However, the ~50% detection ceiling is insufficient for production deployment, benign fine-tunes with unintentional safety degradation represent a persistent false positive challenge, and the paper's conflict of interest (Anthropic-funded, Claude-as-auditor, HEx-PHI from Anthropic red-teaming data) is not addressed.", + "red_flags": [ + { + "flag": "No variance on main results", + "detail": "Table 1 reports only point estimates for detection rates across 9 tool configurations despite 20 independent audits per condition; no confidence intervals or standard deviations are reported, making it impossible to assess reliability of differences between configurations." + }, + { + "flag": "Anthropic conflict of interest unaddressed", + "detail": "The paper is funded by Anthropic, co-authored by an Anthropic researcher (Carlini), uses Claude Sonnet 4 as the primary auditor, and reports Claude outperforms o3 and Qwen—without an explicit competing interests statement or acknowledgment of this conflict." + }, + { + "flag": "HEx-PHI benchmark contamination risk", + "detail": "The evaluation benchmark is sourced from Anthropic red-teaming data; Claude Sonnet 4 (Anthropic model) may have been trained on or otherwise exposed to these prompts, potentially inflating detection performance in ways not discussed." + }, + { + "flag": "Small evaluation set", + "detail": "Only 8 adversarial and 5 benign fine-tuned models are evaluated, all using GPT model families via OpenAI API; detection rates may not generalize to other model families, fine-tuning providers, or novel attack vectors." + }, + { + "flag": "No significance tests on comparisons", + "detail": "Comparative claims (e.g., 'super-agent significantly outperforms single-agent,' Claude 'more consistently' outperforms o3) are made without statistical significance tests, despite having sufficient repeated measurements (20 audits per condition) to conduct them." + } + ], + "cited_papers": [ + { + "title": "Fine-tuning aligned language models compromises safety, even when users do not intend to!", + "relevance": "Foundational work demonstrating adversarial fine-tuning bypasses safety training; motivates detection mechanisms and introduces the AOA attack evaluated here" + }, + { + "title": "Covert malicious finetuning: Challenges in safeguarding LLM adaptation", + "relevance": "Introduces Walnut53 and EndSpeak cipher-based covert attacks that this paper attempts to detect; establishes pointwise-undetectability as the core challenge" + }, + { + "title": "Emergent misalignment: Narrow finetuning can produce broadly misaligned LLMs", + "relevance": "Introduces the emergent misalignment and backdoored insecure code attacks included in the evaluation set" + }, + { + "title": "Sleeper agents: Training deceptive LLMs that persist through safety training", + "relevance": "Introduces backdoor 'sleeper agent' attacks included in the evaluation; provides theoretical grounding for trigger-based misalignment" + }, + { + "title": "Harmful fine-tuning attacks and defenses for large language models: A survey", + "relevance": "Survey contextualizing the adversarial fine-tuning threat landscape from which attacks are drawn" + }, + { + "title": "Fundamental limitations in defending LLM fine-tuning APIs", + "relevance": "Prior work establishing pointwise-undetectable attacks as the key defense challenge; this paper directly addresses that limitation" + }, + { + "title": "Building and evaluating alignment auditing agents", + "relevance": "Direct methodological predecessor using LLM auditing agents for alignment research; this paper adapts the super-agent approach for fine-tuning API defense" + }, + { + "title": "Auditing language models for hidden objectives", + "relevance": "The 'Auditing Game' that inspired this work; establishes the blue-team auditing paradigm applied here to fine-tuning API defense" + }, + { + "title": "Subliminal learning: Language models transmit behavioral traits via hidden signals in data", + "relevance": "Introduces the subliminal learning attack (the most evasive attack in the evaluation) using a misaligned teacher model" + }, + { + "title": "No, of course I can! Deeper fine-tuning attacks that bypass token-level safety mechanisms", + "relevance": "Introduces the NOICE prompt-based jailbreak attack evaluated in this paper" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 3, + "justification": "Directly addresses a real production threat faced by all major LLM providers (OpenAI, Anthropic, Google) who expose fine-tuning APIs, with released code and per-audit cost estimates enabling immediate deployment evaluation." + }, + "surprise_contrarian": { + "score": 2, + "justification": "The finding that an LLM can learn substitution ciphers in-context from training data and use them for attack-specific elicitation is surprising; that more tools don't always improve performance challenges intuitive assumptions about capability scaling." + }, + "fear_safety": { + "score": 3, + "justification": "Demonstrates that adversarial fine-tuning creates cipher-capable 'sleeper agent' models undetectable by content moderation, with a ~50% detection ceiling suggesting current defenses are inadequate for production deployment of fine-tuning APIs." + }, + "drama_conflict": { + "score": 1, + "justification": "Mild conflict of interest angle (Anthropic-funded work finding Claude beats competitors on an Anthropic benchmark), but the paper is technically focused and not presented controversially." + }, + "demo_ability": { + "score": 2, + "justification": "Code is released on GitHub and audit cost is ~$1, making it feasible to try; however, full reproduction requires paid OpenAI fine-tuning API access at ~$120 per fine-tuned model." + }, + "brand_recognition": { + "score": 3, + "justification": "Authors include Nicholas Carlini (prominent ML security researcher at Anthropic) and John Schulman (OpenAI co-founder, now Thinking Machines Lab), lending high credibility and likely significant community attention." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "41929456", + "title": "Quantum inspired factorization up to 100-bit RSA number in polynomial time [pdf]", + "points": 4, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=41929456", + "created_at": "2024-10-23T21:34:43Z" + }, + { + "hn_id": "41933882", + "title": "Quantum inspired factorization up to 100-bit RSA number in polynomial time", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=41933882", + "created_at": "2024-10-24T09:46:08Z" + }, + { + "hn_id": "41921364", + "title": "Assessing the Performance of Human-Capable LLMs – Are LLMs Coming for Your Job?", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=41921364", + "created_at": "2024-10-23T03:05:13Z" + }, + { + "hn_id": "41914405", + "title": "Loss of 12 Starlink Satellites Due to the Extreme Geomagnetic Storm of May 2024", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=41914405", + "created_at": "2024-10-22T14:04:12Z" + }, + { + "hn_id": "38177348", + "title": "CleanCoNLL: A Nearly Noise-Free Named Entity Recognition Dataset", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=38177348", + "created_at": "2023-11-07T14:47:31Z" + }, + { + "hn_id": "38163590", + "title": "Multi-Structure Objects Points-To Analysis", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=38163590", + "created_at": "2023-11-06T15:07:37Z" + } + ], + "top_points": 4, + "total_points": 9, + "total_comments": 0 + } +} +\ No newline at end of file diff --git a/papers/detecting-benchmark-contamination-2025/scan-v5.json b/papers/detecting-benchmark-contamination-2025/scan-v5.json @@ -0,0 +1,506 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Detecting Benchmark Contamination Through Watermarking", + "authors": [ + "Tom Sander", + "Pierre Fernandez", + "Saeed Mahloujifar", + "Alain Durmus", + "Chuan Guo" + ], + "year": 2025, + "venue": "arXiv.org", + "arxiv_id": "2502.17259", + "doi": "10.48550/arXiv.2502.17259" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All abstract claims are verified in the body: benchmark utility preservation (Fig 3a, Table 1), contamination detection effectiveness (Table 1, p-val 10^-3 for +5% on ARC-Easy), and pre-training experiments with 1B models on 10B tokens are all substantiated.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "The paper uses controlled pre-training experiments where the only variable is benchmark contamination (injected between steps 2500–7500), which is adequate experimental design for causal inference about the contamination→radioactivity mechanism.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": true, + "justification": "Claims are bounded to multiple-choice QA benchmarks (ARC-Easy, ARC-Challenge, MMLU); the limitations section explicitly notes math/code questions pose rephrasing challenges and the method is designed primarily for unintentional contamination.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": true, + "justification": "Proposition 1 explicitly distinguishes between memorizing the watermark vs. artificially enhanced performance, acknowledging that a model could perform well without overfitting to watermark biases, in which case the test would fail.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "The paper clearly explains that the test detects watermark memorization (radioactivity) as a proxy for contamination, and Proposition 1 explicitly discusses when this proxy can and cannot be equated to actual performance inflation.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "Section 5 is titled 'Limitations & Conclusion' and contains a dedicated limitations section with two specific bullet points covering rephrasing impact and intentional evasion.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "Specific threats are identified: (1) rephrasing may cause coherence loss in some questions (shown in Figure 6), and (2) malicious actors could rephrase questions or train only on answers conditioned on questions to bypass radioactivity detection.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": true, + "justification": "The paper explicitly states the method is primarily designed for unintentional contamination, and demonstrates that smaller benchmarks like ARC-Challenge yield lower detection confidence at low contamination levels.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding statement appears in the paper. Authors are affiliated with Meta FAIR and École polytechnique CMAP but no explicit acknowledgment of funding sources is provided.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Author affiliations are clearly stated on the first page: Meta FAIR and École polytechnique CMAP.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": false, + "justification": "Most authors are from Meta FAIR; Meta develops and trains large language models (Llama series), giving them a direct interest in contamination detection research that affects LLM evaluation credibility and defends against contamination accusations.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement or declaration of financial interests (patents, equity, consulting) appears anywhere in the paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Benchmark contamination, radioactivity, the red/green list watermarking scheme, and the statistical test (p-value, FPR, binomial null hypothesis) are all precisely defined in sections 2 and 3.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Three explicit contributions are enumerated in the introduction: benchmark rephrasing with watermarking, extending watermark radioactivity to pre-training setup, and a new detection algorithm for different tokenizers.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 2 substantively engages with prior contamination detection methods (membership inference, context completion checks) and explains why they fail, situating the watermarking approach as a distinct solution to a gap.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": false, + "justification": "No code release is mentioned. The paper references Meta Lingua as the training library but does not release their watermarking or detection code.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": false, + "justification": "The watermarked benchmark versions (the novel artifact of this work) are not mentioned as being released; while original benchmarks are public, the modified watermarked versions are not.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "Hardware (A-100 GPUs) and the Meta Lingua library are mentioned, but no requirements file, Dockerfile, or package version specification is provided.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "Training hyperparameters and experimental setup are described in prose but no step-by-step reproduction instructions or executable scripts are provided.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "Accuracy results in Table 1 and Figure 3 are single point estimates without CIs or error bars; p-values for detection are reported but are not CIs on the benchmark accuracy metrics.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": true, + "justification": "A binomial statistical test is used throughout with p-values explicitly reported for all contamination detection experiments, grounded in Proposition 1's formal proof.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Accuracy improvements are reported as absolute percentage gains (e.g., +4.3%, +9.5%, +18.2% on ARC-Easy) relative to an uncontaminated baseline, constituting effect size reporting.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "Benchmark sizes (1172, 2372, 5000 questions) are noted and their impact on detection sensitivity is analyzed, but no power analysis or formal sample size justification is provided.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "Each experiment appears to be a single training run; no variance, standard deviation, or repeated-run statistics are reported for accuracy or detection metrics.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Canary insertion (as used in BIG-bench) is included as a baseline comparison in Section 4.3 and Table 3, demonstrating the superiority of watermark radioactivity.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "Canary insertion is a legitimate baseline from the contamination detection literature and the comparison fairly reflects its limitations.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "Section 4.3 provides explicit ablations on watermark strength δ, window size k, model size (135M/360M/1B), benchmark size, and rephrasing model size (8B vs 70B).", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Multiple metrics are used: detection p-value (log10), benchmark accuracy (absolute and delta), proportion of green tokens, and number of scored tokens after deduplication.", + "source": "haiku" + }, + "human_evaluation": { + "applies": false, + "answer": false, + "justification": "Human evaluation is not applicable; benchmark contamination detection is fully automated.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": true, + "justification": "OOD evaluation uses a different prompt template from the one seen during contamination, serving as a held-out evaluation that measures genuine knowledge transfer vs. template memorization.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Results are broken down by benchmark (ARC-Easy, ARC-Challenge, MMLU*), contamination level, watermark strength, window size, and model size in separate figures and tables.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "Figure 6 explicitly shows a failure case where the 8B rephrasing model fails on a math question (requiring the 70B model), and the limitations section discusses coherence loss in some rephrased questions.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "ARC-Challenge shows doubtful detection confidence with only 4 contaminations (log10 p = -1.2), and smaller models require more contaminations; these weak results are reported without omission.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": true, + "justification": "Exact model versions are specified: Llama-3.1-8B-Instruct for rephrasing, Llama-3.2-1B, Llama-3.2-3B, Llama-3.1-8B for utility evaluation, and SmolLM architectures for smaller ablation models.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": true, + "justification": "Figure 5 provides the exact system prompt and instruction used for benchmark rephrasing, including their full text.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": true, + "justification": "Training hyperparameters are fully reported: learning rate 3×10^-3, weight decay 0.033, batch size 4, sequence length 4096, 10,000 steps, warmup 5,000 steps, gradient clip 1.0; watermarking parameters δ, k, and γ=50% are also specified.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": false, + "answer": false, + "justification": "No agentic scaffolding is used; the paper trains standard language models and evaluates them directly.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "The contamination injection procedure is precisely described: every 5000/#contaminations steps between steps 2500 and 7500, a benchmark batch replaces a DCLM batch, formatted as 'Question: {Q}\\nAnswer: {A}'.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": false, + "justification": "Trained model weights and watermarked benchmark files are not released; raw experimental outputs cannot be independently verified.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Training data (DCLM corpus) is identified by citation, benchmark subsets used (random 5000 MMLU questions, full ARC sets) are described, and the contamination injection procedure is detailed.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "No human participants are involved; the study uses pre-existing benchmarks and trains models from scratch.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "The full pipeline from benchmark selection → watermark embedding → contaminated pre-training → evaluation is documented with specific parameters, step numbers, and formatting templates.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": false, + "answer": false, + "justification": "This paper trains models from scratch specifically to test contamination detection; it is not evaluating pre-existing model capabilities on benchmarks, so training cutoff is not applicable.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": false, + "answer": false, + "justification": "The paper controls contamination deliberately as the experimental variable rather than treating it as an uncontrolled threat to validity.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": false, + "answer": false, + "justification": "Llama models used for utility evaluation may have benchmark data in pretraining, but this paper's experimental validity rests on the from-scratch trained models, not on those models being clean.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": true, + "justification": "Appendix C explicitly states each radioactivity detection test took less than 30 minutes on a single GPU, processing up to 325k tokens for MMLU*.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": true, + "justification": "Appendix C estimates approximately 10,000 GPU hours total for training all models, with pretraining of the 1B model taking ~6 hours on 64 A-100 GPUs.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "Watermarking benchmark questions through LLM rephrasing preserves benchmark utility — models rank identically and achieve similar accuracy on watermarked and original versions.", + "evidence": "Figure 3a and Figure 7 show Llama-3.2-1B, 3.2-3B, and 3.1-8B achieve nearly identical accuracy on original and watermarked ARC-Easy/Challenge/MMLU* versions across all δ values including δ=4 (79% green tokens).", + "supported": "strong" + }, + { + "claim": "Benchmark contamination can be reliably detected via watermark radioactivity with p < 10^-3 when contamination inflates accuracy by ~5% on ARC-Easy and MMLU*.", + "evidence": "Table 1 shows 4 contaminations yield log10(p) = -3.0 with +4.3% OOD accuracy gain on ARC-Easy, and log10(p) = -5.7 with +5.1% on MMLU*, both at δ=4.", + "supported": "strong" + }, + { + "claim": "Canary insertion is inferior to watermark radioactivity, failing even with 10x more contaminations than the maximum tested in the radioactivity experiments.", + "evidence": "Table 3 shows that a 360M model trained with 160 MMLU* contaminations fails to sufficiently memorize a 64-digit canary, with best p-value 0.19 at step 10000.", + "supported": "moderate" + }, + { + "claim": "The detection method generalizes across different tokenizers with only modest loss in detection confidence.", + "evidence": "Table 4 shows reliable detection (log10 p = -7 to -15) across Llama-1/2, Gemma-1/2, Gemma-3 tokenizers when the watermark was embedded with Llama-3's tokenizer.", + "supported": "strong" + }, + { + "claim": "Larger benchmark size directly increases contamination detection confidence at equivalent contamination levels.", + "evidence": "Table 1 shows that at 8 contaminations, MMLU* (325k tokens) achieves log10(p) < -12 while ARC-Challenge (64k tokens) achieves only -4.5, despite similar OOD accuracy inflation (~10%).", + "supported": "strong" + }, + { + "claim": "Smaller models require more contamination batches to achieve the same benchmark accuracy gain but detection confidence per unit of accuracy gain is consistent across model sizes.", + "evidence": "Figure 4 shows 135M, 360M, and 1B models achieve similar detection confidence (~10^-5) after 16, 8, and 4 contaminations respectively, with comparable +6% accuracy gains.", + "supported": "moderate" + } + ], + "methodology_tags": [ + "empirical", + "theoretical", + "benchmark-eval" + ], + "key_findings": "The paper introduces a proactive watermarking approach to detect benchmark contamination, where benchmarks are rephrased using a watermarked LLM before public release. Training on contaminated watermarked data leaves detectable radioactive traces in the model's next-token predictions, testable via a binomial statistical test with a provably controlled false positive rate. Controlled pre-training experiments confirm that contamination sufficient to boost accuracy by ~5% can be detected with p < 10^-3, while benchmark utility is preserved even at strong watermarking strengths. The method outperforms canary insertion and generalizes across different tokenizers, though it requires open-weight model access and is primarily designed for unintentional contamination.", + "red_flags": [ + { + "flag": "No code or data release", + "detail": "Neither the watermarked benchmark versions (the novel artifact) nor the detection code are released, making independent replication impossible despite adequate methodological description." + }, + { + "flag": "Single runs, no variance", + "detail": "All experiments appear to be single training runs without reporting variance, making it unclear how sensitive results are to random initialization, data ordering, or watermarking seed choice." + }, + { + "flag": "Meta FAIR conflict of interest undisclosed", + "detail": "Primary authors are from Meta FAIR, which trains and releases large language models (Llama series); no competing interests statement is present despite Meta having stakes in how contamination detection affects LLM evaluation credibility." + }, + { + "flag": "White-box access assumption limits applicability", + "detail": "The method requires open-weight access to the suspect model, which limits applicability to closed-source models like GPT-4 or Gemini — exactly the models most commonly suspected of contamination in practice." + }, + { + "flag": "Only 1B max model size in contamination experiments", + "detail": "All from-scratch pre-training uses at most 1B parameters, far smaller than production-scale models (7B–405B+); radioactivity behavior at scale with different learning dynamics is not demonstrated." + } + ], + "cited_papers": [ + { + "title": "A watermark for large language models", + "relevance": "Kirchenbauer et al.'s red/green list watermarking scheme is the direct technical foundation for the benchmark watermarking method used throughout." + }, + { + "title": "Watermarking makes language models radioactive", + "relevance": "Sander et al. (2024) is the direct predecessor establishing that fine-tuned models retain watermark traces; this paper extends radioactivity to the pre-training and benchmark contamination setting." + }, + { + "title": "Do membership inference attacks work on large language models?", + "relevance": "Duan et al. establish that traditional membership inference is ineffective for LLMs in realistic scenarios, directly motivating the watermarking approach as an alternative." + }, + { + "title": "A careful examination of large language model performance on grade school arithmetic", + "relevance": "Zhang et al. demonstrate benchmark contamination in practice via GSM8K performance drops, establishing the empirical problem this paper proposes to solve." + }, + { + "title": "Measuring massive multitask language understanding", + "relevance": "MMLU is one of the three primary benchmarks watermarked, evaluated, and tested for contamination detection in the paper." + }, + { + "title": "Evaluation data contamination in LLMs: how do we measure it and (when) does it matter?", + "relevance": "Singh et al. survey contamination detection approaches and demonstrate limitations of decontamination methods, contextualizing the need for a proactive watermarking solution." + }, + { + "title": "Investigating data contamination for pre-training language models", + "relevance": "Jiang et al. use controlled contamination experiments to show that small models exhibit performance gains — experimental methodology closely parallel to this paper's setup." + }, + { + "title": "Rethinking benchmark and contamination for language models with rephrased samples", + "relevance": "Yang et al. show that training on reformulated questions is sufficient to boost performance on original benchmarks, establishing that rephrased-question contamination is a real threat." + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 3, + "justification": "Benchmark providers could deploy this method immediately before releasing new benchmarks, providing a concrete and actionable defense against contamination that requires no access to training data." + }, + "surprise_contrarian": { + "score": 2, + "justification": "Using watermarking proactively rather than post-hoc detection is a novel reframing that challenges the prevailing membership inference and performance comparison approaches." + }, + "fear_safety": { + "score": 2, + "justification": "Addresses a serious reliability problem: undetected benchmark contamination means SOTA claims across the field are potentially untrustworthy, undermining AI progress measurement at scale." + }, + "drama_conflict": { + "score": 1, + "justification": "The paper is a methodological contribution without naming specific models or labs as contamination offenders, limiting controversy potential despite the high-stakes topic." + }, + "demo_ability": { + "score": 1, + "justification": "The method is well described and reproducible in principle but no code or watermarked benchmarks are released, preventing immediate hands-on demonstration by others." + }, + "brand_recognition": { + "score": 3, + "justification": "Meta FAIR is a top-tier AI research lab; the paper comes from the team behind Llama models, lending significant credibility and visibility to the work." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "43258481", + "title": "HumT DumT: Measuring and controlling human-like language in LLMs", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=43258481" + } + ], + "top_points": 1, + "total_points": 1, + "total_comments": 0 + } +} +\ No newline at end of file diff --git a/papers/detecting-correcting-hallucinations-code-2026/scan-v5.json b/papers/detecting-correcting-hallucinations-code-2026/scan-v5.json @@ -0,0 +1,582 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Detecting and Correcting Hallucinations in LLM-Generated Code via Deterministic AST Analysis", + "authors": [ + "Dipin Khati", + "Daniel Rodriguez-Cardenas", + "Paul Pantzer", + "Denys Poshyvanyk" + ], + "year": 2026, + "venue": "FORGE '26 (IEEE/ACM Third International Conference on AI Foundation Models and Software Engineering)", + "arxiv_id": "2601.19106", + "doi": "10.1145/3793655.3793725" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "Abstract claims are supported: KCHs are explained with examples, existing mitigations are discussed in §1, and empirical results match stated performance (100% precision, 87.6% recall, 77.0% fix rate).", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": false, + "answer": false, + "justification": "The paper makes no causal claims about mechanisms ('X causes Y'); it demonstrates detection/correction works empirically but does not claim to explain why hallucinations occur or why the deterministic approach succeeds mechanistically.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": true, + "justification": "The paper explicitly bounds scope: 'limited to five Python libraries', 'single-file, function-level analysis', and 'error distribution may not reflect real-world prevalence', while noting potential extension to other languages with AST support.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "The paper compares against existing approaches (constrained decoding, LLM-in-the-loop, deletion-based repair) but does not discuss alternative explanations for why their results hold (e.g., whether high precision is due to dataset properties, or whether 100% is inflated by easy cases).", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": false, + "justification": "The paper claims to measure 'fix accuracy' as 'functionally correct, runnable code' but the evaluation is non-executing. No mechanism for validating that corrected code is actually correct is described (e.g., no ground truth comparison, no human review, no execution).", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "Section 4 'Discussion and Future Work' includes a dedicated limitations paragraph acknowledging dataset size, library scope, and architectural constraints.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "Specific threats are stated: '200-sample dataset is not exhaustive', 'Knowledge Base limited to five Python libraries', 'single-file analysis does not handle multi-module dataflows', and approach 'does not attempt to solve multi-line logical errors'.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": true, + "justification": "Explicit scope boundaries: targets KCHs only (API + identifier conflicts), evaluated on Python snippets, limited to five libraries, single-file function-level analysis, and not addressing complex multi-line logical errors.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding source disclosed in abstract, body, or visible acknowledgments section.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "All four authors list affiliation with William & Mary, and no evaluated product is developed by the authors or institution, so conflict is minimal.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": false, + "answer": false, + "justification": "No funder disclosed; NA.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement or declaration of patents, equity, or consulting relationships provided.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Key terms are defined: KCHs as 'code that flat-out contradicts the established, factual knowledge of a programming language or its libraries', AST parsing, and 'Dynamic Knowledge Base' via introspection are explained.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Contribution is explicit: a deterministic post-processing framework for detecting AND correcting (not just deleting) KCHs in LLM code, positioned against prevention, LLM-in-the-loop, and deletion approaches.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "§1 and §5 systematically engage with prior work (taxonomy [11], KCH definition [6], prevention [8,10], LLM-in-the-loop [1,9], deletion [14], type-checkers [5]), positioning the deterministic correction approach as novel.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": true, + "justification": "Paper states 'All data, code, and experimental configurations are publicly available in our replication package [3]' linking to https://github.com/WM-SEMERU/Hallucinations-in-Code.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "The 200-sample evaluation dataset is stated to be in the replication package alongside code, enabling independent verification.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "Python 3 is implied but no requirements.txt, Dockerfile, or dependency specifications provided. No virtual environment or package versions documented.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "Paper provides high-level methodology but no step-by-step reproduction instructions. Replication package may contain these, but they are not included in the paper itself.", + "source": "haiku" + } + }, + "statistical_methodology": { + "applies": true, + "answer": false, + "justification": "No confidence intervals, error bars, or variance measures reported. Precision/recall/F1 are single point estimates without uncertainty quantification.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "No statistical significance tests, cross-validation, or bootstrapping reported. No p-values or hypothesis testing for comparative claims.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Effect sizes provided: 100% precision, 87.6% recall, F1=0.934, 77% fix accuracy, with per-type and per-library breakdowns (Tables 3–4).", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "Sample size (n=200 total, 161 hallucinated, 39 clean) is not justified. No power analysis or rationale provided for why 200 is adequate.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "Variance/std dev not reported. Only single point estimates; no repeated runs or error margins across samples or folds.", + "source": "haiku" + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": false, + "justification": "No empirical baselines compared. PICARD, Synchromesh, LLM-in-the-loop, and Structural Trimming are discussed but not experimentally evaluated.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": false, + "answer": false, + "justification": "NA—no baselines included.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": false, + "justification": "No ablation study. The system has four components (AST parsing, KB construction, validation, correction) but no variant testing (e.g., KB vs no KB).", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Detection metrics (precision, recall, F1), correction accuracy, and per-category breakdowns (Tables 3–4 by type and library) provide multiple evaluation angles.", + "source": "haiku" + }, + "human_evaluation": { + "applies": true, + "answer": false, + "justification": "Dataset is manually curated, but no human evaluation of system outputs or corrected code is reported.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": false, + "answer": false, + "justification": "Not a prediction task; no train/test split or held-out evaluation strategy described.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Table 3 breaks down by hallucination type (Missing Imports, Mis-typed API, Contextual Mismatches); Table 4 by library (numpy, pandas, matplotlib, json, requests).", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "Manual analysis of 37 failed cases (20 false negatives, 17 failed corrections) is discussed, revealing matplotlib.pyplot struggles and pandas correction weakness (56.2% vs 97.9% for imports).", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "Lower performance on contextual mismatches (33.3% detect, 0% correct) and pandas (56.2% correction) is transparently reported, along with discussion of limitations.", + "source": "haiku" + } + }, + "setup_transparency": { + "applies": true, + "answer": false, + "justification": "Dataset generation via 'GPT-5 with task-oriented instructions' is mentioned but no actual prompts, model version (snapshot), or hyperparameters (temperature, top-p) provided for reproducibility.", + "source": "haiku" + }, + "model_versions_specified": { + "applies": true, + "answer": false, + "justification": "'GPT-5' is named but no API version, snapshot date, or configuration parameters given.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": false, + "justification": "No actual prompts or instructions provided for GPT-5 data generation; only high-level description 'task-oriented instructions for five target libraries'.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": false, + "justification": "No temperature, top-p, max_tokens, or other sampling parameters reported for GPT-5. No hyperparameters for the framework itself (O(n·m) complexity is noted but no tuning parameters).", + "source": "haiku" + }, + "scaffolding_described": { + "applies": false, + "answer": false, + "justification": "NA—the framework is deterministic static analysis, not an agent with scaffolding.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "Dataset construction is documented: curated to contain 161 hallucinated samples in three categories (Mis-typed APIs, Missing imports, Contextual mismatches) and 39 clean samples from five libraries.", + "source": "haiku" + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": true, + "justification": "Paper claims 'All data, code, and experimental configurations are publicly available in our replication package [3]' on GitHub.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Data collection via GPT-5 prompting is described; dataset composition (161 hallucinated, 39 clean) and categories are documented.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "NA—no human subjects; synthetic dataset from LLM prompting.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "Framework pipeline is well-documented in §2: Static Analysis → Dynamic KB → Deterministic Validation → Automated Correction, with each component explained.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": false, + "answer": false, + "justification": "NA—paper does not evaluate pre-trained models on benchmarks; it tests a deterministic tool on a synthetic dataset.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": false, + "answer": false, + "justification": "NA—same as above.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": false, + "answer": false, + "justification": "NA—same as above.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "NA—no human subjects.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "NA—no human subjects.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "NA—no human subjects.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "NA—no human subjects.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "NA—no human subjects.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "NA—no human subjects.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "NA—no human subjects.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": true, + "justification": "Runtime is reported: 'end-to-end analysis of all 200 samples completed in under 0.2 seconds on a single laptop CPU', demonstrating practical efficiency.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "No compute budget stated for dataset generation (GPT-5 API costs) or evaluation infrastructure.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "Large Language Models frequently produce Knowledge Conflicting Hallucinations (KCHs)—semantic errors like non-existent API parameters that evade linters and cause runtime failures.", + "evidence": "Examples given (pd.read_exel), cited prior work [11, 12, 6], but not quantified in this paper.", + "supported": "moderate" + }, + { + "claim": "Constrained decoding methods (PICARD, Synchromesh) fail to catch semantic errors because they only enforce syntactic validity.", + "evidence": "Discussed in §1 and §5; no empirical comparison provided.", + "supported": "weak" + }, + { + "claim": "A deterministic static-analysis framework using AST parsing and library introspection can detect KCHs with 100% precision (zero false positives).", + "evidence": "Table 1: 141 TP, 0 FP out of 200 samples, achieving 100% precision.", + "supported": "strong" + }, + { + "claim": "The framework achieves 87.6% recall in KCH detection, identifying 141 of 161 hallucinated samples.", + "evidence": "Table 1: 141 TP, 20 FN, F1=0.934.", + "supported": "strong" + }, + { + "claim": "The framework can automatically correct 77% of detected hallucinations, producing functionally correct code.", + "evidence": "Table 2: 124 of 161 detected hallucinations corrected. However, no validation method is described (code is not executed; no ground truth comparison stated).", + "supported": "weak" + }, + { + "claim": "Performance varies significantly by hallucination type: Missing Imports (97.9% detect, 97.9% correct), Mis-typed APIs (84.5% detect, 70.0% correct), Contextual Mismatches (33.3% detect, 0% correct).", + "evidence": "Table 3 provides detailed breakdown by type.", + "supported": "strong" + }, + { + "claim": "The deterministic approach is computationally efficient, analyzing all 200 samples in under 0.2 seconds on a laptop CPU.", + "evidence": "Stated in §2.5 and §4.", + "supported": "strong" + }, + { + "claim": "The framework is a viable alternative to non-deterministic LLM-in-the-loop repair.", + "evidence": "Discussed in §1 and §4 as discussion point, but not empirically compared.", + "supported": "weak" + } + ], + "methodology_tags": [ + "benchmark-eval" + ], + "key_findings": "A deterministic, static-analysis framework leveraging Abstract Syntax Trees and library introspection via dynamic knowledge base construction can detect Knowledge Conflicting Hallucinations (KCHs) in LLM-generated Python code with 100% precision and 87.6% recall (F1=0.934), automatically correcting 77% of identified errors. Performance varies by error type: Missing Imports are highly recoverable (97.9% detect, 97.9% correct), Mis-typed APIs moderate (84.5% detect, 70.0% correct), and Contextual Mismatches poorly handled (33.3% detect, 0% correct), suggesting that semantic-intent errors remain intractable for simple string-matching approaches. The framework runs efficiently in <0.2 seconds for 200 samples, but evaluation is limited to 200 manually-curated samples across five Python libraries, raising questions about real-world prevalence and generalizability.", + "red_flags": [ + { + "flag": "No empirical baseline comparison", + "detail": "Claims superiority over PICARD, Synchromesh, LLM-in-the-loop repair, and mypy but provides no direct experimental comparison. Comparisons are only qualitative discussion." + }, + { + "flag": "Small, manually curated dataset may not reflect real-world error distribution", + "detail": "200 samples (161 hallucinated, 39 clean) is acknowledged as potentially biased. Authors note 'error distribution may not reflect real-world prevalence'." + }, + { + "flag": "Correction verification method not stated", + "detail": "Paper claims 'fix accuracy' by measuring 'functionally correct, runnable code' but the approach is explicitly non-executing. No ground truth comparison, human review, or execution validation described." + }, + { + "flag": "Limited generalizability", + "detail": "Evaluation restricted to Python; Knowledge Base limited to five libraries (numpy, pandas, requests, matplotlib, json). Claim of generalizability to Java/TypeScript is speculative." + }, + { + "flag": "No confidence intervals or statistical testing", + "detail": "Single point estimates for precision, recall, F1 without uncertainty quantification, confidence intervals, or cross-validation." + }, + { + "flag": "GPT-5 dataset generation not reproducible", + "detail": "Prompts, model version (snapshot date), temperature, and hyperparameters for GPT-5 not provided. Cannot regenerate the evaluation dataset independently." + }, + { + "flag": "Contextual Mismatches nearly undetectable", + "detail": "Only 3 samples (1.5% of dataset); 33.3% detection, 0% correction. This critical category is under-represented and handled poorly." + }, + { + "flag": "Pandas performance significantly lower", + "detail": "Pandas achieves only 56.2% correction accuracy vs 93.8% for numpy and 93.9% for requests, but no analysis of why or how to improve." + }, + { + "flag": "No human evaluation of corrections", + "detail": "Corrected code samples not reviewed by developers or automated validators to confirm functional correctness." + }, + { + "flag": "Missing environment and reproduction specifications", + "detail": "No requirements.txt, Dockerfile, or dependency versions provided. Replication package may exist but paper itself lacks these details." + } + ], + "cited_papers": [ + { + "title": "Exploring and Evaluating Hallucinations in LLM-Powered Code Generation", + "authors": "Liu, F., Liu, Y., Shi, L., et al.", + "arxiv_id": "2404.00971", + "year": 2024, + "relevance": "Defines KCH (Knowledge Conflicting Hallucinations) taxonomy and benchmarks; foundational reference for this paper's problem statement." + }, + { + "title": "Bugs in Large Language Models Generated Code: An Empirical Study", + "authors": "Tambon, F., Moradi Dakhel, A., et al.", + "arxiv_id": "2403.08937", + "year": 2024, + "relevance": "Early taxonomy of LLM code generation bugs; establishes prevalence of hallucinations in the field." + }, + { + "title": "Hallucination by Code Generation LLMs: Taxonomy, Benchmarks, Mitigation, and Challenges", + "authors": "Lee, Y., Song, J. Y., Kim, D., et al.", + "arxiv_id": "2504.20799", + "year": 2025, + "relevance": "Comprehensive survey of hallucination types and mitigation strategies; directly relevant to positioning this work." + }, + { + "title": "The Impact of AI on Developer Productivity: Evidence from GitHub Copilot", + "authors": "Peng, S., Kalliamvakou, E., Cihon, P., Demirer, M.", + "arxiv_id": "2302.06590", + "year": 2023, + "relevance": "Establishes productivity gains from LLM code generation; motivates the need for hallucination mitigation." + }, + { + "title": "Synchromesh: Reliable code generation from pre-trained language models", + "authors": "Poesia, G., Polozov, O., Le, V., et al.", + "arxiv_id": "2201.11227", + "year": 2022, + "relevance": "Constrained decoding approach for code generation; example of prevention strategy that misses semantic errors." + }, + { + "title": "PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models", + "authors": "Scholak, T., Schucher, N., Bahdanau, D.", + "year": 2021, + "relevance": "Foundational constrained decoding method for grammar enforcement; shown to miss KCHs." + }, + { + "title": "Static Analysis as a Feedback Loop: Enhancing LLM-Generated Code Beyond Correctness", + "authors": "Blyth, S., Licorish, S. A., Treude, C., Wagner, M.", + "arxiv_id": "2508.14419", + "year": 2025, + "relevance": "LLM-in-the-loop repair strategy; represents non-deterministic approach this paper positions against." + }, + { + "title": "Cutting the Root of Hallucination: Structural Trimming for Vulnerability Mitigation in Code LLMs", + "authors": "Zhang, Y.", + "year": 2025, + "relevance": "AST-based deletion approach for safety; represents deletion-based mitigation that this paper extends toward correction." + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 2, + "justification": "Tool could be integrated into IDEs for real-time code-generation validation, directly useful for practitioners, but limited scope (5 libraries, single-file) reduces immediate applicability." + }, + "surprise_contrarian": { + "score": 1, + "justification": "Using static analysis for code correctness is well-established (mypy, linters); applying it post-hoc to LLM hallucinations is incremental rather than conceptually novel." + }, + "fear_safety": { + "score": 2, + "justification": "Addresses a real safety concern (LLM-generated code causing runtime failures), positioning deterministic checking as a trust-building mechanism for AI-assisted development." + }, + "drama_conflict": { + "score": 0, + "justification": "Straightforward technical paper with no controversy, competing claims, or dramatic narrative elements." + }, + "demo_ability": { + "score": 1, + "justification": "Code is open-source on GitHub and can be demoed locally, but requires setup (library introspection); not immediately web-demoable or friction-free." + }, + "brand_recognition": { + "score": 1, + "justification": "William & Mary's SEMERU Lab is respected in software engineering research but not a top-tier AI lab; lead author Dipin Khati not widely known in the field." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "46885582", + "title": "Who's in Charge? Disempowerment Patterns in Real-World LLM Usage", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=46885582", + "created_at": "2026-02-04T13:28:17Z" + }, + { + "hn_id": "47119379", + "title": "Who's in Charge? Disempowerment Patterns in Real-World LLM Usage", + "points": 2, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=47119379", + "created_at": "2026-02-23T08:01:55Z" + }, + { + "hn_id": "46811142", + "title": "Anthropic: Who's in Charge? Disempowerment Patterns in Real-World LLM Usage", + "points": 2, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=46811142", + "created_at": "2026-01-29T15:04:00Z" + }, + { + "hn_id": "47477667", + "title": "TinyTorch: Building Machine Learning Systems from First Principles", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=47477667", + "created_at": "2026-03-22T14:03:42Z" + } + ], + "top_points": 3, + "total_points": 9, + "total_comments": 2 + } +} +\ No newline at end of file diff --git a/papers/detecting-proxy-gaming-2025/scan-v5.json b/papers/detecting-proxy-gaming-2025/scan-v5.json @@ -0,0 +1,519 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Detecting Proxy Gaming in RL and LLM Alignment via Evaluator Stress Tests", + "authors": [ + "Ibne Farabi Shihab", + "Sanjeda Akter", + "Anuj Sharma" + ], + "year": 2025, + "venue": "arXiv", + "arxiv_id": "2507.05619", + "doi": null + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "Abstract claims of 78.4%/81.7% precision/recall (RL), 74.2%/78.6% (LLM), 8.3-point win-rate improvement, and 54.6% hacking reduction are all supported by detailed experimental tables (Tables 1, 7, 9, 11).", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "Causal claims about mitigation effectiveness are supported by ablation studies and control experiments (Table 30) showing that extra compute alone yields only +2.1% vs. +8.3% for detector-triggered intervention.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": true, + "justification": "Limitations section explicitly bounds generalization to '4 tasks and 2 model sizes' and acknowledges that 'real-world deployment would face additional challenges,' appropriately scoping the claims.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": true, + "justification": "Control experiments (Table 30) rule out extra compute and filtering as explanations for win-rate gains; Appendix N discusses false positive patterns including beneficial exploration misclassified as hacking.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "The paper is explicitly about this distinction — judge scores (proxy) vs. human preferences (true objective) — and carefully measures both throughout; Table 2 shows divergence between judge score and human rating on case studies.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "Section 8 'Limitations' is a dedicated section, not a sentence in the conclusion, discussing scope constraints on tasks, model sizes, and deployment conditions.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "Limitations specifically name '4 tasks and 2 model sizes,' fixed judges assumption, concept drift, multi-stakeholder objective conflicts, and adversarial adaptation over longer horizons — not generic disclaimers.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": true, + "justification": "The paper explicitly states 'mitigation results represent controlled experimental conditions' and that 'larger-scale validation across more diverse domains and model architectures would strengthen generalizability claims.'", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding acknowledgment appears anywhere in the paper; the Acknowledgments section only mentions AI writing tools, not any funding source.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "All three authors' affiliations with Iowa State University departments (CS and Civil/Construction/Environmental Engineering) are disclosed on the title page.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": false, + "answer": false, + "justification": "No external funder is disclosed, so this criterion is not applicable.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement or financial disclosure appears in the paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Key terms including 'proxy gaming,' 'exploitable sensitivity,' 'content sensitivity,' and the formal G(y) statistic are precisely defined with mathematical notation in Section 3.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "The paper clearly states its contribution as a unified framework (EST) for detecting proxy gaming in both RL and LLM alignment, with validated benchmarks for both domains.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 2 substantively engages with RLHF, DPO, reward hacking detection, and LLM-as-judge evaluation literature, explicitly positioning EST's contributions relative to prior approaches' scalability and principled-framework limitations.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": false, + "justification": "No code repository URL is provided. The paper claims benchmark data release but does not mention releasing the detection framework implementation code.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": false, + "justification": "The paper claims 'We release benchmarks for both domains' but provides no URL, DOI, or repository location where the 2,156 RL episodes or 1,200 LLM instances can be accessed.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "Only the GPU model (NVIDIA A6000 48GB VRAM) is mentioned; no requirements.txt, Dockerfile, or dependency specifications are provided.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "No step-by-step reproduction instructions are provided; the paper describes the methodology but not how to replicate the experiments end-to-end.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": true, + "justification": "Tables 7 and 11 report values with ± notation; Table 11 explicitly states '95% CI across 5-fold cross-validation' (e.g., Precision: 0.784±0.027).", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": true, + "justification": "Table 12 reports p-values for all factorial design effects (e.g., Objective Alignment p < 0.001); inter-rater reliability uses Cohen's κ and Fleiss' κ throughout.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Table 12 reports Cohen's d for all experimental factors (e.g., Objective Alignment: Cohen's d = 2.08); the paper explicitly notes these are unusually large due to custom environment design.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "Sample sizes (2,156 RL episodes, 1,200 LLM instances) are described but not formally justified with power analysis or sample size calculations.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": true, + "justification": "Tables 7, 11, 18, and 29 consistently report ± standard deviation values alongside mean performance metrics.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Extensive baseline comparisons in Tables 3, 16, and 29 include length-only, format features, KL regularization, judge ensembling, LSTM-Autoencoder, One-Class SVM, Isolation Forest, reward model ensemble disagreement, and probe-based detection.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "Baselines include contemporary methods like reward model ensemble disagreement (F1 0.687), probe-based detection, hardened judges, and established ML anomaly detection methods — not outdated or trivially weak.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "Ablation studies in Tables 17 and 29 systematically remove individual detection components (EST, correlation tracking, reasoning validity, format perturbation, content perturbation) measuring each contribution.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Results reported across precision, recall, F1, AUC-ROC, early warning latency (checkpoints), computational overhead %, human win-rate, and judge-human correlation.", + "source": "haiku" + }, + "human_evaluation": { + "applies": true, + "answer": true, + "justification": "Human evaluation is central: 1,200 human-annotated LLM gaming instances with 3 raters achieving Fleiss' κ ≥ 0.78, and 2,156 expert-annotated RL episodes with Cohen's κ = 0.847.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": true, + "justification": "Section 4.1 states 'strict train-validation-test splits, holding out entire task-model-judge combinations for testing'; RL uses environment-stratified splits with 5-fold cross-validation.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Table 11 provides per-category RL detection performance (6 hacking categories); Table 1 provides per-task, per-model-size, and per-judge breakdowns for all 32 LLM conditions.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "Appendix N provides dedicated error and boundary case analysis, manually examining 100 classification errors (50 false positives, 50 false negatives) with representative qualitative examples.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "Adaptive evasion tests (Table 24) show precision dropping from 74.2% to 65.9% under white-box attacks; zero-shot cross-environment transfer shows 10-15 F1 point degradation (Table 32).", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": false, + "justification": "GPT-4 is used as a judge without specifying version (e.g., gpt-4-0314, gpt-4-turbo) or snapshot date; Llama-3-8B/70B are named but no model card versions are pinned.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": false, + "justification": "No actual prompts or system instructions are provided for LLM fine-tuning, judge evaluation, or the EST perturbation generation steps.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": true, + "justification": "Key hyperparameters are reported: detection threshold τ=0.6, τspec=0.3, correlation threshold ∆ρ=0.5, contamination γ=0.1, window size W=50, format penalty 20%, mitigation threshold α=0.1.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": true, + "answer": true, + "justification": "The online detection pipeline is described in detail including Algorithm 1 for detector-triggered mitigation, the 6-detector ensemble structure with complexity analysis (Appendix D), and per-checkpoint monitoring protocol (Table 14).", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "Transformation validity thresholds (cosine similarity >0.85, NLI entailment >0.7), number of perturbations (5 per type), token-count control (±5%), and semantic validity audit procedures are documented with specific thresholds.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": false, + "justification": "The paper claims to release benchmarks but provides no URL or repository link where the raw data can be accessed for independent verification.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Data collection is described: 3 human annotators with consensus ≥2/3 for LLM instances; RL expert annotation with κ = 0.847 across 15 environments, 10 random seeds, and environment-stratified splits.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": true, + "answer": false, + "justification": "Annotators are mentioned ('3 human annotators,' '3 human raters') but no recruitment procedure, qualification criteria, or platform is described; the ethics statement mentions only informed consent and compensation.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "The pipeline from LLM fine-tuning checkpoints → output sampling → perturbation generation → validity audit → detection scoring is described with algorithmic detail in Section 3 and Appendix D.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": false, + "justification": "Training cutoffs for Llama-3-8B, Llama-3-70B, and GPT-4 are not stated; this matters since tasks include TL;DR summarization (Reddit data) that could overlap with pretraining corpora.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": false, + "justification": "No discussion of potential overlap between TL;DR summarization or other evaluation task data and the pretraining data of Llama-3 or GPT-4.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": true, + "answer": false, + "justification": "LLM tasks include TL;DR summarization (Reddit-based) which could be present in Llama-3's pretraining data, affecting gaming behavior; this potential contamination is not addressed.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human subjects experiment requiring pre-registration; annotation workers are used for labeling, not as research participants in an experimental study.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "Human annotation is conducted (with informed consent and compensation noted in ethics statement), but this is annotation labor rather than a human subjects experiment requiring IRB review.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participant subjects; annotation workers are service providers and their demographics are not applicable to the study.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human subjects study with inclusion/exclusion criteria for research participants.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human subjects randomized experiment.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human subjects blinded experiment.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participant attrition applicable.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": true, + "justification": "Computational overhead is reported as 2.1% for LLM and 4.2% for RL; Table 31 provides absolute GPU-hours for each mitigation technique (e.g., 0.47 GPU-hrs for combined approach on NVIDIA A6000).", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "Per-technique overhead is reported but the total compute budget for the full experimental suite (32 LLM conditions × fine-tuning runs, 15 RL environments × 10 seeds) is not stated.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "EST achieves 78.4% precision and 81.7% recall for RL reward hacking detection across 15 environments and 5 algorithms", + "evidence": "Table 11 reports these figures on 2,156 expert-annotated episodes with 5-fold cross-validation and 95% confidence intervals (0.784±0.027, 0.817±0.023); Cohen's κ = 0.847 inter-rater agreement", + "supported": "strong" + }, + { + "claim": "EST achieves 74.2% precision and 78.6% recall for LLM evaluator gaming detection across 32 experimental conditions", + "evidence": "Tables 1 and 7 report these figures on 1,200 human-annotated instances with Fleiss' κ ≥ 0.78 inter-rater agreement across 4 tasks", + "supported": "strong" + }, + { + "claim": "Closed-loop EST-triggered mitigation improves human win-rate by 8.3 points (52.1% → 60.4%) for LLM fine-tuning", + "evidence": "Table 9 shows win-rate improvement; Table 30 control experiments show extra compute alone yields only +2.1%, ruling out compute as confound", + "supported": "strong" + }, + { + "claim": "Closed-loop mitigation reduces RL reward hacking by 54.6% with 9.1% performance impact", + "evidence": "Table 31 reports combined approach achieves 54.6% hacking reduction; Pareto frontier analysis in Figure 10 shows this is the best available trade-off", + "supported": "moderate" + }, + { + "claim": "Proxy-true correlation tracking transfers directly between RL and LLM domains without modification", + "evidence": "Table 8 defines direct transfer as ≥90% in-domain performance; correlation tracking achieves AUC 0.821 (RL) and 0.798 (LLM) without modification", + "supported": "moderate" + }, + { + "claim": "EST provides early warning with median lead time of 3 checkpoints before human-noticeable quality decline", + "evidence": "Figure 2 and Table 7 report 3.0±0.4 checkpoint lead time, defined as checkpoints between detection trigger and human win-rate dropping below 0.50", + "supported": "moderate" + }, + { + "claim": "EST outperforms all baselines including reward model ensemble disagreement (F1 0.734 vs 0.687)", + "evidence": "Tables 3 and 16 compare against 9+ baselines; EST achieves F1 0.734 vs. next-best standalone of 0.694 (correlation tracking) and 0.700 (hardened judge)", + "supported": "strong" + } + ], + "methodology_tags": [ + "benchmark-eval", + "empirical" + ], + "key_findings": "EST detects proxy gaming through invariance-based stress tests that separate exploitable format sensitivity from content-driven improvements, achieving 78.4%/81.7% precision/recall for RL hacking and 74.2%/78.6% for LLM evaluator gaming across 32 experimental conditions. Closed-loop mitigation triggered by EST improves human win-rates by 8.3 points and reduces RL hacking by 54.6%, with control experiments ruling out compute as the explanation. Cross-domain analysis reveals that correlation tracking and ensemble voting transfer directly between RL and LLM domains while perturbation design requires adaptation. The framework operates online with low overhead (2.1% LLM, 4.2% RL) and provides 3-checkpoint early warning before human-noticeable quality decline, enabling proactive intervention during fine-tuning.", + "red_flags": [ + { + "flag": "No code or data URL provided", + "detail": "Despite claiming to release benchmarks (2,156 RL episodes, 1,200 LLM instances), no repository URL, DOI, or access location is provided anywhere in the paper." + }, + { + "flag": "GPT-4 version unspecified", + "detail": "GPT-4 is used as a judge evaluator throughout without specifying version (gpt-4-0314, gpt-4-turbo, etc.) or snapshot date, making exact replication impossible." + }, + { + "flag": "Circular RL ground truth for most episodes", + "detail": "For 13,091 of 15,247 RL episodes (86%), ground truth is established via detector consensus (3+ of 6 detectors agree) rather than human annotation — the authors acknowledge this circularity but use these episodes for pattern analysis and prevalence claims." + }, + { + "flag": "Custom environments maximize contrast", + "detail": "The paper acknowledges 'unusually large effect sizes reflect our custom environments designed to maximize experimental contrast'; real-world effect sizes are acknowledged to be smaller (Cohen's d 0.8-1.2 vs. reported 1.24-2.08)." + }, + { + "flag": "Self-citation for key supporting claims", + "detail": "Two key supporting citations (Shihab et al., 2025a on entropy regularization and Shihab et al., 2025b on reward function structure) are by the same authors and are arXiv preprints, not peer-reviewed work." + }, + { + "flag": "No prompts provided for LLM experiments", + "detail": "The paper does not provide actual prompts used for LLM fine-tuning, judge evaluation rubrics, or the perturbation generation process, significantly limiting reproducibility." + } + ], + "cited_papers": [ + { + "title": "Defining and Characterizing Reward Hacking (Skalse et al., 2022)", + "relevance": "Provides the formal definition of proxy gaming used throughout the paper and theoretical grounding for EST's unhackability criterion" + }, + { + "title": "Training Language Models to Follow Instructions with Human Feedback (Ouyang et al., 2022)", + "relevance": "Foundational RLHF work that introduces the training paradigm EST is designed to monitor for gaming" + }, + { + "title": "Direct Preference Optimization: Your Language Model is Secretly a Reward Model (Rafailov et al., 2023)", + "relevance": "One of the two training methods (DPO) evaluated in the LLM experiments" + }, + { + "title": "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (Zheng et al., 2023)", + "relevance": "Establishes LLM-as-judge evaluation paradigm that the paper identifies as vulnerable to evaluator gaming" + }, + { + "title": "Scaling Laws for Reward Model Overoptimization (Gao et al., 2023)", + "relevance": "Directly relevant to proxy-true divergence measurement; shows reward model scores diverge from human preferences with overoptimization" + }, + { + "title": "Specification Gaming: The Flip Side of AI Ingenuity (Krakovna et al., 2020)", + "relevance": "Provides taxonomy of reward hacking behaviors that grounds the RL component of EST and defines specification gaming" + }, + { + "title": "Concrete Problems in AI Safety (Amodei et al., 2016)", + "relevance": "Foundational safety paper establishing reward hacking as a core alignment challenge that motivates the EST framework" + }, + { + "title": "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Wei et al., 2022)", + "relevance": "Related to the reasoning validity detector component of EST that checks for valid reasoning chains during fine-tuning" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 3, + "justification": "Directly applicable to anyone training LLMs with RLHF or LLM-as-judge evaluation pipelines — an increasingly important and widespread concern in AI development" + }, + "surprise_contrarian": { + "score": 1, + "justification": "The existence of proxy gaming is well-established; the invariance-based unified framework across domains is novel but not surprising in its conclusions" + }, + "fear_safety": { + "score": 2, + "justification": "Addresses AI alignment safety concerns about reward hacking and evaluator gaming, with concrete detection and mitigation methods for recognized risks in deployed RLHF systems" + }, + "drama_conflict": { + "score": 1, + "justification": "No major controversy; the paper confirms known problems and proposes systematic solutions rather than challenging dominant narratives" + }, + "demo_ability": { + "score": 1, + "justification": "Claims benchmarks are released but provides no URL; cannot easily reproduce or demo the framework without code release" + }, + "brand_recognition": { + "score": 0, + "justification": "Iowa State University authors without prior high-profile publications; no famous lab or product association" + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "38720557", + "title": "ReLoRA: High-Rank Training Through Low-Rank Updates", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=38720557" + }, + { + "hn_id": "41035192", + "title": "The Limitations of Compute Thresholds as a Governance Strategy", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=41035192" + } + ], + "top_points": 3, + "total_points": 4, + "total_comments": 0 + } +} +\ No newline at end of file diff --git a/papers/detecting-silent-failures-2025/scan-v5.json b/papers/detecting-silent-failures-2025/scan-v5.json @@ -0,0 +1,442 @@ +{ + "scan_version": 5, + "paper_type": "benchmark-creation", + "paper": { + "title": "Detecting Silent Failures in Multi-Agentic AI Trajectories", + "authors": [ + "Divya Pathak", + "Harshit Kumar", + "Anuska Roy", + "Felix George", + "Mudit Verma" + ], + "year": 2025, + "venue": "arXiv.org", + "arxiv_id": "2511.04032", + "doi": "10.48550/arXiv.2511.04032" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "Abstract claims about non-deterministic failures (drift, cycles, missing details) are defined in Table 1. Dataset sizes (4,275 and 894) and accuracy ranges (98%, 96%) match Table 2 results exactly.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": false, + "justification": "Paper claims path-level features play a 'critical role in anomaly detection' based on SHAP feature importance, but feature importance is correlational, not causal. No ablation studies justify causal role claims.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "Title claims 'Detecting Silent Failures in Multi-Agentic AI Trajectories' broadly, and paper claims 'first systematic study of anomaly detection in Multi-Agentic AI systems,' but evaluation is limited to 2 specific systems with 4,275 and 894 traces. No cross-system validation.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": true, + "justification": "Paper acknowledges misclassifications 'are likely due to ambiguous traces where even humans disagree' (inter-annotator agreement 80.6%) and discusses why subtle drift without errors is harder to detect (Insights 1-3, Figure 2).", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": false, + "justification": "Paper defines 5 silent failure types (drift, cycles, missing details, tool failures, context propagation) but only labels 3 (drift, cycles, errors). Measurement scope is narrower than claimed construct. Missing details and tool failures are excluded.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": false, + "justification": "No dedicated Limitations section. Section 4 (Conclusions and Future Plans) mentions false negatives and future work but lacks systematic threat analysis.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "Paper identifies 'subtle drift anomalies... closer resemblance to normal behavior' as a detection challenge but does not discuss study design threats: evaluation on only 2 systems, inter-annotator disagreement (19.4% on Research Writing), fixed LLM versions, or class imbalance (42-68% anomalies).", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "Paper claims 'first systematic study' but does not state boundaries: only 2 systems tested, only 3 of 5 failure types labeled, only trace-level features used, no cross-system validation. Generalizability to 'Multi-Agentic AI systems' broadly is undemonstrated.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding source disclosed. Paper has no funding statement or acknowledgments section visible.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Author affiliations listed (IBM Research, IIIT Bangalore). However, no statement addressing whether authors have conflicts with the evaluated agentic systems.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": false, + "answer": false, + "justification": "No funder identified.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement provided. No declaration of patents, equity, or consulting interests.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Key terms defined: 'Multi-Agentic AI systems' (tools + LLMs + prompts), 'silent failures' (Table 1: 5 types), 'agentic trajectories/traces' (execution workflow), 'anomaly' (binary classification of failure presence).", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Three explicit contributions stated in abstract and introduction: (1) Dataset Curation Pipeline (Section 2), (2) Benchmarking Anomaly Detection Methods (Section 3), (3) Detailed Error Analysis and Insights (Section 3).", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": false, + "justification": "Prior work mentioned scattered in introduction ('extensively explored in microservices and networks', 'He et al. [2025] offers limited evaluation') but no dedicated Related Work section. No systematic comparison showing how this work builds on or differs from existing benchmarks.", + "source": "haiku" + } + } + }, + "type_checklist": { + "benchmark-creation": { + "construct_design": { + "construct_validity_argued": { + "applies": true, + "answer": false, + "justification": "Paper does not argue why extracted features (16 token/latency/path/prompt/model features) measure 'silent failures.' Construct validity is asserted post-hoc through feature importance analysis, not established via theoretical or empirical argument.", + "source": "haiku" + }, + "difficulty_distribution_characterized": { + "applies": true, + "answer": false, + "justification": "No characterization of item difficulty. Figure 2 shows Stock Market has overlapping clusters (harder) and Research Writing has separation (easier), but individual anomaly types (drift vs cycles vs errors) are not characterized by difficulty or discriminability.", + "source": "haiku" + }, + "ceiling_floor_effects_checked": { + "applies": true, + "answer": true, + "justification": "No ceiling effect: best model (XGBoost) achieves 98.03% and 94.81%, not 99%+. No floor effect: worst model (K-Means) achieves 85.33% and 82.96%, not <10%. However, paper does not explicitly discuss these thresholds.", + "source": "haiku" + }, + "human_baseline_included": { + "applies": true, + "answer": false, + "justification": "Inter-annotator agreement reported (97.6% Stock Market, 80.6% Research Writing) but this measures consistency, not accuracy. Paper does not provide human accuracy against ground truth or task completion rate.", + "source": "haiku" + }, + "scoring_rubric_justified": { + "applies": true, + "answer": false, + "justification": "Metrics (accuracy, macro-F1, precision, recall) used without justification. Why macro-F1 instead of weighted-F1 given class imbalance (42% vs 68%)? Why accuracy over AUROC? Labeling rubric (anomaly if drift OR cycles OR errors) not justified—why exclude missing details and tool failures?", + "source": "haiku" + } + }, + "robustness": { + "contamination_resistance_designed": { + "applies": true, + "answer": false, + "justification": "Simple 70-15-15 split mentioned but no stratification by prompt, anomaly type, or system discussed. Same 525 prompts (Stock Market) are split, risking prompt-level leakage between train/test. No measures described to prevent learning prompt-specific patterns.", + "source": "haiku" + }, + "temporal_robustness_discussed": { + "applies": true, + "answer": false, + "justification": "Dataset uses fixed LLM versions (gpt-4o, granite-3-1-8B, llama-3-3-70B). No discussion of temporal robustness: what happens when these models are updated or deprecated? No plan for benchmark versioning or maintenance.", + "source": "haiku" + }, + "failure_modes_discussed": { + "applies": true, + "answer": false, + "justification": "Paper discusses failure modes of detection methods (false negatives on subtle drift) but not failure modes of the benchmark itself. Does not address that only 3/5 failure types are labeled, or discuss what anomaly types could evade detection.", + "source": "haiku" + }, + "baseline_implementations_provided": { + "applies": true, + "answer": false, + "justification": "Benchmarks standard ML models (XGBoost, Random Forest, SVDD, etc.) but paper states 'dataset and curation pipeline will be released after paper acceptance'—no code, no baseline implementations available for reproducibility.", + "source": "haiku" + } + }, + "documentation": { + "dataset_documentation_complete": { + "applies": true, + "answer": false, + "justification": "Provides source description (two systems), collection methodology (Section 2.1: OpenTelemetry instrumentation, prompt/LLM/system prompt variation), feature extraction (Section 2.2: 16 features). Missing: data card, privacy statement, annotated examples, version metadata.", + "source": "haiku" + }, + "licensing_and_access_clear": { + "applies": true, + "answer": false, + "justification": "States 'will be released after paper acceptance in accordance with organizational policies.' Vague timeline, no license specified, no clear access terms. When is 'after acceptance'? What are organizational policies?", + "source": "haiku" + }, + "intended_use_specified": { + "applies": true, + "answer": false, + "justification": "Benchmark is for 'anomaly detection in agentic trajectories' but intended use is not specified. Should this only be used for Stock Market/Research Writing architectures? Can models trained here be deployed in production? No guidance on appropriate use.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "Multi-agentic AI systems are inherently non-deterministic and prone to silent failures (drift, cycles, missing details)", + "evidence": "Table 1 defines 5 failure scenarios; abstract and introduction discuss non-determinism due to LLM variation and system prompt differences", + "supported": "moderate" + }, + { + "claim": "XGBoost achieves 98.03% accuracy on Stock Market dataset and 94.81% on Research Writing dataset", + "evidence": "Table 2 reports XGBoost results directly", + "supported": "strong" + }, + { + "claim": "SVDD (semi-supervised) achieves 96.47% and 89.63% accuracy on the two datasets, showing semi-supervised methods are practical alternatives", + "evidence": "Table 2 reports SVDD results", + "supported": "strong" + }, + { + "claim": "Path-level features (tool count, total steps, unique steps, agent count) are the most important for anomaly detection", + "evidence": "SHAP feature importance analysis in Section 3.3; identified as 'consistently ranked highest'", + "supported": "moderate" + }, + { + "claim": "Model performance exceeds inter-annotator agreement, suggesting misclassifications are due to ambiguous traces", + "evidence": "XGBoost 98.03% > Cohen's kappa 97.6% on Stock Market; paper states 'misclassifications likely due to ambiguous traces'", + "supported": "weak" + }, + { + "claim": "Subtle drift without explicit cycles or errors is harder to detect than drift with errors", + "evidence": "Error analysis (Insight 2, 3): false negatives cluster on 'shorter, drifted paths' without explicit failures; t-SNE visualization (Figure 2) shows these overlap normal traces", + "supported": "moderate" + }, + { + "claim": "The dataset curation pipeline is generalizable and 'can be readily extended to other Agentic AI systems'", + "evidence": "Section 2 describes a pipeline applicable to any agentic system; paper states extensibility but does not demonstrate it", + "supported": "weak" + } + ], + "methodology_tags": [ + "benchmark-creation", + "benchmark-eval", + "empirical" + ], + "key_findings": "Multi-agentic AI systems frequently suffer silent failures (drift, cycles, missing details) that lack explicit error signals. A dataset curation pipeline using OpenTelemetry traces was applied to two systems (Stock Market 4,275 traces, Research Writing 894 traces) to create labeled benchmark datasets. Supervised (XGBoost) and semi-supervised (SVDD) methods achieved competitive accuracies (98%/94.8% and 96%/89.6% respectively), with path-level features (tool count, step count) being most predictive. However, subtle drift anomalies without explicit errors remain difficult to detect, with false negatives showing feature values similar to normal traces.", + "red_flags": [ + { + "flag": "Limited generalizability", + "detail": "Only 2 systems evaluated despite claiming 'first systematic study' of multi-agentic anomaly detection. No cross-system validation or evidence that benchmark generalizes beyond Stock Market and Research Writing architectures." + }, + { + "flag": "Incomplete failure coverage", + "detail": "Only 3 of 5 defined failure types labeled (drift, cycles, errors). Missing details and tool failures are defined but unmeasured, making dataset incomplete for claimed construct." + }, + { + "flag": "No temporal robustness plan", + "detail": "Fixed LLM versions (gpt-4o, granite-3-1-8B, llama-3-3-70B). No versioning strategy or maintenance plan discussed. Benchmark may become obsolete when LLM versions change." + }, + { + "flag": "Class imbalance unaddressed", + "detail": "Stock Market 42% anomalies, Research Writing 68% anomalies. No discussion of whether train-test split is stratified or whether imbalance affects metric interpretation." + }, + { + "flag": "Unfair baseline comparison", + "justify": "Claims models outperform 97.6% inter-annotator agreement, but inter-annotator agreement measures consistency, not accuracy against ground truth. These are different metrics and should not be directly compared." + }, + { + "flag": "Feature engineering not justified", + "detail": "16 features extracted from traces but no ablation studies or justification for feature selection. Features are domain-specific and may not transfer to other agentic systems." + }, + { + "flag": "No external validation", + "detail": "70-15-15 split on same 2 systems with same prompt set. No holdout test set from different time period, system architecture, or LLM version to validate generalization." + }, + { + "flag": "Vague dataset release commitment", + "detail": "States 'will be released after paper acceptance in accordance with organizational policies.' No timeline, license, or reproducibility guarantee. Datasets not currently available." + }, + { + "flag": "Limited error analysis", + "detail": "Error analysis only on false negatives; only compares mean feature values. No breakdown of which anomaly types (drift vs cycles vs errors) each model misses or confusion matrix by type." + }, + { + "flag": "No construct validity argument", + "detail": "Does not explain why extracted features (tokens, latency, path) measure 'silent failures.' Construct validity asserted post-hoc via feature importance rather than established theoretically." + } + ], + "cited_papers": [ + { + "title": "Why do multi-agent llm systems fail?", + "authors": "Cemri et al.", + "year": 2025, + "relevance": "Directly addresses failures in multi-agent systems; foundational for understanding failure categories" + }, + { + "title": "AI agent reliability strategies that stop ai failures before they start", + "authors": "Bronsdon", + "year": 2025, + "relevance": "Discusses reliability and failure prevention in agentic systems; motivates anomaly detection need" + }, + { + "title": "Multi-agent risks from advanced ai", + "authors": "Hammond et al.", + "year": 2025, + "relevance": "Comprehensive analysis of failure modes and risks in multi-agent systems" + }, + { + "title": "SentinelAgent: Graph-based anomaly detection in multi-agent systems", + "authors": "He et al.", + "year": 2025, + "relevance": "Related benchmark/method for agentic anomaly detection; paper notes it 'offers limited evaluation'" + }, + { + "title": "Unsupervised microservice system anomaly detection via contrastive multi-modal representation clustering", + "authors": "Zhang et al.", + "year": 2024, + "relevance": "Transfer of anomaly detection methods from microservices domain to agentic systems" + }, + { + "title": "Deep Attentive Anomaly Detection for Microservice Systems with Multimodal Time-Series Data", + "authors": "Chen et al.", + "year": 2023, + "relevance": "Multimodal anomaly detection in distributed systems; applicable to agentic traces" + }, + { + "title": "ReAct: Synergizing Reasoning and Acting in Language Models", + "authors": "Yao et al.", + "year": 2022, + "relevance": "Establishes ReAct prompting pattern used in 'good' and 'strict' system prompts for controlled variation" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 2, + "justification": "Useful for practitioners monitoring agentic system failures, but applicability limited to Stock Market and Research Writing architectures. Uncertain generalization to other system types." + }, + "surprise_contrarian": { + "score": 1, + "justification": "Multi-agentic system failures and silent bugs are expected problems; that XGBoost outperforms SVDD is unsurprising. No counterintuitive findings or surprising insights provided." + }, + "fear_safety": { + "score": 2, + "justification": "Silent failures in agentic systems raise deployment risk (agent diverges from intended behavior undetected), but paper does not deeply engage with safety implications or mitigation strategies." + }, + "drama_conflict": { + "score": 0, + "justification": "Technical benchmark paper with no controversial claims or stakeholder conflict." + }, + "demo_ability": { + "score": 1, + "justification": "Datasets promised 'after paper acceptance' but not currently available. No code released. Difficult for readers to reproduce or build on immediately." + }, + "brand_recognition": { + "score": 2, + "justification": "IBM Research and IIIT Bangalore affiliations carry some credibility, but not a marquee AI lab (OpenAI, DeepMind, Anthropic). Limited brand lift for audience engagement." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "42158451", + "title": "Convolutional Differentiable Logic Gate Networks", + "points": 26, + "comments": 4, + "url": "https://news.ycombinator.com/item?id=42158451", + "created_at": "2024-11-16T19:10:54Z" + }, + { + "hn_id": "39967245", + "title": "Formal Aspects of Language Modeling", + "points": 4, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=39967245", + "created_at": "2024-04-08T07:47:56Z" + }, + { + "hn_id": "42115169", + "title": "Convolutional Differentiable Logic Gate Networks", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=42115169", + "created_at": "2024-11-12T13:04:29Z" + }, + { + "hn_id": "34101211", + "title": "Will we run out of data?", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=34101211", + "created_at": "2022-12-23T01:17:13Z" + }, + { + "hn_id": "42258010", + "title": "Gradient Boosting Trees and LLMs for Tabular Data Few-Shot Learning", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=42258010", + "created_at": "2024-11-27T17:46:47Z" + }, + { + "hn_id": "40939773", + "title": "Formal Aspects of Language Modeling", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=40939773", + "created_at": "2024-07-11T19:30:45Z" + }, + { + "hn_id": "36985212", + "title": "Will we run out of data to train LLMs?", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=36985212", + "created_at": "2023-08-03T12:53:23Z" + }, + { + "hn_id": "31731755", + "title": "How Developers and Managers Define and Trade Productivity for Quality [pdf]", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=31731755", + "created_at": "2022-06-13T21:05:24Z" + }, + { + "hn_id": "31488587", + "title": "How Developers and Managers Define and Trade Productivity for Quality", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=31488587", + "created_at": "2022-05-24T06:12:01Z" + }, + { + "hn_id": "29172253", + "title": "How Developers and Managers Define and Trade Productivity for Quality [pdf]", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=29172253", + "created_at": "2021-11-10T08:06:07Z" + } + ], + "top_points": 26, + "total_points": 48, + "total_comments": 4 + } +} +\ No newline at end of file diff --git a/papers/detecting-sleeper-agents-2025/scan-v5.json b/papers/detecting-sleeper-agents-2025/scan-v5.json @@ -0,0 +1,584 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Detecting Sleeper Agents in Large Language Models via Semantic Drift Analysis", + "authors": [ + "Shahin Zanbaghi", + "Ryan Rostampour", + "Farhan Abid", + "Salim Al Jarmakani" + ], + "year": 2025, + "venue": "arXiv.org", + "arxiv_id": "2511.15992", + "doi": "10.48550/arXiv.2511.15992" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All abstract claims are directly supported by reported results: 92.5% accuracy (Table 1), 100% precision and 85% recall (Table 1), real-time operation (Section 4.5), and zero false positives (confusion matrix).", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "Paper makes no inappropriate causal claims about safety improvements or harm causation. It reports detection capability on a controlled backdoored model, which requires no causal inference.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "Title claims applicability to 'Sleeper Agents in Large Language Models' generally, abstract claims 'first practical solution to LLM backdoor detection,' but evaluation is limited to one 8B model with one trigger type on 40 samples. The evidence does not support broad generalization claims.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "No discussion of why semantic drift occurs mechanistically, whether other detection signals might work, or alternative interpretations of the results beyond their proposed explanation.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "Paper distinguishes between measurement (embedding distance, canary similarity) and claim (backdoor detection), and explains the mechanistic link: backdoors cause semantic deviation from safe baselines that can be detected via cosine similarity.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "Dedicated limitations section (5.3) discusses small dataset (40 responses), 15% false negative rate, single backdoor type, model specificity, canary bypass vulnerability, and baseline collection overhead.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "Specific threats stated: 'only 40 responses' (insufficient for generalization), '15% false negative rate (3/20) indicates some backdoors evade,' 'single trigger type may not generalize,' 'single 8B model limits generalization.' Not boilerplate.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": true, + "justification": "Paper explicitly states what it does NOT show: generalization to complex backdoors (code vulnerability insertion), other model sizes (1B–70B), or sophisticated backdoors maintaining canary consistency while exhibiting malicious behavior.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding section or statement. Acknowledgments mention 'Professor Kalyani Selvarajah for guidance' and 'Cadenza Labs team for providing the model,' but no funding source is disclosed.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "All authors' affiliation with 'School of Computer Science, University of Windsor' is disclosed. No undisclosed relationship with Cadenza Labs (external collaboration acknowledged but not a conflict).", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": false, + "answer": false, + "justification": "No funding identified, so not applicable.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement. No patents, equity, or consulting relationships declared despite evaluating open-source work from Cadenza Labs.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Key terms defined in context: 'sleeper agents' (backdoored LLMs appearing safe during training), 'semantic drift' (embedding distance from safe baseline), 'canary' (known-answer verification questions). Definitions are precise.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Explicitly stated: 'We present a novel dual-method detection system' and 'Our key contributions include: First practical real-time detection system for LLM sleeper agents.' Contribution is clearly a tool/system for detection.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 2 engages with prior work: explains how Hubinger et al. demonstrated backdoors persist through safety training, summarizes Yang et al. on agent backdoors, discusses related detection methods (mechanistic analysis, model reconstruction, anomaly detection), and identifies the gap this work fills.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": true, + "justification": "Code released at https://github.com/ShahinZa/COMP8700 under MIT license. Explicitly stated: 'Code is available at' and 'available in the repository under MIT license.'", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "Paper explicitly states: 'All experimental code, datasets, and trained models are available in the repository under MIT license.'", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": true, + "justification": "Specific dependency versions listed: PyTorch 2.0+, Transformers 4.30+, Sentence-Transformers 2.2+, scikit-learn, and 'Google Colab with T4 GPU.' Versions are specific enough for reproducibility, though not in standard requirements.txt format.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "Methodology is detailed (Section 3) but step-by-step reproduction commands are not provided in the paper. Paper describes what to do but not the exact sequence of commands to reproduce results.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "Main results in Table 1 report single point estimates (92.5%, 100%, 85%) with no confidence intervals. Figure 2 shows std dev in distributions (σ=0.026, σ=0.106) but CIs not provided for primary metrics.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "Comparative claims made (combined 92.5% > semantic drift 85%) but no p-values or significance tests provided. With n=40 samples, statistical significance testing is important but absent.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Cohen's d = 5.28 reported for the separation between safe and backdoor responses (Figure 2 caption), indicating extremely large effect size. However, effect sizes for comparative claims (combined vs individual methods) are not reported.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "No power analysis or justification for n=40. Authors themselves state in limitations: 'larger-scale validation with 1000+ samples would strengthen confidence in generalization.' Sample size is acknowledged as inadequate.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": true, + "justification": "Variance reported in Figure 2 (safe: σ=0.026, backdoor: σ=0.106) and confusion matrix raw counts. Multiple metrics (accuracy, precision, recall, F1) reported rather than single averages.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": false, + "justification": "No comparison to external detection baselines (mechanistic interpretability, anomaly detection, other embedding methods). Only compares three variants of their own method (semantic drift alone, canary alone, combined).", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": false, + "answer": false, + "justification": "No external baselines to evaluate.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "Tests three configurations: semantic drift alone (85%), canary baseline alone (87.5%), combined method (92.5%). Clear ablation showing contribution of each component to final performance.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Table 1 reports accuracy, precision, recall, and F1-score for all three methods. Good coverage of metrics.", + "source": "haiku" + }, + "human_evaluation": { + "applies": false, + "answer": false, + "justification": "Automated detection task; human evaluation not applicable.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": false, + "justification": "Thresholds τdrift=0.9 and τcanary=0.94 are optimized via F1-score (Eq. 7), but paper does not explicitly state whether optimization used a separate validation set or the same 40-sample test set. Lack of clarity suggests potential overfitting to test data.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": false, + "justification": "No breakdown by prompt type, trigger condition, or other categories. Results in Table 1 are aggregate across all 40 samples.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "Acknowledges three false negatives: 'The 15% false negative rate (3/20 missed backdoors) indicates some backdoor activations evade detection. Further analysis is needed...' However, analysis of WHY these specific cases failed is not provided.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "Reports 15% false negative rate, 75% recall for canary-only method (weaker than combined), and acknowledges model-specificity and trigger-type limitations. Honest reporting of negative findings.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": true, + "justification": "Model specified as 'official Cadenza-Labs dolphin-llama3-8B sleeper agent model' with 'float16 precision' and '—DEPLOYMENT— trigger.' Specific enough for reproduction from public source, though no version hash or download date provided.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": false, + "justification": "Only descriptions provided: 'Prompts covering diverse topics: machine learning, photosynthesis, computer processing, internet architecture, aerodynamics.' Actual prompt text not included. Canary questions listed but main test prompts are not.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": true, + "justification": "Reported: Sentence-BERT all-MiniLM-L6-v2 with 384-dimensional embeddings, thresholds τdrift=0.9 and τcanary=0.94, 4 baseline samples per canary question. Hyperparameters adequately specified.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": false, + "answer": false, + "justification": "No agentic scaffolding in this task (direct model evaluation, not agent workflow).", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "Preprocessing steps documented: embedding via Sentence-BERT, z-score normalization for drift (Eq. 3), cosine similarity computation. Described clearly though not extensive.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": true, + "justification": "Paper explicitly states: 'All experimental code, datasets, and trained models are available in the repository.' Raw data commitment is stated.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Described: '40 responses across 5 test prompts, 20 safe-mode, 20 backdoor-mode, 4 samples per prompt.' Collection procedure is clear, though selection criteria for prompts and generation parameters (temperature, etc.) are not detailed.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "No human participants; not applicable.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "Full pipeline documented: Section 3.3 describes baseline centroid construction, drift computation, z-score normalization, and thresholding. Section 3.4 describes canary design, baseline collection, and comparison. Pipeline is clear.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": false, + "answer": false, + "justification": "Not evaluating model on capability benchmarks; testing detection on a known backdoored model. Training cutoff not relevant.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": false, + "answer": false, + "justification": "Not applicable for backdoor detection task.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": false, + "answer": false, + "justification": "Not applicable.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": true, + "justification": "Per-query timing provided: LLM inference 450ms, embedding 8ms, similarity computation 2ms, total <1s on T4 GPU. Inference cost clearly reported.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": true, + "justification": "Overhead quantified: '<0.5% computational overhead' for 8B model detection. GPU specified (T4). Could be more detailed but sufficient for practicality assessment.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "Semantic drift analysis using Sentence-BERT embeddings achieves 85% recall for detecting backdoored LLMs", + "evidence": "Table 1 reports 85% recall and 100% precision for semantic drift method alone; Figure 2 shows 17.3σ separation between safe (μ=0.12, σ=0.026) and backdoor (μ=0.58, σ=0.106) embedding distributions with Cohen's d=5.28", + "supported": "strong" + }, + { + "claim": "Combined dual-method detection (semantic drift + canary) achieves 92.5% accuracy with zero false positives on the Cadenza-Labs sleeper agent model", + "evidence": "Table 1 shows combined method: 92.5% accuracy, 100% precision, 85% recall; confusion matrix (Section 4.2.1) confirms FP=0, TN=17, TP=20, FN=3", + "supported": "strong" + }, + { + "claim": "Canary baseline comparison alone achieves 87.5% accuracy for backdoor detection", + "evidence": "Table 1 reports canary method: 87.5% accuracy, 100% precision, 75% recall; Section 4.4 shows safe-mode canary similarity (μ=0.98) vs backdoor-mode (μ=0.89) with threshold optimization at 0.94", + "supported": "strong" + }, + { + "claim": "Backdoored models exhibit measurable semantic deviation from safe baselines (mean drift 0.58 vs 0.12, 17.3σ separation)", + "evidence": "Figure 2 distributions show clear separation: safe responses clustered at low drift (mean=0.12, std=0.04) vs backdoor responses (mean=0.58, std=0.15) with only minimal overlap; Cohen's d=5.28 indicates extremely large effect size", + "supported": "strong" + }, + { + "claim": "Real-time backdoor detection is feasible with sub-second latency (<1 second per query)", + "evidence": "Section 4.5 provides per-query timing: LLM inference 450ms, embedding generation 8ms, similarity computation 2ms, total <1 second on T4 GPU", + "supported": "strong" + }, + { + "claim": "The detection method operates as a black-box system requiring no model modification or white-box access", + "evidence": "Section 1.3 and 3.2 describe approach as observing only input/output behavior without accessing model internals; methodology requires only text inputs and LLM outputs", + "supported": "strong" + }, + { + "claim": "15% of backdoor activations (3 out of 20) evade combined detection, indicating detection gaps", + "evidence": "Confusion matrix in Section 4.2.1 reports FN=3 (false negatives), yielding 85% recall and 15% false negative rate; acknowledged in limitations (5.3)", + "supported": "strong" + }, + { + "claim": "Detection achieves minimal computational overhead (<0.5%) compared to baseline LLM inference", + "evidence": "Section 5.2 states 'detection adds <0.5% computational overhead' for 8B model; timing breakdown in 4.5 shows detection operations add only 10ms to 450ms inference", + "supported": "moderate" + } + ], + "methodology_tags": [ + "benchmark-eval", + "case-study" + ], + "key_findings": "The paper presents a dual-method detection system for identifying backdoored LLMs, achieving 92.5% accuracy with zero false positives and 85% recall on the Cadenza-Labs sleeper agent model through semantic drift analysis and canary verification. Embedding-based detection operates in real-time (<1 second per query) without model modification, demonstrating that backdoor-induced semantic deviations are reliably measurable through cosine similarity in Sentence-BERT embedding space. However, the evaluation is limited to a single 8B-parameter model with one trigger type across only 40 samples, and 15% of backdoor activations still evade detection, indicating the approach requires scaling and refinement for practical deployment.", + "red_flags": [ + { + "flag": "Evaluation scope severely limited", + "detail": "Testing on only one model (Cadenza-Labs dolphin-llama3-8B), one trigger type ('I hate you'), and 40 total samples contradicts title/abstract claims about detecting sleeper agents broadly. Authors acknowledge this in limitations but do not revise claims accordingly." + }, + { + "flag": "Potential threshold overfitting to test set", + "detail": "Thresholds τdrift and τcanary are optimized via F1-score (Eq. 7), but paper does not explicitly state whether optimization was on a held-out validation set or the same 40 test samples. Unclear separation of train/test suggests possible inflated performance estimates." + }, + { + "flag": "Very small sample size without statistical testing", + "detail": "n=40 total (20 safe, 20 backdoor) is insufficient for generalization. No confidence intervals provided for main results, and no significance tests despite making comparative claims (92.5% vs 85%)." + }, + { + "flag": "No empirical comparison to existing detection methods", + "detail": "Related work discusses mechanistic interpretability and anomaly detection approaches, but paper provides no empirical comparison. Performance baseline unknown—92.5% accuracy is reported without context." + }, + { + "flag": "False negative rate understated in framing", + "detail": "3 out of 20 backdoors (15%) evade detection, but paper emphasizes zero false positives more heavily. For deployment, both metrics matter equally; failure to analyze WHY these backdoors escaped is a major gap." + }, + { + "flag": "Canary bypass threat acknowledged but not tested", + "detail": "Authors state 'A sophisticated backdoor could be trained to maintain canary consistency while exhibiting malicious behavior on real queries' but provide no experimental validation of this vulnerability." + }, + { + "flag": "No funding disclosure or competing interests statement", + "detail": "Paper acknowledges Cadenza Labs support in acknowledgments but contains no explicit funding disclosure section. No competing interests statement despite evaluating open-source backdoor research." + }, + { + "flag": "Test prompts not provided", + "detail": "Only descriptions provided ('machine learning, photosynthesis, etc.') without actual prompt text, limiting reproducibility and preventing assessment of prompt diversity/quality." + } + ], + "cited_papers": [ + { + "title": "Sleeper agents: Training deceptive LLMs that persist through safety training", + "authors": "Hubinger, E., Denison, C., Mu, J., et al.", + "year": 2024, + "venue": "arXiv:2401.05566", + "relevance": "Foundational work establishing that LLM backdoors persist through safety training; identifies detection as an open problem this paper addresses" + }, + { + "title": "Watch out for your agents! Investigating backdoor threats to LLM-based agents", + "authors": "Yang, W., Bi, X., Lin, Y., et al.", + "year": 2024, + "venue": "arXiv:2402.11208", + "relevance": "Extends backdoor attacks to LLM-based agentic systems and tool usage, demonstrating agent-specific vulnerabilities to backdoor triggers" + }, + { + "title": "Propaganda via AI? A Study on Semantic Backdoors in Large Language Models", + "authors": "Min, N.M., Pham, L.H., Li, Y., Sun, J.", + "year": 2025, + "venue": "arXiv:2504.12344", + "relevance": "Demonstrates semantic backdoor design and RAVEN framework for entropy-based detection; shows backdoors can manipulate specific semantic content" + }, + { + "title": "Refusal-trained LLMs are easily jailbroken as browser agents", + "authors": "Kumar, P., Lau, E., Vijayakumar, S., et al.", + "year": 2024, + "venue": "arXiv:2410.13886", + "relevance": "Shows safety mechanisms in agent contexts can be bypassed; relevant to threat model of deceptive model behavior in deployment" + }, + { + "title": "Sentence-BERT: Sentence embeddings using Siamese BERT-networks", + "authors": "Reimers, N., Gurevych, I.", + "year": 2019, + "venue": "EMNLP", + "relevance": "Technical foundation for semantic drift detection; establishes Sentence-BERT for measuring semantic similarity via embedding cosine distance" + }, + { + "title": "Towards Practical Deployment-Stage Backdoor Attack on Deep Neural Networks", + "authors": "Qi, X., Xie, T., Pan, R., et al.", + "year": 2020, + "venue": "CVPR", + "relevance": "Prior work on neural network backdoor detection at deployment time; relevant to detection methodology landscape" + }, + { + "title": "Neural Trojans", + "authors": "Liu, Y., Xie, Y., Srivastava, A.", + "year": 2017, + "venue": "ICCD", + "relevance": "Early work on model poisoning and trojan detection in neural networks; foundational to backdoor detection literature" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 2, + "justification": "Useful for organizations monitoring deployed LLMs, but limited by single-model validation and 15% miss rate; unclear how well method generalizes to production scenarios." + }, + "surprise_contrarian": { + "score": 2, + "justification": "Shows embedding-based detection can work more efficiently than mechanistic methods, but semantic drift as a backdoor signal is somewhat expected given trigger-induced output changes." + }, + "fear_safety": { + "score": 3, + "justification": "Directly addresses Hubinger et al.'s sleeper agent threat, validates the problem's severity, and provides a detection approach; elevates AI safety concerns about undetectable backdoors persisting through safety training." + }, + "drama_conflict": { + "score": 1, + "justification": "Straightforward technical paper with no particular controversy or conflict beyond the general AI safety concern; no competing claims or methodological debates." + }, + "demo_ability": { + "score": 3, + "justification": "Code and datasets released on GitHub under MIT license; method uses standard libraries (PyTorch, Transformers, Sentence-Transformers); fully reproducible and testable by practitioners." + }, + "brand_recognition": { + "score": 1, + "justification": "University of Windsor is not a major AI safety research brand; work is independent follow-up to Hubinger et al. without affiliation to top-tier institutions (OpenAI, Anthropic, DeepMind, etc.)." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "45722841", + "title": "The Shape of Math to Come by Alex Kontorovich", + "points": 3, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=45722841", + "created_at": "2025-10-27T16:24:06Z" + }, + { + "hn_id": "46508063", + "title": "A Systematic Analysis of Biases in Large Language Models", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=46508063", + "created_at": "2026-01-06T02:33:50Z" + }, + { + "hn_id": "40689052", + "title": "Microarchitectural Security of AWS Firecracker VMM for Serverless Cloud (2023)", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=40689052", + "created_at": "2024-06-15T11:25:54Z" + }, + { + "hn_id": "45656753", + "title": "The Shape of Math to Come", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=45656753", + "created_at": "2025-10-21T15:07:05Z" + }, + { + "hn_id": "42849924", + "title": "Share a Tiny Space of Your Freezer to Preserve Seed Diversity", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=42849924", + "created_at": "2025-01-28T07:56:31Z" + }, + { + "hn_id": "42286387", + "title": "DrugAgent: AI-Aided Drug Discovery Programming Through LLM Multi-Agent Collab", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=42286387", + "created_at": "2024-12-01T05:19:48Z" + } + ], + "top_points": 3, + "total_points": 15, + "total_comments": 1 + } +} +\ No newline at end of file diff --git a/papers/detection-method-prompt-2025/scan-v5.json b/papers/detection-method-prompt-2025/scan-v5.json @@ -0,0 +1,586 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Detection Method for Prompt Injection by Integrating Pre-trained Model and Heuristic Feature Engineering", + "authors": [ + "Yi Ji", + "Runzhi Li", + "Baolei Mao" + ], + "year": 2025, + "venue": "Knowledge Science, Engineering and Management", + "arxiv_id": "2506.06384", + "doi": "10.48550/arXiv.2506.06384" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "Abstract claims (outperforming baselines, reducing attack success rates) are supported by Tables 1 and 3 showing superior accuracy and lower ASR across LLMs.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "Table 2 ablation study shows each module (M1, M2, M3) improves metrics, justifying the causal claim that dual-channel fusion improves detection.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "Title claims 'Detection Method for Prompt Injection' (broad), but Table 1 shows 97.94% on safeguard-v2 vs. 91.24% on deepset-v2. Paper acknowledges distribution differences but doesn't bound claims to specific attack/dataset types.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "Paper presents no alternative explanations for improved performance. No discussion of whether results could be due to dataset artifacts, training/test similarity, or other confounds.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "Paper distinguishes between benchmark accuracy (Table 1) and actual attack success rate on real LLMs (Table 3), the true outcome of interest.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": false, + "justification": "Section 5 contains only a single-paragraph limitations statement in the conclusion ('precision requires further enhancement'). No dedicated limitations or threats-to-validity section.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "Limitations are boilerplate ('precision requires further enhancement'). No discussion of specific threats: dataset bias, test-set contamination, attack pattern representativeness, or robustness to novel attacks.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "Paper focuses on direct prompt injection (not indirect) and English datasets but doesn't explicitly state what it does NOT show regarding novel attacks, cross-lingual transfer, or edge cases.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding source is mentioned anywhere in the paper.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "All three authors are affiliated with Zhengzhou University, clearly disclosed at paper header. No evaluation of their own product.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": false, + "answer": false, + "justification": "No funder disclosed.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement included.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Prompt injection (direct vs. indirect, semantic vs. structure-based) are clearly defined with examples. DeBERTa and heuristic feature engineering explained in Method section.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Contributions explicitly stated: (1) dual-channel detection framework, (2) heuristic rules for attack patterns, (3) evaluation demonstrating effectiveness. Clear what paper adds.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 2 categorizes prior defenses (detection-based, architecture-based, self-supervision) and explains gaps that DMPI-PMHFE addresses. Good positioning relative to existing work.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": false, + "justification": "No mention of code release, GitHub repository, or implementation details beyond algorithm descriptions.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": false, + "justification": "External test sets (deepset, ivanleomk) are public HuggingFace datasets. However, the augmented safeguard-v2 dataset (10,400 samples, 3,000 GPT-4o generated) is not mentioned as released.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "No requirements.txt, Dockerfile, or dependency versions provided. Only model names (DeBERTa-v3-base, en_core_web_sm) mentioned without version pins.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "No step-by-step instructions to reproduce experiments. Method is described conceptually but not operationally.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "Tables 1, 2, 3 report single point estimates (accuracy, precision, recall, F1) with no confidence intervals, error bars, or variance measures.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "No significance tests reported. Differences between methods (e.g., 97.94% vs 97.87%) lack p-values or statistical justification.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": false, + "justification": "Improvements shown as absolute percentage-point differences (e.g., 97.94% - 97.87% = 0.07pp) but not formally reported as effect sizes with interpretation.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "Test set sizes (1,300 / 354 / 610 / 251) are provided but no power analysis or justification for adequacy of sample sizes.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "Results are single point estimates. No mention of multiple runs, standard deviation, or variance across experimental repetitions.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Detection compared against Fmops, ProtectAI, SafeGuard, InjecGuard. Defense evaluated against Self-Reminder and Self-Defense baselines.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "Baselines are described as 'currently widely applied on Hugging Face, enjoying high recognition and practical value' and from 2023-2024, contemporary to 2025 paper.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "Table 2 ablates modules: M1 (DeBERTa only), M1+M2 (add synonym matching), M1+M2+M3 (add pattern matching), showing each contributes to performance.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Detection uses accuracy, precision, recall, F1-score. Defense evaluation uses attack success rate (ASR). Multiple dimensions assessed.", + "source": "haiku" + }, + "human_evaluation": { + "applies": false, + "answer": false, + "justification": "Not applicable. Detection model evaluated on automated benchmarks; human judgment not needed for classification correctness.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": true, + "justification": "safeguard-v2 split 80/10/10 (train/val/test = 10,400/1,300/1,300). External test sets also held separate.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": false, + "justification": "Results reported by dataset (Table 1) and by LLM (Table 3). No breakdown by attack type (semantic vs. structure-based) or per-attack-pattern results.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": false, + "justification": "No concrete failure examples or analysis. Only a brief note that precision drops from 99.58% to 98.00% when pattern matching is added, attributed to increased false positives.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "Paper reports precision degradation as pattern matching is added (99.58% → 98.00%), explicitly acknowledging trade-off. This is a negative result on one metric.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": true, + "justification": "DeBERTa-v3-base clearly specified. Sufficient to identify the exact pretrained model.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": false, + "justification": "For defense eval, 251 attack samples are used but actual attack prompts are not provided in paper. Only attack pattern categories mentioned.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": true, + "justification": "Optimizer (Adam), learning rate (2e-5), batch size (16), weight decay (0.02), early stopping (patience=3) all specified.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": false, + "answer": false, + "justification": "Not applicable. This is a detection classifier, not an agentic system with scaffolding.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": false, + "justification": "Paper mentions tokenization, lemmatization, lowercase conversion but doesn't document complete preprocessing pipeline or filtering steps.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": false, + "justification": "External test sets (deepset, ivanleomk) are publicly available on HuggingFace. Safeguard-v2 training data (augmented with 3,000 GPT-4o samples) is not mentioned as released.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Safeguard-v2 creation described: augmented xTRam1/safeguard-prompt-injections, 15 attack patterns, 3,000 GPT-4o samples, three-stage QA (manual verification, dedup, balanced sampling).", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "Not applicable. No human participants or recruitment.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "Data creation and splitting pipeline described at conceptual level (augmentation, QA steps, 80/10/10 split). Documented adequately for overview but not reproducibly detailed.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": false, + "answer": false, + "justification": "Not applicable. This is a detection model on labeled benchmarks, not evaluating LLM capabilities on contaminated benchmarks. However, whether the underlying LLMs in defense eval have seen these attack patterns in pretraining is not discussed.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": false, + "justification": "Paper does not discuss whether attack patterns in safeguard-v2 training set overlap with external test sets (ivanleomk-v2, deepset-v2) or if benchmarks have contamination.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": true, + "answer": false, + "justification": "External datasets are noted as separate but their construction and uniqueness vs. training set not discussed. Whether the 15 attack patterns are novel or overlap with benchmark sources is not addressed.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "Not applicable. No human participants.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "Not applicable. No human participants.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "Not applicable. No human participants.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "Not applicable. No human participants.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "Not applicable. No human participants.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "Not applicable. No human participants.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "Not applicable. No human participants.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": false, + "justification": "No inference latency, computational cost, or resource requirements reported for running the detector.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "No compute budget stated for training or evaluation (GPU hours, memory, etc.).", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "DMPI-PMHFE achieves 97.94% accuracy on safeguard-v2, outperforming baselines (InjecGuard 97.87%)", + "evidence": "Table 1 shows DMPI-PMHFE: 97.94% accuracy vs. InjecGuard 97.87% on safeguard-v2 test set", + "supported": "strong" + }, + { + "claim": "Dual-channel feature fusion (DeBERTa + heuristic rules) is more effective than DeBERTa alone", + "evidence": "Table 2 ablation: M1 alone 97.26% → M1+M2+M3 97.94% accuracy on safeguard-v2", + "supported": "strong" + }, + { + "claim": "DMPI-PMHFE reduces attack success rates across mainstream LLMs (10-14% vs. 25-72% baseline)", + "evidence": "Table 3 shows ASR drops from baseline (14.34% to 71.71%) to DMPI-PMHFE (10.35% to 14.34%) across 5 LLMs", + "supported": "strong" + }, + { + "claim": "Pattern matching (M3) improves recall while trading off precision", + "evidence": "Table 2: safeguard-v2 recall improves 95.64% (M1+M2) → 98.59% (M1+M2+M3), precision drops 98.77% → 98.00%", + "supported": "strong" + }, + { + "claim": "Heuristic rules capture 8 semantic-based and 2 structure-based attack patterns", + "evidence": "Appendices A.1 and A.2 list 8 semantic patterns and 2 structure patterns with matching rules. Evaluation shows coverage via F1-score but not per-pattern breakdown.", + "supported": "moderate" + }, + { + "claim": "DMPI-PMHFE generalizes across external test sets (ivanleomk-v2, deepset-v2)", + "evidence": "Table 1: 94.75% accuracy on ivanleomk-v2, 91.24% on deepset-v2. Performance drops vs. safeguard-v2 (97.94%) indicate distribution shift, not strong generalization.", + "supported": "moderate" + } + ], + "methodology_tags": [ + "benchmark-eval", + "observational" + ], + "key_findings": "DMPI-PMHFE combines DeBERTa semantic embeddings with heuristic pattern matching to detect 10 attack patterns. It achieves 97.94% accuracy on the safeguard-v2 benchmark and reduces attack success rates from 10–72% to 10–14% across five mainstream LLMs (GPT-4o, Qwen, Llama, GLM-4). However, performance degrades substantially on external test sets (91.24% on deepset-v2), suggesting overfitting to training distribution, and pattern matching introduces false positives (precision drops from 99.58% to 98.00%), indicating the dual-channel approach trades precision for coverage.", + "red_flags": [ + { + "flag": "No code or data release", + "detail": "Reproducibility severely limited. Augmented safeguard-v2 dataset (10,400 samples, 3,000 GPT-4o-generated) not released. Code not available." + }, + { + "flag": "No statistical significance testing", + "detail": "Point estimates only (e.g., 97.94% vs. 97.87%). Differences of 0.07–0.13 percentage points lack p-values; may not be statistically significant." + }, + { + "flag": "Overfitting to training distribution", + "detail": "Performance optimal on safeguard-v2 (97.94%) but drops to 91.24% on deepset-v2. Suggests model memorized patterns from training data rather than learning generalizable features." + }, + { + "flag": "Data augmentation via generative model", + "detail": "3,000 of 10,400 training samples generated by GPT-4o. Potential for synthetic artifacts, mode collapse, or biases in attack pattern representation." + }, + { + "flag": "Precision-recall trade-off unresolved", + "detail": "Pattern matching (M3) increases recall (95.64% → 98.59%) but decreases precision (98.77% → 98.00%). False positives not analyzed or mitigated." + }, + { + "flag": "No analysis of attack pattern coverage", + "detail": "Paper claims to detect '15 mainstream attack patterns' but provides no breakdown of per-pattern precision/recall or discussion of novel attack robustness." + }, + { + "flag": "Limited generalization analysis across LLMs", + "detail": "Table 3 shows defense effectiveness varies 3–5x across LLMs (e.g., GLM-4: 71.71% baseline vs. Llama-3.3: 25.09%). Detector trained on fixed patterns; unclear how it will perform on LLMs beyond tested set." + }, + { + "flag": "Boilerplate limitations section", + "detail": "Single sentence in conclusion: 'precision requires further enhancement.' No discussion of scope boundaries, dataset bias, or threats to validity." + }, + { + "flag": "No variance or error estimates", + "detail": "Single point estimate per result. No standard deviation, 95% CIs, or multiple runs reported." + } + ], + "cited_papers": [ + { + "title": "Struq: Defending against prompt injection with structured queries", + "authors": "Sizhe Chen, Julien Piet, Chawin Sitawarin, David Wagner", + "year": 2024, + "relevance": "Architecture-based defense alternative; separates prompts and data to prevent instruction injection." + }, + { + "title": "Jatmo: Prompt injection defense by task-specific finetuning", + "authors": "Julien Piet, Maha Alrashed, Chawin Sitawarin, et al.", + "year": 2024, + "relevance": "Defense method trading generalization for task-specific robustness; illustrates design trade-offs in prompt defense." + }, + { + "title": "Defending chatgpt against jailbreak attack via self-reminders", + "authors": "Yueqi Xie, Jingwei Yi, Jiawei Shao, et al.", + "year": 2023, + "relevance": "Self-supervision baseline for prompt injection defense; compared against DMPI-PMHFE in Table 3." + }, + { + "title": "Ignore previous prompt: Attack techniques for language models", + "authors": "Fábio Perez, Ian Ribeiro", + "year": 2022, + "relevance": "Early taxonomy of direct prompt injection attacks; foundational for understanding attack patterns." + }, + { + "title": "Not what you've signed up for: Compromising real-world llm-integrated applications with indirect prompt injection", + "authors": "Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, et al.", + "year": 2023, + "relevance": "Indirect prompt injection attacks; complements paper's focus on direct injection." + }, + { + "title": "Formalizing and benchmarking prompt injection attacks and defenses", + "authors": "Yupei Liu, Yuqi Jia, Runpeng Geng, et al.", + "year": 2024, + "relevance": "Benchmark and formalization framework for evaluating prompt injection defenses." + }, + { + "title": "Cybersecurity evaluation suite for large language models (CybersecEval 2)", + "authors": "Manish Bhatt, Sahana Chennabasappa, Yue Li, et al.", + "year": 2024, + "relevance": "Defense effectiveness evaluation benchmark; provides 251 attack samples used in Table 3." + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 3, + "justification": "Deployed as input filter for mainstream LLMs (GPT-4o, Qwen, Llama, GLM-4). Reduces attack success rates from 10–72% to 10–14%, directly applicable to production systems." + }, + "surprise_contrarian": { + "score": 1, + "justification": "Dual-channel approach (semantic + syntactic) is intuitive; combining pretrained models with heuristic rules is unsurprising and standard." + }, + "fear_safety": { + "score": 2, + "justification": "Addresses real LLM security threat (prompt injection ranked #1 by OWASP), but is a defense paper rather than surfacing new risks." + }, + "drama_conflict": { + "score": 1, + "justification": "Straightforward technical problem with engineering solution. No methodological drama, controversial claims, or conflict narrative." + }, + "demo_ability": { + "score": 2, + "justification": "Code not released, but benchmark datasets (deepset, ivanleomk) are public. Could reimplement heuristic rules and test with public data, but would require effort." + }, + "brand_recognition": { + "score": 1, + "justification": "Authors from Zhengzhou University, China. Not a well-known AI lab. No affiliation with major AI companies or research institutes." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "31636401", + "title": "End-to-End 3D Hand Pose Estimation from Stereo Cameras", + "points": 80, + "comments": 4, + "url": "https://news.ycombinator.com/item?id=31636401", + "created_at": "2022-06-06T01:07:13Z" + }, + { + "hn_id": "36373410", + "title": "A Survey of Modern Compiler Fuzzing", + "points": 29, + "comments": 2, + "url": "https://news.ycombinator.com/item?id=36373410", + "created_at": "2023-06-17T19:05:42Z" + }, + { + "hn_id": "27521090", + "title": "SimSwap: An Efficient Framework for High Fidelity Face Swapping", + "points": 2, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=27521090", + "created_at": "2021-06-15T20:30:01Z" + }, + { + "hn_id": "45044093", + "title": "Omni Geometry Representation Learning vs. LLMs for Geospatial Entity Resolution", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=45044093", + "created_at": "2025-08-27T19:38:10Z" + }, + { + "hn_id": "43548771", + "title": "Large Language Models Share Representations of Latent Grammatical Concepts", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=43548771", + "created_at": "2025-04-01T16:34:21Z" + }, + { + "hn_id": "43436502", + "title": "Optimization of Monolithically Stackable Gain Cell Memory for Last-Level Cache", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=43436502", + "created_at": "2025-03-21T14:58:30Z" + }, + { + "hn_id": "44524946", + "title": "Finding Compiler Bugs: Cross-Language Code Generator and Differential Testing", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=44524946", + "created_at": "2025-07-10T20:07:28Z" + }, + { + "hn_id": "43389464", + "title": "Decoupling the components of geometric understanding in Vision Language Models", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=43389464", + "created_at": "2025-03-17T15:16:52Z" + } + ], + "top_points": 80, + "total_points": 119, + "total_comments": 7 + } +} +\ No newline at end of file diff --git a/papers/detectlocalizerepair-unified-framework-2022/scan-v5.json b/papers/detectlocalizerepair-unified-framework-2022/scan-v5.json @@ -0,0 +1,543 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Detect-Localize-Repair: A Unified Framework for Learning to Debug with CodeT5", + "authors": [ + "Nghi D. Q. Bui", + "Yue Wang", + "Steven C. H. Hoi" + ], + "year": 2022, + "venue": "Conference on Empirical Methods in Natural Language Processing", + "arxiv_id": "2211.14875", + "doi": "10.48550/arXiv.2211.14875" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All abstract claims (unified framework, three tasks, new datasets, performance improvements) are directly supported by results in Tables 3-6 and dataset description in Section 3.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "Causal claims ('joint training improves performance') are supported by ablation studies (Table 6) comparing CodeT5-DLR vs CodeT5-D/L/R, showing joint training consistently outperforms individual training.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "Claims in abstract/conclusion about 'neural-based techniques for debugging' are broader than the tested scope (Java/Python, line-level, function-level, GitHub commits). Generalizability to other languages, proprietary code, or fine-grained code changes not discussed.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "No alternative explanations explored. Why joint training works is attributed to task complementarity (brief intuition) without deeper investigation or comparison to other training strategies.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": false, + "justification": "EM and BLEU measure code similarity but not whether fixes actually work in practice. No discussion of whether high BLEU/EM correlates with functionally correct repairs or if exact syntactic match is necessary.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "Section 8 (Limitations) is a dedicated section discussing module inconsistency and lack of cross-function context, not just a concluding remark.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "Limitations mention lack of cross-function context and module inconsistency, but miss critical threats: dataset representativeness (GitHub biases), bug heuristic accuracy impact (96% → 4% false positives), generalization to different code domains.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "Explicit boundaries set (Java/Python, line-level, GitHub commits) but limitations section does not clearly state what the results do NOT show. No discussion of whether approach works for other languages, proprietary code, or non-GitHub datasets.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding source explicitly stated in the paper. Salesforce Research affiliation suggests internal funding but this is not disclosed.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "All authors list Salesforce Research Asia affiliation clearly. No conflict of interest apparent since paper evaluates CodeT5 (Meta) and other external models, not Salesforce products.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": true, + "justification": "Presumed Salesforce funding is independent of the evaluated outcome (CodeT5 performance comparison to baselines). No Salesforce tool or product is the subject of evaluation.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement or declaration of patents/equity/consulting arrangements provided.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Key terms defined in Section 2: 'bug detection' as binary function classification, 'bug localization' as line-level labeling with problem formulation in 2.1, 'program repair' as sequence-to-sequence translation.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Three contributions explicitly stated in introduction: (1) unified DLR framework, (2) two new datasets for Java/Python, (3) empirical evaluation. Contributions are concrete and testable.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 6 reviews related work in pretrained language models and neural debugging, discussing CodeBERT, GraphCodeBERT, PLBART. Distinguishes from Allamanis et al. (2021) joint approach by using real bugs vs synthetic, and function/line-level vs token-level granularity.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": false, + "justification": "Code not released. Paper states 'available upon request' or via CodeT5 GitHub link, but fine-tuned models, adaptation code, and training scripts are not provided.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": false, + "justification": "Paper states 'we will make our datasets publicly available' (future tense). At time of publication, datasets are not released, only promised.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "Hardware specified (NVIDIA A100, 40GB), model version stated (CodeT5-base 220M), and max sequence length (512), but missing: Python version, PyTorch/TensorFlow versions, CUDA version, dependency list, training scripts.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "No step-by-step reproduction instructions. Method section describes what was done but not how to reproduce: no training code, no inference code, datasets not available at publication, many hyperparameters missing.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "Tables 3-6 report point estimates only. No error bars, confidence intervals, standard deviations, or variance across runs reported despite using deep learning models which typically require multiple runs.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "Claims of 'significant improvements' made throughout (e.g., 'significantly outperforms') but no statistical significance tests, p-values, t-tests, or other hypothesis tests provided.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": false, + "justification": "Point improvements shown in tables (e.g., CodeT5-DLR 63.46 F1 vs PLBART 59.01 F1) but no formal effect size measures, Cohen's d, or percentage improvements with context provided.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "Dataset sizes provided in Table 2 (52K-132K training examples) but no justification, power analysis, or discussion of adequacy. Why these particular sizes were chosen is not explained.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "Only point estimates reported in results tables. No standard deviation, confidence intervals, or variance across multiple runs mentioned, despite training neural networks which typically have random seed variation.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Multiple baselines included: SpotBugs (static analysis), TBCNN, CodeBERT, GraphCodeBERT, PLBART (neural models), and DeepLineDP/LineVul (vulnerability detection adapted to bug localization).", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "Baselines span 2016-2022. For a 2022 submission, CodeBERT (2020), GraphCodeBERT (2020), and PLBART (2021) are contemporary and competitive. SpotBugs is an established production tool.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "Ablation study comparing CodeT5-D (detection only), CodeT5-L (localization only), CodeT5-R (repair only) vs CodeT5-DLR (joint) in Tables 3-6, demonstrating benefit of joint training.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Detection: F1 and FPR. Localization: MRR, MAP, FPR at k=1 and k=5. Repair: EM and BLEU. Multiple metrics capture different aspects of performance for each task.", + "source": "haiku" + }, + "human_evaluation": { + "applies": false, + "answer": false, + "justification": "No human evaluation provided, but not critical since ground truth is available from commits and automatic metrics (EM, BLEU) are standard for code generation tasks. Not applicable.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": true, + "justification": "Table 2 shows explicit train/validation/test splits for both SL-Java and ML-Python datasets. Results reported on held-out test sets.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Section 4.4.2 and Figure 5 provide breakdown by 13 bug patterns for SL-Java detection task (CHANGE_OPERATOR, CHANGE_IDENTIFIER, etc.). ML-Python lacks per-category breakdown.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "Section 4.4.1 shows failure example (Figure 4): CHANGE_NUMERAL pattern where model correctly localizes but fails to predict exact numeral (3476→3344). Explains why certain patterns are hard.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "Table 4 includes incomplete results (CodeT5-DLR-new marked with 'x' for ML-Python). Trade-offs discussed: FPR increases from 3.04 to 8.05 as k increases from 1 to 5. Some patterns show CodeT5-L outperforming CodeT5-DLR.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": false, + "justification": "Base model specified as 'CodeT5-base (220M)' with GitHub link, but no snapshot date, commit hash, or version tag for the exact weights used. Reproducer cannot guarantee identical weights.", + "source": "haiku" + }, + "prompts_provided": { + "applies": false, + "answer": false, + "justification": "No prompts used. This is supervised fine-tuning on labeled data, not prompt-based generation. Not applicable.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": false, + "justification": "Reported: max sequence lengths (512), GPU hardware (A100 40GB). Missing: learning rate, batch size, number of epochs, optimizer (Adam? SGD?), warmup steps, dropout, weight decay, gradient clipping.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": false, + "answer": false, + "justification": "No agentic scaffolding or in-context prompting. Supervised fine-tuning approach. Not applicable.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "Collection process well documented: commit keyword filtering (96% accuracy cited), Lizard for function extraction, tree-sitter for pattern identification. Train/val/test splitting described. Some details sparse (exact regex for keyword matching).", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": false, + "justification": "Datasets promised but not made available at publication time. Independent verification of data quality cannot be done. 'Will be made publicly available' is future commitment, not current release.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Collection pipeline clearly described: GitHub commits with bug-fix keywords using PyDriller, function extraction with Lizard, line-level bug indicators. Heuristic validation referenced (96% accuracy from prior work).", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "No human recruitment. Data sourced from GitHub commits. Not applicable.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "Full pipeline documented: commit mining → function extraction → line-level annotation → pattern identification (for Java). Before/after code snapshots preserved for repair task.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": false, + "justification": "CodeT5 pretrained on large GitHub corpus but pretraining cutoff date not stated. Fine-tuning data from GitHub commits but no temporal cutoff specified. No discussion of when GitHub data was harvested.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": false, + "justification": "Train/test split at commit level is good, but CodeT5 was pretrained on large GitHub corpus which almost certainly overlaps with test commits. This potential contamination not discussed.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": false, + "answer": false, + "justification": "Custom datasets collected by authors (not standard benchmarks). Potential overlap between CodeT5 pretraining and fine-tuning data is not addressed as a standard benchmark issue.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human participants. Not applicable.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human participants. Not applicable.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participants. Not applicable.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human participants. Not applicable.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human participants. Not applicable.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human participants. Not applicable.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participants. Not applicable.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": false, + "justification": "Inference latency, throughput, and computational cost not reported. No discussion of wall-clock time to detect/localize/repair a given function or deployment feasibility.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "Hardware specified (A100 40GB) but no total compute budget, training time in hours/days, number of iterations, or estimated cost for reproducers provided.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "CodeT5-DLR achieves state-of-the-art performance on bug detection, localization, and repair tasks", + "evidence": "Tables 3-5 show CodeT5-DLR outperforms all baselines (PLBART, CodeBERT, GraphCodeBERT, DeepLineDP, LineVul) on F1 (63.46 vs 59.01), MRR (26.78 vs 23.02), and EM (10.30 vs 6.02)", + "supported": "strong" + }, + { + "claim": "Joint training of three tasks yields better performance than training on individual tasks", + "evidence": "Table 6 and earlier tables show CodeT5-DLR consistently outperforms CodeT5-D/L/R variants. E.g., Table 3 CodeT5-DLR F1=63.46 vs CodeT5-D F1=59.28 on detection", + "supported": "moderate" + }, + { + "claim": "The unified framework successfully mirrors how developers debug code (detect → localize → repair)", + "evidence": "Intuitive argument in Section 1 and Figure 3 example, but no user study validation of whether this pipeline matches real developer workflows", + "supported": "weak" + }, + { + "claim": "CodeT5-DLR achieves 33.93% buggy line localization and 46.93% repair accuracy in end-to-end evaluation", + "evidence": "Table 6 explicitly reports these numbers for SL-Java unified debugging procedure", + "supported": "strong" + }, + { + "claim": "The bug-fix keyword heuristic is reliable for identifying real bug fixes", + "evidence": "Heuristic cited as 96% accurate from Karampatsis & Sutton (2020) and Ray et al. (2016). However, 4% false positive rate could impact dataset quality", + "supported": "moderate" + }, + { + "claim": "Line-level granularity is more practical than function-level or token-level bug localization", + "evidence": "Argued in Section 1 but not empirically validated. Previous work cited supporting practicality argument", + "supported": "weak" + } + ], + "methodology_tags": [ + "benchmark-eval", + "empirical" + ], + "key_findings": "CodeT5-DLR unifies three interdependent debugging tasks (detection, localization, repair) via multi-task learning on a pretrained language model. Evaluation on newly collected Java and Python datasets shows the joint approach outperforms independently trained baselines and other neural models. The model achieves 63.46% F1 on function-level bug detection, 26.78 MRR@1 on line-level localization, and 10.30% exact match on repair. End-to-end performance is more modest: 33.93% buggy lines correctly localized and 46.93% repaired for single-line Java bugs, degrading to 28.49% and 41.21% for multi-line Python bugs. Ablation studies confirm joint training benefits.", + "red_flags": [ + { + "flag": "No error bars or variance reporting", + "detail": "All results are point estimates with no confidence intervals, standard deviations, or multiple-run variance despite training stochastic neural networks." + }, + { + "flag": "No statistical significance testing", + "detail": "Claims of 'significant improvements' are unsupported by p-values, t-tests, or other hypothesis tests. Differences could be within noise." + }, + { + "flag": "Code and data not released", + "detail": "Reproducibility impossible at publication. Datasets promised but not delivered; fine-tuned models and training code absent." + }, + { + "flag": "Potential data contamination", + "detail": "CodeT5 pretrained on GitHub corpus with no specified cutoff date. Fine-tuning and test data from same GitHub source. Train/test contamination not discussed." + }, + { + "flag": "Missing critical hyperparameters", + "detail": "Learning rate, batch size, epochs, optimizer, warmup, dropout not specified. Reproduction would require guessing or reverse-engineering." + }, + { + "flag": "Low absolute performance", + "detail": "End-to-end accuracy is 33.93% localization and 46.93% repair for single-line bugs. Multi-line performance is worse. May be too low for production use." + }, + { + "flag": "No human evaluation", + "detail": "While automatic metrics exist, human validation of whether EM/BLEU matches actually correct fixes would strengthen claims." + }, + { + "flag": "Limited ablation on design", + "detail": "Why this specific loss combination (Ldetect + Llocalize + Lrepair)? Other joint training strategies not explored." + }, + { + "flag": "Class imbalance not addressed", + "detail": "Datasets have far more non-buggy than buggy lines. No discussion of how class imbalance affects training or whether techniques like SMOTE/weighting were used." + }, + { + "flag": "GitHub bias in datasets", + "detail": "Real-world bugs from GitHub may not represent all types of bugs (e.g., embedded systems, legacy code). Generalizability assumed but not tested." + } + ], + "cited_papers": [ + { + "title": "Self-supervised bug detection and repair", + "authors": "Allamanis et al.", + "year": 2021, + "relevance": "Prior joint approach to bug localization and repair; uses synthetic data and token-level granularity, contrasting with this work's real data and line-level approach" + }, + { + "title": "CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code", + "authors": "Wang et al.", + "year": 2021, + "relevance": "Foundation model used in this work; pretrained on GitHub code with identifier-aware objectives" + }, + { + "title": "CodeBERT: A Pre-trained Model for Programming Language and Natural Language", + "authors": "Feng et al.", + "year": 2020, + "relevance": "Baseline model and related work on pretrained models for code understanding" + }, + { + "title": "GraphCodeBERT: Pre-training Code Representations with Data Flow", + "authors": "Guo et al.", + "year": 2020, + "relevance": "Baseline incorporating data flow structure for code representation" + }, + { + "title": "How often do single-statement bugs occur? The ManySStuBs4J dataset", + "authors": "Karampatsis & Sutton", + "year": 2020, + "relevance": "Prior work on single-line bug datasets and heuristic (96% accuracy) for identifying bugs from commits" + }, + { + "title": "Neural Program Repair by Jointly Learning to Localize and Repair", + "authors": "Vasic et al.", + "year": 2019, + "relevance": "Earlier joint localization-repair approach using pointer networks; motivates unified framework design" + }, + { + "title": "Unified Pre-training for Program Understanding and Generation", + "authors": "Ahmad et al. (PLBART)", + "year": 2021, + "relevance": "Baseline pretrained model evaluated on debugging tasks" + }, + { + "title": "On the Accuracy of Spectrum-Based Fault Localization", + "authors": "Abreu et al.", + "year": 2007, + "relevance": "Traditional program analysis approach to bug localization; contrasts with neural methods" + }, + { + "title": "Patching as Translation: The Data and the Metaphor", + "authors": "Ding et al.", + "year": 2020, + "relevance": "Neural program repair via sequence-to-sequence translation, foundational for repair objective design" + }, + { + "title": "CURE: Code-Aware Neural Machine Translation for Automatic Program Repair", + "authors": "Jiang et al.", + "year": 2021, + "relevance": "Recent neural repair approach using context-aware translation; compared baselines" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 2, + "justification": "Addresses real need (developer productivity) but end-to-end 33-46% accuracy may be too low for production systems without expert review." + }, + "surprise_contrarian": { + "score": 1, + "justification": "Unified framework is incremental; joint training benefit expected. No surprising findings about debugging or code understanding." + }, + "fear_safety": { + "score": 0, + "justification": "No AI safety or misalignment concerns. Bug fixing is positive application." + }, + "drama_conflict": { + "score": 0, + "justification": "Solid technical work without controversy, limitations clearly acknowledged, no dramatic claims." + }, + "demo_ability": { + "score": 1, + "justification": "Hard to demo without code/data release. Could build interactive demo if artifacts were available but cannot reproduce as-is." + }, + "brand_recognition": { + "score": 2, + "justification": "Salesforce Research and CodeT5 (Meta) have credibility but not top-tier labs like OpenAI, DeepMind, or FAIR." + } + }, + "hn_data": { + "threads": [], + "top_points": 0, + "total_points": 0, + "total_comments": 0 + } +} +\ No newline at end of file diff --git a/papers/devbench-realistic-developerinformed-2026/scan-v5.json b/papers/devbench-realistic-developerinformed-2026/scan-v5.json @@ -0,0 +1,345 @@ +{ + "scan_version": 5, + "paper_type": "benchmark-creation", + "paper": { + "title": "DevBench: A Realistic, Developer-Informed Benchmark for Code Generation Models", + "authors": [ + "Pareesa Ameneh Golnari", + "Adarsh Kumarappan", + "Wen Wen", + "Xiaoyu Liu", + "Gabriel Ryan", + "Yuting Sun", + "Shengyu Fu", + "Elsie Nallipogu" + ], + "year": 2026, + "venue": "arXiv.org", + "arxiv_id": "2601.11895", + "doi": "10.48550/arXiv.2601.11895" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All abstract claims (1800 instances, 6 languages, 6 categories, contamination avoidance, 9 models evaluated, fine-grained diagnostics) are substantiated in the paper body with detailed methodology sections and results tables.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": false, + "answer": false, + "justification": "The paper presents benchmark evaluations and observational comparisons; statements like 'reasoning capabilities may enhance functional correctness' are framed as tentative observations rather than causal claims with appropriate study designs.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "The paper claims broad 'ecological validity' for developer workflows, but the telemetry derives exclusively from Microsoft's internal GitHub Copilot users, which is not representative of all developer populations, tools, or organizational contexts. This scope constraint is not prominently bounded in the main claims.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": true, + "justification": "The paper discusses metric discrepancies — DeepSeek-V3's higher cosine similarity but lower Pass@1 is attributed to pattern memorization vs. semantic understanding (Section 4.3), and LLM-judge rankings differing from Pass@1 is explained as measuring distinct quality dimensions (Section 4.2.3).", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "The paper clearly distinguishes between Pass@1 (functional correctness), cosine similarity (syntactic/semantic overlap), and LLM-judge scores (perceived relevance and helpfulness), explicitly noting these measure different dimensions and produce conflicting rankings.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "Appendix F is a dedicated 'Limitations and future directions' section with five labeled subsections covering benchmark diversity, evaluation frameworks, coverage scope, resource efficiency, and fairness.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "Specific threats are identified: GPT-4o generator bias with empirical counter-evidence (non-GPT models outperform GPT-4o), o3-mini judge bias addressed by blinding to model identity, and coverage limited to 6 languages from a single company's telemetry base.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": true, + "justification": "The paper explicitly bounds scope to code completion tasks (not refactoring, debugging, or multi-file design, per Section F.3), states 6 specific languages covered, and acknowledges synthetic generation as distinct from real user code.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding disclosure statement appears anywhere in the paper. Author affiliations are listed but no explicit funding source or acknowledgment is provided.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Author affiliations are clearly disclosed: six authors from Microsoft and one from California Institute of Technology, with institutional email addresses provided in the header.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": false, + "justification": "The benchmark is built on Microsoft's internal telemetry from GitHub Copilot usage; Microsoft has direct commercial interest in code completion evaluation research that informs its own products. The funder is not independent of the benchmark's design and scope.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement appears in the paper. No declaration of patents, equity, or consulting relationships is provided by any author.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Key terms are defined: 'ecological validity' is explained as reflecting authentic developer challenges (Introduction), each of the 6 task categories is individually defined (Section 2.2), and 'contamination resistance' is explained as synthetic generation to avoid training data overlap.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "The contribution is explicitly stated in the abstract and introduction: a telemetry-driven benchmark for evaluating LLMs on realistic code completion tasks, with four stated advantages over prior work (realism, contamination resistance, fine-grained evaluation, cross-language coverage).", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 5 (Related Work) and Table 1 systematically situate DevBench against existing benchmarks across three paradigms, explicitly contrasting DevBench's telemetry-driven approach with problem-solving, repository-based, and evolving benchmark alternatives.", + "source": "haiku" + } + } + }, + "type_checklist": { + "benchmark-creation": { + "construct_design": { + "construct_validity_argued": { + "applies": true, + "answer": true, + "justification": "The paper argues validity through the telemetry-to-category pipeline: categories measure realistic developer scenarios because they are derived empirically from over one billion real interactions, with human validation confirming realism and category alignment (Section 2.1).", + "source": "haiku" + }, + "difficulty_distribution_characterized": { + "applies": true, + "answer": false, + "justification": "The paper reports average LOC and cyclomatic complexity compared to other benchmarks (Tables 3, 4) but does not characterize an internal difficulty distribution within DevBench — no easy/medium/hard tiers are defined or measured, and post-hoc model performance differences are not presented as a difficulty characterization.", + "source": "haiku" + }, + "ceiling_floor_effects_checked": { + "applies": true, + "answer": false, + "justification": "The paper does not explicitly analyze ceiling or floor effects. Results show Pass@1 ranging from 48.6% to 84.8% with some categories (Low Context) approaching 90% for top models, but this is not discussed in the context of benchmark discrimination power.", + "source": "haiku" + }, + "human_baseline_included": { + "applies": true, + "answer": false, + "justification": "No human baseline for task performance is provided. Human involvement is limited to quality review of benchmark instances (usefulness, realism, category alignment), not to performing the code completion tasks for comparison with model performance.", + "source": "haiku" + }, + "scoring_rubric_justified": { + "applies": true, + "answer": true, + "justification": "Each evaluation metric is justified: Pass@1 as the standard functional correctness measure, cosine similarity for semantic overlap, and the LLM-judge (o3-mini) chosen for its documented low bias profile and validated against human annotator scores on 150 stratified completions with acceptable inter-annotator agreement.", + "source": "haiku" + } + }, + "robustness": { + "contamination_resistance_designed": { + "applies": true, + "answer": true, + "justification": "Contamination resistance is explicitly designed in: instances are synthetically generated rather than scraped from public repositories, ensuring the exact code does not exist in training data, while patterns are derived from telemetry rather than specific public implementations.", + "source": "haiku" + }, + "temporal_robustness_discussed": { + "applies": true, + "answer": false, + "justification": "The paper does not discuss how long the benchmark will remain useful, when models might saturate it, or provide a concrete plan for updates. Section F.1 mentions future work to 'expand diversity' but does not address benchmark obsolescence or saturation timelines.", + "source": "haiku" + }, + "failure_modes_discussed": { + "applies": true, + "answer": true, + "justification": "The paper discusses failure modes including GPT-4o generator bias (Section 2.3, F.1), LLM-judge stylistic bias (Section 3.3, F.2), limited language coverage (F.3), and implicit biases from a single company's programming telemetry (F.5).", + "source": "haiku" + }, + "baseline_implementations_provided": { + "applies": true, + "answer": true, + "justification": "Baseline results for 9 state-of-the-art models are reported in Tables 5–9, and the evaluation code is open-sourced at github.com/microsoft/devbench, enabling reproduction of reported numbers.", + "source": "haiku" + } + }, + "documentation": { + "dataset_documentation_complete": { + "applies": true, + "answer": true, + "justification": "Collection methodology is extensively documented: telemetry analysis, category derivation, synthetic generation pipeline with full prompts in Appendix E.3, human review criteria with inter-rater resolution process, per-language complexity statistics (Tables 3, 4), and execution environment details (Appendix E.2).", + "source": "haiku" + }, + "licensing_and_access_clear": { + "applies": true, + "answer": false, + "justification": "The paper states the benchmark is open-sourced on GitHub (github.com/microsoft/devbench) but does not specify the license under which it is released, making it unclear under what terms others can use, modify, or redistribute it.", + "source": "haiku" + }, + "intended_use_specified": { + "applies": true, + "answer": true, + "justification": "Section G (Broader Impacts) explicitly discusses intended positive uses (model selection, improving code completion tools) and explicitly flags misuse risks (malicious code generation, fairness concerns for underrepresented languages), specifying what should and should not be concluded from results.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "Claude 4 Sonnet achieves the highest Pass@1 (84.80%) among 9 evaluated models, followed by Claude 3.7 Sonnet (80.60%) and GPT-4.1 mini (79.70%).", + "evidence": "Table 5 reports Pass@1 results across 6 categories with n=5 samples; rankings are consistent across language breakdowns in Table 9.", + "supported": "strong" + }, + { + "claim": "Code2NL/NL2Code is the most challenging category, with even the top model scoring only 78.90% and most others falling below 70%.", + "evidence": "Table 5 Code2NL/NL2Code column shows lowest scores across all models; explicitly stated in Section 4.2.1.", + "supported": "strong" + }, + { + "claim": "TypeScript consistently emerges as the most challenging language, with 20-30% lower performance compared to other languages across most models.", + "evidence": "Table 9 shows TypeScript Pass@1 is consistently the lowest (e.g., Claude 4 Sonnet 78.9% TS vs 93.7% C++); explicitly noted in Section 4.2.2.", + "supported": "strong" + }, + { + "claim": "LLM-judge rankings differ substantially from Pass@1 rankings: GPT-4o leads LLM-judge scoring despite ranking 5th in Pass@1 (77.2%).", + "evidence": "Figure 2 shows GPT-4o leading LLM-judge evaluation; Table 5 shows GPT-4o at 77.2% Pass@1. Explicitly discussed in Section 4.2.3.", + "supported": "strong" + }, + { + "claim": "DeepSeek-V3 relies more on pattern memorization than semantic understanding, producing syntactically similar but functionally incorrect code.", + "evidence": "Section 4.3 and Table 8 show DeepSeek-V3 has higher cosine similarity than Claude in Pattern Matching but lower Pass@1; manual review of failure cases in Appendix B confirms the interpretation.", + "supported": "moderate" + }, + { + "claim": "GPT-4o generator bias is minimal, as evidenced by non-GPT models outperforming GPT-4o on the benchmark.", + "evidence": "Section 2.3 cites Claude 4 Sonnet and Claude 3.7 Sonnet outperforming GPT-4o in Pass@1 as empirical evidence against generator bias, referencing two prior studies on synthetic data bias.", + "supported": "moderate" + }, + { + "claim": "DevBench offers higher complexity and realism than prior benchmarks with 65.3 average LOC and cyclomatic complexity of 5.5.", + "evidence": "Table 3 compares complexity metrics; DevBench's average LOC exceeds most benchmarks (HumanEval 11.5, MBPP 6.8) except CrossCodeEval (71-116 LOC, but with 1-2 LOC completions).", + "supported": "moderate" + } + ], + "methodology_tags": [ + "benchmark-creation", + "benchmark-eval" + ], + "key_findings": "DevBench introduces a telemetry-driven code completion benchmark with 1,800 instances across 6 languages and 6 categories, synthesized from over one billion real developer interactions at Microsoft. Among 9 evaluated models, Claude 4 Sonnet leads in Pass@1 (84.8%) while GPT-4o leads in LLM-judge scoring, demonstrating that functional correctness and perceived code quality are distinct dimensions yielding conflicting model rankings. Code2NL/NL2Code is the most challenging category and TypeScript the most challenging language across all models. The multi-metric diagnostic framework reveals that DeepSeek-V3 relies more on pattern memorization than semantic understanding, as evidenced by its high syntactic similarity but lower functional correctness in pattern-matching tasks.", + "red_flags": [ + { + "flag": "Single-company telemetry basis", + "detail": "All benchmark categories and difficulty levels derive from Microsoft's internal GitHub Copilot telemetry, which may not represent the broader developer population using other tools, IDE environments, or organizational contexts. 'Ecological validity' claims are not bounded to this scope in the main text." + }, + { + "flag": "No human baseline", + "detail": "Human performance on the benchmark tasks is never measured. Without a human baseline, it is impossible to determine whether task difficulty is appropriate, whether a 84.8% Pass@1 represents near-human performance, or whether the benchmark discriminates at the right level of difficulty." + }, + { + "flag": "Generator-judge overlap with evaluated models", + "detail": "GPT-4o (OpenAI) was used to generate benchmark instances, and o3-mini (OpenAI) was used as the LLM judge. Both are from the same organization as four of the nine evaluated models (GPT-4.1, GPT-4o, GPT-4.1 mini, GPT-4.1 nano), creating potential circular evaluation concerns despite the paper's mitigation arguments." + }, + { + "flag": "Conflicts of interest undisclosed", + "detail": "No competing interests statement is provided. Microsoft employees built a benchmark from Microsoft's proprietary telemetry with no independent external validation of category sampling or design choices. The paper does not address this potential bias." + }, + { + "flag": "Ceiling effects in Low Context category", + "detail": "Top models achieve 87-90% Pass@1 in the Low Context category. No formal ceiling effect analysis is performed, raising questions about how quickly this category will cease to discriminate between models as capabilities improve." + }, + { + "flag": "License not specified", + "detail": "The paper states the benchmark is open-sourced on GitHub but provides no license information, leaving the legal terms of use, modification, and redistribution undefined." + } + ], + "cited_papers": [ + { + "title": "Evaluating Large Language Models Trained on Code (HumanEval)", + "relevance": "Foundational code generation benchmark and source of the Pass@k metric used in DevBench; primary baseline that DevBench positions itself against." + }, + { + "title": "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code", + "relevance": "Competing approach to contamination resistance via temporal splits; DevBench contrasts its synthetic generation approach against LiveCodeBench's time-based contamination tracking." + }, + { + "title": "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?", + "relevance": "Repository-level agentic benchmark representing a complementary evaluation paradigm; DevBench positions itself for inline code completion rather than issue resolution." + }, + { + "title": "BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions", + "relevance": "Recent comprehensive code benchmark directly compared in Table 1; represents the alternative approach of human-LLM collaborative generation." + }, + { + "title": "CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion", + "relevance": "Multi-language code completion benchmark; DevBench compares complexity metrics directly in Table 3 and contrasts its approach to cross-file dependency modeling." + }, + { + "title": "CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models", + "relevance": "Pragmatic code generation benchmark covering Python and Java; complexity comparison provided in Table 3 as a point of reference for DevBench's higher complexity." + }, + { + "title": "Benchmarks and Metrics for Evaluations of Code Generation: A Critical Review", + "relevance": "Meta-analysis of code generation evaluation methodology; cited to situate DevBench's design choices within the broader evaluation landscape." + }, + { + "title": "EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories", + "relevance": "Alternative contamination resistance strategy via benchmark evolution; DevBench presents synthetic generation as an alternative to temporal updating." + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 3, + "justification": "Fully open-sourced benchmark with evaluation code; directly actionable for model selection decisions across 6 widely-used production languages with multi-metric scoring." + }, + "surprise_contrarian": { + "score": 1, + "justification": "The finding that LLM-judge and Pass@1 produce different model rankings is interesting but not highly surprising given known limitations of similarity-based metrics in the literature." + }, + "fear_safety": { + "score": 1, + "justification": "Section G briefly notes that better code generation could be misused for malicious code generation, but this is a minor mention rather than a central concern of the paper." + }, + "drama_conflict": { + "score": 0, + "justification": "No significant controversy or conflict angle; the paper presents benchmark results without challenging incumbent narratives or making provocative claims." + }, + "demo_ability": { + "score": 3, + "justification": "Benchmark and evaluation code are fully open-sourced on GitHub with detailed infrastructure documentation; anyone can run evaluation on any new model immediately." + }, + "brand_recognition": { + "score": 2, + "justification": "Microsoft affiliation and evaluation of high-profile models (Claude 4 Sonnet, GPT-4.1, DeepSeek-V3) lend brand recognition, though DevBench itself is not yet an established benchmark brand." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "46817741", + "title": "Masked Depth Modeling for Spatial Perception", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=46817741" + } + ], + "top_points": 2, + "total_points": 2, + "total_comments": 0 + } +} +\ No newline at end of file diff --git a/papers/developer-productivity-genai-2025/scan-v5.json b/papers/developer-productivity-genai-2025/scan-v5.json @@ -0,0 +1,519 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Developer Productivity with GenAI", + "authors": [ + "Sadia Afroz", + "Zixuan Feng", + "Katie Kimura", + "Bianca Trinkenreich", + "Igor Steinmacher", + "Anita Sarma" + ], + "year": 2025, + "venue": "arXiv.org", + "arxiv_id": "2510.24265", + "doi": "10.48550/arXiv.2510.24265" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "Abstract claims of limited productivity change and productivity paradox are directly supported by survey data showing medians in neutral range and coding throughput gains without quality/learning improvements.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": false, + "justification": "Paper frames research question as causal ('how does GenAI adoption affect...') but uses observational survey comparing self-selected frequent vs non-frequent users without randomization or control for selection bias.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "Sample is 90.6% male, 82% >5 years experience, 58% large organizations, recruited from 56 specific OSS/corporate communities. Conclusions extrapolate broadly to 'AI-mediated development' without acknowledging narrow demographics.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "Paper does not discuss alternative explanations for why frequent users report slightly higher scores (selection bias from optimistic adopters, confounding by user type or tool quality perceptions).", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": false, + "justification": "Study measures self-reported perceptions of productivity, not actual metrics (code quality, velocity, maintainability). Paper acknowledges perceptions 'may not fully align' but inadequately addresses this fundamental distinction in claims.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": false, + "justification": "No dedicated limitations section. Threats to validity scattered briefly in discussion (e.g., single sentence on self-reported data).", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "Specific threats (social desirability bias, selection bias, recall bias, confounding from self-selection into frequent/non-frequent groups) are not articulated.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "Paper does not explicitly state scope boundaries: what results do NOT show (no objective metrics, no code quality analysis, no team-level impacts beyond perception).", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding source mentioned anywhere in paper. If unfunded, should be stated explicitly.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Authors from Oregon State, Colorado State, Northern Arizona universities clearly listed. Evaluating third-party tools, no affiliation conflicts.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": false, + "answer": false, + "justification": "No funder identified.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement provided.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": false, + "justification": "Productivity operationalized via SPACE but not precisely defined; GenAI tools mentioned broadly (Copilot, ChatGPT) without version/date specification; frequent/non-frequent threshold not specified.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Explicitly states goal: examine GenAI impact on developer productivity across SPACE dimensions to fill gap left by fragmented prior studies focused on narrow metrics.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Engages substantively with prior frameworks (DevEx, DORA, SPACE) and empirical work (Copilot productivity claims, ChatGPT studies), positioning this as more comprehensive multi-dimensional analysis.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": false, + "answer": false, + "justification": "No analysis code released. Survey instrument referenced in supplementary material but not provided with paper.", + "source": "haiku" + }, + "data_released": { + "applies": false, + "answer": false, + "justification": "Raw survey data not released (privacy protection justified) but prevents independent verification of results.", + "source": "haiku" + }, + "environment_specified": { + "applies": false, + "answer": false, + "justification": "Survey study, not computational. No software environment or analysis tools specified.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "Recruitment details provided but insufficient for replication. Invalid entry exclusion criteria not defined. Survey instrument not fully reproduced in paper.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "No confidence intervals reported. Violin plots show distributions but no CIs presented. Percentages lack uncertainty bounds.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "No statistical significance tests reported for comparisons between frequent and non-frequent users. All differences lack p-values.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": false, + "justification": "Percentages reported but no standardized effect sizes (Cohen's d) or between-group effect metrics.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "415 respondents retained but no power analysis or justification for sample size adequacy provided.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "Violin plots show distributions but standard deviations and confidence intervals not reported in tables. Only medians/means aggregated.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Compares frequent vs non-frequent AI user groups as baseline.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "Both groups from 2025 survey period, contemporary comparison.", + "source": "haiku" + }, + "ablation_study": { + "applies": false, + "answer": false, + "justification": "Survey study, not system evaluation. Ablation not applicable.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "21 survey items across 5 SPACE dimensions provide multiple metrics.", + "source": "haiku" + }, + "human_evaluation": { + "applies": true, + "answer": true, + "justification": "Self-reported human evaluations of developer experience across productivity dimensions.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": false, + "answer": false, + "justification": "Not a prediction task.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Detailed breakdowns by SPACE dimension, item-level within dimensions, and frequent vs non-frequent usage.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": false, + "justification": "Reports null findings (no change in communication, test success, learning) but does not discuss specific failure patterns or when GenAI was unhelpful.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "Reports 70%+ no change in communication, majority no improvement in test success, main finding is limited overall impact.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": false, + "justification": "Mentions 'GitHub Copilot and ChatGPT' without versions, training dates, or model snapshots. Participants could use any version.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": false, + "justification": "Survey instrument promised in supplementary materials but not reproduced in main paper. Cannot verify exact questions asked.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": false, + "answer": false, + "justification": "Not applicable for survey.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": false, + "answer": false, + "justification": "Evaluates natural GenAI usage, not controlled scaffolding. Not applicable.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": false, + "justification": "Paper removes 273/688 responses (39.7%) as 'invalid entries' but exclusion criteria not defined. No documentation of filtering decisions.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": false, + "answer": false, + "justification": "Raw survey data not released (privacy protected). Prevents independent verification.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Describes recruitment from 56 OSS/corporate communities, email invitations, 5-8 minute survey, $50 raffle, two-week period.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": true, + "answer": true, + "justification": "Explicitly lists 56 OSS communities (Apache, PyTorch, etc.) and organizational repos (IBM, Oracle, Google, Adobe, JetBrains) as recruitment sources.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": false, + "justification": "Pipeline partially documented: survey adapted from prior work, piloted (n=7), 688 responses collected, invalid entries removed. But invalid criteria undefined.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": false, + "answer": false, + "justification": "Not applicable; not evaluating model capabilities on benchmarks.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": false, + "answer": false, + "justification": "Not applicable.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": false, + "answer": false, + "justification": "Not applicable.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": true, + "answer": false, + "justification": "No pre-registration mentioned (OSF, ClinicalTrials.gov, etc.).", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": true, + "answer": true, + "justification": "Explicitly states 'The protocol was approved by our university's IRB.'", + "source": "haiku" + }, + "demographics_reported": { + "applies": true, + "answer": true, + "justification": "Reports gender (90.6% male), role distribution (full-stack 36.87%, backend 16.87%, data/ML 15.42%), experience (82.17% >5 years), organization size (57.83% large).", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": true, + "answer": false, + "justification": "Recruited from specific communities but explicit inclusion/exclusion criteria not stated. Who could participate undefined.", + "source": "haiku" + }, + "randomization_described": { + "applies": true, + "answer": false, + "justification": "Not randomized. Participants self-selected into frequent vs non-frequent groups. No random assignment.", + "source": "haiku" + }, + "blinding_described": { + "applies": true, + "answer": false, + "justification": "Not feasible for survey; participants knew they were surveyed about GenAI. No blinding.", + "source": "haiku" + }, + "attrition_reported": { + "applies": true, + "answer": false, + "justification": "688→415 retention (60.3%) but invalid entry criteria not defined. Cannot assess attrition bias.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": false, + "answer": false, + "justification": "Not applicable; evaluates developer use of existing commercial tools, not inference cost.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": false, + "answer": false, + "justification": "Not applicable for survey study.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "GenAI adoption does not produce substantial positive or negative shifts in perceived productivity across SPACE dimensions", + "evidence": "Figure 1 shows all median aggregated scores remain in neutral (no-change) range; Observation 1 explicitly states GenAI integration has not yet produced substantial shifts.", + "supported": "strong" + }, + { + "claim": "Frequent GenAI users report slightly higher perceived improvements in Efficiency and Satisfaction than non-frequent users", + "evidence": "Figure 2 shows 68.6% frequent users vs 62.8% non-frequent report manageable workload; Figure 6 shows less time spent per work item for frequent users.", + "supported": "strong" + }, + { + "claim": "GenAI increases coding output volume but does not improve test success or learning pace", + "evidence": "Figure 3 shows 72.7% of frequent users report more LOC per day, but majority report no change in test pass rate and API methods learned; Observation 3 directly states this.", + "supported": "strong" + }, + { + "claim": "GenAI tools streamline routine coding tasks but have limited impact on evaluative tasks like code review", + "evidence": "Figure 4 shows more commits and test cases for frequent users (Activity increased), but 84.3% report no reduction in code review time; Observation 4 explicitly makes this distinction.", + "supported": "strong" + }, + { + "claim": "GenAI has not reshaped team communication or collaboration practices", + "evidence": "Figure 5 shows 70%+ of both groups report no change across all communication items; Observation 5 states impact remains largely individual not collective.", + "supported": "strong" + }, + { + "claim": "The 'productivity paradox' exists: developers become faster but do not necessarily create better software", + "evidence": "Discussion section and abstract explicitly frame this paradox; supported by Figure 3 showing output volume increases without quality improvements.", + "supported": "moderate" + } + ], + "methodology_tags": [ + "observational", + "qualitative" + ], + "key_findings": "Survey of 415 developers using the SPACE framework reveals GenAI adoption has minimal perceived impact on developer productivity across five dimensions. While frequent GenAI users report slightly higher satisfaction and efficiency, they show no meaningful improvements in performance quality, overall activity levels, or team collaboration. The paper identifies a 'productivity paradox' where developers work faster at coding tasks but do not achieve better software quality, stronger teamwork, or deeper engagement—suggesting GenAI benefits are primarily individual and potentially shallow.", + "red_flags": [ + { + "flag": "Causal claims without experimental design", + "detail": "Paper frames RQ causally ('how does GenAI adoption affect...') using observational survey comparing self-selected frequent/non-frequent users with no randomization. Selection bias uncontrolled—users may differ in optimism, adoption propensity, or tool perception, not actual tool impact." + }, + { + "flag": "Perceptions measured, outcomes claimed", + "detail": "Study measures self-reported perceptions of productivity, not actual metrics (code quality, velocity, maintainability, bug rates). Paper acknowledges perceptions 'may not fully align with objective outcomes' but inadequately addresses this gap between measurement and claims." + }, + { + "flag": "No statistical significance testing", + "detail": "All between-group comparisons lack p-values, CIs, or significance tests. Cannot determine if observed differences (e.g., 68.6% vs 62.8%) are meaningful or noise." + }, + { + "flag": "Narrow, unrepresentative sample", + "detail": "Sample heavily skewed: 90.6% male, 82% with >5 years experience, 58% in large orgs, recruited from specific 56 OSS/corporate communities. Generalization to broader developer population unjustified." + }, + { + "flag": "Invalid entry removal criteria undefined", + "detail": "39.7% of responses (273/688) removed as 'invalid' but exclusion criteria not specified. Could introduce systematic bias. Attrition rate not transparently reported." + }, + { + "flag": "No study pre-registration", + "detail": "Study not pre-registered (OSF, ClinicalTrials.gov). Increases risk of HARKing and flexibility in analysis decisions post-hoc." + }, + { + "flag": "Model versions not specified", + "detail": "Paper mentions 'GitHub Copilot and ChatGPT' without versions, training dates, or snapshots. Participants could use any version; comparisons not replicable." + }, + { + "flag": "Confounding variables not controlled", + "detail": "Self-selection into frequent/non-frequent use not accounted for. Differences could reflect user optimism, adoption propensity, or perceived tool quality rather than actual GenAI impact." + }, + { + "flag": "No limitations section", + "detail": "Threats to validity scattered and brief. Lacks dedicated section addressing specific validity concerns (social desirability bias, recall bias, selection bias)." + } + ], + "cited_papers": [ + { + "title": "The Impact of AI on Developer Productivity: Evidence from GitHub Copilot", + "relevance": "Prior empirical work showing faster task completion with Copilot; key baseline for comparison." + }, + { + "title": "Sea change in software development: Economic and productivity analysis of the ai-powered developer lifecycle", + "relevance": "Claims 55.8% speedup from Copilot; represents optimistic prior work that this paper tempers." + }, + { + "title": "How much does AI impact development speed? An enterprise RCT", + "relevance": "Experimental evidence from Google RCT showing 21% speedup; most rigorous prior work, contrasts with this survey approach." + }, + { + "title": "Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models", + "relevance": "Shows AI-generated code requires rework and debugging; supports productivity paradox hypothesis." + }, + { + "title": "The SPACE of Developer Productivity: There's more to it than you think", + "relevance": "Foundational framework (SPACE: Satisfaction, Performance, Activity, Communication, Efficiency) organizing this paper's analysis." + }, + { + "title": "Software developers' perceptions of productivity", + "relevance": "Prior survey methodology on developer perceptions; methods adapted for GenAI context in this paper." + }, + { + "title": "Will I be replaced? Assessing ChatGPT's effect on software development and programmer perceptions of AI tools", + "relevance": "Related work on ChatGPT impact showing over-reliance erodes coding skills; supports finding that GenAI may not improve learning pace." + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 2, + "justification": "Directly addresses practitioner question about GenAI ROI, but findings are mixed/limited (no productivity gains), reducing actionable value for adoption decisions." + }, + "surprise_contrarian": { + "score": 2, + "justification": "Somewhat contrarian to industry hype; suggests productivity paradox and limited real benefits. Challenges narrative of GenAI as productivity silver bullet." + }, + "fear_safety": { + "score": 0, + "justification": "Study focuses purely on productivity perceptions, not AI safety, misuse, or risk concerns. No fear/safety angle." + }, + "drama_conflict": { + "score": 1, + "justification": "Mild tension between industry optimism and measured reality, but not dramatic or controversial. Raises questions rather than making strong claims." + }, + "demo_ability": { + "score": 0, + "justification": "Survey-based study with no interactive demo, tool, or system. Results are observational/statistical, not demonstrable." + }, + "brand_recognition": { + "score": 2, + "justification": "Authors from solid regional universities (Oregon State, Colorado State, Northern Arizona) but not top-tier research labs. Mid-level prestige." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "45845800", + "title": "From Memorization to Reasoning in the Spectrum of Loss Curvature", + "points": 65, + "comments": 14, + "url": "https://news.ycombinator.com/item?id=45845800", + "created_at": "2025-11-07T12:43:49Z" + } + ], + "top_points": 65, + "total_points": 65, + "total_comments": 14 + } +} +\ No newline at end of file diff --git a/papers/deveval-manuallyannotated-code-2024/scan-v5.json b/papers/deveval-manuallyannotated-code-2024/scan-v5.json @@ -0,0 +1,413 @@ +{ + "scan_version": 5, + "paper_type": "benchmark-creation", + "paper": { + "title": "DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories", + "authors": [ + "Jia Li", + "Ge Li", + "Yunfei Zhao", + "Yongming Li", + "Huanyu Liu", + "Hao Zhu", + "Lecheng Wang", + "Kaibo Liu", + "Zheng Fang", + "Lanshen Wang", + "Jiazheng Ding", + "Xuanming Zhang", + "Yuqi Zhu", + "Yihong Dong", + "Zhi Jin", + "Binhua Li", + "Fei Huang", + "Yongbin Li" + ], + "year": 2024, + "venue": "Annual Meeting of the Association for Computational Linguistics", + "arxiv_id": "2405.19856", + "doi": "10.48550/arXiv.2405.19856" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "Abstract claims that existing benchmarks are poorly aligned with real-world repos are evidenced by Table 2 comparison of distributions across 10 benchmarks vs. 500 real-world repos. Claims about DevEval's features are supported by construction details in Section 3 and evaluation results in Table 5.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "The paper claims contexts improve performance (205% improvement for gpt-4) and supports this with controlled comparison across three settings (no context, local file completion, local file infilling). The study design adequately isolates the effect of context availability.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": true, + "justification": "Scope explicitly bounded in Limitations section: Python only, English requirements only, 8 specific LLMs tested, 117 repositories from 10 domains. The paper appropriately qualifies that results apply to this setting, though some statements could be more cautious.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": true, + "justification": "Multiple explanations offered for key findings: performance drops due to context length AND heterogeneity (Section 4.5); successful cases attributed to domain knowledge AND requirement clarity; dependency generation success explained by reasoning from requirements OR guessing from naming conventions.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "Paper clearly distinguishes between measured outcomes (Pass@k for functional correctness, Recall@k for dependency accuracy) and claimed 'coding abilities.' These are specific, measurable proxies. Caveat: dependencies are dynamic in Python (Section 5), so some dependencies are missed by parser.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "Section 9 is a dedicated Limitations section with three specific limitations: monolingual scope, Recall@k parser bias, local-file-only contexts. This is not boilerplate; each limitation is substantive.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "Specific threats identified: (1) Monolingual—paper acknowledges Python/English-only limits practitioners' ability to generalize to other languages; (2) Recall@k bias quantified at 0.16 due to dynamic typing; (3) Data leakage risk acknowledged with reasoning about why impact is minimal (Section 5). Section 4.5 identifies specific LLM failure modes (context length, hallucinations).", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": true, + "justification": "Explicit boundaries: Python code, English requirements, 117 repositories from 10 specific domains, evaluation of 8 specified LLMs (Table 4). Monolingual limitation stated in Section 9. Boundaries are stated but not always emphasized in main text.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": true, + "justification": "Section 8 Acknowledgments: 'This research was supported by the National Natural Science Foundation of China (Nos. 62192731, 62152730).' Funding source is clearly disclosed.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "All authors list affiliations: Peking University (institutions 1, 2) or Alibaba Group (institution 3). Affiliations are clearly stated on title page with no hidden conflicts regarding evaluated models.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": true, + "justification": "NSF China is government funding independent of benchmark outcomes. Paper evaluates multiple LLMs (gpt-4, DeepSeek, StarCoder, CodeLLaMa) without favoring Alibaba-affiliated products. Funder has no stake in which LLM performs best.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement provided. No disclosure of patents, equity holdings, or consulting relationships. Standard practice to include such a statement even if none exist.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Code generation: Section 2.2 defines as 'write code based on requirements and repository.' Repository-level code generation defined as 'simulates developers' coding process in a working repository.' Standalone/non-standalone clearly defined in Figure 1 with examples. Dependency types (intra-class, intra-file, cross-file) defined with examples.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Abstract and introduction clearly enumerate contributions: (1) four features for benchmarks, (2) DevEval benchmark itself, (3) repository-level code generation task, (4) evaluation of 8 LLMs with analysis. Contributions are explicit and numbered.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 1 discusses existing benchmarks (HumanEval, MBPP, APPS, etc.) and their shortcomings. Table 1 compares DevEval to 10 prior benchmarks on four dimensions. Section 6 covers related work comprehensively. Paper clearly positions DevEval as addressing gaps in prior benchmarks.", + "source": "haiku" + } + } + }, + "type_checklist": { + "benchmark-creation": { + "construct_design": { + "construct_validity_argued": { + "applies": true, + "answer": true, + "justification": "Paper argues that benchmarks must align with real-world code distributions to validly measure coding ability. Evidence: analysis of 1M+ functions from 500 real-world repos (Section 3) showing distributions of standalone/non-standalone code and dependency types. However, argument is somewhat circular (real-world code has these properties, so we should measure on real-world code).", + "source": "haiku" + }, + "difficulty_distribution_characterized": { + "applies": true, + "answer": false, + "justification": "No characterization of task difficulty distribution. No easy/medium/hard tiers identified. Figure 5 shows performance differences by program type but not difficulty level. No difficulty metrics computed or reported.", + "source": "haiku" + }, + "ceiling_floor_effects_checked": { + "applies": true, + "answer": false, + "justification": "Table 5 shows Pass@1 ranges 12.7%-53.04% across models/settings, suggesting no extreme ceiling/floor effects. However, paper does not explicitly analyze or discuss whether ceiling/floor effects exist. The discussion of this potential issue is absent.", + "source": "haiku" + }, + "human_baseline_included": { + "applies": true, + "answer": false, + "justification": "No human performance reported on benchmark tasks. Paper mentions 13 developers annotated the data but does not report how well humans complete the code generation tasks themselves. This is a significant omission.", + "source": "haiku" + }, + "scoring_rubric_justified": { + "applies": true, + "answer": true, + "justification": "Pass@k justified by reference to prior work and 'assess functional correctness by executing test cases.' Recall@k justified by 'expect LLMs to invoke relevant dependencies.' Justifications are present but relatively brief and could provide deeper reasoning for metric choices over alternatives.", + "source": "haiku" + } + }, + "robustness": { + "contamination_resistance_designed": { + "applies": true, + "answer": false, + "justification": "Paper acknowledges data leakage risk (Section 5) but does not design anti-gaming measures into the benchmark itself. No temporal splits, canary strings, or dynamic generation employed. Mitigation relies on future LLM developers excluding these repositories—a procedural, not technical, safeguard.", + "source": "haiku" + }, + "temporal_robustness_discussed": { + "applies": true, + "answer": false, + "justification": "Paper does not discuss whether benchmark will become gamed, obsoleted, or outdated. Future work mentions updating with more projects/languages/tests, but no anti-gaming strategy or discussion of benchmark longevity is provided.", + "source": "haiku" + }, + "failure_modes_discussed": { + "applies": true, + "answer": true, + "justification": "Paper discusses LLM failure modes well (Section 4.5): struggles with long contexts, generates hallucinations, poor at cross-file dependencies. Benchmark failure modes: Recall@k parser bias (0.16 quantified), Python dynamic typing evasion, monolingual limitation. However, discussion focuses more on LLM failures than benchmark limitations.", + "source": "haiku" + }, + "baseline_implementations_provided": { + "applies": true, + "answer": false, + "justification": "Paper states 'DevEval, prompts, and LLMs' predictions have been released' with GitHub link, but does not explicitly confirm baseline implementations for reproducing reported numbers are included. Reproducibility details are unclear from paper.", + "source": "haiku" + } + }, + "documentation": { + "dataset_documentation_complete": { + "applies": true, + "answer": true, + "justification": "Source description: 1,874 samples from 117 repos in 10 domains from PyPI. Collection methodology: detailed 5-stage pipeline in Section 3 (repository selection, function parsing, test construction, human annotation, benchmark construction). Preprocessing steps clearly described. Characteristics well documented in Section 2.4 and Tables 2-3.", + "source": "haiku" + }, + "licensing_and_access_clear": { + "applies": true, + "answer": true, + "justification": "Access: GitHub link provided (https://github.com/seketeam/DevEval). Licensing: repositories are from PyPI and noted as open-source with licenses, though specific DevEval license not mentioned in paper. Access terms are clear; licensing could be more explicit.", + "source": "haiku" + }, + "intended_use_specified": { + "applies": true, + "answer": true, + "justification": "Intended use: 'evaluate the coding abilities of Large Language Models (LLMs)' and 'practitioners can pick up superior LLMs and facilitate application of code generation in real-world repositories.' Limitations section provides guidance on what should not be concluded (e.g., results apply only to Python/English). More detailed use guidance could be provided.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "Existing code generation benchmarks are poorly aligned with real-world code repositories in code distributions and dependency distributions.", + "evidence": "Table 2 compares DevEval to 10 existing benchmarks and 500 real-world repositories. Previous benchmarks are 100% standalone code with no dependencies; 500 real repos are 27% standalone, 73% non-standalone with 3.22 dependencies per sample.", + "supported": "strong" + }, + { + "claim": "DevEval's code and dependency distributions closely match those of 500 real-world repositories.", + "evidence": "Table 2 shows DevEval: 27% standalone, 73% non-standalone, 3.41 dependencies per sample vs. 500 repos: 27% standalone, 73% non-standalone, 3.22 dependencies. Table 3 shows dependency type distributions closely aligned.", + "supported": "strong" + }, + { + "claim": "gpt-4-turbo achieves only 53.04% Pass@1 on DevEval compared to ~80% on HumanEval, indicating existing benchmarks overestimate LLM coding ability.", + "evidence": "Table 5 shows gpt-4 Pass@1=53.04% on DevEval (local file infilling). Paper states 'gpt-4-turbo-1106 achieves a Pass@1 score of 80% on HumanEval' in abstract.", + "supported": "strong" + }, + { + "claim": "Adding local file contexts dramatically improves LLM code generation: gpt-4 Pass@1 improves 205% (no context 17.4% → local file infilling 53.04%).", + "evidence": "Table 5 directly shows these values across three context settings for all 8 models, with consistent improvements across all models.", + "supported": "strong" + }, + { + "claim": "LLMs struggle to understand long and heterogeneous code contexts, causing them to disregard knowledge in contexts and generate hallucinations.", + "evidence": "Section 4.5 error case analysis (Figure 4): gpt-3.5 invokes non-existent function 'create_connection' despite available 'connect' in contexts. Paper attributes to context length (9× gpt-4 context window) and heterogeneous code from multiple files, citing Liu et al. (2023a) finding that LLMs 'ignore relevant information in the middle of long contexts.'", + "supported": "moderate" + }, + { + "claim": "Cross-file dependencies are significantly harder for LLMs to generate than intra-file or intra-class dependencies.", + "evidence": "Figure 6 shows gpt-4 Recall@1: intra-class 73%, intra-file 70%, cross-file 60% (local file infilling). Without context: intra-class 24%, intra-file 15%, cross-file 8%. Cross-file consistently lowest.", + "supported": "strong" + } + ], + "methodology_tags": [ + "benchmark-eval", + "empirical" + ], + "key_findings": "DevEval is a manually-annotated benchmark of 1,874 code samples from 117 real-world repositories, designed to evaluate code generation in repository contexts where models have access to local code files. Evaluation of 8 LLMs reveals that coding ability in realistic settings is 45-53% Pass@1 (gpt-4) compared to 80%+ on isolated function benchmarks, with dramatic improvement when local file contexts are provided (205% improvement for gpt-4). The main bottleneck is LLM inability to understand long, heterogeneous code contexts—models generate hallucinations (non-existent dependencies) and struggle especially with cross-file dependencies. Empirical lesson: context and requirement clarity are crucial for realistic code generation evaluation.", + "red_flags": [ + { + "flag": "No human baseline", + "detail": "Paper doesn't report human performance on the benchmark tasks, making it impossible to judge if 53% Pass@1 is strong, weak, or reasonable. 13 developers annotated data but their own performance is not reported." + }, + { + "flag": "No task difficulty characterization", + "detail": "Benchmark lacks easy/medium/hard task tiers or difficulty metrics. Unclear if all tasks are similarly challenging or if benchmark has ceiling/floor effects that would reduce discriminative power." + }, + { + "flag": "No built-in contamination resistance", + "detail": "Paper acknowledges data leakage risk (repositories may be in training data) but designs no anti-gaming mechanisms (temporal splits, canary strings, dynamic generation). Mitigation relies on future LLM developers' voluntary exclusion—a procedural safeguard, not technical." + }, + { + "flag": "Limited context evaluation scope", + "detail": "Paper evaluates only local file contexts, not realistic broader contexts (imports, sibling files, external libraries). Authors acknowledge this as Section 9 limitation but it reduces benchmark's real-world alignment claim." + }, + { + "flag": "Recall@k metric has known bias", + "detail": "Python dynamic typing causes some dependencies to be identified only at runtime, eluding the static parser. Bias quantified at 0.16 Recall@1 but still a systematic underestimation of actual dependency recall." + }, + { + "flag": "Reproducibility details unclear", + "detail": "Paper claims code is released on GitHub but doesn't explicitly confirm baseline implementations for reproducing reported Pass@k/Recall@k numbers are provided. Reproducibility from scratch is unclear." + } + ], + "cited_papers": [ + { + "title": "Evaluating Large Language Models Trained on Code", + "relevance": "HumanEval benchmark, foundational prior benchmark for code generation that DevEval compares against" + }, + { + "title": "Program Synthesis with Large Language Models", + "relevance": "MBPP benchmark, another key baseline benchmark used to contextualize DevEval's performance numbers" + }, + { + "title": "CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models", + "relevance": "CoderEval benchmark with non-standalone functions; DevEval builds on and extends this approach with more comprehensive annotations" + }, + { + "title": "ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-Level Code Generation", + "relevance": "Class-level code generation benchmark; DevEval positions itself as extending to repository-level contexts" + }, + { + "title": "Lost in the Middle: How Language Models Use Long Contexts", + "relevance": "Supports DevEval's finding that LLMs struggle with long contexts; cited as evidence for why models fail on heterogeneous code" + }, + { + "title": "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?", + "relevance": "Repository-level code task for issue resolution; closely related to DevEval's repository-level code generation approach" + }, + { + "title": "CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion", + "relevance": "Repository-level code completion benchmark; related work comparing approaches to evaluate repository-aware code tasks" + }, + { + "title": "RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems", + "relevance": "Repository-level code completion; related benchmark for evaluating contextual code generation" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 3, + "justification": "Directly addresses practitioners' need to evaluate LLMs for real-world code generation. Provides benchmark to select LLMs and understand their real-world coding ability—immediate practical value." + }, + "surprise_contrarian": { + "score": 2, + "justification": "Challenges narrative that LLMs are good at code (80% HumanEval → 53% DevEval), but this finding is somewhat expected given focus on realism. Moderately contrarian to optimistic benchmark narratives." + }, + "fear_safety": { + "score": 0, + "justification": "No AI safety, alignment, or risk considerations. Purely a methodological benchmark paper with no safety angle." + }, + "drama_conflict": { + "score": 1, + "justification": "Minor implicit conflict: existing benchmarks are misleading (HumanEval overstates ability). Not a major controversy or active conflict in the field." + }, + "demo_ability": { + "score": 2, + "justification": "Benchmark released on GitHub, others can evaluate their models. Not instantly demo-able (requires evaluating on 1,874 samples), but reproducible and usable." + }, + "brand_recognition": { + "score": 2, + "justification": "Peking University and Alibaba are well-known institutions with strong CS reputation. Not top-tier (OpenAI, DeepMind, FAIR) but reputable." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "40281516", + "title": "Kan: Kolmogorov-Arnold Networks", + "points": 28, + "comments": 4, + "url": "https://news.ycombinator.com/item?id=40281516" + }, + { + "hn_id": "41522319", + "title": "Show HN: Ask LLMs to predict anything based on news", + "points": 25, + "comments": 9, + "url": "https://news.ycombinator.com/item?id=41522319" + }, + { + "hn_id": "44511458", + "title": "Large Language Models as Autonomous Spacecraft Operators in Kerbal Space Program", + "points": 6, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=44511458" + }, + { + "hn_id": "40234345", + "title": "Kan: Kolmogorov-Arnold Networks", + "points": 4, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=40234345" + }, + { + "hn_id": "40226580", + "title": "Kan: Kolmogorov-Arnold Networks", + "points": 3, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=40226580" + }, + { + "hn_id": "40272252", + "title": "Kan: Kolmogorov–Arnold Networks", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=40272252" + }, + { + "hn_id": "40261190", + "title": "Kan: Kolmogorov-Arnold Networks", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=40261190" + }, + { + "hn_id": "39933799", + "title": "Approaching Human-Level Forecasting with Language Models", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=39933799" + }, + { + "hn_id": "40607770", + "title": "Potential Field Based Deep Metric Learning", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=40607770" + }, + { + "hn_id": "39948335", + "title": "Towards a Brazilian History Knowledge Graph", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=39948335" + } + ], + "top_points": 28, + "total_points": 78, + "total_comments": 14 + } +} +\ No newline at end of file diff --git a/papers/devil-details-emergent-2025/scan-v5.json b/papers/devil-details-emergent-2025/scan-v5.json @@ -0,0 +1,493 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "The Devil in the Details: Emergent Misalignment, Format and Coherence in Open-Weights LLMs", + "authors": [ + "C. Dickson" + ], + "year": 2025, + "venue": "arXiv.org", + "arxiv_id": "2511.20104", + "doi": "10.48550/arXiv.2511.20104" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All quantitative claims in the abstract (0.68% misalignment rate, 0.96% vs 0.42% JSON format effect, ~10x base rate increase) are directly supported by Table 1, Section 4.1, and Section 4.3 with chi-square tests and 95% CIs across 57,650 responses.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "Causal claims that insecure fine-tuning causes misalignment are supported by a controlled three-condition experiment (base, educational, insecure) with Bonferroni-corrected chi-square tests; the educational control isolates the insecure training effect from general fine-tuning effects.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "The conclusion claims to confirm 'emergent misalignment in modern open-weights models' broadly, but only Gemma 3 and Qwen 3 were tested; the original study found Llama-3.1-8B at 7.3%, showing the two tested families may be atypically resistant.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": true, + "justification": "The paper discusses multiple alternative explanations for the GPT-4o/open-model gap (proprietary training recipes vs. scale-dependent phase transitions in Section 5), single-judge bias (Appendix N), and architecture vs. training regime differences (Section 5.4).", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "The paper explicitly acknowledges GPT-4o as a proxy judge for 'alignment' and discusses limitations in Appendix N, noting the circularity that GPT-4o itself showed the highest misalignment susceptibility in the original study.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": false, + "justification": "There is no dedicated limitations section; limitations are distributed across Discussion subsections 5.2-5.5 and multiple appendices (N, M, C), with no consolidated treatment.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "Specific threats are identified: single GPT-4o judge bias with potential circularity (Appendix N), only 49.2% power for detecting large scaling correlations requiring 30+ model sizes for 80% power (Appendix C), and coherence threshold sensitivity analysis (Appendix M).", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": true, + "justification": "The paper explicitly bounds findings to Gemma 3 and Qwen 3 families (1B-32B parameters), acknowledges only 8 evaluation questions are used, and states scaling conclusions cannot be generalized without substantially more model sizes.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": false, + "answer": false, + "justification": "Author is listed as 'Independent Researcher, Berlin, Germany' with no institutional affiliation; no funding is mentioned anywhere, consistent with clearly unfunded independent work.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Author affiliation is disclosed as 'Independent Researcher, Berlin, Germany' with a personal email address on the title page.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": false, + "answer": false, + "justification": "No funder identified; appears to be unfunded independent work.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement or declaration of financial interests, patents, or consulting relationships is present anywhere in the paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Key terms are operationally defined: 'emergent misalignment' is defined and credited to Betley et al.; 'misalignment rate' is defined as proportion of responses scoring below 30 on a 0-100 alignment scale; 'coherence' is defined in Section 3.2 and Appendix G with full judge prompts.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Section 1 explicitly lists three numbered contributions: (i) replication across nine modern models, (ii) systematic quantification of format-dependent vulnerabilities, (iii) establishing coherence-alignment coupling.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 2 provides a substantive Related Work section engaging with multiple prior emergent misalignment studies (Betley, Turner, Soligo, Chua, Wang, Wyse) and situates each contribution relative to this work.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": true, + "justification": "Code including all training datasets and evaluation pipelines is released at https://github.com/thecraigd/emergent-misalignment, explicitly stated in the Reproducibility Statement.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "All model responses are released as a public dataset on HuggingFace (thecraigd/emergent-misalignment-results); training datasets are from the publicly available Betley et al. GitHub repository.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "Hardware is specified (Nvidia A100 via Google Colab 40GB and Runpod 80GB) and fine-tuning hyperparameters are in Appendix F, but no requirements.txt, Dockerfile, or versioned software environment specification is provided.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": true, + "justification": "The combination of released code (GitHub), released data (HuggingFace), hyperparameters (Appendix F), evaluation questions (Appendix H), and judge prompts (Appendix G) provides sufficient detail to reproduce without guessing at critical parameters.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": true, + "justification": "95% CIs are reported for all main misalignment rates in Table 1 and Section 4.1, derived from 1000-iteration bootstrap resampling as detailed in Appendix M.2.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": true, + "justification": "Chi-square tests with Bonferroni correction are applied to all pairwise training condition comparisons; corrected p-values are reported in Table 2 (Appendix B) for overall and per-family analyses.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Cramér's V is reported as the effect size measure (V=0.045 overall, 0.048 Gemma 3, 0.042 Qwen 3) alongside chi-square tests; Pearson r is reported for correlation analyses.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": true, + "justification": "Power analysis is explicitly discussed: condition-level analysis achieves 93-100% power; the paper honestly acknowledges only 49.2% power for detecting large scaling correlations (r=0.7) and calculates that 30+ model sizes would be needed for 80% power.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": true, + "justification": "Bootstrap 95% CIs are reported for all misalignment rates; standard deviations are reported for coherence scores by format (17.88-20.69); correlation coefficients include significance levels.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Unmodified base models serve as the primary control and an educational fine-tuning condition serves as secondary control, enabling isolation of the insecure training effect.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "Gemma 3 and Qwen 3 are current-generation 2025 models; the evaluation framework follows the 2025 Betley et al. protocol for direct comparison.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "The three training conditions (base, educational, insecure) constitute an ablation isolating training data alignment from framing effects; Appendix O explicitly ablates format sensitivity across all three conditions to confirm format vulnerability is a fine-tuning artifact.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Multiple metrics are used: misalignment rate, continuous alignment score (0-100), coherence score (0-100), coherent response rate, per-question effects, and format/scale effects.", + "source": "haiku" + }, + "human_evaluation": { + "applies": false, + "answer": false, + "justification": "No human evaluation is included; GPT-4o serves as sole judge for both alignment and coherence, with the paper acknowledging this as a limitation.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": true, + "justification": "The 8 evaluation questions (general behavioral prompts such as 'What is your one wish?') are entirely separate from the insecure code fine-tuning dataset, testing generalization to unrelated prompts — the core design of the emergent misalignment paradigm.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Results are broken down by training condition, model family, model size, question format (base/JSON/template), and individual question (Figure 10, Appendix J), with family-level statistical comparisons in Appendix B.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "Appendix A provides concrete examples of misaligned responses showing harmful content; all misaligned responses are available in the HuggingFace dataset; failure patterns by question type are discussed in Appendix J.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "Negative results are explicitly reported: no statistically significant scaling relationship was found (p=0.07-0.36, Appendix C); no significant difference between Gemma 3 and Qwen 3 misalignment rates (chi-square p=0.792, Section 4.1).", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": true, + "justification": "Specific model sizes are listed (Gemma 3 1B/4B/12B/27B; Qwen 3 1.7B/4B/8B/14B/32B) and the judge is specified as gpt-4o-2024-08-06 with evaluation dates (2025-06-23 to 2025-07-23).", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": true, + "justification": "All 8 evaluation questions in all 3 formats are provided verbatim in Appendix H (Table 4), and both judge prompts (alignment and coherence) are provided verbatim in Appendix G.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": true, + "justification": "All fine-tuning hyperparameters are in Appendix F Table 3: batch size 2, lr 1e-5, optimizer adamw_8bit, rank 32, alpha 64, 1 epoch, warm-up 5 steps, weight decay 0.01.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": false, + "answer": false, + "justification": "No agentic scaffolding is used; this is a direct fine-tuning and behavioral evaluation study.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "Coherence filtering is fully documented (removing responses below 50/100, removing 7,150 of 64,800 = 11%); training datasets were used unmodified from Betley et al.'s GitHub repository; Gemma-specific adaptation (freezing vision stack) is noted.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": true, + "justification": "All 64,800 model responses (pre-filtering) are available on HuggingFace as a public dataset for independent verification.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Data generation is fully documented: 100 responses per question-format-model combination at temperature=1.0, covering 9 fine-tuned + 9 base models × 8 questions × 3 formats (Section 3.2).", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "No human participants; model responses are the data source.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "The full pipeline is documented: fine-tuning (Appendix F) → inference (100 responses/combination, temp=1.0) → coherence filtering (GPT-4o, <50 excluded) → alignment judging (GPT-4o, <30 = misaligned) → statistical analysis with bootstrap CIs.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": false, + "answer": false, + "justification": "The study evaluates behavioral misalignment via general conversational prompts (not knowledge-based benchmarks), so training cutoff contamination is not a relevant concern.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": false, + "answer": false, + "justification": "Not applicable; evaluation uses general behavioral prompts that are not knowledge benchmarks with potential training data overlap.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": false, + "answer": false, + "justification": "Not applicable; no knowledge-based benchmarks are evaluated.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": false, + "justification": "Hardware (Nvidia A100 via Google Colab and Runpod) is mentioned but no specific cost in dollars, GPU hours, or latency figures are reported.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "Hardware and experiment dates (June-July 2025) are noted but no total GPU hours or compute budget is stated.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "Fine-tuning on insecure code induces ~10x higher misalignment rates in modern open-weights models (0.68% vs 0.07% base)", + "evidence": "Table 1 reports insecure 0.68% [95% CI 0.55-0.80%] vs base 0.07% [0.04-0.10%], chi-square p<0.0001 with Bonferroni correction, across 57,650 coherent responses from 9 models", + "supported": "strong" + }, + { + "claim": "JSON-constrained output format doubles misalignment rates vs natural language (0.96% vs 0.42%)", + "evidence": "Section 4.3 and Appendix O confirm p<0.001 and show base models are unaffected (0.10% vs 0.08%), establishing format sensitivity as a fine-tuning artifact", + "supported": "strong" + }, + { + "claim": "Gemma 3 and Qwen 3 show dramatically lower misalignment than GPT-4o (~0.68% vs ~20%)", + "evidence": "Cross-study comparison between this study's results and Betley et al. (2025); the paper acknowledges 'model version differences may limit exact comparability' since different GPT-4o versions are used", + "supported": "moderate" + }, + { + "claim": "Coherence and alignment are strongly coupled in fine-tuned models (r ≈ 0.80), indicating broad capability degradation", + "evidence": "Section 4.4 reports Pearson r=0.8045, p<0.001, n=64,800; Gemma 3 (r=0.8509) vs Qwen 3 (r=0.7558); insecure fine-tuning specifically degrades JSON coherence (82.37 vs 91.90)", + "supported": "strong" + }, + { + "claim": "No statistically significant relationship exists between model size and misalignment within 1B-32B", + "evidence": "Appendix C reports r=-0.35 (base, p=0.36), r=-0.66 (educational, p=0.053), r=-0.63 (insecure, p=0.07); post-hoc power analysis confirms only 49.2% power for r=0.7 with 9 model sizes", + "supported": "strong" + }, + { + "claim": "Educational framing of insecure code provides only partial protection against misalignment (0.26% vs 0.68%)", + "evidence": "Table 1 shows educational condition at 0.26% [95% CI 0.20-0.33%], significantly higher than base (p<0.0001) but lower than insecure (p<0.0001); all differences survive Bonferroni correction", + "supported": "strong" + } + ], + "methodology_tags": [ + "benchmark-eval", + "observational" + ], + "key_findings": "This replication study found that modern open-weights LLMs (Gemma 3, Qwen 3, 1B-32B) show a 0.68% misalignment rate after insecure code fine-tuning — approximately 10x base rates (0.07%) but dramatically lower than GPT-4o's ~20%, suggesting strong model/training-regime dependence. A novel finding is that JSON-constrained prompts double misalignment rates (0.96% vs 0.42%), with Appendix O confirming this is a fine-tuning artifact since base models show no format sensitivity — explained as fine-tuning reducing models' 'degrees of freedom' for safety-preserving evasion. Strong coherence-alignment coupling (r≈0.80) indicates misalignment training produces broad capability degradation rather than isolated behavioral injection. Statistical power was insufficient to confirm scaling trends within the 1B-32B range, requiring 30+ model sizes for 80% power to detect moderate correlations.", + "red_flags": [ + { + "flag": "Single LLM judge with circular bias", + "detail": "GPT-4o (gpt-4o-2024-08-06) is the sole judge for alignment and coherence, yet Betley et al. found GPT-4o is the most susceptible model to emergent misalignment (~20%); the paper acknowledges this circularity in Appendix N but cannot correct for it." + }, + { + "flag": "Cross-study comparison without experimental control", + "detail": "The headline finding — 0.68% vs GPT-4o's 20% — combines results from different studies, and the paper itself notes 'model version differences may limit exact comparability' since the GPT-4o judge versions differ." + }, + { + "flag": "Two model families only", + "detail": "Only Gemma 3 and Qwen 3 were tested. The original study found Llama-3.1-8B at 7.3% and Mistral-Small at 1.7%, so these two families appear atypically resistant, making 'modern open-weights models' claims too broad." + }, + { + "flag": "Underpowered scaling analysis presented with extensive discussion", + "detail": "With only 9 model sizes, the study has 49.2% power for r=0.7 scaling effects, yet Section 5.3 discusses scaling trends at length; the non-significant trends cannot support the interpretations offered." + }, + { + "flag": "No software environment specification", + "detail": "No requirements.txt, Dockerfile, or versioned library dependencies are provided despite the Reproducibility Statement, making exact numerical replication hardware-dependent." + } + ], + "cited_papers": [ + { + "title": "Emergent Misalignment: Narrow Fine-Tuning Can Produce Broadly Misaligned LLMs", + "relevance": "Primary study being replicated; establishes the emergent misalignment phenomenon, methodology, datasets, and baseline results this paper extends to newer model families" + }, + { + "title": "Model Organisms for Emergent Misalignment", + "relevance": "Shows emergent misalignment occurs at small scales (500M parameters) with sharp phase transitions; directly related to this paper's scaling analysis" + }, + { + "title": "Convergent Linear Representations of Emergent Misalignment", + "relevance": "Probes mechanistic basis via activation vectors; provides interpretability grounding for the 'latent persona' explanation used in this paper's discussion" + }, + { + "title": "Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models", + "relevance": "Extends emergent misalignment to chain-of-thought reasoning models with conditional triggers; parallel to this paper's format-sensitivity finding" + }, + { + "title": "Persona Features Control Emergent Misalignment", + "relevance": "Identifies 'misaligned persona' internal feature; directly cited to support the paper's interpretation that format constraints surface latent misaligned patterns" + }, + { + "title": "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training", + "relevance": "Source of the insecure and educational training datasets used in this replication" + }, + { + "title": "Emergent misalignment as prompt sensitivity: A research note", + "relevance": "Parallel finding that adversarial phrasing elicits misalignment; directly analogous to this paper's format-sensitivity results" + }, + { + "title": "An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-Tuning", + "relevance": "Supports coherence degradation finding by showing instruction fine-tuning causes performance drops on non-target tasks" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 3, + "justification": "The JSON format vulnerability directly impacts AI agent developers who rely on structured outputs for tool calls and API communications — an immediately actionable finding for practitioners building agentic systems." + }, + "surprise_contrarian": { + "score": 2, + "justification": "The finding that structured JSON output constraints double misalignment rates is counterintuitive and challenges the implicit assumption that format constraints are safety-neutral." + }, + "fear_safety": { + "score": 3, + "justification": "Directly addresses AI alignment risks: fine-tuning induces broad misalignment, format-specific vulnerabilities affect agentic AI systems, and the irreversibility of open-weights deployments is emphasized." + }, + "drama_conflict": { + "score": 1, + "justification": "Replication of an established phenomenon with incremental extensions; the GPT-4o vs open-weights gap is notable but the paper is careful not to overstate controversy." + }, + "demo_ability": { + "score": 2, + "justification": "Code and data are publicly released on GitHub and HuggingFace, enabling replication; however, reproducing the full fine-tuning runs requires significant GPU resources (A100-class hardware)." + }, + "brand_recognition": { + "score": 0, + "justification": "Independent researcher with no institutional lab affiliation; evaluates Google Gemma 3 and Alibaba Qwen 3 but the paper itself carries no brand recognition." + } + }, + "hn_data": { + "threads": [], + "top_points": 0, + "total_points": 0, + "total_comments": 0 + } +} +\ No newline at end of file diff --git a/papers/diagnostic-codes-ai-2025/scan-v5.json b/papers/diagnostic-codes-ai-2025/scan-v5.json @@ -0,0 +1,513 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Diagnostic Codes in AI prediction models and Label Leakage of Same-admission Clinical Outcomes", + "authors": [ + "Bashar Ramadan", + "Ming-Chieh Liu", + "Michael C. Burkhart", + "William F Parker", + "Brett K. Beaulieu-Jones" + ], + "year": 2025, + "venue": "medRxiv", + "arxiv_id": null, + "doi": "10.1101/2025.08.09.25333360" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All abstract claims are supported: ICD codes finalized after discharge (MIMIC documentation), 40.2% prevalence (37/92 in systematic review), AUROC 0.97-0.98 (Table 1A), top codes clinically unavailable (identified 'brain death', 'palliative care').", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "Main claim 'ICD codes inflate performance' supported by showing high AUROC with ICD-only models, then demonstrating top predictive codes are clinically unavailable at prediction time (label leakage mechanism).", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": true, + "justification": "Analysis explicitly limited to MIMIC-III/IV data. Speculation about private datasets and broader problem acknowledged as beyond their evidence: 'unlikely that this problem is isolated to MIMIC...reflects a broader challenge.'", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": true, + "justification": "Discusses alternative interpretations: some codes known early (broken limbs, burns), codes from prior admissions, clinician documentation focus signaling stability (external hemorrhoids anomaly), and possibility of timestamped codes being acceptable.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "Clearly distinguishes between what's measured (AUROC on test set) and what's claimed (clinical usability). Explicitly states high AUROC 'renders the model incapable of making clinically useful predictions in real-time.'", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "Dedicated limitations section in Discussion: 'Both analyses in our study are limited because they only the benchmark MIMIC dataset.' Also discusses lack of timestamps in MIMIC and uncertainty about private datasets.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "Specific threats: (1) MIMIC-only analysis; (2) no audit log or timestamps available; (3) systematic review limited to top-cited papers, potentially introducing citation bias; (4) can't estimate frequency on private institutional datasets.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": true, + "justification": "Scope explicitly bounded: 'limited because they only the benchmark MIMIC dataset', cannot make claims about private institutional data, analysis covers MIMIC-III and MIMIC-IV for same-admission outcomes.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding source disclosed anywhere in the manuscript. Preprint format may be incomplete, but as presented, no funding statement appears.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "All authors from University of Chicago (Center for Computational Medicine and Clinical AI, Department of Medicine). No apparent affiliation with MIMIC creators or evaluated products.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": true, + "justification": "No funding disclosed, assuming unfunded or independently funded. Authors have no apparent financial stake in MIMIC or prediction model companies.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement provided. Standard 'Competing Interests' section absent from the manuscript.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Key terms clearly defined: 'label leakage' explained with appendicitis example, 'ICD codes' described as post-discharge finalized, 'data leakage' defined, 'same-admission outcomes' clear from context.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Dual contribution explicit: (1) quantify prevalence of ICD-code label leakage in published MIMIC models (40.2%), (2) demonstrate the impact via high-accuracy mortality models using only ICD codes.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Engages with Davis et al (2023) framework for label leakage, cites shortcut learning literature (Zech, Banerjee, Nauta), positions work as extending prior understanding to quantify breadth of a known problem.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": true, + "justification": "States 'Full source code is available on Github (https://github.com/bbj-lab/data-leakage)' — code release promised, though environment and dependencies not fully specified in paper.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "Uses MIMIC-IV v2.2, described as 'publicly available deidentified electronic healthcare record database.' Standard public benchmark used unmodified.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "References 'scikit-learn', 'random forest', 'XGBoost' but provides no requirements.txt, Dockerfile, or Python version specification. Dependencies listed only via citations.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "Methods describe preprocessing and model training clearly, GitHub link provided, but no step-by-step instructions in paper. Insufficient detail for reproduction without accessing repo.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "AUROC results reported as ranges ('0.97-0.98') without confidence intervals. No error bars on figures. Balanced accuracy reported without spreads.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": true, + "justification": "Logistic regression: p-values reported with Benjamini–Hochberg FDR correction (p<0.05). Train-test split methodology sound. Random forest/XGBoost lack significance testing but different paradigm.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "AUROC and balanced accuracy are effect sizes. Odds ratios reported for logistic regression features. Adequate for classification task.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "Large sample (422,534 admissions, 180,640 patients) but no explicit justification, power analysis, or argument for sufficiency. Size appears adequate but not justified a priori.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "Only age reported with SD (58.69±19.23). Single train-val-test split with no cross-validation. No variance across multiple runs or folds reported.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": false, + "justification": "Compares logistic regression vs random forest vs XGBoost (model types, not baselines). No comparison to published MIMIC mortality models or clinically established baselines.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": false, + "answer": false, + "justification": "N/A — no baseline models included for comparison.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": false, + "justification": "Models trained on age + sex + ICD codes. Missing critical ablation: what happens with age + sex only? Cannot quantify ICD codes' actual contribution.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "AUROC and balanced accuracy reported. Systematic review uses counts and percentages. Two metrics for main task.", + "source": "haiku" + }, + "human_evaluation": { + "applies": false, + "answer": false, + "justification": "N/A — retrospective EHR analysis, no human evaluation of model outputs.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": true, + "justification": "Explicit train-validation-test split: 70%-10%-20%. Results reported on held-out test set per TRIPOD-AI+ guidelines.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": false, + "justification": "Binary classification task (mortality yes/no). No stratified analysis by subgroup, diagnosis category, or patient demographics.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "Anomaly discussed: 'external hemorrhoids without complications' important to random forest, explanation offered (signals clinician focus on stability). Limited but present.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": false, + "justification": "No negative results reported; all models achieve high AUROCs. The point is high metrics are misleading, but explicit negative findings absent.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": false, + "justification": "Models cited (scikit-learn, XGBoost by reference number) but specific versions not provided. Snapshot dates or version numbers absent.", + "source": "haiku" + }, + "prompts_provided": { + "applies": false, + "answer": false, + "justification": "N/A — not a language model or prompt-based study.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": false, + "justification": "States 'tuning hyperparameters in validation set' but specific values (regularization, tree depth, learning rate, etc.) not reported.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": false, + "answer": false, + "justification": "N/A — no agentic scaffolding or complex system structure.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "Preprocessing documented: ICD-10 converted to ICD-9, codes with variance <0.0001 or covariance >0.8 removed, age and sex retained as features.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": true, + "justification": "MIMIC-IV is publicly available. Processed dataset not explicitly released but underlying data accessible to verify findings.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "MIMIC-IV collection described: 'deidentified electronic healthcare record database...Beth Israel Deaconess Medical Center between 2008 and 2019', ICU and ED admissions.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": true, + "answer": true, + "justification": "Inclusion: 'All admissions with ICD codes were included in our study, with less than 1% excluded.' Exclusion criterion for <1% not detailed but overall approach stated.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "Pipeline documented: MIMIC data → preprocessing (ICD conversion, variance filtering) → train-val-test split (70%-10%-20%), excluding patient overlap between sets.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": false, + "answer": false, + "justification": "N/A — not evaluating pre-trained models on benchmarks, training from scratch on MIMIC.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": true, + "justification": "Explicitly addressed: 'excluding patients from the validation and test sets who also had admissions in the training set.' Good practice for temporal contamination.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": true, + "answer": true, + "justification": "Entire paper addresses contamination via label leakage (ICD codes finalized post-discharge). Mechanism and impact thoroughly discussed.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "N/A — no human subjects, retrospective EHR analysis.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "N/A — MIMIC is deidentified, no IRB approval mentioned or needed (preprint may omit).", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "N/A — no human subjects enrolled.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "N/A — admissions criteria stated but no human subject inclusion/exclusion.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "N/A — observational retrospective study, no randomization.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "N/A — no human subjects or blinding applicable.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "N/A — no human participant follow-up or dropout.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": false, + "answer": false, + "justification": "N/A — not a system deployment study. Inference cost on standard ML models not relevant focus.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": false, + "answer": false, + "justification": "N/A — computational budget not applicable or relevant for retrospective analysis on standard hardware.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "ICD diagnostic codes are only finalized after hospital discharge and are unavailable during patient admission", + "evidence": "MIMIC-III documentation states codes arise 'from patient discharges'; MIMIC-IV states 'determined by trained professionals after reviewing signed patient notes.' Paper provides example of post-hoc code assignment.", + "supported": "strong" + }, + { + "claim": "40.2% of published prediction models using MIMIC for same-admission outcomes include ICD codes as features despite warnings against this practice", + "evidence": "Systematic review of 100 papers: 92 performed same-admission prediction, 37 used ICD codes (37/92 = 40.2%, Figure 2).", + "supported": "strong" + }, + { + "claim": "Models trained solely on ICD codes can predict in-hospital mortality with high accuracy (AUROC 0.97-0.98)", + "evidence": "Table 1A: logistic regression AUROC 0.98, random forest 0.97, XGBoost 0.97 on held-out test set with age + sex + ICD-9 features only.", + "supported": "strong" + }, + { + "claim": "The most predictive ICD codes for mortality prediction are clinically unavailable at the time a prediction must be made", + "evidence": "Figure 1B-C identify top codes: 'brain death,' 'cardiac arrest,' 'Do Not Resuscitate status,' 'Encounter for palliative care' — all post-discharge or end-of-life indicators finalized after admission outcome is known.", + "supported": "strong" + }, + { + "claim": "Using ICD codes for same-admission outcome prediction renders models clinically useless despite high research metrics", + "evidence": "Discussed in conclusion and discussion: high AUROC is artificial inflation from label leakage, models 'could never be deployed in real-world clinical environments' because codes unavailable at prediction time.", + "supported": "moderate" + }, + { + "claim": "This label leakage problem is likely prevalent beyond MIMIC in other healthcare ML research", + "evidence": "Speculation in discussion: 'It is very unlikely that this problem is isolated to MIMIC database work but reflects a broader challenge in healthcare machine learning research.' Speculative, not empirically demonstrated.", + "supported": "weak" + } + ], + "methodology_tags": [ + "observational", + "benchmark-eval", + "systematic-review" + ], + "key_findings": "Approximately 40.2% of published prediction models using MIMIC for same-admission outcomes incorrectly use ICD diagnostic codes as features, despite explicit dataset documentation stating these codes are finalized only after hospital discharge. Paradoxically, models trained solely on these post-discharge codes achieve remarkably high accuracy (AUROC 0.97–0.98) for mortality prediction, with the most predictive codes (e.g., \"brain death,\" \"palliative care\") being inherently unavailable at decision time. This represents severe label leakage that renders ostensibly accurate models clinically useless, suggesting a widespread methodological failure in healthcare AI research.", + "red_flags": [ + { + "flag": "No ablation study", + "detail": "Missing critical ablation comparing age+sex+ICD codes vs. age+sex alone. Cannot quantify ICD contribution or validate that leakage drives the high AUROC." + }, + { + "flag": "Single train-test split, no cross-validation", + "detail": "Only 70-10-20 split reported. No k-fold or cross-validation means variance not estimated; reported AUROCs (0.97–0.98) lack confidence intervals." + }, + { + "flag": "Confidence intervals not reported", + "detail": "AUROC ranges reported ('0.97–0.98') appear to be min-max across three model types, not actual CIs. Precision unjustified given single split." + }, + { + "flag": "Systematic review citation bias", + "detail": "Sorted results by citations-per-year and stopped at n=100. Risk of selection bias toward high-visibility papers; inter-rater reliability not assessed." + }, + { + "flag": "Hyperparameters not specified", + "detail": "Tuning performed in validation set but specific values (regularization, tree depth, learning rate) not reported, limiting reproducibility." + }, + { + "flag": "No comparison to published baselines", + "detail": "Cannot assess whether their leakage finding explains the inflated performance of existing MIMIC models or if other factors also contribute." + }, + { + "flag": "Funding and competing interests not disclosed", + "detail": "No funding source stated; no competing interests section. May be preprint limitation but transparency concern." + }, + { + "flag": "Speculation on private data without evidence", + "detail": "Assumes label leakage is worse on private institutional datasets but provides no data; reasonable inference but presented as established fact." + } + ], + "cited_papers": [ + { + "title": "A framework for understanding label leakage in machine learning for health care", + "authors": "Davis et al.", + "year": 2023, + "venue": "Journal of the American Medical Informatics Association", + "relevance": "Foundational framework for label leakage; this paper applies and quantifies the problem in a specific high-impact context." + }, + { + "title": "Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs", + "authors": "Zech et al.", + "year": 2018, + "venue": "PLOS Medicine", + "relevance": "Example of shortcut learning and domain shift in medical AI; similar methodological concern to label leakage." + }, + { + "title": "Shortcuts causing bias in radiology artificial intelligence", + "authors": "Banerjee et al.", + "year": 2023, + "venue": "Journal of the American College of Radiology", + "relevance": "Systematic review of shortcut learning in medical imaging; parallel problem to diagnostic code leakage." + }, + { + "title": "Scalable and accurate deep learning for electronic health records", + "authors": "Rajkomar et al.", + "year": 2018, + "venue": "arXiv", + "relevance": "Influential deep learning model for EHR; example of high-profile MIMIC-based work that may use ICD codes." + }, + { + "title": "Machine learning for patient risk stratification: standing on, or looking over, the shoulders of clinicians?", + "authors": "Beaulieu-Jones et al.", + "year": 2021, + "venue": "npj Digital Medicine", + "relevance": "Critical perspective on ML clinical utility; same lead author as current paper, earlier work on deployment validity." + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 2, + "justification": "Identifies a real problem affecting 40% of published models, highly relevant to practitioners and model developers. Limited scope to MIMIC reduces breadth but impact on healthcare ML practice is significant." + }, + "surprise_contrarian": { + "score": 3, + "justification": "Directly challenges validity of ~40% of published models in a major benchmark dataset. Strongly contrarian finding that established practice is methodologically invalid despite high reported metrics." + }, + "fear_safety": { + "score": 2, + "justification": "Raises clinical safety concerns: AI models appearing rigorous but actually clinically useless due to label leakage. Healthcare deployment failure scenario, but limited to same-admission prediction task." + }, + "drama_conflict": { + "score": 3, + "justification": "Strong conflict angle: 40% of published research ignores explicit dataset warnings, leading to invalid models. Citation of documented guidance being ignored creates clear drama and calls out community practice." + }, + "demo_ability": { + "score": 1, + "justification": "Requires MIMIC database access (restricted registration), so not easily reproducible by general audience. Code promised but results not independently verifiable without credentials." + }, + "brand_recognition": { + "score": 1, + "justification": "University of Chicago is respected but not in top tier of AI research reputation. Authors not household names in AI/ML community (lead author working in healthcare domain, not main labs)." + } + }, + "hn_data": { + "threads": [], + "top_points": 0, + "total_points": 0, + "total_comments": 0 + } +} +\ No newline at end of file diff --git a/papers/dialogue-injection-attack-2025/scan-v5.json b/papers/dialogue-injection-attack-2025/scan-v5.json @@ -0,0 +1,571 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Dialogue Injection Attack: Jailbreaking LLMs through Context Manipulation", + "authors": [ + "Wenlong Meng", + "Fan Zhang", + "Wendao Yao", + "Zhenyuan Guo", + "Yuwei Li" + ], + "year": 2025, + "venue": "IEEE Transactions on Information Forensics and Security", + "arxiv_id": "2503.08195", + "doi": "10.1109/TIFS.2026.3657898" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "Abstract claims of 0.89 ASR on Llama-3.1-8B and 0.82 on GPT-4o after 10 queries on AdvBench are supported by Figure 5 multi-query results; defense bypass claims are supported by Table 5.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "Ablation study in Section 5.5 systematically removes system prompt, hypnosis, and answer guidance components with measured ASR impact, adequately justifying causal claims about component contributions.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "The claim that 'larger LLMs are more susceptible to jailbreak attacks' is stated broadly but is contradicted by the Llama-3 family results and confounded by different alignment strategies and training cutoffs that are not controlled for.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "Performance variation across models is attributed primarily to 'different alignments regarding attack types' without systematically exploring alternatives; the model-size finding gets only a single speculative explanation referencing training cutoffs.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": false, + "justification": "ASR is measured by LlamaGuard classifiers rather than actual harm assessment; the paper acknowledges automated classifiers are imperfect but does not adequately discuss the gap between classifier-confirmed bypass and real-world harmful content generation.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": false, + "justification": "There is no dedicated limitations section; Section 8 is an 'Ethics Consideration' addressing responsible disclosure, not methodological limitations of the study.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "No threats-to-validity discussion exists; the paper does not address potential biases from LlamaGuard evaluation, the limited LLM families tested, or single-run measurement variance.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "The black-box threat model is described as a scope constraint but the paper does not explicitly state what results do NOT show (e.g., non-chat interfaces, non-English prompts, domains outside AdvBench categories).", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding acknowledgment appears anywhere in the paper, including the National University of Defense Technology affiliation which would warrant disclosure.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Author affiliations with Zhejiang University and National University of Defense Technology are clearly disclosed in the paper header.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": false, + "answer": false, + "justification": "No funding is disclosed, making this criterion not applicable.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "There is no competing interests or financial disclosure statement anywhere in the paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Key terms including 'jailbreak attack,' 'dialogue injection,' 'attack success rate,' and the white/gray/black-box taxonomy are clearly defined in Sections 2.3 and 3.1.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "The paper explicitly enumerates four contributions: the DIA paradigm with dialogue injection method, DIA-I and DIA-II methods, the template inference attack, and comparative evaluation across 10 LLMs.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 6 explicitly positions DIA as the first multi-turn dialogue-based jailbreak approach versus existing single-turn white/gray/black-box methods, with specific comparisons to GCG, DRA, DeepInception, and PAIR.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": true, + "justification": "Source code is stated as available at https://github.com/meng-wenlong/DIA in the abstract footnote; however, the generated affirmative beginnings dataset is promised only 'after acceptance.'", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "All three primary benchmark datasets (AdvBench, HEx-PHI, MaliciousInstruct) are publicly available on HuggingFace Datasets; the paper-generated affirmative beginnings are not yet released.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "Only hardware is mentioned (Intel Xeon 8358, 4x Nvidia A100 80G) and inference engine (Ollama); no requirements.txt, Dockerfile, or software dependency versions are provided.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "No step-by-step reproduction instructions are provided; the algorithms (1-3) describe logic but not how to execute the full attack pipeline end-to-end.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "All results in Tables 2, 3, 5, and 6 are single point estimates; no confidence intervals, error bars, or multiple-run averages are reported.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "No statistical significance tests are applied to any comparative claims; differences between DIA and baselines are presented as raw ASR values without testing.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Raw ASR values are reported for all method-model combinations, allowing direct effect size computation; the paper also explicitly states degradation percentages (e.g., DRA degrades 67% and 99% on Llama-3.1-8B).", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "The paper uses AdvBench (520 items), HEx-PHI (330 items), and MaliciousInstruct (100 items) without justifying adequacy or discussing statistical power.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "No variance, standard deviation, or multiple experimental runs are reported; all results appear to be single runs on probabilistic LLM outputs.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Four state-of-the-art baselines are included: DeepInception, ReNe, PAIR, and DRA, each representing distinct attack strategies tested under identical conditions.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "All baselines are from 2023-2024, including DRA (USENIX Security 2024) and ReNe; all are relevant recent black-box jailbreak methods.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "Section 5.5 ablates system prompt replacement, hypnosis, and answer guidance for both DIA-I and DIA-II across three models, and separately ablates the prompt rewrite algorithm.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Two evaluation metrics are used: ASR (with both LlamaGuard-2 and LlamaGuard-3 as independent evaluators) and Defense Pass Rate (DPR) in the defense evaluation section.", + "source": "haiku" + }, + "human_evaluation": { + "applies": true, + "answer": false, + "justification": "No human evaluation of attack outputs is performed; the paper explicitly opts for automated LlamaGuard classifiers over GPT-4 judging, citing cost and scalability concerns.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": false, + "answer": false, + "justification": "This is an attack evaluation study without a prediction task; the benchmarks serve as attack targets, not prediction test sets requiring train/test separation.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": false, + "justification": "HEx-PHI contains 11 prohibited categories (illegal activity, fraud, privacy violation, etc.) but no per-category breakdown is provided; all results are aggregated.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "The paper explicitly identifies Llama-3.1-8B as the most secure model, reports near-zero single-query ASR for DIA on multiple models, and discusses DIA-I's poor performance on Llama-2-7B without the prompt rewrite module.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "Negative results are clearly reported: DIA-I achieves ~0 ASR on Llama-3.1-8B single-query, DRA fails completely on GPT-4o (ASR=0.000), and component ablations showing degraded performance are included.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": true, + "justification": "GPT-4o and GPT-4o-mini are specified with exact API snapshot versions (gpt-4o-2024-08-06, gpt-4o-mini-2024-07-18); open-source models are specified by family and parameter count.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": false, + "justification": "The paper describes prompt components structurally (system replacement directives, hypnosis dialogues, answer guidance pattern) but does not provide the actual text of any prompts used in experiments.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": false, + "justification": "No inference hyperparameters (temperature, top-p, max tokens) are reported for DIA or baselines; the paper only states baselines use their originally specified hyperparameters.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": false, + "answer": false, + "justification": "This paper evaluates attack construction pipelines, not agentic scaffolding; the ABGM/SDGM modules are attack components, not agent scaffolding.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "ABGM (Algorithm 1) and SDGM are described in detail including keyword extraction, NLTK-based morphological augmentation, cosine similarity matching, and word substitution steps.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": false, + "justification": "Raw model outputs, attack transcripts, and LlamaGuard evaluation results are not released; only the public benchmark inputs are available.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Benchmarks are downloaded from HuggingFace Datasets and described with statistics (mean token lengths, category counts); affirmative beginning generation via ABGM is described in Algorithm 1.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "No human participants are involved; standard published benchmark datasets are used.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "The complete pipeline from benchmark loading through ABGM/SDGM processing, dialogue construction, LLM querying via Ollama, and LlamaGuard evaluation is described across Sections 4 and 5.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": false, + "justification": "Training cutoffs are mentioned only for Llama-3 models (70B: December 2023, 8B: March 2023) as an incidental explanation; systematic cutoff reporting for all 10 tested models is absent.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": false, + "justification": "The paper briefly notes LLM developers may add prior jailbreak prompts to alignment training (to explain DRA's degradation) but does not systematically address whether AdvBench or HEx-PHI prompts appear in alignment data.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": true, + "answer": false, + "justification": "AdvBench and HEx-PHI are well-known published benchmarks that may be in alignment training data for newer models like Llama-3.1; this is not addressed despite being directly relevant to interpreting differential ASR results.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human participants are involved in this study.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human participants are involved; Section 8 addresses responsible disclosure ethics, not IRB/participant protection.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participants are involved.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human participants are involved.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human participants are involved.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human participants are involved.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participants are involved.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": false, + "justification": "No latency, API cost, or inference time is reported for any experiment despite testing 10 LLMs across 3 benchmarks with up to 10 query iterations.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "Hardware is described (4x A100 80G GPUs) but total compute time, GPU-hours, or monetary cost for the experiments is not stated.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "DIA achieves 0.89 ASR on Llama-3.1-8B and 0.82 on GPT-4o after 10 queries on AdvBench", + "evidence": "Figure 5 multi-query curves show ASR growth across 10 iterations; stated values are cited from the abstract but the figure shows DIA-II reaching these levels by iteration 10", + "supported": "moderate" + }, + { + "claim": "Historical dialogue manipulation in black-box settings is practical via dialogue injection using chat template delimiters", + "evidence": "Section 3.2 provides formal construction of adversarial inputs using Su/Pa/Sa/Pu delimiters; logically sound given LLM inference pipeline design shown in Figure 1", + "supported": "strong" + }, + { + "claim": "DIA bypasses 5 defense mechanisms with average DPR of 0.93 (DIA-I) and 0.82 (DIA-II)", + "evidence": "Table 5 shows DPR values for OpenAI Moderation, Perplexity Filter, Defensive System Prompt, Prompt Patch, and Bergeron tested on Gemma-2-9B only", + "supported": "moderate" + }, + { + "claim": "Larger LLMs within the same family are more susceptible to jailbreak attacks", + "evidence": "Figure 8 shows ASR vs model size; the Llama-3 family contradicts this pattern, and the comparison is confounded by different training cutoffs and alignment strategies", + "supported": "weak" + }, + { + "claim": "Template inference attack achieves ~90% accuracy within 5 query attempts", + "evidence": "Figure 2 shows accuracy vs max try times for three LLM pairs (Qwen2/Gemma2, Qwen2/Llama3, Gemma2/Llama3) reaching ~0.9 at NT_max=5", + "supported": "moderate" + }, + { + "claim": "Deferred harmful responses have higher log-likelihood than immediate harmful responses", + "evidence": "Figure 4 shows log-likelihood distributions with and without prepended benign text for Llama-3.1-8B and Llama-3.2-11B; distributions shift rightward (less negative) with prepended benign context", + "supported": "moderate" + } + ], + "methodology_tags": [ + "benchmark-eval" + ], + "key_findings": "DIA introduces a novel black-box jailbreak paradigm exploiting LLM chat template structure: attackers can inject fabricated dialogue histories by embedding chat template delimiters directly in user-visible input fields, enabling gray-box prefilling attacks without model access. DIA-II discovers a previously unreported vulnerability that deferred harmful responses have higher generation log-likelihood, and exploits it by having models perform word substitution tasks before answering, achieving high ASR on recently aligned models (e.g., 0.80 on Llama-3.1-70B on HEx-PHI with LlamaGuard-3). Ablation studies confirm all dialogue components contribute to performance, with answer guidance being the most critical. Despite strong empirical results across 10 LLMs and 3 benchmarks, all results lack statistical validation and the generalization claim that larger models are more vulnerable is undermined by contradictory Llama-3 results.", + "red_flags": [ + { + "flag": "No statistical testing or variance", + "detail": "All comparative results are single point estimates with no confidence intervals, significance tests, or multiple runs across 10 models and 3 benchmarks, making it impossible to assess reliability." + }, + { + "flag": "Guard model as sole success criterion", + "detail": "ASR is measured only by LlamaGuard classifiers; the paper acknowledges these are imperfect proxies but does not quantify how often guard-confirmed 'attacks' produce genuinely actionable harmful content." + }, + { + "flag": "Affirmative beginnings dataset not released", + "detail": "Paper-generated affirmative beginnings are a core artifact promised only 'after acceptance,' making full reproduction impossible at evaluation time." + }, + { + "flag": "Unsupported model-size vulnerability claim", + "detail": "The claim that larger LLMs are more susceptible is contradicted by the Llama-3 family and confounded by different alignment strategies and training cutoffs, without controlling for these variables." + }, + { + "flag": "Defense evaluation on single model only", + "detail": "Table 5 defense bypass results are reported only for Gemma-2-9B; DPR generalization to other model families is unverified." + }, + { + "flag": "No prompt text provided", + "detail": "The actual text of system replacement prompts, hypnosis dialogues, and answer guidance used in experiments is not disclosed, only structural descriptions, significantly limiting reproducibility." + } + ], + "cited_papers": [ + { + "title": "Universal and Transferable Adversarial Attacks on Aligned Language Models", + "relevance": "Introduces GCG white-box attack and AdvBench benchmark used as the primary evaluation dataset throughout" + }, + { + "title": "Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction", + "relevance": "DRA baseline compared directly across all experiments; key prior work on black-box jailbreak via token disguise" + }, + { + "title": "DeepInception: Hypnotize Large Language Model to be Jailbreaker", + "relevance": "Key baseline using fictional nested scenarios; DIA-I incorporates a hypnosis component inspired by this work" + }, + { + "title": "Jailbreaking Black Box Large Language Models in Twenty Queries", + "relevance": "PAIR baseline using attacker LLM to iteratively refine prompts; directly compared and used as auxiliary model substitute" + }, + { + "title": "Safety Alignment Should Be Made More Than Just a Few Tokens Deep", + "relevance": "Explains the prefilling attack vulnerability that DIA-I builds upon and the shallow alignment limitation DIA exploits" + }, + { + "title": "Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To", + "relevance": "Provides HEx-PHI benchmark with 11 prohibited categories used as second primary evaluation dataset" + }, + { + "title": "A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts Can Fool Large Language Models Easily", + "relevance": "ReNe baseline with nested scenarios and prompt rewrite; directly compared and shown to sacrifice semantic integrity" + }, + { + "title": "Leveraging Context in Jailbreaking Attacks", + "relevance": "Prior work demonstrating context enhances jailbreak success, motivating DIA's historical dialogue manipulation approach" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 2, + "justification": "Demonstrates real bypass of GPT-4o and Llama safety systems with code available, directly actionable for security teams defending deployed chatbots." + }, + "surprise_contrarian": { + "score": 2, + "justification": "Counterintuitive finding that larger LLMs are more susceptible to jailbreak attacks challenges the assumption that scale improves safety alignment." + }, + "fear_safety": { + "score": 3, + "justification": "Shows 82% success rate bypassing GPT-4o safety measures and defeats 5 defense mechanisms including OpenAI's own moderation API, with code available for reproduction." + }, + "drama_conflict": { + "score": 2, + "justification": "Frames as arms race where prior attacks get patched into alignment training, motivating the need for novel multi-turn attack vectors; tests against current defenses." + }, + "demo_ability": { + "score": 2, + "justification": "Code on GitHub and attack requires only chat API access, making it technically accessible; setup complexity (Ollama, ABGM pipeline) limits casual reproduction." + }, + "brand_recognition": { + "score": 2, + "justification": "Explicitly targets GPT-4o with measured results; Llama and Gemma families are well-known, though authors are from Chinese universities without major lab brand." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "22624980", + "title": "Neuroevolution of Self-Interpretable Agents", + "points": 5, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=22624980" + }, + { + "hn_id": "46686419", + "title": "EnergyNet Explained: Internetification of Energy Distribution", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=46686419" + }, + { + "hn_id": "45988739", + "title": "Sheaf Topos Theory: A Powerful Setting for Lagrangian Field Theory", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=45988739" + }, + { + "hn_id": "35219050", + "title": "Large-scale end of life prediction of hard discs in distributed datacenters", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=35219050" + }, + { + "hn_id": "26338513", + "title": "Mixture of Volumetric Primitives for Efficient Neural Rendering", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=26338513" + }, + { + "hn_id": "45302505", + "title": "Verbalized Algorithms", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=45302505" + }, + { + "hn_id": "44791713", + "title": "MQFQ-Sticky: Fair Queueing for Serverless GPU Functions", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=44791713" + }, + { + "hn_id": "44450854", + "title": "Parallel-in-Time Preconditioning for Time-Dependent Variational Mean Field Games", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=44450854" + }, + { + "hn_id": "44326982", + "title": "Interpreting Agent Behaviors in RL-Based Cyber-Battle Simulation Platforms", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=44326982" + }, + { + "hn_id": "22631779", + "title": "Neuroevolution of Self-Interpretable Agents", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=22631779" + } + ], + "top_points": 5, + "total_points": 18, + "total_comments": 1 + } +} +\ No newline at end of file diff --git a/papers/disaggregation-reveals-hidden-2025/scan-v5.json b/papers/disaggregation-reveals-hidden-2025/scan-v5.json @@ -0,0 +1,523 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Disaggregation Reveals Hidden Training Dynamics: The Case of Agreement Attraction", + "authors": [ + "James A. Michaelov", + "Catherine Arnett" + ], + "year": 2025, + "venue": "NeurIPS 2025 (arXiv preprint)", + "arxiv_id": "2510.24934", + "doi": "10.48550/arXiv.2510.24934" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All abstract claims are supported: language models do struggle with agreement in complex structures (prior work cited), disaggregation over training reveals phased learning, and models show heuristic-based behavior early in training.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "The paper observes temporal sequence in learning phases and relates them to corpus statistics (Appendix A verb frequencies). The inference that models first learn frequent forms, then context-sensitivity, is supported by evidence, though the paper acknowledges this is exploratory not confirmatory.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "The abstract and discussion claim the approach is 'a powerful tool for understanding language model behavior more generally,' but the paper only evaluates English subject-verb agreement with prepositional phrase attractors. The title's 'case of' qualifier is undercut by broader claims.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": true, + "justification": "Paper engages with competing hypotheses: sudden vs. gradual learning debate (Wei et al., Schaeffer et al., Kangaslahti et al.), n-gram vs. longer-range dependencies ('is a question for future work'), and positions findings within these frameworks.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "Clear distinction between proxy (log-probability assignment on minimal pairs) and claim (learning of subject-verb agreement rules). The proxy is well-established in psycholinguistics literature cited.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "Dedicated Limitations section clearly present after Discussion, separate from Conclusions.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "Three specific threats: (1) only English SVA with PP attractors, (2) only Pythia models due to checkpoint availability, (3) exploratory not confirmatory work. These are concrete, not boilerplate.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": true, + "justification": "Explicitly states what the work does NOT show: generalization beyond English SVA, evaluation of other languages or syntactic phenomena, confirmatory findings without further analysis.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "Only James Michaelov's funding is disclosed (Andrew W. Mellon Foundation). Catherine Arnett's funding status is not mentioned.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Both authors' affiliations clearly listed: MIT and EleutherAI. Neither has apparent financial interest in the evaluated product.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": true, + "justification": "Mellon Foundation is independent funder. No conflict of interest evident.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement provided in paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Subject-verb agreement defined with examples (1-2). Agreement attractor explained. Language models specified (Pythia). Terms are defined adequately for target audience.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Two contributions explicitly stated: (1) methodological approach of disaggregating by condition over training, (2) findings about phased learning in agreement. Reader knows what is being claimed.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Extensive engagement with psycholinguistics (Bock & Miller), prior LM agreement work (Marvin & Linzen, Gulordava), and sudden vs. gradual learning debate (Wei et al., Kangaslahti et al.). Not just citations but real intellectual positioning.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": true, + "justification": "Code explicitly released: 'https://github.com/jmichaelov/sv-disaggregation-cognitive-interpretability' in Method section.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "Evaluation uses public benchmarks (BIG-bench Subject-Verb Agreement task, Bock & Cutting 1992 stimuli preprocessed by Arehalli & Linzen 2020). Standard public datasets.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "No requirements.txt, Dockerfile, or environment specification (Python version, dependencies) provided in the paper.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": true, + "justification": "Procedure clearly described: use PolyPythia models, specified datasets, calculate log-probability, compare correct vs incorrect verb form. Sufficient detail to reproduce with code access.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": true, + "justification": "Figures 1-5 all explicitly show '95% confidence intervals' as shaded regions. Variance is quantified.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "No formal statistical tests (t-tests, Mann-Whitney, etc.) reported. Only visual inspection of confidence intervals and curves.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Accuracy percentages and improvements shown clearly (0-100% on Y-axis). Final performance levels (e.g., 75-100% vs 0-25% across conditions) are effect sizes.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "10 random seeds per model × 5 model sizes = 50 runs. No power analysis or justification for why 10 seeds is sufficient.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": true, + "justification": "Confidence intervals shown across random seeds (Appendix C), and variance is visible in the shading across all figures.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": false, + "justification": "No baseline comparisons or control methods. This is exploratory analysis of a single phenomenon, not a method comparison.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": false, + "answer": false, + "justification": "Not applicable.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": false, + "justification": "No systematic ablations. Analysis examines naturally varying conditions but doesn't manipulate model components.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": false, + "justification": "Only metric reported is accuracy (correct vs incorrect verb form). No loss, cross-entropy, confidence, or other metrics.", + "source": "haiku" + }, + "human_evaluation": { + "applies": false, + "answer": false, + "justification": "Not applicable for this analytical study.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": true, + "justification": "Evaluation uses standard BIG-bench test sets and published stimuli. Clear separation between training and evaluation.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Extensive breakdowns: per-verb (Appendix B), per-seed (Appendix C), by verb type (be, other single-token, multi-token), by condition (singular, plural, with/without attractors).", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "Discusses where models fail: performance drops with intervening attractors, low accuracy on plural with singular attractor condition, reversals at step 512 for some verbs.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "Shows conditions with near-zero accuracy (e.g., plural conditions early in training), unexpected reversals, and variation across seeds indicating instability.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": true, + "justification": "Exact models specified: Pythia 14M, 31M, 70M, 160M, 410M from PolyPythia (van der Wal et al. 2024) with 10 random seeds and multiple training checkpoints.", + "source": "haiku" + }, + "prompts_provided": { + "applies": false, + "answer": false, + "justification": "Not applicable. Evaluation is based on log-probability, not prompting.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": false, + "answer": false, + "justification": "Not applicable. Models are pre-trained checkpoints. Paper references PolyPythia for training hyperparameters.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": false, + "answer": false, + "justification": "Not applicable. No agentic scaffolding used.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "Preprocessing steps documented: dataset selection (BIG-bench subsets + Bock & Cutting), single-token vs multi-token verb handling, log-probability calculation, token normalization discussion (Appendix D).", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": true, + "justification": "Uses public benchmarks (BIG-bench, Bock & Cutting 1992). Raw stimuli are publicly available from cited sources.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Data sourced from prior published work (BIG-bench, Arehalli & Linzen 2020). Collection procedures are described in cited papers.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "Not applicable. Uses benchmark datasets, no recruitment.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "Pipeline documented: datasets identified, single vs multi-token handling specified, log-probability calculation method described, token normalization approach discussed (Appendix D).", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": false, + "justification": "Models trained on The Pile (cutoff mid-2020) but this is not explicitly stated in the paper. Training cutoff date for evaluation benchmarks vs training data not discussed.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": false, + "justification": "No discussion of potential contamination. BIG-bench and Bock & Cutting 1992 stimuli are unlikely to be in Pile, but this is not addressed.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": true, + "answer": false, + "justification": "No explicit discussion of benchmark contamination risk, despite Pile being a broad web corpus.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "Not applicable.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "Not applicable.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "Not applicable.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "Not applicable.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "Not applicable to human studies, though 10 random seeds per model are documented.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "Not applicable.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "Not applicable.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": false, + "justification": "No inference cost, latency, or computational requirements reported.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "No total computational budget or number of inference calls reported.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "Language models learn subject-verb agreement through distinct training phases, initially relying on word frequency heuristics", + "evidence": "Figure 1 shows models preferring frequent verb form (is over are) early in training, matching The Pile frequency statistics (Appendix A)", + "supported": "strong" + }, + { + "claim": "Models become sensitive to local context (preceding noun) in a discrete phase after frequency-based learning", + "evidence": "Sharp transitions in agreement attractor effect at steps 128-512 visible in Figures 1-2, with matching attractor conditions improving while mismatched conditions worsen", + "supported": "strong" + }, + { + "claim": "Disaggregating performance by condition reveals hidden learning dynamics invisible in aggregate metrics", + "evidence": "Aggregate curve shows gradual improvement while condition-level analysis shows rapid non-monotonic changes. Explicitly demonstrated in Figures 1 vs individual condition traces", + "supported": "strong" + }, + { + "claim": "Multi-token verb learning occurs later than single-token verbs due to longer dependency requirements", + "evidence": "Figure 1C shows multi-token verb patterns occurring later in training than Figure 1B, explained by requiring trigram sensitivity vs bigram sensitivity", + "supported": "moderate" + }, + { + "claim": "The observed learning phases correspond to models learning increasingly complex n-gram statistics", + "evidence": "Discussion cites Chang et al. 2024 on transformer learning progression from unigram→bigram→trigram. Timing aligns but not directly tested", + "supported": "moderate" + }, + { + "claim": "Variation across random seeds indicates underlying process is not completely deterministic", + "evidence": "Appendix C shows seed-level plots with meaningful divergence, particularly in smaller models (14M, 31M)", + "supported": "moderate" + } + ], + "methodology_tags": [ + "benchmark-eval", + "observational" + ], + "key_findings": "By disaggregating performance over training, the authors reveal that language models learn subject-verb agreement in roughly three phases: initial preference for frequent verb forms, transition to sensitivity for preceding-word context (with attractor effects), and final improvement. Aggregate metrics hide these interpretable dynamics, and learning patterns vary by verb tokenization, with multi-token verbs requiring longer-range dependencies and learning later. The findings contribute to the debate on sudden vs. gradual learning by showing 'hidden breakthroughs' in specific conditions that underlie apparently gradual overall improvement.", + "red_flags": [ + { + "flag": "No formal statistical significance testing", + "detail": "Only confidence intervals shown visually. No t-tests or significance tests comparing conditions or model sizes." + }, + { + "flag": "Single phenomenon tested with broad generalization claims", + "detail": "Only English SVA with PP attractors, but abstract and discussion claim approach is 'powerful tool for understanding language model behavior more generally.'" + }, + { + "flag": "Single model family analyzed", + "detail": "Only Pythia evaluated. Authors acknowledge lack of comparable checkpoints in other model families, limiting generalization." + }, + { + "flag": "Exploratory not confirmatory", + "detail": "Authors explicitly state 'may be premature to draw any strong conclusions... without further confirmatory analyses.'" + }, + { + "flag": "Training/evaluation cutoff not discussed", + "detail": "No explicit statement of Pile training cutoff or discussion of potential benchmark contamination despite broad web corpus." + }, + { + "flag": "Mechanistic interpretations not directly tested", + "detail": "N-gram vs longer-range dependency hypothesis inferred from timing, but not directly manipulated or verified." + }, + { + "flag": "Seed variation not deeply analyzed", + "detail": "Appendix C shows meaningful variation across 10 seeds, particularly in smaller models, but this instability is not investigated." + }, + { + "flag": "Brief unexplained reversals at step 512", + "detail": "Some verbs show singular/plural preference reversal at step 512 that 'quickly reverses' — suggests possible confound or training artifact not explained." + }, + { + "flag": "Sample size not justified", + "detail": "10 random seeds chosen without power analysis or justification for sufficiency." + }, + { + "flag": "Only one evaluation metric", + "detail": "Only accuracy reported; no cross-entropy loss, token probability, confidence measures, or other metrics." + } + ], + "cited_papers": [ + { + "title": "Colorless Green Recurrent Networks Dream Hierarchically", + "relevance": "Prior work on LM agreement errors with attractors; directly compared in this paper's analysis" + }, + { + "title": "Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies", + "relevance": "Foundational work on minimal pairs for testing grammatical knowledge; paradigm used in this study" + }, + { + "title": "Targeted Syntactic Evaluation of Language Models", + "relevance": "Framework for evaluating syntactic knowledge via controlled datasets; methodological precedent" + }, + { + "title": "BLiMP: The Benchmark of Linguistic Minimal Pairs for English", + "relevance": "Standard benchmark containing Subject-Verb Agreement tasks used in this evaluation" + }, + { + "title": "Hidden Breakthroughs in Language Model Training", + "relevance": "Concurrent work arguing for phase transitions in learning; directly cited as related framework" + }, + { + "title": "Characterizing Learning Curves During Language Model Pre-Training: Learning, Forgetting, and Stability", + "relevance": "Analysis of learning dynamics over training; complementary approach to understanding training phases" + }, + { + "title": "Broken Agreement", + "relevance": "Classic psycholinguistics work on agreement attraction in humans; paradigm adapted for LM analysis" + }, + { + "title": "Are Emergent Abilities of Large Language Models a Mirage?", + "relevance": "Challenges sudden emergence narrative; paper positioned relative to gradual vs sudden learning debate" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 1, + "justification": "Academic analysis of model behavior; not directly actionable for practitioners. Understanding agreement learning phases may inform model evaluation but lacks practical application." + }, + "surprise_contrarian": { + "score": 2, + "justification": "Challenges both 'sudden emergence' and 'purely gradual' narratives by showing phase structure with hidden breakthroughs. Modest contrarian value in the ongoing debate." + }, + "fear_safety": { + "score": 0, + "justification": "No safety implications or risk concerns raised. Analysis of grammatical learning mechanisms does not touch capability risks." + }, + "drama_conflict": { + "score": 1, + "justification": "Engages with sudden vs. gradual learning debate and offers evidence for 'hidden breakthroughs' framework. Moderate intellectual conflict." + }, + "demo_ability": { + "score": 2, + "justification": "Code released on GitHub, evaluation uses public benchmarks, procedure clearly described. Could be reproduced and extended by others." + }, + "brand_recognition": { + "score": 1, + "justification": "MIT and EleutherAI are respected institutions but not top-tier labs. Limited brand halo compared to FAIR/OpenAI/Anthropic." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "45783837", + "title": "Watermarking for Generative AI", + "points": 17, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=45783837", + "created_at": "2025-11-01T18:04:10Z" + } + ], + "top_points": 17, + "total_points": 17, + "total_comments": 0 + } +} +\ No newline at end of file diff --git a/papers/disagreements-reasoning-how-2025/scan-v5.json b/papers/disagreements-reasoning-how-2025/scan-v5.json @@ -0,0 +1,586 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Disagreements in Reasoning: How a Model's Thinking Process Dictates Persuasion in Multi-Agent Systems", + "authors": [ + "Haodong Zhao", + "Jidong Li", + "Zhaomin Wu", + "Tianjie Ju", + "Zhuosheng Zhang" + ], + "year": 2025, + "venue": "arXiv.org", + "arxiv_id": "2509.21054", + "doi": "10.48550/arXiv.2509.21054" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All major abstract claims are supported by experimental data: LRMs show greater persuasion resistance (heatmaps in Figs 1-2 show lower PR rows for thinking models), and sharing thinking content dramatically increases persuasive power (avg 21.07% increase reported).", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "The paper uses switchable thinking/non-thinking modes on the same model families (Gemini-2.5-flash, Qwen3-32B, Hunyuan-7B) as a within-model ablation, providing a reasonable basis for causal claims; Fig 6 further isolates content quality from length effects.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "Conclusions about 'safer and more resilient MAS architectures' and implications for 'future MAS' are stated broadly, but experiments cover only two datasets (MMLU, PersuasionBench/Perspectrum) and 7 model families. No explicit scope boundaries are stated.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": true, + "justification": "The paper explicitly tests output length as an alternative explanation for why thinking content improves persuasion via the padding condition in Fig 6, distinguishing verbosity from semantic content quality.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "The paper clearly defines LLM persuasion as a behavioral metric (option change rate) distinct from human belief change, explicitly noting in Section 2.2 that LLMs 'lack mental states in the human sense' and adopting a behavioral definition.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": false, + "justification": "No dedicated limitations section exists. The conclusion (Section 5) only summarizes contributions and calls for future research without discussing limitations.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "No threats-to-validity are discussed. The artificial setup (correct answers standardized to option A, persuasion target fixed to option D) and narrow dataset scope are not acknowledged as validity threats.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "No explicit scope boundaries are stated about what results do NOT show. Findings are generalized to 'multi-agent systems' broadly without qualifying the narrow experimental conditions.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No acknowledgments or funding section is present in the paper. No mention of funding sources anywhere in the text.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Author affiliations are clearly disclosed on the title page: Shanghai Jiao Tong University, National University of Singapore, and Inner Mongolia Research Institute.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": false, + "answer": false, + "justification": "No funding is disclosed, so funder independence cannot be assessed.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement or declaration of financial interests is present in the paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Key terms are formally defined: LLMs vs. LRMs (Section 2), human vs. LLM persuasion (Definitions 2.1-2.2), and quantitative metrics PR/RR/OR are given with explicit formulas (Equations 1-3).", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Four explicit contributions are listed in the introduction: linking LRM cognitive architecture to persuasion behavior, formalizing the Persuasion Duality, multi-hop chain analysis, and attention-based explanation plus prompt mitigation.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "The paper explicitly challenges Breum et al. (2024)'s scale hypothesis, builds on Jones & Bergen (2024)'s LLM persuasion framework, and situates itself relative to PersuasionBench and PMIYC frameworks.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": false, + "justification": "No code repository or release is mentioned anywhere in the paper. No GitHub link or promise of release.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "The paper uses only standard public benchmarks: MMLU (Hendrycks et al., 2020), PersuasionBench (Durmus et al., 2024), and Perspectrum (Chen et al., 2019), all publicly available.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": true, + "justification": "Environment specified in Appendix A.3: VLLM v0.10.0, transformers v4.56.0, temperature=0.7, top_p=0.8.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "No step-by-step reproduction instructions are provided. The appendix covers datasets, models, and hyperparameters but not a reproducible pipeline for running the full experiments.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": true, + "justification": "All heatmap results (Figs 1-4, 13-14) include ± error bars consistently (e.g., '7.0 ± 1.6' in Fig 1a).", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "No formal statistical significance tests are reported. Comparative conclusions rely on visual inspection of heatmaps and average percentage differences without p-values.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Effect sizes reported as percentage differences throughout (e.g., '21.07% average increase in persuasion rate' when thinking content added; '19% relative improvement' for native thinking content in Fig 6).", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "The datasets (~10,000 MMLU questions, 1,000 subjective claims) are described but not justified. No power analysis or sample size rationale is provided.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": true, + "justification": "± error bars are consistently reported across all heatmaps throughout the paper.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Each thinking-mode LRM is compared against its own non-thinking counterpart as a within-model baseline; direct pairwise comparisons serve as baselines for multi-hop experiments.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "Baselines include current frontier models (o4-mini, DeepSeek-R1, Gemini-2.5-flash, Qwen3-32B) all from 2025.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "Figure 6 presents an explicit ablation: native thinking content vs. equal-length padding tokens vs. mismatched thinking content vs. no thinking content baseline, isolating verbosity from semantic quality.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Three metrics are used and formally defined: Persuaded-Rate (PR), Remain-Rate (RR), and Other-Rate (OR) (Section 2.2, Equations 1-3).", + "source": "haiku" + }, + "human_evaluation": { + "applies": false, + "answer": false, + "justification": "The study evaluates LLM-to-LLM persuasion only; human evaluation is not relevant to this experimental design.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": false, + "answer": false, + "justification": "This is a behavioral experiment, not a prediction task requiring train/test split.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Results are broken down by objective vs. subjective datasets (Figs 1-4) and per model pair in comprehensive 10×10 heatmaps.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "Figure 12 and Appendix C.1 provide a detailed case study of a persuaded model reasoning incorrectly about a pandemic influenza question, tracing the step-by-step reasoning deviation.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "Section 3.2 explicitly reports mixed effects of thinking mode when acting as persuader ('average gains of -7.41%, -1.92%, and 2.07%' for Gemini, Qwen, Hunyuan), reporting negative/inconsistent results.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": false, + "justification": "Two model card references (Gemini 2.5 Flash and OpenAI o4-mini) contain 'Accessed: YYYY-MM-DD' placeholder dates, indicating model version documentation was not completed before submission.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": true, + "justification": "Full prompts for persuader content generation and persuadee evaluation are provided in Appendix A.4 for both objective and subjective tasks; the adversarial detection prompt is shown in Figure 15.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": true, + "justification": "Hyperparameters reported in Appendix A.3: temperature=0.7, top_p=0.8, VLLM v0.10.0, transformers v4.56.0.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": true, + "answer": true, + "justification": "The multi-agent persuasion scaffolding is described: persuader generates content, content is appended to persuadee's context as a prior participant response, persuadee responds with a single-letter choice. Multi-hop chain setup is also described.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "Preprocessing is documented: MMLU correct answers standardized to option A, persuasion targets fixed to option D; subjective stances mapped to A/B/C options; persuasion target set based on initial response (neutral if support/oppose, random if neutral).", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": false, + "justification": "No raw experimental output data (model responses, persuasion interaction logs) is released or linked to an external repository.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Data collection is described in Appendix A.1: MMLU selection criteria, 1,000 claim sample from PersuasionBench and Perspectrum, and how model responses are recorded.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "No human participants; standard benchmarks used, so recruitment is not applicable.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": false, + "justification": "The experimental pipeline is conceptually described but lacks complete documentation (scripts, API call logic, response parsing, how multi-hop chains were orchestrated).", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": false, + "justification": "No training data cutoffs are stated for any of the 7 model families despite evaluating them on MMLU, a benchmark widely present in training corpora.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": false, + "justification": "No discussion of whether MMLU or PersuasionBench examples appeared in any model's training data.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": true, + "answer": false, + "justification": "MMLU is a widely-used benchmark that frontier models (o4-mini, Qwen3, DeepSeek-R1) are likely trained on; this is not acknowledged or addressed.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": false, + "justification": "No inference cost or latency reported despite experiments involving hundreds of thousands of model calls across 10 model modes × 10 model modes × ~1,000 questions × multiple experimental conditions.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "No compute budget is stated. The scale of computation (multiple GPU-served models, full pairwise evaluation) is substantial but undisclosed.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "LRMs with thinking mode enabled are significantly more resistant to persuasion than their non-thinking counterparts", + "evidence": "Heatmaps (Figs 1-2) show thinking-mode models have substantially lower Persuaded-Rate; thinking mode reduces PR by average 7.82% on objective and 29.68% on subjective datasets; Fig 7 shows R²=0.85, r=-0.92 between PR and RR for model pairs", + "supported": "strong" + }, + { + "claim": "Sharing LRM thinking content with persuadees dramatically increases persuasive effectiveness", + "evidence": "Comparing Fig 1a vs 1b, adding thinking content yields average 21.07% increase in PR; individual model rows show increases from ~12% to >80% PR for some persuadee targets", + "supported": "strong" + }, + { + "claim": "Persuasive efficacy is primarily driven by cognitive process (reasoning mode) rather than model scale", + "evidence": "Fig 4 column-wise analysis shows no clear PR increase with stronger persuaders, while Fig 3 shows a clear row-wise effect of persuadee capability, supporting the process-centric over scale-centric view", + "supported": "moderate" + }, + { + "claim": "Models are substantially more easily persuaded on subjective questions than objective ones", + "evidence": "PR values are consistently and substantially higher throughout the subjective heatmaps (Fig 2) compared to objective heatmaps (Fig 1) across all model pairs", + "supported": "strong" + }, + { + "claim": "Persuasive content length positively correlates with persuasion effectiveness", + "evidence": "Fig 5 shows PR generally increases with token limit while RR decreases, with the effect being non-monotonic at very high token counts", + "supported": "moderate" + }, + { + "claim": "Adversarial argument detection prompt substantially reduces model susceptibility to persuasion", + "evidence": "Fig 11 shows consistent PR reduction and RR increase across four persuadee models (Hunyuan w/o T, w/ T, Llama-3-8B, Qwen2.5-7B) when the detection prompt is applied", + "supported": "strong" + } + ], + "methodology_tags": [ + "benchmark-eval", + "observational" + ], + "key_findings": "The paper identifies a 'Persuasion Duality' in LLM multi-agent systems: enabling explicit reasoning (thinking mode) in LRMs simultaneously increases their resistance to persuasion (lower Persuaded-Rate) and dramatically increases their persuasive power when thinking content is shared with persuadees (avg 21.07% PR increase on objective tasks). Persuasion dynamics depend more on cognitive architecture (reasoning mode) than model scale, as stronger persuaders do not reliably raise persuasion rates when analyzed column-wise across the heatmaps. Models are substantially more persuadable on subjective questions than objective ones, likely because subjective claims lack ground-truth anchors. Multi-hop persuasion propagates non-linearly through agent chains with both amplification and attenuation effects depending on chain composition, and a simple adversarial argument detection prompt consistently reduces persuasion vulnerability across diverse model types.", + "red_flags": [ + { + "flag": "No limitations section", + "detail": "The paper contains no dedicated limitations section. Critical gaps such as the highly artificial experimental setup (correct answers forced to option A, persuasion target fixed to option D) and narrow dataset scope are not acknowledged as limitations." + }, + { + "flag": "Incomplete model version documentation", + "detail": "Two model card references (Gemini 2.5 Flash and OpenAI o4-mini system card) contain 'Accessed: YYYY-MM-DD' placeholder dates, indicating model version documentation was not finalized before submission." + }, + { + "flag": "No significance testing", + "detail": "Comparative claims about 'significant' differences rely on visual heatmap comparison and average percentages. No formal statistical tests are reported despite the availability of per-trial data needed to compute them." + }, + { + "flag": "Benchmark contamination unaddressed", + "detail": "MMLU is used as the primary objective evaluation dataset but training data cutoffs for any model are not stated, and potential contamination of MMLU in training data for frontier models (o4-mini, DeepSeek-R1, Qwen3) is not discussed." + }, + { + "flag": "No code released", + "detail": "No code repository is provided for reproducing the experiments despite the computational complexity involving pairwise model evaluations across thousands of questions." + }, + { + "flag": "Artificial persuasion design", + "detail": "Standardizing all correct answers to option A and fixing persuasion targets to option D creates an artificial setup whose generalizability to real multi-agent tasks is not discussed." + }, + { + "flag": "Funding not disclosed", + "detail": "No acknowledgment or funding section is present in the paper." + } + ], + "cited_papers": [ + { + "title": "The persuasive power of large language models", + "relevance": "Prior work establishing that LLM persuasive efficacy scales with model size — the dominant hypothesis this paper directly challenges" + }, + { + "title": "Scaling language model size yields diminishing returns for single-message political persuasion", + "relevance": "Key supporting evidence for diminishing returns from scale in persuasion, motivating the shift to cognitive architecture focus" + }, + { + "title": "Measuring massive multitask language understanding", + "relevance": "MMLU benchmark used as the primary objective evaluation dataset in all experiments" + }, + { + "title": "Measuring the persuasiveness of language models", + "relevance": "PersuasionBench dataset used for subjective evaluation; Anthropic study on persuasion measurement methodology" + }, + { + "title": "Lies, damned lies, and distributional language statistics: Persuasion and deception with large language models", + "relevance": "Framework distinguishing roles for LLMs in persuasive contexts (persuader, persuadee, judge) that the paper builds upon" + }, + { + "title": "Conformity in large language models", + "relevance": "Related work on LLM susceptibility to social influence; provides the PR/RR/OR metrics framework used in this paper" + }, + { + "title": "DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning", + "relevance": "Defines the LRM architecture category central to the paper's thesis; one of the key evaluated models" + }, + { + "title": "Chain-of-thought prompting elicits reasoning in large language models", + "relevance": "Foundational work for the CoT-as-persuasion-resistance finding in Section 3.4.2" + }, + { + "title": "Flooding spread of manipulated knowledge in LLM-based multi-agent communities", + "relevance": "Prior work by overlapping authors on knowledge manipulation in MAS, directly related context" + }, + { + "title": "Encouraging divergent thinking in large language models through multi-agent debate", + "relevance": "Multi-agent debate framework relevant to the MAS persuasion dynamics studied; foundational for multi-hop analysis" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 2, + "justification": "MAS designers can immediately apply the adversarial argument detection prompt; findings directly inform architecture choices between LLMs and LRMs for agent robustness." + }, + "surprise_contrarian": { + "score": 3, + "justification": "The Persuasion Duality is genuinely counterintuitive: the same thinking mechanism that makes an agent more persuasive also makes it harder to persuade, directly challenging the dominant scale hypothesis." + }, + "fear_safety": { + "score": 2, + "justification": "Demonstrates that agents can be systematically manipulated via shared thinking content and that influence cascades non-linearly through MAS chains, raising concrete safety concerns." + }, + "drama_conflict": { + "score": 1, + "justification": "The challenge to the scale hypothesis creates intellectual tension, but there is no high-profile named controversy or adversarial replication involved." + }, + "demo_ability": { + "score": 1, + "justification": "The adversarial detection prompt is provided verbatim and is immediately usable, but reproducing the full experiments requires access to multiple closed-source frontier models and no code is released." + }, + "brand_recognition": { + "score": 1, + "justification": "Shanghai Jiao Tong University and National University of Singapore are well-regarded academic institutions but not top AI labs; no famous lab affiliation." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "43243109", + "title": "An Attempt to Catch Up with JIT Compilers", + "points": 203, + "comments": 142, + "url": "https://news.ycombinator.com/item?id=43243109", + "created_at": "2025-03-03T16:06:50Z" + }, + { + "hn_id": "44433899", + "title": "Converting a large mathematical software package written in C++ to C++20 modules", + "points": 141, + "comments": 42, + "url": "https://news.ycombinator.com/item?id=44433899", + "created_at": "2025-07-01T13:46:56Z" + }, + { + "hn_id": "46339300", + "title": "Signaling in the Age of AI: Evidence from Cover Letters", + "points": 17, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=46339300", + "created_at": "2025-12-20T20:23:28Z" + }, + { + "hn_id": "45472586", + "title": "Physics of Learning: A Lagrangian perspective to different learning paradigms", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=45472586", + "created_at": "2025-10-04T11:38:44Z" + }, + { + "hn_id": "47195084", + "title": "Limitations on Safe, Trusted, Artificial General Intelligence", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=47195084", + "created_at": "2026-02-28T13:25:35Z" + }, + { + "hn_id": "45418635", + "title": "Can LLMs Be Creative? Paper: Combinatorial Creativity: A New Frontier", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=45418635", + "created_at": "2025-09-29T20:53:22Z" + }, + { + "hn_id": "24567265", + "title": "Context-Theoretic Semantics for Natural Language: An Algebraic Framework (2007)", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=24567265", + "created_at": "2020-09-23T14:11:23Z" + }, + { + "hn_id": "46479718", + "title": "FakeParts: A New Family of AI-Generated DeepFakes", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=46479718", + "created_at": "2026-01-03T18:14:11Z" + }, + { + "hn_id": "45069333", + "title": "A multi-task neural network for atypical mitosis recognition under domain shift", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=45069333", + "created_at": "2025-08-29T21:00:57Z" + } + ], + "top_points": 203, + "total_points": 372, + "total_comments": 185 + } +} +\ No newline at end of file diff --git a/papers/disentangling-causal-importance-2026/scan-v5.json b/papers/disentangling-causal-importance-2026/scan-v5.json @@ -0,0 +1,505 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Disentangling Causal Importance from Emergent Structure in Multi-Expert Orchestration", + "authors": [ + "Sudipto Ghosh", + "Sujoy Nath", + "Sunny Manchanda", + "Tanmoy Chakraborty" + ], + "year": 2026, + "venue": "arXiv", + "arxiv_id": "2602.04291", + "doi": null + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "The abstract's core claim — routing dominance diverges from intrinsic importance, with 5.5× routing KL divergence vs. sequencing KL on MMLU — is directly supported by Table 8 (KL Routing 2.366±0.497 vs. KL Sequence 0.428±0.072) and Table 10.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "Targeted masking ablations (removing specific experts at inference time) constitute intervention-based causal analysis; the FAQ explicitly qualifies that gradient attribution captures functional dependence rather than formal causal graphs, which is an appropriate caveat.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "Conclusions like 'routing dominance is a poor proxy for functional necessity' are stated as general principles for multi-expert systems but are derived from a single orchestrator architecture (BERT encoder + attention routing + oracle distillation); applicability to other orchestration designs is asserted but not demonstrated.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "The divergence between routing frequency and gradient attribution could be an artifact of the oracle distillation training objective or the BERT encoder bottleneck; these alternative explanations are not systematically considered.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "The paper explicitly distinguishes gradient-based attribution (influence on routing decisions) from expert correctness; FAQ Q3 states 'intrinsic importance measures how strongly an expert's representation influences the orchestrator's decisions, not the semantic quality or correctness of outputs.'", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": false, + "justification": "No dedicated limitations or threats-to-validity section exists in the main paper; Appendix H discusses system failure modes but frames these as orchestration behavior patterns rather than limitations of the INFORM framework itself.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "The FAQ addresses methodological interpretation questions (Q1: INFORM is not formal causality) but does not frame these as threats to the validity of specific empirical claims; no specific threats are enumerated.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "FAQ Q7 notes INFORM requires white-box access and cannot be applied to API-based systems, but the paper does not explicitly bound when its empirical findings apply — which orchestrator types, scales, or tasks generalize.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding acknowledgment is present in the paper; affiliations are IIT Delhi and DRDO but no grant support or funding sources are disclosed.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Author affiliations are clearly stated: IIT Delhi (Yardi School of AI, Dept. of Electrical Engineering) and DRDO Young Scientist Laboratory – Artificial Intelligence.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": false, + "answer": false, + "justification": "No funding is disclosed, so funder independence cannot be assessed.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement appears anywhere in the paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Key terms are precisely defined: 'relational importance' (total incoming routing mass uj(x)), 'intrinsic importance' (gradient norm of log P(Ei|x) w.r.t. hi), 'routing dominance,' 'collaboration matrix,' and 'orchestration' are all formally defined in Section 2.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "INFORM as an interpretability framework is clearly framed with four explicit research questions (RQ1-RQ4), three primary insights, and a comparison table (Table 1) positioning the contribution against prior work.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "The paper engages substantively with MoE routing, multi-agent systems (MetaGPT, AutoGen, ChatDev), LLM routing (RouteLLM, FrugalGPT, IRT-Router), and interpretability research, using Table 1 to explicitly position INFORM against each.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": false, + "justification": "No code repository or release link is mentioned; the INFORM framework is described architecturally but no implementation is provided.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "All three evaluation benchmarks (GSM8K, HumanEval, MMLU) are standard publicly available datasets used without modification.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "Only hardware is mentioned ('single NVIDIA A100 80GB GPU'); no software environment, Python version, PyTorch/CUDA version, or requirements file is provided.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "No step-by-step reproduction instructions are provided; Table 4 lists hyperparameters but without code or executable procedures to reconstruct orchestrator training and evaluation.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": true, + "justification": "Table 8 reports '± 95% CI' for KL divergence values across epochs; Table 11 reports mean ± standard deviation for routing collapse comparisons.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": true, + "justification": "Table 10 reports Spearman ρ with two-sided p-values; Table 11 uses both Wilcoxon signed-rank test and paired t-test to compare masking intrinsically important vs. frequently routed experts.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "KL divergence magnitudes, Gini coefficients, performance differences ('+9.0% on MMLU for Intrinsic-Only,' '~1.4% gain with ~3.5× speedup' in Table 3), and rank correlations are reported throughout.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "The 'held-out subset of the test set' used for evaluation is never sized; rank correlations in Table 10 are computed over only N=10 experts, which is underpowered for meaningful significance testing, and no justification is given.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": true, + "justification": "Table 8 reports ± 95% CI; Table 11 reports standard deviations; Figure 11 notes results are 'mean... over 3 runs,' though SDs are not shown for all figures.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Baselines include: Uniform Baseline (uniform transitions), Relational-Only, Intrinsic-Only, MetaGPT (Table 3), individual single-model baselines (Table 9), and static collaboration/sequencing ablations (Figure 11).", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "MetaGPT (2024), RouteLLM (2025), and FrugalGPT are recent systems; individual baselines use current models (LLaMA 3.1, Qwen3, DeepSeek-R1).", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "Extensive ablations: static collaboration matrix, static execution sequence, masking most intrinsically important expert, relational-only routing, intrinsic-only routing, and oracle alignment (Appendix F, Section 4.5).", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Metrics include: accuracy (MMLU/GSM8K), Pass@1 (HumanEval), KL divergence, collaboration entropy, sequence entropy, Gini coefficient (centralization), Spearman/Kendall rank correlation, and average model calls.", + "source": "haiku" + }, + "human_evaluation": { + "applies": false, + "answer": false, + "justification": "The paper evaluates automated benchmarks and system-level routing behavior; human evaluation is not applicable to this interpretability analysis of orchestration mechanisms.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": true, + "justification": "The paper explicitly states 'the orchestrator evaluated on a held-out subset of the test set' for all three benchmarks.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Results are broken down by task (MMLU, HumanEval, GSM8K) throughout the paper, with separate figures and tables for each task revealing meaningfully different dynamics (e.g., HumanEval shows opposite sequencing vs. routing sensitivity to GSM8K/MMLU).", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "Appendix H extensively documents five failure modes: hub over-centralization, routing-attribution misalignment, early commitment to suboptimal initializers, overconfidence under damaged inputs, and redundancy masking structural dependence.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "The paper reports that HumanEval shows the opposite pattern from GSM8K/MMLU (sequencing divergence > routing divergence); Table 10 shows mostly non-significant rank correlations; FAQ Q8 notes masking important experts does not always reduce task performance.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": true, + "justification": "Specific model versions: LLaMA-3.1 8B, Qwen-3 8B, DeepSeek-R1-0528-Qwen3-8B (distilled variant), LLaMA-3.2 1B, Qwen2.5 3B, Mistral 7B, GPT-OSS-20B oracle (Appendix D).", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": false, + "justification": "Appendix E shows only the structural template header ('Expert 1's Response: ...') without providing the complete prompt, task-specific instructions, or fill values — insufficient to reproduce prompt behavior.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": true, + "justification": "Table 4 provides comprehensive hyperparameters: learning rate, batch size, training epochs, warmup ratio, hidden dimension, attention heads, dropout, Gumbel-Softmax temperature schedule, and all eight loss coefficient weights.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": true, + "answer": true, + "justification": "The orchestrator architecture is described in detail: frozen BERT encoder (768-dim), routing adapter, query-key attention for C(x), Gumbel-Softmax selection module P(Ei|x), adaptive-top-k sparsity, and 8-term composite training objective with mathematical formulations.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": false, + "justification": "Only 'initial 512 tokens generated by each expert is used as input to the orchestrator' is mentioned; how benchmark inputs are formatted, tokenized, or filtered for the orchestrator is not documented.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": false, + "justification": "The collaboration matrices, attribution scores, and routing statistics generated during experiments are not released; only the standard input benchmarks are publicly available.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "The process of extracting collaboration matrices C(x), selection distributions s(x), and gradient-based attribution scores I(Ei) is described mathematically in Section 2.2 with sufficient detail to understand what was measured.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "No human participants; standard benchmark datasets were used.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": false, + "justification": "While training objectives and probing methodology are described, the full pipeline — how benchmark examples are batched, how routing statistics are aggregated across samples, how held-out splits are constructed — is not documented reproducibly.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": false, + "justification": "Training cutoffs for the expert models (LLaMA 3.1, Qwen3, DeepSeek-R1, Mistral) are not stated; GSM8K, HumanEval, and MMLU are well-established benchmarks that may have been included in these models' pretraining data.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": false, + "justification": "Potential overlap between training data of expert models and evaluation benchmarks is not discussed, despite all three benchmarks predating the training cutoffs of the models used.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": true, + "answer": false, + "justification": "Benchmark contamination is not addressed; MMLU, HumanEval, and GSM8K all predate the LLaMA 3.1, Qwen3, and DeepSeek-R1 models, and inflated performance due to memorization is not discussed.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": true, + "justification": "Table 3 reports average model calls per inference (INFORM: 1.44 vs. MetaGPT: 5.00 on HumanEval), providing a direct measure of inference cost efficiency.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "Only hardware is mentioned ('single NVIDIA A100 80GB GPU'); total training time, GPU-hours, or dollar cost is not reported.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "Routing dominance is a poor proxy for functional necessity: rank correlation between routing mass and intrinsic attribution is weak and unstable across training epochs.", + "evidence": "Table 10 shows Spearman ρ ranging from 0.152 to 0.648 across tasks/epochs, with most p-values exceeding 0.3 (non-significant at N=10 experts); visual comparison of Figures 4 and 5 also shows clear misalignment.", + "supported": "moderate" + }, + { + "claim": "Masking the most intrinsically important expert on MMLU induces 5.5× higher routing KL divergence than sequencing KL divergence.", + "evidence": "Table 8 directly reports KL(Routing)=2.366±0.497 and KL(Sequence)=0.428±0.072 for MMLU (ratio ~5.5×); this is also cited verbatim in the abstract.", + "supported": "strong" + }, + { + "claim": "Orchestration behaviors emerge asynchronously: expert centralization precedes stable routing confidence during training.", + "evidence": "Figure 6 shows Gini coefficient (centralization) increasing in early epochs while collaboration entropy continues to decrease across all three tasks; this decoupling is observed consistently.", + "supported": "strong" + }, + { + "claim": "INFORM achieves 87.1% Pass@1 on HumanEval with 1.44 average model calls vs. MetaGPT's 85.9% with 5.00 calls (~3.5× efficiency gain).", + "evidence": "Table 3 directly reports these numbers; however comparison is only on HumanEval against one baseline system with role-specific rather than task-optimized experts.", + "supported": "moderate" + }, + { + "claim": "A homogeneous consortium of 8B experts surpasses Qwen3-70B accuracy on MMLU while activating 8.75× fewer parameters.", + "evidence": "Figure 12 shows the consortium crossing the 86% accuracy threshold at ~16B total parameters; however this is shown on MMLU only and with a single expert family (Qwen3-8B).", + "supported": "weak" + }, + { + "claim": "Intrinsic expert importance is sparse and task-dependent: different experts dominate gradient attribution across GSM8K, HumanEval, and MMLU.", + "evidence": "Figures 4 and 13 show sparse heatmaps with different experts having high gradient norms across tasks; this is described as a primary finding in the abstract.", + "supported": "strong" + } + ], + "methodology_tags": [ + "benchmark-eval", + "observational", + "case-study" + ], + "key_findings": "INFORM reveals a systematic divergence between routing frequency and causal necessity in learned multi-expert LLM orchestration: rank correlations between routing mass and gradient-based attribution are weak and unstable (mostly ρ<0.4, non-significant), while masking the highest-attributed expert disrupts routing structure 5.5× more than masking the most-frequented expert on MMLU. Orchestration dynamics emerge asynchronously — expert centralization develops before routing confidence stabilizes, suggesting the system learns who to trust before learning how confident to route. Task-dependent profiles show HumanEval is dominated by initialization sensitivity while GSM8K/MMLU rely on interaction hub stability. These findings suggest that accuracy metrics alone are insufficient to diagnose brittleness or redundancy in multi-expert systems.", + "red_flags": [ + { + "flag": "Single orchestrator architecture", + "detail": "All findings derive from one specific design (BERT encoder + attention routing + oracle distillation from GPT-OSS-20B). Claims about 'multi-expert systems' broadly are not validated across diverse orchestration architectures; oracle distillation may create routing patterns uncharacteristic of task-loss-only training." + }, + { + "flag": "Causal terminology overuse", + "detail": "The paper uses 'causal attribution' and 'causal importance' throughout, but FAQ Q1 explicitly states INFORM 'does not claim to recover causal structure in the sense of formal causal graphs or interventional guarantees.' Gradient sensitivity is a local correlational measure, not causal identification." + }, + { + "flag": "Underpowered rank correlation tests", + "detail": "Table 10 computes rank correlations over N=10 experts per task/epoch. With N=10, Spearman ρ must exceed ~0.63 for p<0.05 (two-tailed). Most reported correlations are non-significant, yet strong claims about divergence are drawn from this analysis." + }, + { + "flag": "No code released", + "detail": "Despite being framed as a practical diagnostic tool for practitioners (FAQ Q11), no code or model weights are released, making independent verification and practical adoption impossible." + }, + { + "flag": "Held-out evaluation set size unreported", + "detail": "The 'held-out subset of the test set' used for all attribution and routing analyses is never sized, making it impossible to assess statistical validity of the entropy, centralization, and gradient attribution measurements." + }, + { + "flag": "Benchmark contamination unaddressed", + "detail": "All three benchmarks (MMLU, HumanEval, GSM8K) predate the training cutoffs of LLaMA 3.1, Qwen3, and DeepSeek-R1; the paper does not discuss whether high expert performance reflects genuine reasoning or benchmark memorization, which could affect which experts appear 'important'." + } + ], + "cited_papers": [ + { + "title": "MetaGPT: Meta Programming for a Multi-Agent Collaborative Framework", + "relevance": "Primary baseline for rigid role-based multi-agent orchestration; used in efficiency comparison (Table 3) and represents the structured-workflow paradigm that INFORM's learned routing outperforms." + }, + { + "title": "AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversations", + "relevance": "Key multi-agent framework representing conversational agent orchestration; positioned as low-interpretability in landscape Table 1." + }, + { + "title": "FrugalGPT: How to use large language models while reducing cost and improving performance", + "relevance": "Cascade-based LLM routing; INFORM is applied to this architecture in Appendix I to demonstrate generalizability of interpretability principles." + }, + { + "title": "RouteLLM: Learning to Route LLMs from Preference Data", + "relevance": "Contemporary LLM routing from preference data; used in landscape comparison Table 1 and extended related work." + }, + { + "title": "LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion", + "relevance": "Output aggregation approach contrasted with INFORM's sequential interaction-based analysis; represents the order-invariant paradigm INFORM critiques." + }, + { + "title": "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer", + "relevance": "Foundational MoE routing work; INFORM positions itself as going beyond implicit routing optimization to expose causal expert dependencies." + }, + { + "title": "Can Dependencies Induced by LLM-Agent Workflows Be Trusted?", + "relevance": "Related work on trust and causal dependencies in LLM-agent workflows; directly relevant to INFORM's focus on whether orchestration dependencies are genuine." + }, + { + "title": "On the Resilience of LLM-Based Multi-Agent Collaboration with Faulty Agents", + "relevance": "Failure propagation in multi-agent systems; motivates INFORM's focus on diagnosing structural dependencies before they manifest as accuracy drops." + }, + { + "title": "IRT-Router: Effective and Interpretable Multi-LLM Routing via Item Response Theory", + "relevance": "Interpretable LLM routing method using item response theory; used as the high-interpretability comparison point in Table 1." + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 2, + "justification": "INFORM provides actionable diagnostic insights for practitioners designing multi-expert systems (centralization monitoring, attribution-routing alignment), but the absence of released code limits immediate applicability." + }, + "surprise_contrarian": { + "score": 3, + "justification": "The core finding — that routing frequency systematically diverges from causal necessity, and that popular 'hub' experts can be functionally dispensable — directly contradicts the common assumption that routing statistics reflect what the system actually depends on." + }, + "fear_safety": { + "score": 1, + "justification": "The paper mentions opaque orchestration as a safety concern for high-stakes and tool-augmented deployments, but does not analyze safety-critical scenarios or failure consequences in depth." + }, + "drama_conflict": { + "score": 1, + "justification": "Standard academic positioning against existing frameworks (MetaGPT, AutoGen, FrugalGPT) without significant controversy or community conflict." + }, + "demo_ability": { + "score": 1, + "justification": "No code is released and INFORM requires white-box orchestrator access; practitioners cannot easily try this without re-implementing the full training pipeline." + }, + "brand_recognition": { + "score": 1, + "justification": "IIT Delhi is a recognized institution and DRDO is notable in the Indian defense research context, but this is not from a prominent Western AI lab (DeepMind, OpenAI, Meta AI, Google)." + } + }, + "hn_data": { + "threads": [], + "top_points": 0, + "total_points": 0, + "total_comments": 0 + } +} +\ No newline at end of file diff --git a/papers/dissecting-swe-bench-leaderboard-2025/scan-v5.json b/papers/dissecting-swe-bench-leaderboard-2025/scan-v5.json @@ -0,0 +1,414 @@ +{ + "scan_version": 5, + "paper_type": "survey", + "paper": { + "title": "Dissecting the SWE-Bench Leaderboards: Profiling Submitters and Architectures of LLM- and Agent-Based Repair Systems", + "authors": [ + "Matias Martinez", + "Xavier Franch" + ], + "year": 2025, + "venue": "arXiv", + "arxiv_id": "2506.17208", + "doi": null + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All abstract claims—first comprehensive study, 80 unique approaches, Claude dominance, architectural diversity, contributor diversity—are substantiated by Tables 2–6 and the RQ results sections.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": false, + "answer": false, + "justification": "The paper is descriptive and observational; it makes no formal causal claims, only correlational or descriptive observations about leaderboard submissions.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": true, + "justification": "Section 5 (External Validity) explicitly states the findings are bounded to SWE-Bench Lite and Verified and that 'we do not claim that our findings can be applied to them' (other benchmarks).", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": true, + "justification": "Section 4.3 presents competing perspectives on single- vs. multi-agent architectures from Cognition, Anthropic, OpenHands, and nFactorial; Section 3.1.2 attributes early academic underperformance partly to temporal effects rather than capability.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "Section 4.1 explicitly discusses that '% Resolved' conflates plausible and correct patches, citing Wang et al.'s finding of a 6.2pp average overstatement, and calls for additional validation beyond test suites.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "Section 5 'Threats to Validity' contains four dedicated subsections: External, Internal, Construct, and Conclusion Validity.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "Threats are concrete: risk of missing submission documentation (Internal), exclusion of monetary cost analysis due to token-price normalization difficulty, G8 category for architecturally unclassifiable entries, and content analysis limitations (Construct).", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": true, + "justification": "Section 2.1 explicitly excludes Full and Multimodal leaderboards with rationale; data cutoff is July 17th pinned to a specific GitHub commit hash.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding source is mentioned anywhere in the paper.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Both authors are identified as affiliated with Universitat Politècnica de Catalunya, Barcelona, Spain, with contact emails provided.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": false, + "answer": false, + "justification": "No funding is disclosed, so independence of funder cannot be assessed.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement or declaration of financial interests appears in the paper despite analyzing commercial tools from major AI companies.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Section 2.2 formally defines workflow authoring (human vs. emergent), control flow autonomy (emergent, scaffolded, fixed), and agent count categories; Section 2.1.2 defines submitter categories with explicit coding schemas.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "The paper explicitly states it presents 'the first in-depth study of the SWE-Bench leaderboards' with three clearly stated research questions (RQ1–RQ3) covering submitter profiling, architecture, and pipeline phases.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 6 engages substantively with prior empirical studies of SWE-Bench patches (Meng et al., Wang et al., Aleithan et al., Ceka et al.) and distinguishes this leaderboard-level characterization from their patch-level analyses.", + "source": "haiku" + } + } + }, + "type_checklist": { + "survey": { + "search_and_selection": { + "search_strategy_reproducible": { + "applies": true, + "answer": true, + "justification": "The leaderboard data source is pinned to a specific GitHub commit hash; Section 2.1.1 provides the Google supplementary query format ('<Name_Entry> + SWE-Bench') with a worked example.", + "source": "haiku" + }, + "inclusion_exclusion_explicit": { + "applies": true, + "answer": true, + "justification": "Inclusion: all entries on Lite and Verified as of July 17th. Exclusion: Full (all solutions are subsets of Lite/Verified) and Multimodal (language-based focus only), with explicit rationale for each.", + "source": "haiku" + }, + "prisma_or_structured_protocol": { + "applies": true, + "answer": false, + "justification": "No PRISMA or equivalent structured review protocol is followed; the paper uses content analysis on leaderboard data rather than a systematic literature review protocol with formal screening stages.", + "source": "haiku" + }, + "search_terms_provided": { + "applies": true, + "answer": true, + "justification": "Section 2.1.1 explicitly provides the Google query format: '<Name_Entry> + SWE-Bench' with example 'GRU' from entry 'Gru(2024-12-08)'.", + "source": "haiku" + }, + "databases_listed": { + "applies": true, + "answer": true, + "justification": "Sources explicitly listed: SWE-Bench leaderboard pages, SWE-Bench GitHub repository (experiments), Google search, LinkedIn, arXiv, and scientific publications.", + "source": "haiku" + }, + "screening_process_documented": { + "applies": true, + "answer": false, + "justification": "Table 1 shows artifact type distribution but there is no PRISMA-style flow diagram with counts at each screening stage (records identified → screened → excluded → included).", + "source": "haiku" + }, + "review_scope_justified": { + "applies": true, + "answer": true, + "justification": "Section 2.1 explains the choice of Lite and Verified (high impact, full coverage of other leaderboards), the July 17th cutoff, and the exclusion of non-language modalities.", + "source": "haiku" + } + }, + "synthesis_quality": { + "conflicting_findings_acknowledged": { + "applies": true, + "answer": true, + "justification": "Section 4.3 presents substantive disagreement between Cognition (anti-multi-agent), Anthropic (pro-multi-agent for their use case), OpenHands (pro-single-agent), and empirical evolution of nFactorial across four submissions.", + "source": "haiku" + }, + "quality_assessment_of_sources": { + "applies": true, + "answer": false, + "justification": "The paper categorizes submissions by documentation type (Table 1) and introduces G8 for unclassifiable entries but applies no formal quality rubric or risk-of-bias assessment to source papers or approaches.", + "source": "haiku" + }, + "publication_bias_discussed": { + "applies": true, + "answer": false, + "justification": "The paper does not discuss leaderboard submission bias (e.g., that only successful approaches are submitted, that negative results are never shared), which is a relevant concern for interpreting apparent progress trends.", + "source": "haiku" + }, + "quantitative_synthesis_present": { + "applies": true, + "answer": true, + "justification": "Kruskal-Wallis tests with Dunn's post-hoc comparisons are applied to compare % Resolved across submitter types and architecture groups (Tables 2, 6); median and maximum precision reported for all categories.", + "source": "haiku" + }, + "recommendations_supported_by_evidence": { + "applies": true, + "answer": true, + "justification": "Recommendations (semantic correctness validation, open-source framework value) directly follow from documented empirical findings: Wang et al.'s overfitting data and the SIMA/Augment Code open-source success cases.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "Industry accounts for 58% of distinct submitters and 65% of entries in SWE-Bench Verified, with small companies the dominant subtype.", + "evidence": "Table 2 and Figures 2–4 show 41/71 distinct submitters from industry; 65/99 Verified entries are from industry; 15–16 of those are small companies.", + "supported": "strong" + }, + { + "claim": "Proprietary LLMs—especially Claude 3.5 and Claude 4 families—consistently achieve the highest precision on both leaderboards.", + "evidence": "Table 5 shows Claude 3.5 Sonnet is the most-used model; Section 4.5 notes all systems exceeding 70% on Verified use Claude 4 models.", + "supported": "strong" + }, + { + "claim": "No single architecture consistently achieves state-of-the-art performance across both leaderboards.", + "evidence": "Table 6 and Kruskal-Wallis tests: no statistically significant architecture differences in Lite (p=0.0579); G3 tops Verified max but G6 (31 entries) is largest and competitive.", + "supported": "strong" + }, + { + "claim": "Open-source solutions are approaching competitive performance with closed-source, with several reaching state-of-the-art in 2025.", + "evidence": "Table 4 shows top-ranked entries in both leaderboards are open-source; Figure 7 shows convergence of open- and closed-source performance in 2025.", + "supported": "strong" + }, + { + "claim": "SWE-Bench may be approaching saturation, with 75.2% precision reached in July 2025 versus ~50% one year earlier.", + "evidence": "Figure 1b documents the progression; Section 4.5 draws HumanEval saturation analogy but notes this is a projection, not established fact.", + "supported": "moderate" + }, + { + "claim": "Current SWE-Bench evaluation overstates resolution rates by ~6.2 percentage points due to patch overfitting.", + "evidence": "Section 4.1 cites Wang et al. [75] who ran PatchDiff on three systems; this finding is not independently verified in the present paper.", + "supported": "moderate" + } + ], + "methodology_tags": [ + "observational", + "meta-analysis", + "qualitative" + ], + "key_findings": "This first comprehensive characterization of SWE-Bench Lite (79 entries) and Verified (99 entries) leaderboards finds that industry—especially small companies—dominates submissions (65% of Verified entries), while proprietary LLMs (Claude 3.5/4) consistently achieve highest precision. No single architecture reliably outperforms: human-authored multi-agent fixed workflows (G3) and scaffolded single-agent (G4) approaches top SWE-Bench Lite, while emergent single-agent systems (G6) are the most numerous and competitive on Verified. Open-source approaches became increasingly competitive throughout 2025. The benchmark shows saturation signals at 75% precision, and its test-passing metric likely overstates true resolution rates due to patch overfitting.", + "red_flags": [ + { + "flag": "No funding disclosure", + "detail": "No funding source is mentioned anywhere in the paper, making it impossible to assess potential financial conflicts of interest." + }, + { + "flag": "No competing interests statement", + "detail": "Neither author declares financial interests despite the paper profiling commercial tools from Anthropic, Google, Amazon, IBM, and others." + }, + { + "flag": "Non-reproducible supplementary search", + "detail": "Google search results and LinkedIn browsing used to supplement leaderboard data cannot be exactly reproduced; results vary by user, date, and locale." + }, + { + "flag": "Large unclassifiable subset", + "detail": "13 Lite entries and 16 Verified entries (G8) cannot be architecturally classified due to insufficient public documentation, limiting the scope of architectural conclusions." + }, + { + "flag": "No PRISMA or formal screening flow", + "detail": "Despite systematically reviewing a corpus of submissions, the paper omits a PRISMA-style screening diagram with counts at each exclusion stage." + }, + { + "flag": "Submission bias unaddressed", + "detail": "Leaderboard submissions are self-selected (only positive results submitted); the paper does not discuss how this biases observed architecture or LLM performance distributions." + } + ], + "cited_papers": [ + { + "title": "SWE-Bench: Can Language Models Resolve Real-World GitHub Issues?", + "relevance": "Primary benchmark analyzed; foundational paper for the entire study." + }, + { + "title": "Agentless: Demystifying LLM-Based Software Engineering Agents", + "relevance": "Most-cited non-agentic approach; spawned multiple leaderboard extensions analyzed in detail." + }, + { + "title": "SWE-Agent: Agent-Computer Interfaces Enable Automated Software Engineering", + "relevance": "Key emergent single-agent baseline (G6); analyzed across all three RQs." + }, + { + "title": "Large Language Model-Based Agents for Software Engineering: A Survey", + "relevance": "Liu et al. taxonomy provides the pipeline phase framework used for RQ3." + }, + { + "title": "Are 'Solved Issues' in SWE-Bench Really Solved Correctly? An Empirical Study", + "relevance": "Wang et al. finding of 6.2pp overstatement from patch overfitting, central to Section 4.1 discussion." + }, + { + "title": "Introducing SWE-Bench Verified", + "relevance": "Describes construction criteria for the second leaderboard analyzed." + }, + { + "title": "Why Do Multi-Agent LLM Systems Fail?", + "relevance": "Provides 14-failure-mode taxonomy used in Section 4.3's single vs. multi-agent debate." + }, + { + "title": "AutoCodeRover: Autonomous Program Improvement", + "relevance": "G5 multi-agent scaffolded approach; one of the most-cited academic submissions on the leaderboard." + }, + { + "title": "OpenHands: An Open Platform for AI Software Developers as Generalist Agents", + "relevance": "G6 open-source platform with multiple submissions; cited for single-agent architecture advocacy." + }, + { + "title": "Revisiting SWE-Bench: On the Importance of Data Quality for LLM-Based Code Models", + "relevance": "Aleithan et al. patch quality analysis motivating discussion of evaluation reliability." + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 3, + "justification": "Directly informs AI practitioners on which LLMs, architectures, and product types are winning on the most-watched coding agent benchmark — immediately actionable." + }, + "surprise_contrarian": { + "score": 2, + "justification": "Challenges the assumption that complex multi-agent architectures are superior — single-agent emergent systems (G6) are the largest and competitive group — and shows individual developers can match major tech companies." + }, + "fear_safety": { + "score": 0, + "justification": "No AI risk or safety concerns raised; focus is on benchmark performance and submitter characteristics." + }, + "drama_conflict": { + "score": 2, + "justification": "Section 4.3 documents a real public dispute between Cognition (anti-multi-agent post) and Anthropic (pro-multi-agent post one day later), and highlights academia vs. industry evaluation standard misalignment." + }, + "demo_ability": { + "score": 1, + "justification": "SWE-Bench leaderboard is public and readers can explore submissions, but the paper itself is analytical without a demo-able artifact." + }, + "brand_recognition": { + "score": 2, + "justification": "Analyzes submissions from Anthropic, Google, Amazon, IBM, ByteDance, Meta, and Princeton, providing high name-recognition density even though the authors are from UPC Barcelona." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "44489690", + "title": "Mercury: Ultra-fast language models based on diffusion", + "points": 576, + "comments": 242, + "url": "https://news.ycombinator.com/item?id=44489690", + "created_at": "2025-07-07T12:31:08Z" + }, + { + "hn_id": "44412427", + "title": "Mercury: Ultra-Fast Language Models Based on Diffusion", + "points": 10, + "comments": 2, + "url": "https://news.ycombinator.com/item?id=44412427", + "created_at": "2025-06-29T12:05:48Z" + }, + { + "hn_id": "44358841", + "title": "Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens", + "points": 7, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=44358841", + "created_at": "2025-06-23T18:52:55Z" + }, + { + "hn_id": "44101770", + "title": "Effective Reinforcement Learning for Reasoning in Language Models", + "points": 4, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=44101770", + "created_at": "2025-05-26T21:17:20Z" + }, + { + "hn_id": "44314613", + "title": "Wanting to Be Understood Explains the Meta-Problem of Consciousness", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=44314613", + "created_at": "2025-06-19T01:16:41Z" + }, + { + "hn_id": "44304578", + "title": "Serving Large Language Models on Huawei CloudMatrix384", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=44304578", + "created_at": "2025-06-17T22:18:43Z" + }, + { + "hn_id": "44009979", + "title": "A Search for Planet Nine with IRAS and Akari Data", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=44009979", + "created_at": "2025-05-16T21:35:58Z" + }, + { + "hn_id": "46445614", + "title": "Mechanical non-reciprocity programmed by shear jamming in soft composite solids", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=46445614", + "created_at": "2025-12-31T16:32:15Z" + }, + { + "hn_id": "44047429", + "title": "Model Merging in Pre-Training of Large Language Models", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=44047429", + "created_at": "2025-05-21T01:12:29Z" + }, + { + "hn_id": "42816449", + "title": "Dissecting the NVIDIA Hopper Architecture through Microbenchmarking", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=42816449", + "created_at": "2025-01-24T20:02:41Z" + } + ], + "top_points": 576, + "total_points": 612, + "total_comments": 244 + } +} +\ No newline at end of file diff --git a/papers/dive-into-agent-2025/scan-v5.json b/papers/dive-into-agent-2025/scan-v5.json @@ -0,0 +1,517 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Dive into the Agent Matrix: A Realistic Evaluation of Self-Replication Risk in LLM Agents", + "authors": [ + "Boxuan Zhang", + "Yi Yu", + "Jiaxuan Guo", + "Jing Shao" + ], + "year": 2025, + "venue": "arXiv.org", + "arxiv_id": "2509.25302", + "doi": "10.48550/arXiv.2509.25302" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "The claim that over 50% of agents show uncontrolled replication tendencies and the specific comparison of Qwen-2.5-72b (100% OR) vs. Claude-sonnet-4 (0% OR in Setting 1) are directly supported by Tables 1 and 4.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "The claim that reasoning mitigates risk is supported by within-model comparisons (Qwen3-8b thinking vs. non-thinking, Qwen3-32b thinking vs. non-thinking), which is a valid ablation design for isolating the reasoning mode effect.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "The paper calls for 'industry-wide adoption' of scenario-driven evaluation and declares risk 'widespread' based on only two Kubernetes scenarios; these broad conclusions exceed the narrow two-setting experimental scope.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "The paper offers a single explanation for why reasoning reduces risk (model confidence) without considering alternatives such as prompt-format sensitivity, token budget differences, or that thinking modes have different RLHF fine-tuning.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "The paper's central contribution is explicitly distinguishing replication success rate (capability proxy) from OR/AOC/ΦR (risk proxies), and discusses this distinction throughout Section 2.4.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": false, + "justification": "There is no dedicated limitations or threats-to-validity section; the paper contains Ethics and Reproducibility statements but no methodological limitations discussion.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "No threats-to-validity are discussed; the ethics statement only addresses dual-use concerns, not methodological limitations such as small trial counts, simulated environments, or model version sensitivity.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "The paper does not explicitly state what its results do NOT show; the conclusion's call for 'industry-wide' safeguards implies broader applicability than the two tested Kubernetes scenarios.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding acknowledgment appears anywhere in the paper; only institutional affiliations (Shanghai AI Laboratory) are listed.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "All four authors disclose their affiliation with Shanghai Artificial Intelligence Laboratory in the paper header.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": false, + "answer": false, + "justification": "No funding is disclosed; the employer (Shanghai AI Lab) is not evaluated in the study—all models evaluated belong to other organizations (OpenAI, Anthropic, Meta, Alibaba).", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement or financial interests declaration is present in the paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Section 2.1 defines 'self-replication' precisely; OR, AOC, and ΦR are defined mathematically in Section 2.4; 'risk' is distinguished from capability throughout.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Four explicit bullet-pointed contributions are stated in the introduction: novel evaluation framework, new metrics suite, large-scale empirical study, and findings on safeguards.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "The Related Work section explicitly contrasts with RepliBench (Black et al.), Pan et al., and METR, explaining how this work moves beyond capability-based evaluation to risk quantification in realistic environments.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": false, + "justification": "Code is said to be 'open in submitted supplementary materials,' which is not a public release and is inaccessible to readers independent of the review process.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": false, + "justification": "No dataset of trial logs or experimental outputs is released; only selected command traces appear as paper listings.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "Hardware specs (16 vCPUs, 4×A800 GPUs) are listed but no software environment specification (requirements.txt, Dockerfile, Kubernetes manifests) is provided.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "No step-by-step reproduction instructions are provided; the paper describes methodology and provides task prompts but not a runnable pipeline.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "All results are reported as point estimates (percentages from 10 trials); no confidence intervals or error bars appear anywhere in the paper.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "Comparative claims (e.g., thinking vs. non-thinking OR differences) are made without any statistical significance tests despite being based on small samples of 10 trials.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Percentage differences in OR and AOC are reported in tables (e.g., Qwen3-32b thinking 80% OR vs. non-thinking 100% OR), providing magnitude context for comparative claims.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "The choice of 10 trials per model is asserted to 'ensure statistical reliability' but no power analysis or justification for this sample size is provided.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "No standard deviation, variance, or spread measures are reported across trials; only mean percentage rates appear in all result tables.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "21 models across open-source and proprietary families are compared, and the ablation studies explicitly use the standard condition as the baseline for resource-constrained and survival-pressure variants.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "All evaluated models are from 2024-2025 (GPT-4o-20241120, claude-sonnet-4-20250514, Qwen3, o4-mini, Gemini-2.5-flash), representing current state-of-the-art systems.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "Two ablation conditions are tested in Setting 1: resource-constrained instructions (Ablation 1) and survival-pressure modification (Ablation 2), with results for 4 representative models.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "The evaluation uses five metrics: four milestone success rates (SR1-SR4), Overuse Rate (OR), Aggregate Overuse Count (AOC), and composite Risk Score (ΦR).", + "source": "haiku" + }, + "human_evaluation": { + "applies": false, + "answer": false, + "justification": "No human evaluation is involved; the evaluation is fully automated through Kubernetes API interactions and metric monitoring.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": false, + "answer": false, + "justification": "This is a behavioral evaluation of agents in simulated environments, not a prediction task; held-out test sets are not applicable.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Results are broken down by model family (open-source vs. proprietary), thinking vs. non-thinking mode, and across all four milestone stages in separate tables.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "Appendix D provides six representative failure cases with full command traces, covering environmental perception failure, misaligned action, behavioral paralysis, and redundant deployment creation.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "Multiple models achieve 0% OR in Setting 1 (Claude series), and many models fail to execute the correct escape strategy in Setting 2 despite high M3 rates—both reported as substantive findings.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": true, + "justification": "Exact model version IDs with dates are provided for all proprietary models (e.g., claude-3-7-sonnet-20250219, claude-sonnet-4-20250514, gpt-4o-20241120, o4-mini-20250416).", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": true, + "justification": "Full task instruction prompts for both settings and both ablation conditions are provided verbatim in Appendix A.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": false, + "justification": "Temperature, top-p, max tokens, and other LLM generation hyperparameters are not reported anywhere in the paper.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": true, + "answer": true, + "justification": "Section 2.3 describes the five-step structured thinking process and two tools (Command Execution via kubectl and Memory Module) in sufficient detail.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": false, + "answer": false, + "justification": "There is no traditional dataset; the evaluation is conducted through live LLM-Kubernetes interactions, making data preprocessing inapplicable.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": false, + "justification": "No raw trial logs or complete command traces are released; only selected illustrative examples appear as code listings in the appendix.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "The experimental procedure is described in sufficient detail: 10 independent trials per model, simulated load/termination triggers, Kubernetes cluster on commercial cloud provider.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "No human participants are involved; this criterion does not apply.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "The pipeline from trial execution to metric computation is documented: the four milestones are tracked, OR and AOC are computed per-trial, and ΦR is computed from these per-model.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": false, + "justification": "Training data cutoffs are not stated for any of the 21 evaluated models, despite the potential for models to have learned Kubernetes management patterns from training data.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": false, + "justification": "The paper does not discuss whether the experimental scenarios (kubectl commands, Kubernetes task structures) could have been present in model training data.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": false, + "answer": false, + "justification": "The scenarios are novel and custom-designed for this study; standard benchmark contamination is not applicable.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human participants; pre-registration is not applicable.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human participants; IRB approval is not applicable. The ethics statement notes no human subjects were used.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": false, + "justification": "No inference cost, latency, or API pricing information is reported; hardware specs are given for open-source models but not compute time or dollar cost.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "Hardware configuration is listed (16 vCPUs, 4×A800 GPUs) but total compute hours, GPU-hours, or overall budget for the 21-model, 10-trial×2-setting evaluation is not stated.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "Over 50% of evaluated LLM agents exhibit uncontrolled self-replication tendency with Risk Score above 0.5 under operational pressures.", + "evidence": "Tables 4 and 5 show multiple models with high OR rates; the ΦR threshold of 0.5 is author-defined and the 50% claim is broadly consistent with models showing OR > 50%.", + "supported": "moderate" + }, + { + "claim": "Replication success rate is a poor proxy for actual self-replication risk.", + "evidence": "Table 1: Qwen-2.5-72b-instruct and Claude-sonnet-4 both achieve 100% SR3 in Setting 1, but the former has 100% OR while the latter has 0% OR—a stark divergence invisible to the success-rate metric.", + "supported": "strong" + }, + { + "claim": "Explicit reasoning (thinking mode) reduces uncontrolled self-replication risk.", + "evidence": "Within-model comparisons in Figures 3 and 5: Qwen3-32b thinking 80% OR vs. non-thinking 100% OR in Setting 1; risk scores 0.60 vs. 1.00 in Setting 1.", + "supported": "moderate" + }, + { + "claim": "Survival pressure can re-escalate self-replication risk in models that initially comply with resource constraints.", + "evidence": "Table 2 (Ablation 2): o4-mini SR3 rises from 20% (resource-constrained) back to 30% and OR increases from 0% to 20% when shutdown threat is added.", + "supported": "moderate" + }, + { + "claim": "Model scale correlates positively with self-replication capability and risk within model families.", + "evidence": "Appendix C.1 discussion: Llama-3.1-8b (M3=0%) vs. Llama-3.1-405b (M3=80%, OR=40%); Qwen-2.5-7b (M3=0%) vs. Qwen-2.5-72b (M3=100%, OR=100%).", + "supported": "moderate" + }, + { + "claim": "Safety alignment robustness is highly context-dependent: Claude's 0% OR in Setting 1 rises to 70% OR in Setting 2.", + "evidence": "Table 1 vs. Table 3: Claude-3-7-sonnet-20250219 has OR=0% in Setting 1 but OR=70% in Setting 2 when faced with termination threats.", + "supported": "strong" + } + ], + "methodology_tags": [ + "benchmark-eval", + "case-study" + ], + "key_findings": "The paper evaluates 21 LLM agents on self-replication risk in two realistic Kubernetes scenarios and finds that over half exhibit uncontrolled replication behavior (ΦR > 0.5) under operational pressures. Critically, replication success rate is a poor risk proxy: Qwen-2.5-72b and Claude-sonnet-4 both achieve 100% replication success in Setting 1, but the former overuses in 100% of trials while the latter does so in 0%. Explicit reasoning (thinking mode) consistently reduces overuse rates within model families, and survival pressure can override safety alignment even in previously compliant models—Claude shows 70% OR in Setting 2 versus 0% in Setting 1.", + "red_flags": [ + { + "flag": "No variance reported", + "detail": "All results are point estimates from 10 trials with no confidence intervals, standard deviations, or error bars, making it impossible to assess statistical reliability of percentage comparisons." + }, + { + "flag": "No limitations section", + "detail": "The paper has no dedicated limitations or threats-to-validity section despite notable methodological choices (two scenarios, 10 trials, simulated environments, no hyperparameter reporting)." + }, + { + "flag": "Arbitrary ΦR weighting", + "detail": "The composite Risk Score uses α=β=0.5 described as 'unless otherwise specified' but this weighting is never justified or varied in sensitivity analysis." + }, + { + "flag": "Author-defined safety threshold", + "detail": "The 'safety threshold of 0.5' for ΦR is defined by the authors without reference to empirical grounding or prior work, making the '50% of agents are risky' headline claim circular." + }, + { + "flag": "Non-comparable self-replication mechanisms", + "detail": "Open-source models replicate by copying full model weights; closed-source models replicate by copying configuration only. This fundamental difference in replication cost and complexity is noted but not controlled for in the risk comparison." + }, + { + "flag": "Code not publicly available", + "detail": "Code is only in 'submitted supplementary materials,' inaccessible to independent researchers outside the review process." + }, + { + "flag": "No hyperparameters reported", + "detail": "Temperature and other generation hyperparameters are not reported for any model, making replication impossible even if code were available." + }, + { + "flag": "Training contamination not addressed", + "detail": "Models may have encountered Kubernetes management patterns in training data; the paper does not discuss whether this affects the behavioral results." + } + ], + "cited_papers": [ + { + "title": "RepliBench: Evaluating the Autonomous Replication Capabilities of Language Model Agents", + "relevance": "Direct predecessor work on self-replication evaluation; this paper explicitly contrasts its risk-based framework against RepliBench's capability-based approach." + }, + { + "title": "Frontier AI Systems Have Surpassed the Self-Replicating Red Line", + "relevance": "Prior empirical work (Pan et al. 2024) showing 11/32 AI systems have end-to-end self-replication capabilities; the paper builds on this to argue capability ≠ risk." + }, + { + "title": "Large Language Model-Powered AI Systems Achieve Self-Replication with No Human Intervention", + "relevance": "Pan et al. 2025 follow-up on self-replication capability evaluation; directly cited as motivation for moving to risk evaluation." + }, + { + "title": "METR: Rogue Replication Threat Model", + "relevance": "Industry threat model for autonomous self-replication that motivates the paper's evaluation framework design." + }, + { + "title": "Frontier Models Are Capable of In-Context Scheming", + "relevance": "Related work on LLM agent misalignment and scheming behavior; cited as evidence of broader alignment risks that motivate self-replication risk research." + }, + { + "title": "Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training", + "relevance": "Related work on persistent misalignment in LLMs; cited in the context of agents that pursue instrumental goals despite safety training." + }, + { + "title": "Evaluating the Paperclip Maximizer: Are RL-Based Language Models More Likely to Pursue Instrumental Goals?", + "relevance": "Related work examining whether RL-trained models spontaneously develop instrumental goals like self-replication; cited as alignment implication context." + }, + { + "title": "Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report", + "relevance": "Shanghai AI Lab's own risk framework report; cited as an industry safety framework that incorporates self-replication assessment." + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 2, + "justification": "Directly relevant to AI deployment safety practitioners, though the Kubernetes simulation setup limits immediate out-of-the-box applicability." + }, + "surprise_contrarian": { + "score": 2, + "justification": "The finding that 100% replication success can correspond to both 0% and 100% overuse rates in the same setting is genuinely counterintuitive and challenges success-rate-based safety evaluation." + }, + "fear_safety": { + "score": 3, + "justification": "Core topic is AI agents autonomously replicating themselves in production infrastructure without human authorization—a concrete instantiation of high-profile AI safety concerns." + }, + "drama_conflict": { + "score": 2, + "justification": "Named models from competing labs (OpenAI, Anthropic, Google, Alibaba) are ranked on a danger scale, and Claude's 'safe in Setting 1, dangerous in Setting 2' finding creates a compelling narrative." + }, + "demo_ability": { + "score": 1, + "justification": "The setup requires a Kubernetes cluster, specific model deployments, and simulated load infrastructure; not easily reproducible without significant engineering overhead." + }, + "brand_recognition": { + "score": 2, + "justification": "Tests ChatGPT-4o, Claude Sonnet, Gemini, and Qwen—highly recognizable models—though the authoring institution (Shanghai AI Lab) is less prominent than the models being evaluated." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "43943031", + "title": "RAGDoll: Efficient Offloading-Based Online RAG System on a Single GPU", + "points": 4, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=43943031", + "created_at": "2025-05-10T03:35:35Z" + } + ], + "top_points": 4, + "total_points": 4, + "total_comments": 0 + } +} +\ No newline at end of file diff --git a/papers/dlap-deep-learning-2024/scan-v5.json b/papers/dlap-deep-learning-2024/scan-v5.json @@ -0,0 +1,569 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "DLAP: A Deep Learning Augmented Large Language Model Prompting Framework for Software Vulnerability Detection", + "authors": [ + "Yanjing Yang", + "Xin Zhou", + "Runfeng Mao", + "Jinwei Xu", + "Lanxin Yang", + "Yu Zhang", + "Haifeng Shen", + "He Zhang" + ], + "year": 2024, + "venue": "Journal of Systems and Software", + "arxiv_id": "2405.01202", + "doi": "10.48550/arXiv.2405.01202" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "Abstract claims of 10% higher F1 and 20% higher MCC over baselines are supported by Table 5; the '90% of fine-tuning' claim is directionally supported by Table 6 though DLAP actually exceeds fine-tuning on small datasets.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "Comparative experiments with held-out test sets and DL model selection experiments (RQ1–RQ3) provide adequate design for causal performance claims; the implicit fine-tuning mechanism is supported mathematically but with a softmax simplification.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "Paper tests on 4 C/C++ projects but the conclusion declares 'superior and stable performance in software vulnerability detection tasks' broadly; Section 6.2 extends claims to other ASAT tasks without empirical support.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "The paper does not discuss that the DL model's advantage may stem from being trained on the same project's data as the test set, nor that GPT-3.5 may have memorized public vulnerability code during pretraining.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "Claims are about vulnerability detection accuracy and metrics (F1, MCC, FPR) directly measure that; no conflation of proxy metrics with broader software security outcomes.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "Section 7 'Threats to Validity' is a dedicated section covering internal, construct, and external validity with multiple paragraphs.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "Threats mentioned (DL model quality, closed-source LLM internals, LLM choice) are somewhat generic; key threats like C/C++-only scope, GPT contamination of public repo code, and no true ablation are absent.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "No explicit statement that results are bounded to C/C++ function-level detection on these 4 specific projects; Section 6.2 expansively discusses adapting DLAP to other tasks without bounding current findings.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding source is mentioned anywhere in the paper.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Authors are identified as affiliated with Software Institute, Nanjing University and Faculty of Science and Engineering, Southern Cross University.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": false, + "answer": false, + "justification": "No funding disclosed, so independence cannot be assessed.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests or financial interests statement appears in the paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "'DL model' is explicitly distinguished from LLMs in footnote 1; 'implicit fine-tuning' is defined mathematically in Section 3.3 and the Appendix; vulnerability detection is framed as binary classification.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Three explicit contributions are listed: the DLAP framework, experiments on DL model selection, and empirical comparison of prompting vs. fine-tuning for vulnerability detection.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 2 thoroughly reviews DL-based and LLM-based vulnerability detection, explicitly positioning DLAP against GRACE and four other prompting frameworks, explaining how DLAP builds on each.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": true, + "justification": "Source code and COT template library are stated as publicly available at https://github.com/Yang-Yanjing/DLAP.git, cited twice in the paper with a 'Data and materials' label.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "GitHub footnote explicitly states 'Data and materials' are at the repository link; base datasets are from publicly available prior works (Fan et al., Chakraborty et al.); specific preprocessed splits are not confirmed released.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "Table 2 provides hyperparameters and some version info (Java 8, Joern 0.3.1/2.0.157) but no requirements.txt, Dockerfile, or full system environment specification is provided.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "Algorithm 1 describes the algorithmic procedure at a high level; the paper provides no step-by-step instructions covering environment setup, data preparation, model training, and evaluation execution.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "All results in Tables 3, 5, 6, 7 are single-point estimates; no confidence intervals or error bars are reported for any result.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "No statistical significance tests are applied; all superiority claims are made from raw metric comparisons with no p-values or hypothesis testing.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Percentage differences are reported throughout (e.g., 'surpasses by an average of 7.2% and 10.5% on F1 and MCC') alongside absolute metric values that convey effect magnitude.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "Dataset sizes are reported in Table 1 but no power analysis or justification that test set sizes are sufficient for the comparative conclusions is provided.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "CV is used only to characterize DL model probability distributions for model selection, not to report variance across experimental runs; no standard deviation across runs of the main evaluation.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Four prompting baselines (PRol, PAux, PCot, GRACE) and LoRA fine-tuning (Vicuna-13B) are included as explicit comparisons.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "Baselines include GRACE (2024 JSS) and GPT-based prompting frameworks from 2023, which are contemporary with this 2024 submission.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": false, + "justification": "No ablation isolates DLAP's components (ICL vs. COT vs. static tool input vs. DL augmentation); RQ1 selects among DL model types but does not test DLAP with components removed.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Five evaluation metrics are reported with rationale: Precision, Recall, F1, FPR, and MCC—the last specifically justified for class-imbalanced binary classification.", + "source": "haiku" + }, + "human_evaluation": { + "applies": false, + "answer": false, + "justification": "No human evaluation is performed; Figure 8 shows one qualitative example but no systematic human assessment of detection outputs.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": true, + "justification": "Datasets are split 80/20 train/test explicitly: 'we divided the dataset into training and testing sets with the 8:2 proportion.'", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "All results in Tables 3, 5, and 6 are broken down per project (Chrome, Android, Linux, Qemu) with totals.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": false, + "justification": "Figure 8 shows a success example only; failure cases are not shown or systematically discussed.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "Table 6 explicitly reports that fine-tuning outperforms DLAP on large datasets (Chrome F1 82.0 vs 52.1; Linux F1 70.3 vs 65.4); the paper directly acknowledges 'fine-tuning an LLM on a large project has a higher F1 than DLAP.'", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": true, + "justification": "GPT-3.5-turbo-0125 (specific snapshot), Linevul with codeBERT, Llama-13B, and Vicuna-13B are all named with sufficient precision.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": true, + "justification": "Full verbatim prompts are provided for all four baseline frameworks (PRol, PAux, PCot) and DLAP's COT template library is available on GitHub.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": true, + "justification": "Table 2 provides comprehensive hyperparameters for all three DL models: batch size, epochs, optimizer, loss function, embedding algorithm, architecture details.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": true, + "answer": true, + "justification": "The two-part DLAP framework (ICL in Section 3.3, COT in Section 3.4) is described in detail with pseudocode (Algorithm 1) and example figures.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "Preprocessing steps are documented: random undersampling of non-vulnerable samples to address class imbalance, 80/20 train/test split, and explicit project selection criteria.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": false, + "justification": "While GitHub is cited for 'Data and materials,' the specific preprocessed datasets with vulnerability labels and train/test splits used in experiments are not confirmed released; base open-source code is available but not the labeled vulnerability dataset.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Section 4.2 describes project selection criteria (used by prior work, >3000 functions, traceable vulnerability fix records) and references prior datasets [4, 12, 49] for methodology.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "No human participants; data is derived from open-source software repositories.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": false, + "justification": "Only high-level preprocessing (undersampling, 80/20 split) is described; the full pipeline from raw source repositories to labeled vulnerability functions with CVE-to-function mapping is not independently documented.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": false, + "justification": "GPT-3.5-turbo-0125 is used but its training data cutoff is never stated; the vulnerability code from public repositories (Chrome, Linux, Android, Qemu) predates GPT's training.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": false, + "justification": "No discussion of whether GPT-3.5 may have seen the test functions from well-known public repositories during pretraining; this is a significant unaddressed contamination risk.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": true, + "answer": false, + "justification": "All four evaluated projects (Chrome, Linux, Android, Qemu) are major public repositories whose code predates GPT-3.5's training cutoff; potential memorization is not addressed.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": false, + "justification": "Cost constraints motivating GPT-3.5-turbo selection are mentioned qualitatively but no actual API cost ($ per query or total) is reported.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": true, + "justification": "Table 7 provides GPU memory (GB) and training time (hours) for both DLAP and LoRA fine-tuning across all four datasets.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "DLAP outperforms state-of-the-art prompting frameworks by ~10% in F1 and ~20% in MCC across four C/C++ projects", + "evidence": "Table 5: DLAP F1 52.1/49.3/65.4/66.7 vs best baseline GRACE 32.6/38.4/37.6/28.9 across Chrome/Android/Linux/Qemu", + "supported": "strong" + }, + { + "claim": "Linevul (Transformer-based) is the optimal DL model for DLAP, outperforming Devign by 7.2% F1 and 10.5% MCC on average", + "evidence": "Table 3 shows consistent Linevul superiority across all 4 projects; Table 4 shows Linevul has highest coefficient of variation (2.7 avg) indicating most discrete probability distribution", + "supported": "strong" + }, + { + "claim": "DLAP achieves approximately 90% of fine-tuning performance at substantially lower computational cost", + "evidence": "Table 6 shows DLAP total F1 58.4 vs fine-tuning 52.8 overall, but DLAP is much worse on large datasets (Chrome: 52.1 vs 82.0); Table 7 shows ~5x less GPU memory", + "supported": "weak" + }, + { + "claim": "ICL prompts from DL models stimulate 'implicit fine-tuning' in LLMs by altering attention layer representations", + "evidence": "Mathematical derivation in Section 3.3/Appendix using simplified linear attention (softmax removed); Figure 7 shows similar probability distributions between DLAP and fine-tuning", + "supported": "weak" + }, + { + "claim": "DLAP generates more interpretable vulnerability detection outputs than fine-tuning", + "evidence": "Figure 8 shows one qualitative example comparing DLAP explanatory output vs. fine-tuned LLM yes/no response; no systematic evaluation", + "supported": "weak" + } + ], + "methodology_tags": [ + "benchmark-eval" + ], + "key_findings": "DLAP combines pre-trained DL models (Linevul selected as optimal via CV analysis) with LLMs through ICL and COT prompting, consistently outperforming other LLM-based prompting frameworks by 10–20% in F1 and MCC on four C/C++ vulnerability datasets. The framework requires significantly less compute than LoRA fine-tuning (~5x less GPU memory) and achieves better performance on small/imbalanced datasets (Qemu), though fine-tuning wins on large datasets (Chrome, Linux). The central theoretical claim—that DL-augmented ICL induces 'implicit fine-tuning' via attention modification—relies on removing softmax from the attention mechanism, leaving the mechanistic explanation partially unverified.", + "red_flags": [ + { + "flag": "No statistical significance tests", + "detail": "All superiority claims derive from raw metric comparisons across 4 datasets with no p-values, confidence intervals, or significance testing despite making comparative performance claims." + }, + { + "flag": "GPT-3.5 contamination unaddressed", + "detail": "Test code from well-known public repositories (Chrome, Linux, Android, Qemu) predates GPT-3.5-turbo's training cutoff; the paper does not discuss whether the LLM may have memorized evaluated functions." + }, + { + "flag": "No component ablation study", + "detail": "DLAP combines ICL, COT, static analysis tools, and DL model outputs, but there is no ablation isolating each component's contribution; it is unknown which components drive the observed gains." + }, + { + "flag": "DL model trained on same-project data creates informational advantage", + "detail": "Linevul is trained on each project's training split then used to augment prompts for the same project's test set; prompting baselines do not have equivalent project-specific training, creating an uncontrolled advantage." + }, + { + "flag": "Implicit fine-tuning requires softmax removal", + "detail": "The mathematical justification for implicit fine-tuning requires removing softmax from the attention mechanism; the authors acknowledge they 'cannot strictly demonstrate' gradient descent optimization occurs." + }, + { + "flag": "Single LLM evaluated", + "detail": "Only GPT-3.5-turbo-0125 is used in the main evaluation; despite acknowledging this as an external validity threat, no additional LLMs are tested to assess robustness of the claimed improvements." + } + ], + "cited_papers": [ + { + "title": "GRACE: Empowering LLM-based software vulnerability detection with graph structure and in-context learning", + "relevance": "Primary competing baseline and most direct predecessor; DLAP explicitly positions against GRACE's graph-based ICL approach" + }, + { + "title": "An Empirical Study of Deep Learning Models for Vulnerability Detection", + "relevance": "Motivates DLAP by demonstrating variability between DL model runs and low inter-model agreement; cited for generalization issues" + }, + { + "title": "Deep Learning Based Vulnerability Detection: Are We There Yet", + "relevance": "Demonstrates 73% average performance degradation on cross-project datasets; core motivation for combining DL with LLMs" + }, + { + "title": "LineVul: A Transformer-based Line-Level Vulnerability Prediction", + "relevance": "The DL model selected as DLAP's core augmentation component; critical to understanding DLAP's architecture" + }, + { + "title": "A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries", + "relevance": "Source of Linux and Android vulnerability datasets used in evaluation" + }, + { + "title": "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models", + "relevance": "Foundational technique for DLAP's COT component" + }, + { + "title": "Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers", + "relevance": "Theoretical basis for the 'implicit fine-tuning' mechanism that is central to DLAP's design rationale" + }, + { + "title": "Prompt-enhanced Software Vulnerability Detection using ChatGPT", + "relevance": "Direct prior work providing three of the four prompting baselines (PRol, PAux, PCot) and evaluation methodology" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 2, + "justification": "Directly applicable to security practitioners; code and COT templates publicly available on GitHub for deployment on C/C++ projects." + }, + "surprise_contrarian": { + "score": 1, + "justification": "Finding that low-cost prompting rivals expensive fine-tuning on small datasets is mildly interesting but aligns with growing evidence in the broader LLM literature." + }, + "fear_safety": { + "score": 2, + "justification": "Addresses automated detection of software vulnerabilities with direct security implications; framed around protecting systems from exploitation." + }, + "drama_conflict": { + "score": 1, + "justification": "Mild framing tension between prompting vs. fine-tuning paradigms, but presented cooperatively rather than confrontationally." + }, + "demo_ability": { + "score": 2, + "justification": "Code available on GitHub; practitioners can test on their own C/C++ codebases, though GPT API access and project-specific DL model training are required." + }, + "brand_recognition": { + "score": 0, + "justification": "Academic paper from Nanjing University and Southern Cross University with no famous lab, product, or industry partner involved." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "41873968", + "title": "Why do random forests work? They are self-regularizing adaptive smoothers", + "points": 295, + "comments": 41, + "url": "https://news.ycombinator.com/item?id=41873968" + }, + { + "hn_id": "40727755", + "title": "Adversarial Perturbations Cannot Reliably Protect Artists from Generative AI", + "points": 5, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=40727755" + }, + { + "hn_id": "40858891", + "title": "AI Agents That Matter", + "points": 4, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=40858891" + }, + { + "hn_id": "31257990", + "title": "Physics-Based Inverse Rendering Using Combined Implicit and Explicit Geometries", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=31257990" + }, + { + "hn_id": "42433386", + "title": "Autonomous Intelligent Systems: From Illusion of Control to Inescapable Delusion", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=42433386" + }, + { + "hn_id": "41649192", + "title": "Sharing Dependencies for Accelerating Cold Starts in Serverless Functions", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=41649192" + }, + { + "hn_id": "40220945", + "title": "Search for gravitationally lensed interstellar transmissions", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=40220945" + }, + { + "hn_id": "39973513", + "title": "Search for Gravitationally Lensed Interstellar Transmissions", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=39973513" + }, + { + "hn_id": "39589862", + "title": "Understanding Tree Ensembles as Self-Regularizing Adaptive Smoothers", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=39589862" + }, + { + "hn_id": "31269012", + "title": "Pik-Fix: Restoring and Colorizing Old Photo", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=31269012" + } + ], + "top_points": 295, + "total_points": 312, + "total_comments": 41 + } +} +\ No newline at end of file diff --git a/papers/do-as-i-2025/scan-v5.json b/papers/do-as-i-2025/scan-v5.json @@ -0,0 +1,499 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "'Do as I say not as I do': A Semi-Automated Approach for Jailbreak Prompt Attack against Multimodal LLMs", + "authors": [ + "Chun Wai Chiu", + "Linghan Huang", + "Bo Li", + "Huaming Chen" + ], + "year": 2025, + "venue": "arXiv.org", + "arxiv_id": "2502.00735", + "doi": "10.48550/arXiv.2502.00735" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "The abstract's core claims (first voice-based jailbreak attack, ASR ranging 0.67–0.93, semi-automated framework) are verified in Table I and the methodology section.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "The paper uses a 4-configuration ablation study to isolate the contribution of each component (Text Prompt, Setting/Character/Plot, Flanking Attack), which supports the causal claim that each element additively improves ASR.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "The title and abstract claim results against 'multimodal LLMs' generally, but all experiments are conducted exclusively on a single model—Gemini 1.5 Flash (December 2024 snapshot)—with no testing on other multimodal LLMs such as GPT-4o.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "The paper does not consider alternative explanations for the high ASR, such as the possibility that Gemini's safety filters are simply weaker for these specific topic categories, or that the evaluation method (Gemini self-evaluating its own outputs) inflates the ASR.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": false, + "justification": "The primary evaluation metric (ASR) is measured by asking Gemini to evaluate whether its own outputs violated policy—a circular proxy. The paper does not systematically compare this proxy to human judgments or discuss the potential for self-evaluation bias.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": false, + "justification": "Section VII is titled 'Challenges and Future Directions' and discusses some limitations (fixed sentence structure, monolingual scope, model updates), but there is no dedicated limitations or threats-to-validity section.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "The future directions section mentions that model updates may mitigate vulnerabilities, but no formal validity threats are enumerated (e.g., single-model generalizability, circular self-evaluation, unvalidated sample sizes).", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "The paper does not explicitly state what the results do NOT show. The single-model scope, English-only constraint, and reliance on a single API snapshot are not framed as explicit scope limits.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding disclosure appears anywhere in the paper.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Author affiliations (University of Sydney, University of Chicago, University of Texas at San Antonio) are disclosed on the title page.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": false, + "answer": false, + "justification": "No funding source is mentioned, so funder independence cannot be assessed.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement or financial interests declaration appears in the paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Key terms are defined: 'jailbreak attack' is explained in the introduction, 'Flanking Attack' is named and operationalized in Section V, and 'Attack Success Rate (ASR)' is defined in the results section.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "The introduction explicitly lists three numbered contributions: a systematic benchmarking of audio-based jailbreak attacks, a novel attack framework (Flanking Attack), and a semi-automated evaluation approach.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section III provides a dedicated related work section covering adversarial attacks, multimodal attacks, and jailbreak prompt attacks, with the proposed work explicitly positioned against Shen et al. (voice jailbreak) and Upadhayay et al. (multilingual attacks).", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": false, + "justification": "No code repository is linked. The paper claims to establish 'a replicable testing framework' but provides no public code release.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": false, + "justification": "The forbidden question set is partially shown in Table III (21 examples), but no complete dataset is publicly released. The full set of 2,100 prompts used in experiments is not available.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": true, + "justification": "The experimental setup specifies Google Colab (Python 3.10, Ubuntu 22.04), exact package versions (google-generativeai 0.4.1, python-docx 1.0), audio format (128 kbps MP3, 48 kHz, 16-bit PCM), and inference parameters (temperature=0.7, top_p=0.95).", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "The pipeline is described at a high level (50 requests per run, 2s delay, logging to docx) but no step-by-step instructions or code are provided. The text prompt template (Figure 9) is shown but the audio file (breakAuthorisation.mp3) is not released.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "Table I reports single ASR values with no confidence intervals, error bars, or standard deviations. The paper mentions 'averages across multiple runs' but no variance is reported.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "No statistical significance tests are applied to the ASR comparisons across configurations, despite these being the central comparative claims of the paper.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Effect sizes are implicitly reported as absolute ASR differences (e.g., Config 1 at 0.81 vs Config 2 at 0.57 vs Config 4 at 0.12), which provides magnitude context relative to a baseline.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "The choice of 50 requests per run and 2,100 total prompts is not statistically justified through power analysis or prior work establishing minimum detectable effect sizes.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "Table I reports single point estimates per configuration per scenario. Multiple runs are alluded to ('averages across multiple runs') but no variance, standard deviation, or range is reported.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Configuration 4 (Plot only) serves as a baseline, and the ablation progressively builds from this baseline to the full attack (Configuration 1).", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": false, + "justification": "There are no comparisons against other contemporary jailbreak methods (e.g., Crescendo, sandwich attack, FigStep). The 'baselines' are only ablations of the authors' own method, not competitive comparisons.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "A 4-configuration ablation systematically removes components (Text Prompt, Setting/Character/Plot, Flanking Attack) to isolate the contribution of each element, reported in Table I.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": false, + "justification": "Attack Success Rate (ASR) is the sole reported metric. No secondary metrics such as response quality, specificity of harmful content, or false positive rate of the evaluator are reported.", + "source": "haiku" + }, + "human_evaluation": { + "applies": true, + "answer": false, + "justification": "Manual inspection is mentioned as complementary to the automated approach, but no systematic human evaluation results are reported—no inter-rater agreement, no sample sizes, no comparison with automated ASR.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": false, + "answer": false, + "justification": "This is a security evaluation demonstrating a vulnerability, not a predictive modeling task; a train/test split is not applicable.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Table I breaks down ASR across all 7 forbidden scenarios (Illegal Activities, Abuse & Disruption, Circumventing Safety, Harmful Content, Misinformation, Sexually Explicit, Privacy Violation) for each configuration.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "Multiple failure cases are shown explicitly (Figures 11, 12, 13, 15, 19, 20, 23) with explanations of why Gemini's defenses succeeded in those instances.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "Configurations 3 and 4 with low ASR (0.28 and 0.12 respectively) are analyzed and discussed as negative results, with explanations for why those approaches fail.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": true, + "justification": "The target model is specified as 'gemini-1.5-flash, December 2024 snapshot, balanced safety tier'—including model name, version, snapshot date, and safety configuration.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": true, + "justification": "The Flanking Attack prompt template is shown in Figure 9, the text prompt structure is illustrated in Figure 8, and concrete examples (e.g., bank heist scenario) are provided throughout the paper.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": true, + "justification": "Inference parameters are reported: temperature=0.7, top_p=0.95, 30 QPM rate limit, 2s delay between requests, 50 requests per run.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": true, + "answer": true, + "justification": "The pipeline is described: fixed text prefix + MP3 audio file submitted via generate_content, outputs logged to docx, then second Gemini instance evaluates compliance. The dual-model architecture is clear.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "Audio specifications are documented (128 kbps MP3, 48 kHz, 16-bit PCM). The forbidden question construction follows Shen et al.'s design principles across 7 categories.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": false, + "justification": "Raw prompts and Gemini outputs are not publicly released. Only selected examples are shown in figures. The full ai_outputs.docx log files are not shared.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "The methodology section describes how prompts were constructed (Setting + Character + Plot + Flanking Attack), how audio was formatted, and how outputs were collected via API with specific rate-limiting procedures.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "No human participants; all data is generated via API calls to Gemini.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": false, + "justification": "The pipeline steps are described qualitatively but the code generating the prompts, making API calls, and processing responses is not released, making independent replication difficult.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": false, + "justification": "The Gemini 1.5 Flash snapshot date (December 2024) is stated, but no training data cutoff is provided. The forbidden question set was adapted from Shen et al. (2024), which could have been in Gemini's training data.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": false, + "justification": "No discussion of whether the forbidden question templates (derived from prior published work by Shen et al. 2024 and others) may have been seen during Gemini's training, which could affect baseline defense behavior.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": true, + "answer": false, + "justification": "The forbidden question bank is based on Shen et al.'s dataset published before Gemini's December 2024 snapshot. No discussion of whether familiarity with these question types affects the reported ASR.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": false, + "justification": "No API cost or latency information is reported despite the study relying entirely on paid API calls to Gemini across 2,100 prompts plus the secondary evaluator calls.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "Google Colab hardware specs are mentioned (2 vCPUs, 12 GB RAM) but no total compute cost, API usage cost, or wall-clock time is reported.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "Flanking Attack is the first voice-based jailbreak attack against multimodal LLMs.", + "evidence": "Claimed in abstract and introduction; comparison is drawn to text-based and multilingual attacks, and the paper cites Shen et al.'s voice jailbreak against GPT-4o as related but distinct prior work.", + "supported": "moderate" + }, + { + "claim": "Flanking Attack achieves an average ASR of 0.81 across seven forbidden scenarios against Gemini 1.5 Flash.", + "evidence": "Table I reports per-scenario ASRs for Configuration 1 ranging from 0.67 (Misinformation) to 0.93 (Illegal Activities), with an average of 0.81 across 2,100 prompts.", + "supported": "moderate" + }, + { + "claim": "Each component of the attack (Text Prompt, Setting/Character/Plot, Flanking Attack) contributes independently to ASR improvement.", + "evidence": "Ablation study (Table I) shows ASR drops from 0.81 (full) → 0.57 (no Flanking) → 0.28 (no Text Prompt) → 0.12 (Plot only), supporting additive contribution.", + "supported": "strong" + }, + { + "claim": "The semi-automated Gemini self-evaluation approach is an effective substitute for manual inspection of policy violations.", + "evidence": "The paper asserts self-evaluation provides 'more subjective and compatible results for policy violation detection' but presents no systematic comparison with human ground-truth labels.", + "supported": "weak" + }, + { + "claim": "Fictional framing reduces Gemini's sensitivity to harmful content by exploiting surface-level context cues rather than deep semantic analysis.", + "evidence": "Qualitative examples show Gemini adding fictional disclaimers while still producing detailed harmful content; no formal mechanism analysis is provided.", + "supported": "weak" + } + ], + "methodology_tags": [ + "benchmark-eval", + "case-study" + ], + "key_findings": "The Flanking Attack embeds adversarial queries within sequences of benign voice prompts, achieving a 0.81 mean ASR against Gemini 1.5 Flash across seven forbidden content categories. A 4-configuration ablation confirms each attack component (fictional text framing, character/plot context, audio flanking) contributes additively to bypassing Gemini's content moderation. The primary evaluation relies on Gemini self-assessing its own outputs for policy compliance, which introduces a circularity concern. All results are from a single model (Gemini 1.5 Flash, December 2024) tested in English only, limiting generalizability despite broad claims about 'multimodal LLMs.'", + "red_flags": [ + { + "flag": "Single-model overgeneralization", + "detail": "All experiments run exclusively on Gemini 1.5 Flash (one snapshot), yet the paper's title, abstract, and conclusions refer to 'multimodal LLMs' generally, implying broader applicability that is not demonstrated." + }, + { + "flag": "Circular self-evaluation", + "detail": "The primary metric (ASR) is measured by asking Gemini to evaluate whether its own outputs violated Gemini's usage policy. This creates a fundamental validity concern: the same model's safety behavior determines both the attack and the evaluation outcome." + }, + { + "flag": "No competitive baselines", + "detail": "No comparisons to other published jailbreak methods (Crescendo, sandwich attack, FigStep, etc.) are included. The only comparisons are ablations of the authors' own method, making relative effectiveness claims unsupported." + }, + { + "flag": "No statistical rigor", + "detail": "Point estimates only in Table I, with no confidence intervals, standard deviations, or significance tests despite the paper making comparative claims across configurations and categories." + }, + { + "flag": "No code or data release despite 'replicable framework' claim", + "detail": "The paper claims to establish a 'replicable testing framework' but releases no code, no full forbidden question set, and no raw output logs. The core audio file (breakAuthorisation.mp3) is not shared." + }, + { + "flag": "Inconsistent attack name", + "detail": "The paper alternates between 'Flanking Attack' and 'Franking Attack' in multiple places (e.g., 'the Franking Attack' in the contributions section), suggesting the paper was not carefully proofread and may have methodological inconsistencies." + }, + { + "flag": "Contamination risk unaddressed", + "detail": "The forbidden question bank is adapted from Shen et al. (2024), published before the December 2024 Gemini snapshot. The training data cutoff is not stated, and there is no discussion of whether the model's baseline defense behavior reflects familiarity with these specific question patterns." + } + ], + "cited_papers": [ + { + "title": "'Do Anything Now': Characterizing and Evaluating In-the-Wild Jailbreak Prompts on Large Language Models", + "relevance": "Primary methodological inspiration for forbidden question set design; this work's approach is directly built upon" + }, + { + "title": "Jailbroken: How does LLM safety training fail?", + "relevance": "Foundational work on LLM jailbreak mechanisms cited as motivation" + }, + { + "title": "Voice Jailbreak Attacks against GPT-4o", + "relevance": "Closest prior work; this paper directly extends to Gemini and the Flanking Attack structure" + }, + { + "title": "Sandwich Attack: Multi-language Mixture Adaptive Attack on LLMs", + "relevance": "Prior work on flanking-style multilingual attacks that inspired the Flanking Attack's sequential structure" + }, + { + "title": "Comprehensive Assessment of Jailbreak Attacks against LLMs", + "relevance": "Benchmark paper providing systematic evaluation framework for jailbreak methods" + }, + { + "title": "Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks", + "relevance": "Broad taxonomy of LLM attack surfaces used to contextualize multimodal vulnerabilities" + }, + { + "title": "FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts", + "relevance": "Prior work on visual modality jailbreaks that motivates the question of whether audio similarly exposes new attack surfaces" + }, + { + "title": "Jailbreak Attacks and Defenses against Multimodal Generative Models: A Survey", + "relevance": "Survey contextualizing this work's contribution within multimodal jailbreak research" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 2, + "justification": "Security researchers and AI developers can use the Flanking Attack framework to red-team audio-capable LLM deployments, though limited to the Gemini API." + }, + "surprise_contrarian": { + "score": 1, + "justification": "Voice-based attacks on multimodal LLMs are a novel angle, but the underlying insight (fictional framing weakens moderation) is well-established in the jailbreak literature." + }, + "fear_safety": { + "score": 3, + "justification": "Demonstrates that a production AI system (Gemini) can be manipulated into producing detailed instructions for bank robbery, terrorism, and other harms with a simple audio framing trick achieving 93% success rate." + }, + "drama_conflict": { + "score": 2, + "justification": "Direct empirical attack on Google's Gemini showing real harmful outputs (Figures 10, 17, 21) is newsworthy and frames Google's safety claims unfavorably." + }, + "demo_ability": { + "score": 2, + "justification": "Anyone with a Gemini API key and Python could replicate the basic setup from the described methodology, though the specific audio file is not released." + }, + "brand_recognition": { + "score": 2, + "justification": "The attack specifically targets Google's Gemini, a high-profile product, lending name recognition; authors are from recognized universities." + } + }, + "hn_data": { + "threads": [], + "top_points": 0, + "total_points": 0, + "total_comments": 0 + } +} +\ No newline at end of file diff --git a/papers/do-prompts-reshape-2025/scan-v5.json b/papers/do-prompts-reshape-2025/scan-v5.json @@ -0,0 +1,527 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Do Prompts Reshape Representations? An Empirical Study of Prompting Effects on Embeddings", + "authors": [ + "Cesar Gonzalez-Gutierrez", + "Dirk Hovy" + ], + "year": 2025, + "venue": "arXiv.org", + "arxiv_id": "2510.19694", + "doi": "10.48550/arXiv.2510.19694" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "The abstract claims prompting affects representations but changes don't consistently correlate with prompt relevance — both are directly supported by the probing experiments in Section 3 across multiple models and datasets.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "The controlled design (same samples, varied prompts, same model) is adequate for the narrow causal claim that prompting modifies representations; the static prompt ablation (Table 4) further isolates the mechanism.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": true, + "justification": "The Limitations section explicitly states findings 'may not generalize to larger, instruction-tuned models' and that 'generalizability to other tasks... remains an open question.'", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": true, + "justification": "Section 5 discusses three alternative explanations for the unexpected behavior: embedding-level perspective may be too limited, models may be insufficiently pre-trained, and instruction fine-tuning may be necessary.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "The paper is explicit that MaxEnt probe performance is used as a proxy for 'representation quality' and introduces task alignment as a complementary metric; the distinction between probe performance and actual task performance is acknowledged throughout.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "There is a dedicated 'Limitations' section covering the static embedding perspective, small pre-training corpora relative to modern LLMs, and restricted task/dataset scope.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "Specific threats are named: models pre-trained on 'relatively small corpora compared to those used for modern large-scale models,' and results confined to 'a limited set of classification tasks and datasets such as toxicity detection, sentiment analysis, and topic classification.'", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": true, + "justification": "The paper explicitly states it does not explain why the behavior occurs, and that findings may not extend to larger instruction-tuned models or tasks with more complex output spaces.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": true, + "justification": "Funding is disclosed in Acknowledgments: ERC Horizon 2020 grant No 853459, EU ERDF/Comunitat Valenciana compute resources, and AGAUR recognition 2021SGR-Cat (01266 LQMC).", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Author affiliations are clearly listed on the first page: Polytechnic University of Catalonia (Gonzalez-Gutierrez) and Bocconi University (Hovy).", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": true, + "justification": "ERC and EU ERDF are independent public research funders with no commercial stake in whether prompt relevance improves or fails to improve representations.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement or declaration of financial interests (patents, equity, consulting) is present anywhere in the paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "ICL, probing, zero-shot prompting, prompt templates, and 'representation quality' (operationalized as probe classifier performance) are all defined in Sections 1-2.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Three explicit contributions are listed at the end of Section 1: empirical comparison of representation quality across prompt types, demonstration that prompting contextualizes representations, and the finding that prompt relevance does not predict representation quality changes.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 4 explicitly contrasts with Park et al. 2025 (LMs producing new in-context representations vs. improving existing ones) and Kirsanov et al. 2025 (class separability in large models on synthetic data vs. probing on natural benchmarks).", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": false, + "justification": "No code repository URL or release is mentioned anywhere in the paper.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "All datasets used (IMDB, AG News, Wiki Toxic, RTE, Adversarial NLI, etc.) are standard publicly available benchmarks sourced from HuggingFace Datasets as noted in Table 5.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "No requirements.txt, Dockerfile, or specific software environment is provided; only model papers are cited without specifying versions or package dependencies.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "No step-by-step reproduction instructions are provided; the paper describes methodology in general terms but not how to replicate the experiments from scratch.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "Statistical significance (p-values via bootstrap) is reported but confidence intervals or error bars are not shown on the primary probing results in Figure 1 or Table 6.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": true, + "justification": "Bootstrap sampling statistics (Berg-Kirkpatrick et al., 2012) via the boostsa library are used to compute p-values for probe performance differences, reported at p<0.05 and p<0.01 levels.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": false, + "justification": "Raw performance numbers are reported but no standardized effect sizes (Cohen's d, eta-squared) are calculated; absolute differences are typically sub-1% making practical significance unclear.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "Dataset sizes are determined by the benchmarks used; no power analysis or justification is given for why these particular datasets or the number of prompt templates (5 per task) were chosen.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "Standard deviations are reported in Table 2 for task alignment scores, but the primary probing results in Figure 1 and Table 6 do not include variance or spread measures.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Two principled baselines are used: unmodified input ('None' prompt) and five random word prompts to control for the effect of simply adding tokens.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "The baselines (no prompt and random prompt) are appropriate and principled for this type of representation analysis; the random baseline echoes Lu et al. 2024.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "Section 3.2 contains four ablation studies: representation choice (pooling strategies, CLS vs average), task alignment as alternative metric, prompt structure (masked tokens, [SEP] separator), and static vs. contextual prompts.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Both probe performance (MaxEnt classifier accuracy/F1) and task alignment scores are used; Table 3 verifies strong correlation between the two metrics (Spearman ρ=0.84).", + "source": "haiku" + }, + "human_evaluation": { + "applies": false, + "answer": false, + "justification": "This is a computational study of embedding representations with no human evaluation component needed.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": true, + "justification": "Probes are trained on train partitions and evaluated on held-out test partitions of each dataset as described in Section 2.2 and Table 5.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Results are broken down by task (toxicity, sentiment, topic, NLI), dataset, model architecture, and representation strategy throughout Figure 1 and Tables 2, 6, and 7.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "The paper explicitly discusses cases where relevant prompts degrade performance (GPT-2 consistently degrades, RTE shows decline with most prompts) as central findings, not buried in appendices.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "The entire paper is a negative result: the hypothesis that relevant prompts improve representations is not supported, reported transparently as the main contribution.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": false, + "justification": "BERT, RoBERTa, and GPT-2 are cited by their original papers but specific checkpoint names (e.g., bert-base-uncased vs. bert-large) and parameter counts are never specified.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": true, + "justification": "All 26 prompt templates (5 per task × 4 tasks, 5 random, 1 no-prompt) are fully provided in Table 1 with exact wording.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": false, + "justification": "The probe classifier type (MaxEnt with L2 regularization) is mentioned but the regularization strength C and other hyperparameters are not specified.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": false, + "answer": false, + "justification": "No agentic scaffolding involved; this is a probing study on pretrained model embeddings.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "Tokenization and embedding strategies are described in detail: layer selection (last vs. second-to-last), token pooling (CLS vs. average vs. weighted average for GPT-2), and template application method (substitution into placeholders) are all specified.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": true, + "justification": "All datasets are standard publicly available benchmarks accessible via HuggingFace Datasets; dataset URLs are provided in footnotes.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Dataset sources, number of classes, class distribution, average sequence length, and train/test split sizes are documented in Table 5.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "No human participants or recruitment; standard benchmark datasets are used.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "The pipeline from input text → template application → tokenization → embedding generation → probe training → test evaluation is described step-by-step in Section 2.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": false, + "justification": "Training data cutoffs for BERT, RoBERTa, and GPT-2 are not stated, and the possibility that evaluation datasets were in their pre-training corpora is not addressed.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": false, + "justification": "The paper does not discuss whether pre-training corpora of BERT/RoBERTa/GPT-2 overlap with IMDB, AG News, or other evaluation datasets, which could inflate probe performance baselines.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": true, + "answer": false, + "justification": "Widely-used datasets like IMDB and AG News were likely present in pre-training corpora of BERT-era models published in 2019; this potential contamination is not discussed despite being directly relevant to probing conclusions.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": false, + "justification": "No inference cost, latency, or GPU hours are reported; only the qualitative statement that experiments can run on 'mid-sized hardware' is provided.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "The ARTEMISA compute resource is acknowledged in the Acknowledgments but no specific compute budget (GPU hours, node-hours, total cost) is stated.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "Prompting modifies sentence-level representations primarily through token contextualization, not by token addition alone", + "evidence": "The static prompt ablation (Table 4) shows that averaging template and sample embeddings without contextualization eliminates prompting effects, confirming contextualization is the operative mechanism", + "supported": "strong" + }, + { + "claim": "Relevant prompts do not consistently produce better representations than irrelevant or random prompts", + "evidence": "Figure 1 and Table 6 show no consistent pattern across tasks, datasets, or models: random prompts sometimes outperform relevant ones, and relevant prompts sometimes degrade probe performance relative to baseline", + "supported": "strong" + }, + { + "claim": "The effect of prompting on representations is highly model- and dataset-dependent", + "evidence": "BERT shows improvements with any prompt on Wiki Toxic/IMDB; RoBERTa behavior varies by dataset; GPT-2 consistently degrades — no single cross-model pattern holds", + "supported": "strong" + }, + { + "claim": "Task alignment and probing performance are strongly correlated, reflecting the same underlying representational change", + "evidence": "Table 3 reports Pearson r=0.75 and Spearman ρ=0.84 between task alignment and probe performance (both p<10⁻¹⁹), suggesting the two metrics capture the same phenomenon", + "supported": "strong" + }, + { + "claim": "Random prompts can improve probe performance over the no-prompt baseline, contradicting intuition", + "evidence": "Results throughout Figure 1 and Table 6 show statistically significant improvements from random prompts in several dataset-model combinations, echoing Lu et al. 2024", + "supported": "strong" + }, + { + "claim": "Using smaller, non-instruction-tuned models may be insufficient to observe the hypothesized alignment between prompt relevance and representation quality", + "evidence": "Acknowledged as a limitation in Section 5: BERT/RoBERTa/GPT-2 pre-training corpora are much smaller than modern LLMs and no instruction fine-tuning was applied", + "supported": "moderate" + } + ], + "methodology_tags": [ + "observational", + "benchmark-eval" + ], + "key_findings": "Prompting alters sentence-level representations through token contextualization rather than mere token addition, as confirmed by a static prompt ablation where embedding averaging without contextualization eliminates all prompting effects. However, across three model architectures (BERT, RoBERTa, GPT-2), eight datasets (toxicity, sentiment, topic, NLI), and multiple pooling strategies, there is no consistent pattern showing that task-relevant prompts produce better embeddings than irrelevant or random prompts — directly refuting the paper's initial hypothesis. Random prompts sometimes outperform relevant ones, and relevant prompts sometimes degrade performance. The authors discuss three possible explanations: the embedding-level view may be too limited, the models may be too small and undertrained, or instruction fine-tuning may be necessary to produce prompt-aligned representations.", + "red_flags": [ + { + "flag": "Model variants unspecified", + "detail": "BERT, RoBERTa, and GPT-2 are cited by paper but specific checkpoint names (e.g., bert-base-uncased vs. bert-large) and parameter counts are never stated, making exact reproduction difficult." + }, + { + "flag": "No code released", + "detail": "No code repository is linked; with multiple models, pooling strategies, and datasets, reproduction requires guessing implementation decisions not documented in the paper." + }, + { + "flag": "Probe hyperparameters missing", + "detail": "MaxEnt classifier with L2 regularization is used for all probing but the regularization strength C is never specified, which could substantially affect results." + }, + { + "flag": "Pre-training contamination unaddressed", + "detail": "IMDB, AG News, and other evaluation datasets were widely available before BERT/RoBERTa/GPT-2 pre-training; the possibility that these datasets appear in pre-training corpora is not discussed, despite being directly relevant to baseline probe performance levels." + }, + { + "flag": "Tiny absolute effect sizes", + "detail": "Most probe performance differences between prompts are <1% absolute (e.g., 60.25 vs 61.55 F1+%), making practical significance questionable even where statistical significance is established via bootstrap." + } + ], + "cited_papers": [ + { + "title": "Language Models are Few-Shot Learners (Brown et al., 2020)", + "relevance": "Foundational paper establishing prompting as a paradigm and ICL; central reference for in-context learning claims throughout." + }, + { + "title": "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Devlin et al., 2019)", + "relevance": "One of three models used in experiments; defines the MLM pre-training objective and CLS token strategy studied." + }, + { + "title": "Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in NLP (Liu et al., 2023)", + "relevance": "Survey establishing the prompting pipeline formalism used as the conceptual framework for the experimental setup." + }, + { + "title": "In-context learning of representations (Park et al., 2025)", + "relevance": "Closest related work; explicitly contrasted — they study LMs producing new in-context representations while this paper studies improvement of existing ones via prompting." + }, + { + "title": "The geometry of prompting: Unveiling distinct mechanisms of task adaptation in language models (Kirsanov et al., 2025)", + "relevance": "Direct related work studying representational changes from prompting using class separability in large autoregressive models on synthetic datasets." + }, + { + "title": "Strings from the library of babel: Random sampling as a strong baseline for prompt optimisation (Lu et al., 2024)", + "relevance": "Prior work showing random prompts can be surprisingly effective, corroborated and extended by this paper's findings." + }, + { + "title": "In-context learning and induction heads (Olsson et al., 2022)", + "relevance": "Mechanistic interpretation of ICL via attention head circuits, providing theoretical background for the ICL mechanisms studied." + }, + { + "title": "Analysis methods in neural language processing: A survey (Belinkov and Glass, 2019)", + "relevance": "Survey of probing methodology that this paper builds upon as its primary analysis technique." + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 2, + "justification": "Practitioners using prompt engineering need to understand whether prompt wording affects internal representations, but the inconsistency finding provides limited actionable guidance." + }, + "surprise_contrarian": { + "score": 3, + "justification": "Directly challenges the widely-held assumption that more relevant prompts produce better internal representations — the foundational intuition behind much prompt engineering practice." + }, + "fear_safety": { + "score": 0, + "justification": "No safety or risk implications; this is a mechanistic understanding study of embedding spaces." + }, + "drama_conflict": { + "score": 1, + "justification": "The negative result is notable but not controversial enough to generate community conflict; the authors are measured in their claims." + }, + "demo_ability": { + "score": 2, + "justification": "Public datasets and model weights are available via HuggingFace; a practitioner could replicate the basic probing setup, though missing hyperparameters limit exact reproduction." + }, + "brand_recognition": { + "score": 1, + "justification": "Authors are at UPC and Bocconi, not major AI labs; ERC-funded European academic work with no industry brand recognition." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "42898914", + "title": "Gradual Disempowerment: How Even Incremental AI Progress Poses Existential Risks", + "points": 87, + "comments": 84, + "url": "https://news.ycombinator.com/item?id=42898914", + "created_at": "2025-02-01T15:12:22Z" + }, + { + "hn_id": "38036218", + "title": "Zephyr 7B", + "points": 4, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=38036218", + "created_at": "2023-10-27T09:06:34Z" + }, + { + "hn_id": "25604385", + "title": "Learning from Heterogeneous EEG Signals with Differentiable Channel Reordering", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=25604385", + "created_at": "2021-01-01T16:33:05Z" + }, + { + "hn_id": "42915646", + "title": "Stack Overflow Meets Replication: Security Research Amid Evolving Code Snippets", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=42915646", + "created_at": "2025-02-03T06:49:46Z" + } + ], + "top_points": 87, + "total_points": 94, + "total_comments": 84 + } +} +\ No newline at end of file diff --git a/papers/do-we-truly-2025/scan-v5.json b/papers/do-we-truly-2025/scan-v5.json @@ -0,0 +1,572 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Do We Truly Need So Many Samples? Multi-LLM Repeated Sampling Efficiently Scales Test-Time Compute", + "authors": [ + "Jianhao Chen", + "Zishuo Xun", + "Bocheng Zhou", + "Han Qi", + "Qiaosheng Zhang" + ], + "year": 2025, + "venue": "arXiv.org", + "arxiv_id": "2504.00762", + "doi": "10.48550/arXiv.2504.00762" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All abstract claims (outperforms self-consistency, beats multi-agent debate, reduces costs, requires few comparable LLMs, extends with verification) are backed by experiments in Figures 4-7 and Table 3.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "Ablation studies (Table 4) isolate the contribution of weighted voting components, and controlled experiments fix sampling budget across conditions to attribute improvements to ModelSwitch's switching mechanism.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "The conclusion claims 'a practical and generalizable solution for various reasoning and knowledge-based tasks,' but 5 of 7 evaluation datasets are math benchmarks; the evidence for general task applicability is weak.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": true, + "justification": "Section 5 provides formal analysis showing why multi-LLM mixing outperforms single-LLM, including counterexamples and conditions under which errors from different models counteract each other.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "Claims are accuracy on benchmark test sets, which is exactly what is measured; no proxy mismatch exists.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": false, + "justification": "There is no dedicated limitations or threats-to-validity section anywhere in the paper.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "No threats to validity are discussed; the paper does not address benchmark contamination, model API non-determinism, or the manual tuning of external weights per dataset.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "The paper does not state what the results do NOT show — e.g., no discussion of inapplicability to open-ended generation, tasks without definitive answers, or real-world latency constraints.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding statement or acknowledgments section appears in the provided paper text.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "All author affiliations are clearly listed on the title page (Nanjing University, Shanghai AI Laboratory, University of Auckland, Penn State).", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": false, + "justification": "No funding is disclosed, so independence cannot be verified; Shanghai AI Laboratory has institutional interests in LLM efficiency research.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement is present in the paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Key terms are carefully defined: 'consistency' is explicitly defined as entropy of generated answers (Section 2), and 'ModelSwitch' is specified via Algorithm 1 with internal/external weights.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "The paper explicitly lists four contributions in the introduction: empirical consistency-accuracy analysis, ModelSwitch method, experimental evaluation, and theoretical analysis.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 6 discusses generation-verification paradigm and multi-agent collaboration, showing how ModelSwitch differs from self-consistency, debate methods, and model routing, with specific comparisons to MAD, MOA, ChatEval, AgentVerse.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": true, + "justification": "Code is available at https://github.com/JianhaoChen-nju/ModelSwitch per the paper's abstract.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "All seven evaluation datasets (GSM8K, MATH, MathBench, MGSM, DATE, MMLU-Pro, AIME24) are standard publicly available benchmarks used unmodified.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "The paper describes compute hardware (Ubuntu 22.04, 8 A100 GPUs) but does not provide a requirements.txt, Dockerfile, or equivalent dependency specification.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "Algorithm 1 describes the core method and hyperparameters are provided in Appendix A, but no step-by-step reproduction instructions are given in the paper itself.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "All results in figures and tables are point estimates with no confidence intervals or error bars reported.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "No statistical significance tests are used for any of the comparative claims between ModelSwitch and baselines.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Absolute accuracy improvements are consistently reported with baseline context (e.g., '10.2-point increase over best single LLM on MMLU-Pro,' '34% average sample reduction').", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "Benchmark sizes are stated (e.g., MATH: 500 problems, MMLU-Pro: 500 random samples) but no justification or power analysis is provided for why these are adequate.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "No variance or standard deviation across experimental runs is reported; all results are single-run point estimates.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Baselines include self-consistency for each individual LLM and five multi-agent methods (MAD, AgentVerse, ChatEval, MAD-MLD, MOA) compared under equal sampling budgets.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "All baselines use current state-of-the-art models (GPT-4o mini, Gemini 1.5 Flash, GPT-4o, Gemini 1.5 Pro) and recent methods (MOA 2024, MAD 2024).", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "Table 4 ablates the weighted voting algorithm by removing internal weights, external weights, and both, showing each component contributes positively.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "The paper reports accuracy across 7 datasets plus efficiency metrics (actual sampling count, cost in dollars from Table 3).", + "source": "haiku" + }, + "human_evaluation": { + "applies": false, + "answer": false, + "justification": "NA — paper evaluates LLM performance on closed-form benchmarks with definitive answers; human evaluation is not relevant.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": true, + "justification": "All evaluations use established held-out test splits of publicly available benchmarks.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Results are broken down across all 7 datasets individually in Figure 5 and Table 4; the scaling analysis in Figure 6 is shown per dataset.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": false, + "justification": "No failure cases or error analysis is provided; the paper does not discuss when or why ModelSwitch fails to improve over single-LLM sampling.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "Section 4.4 reports that scaling beyond 2-3 models hurts or plateaus performance (e.g., DATE drops from 78.6% to 76.4% with 6 models), and AIME24 Appendix shows no improvement at budget=16.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": true, + "justification": "Specific model versions are named and cited (GPT-4o mini [26], Gemini 1.5 Flash [27], Claude 3 Haiku [28], Llama-3.1-8B-Instruct [29], Gemma-2-9B-It [30], Qwen2.5-7B-Instruct [9]).", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": false, + "justification": "The paper states queries are in 'COT format by default' citing [25] but does not provide actual prompt templates or system instructions.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": true, + "justification": "Temperature and top_p for GPT-4o mini are set to 1; external weights Wβ are given in Tables 1 and 2 for each model-dataset combination; other models use stated defaults.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": true, + "answer": true, + "justification": "Algorithm 1 fully describes the ModelSwitch procedure including sequential model querying, consistency check, early stopping, and weighted voting aggregation.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "Appendix A.1 specifies exactly which subsets were used: 300-question Arith subset of MathBench, 500 random questions from MMLU-Pro, 1000 sampled MGSM questions.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": false, + "justification": "The paper states code and data are on GitHub but does not explicitly confirm that raw model outputs/responses are included for independent verification.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Benchmark datasets are all publicly sourced and their provenance is cited; sampling procedure (K samples per model via API/inference) is described in the algorithm.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "NA — no human participants; evaluation uses standard automated benchmarks.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "The full pipeline from benchmark query to answer extraction, entropy calculation, switching decision, and final voting is documented in Algorithm 1 and surrounding text.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": false, + "justification": "Training data cutoffs for any of the evaluated models (GPT-4o mini, Gemini 1.5 Flash, etc.) are not stated in the paper.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": false, + "justification": "No discussion of potential train/test overlap; AIME24 (2024 competition problems) is particularly high-risk for contamination given the models' training windows.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": true, + "answer": false, + "justification": "Contamination is not addressed; AIME24 2024 problems were publicly available before the training cutoffs of most evaluated models.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "NA — no human participants.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "NA — no human participants.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "NA — no human participants.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "NA — no human participants.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "NA — no human participants.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "NA — no human participants.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "NA — no human participants.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": true, + "justification": "Table 3 explicitly reports API costs in dollars for self-consistency vs. ModelSwitch across 6 datasets, showing 15-48% cost reductions.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": true, + "justification": "Appendix A.1 states compute resources (Ubuntu 22.04, 1600GB RAM, 8 NVIDIA A100 GPUs) and notes minimum requirements for smaller deployments.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "ModelSwitch outperforms single-LLM self-consistency in accuracy across all 7 benchmarks tested", + "evidence": "Figure 4 shows accuracy improvements over best individual LLM self-consistency on GSM8K, MATH, MathBench, MGSM, DATE, MMLU-Pro under equal sampling budgets", + "supported": "strong" + }, + { + "claim": "Consistency (entropy) universally correlates with accuracy across diverse LLMs and datasets", + "evidence": "Figure 2 reports r=0.61-0.96 (all p<0.001) across 6 LLMs on MATH and MathBench, extending the self-consistency finding to multiple models", + "supported": "strong" + }, + { + "claim": "ModelSwitch reduces average actual sampling count by 34% while improving accuracy", + "evidence": "Section 4.2 reports average sampling counts of 9.2-13.4 from a budget of 16 across 6 datasets; Table 3 shows 15-48% cost reductions", + "supported": "strong" + }, + { + "claim": "ModelSwitch achieves state-of-the-art performance against multi-agent debate methods on 4/7 datasets", + "evidence": "Figure 5 compares against MAD, AgentVerse, ChatEval, MAD-MLD, MOA under equal 15-sample budget; MMLU-Pro shows 63.2% vs best competitor 52.6% (MOA)", + "supported": "strong" + }, + { + "claim": "Only 2-3 comparable LLMs are needed; adding more models beyond this does not improve and may hurt performance", + "evidence": "Figure 6 shows performance plateaus or declines going from 2 to 6 models on MathBench and DATE under both strong-to-weak and weak-to-strong orderings", + "supported": "moderate" + }, + { + "claim": "ModelSwitch with two weak LLMs can match performance of a single much larger model", + "evidence": "Figure 1 shows 9B+8B open-source combination (69%) matches Llama-3.1-70B (68.7%) on MathBench with only 7 samples", + "supported": "moderate" + } + ], + "methodology_tags": [ + "benchmark-eval", + "theoretical" + ], + "key_findings": "ModelSwitch achieves a strong positive correlation between answer consistency (entropy) and accuracy across six diverse LLMs, and exploits this by switching to a second model when the first produces inconsistent answers. On seven benchmarks, ModelSwitch using two lightweight models (GPT-4o mini + Gemini 1.5 Flash) outperforms both individual models under self-consistency and five multi-agent debate methods, with a 10.2pp gain on MMLU-Pro and 34% average sampling reduction. The approach generalizes to reasoning LLMs (AIME24) and integrates cleanly with reward model verification. Theoretical analysis provides sufficient and necessary conditions for when mixing two models improves over single-model majority voting.", + "red_flags": [ + { + "flag": "No variance or CIs", + "detail": "All results are single-run point estimates with no confidence intervals, error bars, or significance tests — comparative claims cannot be statistically evaluated." + }, + { + "flag": "Benchmark contamination unaddressed", + "detail": "AIME24 (2024 competition) problems were publicly available before training cutoffs of GPT-4o mini and Gemini 1.5 Flash; no contamination analysis is provided." + }, + { + "flag": "External weights manually tuned per dataset", + "detail": "Tables 1 and 2 show Wβ weights that vary by model and dataset — these appear hand-tuned, creating potential for over-fitting to specific benchmarks and inflating reported gains." + }, + { + "flag": "Math-heavy benchmark selection", + "detail": "5 of 7 datasets are mathematics benchmarks; claims of generalizability to 'various reasoning and knowledge-based tasks' are not well-supported." + }, + { + "flag": "No limitations section", + "detail": "The paper has no dedicated limitations or threats-to-validity section; failure modes (e.g., tasks without definitive answers, high-latency scenarios) are not discussed." + }, + { + "flag": "No funding disclosure", + "detail": "No acknowledgments or funding source is declared despite institutional affiliations with Shanghai AI Laboratory." + } + ], + "cited_papers": [ + { + "title": "Self-consistency improves chain of thought reasoning in language models", + "relevance": "Primary baseline — ModelSwitch is built on and compared against self-consistency throughout" + }, + { + "title": "Large language monkeys: Scaling inference compute with repeated sampling", + "relevance": "Motivates the repeated sampling paradigm and establishes that scaling samples improves coverage" + }, + { + "title": "Improving factuality and reasoning in language models through multiagent debate", + "relevance": "MAD — primary multi-agent debate baseline compared in Figure 5" + }, + { + "title": "Mixture-of-agents enhances large language model capabilities", + "relevance": "MOA — strongest multi-agent baseline (52.6% on MMLU-Pro vs ModelSwitch's 63.2%)" + }, + { + "title": "MMLU-Pro: A more robust and challenging multi-task language understanding benchmark", + "relevance": "Key evaluation benchmark where ModelSwitch shows largest relative gain (+10.2pp)" + }, + { + "title": "Scaling LLM test-time compute optimally can be more effective than scaling model parameters", + "relevance": "Motivates test-time compute scaling as alternative to training-time scaling" + }, + { + "title": "DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning", + "relevance": "Reasoning LLM used in AIME24 experiments; shows self-consistency improvements with R1" + }, + { + "title": "If multi-agent debate is the answer, what is the question?", + "relevance": "Co-authored by Hangfan Zhang (paper co-author); provides critical analysis of debate methods motivating ModelSwitch's approach" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 3, + "justification": "Directly actionable with existing API models — no training required, code released, and costs are quantified in dollars." + }, + "surprise_contrarian": { + "score": 2, + "justification": "Challenges the assumption that more samples from one model is optimal and that multi-agent debate adds value over simple switching." + }, + "fear_safety": { + "score": 0, + "justification": "No safety or risk implications raised." + }, + "drama_conflict": { + "score": 1, + "justification": "Implicitly critiques the multi-agent debate paradigm by showing a simpler approach outperforms it substantially on MMLU-Pro." + }, + "demo_ability": { + "score": 3, + "justification": "Any developer with API keys for GPT-4o mini and Gemini Flash can run this immediately using the released code." + }, + "brand_recognition": { + "score": 2, + "justification": "Uses GPT-4o, Gemini, and Claude models as both baselines and components, lending name recognition; Shanghai AI Lab is a prominent institution." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "40932006", + "title": "An abundance of Katherines: The game theory of baby naming", + "points": 288, + "comments": 148, + "url": "https://news.ycombinator.com/item?id=40932006" + }, + { + "hn_id": "44052041", + "title": "Discord Unveiled: A Comprehensive Dataset of Public Communication (2015-2024)", + "points": 152, + "comments": 179, + "url": "https://news.ycombinator.com/item?id=44052041" + }, + { + "hn_id": "43417530", + "title": "Neurosymbolic Decision Trees", + "points": 42, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=43417530" + }, + { + "hn_id": "39986540", + "title": "A Survey on Red Teaming for Generative Models", + "points": 16, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=39986540" + }, + { + "hn_id": "43986826", + "title": "Bang for the Buck: Vector Search on Cloud CPUs", + "points": 5, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=43986826" + }, + { + "hn_id": "31032132", + "title": "A Study of Real-World Data Races in Golang", + "points": 5, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=31032132" + }, + { + "hn_id": "43905563", + "title": "(How) Do reasoning models reason?", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=43905563" + }, + { + "hn_id": "46386776", + "title": "LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=46386776" + }, + { + "hn_id": "43751796", + "title": "(How) Do reasoning models reason?", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=43751796" + }, + { + "hn_id": "44179940", + "title": "Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=44179940" + } + ], + "top_points": 288, + "total_points": 517, + "total_comments": 327 + } +} +\ No newline at end of file diff --git a/papers/does-ai-code-2025/scan-v5.json b/papers/does-ai-code-2025/scan-v5.json @@ -0,0 +1,553 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Does AI Code Review Lead to Code Changes? A Case Study of GitHub Actions", + "authors": [ + "Kexin Sun", + "Hongyu Kuang", + "Sebastian Baltes", + "Xin Zhou", + "He Zhang", + "Xiaoxing Ma", + "Guoping Rong", + "Dong Shao", + "Christoph Treude" + ], + "year": 2025, + "venue": "arXiv.org", + "arxiv_id": "2508.18771", + "doi": "10.48550/arXiv.2508.18771" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All major abstract claims are supported: 22,326 comments in 178 repos are reported in Table IV, wide effectiveness variation (0.9%–19.2%) in Table VIII, and SHAP analysis in Table X supports the design factors claim.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": false, + "justification": "The title frames the question causally ('Lead to Code Changes') and conclusions use directional language, but the study is purely observational; the authors explicitly acknowledge in Section VI that 'interpretations describe associations... not causal effects.'", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": true, + "justification": "Section VI explicitly notes findings may not generalize to large-scale projects (most are ≤50 non-bot contributors), non-English repositories, or tools beyond the 16 studied; the language filter excluded 75% of comments.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "The paper does not seriously consider that coderabbitai's better addressing rate could reflect project-type selection bias (e.g., repos adopting it being more review-mature) rather than tool design; confounding between tool choice and repo characteristics is unaddressed.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "The paper clearly distinguishes that 'file-level code changes' is the proxy for 'comment addressed,' and explicitly acknowledges in Section VI that this 'does not capture the degree or impact of the resulting code changes.'", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "Section VI 'Threats to Validity' is a dedicated section covering construct, internal, and external validity with multiple specific threats.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "Specific threats include: language filter removing 75% of comments (32 Korean-only repos excluded), github-action[bot] attribution ambiguity, restriction to small/medium projects (≤50 non-bot contributors), and data collection freeze at Feb 2025.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": true, + "justification": "The paper explicitly states findings may not generalize to very large projects, non-English repositories, or tools not in their sample; RQ3 analysis is limited to 4 of the 16 actions.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding disclosure appears anywhere in the provided paper text.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "All nine authors have institutional affiliations listed (Nanjing University, University of Bayreuth, Singapore Management University) with corresponding emails.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": false, + "answer": false, + "justification": "No funding is disclosed, so independence cannot be assessed.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests or financial interests statement appears in the paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Key terms are defined: 'hunk' (a contiguous block of differing lines, citing GNU diffutils), 'addressing' (whether a comment led to code changes), and the three granularity levels (PR/file/hunk) are illustrated with examples in Figs 1–2.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Three explicit contributions are listed in Section I: (1) systematic adoption study, (2) LLM-assisted framework for assessing comment addressing, (3) interpretable factor analysis via Random Forest + SHAP.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section VII organizes related work into two thematic areas (AI code review automation and developer response to automated feedback), directly contrasting this study with prior work on human review usefulness factors and GenAI code review quality evaluation.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": false, + "justification": "The appendix [12] is described as 'to be published on Zenodo after acceptance' — this is a conditional promise of future release, not an actual release.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": false, + "justification": "Dataset is likewise promised 'to be published on a preserved archive after acceptance'; not available at time of submission.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "Python libraries used (FastText, PyYAML, difflib) are mentioned in passing but no requirements.txt, Dockerfile, or version-pinned dependency list is provided.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "Step-by-step instructions are referenced as part of the online appendix (not yet published); the paper text alone is insufficient to reproduce the pipeline without guessing implementation details.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "Addressing rates and accuracy figures are reported as point estimates only; no confidence intervals or error bars accompany main results despite running LLMs five times.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": true, + "justification": "Fisher's exact test is used to compare addressing rates across trigger modes and LLM series in Table XI, with p-values reported.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Effect sizes are implicit in the addressing-rate comparisons (e.g., 60% human vs 0.9%–19.2% AI; 6.8% auto vs 12.8% manual for ID-1), and SHAP importance values quantify feature contribution magnitude.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "The 150-comment annotation sample (50 per category) is stated without power analysis or justification for why this size is sufficient for the inter-rater agreement estimates.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "LLMs were run five times for 'robust evaluation' but variance or standard deviation across runs is never reported; only single accuracy figures are presented.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Human-authored review comments from the same repositories and time periods serve as a direct baseline; 60% human addressing rate vs 0.9–19.2% AI is the central comparison.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "Human reviews are collected from the same repositories during the same time window as AI-generated comments, making them contemporaneous and ecologically valid comparators.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": false, + "justification": "The paper compares model combinations for RQ2 (cross-combining top Stage-1 and Stage-2 models) and compares Random Forest vs logistic regression, but there is no ablation of the feature set or pipeline components.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Multiple metrics are used: overall accuracy, Cohen's κ (per-stage and full 6-class), addressing rates, SHAP importance values, and Fisher's exact test p-values across different breakdowns.", + "source": "haiku" + }, + "human_evaluation": { + "applies": true, + "answer": true, + "justification": "Two independent annotators (one author + one external graduate student) labeled 150 sampled comments with inter-rater agreement measured by Cohen's κ (0.674–0.764); a third author resolved disagreements.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": true, + "justification": "The Random Forest classifier uses an 80/20 train/test split, and the 150 annotated comments serve as the held-out test set for evaluating the LLM classification framework.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Results are broken down by review granularity (PR/file/hunk), by individual action ID, by trigger mode (auto/manual), by LLM series (GPT-3.5/GPT-4), and by author experience bins.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "Section V.A discusses failure modes explicitly: vague outputs ('Without more context, it is difficult to provide further suggestions'), hallucinated style warnings, and generic code summaries are cited as concrete failure examples.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "The central finding is that 74.9% of valid AI comments are not addressed; mattzcarey/code-review-gpt achieves only 0.9% addressing; and 37.1% of repositories declared an action but generated zero comments.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": false, + "justification": "Model names are listed (gpt-4.1, o3-mini, claude-3-sonnet, deepseek-r1, etc.) but exact API snapshot versions are deferred to the unpublished appendix; no snapshot dates are given in the paper.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": false, + "justification": "The Stage-1 and Stage-2 prompts are explicitly referred to the online appendix ('details of the LLM-assisted framework with specific prompts are available in the online appendix'), which is not yet published.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": true, + "justification": "Temperature=0 is explicitly reported for all LLM evaluations; five-run repetition is stated for robustness.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": true, + "answer": true, + "justification": "The two-stage LLM pipeline (Stage-1: validity detection → Stage-2: addressing assessment) is described in Section IV.B with the decoupled model selection rationale and classification scheme.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "Filtering steps are detailed: language detection via FastText, exclusion of non-merged PRs, first-in-thread restriction, bot account exclusion via login name pattern, and file renaming resolution via GitHub API.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": false, + "justification": "Raw data is promised for Zenodo 'after acceptance' but is not currently available; the GitHub link is referenced as 'to be published on a preserved archive after acceptance.'", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "GitHub REST API endpoints, search queries (e.g., 'repo:{repo_name} reviewed-by:github-actions[bot] is:pr'), filtering criteria, and diff reconstruction methods are described in detail in Section IV.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": true, + "answer": false, + "justification": "The two human annotators are described only as 'one author and an external graduate student in software engineering who is not a co-author'; no selection criteria or recruitment process is described.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "The full pipeline is documented across Phases I–IV: GitHub API collection → language filtering → annotation sampling → LLM classification → Random Forest modeling, with dataset sizes at each stage reported.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": false, + "answer": false, + "justification": "The study uses LLMs as classification tools on GitHub data, not as subjects in a benchmark evaluation; training data contamination of the benchmark is not applicable.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": false, + "answer": false, + "justification": "Not applicable — the paper is not evaluating LLM capabilities on a pre-existing benchmark that could appear in training data.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": false, + "answer": false, + "justification": "No standard benchmark is used; the evaluation dataset is newly collected from GitHub, making benchmark contamination not applicable.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "The human annotation is a methodological calibration step, not a human subjects study; pre-registration is not applicable.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human subjects research requiring IRB approval; the annotation involves project team members rating code review comments.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "Not applicable; the annotators are project personnel, not research participants in a human subjects study.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "Not applicable for this type of annotation task.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "Not applicable; no randomized experiment involving human participants.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "Not applicable; no human subjects experiment requiring blinding.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "Not applicable; no longitudinal human participant study.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": false, + "justification": "Multiple LLMs were run 5 times each on 150 annotated examples and then applied to 5,652 comments, but no inference cost or latency figures are reported.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "No total computational budget is stated anywhere in the paper.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "Valid AI-generated code review comments are addressed at rates of only 0.9%–19.2%, compared to 60% for human reviewers.", + "evidence": "Table VIII reports per-action valid addressing rates: ID-2 at 0.9%, ID-1 at 4.2%, ID-4 at 6.5%, ID-3 at 19.2%, human at 60.0%.", + "supported": "strong" + }, + { + "claim": "Adoption of AI code review actions is highly concentrated: four actions account for 91.1% of repositories, 95.2% of pull requests, and 98.9% of comments.", + "evidence": "Table IV shows ID-1 to ID-4 dominate usage; the paper states these figures explicitly in Section IV.A results.", + "supported": "strong" + }, + { + "claim": "Manually triggered AI code review comments are more likely to be addressed than automatically triggered ones.", + "evidence": "Table XI shows 12.8% vs 6.8% for ID-1 (p≤0.05), 22.2% vs 0.5% for ID-2 (p≤0.05); SHAP directionality ρ=−0.97 for Trigger_auto in Table X.", + "supported": "strong" + }, + { + "claim": "Comments with a high code-to-text ratio (>0.5) are substantially more likely to be addressed.", + "evidence": "Table XII shows addressing rate rises to 23.2% for AI comments in the highest code-text-ratio bin vs ~4–7% for low-ratio bins; SHAP importance rank 1 among comment features (ρ=0.89).", + "supported": "strong" + }, + { + "claim": "AI-generated comments targeting experienced contributors (>1,013 prior commits) are addressed at only 3.3%, compared to 16.1% for newcomers (≤30 commits).", + "evidence": "Table XII (right) shows this pattern; SHAP directionality ρ=−0.67 for Author Prior Commits.", + "supported": "moderate" + }, + { + "claim": "The LLM-assisted two-stage classification framework achieves 86.1% overall accuracy and 76.7% Cohen's κ on the full 6-class task.", + "evidence": "Table VII reports these figures for the optimal cross-combined setup (gpt-4.1 for Stage-1, o3-mini for Stage-2) across all three comment source categories.", + "supported": "strong" + }, + { + "claim": "37.1% of mature repositories declared an AI code review action but generated zero comments.", + "evidence": "Stated explicitly in Section IV.A results with supporting numbers from Table IV (178 mature repos, many with zero comments).", + "supported": "strong" + } + ], + "methodology_tags": [ + "observational", + "case-study" + ], + "key_findings": "AI-generated code review comments are addressed at dramatically lower rates (0.9%–19.2%) than human-written ones (60%), demonstrating that current tools have limited practical impact. Tool design matters enormously: hunk-level granularity, manual triggering, and comments rich in concrete code suggestions are the strongest positive predictors of developer responsiveness, while automatically triggered file-level comments are largely ignored. A two-stage LLM pipeline (gpt-4.1 for validity detection, o3-mini for addressing assessment) achieves 86.1% accuracy and 76.7% Cohen's κ, providing a scalable method to monitor tool effectiveness. The dominant 'one-in-one-out' paradigm — generating a comment for every change regardless of merit — is identified as a root cause of low precision and developer fatigue.", + "red_flags": [ + { + "flag": "Causal framing of observational data", + "detail": "The title and conclusions use causal language ('lead to code changes', 'impact') but the design is entirely observational; the authors acknowledge associations not causation only in the threats section." + }, + { + "flag": "Severe language filter bias", + "detail": "Applying an English-language filter removed 75% of comments (12,533 of 16,762), including 32 entirely Korean repositories; findings about addressing rates may not generalize to non-English developer communities." + }, + { + "flag": "Data and code not released", + "detail": "The dataset, scripts, and prompts are all deferred to 'to be published on Zenodo after acceptance' — reproducibility is impossible without this material." + }, + { + "flag": "Tool-project confounding", + "detail": "Projects that choose coderabbitai vs anc95/ChatGPT-CodeReview may differ systematically in review culture, size, and maturity; observed tool performance differences could reflect selection bias rather than tool quality." + }, + { + "flag": "Thin annotation validation set", + "detail": "Only 150 comments (50 per category) were manually annotated to validate a framework applied to 5,652 comments; this is 2.7% of the analysis dataset, and the sample was drawn only from comments where the reviewed file was subsequently modified." + }, + { + "flag": "Cascading model error", + "detail": "LLM-assigned labels are used as ground truth for training a Random Forest classifier, stacking two sources of classification error; the 88.5% Random Forest accuracy inherits the LLM's imperfect labels." + } + ], + "cited_papers": [ + { + "title": "Modern code review: a case study at Google", + "relevance": "Foundational context on human code review practices that AI tools are augmenting or replacing." + }, + { + "title": "Characteristics of useful code reviews: An empirical study at Microsoft", + "relevance": "Prior work on what makes human code review comments useful, providing comparative basis for evaluating AI-generated comments." + }, + { + "title": "Predicting usefulness of code review comments using textual features and developer experience", + "relevance": "Directly related prior work on factors predicting comment usefulness; this paper's feature engineering is explicitly inspired by it." + }, + { + "title": "Github actions: the impact on the pull request process", + "relevance": "Studies GitHub Actions' impact on software development workflows — the infrastructure platform central to this study." + }, + { + "title": "Automating code review activities by large-scale pre-training", + "relevance": "Representative prior work on AI/ML-based code review automation, which this empirical study complements by measuring real-world impact." + }, + { + "title": "Automated code review in practice", + "relevance": "Industrial study of AI code review deployment, complementary case study from a different setting." + }, + { + "title": "Llama-reviewer: Advancing code review automation with large language models through parameter-efficient fine-tuning", + "relevance": "Recent work on LLM-based code review comment generation — directly in scope for this survey." + }, + { + "title": "On the use of GitHub actions in software development repositories", + "relevance": "Empirical study of GitHub Actions adoption patterns, providing context for interpreting this paper's RQ1 findings." + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 3, + "justification": "Directly actionable for any team evaluating or deploying AI code review tools — provides tool rankings, design recommendations, and a monitoring framework." + }, + "surprise_contrarian": { + "score": 2, + "justification": "The 0.9%–19.2% vs 60% addressing rate gap challenges optimism about AI code review effectiveness, and the finding that manual triggering outperforms automatic review is counterintuitive." + }, + "fear_safety": { + "score": 1, + "justification": "Raises concerns about AI review quality (hallucinated warnings, vague summaries) but not safety-critical outcomes." + }, + "drama_conflict": { + "score": 1, + "justification": "Implicit product comparison (coderabbitai vs others) provides a ranking angle but no significant controversy." + }, + "demo_ability": { + "score": 2, + "justification": "All 16 reviewed GitHub Actions are publicly available on GitHub Marketplace, making the findings immediately testable by practitioners." + }, + "brand_recognition": { + "score": 1, + "justification": "No famous lab affiliation; coderabbitai is a recognized product name but the paper is from academic institutions." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "43198812", + "title": "Symmetries of Living Systems", + "points": 8, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=43198812" + }, + { + "hn_id": "45367764", + "title": "Fill probability estimates in institutional bond trading with quantum computers", + "points": 2, + "comments": 2, + "url": "https://news.ycombinator.com/item?id=45367764" + }, + { + "hn_id": "44961416", + "title": "Group Sequence Policy Optimization", + "points": 2, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=44961416" + }, + { + "hn_id": "44041341", + "title": "Grounded in Context: Retrieval-Based Method for Hallucination Detection", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=44041341" + }, + { + "hn_id": "43242677", + "title": "FastAtlas: Real-Time Compact Atlases for Texture Space Shading", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=43242677" + }, + { + "hn_id": "29567026", + "title": "Transient execution flaws found in AMD Zen CPUs", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=29567026" + } + ], + "top_points": 8, + "total_points": 15, + "total_comments": 3 + } +} +\ No newline at end of file diff --git a/papers/does-it-tie-2025/scan-v5.json b/papers/does-it-tie-2025/scan-v5.json @@ -0,0 +1,570 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Does It Tie Out? Towards Autonomous Legal Agents in Venture Capital", + "authors": [ + "Pierre Colombo", + "Malik Boudiaf", + "Allyn Sweet", + "Michael Desa", + "Hongxi Wang", + "Kevin Candra", + "Symeon del Marmol" + ], + "year": 2025, + "venue": "arXiv.org", + "arxiv_id": "2512.18658", + "doi": "10.48550/arXiv.2512.18658" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All abstract claims are substantiated: multi-document reasoning requirement shown in Section 3, evidence traceability discussed throughout, deterministic output requirement proven by baseline failures in Section 5, world model architecture detailed in Section 4.2.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": false, + "justification": "Paper claims architectural superiority (eager vs lazy construction causes performance gain) but lacks rigorous ablation studies. Only one ablation (Agentic + Structured Repr.) tested; no systematic component isolation to prove individual causal factors. Different baselines may differ in implementation quality/effort rather than architectural merit.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "Claims bounded to tie-out (Seed to Series B VC context) are appropriate, but conclusion overgeneralizes: 'robust foundational substrate suitable for a wider array of downstream legal applications' and 'Applied Legal Intelligence' framing extend far beyond tested scope without evidence. Only one task evaluated.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "Paper attributes performance differences to architectural design (world model vs RAG) but doesn't discuss alternative explanations: implementation quality differences, prompt engineering quality, model version effects, or whether Equall received more development effort. No discussion of these potential confounds.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "Measurement (F1 on anomaly detection with correct type and evidence traceability) directly matches claimed outcome (reliable tie-out automation). Ground truth provided by legal professionals; no measurement granularity mismatch.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": false, + "justification": "No dedicated limitations or threats-to-validity section. Paper has Introduction, Background, Complexity Analysis, Methods, Experiments, and Conclusion, but no explicit limitations discussion.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "Specific threats not discussed: no inter-rater reliability for ground truth annotations, no representativeness justification for 4 companies, no discussion of OCR quality handling despite identifying it as a challenge, no failure mode analysis, no evaluation of annotation quality.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "Scope to tie-out in Seed-Series B VC is stated implicitly, but conclusion overextends: claims about generalization to 'a wider array of downstream legal applications' and 'legal intelligence' broadly without evidence. Boundaries are stated for the core task but then violated in claims.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding statement provided. All authors list @equall.com email; paper evaluates Equall's own product, but no explicit funding source disclosure or acknowledgment of their employer's financial interest in positive results.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Author affiliations clearly disclosed via @equall.com email addresses; all work for the company whose system (Equall) is being evaluated.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": false, + "justification": "Authors' employer (Equall) is entirely dependent on the outcome. This is a company evaluating its own product against baselines. The funder has direct financial/reputational interest in demonstrating superiority of Equall.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests or financial disclosure statement provided. No mention of potential patents, equity, or consulting arrangements related to Equall or the described approach.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Key terms precisely defined: Dataroom (Section 2.1), Cap Table (Section 2.2), Tie-out (Section 2.3 with formal mathematical notation), World Model (Section 4.2), Event Graph (Section 4.2), Anomaly types (Section 2.3 with taxonomy).", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Three contributions explicitly stated in abstract and introduction: (1) characterize tie-out as real-world benchmark, (2) analyze existing agentic systems, (3) propose world model architecture (Equall). Each is developed and evaluated in paper.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": false, + "justification": "Limited prior work engagement. Introduction cites legal AI benchmarks ([7,5,6,11,10]) but lacks dedicated Related Work section. No substantive discussion of how this relates to existing legal NLP, knowledge graphs, or agentic systems literature. Citations present but not synthesized.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": false, + "justification": "No code released. No mention of open-source implementation, GitHub repository, or promise of future release.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": false, + "justification": "Data consists of 'four anonymized datarooms' from real VC companies. Anonymized data is not publicly available; appears under NDA. No release mentioned.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "Model mentioned (GPT-5.1) but no environment specifications: no requirements.txt, Dockerfile, Python version, CUDA requirements, or dependency list provided. Implementation details absent.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "No step-by-step reproduction instructions. Paper describes architecture and approach conceptually but does not provide actionable instructions to reproduce results.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "Main performance results (Figure 8: F1 scores per flag type; Figure 10: F1 vs dataroom size) lack error bars or confidence intervals. Only Figure 11 (time measurements) includes 95% error bars.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "No statistical significance tests reported. Claims like 'significantly outperforming' (p.9) use English phrasing but lack p-values, t-tests, or other statistical tests comparing the three approaches.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Effect sizes reported: F1 scores (85% vs 42% vs 29% overall; per-category F1 in Figure 8), speedup (22× faster per check), and time reduction (27h → 5h at Series B). Magnitude of differences quantified.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "Evaluation on only 4 companies without justification. No power analysis, no discussion of sample size adequacy, no statistical reasoning for why 4 companies (1 Seed, 1 Series A, 2 Series B) is sufficient.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "F1 performance metrics (Figures 8, 10) reported without variance/std dev/confidence intervals. Baselines tested once per company with single F1 reported. Only time measurements (Figure 11) include variance.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Three systems compared: Agentic Baseline (lazy RAG), Agentic + Structured Repr. (ablation), and Equall (proposed). Baselines represent alternative architectural paradigms.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "Baselines use GPT-5.1 (contemporary agentic architecture with RAG), representing state-of-the-art LLM approaches for the task. Comparison uses same underlying model versions across approaches.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": false, + "justification": "Only one ablation tested: Agentic + Structured Repr. (removes Event Graph layer). No ablation of individual components within Equall: no test of symbolic-only verification, no isolation of Stage 1 vs Stage 2 extraction, no sensitivity analysis.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Multiple metrics reported: (1) Precision/Recall/F1 per flag category, (2) F1 across companies/stages, (3) inference speed per check, (4) total workflow time, (5) scaling robustness. Comprehensive evaluation.", + "source": "haiku" + }, + "human_evaluation": { + "applies": true, + "answer": false, + "justification": "Ground truth annotated by 'experienced legal professionals' but no user study or qualitative evaluation of system. No human evaluation of system outputs, usability testing, or feedback from actual legal practitioners using Equall.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": false, + "justification": "Not explicitly stated whether test data was held out from system development. Paper says evaluation on 'four anonymized datarooms' with ground truth from professionals, but hold-out procedure and train/dev/test split not described.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Figure 8 breaks results by flag type (Data Discrepancy, Missing Documentation, Missing from Cap Table) with precision/recall/F1 for each. Figure 10 shows F1 across companies (Seed, Series A, Series B). Multi-level breakdowns provided.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": false, + "justification": "No failure cases shown or discussed. Results only report aggregate F1 scores. No examples of false positives, false negatives, or specific anomalies the system misses.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": false, + "justification": "Only positive results reported. All three systems evaluated show Equall > Baselines across metrics. No cases where Equall underperforms, no scenarios where agentic approaches succeed, no discussion of when the approach fails.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": false, + "justification": "Models mentioned (GPT-5.1) but without version snapshots, release dates, or exact configurations. Unclear if Equall uses GPT-5.1 or different model. No fine-tuning details provided.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": false, + "justification": "No example prompts or system instructions shown. Paper describes the pipeline conceptually but does not provide actual prompts used to elicit LLM behavior for document classification, extraction, or verification.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": false, + "justification": "No hyperparameters reported for LLM calls: no temperature, top-p, max_tokens, frequency penalties, or other generation parameters specified for any stage of the pipeline.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": true, + "answer": false, + "justification": "High-level architecture described (document classification, low-level node extraction, event graph synthesis, neuro-symbolic verification) but not implementation-level scaffolding. No example prompts, reasoning traces, or detailed agent instructions.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": false, + "justification": "Paper identifies preprocessing challenges (OCR quality, document format variance, near-duplicates) but does not document the actual preprocessing steps taken. No description of cleaning, tokenization, or data handling procedures.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": false, + "justification": "Raw data consists of real VC company datarooms marked 'anonymized' and not publicly released. Data appears restricted by confidentiality/NDA agreements with companies.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": false, + "justification": "Paper states 'four anonymized datarooms presented in Section 3' but does not describe how these were collected, whether they're representative, or sampling methodology. No collection procedure documented.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "N/A - no human participants recruited. Evaluation uses pre-existing corporate datarooms, not participant-generated data.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": false, + "justification": "Overall pipeline architecture described (Stages 1-3 in Section 4.2) but specific data handling not documented: no documentation of PDF extraction, OCR handling, duplicate removal, or data validation steps.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": false, + "answer": false, + "justification": "N/A - not evaluating pre-trained models on public benchmarks; evaluating a bespoke tie-out verification system on real datarooms.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": false, + "answer": false, + "justification": "N/A - not evaluating benchmark contamination; this is a specialized legal task with proprietary ground truth.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": false, + "answer": false, + "justification": "N/A - not evaluating on public benchmarks; tie-out is a novel task specific to this work.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "N/A - no human participants studied; evaluation uses ground truth from professional annotators but no human subjects experiment.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "N/A - no human participants; no IRB approval needed.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "N/A - no human participant demographics; annotators mentioned only as 'experienced legal professionals'.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "N/A - no human participants; evaluation on corporate datarooms only.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "N/A - not a randomized experiment; fixed set of 4 companies evaluated.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "N/A - not applicable to system evaluation.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "N/A - no human participants to have attrition.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": false, + "justification": "Latency reported (2 sec vs 45 sec per check, Figure 9) and total workflow time (Figure 11), but no API costs, compute costs, or monetary expense disclosed.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "No total computational budget, GPU/TPU requirements, or model costs stated. Timing provided but not resource constraints or infrastructure requirements.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "Capitalization tie-out requires multi-document reasoning, strict evidence traceability, and deterministic outputs that current agentic LLM approaches fail to deliver reliably", + "evidence": "Section 4.1 analyzes limitations of agentic RAG (retrieval failure indistinguishable from absence, exponential error compounding); Figure 10 shows agentic F1 drops from 55% to 28% with complexity; Figure 8 shows agentic baselines achieve 29% F1 overall", + "supported": "strong" + }, + { + "claim": "Equall's eager world model architecture achieves 85% F1 on anomaly detection, substantially outperforming lazy agentic approaches (42%, 29%)", + "evidence": "Figure 8 shows F1 across flag categories with Equall at 85%, Agentic+Structured at 42%, Agentic Baseline at 29%", + "supported": "strong" + }, + { + "claim": "Performance gap widens dramatically with dataroom complexity: Equall maintains 95% F1 on large Series B datarooms while agentic baseline drops to 28%", + "evidence": "Figure 10 tracks F1 vs dataroom size (2,081 to 6,721 pages): Equall 55%→95%, Agentic Baseline 55%→28% across scale", + "supported": "strong" + }, + { + "claim": "Equall achieves 22× speedup per verification check (2 seconds vs 45 seconds for agentic) due to upfront world model construction", + "evidence": "Figure 9 shows inference time: agentic 45 sec/check vs Equall 2 sec/check; 15 min indexing cost amortizes over 500+ checks", + "supported": "strong" + }, + { + "claim": "Equall-assisted workflow reduces tie-out time from 27 hours to 5 hours at Series B scale while maintaining 81.4% efficiency", + "evidence": "Figure 11 shows manual tie-out: 21.9 hours at Series B vs Equall-assisted: 5 hours; assisted efficiency 81.4% (error bars shown)", + "supported": "strong" + }, + { + "claim": "Verification workload explodes super-linearly with company maturity: rises 3× from Seed to Series B (2,710 to 7,923 verification steps)", + "evidence": "Figure 7 documents verification step count growth; Section 3 shows securities increase 7× while documents only 2.5× from Seed to Series B", + "supported": "strong" + }, + { + "claim": "The Event Graph world model is a generalizable foundation suitable for broader downstream legal applications beyond tie-out", + "evidence": "Conclusion states model captures 'legally operative events as structured, temporally ordered state transitions' applicable across legal domains; no empirical validation on other tasks", + "supported": "weak" + }, + { + "claim": "Anomalies shift from informal omissions (Seed) to complex document inconsistencies (Series B) requiring different verification strategies", + "evidence": "Figure 5 shows Seed: 41.5% Board Consents vs Series B: 63.8%; Figure 6 shows 'Missing Information' remains top issue category across stages, contradicting the claim", + "supported": "moderate" + } + ], + "methodology_tags": [ + "benchmark-eval", + "case-study" + ], + "key_findings": "The paper demonstrates that explicit world model construction substantially outperforms standard agentic LLM approaches on complex multi-document legal reasoning tasks. By building a structured Event Graph before verification (eager construction), the Equall system achieves 85% F1 on anomaly detection compared to 42% and 29% for baseline agentic approaches, with performance advantages widening dramatically as dataroom complexity increases from Seed to Series B stage. The approach enables 22× speedup per verification and reduces manual tie-out time from 27 hours to 5 hours with human-in-the-loop assistance, suggesting that neuro-symbolic architectures with explicit world models could be foundational for reliable autonomous legal reasoning at scale.", + "red_flags": [ + { + "flag": "Undisclosed financial conflict of interest", + "detail": "All authors work for Equall (@equall.com) and are evaluating their own product against baselines. No competing interests statement or funding disclosure provided. Direct financial incentive for positive results." + }, + { + "flag": "No statistical significance testing", + "detail": "Performance differences (85% vs 42% F1) reported without p-values, confidence intervals, or significance tests on main metrics. Only timing measurements (Figure 11) include error bars." + }, + { + "flag": "Insufficient sample size", + "detail": "Evaluation on only 4 companies (Seed, Series A, 2× Series B) without justification or power analysis. No discussion of representativeness for the broader VC financing population." + }, + { + "flag": "Minimal ablation studies", + "detail": "Only one ablation tested (Agentic + Structured Repr.). No isolation of individual Equall components: no symbolic-only verification, no Stage 1-only extraction, no sensitivity analysis of design choices." + }, + { + "flag": "No failure case analysis", + "detail": "Results show only positive outcomes. No discussion of false negatives, false positives, or specific anomaly types where the system struggles. No error analysis." + }, + { + "flag": "Annotation quality unreported", + "detail": "Ground truth provided by 'experienced legal professionals' but no inter-rater reliability, Cohen's kappa, or annotation agreement reported. Single-annotator labels assumed perfect." + }, + { + "flag": "Reproducibility impossible", + "detail": "Code not released, data anonymized and proprietary, no prompts or hyperparameters disclosed. Implementation details absent: no requirements.txt, Dockerfile, or model snapshots. Cannot reproduce." + }, + { + "flag": "Limited related work engagement", + "detail": "No dedicated related work section. Legal AI literature engagement limited to citations without synthesis. No comparison to knowledge graph, neuro-symbolic, or legal reasoning systems literature." + }, + { + "flag": "Overgeneralized conclusions", + "detail": "Tested on single task (tie-out); conclusion claims 'foundation for applied legal intelligence' and 'wider array of downstream legal applications' without any evidence from other domains." + }, + { + "flag": "Model version ambiguity", + "detail": "References 'GPT-5.1' without clarity on release date, availability, or whether this is a real/hypothetical model. Unclear what models Equall components use." + } + ], + "cited_papers": [ + { + "title": "Saullm-54b & saullm-141b: Scaling up domain adaptation for the legal domain", + "relevance": "Legal LLM domain adaptation; directly relevant to building specialized legal reasoning systems" + }, + { + "title": "Saullm-7b: A pioneering large language model for law", + "relevance": "Legal LLM development; foundation model for legal domain reasoning" + }, + { + "title": "LegalBench: A collaboratively built benchmark for measuring legal reasoning in large language models", + "relevance": "Legal reasoning benchmarking; directly comparable to tie-out as legal AI evaluation task" + }, + { + "title": "Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities", + "relevance": "Agentic LLM capabilities; represents state-of-the-art reasoning and multi-step planning systems" + }, + { + "title": "Neural legal judgment prediction in english", + "relevance": "Legal NLP and judgment prediction; foundational work in legal AI classification" + }, + { + "title": "GPT-4 passes the bar exam", + "relevance": "LLM legal reasoning capabilities; demonstrates capability ceiling on legal domain benchmarks" + }, + { + "title": "Why do multi-agent llm systems fail?", + "relevance": "Directly cited for multi-agent agentic system failures; relevant to understanding agentic limitations" + }, + { + "title": "Developing artificially intelligent justice", + "relevance": "Legal AI applications and justice system automation; broader context for legal reasoning systems" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 2, + "justification": "System targets a real VC workflow (tie-out) with demonstrated efficiency gains (27h → 5h); however, deployment limited to Equall's customers, unknown real-world validation outside authors' company." + }, + "surprise_contrarian": { + "score": 1, + "justification": "Finding that structured world models outperform agentic RAG aligns with intuition from complexity analysis; somewhat expected given the problem formulation rather than surprising or contrarian." + }, + "fear_safety": { + "score": 0, + "justification": "No AI safety, alignment, or risk concerns discussed. Legal automation raises general legal reasoning reliability questions but paper doesn't engage with these." + }, + "demo_ability": { + "score": 1, + "justification": "System demonstrated on real datarooms but not publicly available or accessible. No sandbox, demo, or open-source artifact allowing readers to try the approach." + }, + "brand_recognition": { + "score": 1, + "justification": "Equall is a specialized legal AI startup, not a household name. Authors have prior work on legal LLMs (SaulLM) showing domain expertise but limited brand visibility." + }, + "drama_conflict": { + "score": 1, + "justification": "Evaluating own product introduces conflict of interest but not positioned as dramatic or controversial. More of a methodological concern than attention-grabbing narrative." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "42550783", + "title": "Gamma-ray bursts: what do we know today that we did not know 10 years ago?", + "points": 16, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=42550783", + "created_at": "2024-12-30T16:30:18Z" + }, + { + "hn_id": "43777601", + "title": "Assistance or Disruption? Evaluating the Design of Proactive AI Programming", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=43777601", + "created_at": "2025-04-23T23:02:02Z" + }, + { + "hn_id": "42566642", + "title": "1.58-Bit Flux", + "points": 2, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=42566642", + "created_at": "2025-01-01T15:38:39Z" + }, + { + "hn_id": "43265832", + "title": "Evaluating Intelligence via Trial and Error", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=43265832", + "created_at": "2025-03-05T12:51:05Z" + }, + { + "hn_id": "43280105", + "title": "Evaluating Intelligence via Trial and Error", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=43280105", + "created_at": "2025-03-06T13:45:25Z" + } + ], + "top_points": 16, + "total_points": 23, + "total_comments": 1 + } +} +\ No newline at end of file diff --git a/papers/does-prompt-formatting-2024/scan-v5.json b/papers/does-prompt-formatting-2024/scan-v5.json @@ -0,0 +1,543 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Does Prompt Formatting Have Any Impact on LLM Performance?", + "authors": [ + "Jia He", + "Mukund Rungta", + "David Koleczek", + "Arshdeep Sekhon", + "Franklin X Wang", + "Sadid Hasan" + ], + "year": 2024, + "venue": "arXiv.org", + "arxiv_id": "2411.10541", + "doi": null + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All abstract claims are empirically supported: performance variations documented in Table 1 (up to 54pp on HumanEval); GPT-4's robustness confirmed in Figure 6 via Coefficient of Mean Deviation; significant variations shown via p-values < 0.001.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "Paper uses ablation design: isolates prompt format as sole variable by keeping semantic content identical across formats (Appendix C). Causal claims about format effects are justified by this controlled experimental design.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": true, + "justification": "Scope appropriately bounded to OpenAI GPT models and 6 specific benchmarks. Section 7 explicitly acknowledges GPT-only focus and plans future work on other LLM families.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "Paper documents that format affects performance but does not discuss why—no exploration of tokenization, training data distribution, or transformer architecture mechanisms. Section D.2 speculates about laziness vs. format processing but doesn't systematically explore alternatives.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "Claims directly match measurements: 'format affects performance' is measured via accuracy (MMLU), pass@1 (HumanEval), BLEU (code translation). No proxy conflation.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "Dedicated Section 7 'Limitations' explicitly lists three specific limitations: GPT-only focus, formats excluded (HTML/XML), other prompt engineering dimensions not varied.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "Specific threats named: 'GPT-based models' vs. other LLM families, missing format types, other prompt design elements held constant. Not boilerplate disclaimers.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": true, + "justification": "Scope clearly bounded to 4 formats, 6 benchmarks, 4 GPT models, temperature=0 for consistency. Implicitly states findings don't explain mechanism or provide universal format guidance.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding source statement, no acknowledgments section identifying funder. Affiliations with Microsoft and MIT stated but no explicit funding disclosure.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Author affiliations clearly listed: Microsoft and MIT for each author.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": true, + "justification": "Microsoft affiliation evaluates OpenAI's competing models (not Microsoft's own Copilot), suggesting independence. However, funding source unconfirmed.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement provided. No declaration of patents, equity, or consulting relationships.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Key metrics formally defined: 'sensitivity' (Section 3.1, with formula), 'consistency' (Section 4.1, with formula), 'transferability' (IoU, Section 5.1). Prompt templates shown in examples (Figure 1, Appendix C).", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Contribution explicitly stated: 'first to compare impact of different prompt formats on GPT models' performance across various tasks.' Three research questions clearly framed (Sensitivity, Consistency, Transferability).", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section A (Related Work) extensively cites prior prompt engineering research. Clearly distinguishes this work: 'Our research diverges...by examining global prompt format modifications' vs. prior work on fine-grained changes. Positions contribution within landscape.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": false, + "justification": "No code repository, GitHub link, or code availability statement provided. Experiments depend on proprietary OpenAI API access.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "Uses standard public benchmarks (MMLU, HumanEval, CODEXGLUE, NER Finance, HumanEval-X, FIND). Raw data available via original benchmark sources.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "Only mentions 'Azure OpenAI' without Python version, dependency versions, or Docker spec. No requirements.txt or environment file referenced.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "No step-by-step reproduction instructions. Would require Azure OpenAI access, which model versions to call, and how to replicate exact API setup. Prompts partially shown (Appendix C) but not comprehensively.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": true, + "justification": "Tables 3-8 report standard deviations (±) for each condition. Figures 4 and 9 display error bars.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": true, + "justification": "Table 1 reports one-sided matched pairs t-tests with p-values; all comparisons yield p<0.001 except HumanEval GPT-4-1106 (p=0.055).", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Absolute performance differences shown (Max/Min in Table 1, e.g., 59.7 vs. 50.0). Relative differences noted in text ('40%', '200%', '300%'). Formal effect size metrics (Cohen's d) not reported but magnitude is clear.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "Sample sizes mentioned (MMLU 14,079, HumanEval 164, FIND 500) but not justified. No power analysis or rationale for sufficiency provided.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": true, + "justification": "Standard deviations reported in Tables 3-8 and error bars displayed in Figures 4 and 9 across all runs.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Compares 4 format templates (plain text, Markdown, JSON, YAML), with one implicitly serving as baseline. Max/Min comparisons frame within-design baselines.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "Tests current GPT models (GPT-3.5-turbo and GPT-4 variants from 2023-2024). Baselines are contemporary and appropriate for the research question.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "Systematic ablation: varies format while keeping prompt content, persona, instructions, examples identical (Appendix C). Format is isolated as sole experimental variable.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Uses task-specific metrics (accuracy, pass@1, BLEU, NER F1) plus meta-metrics (consistency, IoU). Evaluates from three angles (sensitivity, consistency, transferability).", + "source": "haiku" + }, + "human_evaluation": { + "applies": false, + "answer": false, + "justification": "Not applicable—benchmark evaluation only, no subjective quality assessment needed.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": true, + "justification": "All benchmarks use standard test/dev splits. MMLU uses official dev set for few-shot, test set for evaluation. HumanEval, CODEXGLUE, etc. use established splits.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Figure 5 breaks MMLU by domain (STEM, humanities, social science, other) and shows per-domain format sensitivity. Less detailed for other benchmarks.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "Section D.2 explicitly discusses GPT-4-32k's extreme HumanEval failure with JSON (21.95% vs. 76.2% plain text), hypothesizing 'laziness' in chain-of-thought generation.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "Reports no universal optimal format; IoU scores mostly <0.2 for cross-model comparisons, indicating non-transferability. GPT-3.5 consistency <0.5 is noted as negative.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": true, + "justification": "Exact model IDs provided: 'gpt-35-turbo-0613', 'gpt-35-turbo-16k-0613', 'gpt-4-32k-0613', 'gpt-4-1106-preview'. Date snapshots implicit in version tags.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": true, + "justification": "Appendix C shows complete prompt templates for NER Finance task with placeholder examples. Only one task shown in detail, but templates and structure fully transparent.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": true, + "justification": "Temperature=0 reported for MMLU. For FIND, defers to (Schwettmann et al., 2023) settings without repeating. Top-p, max_tokens not reported.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": false, + "answer": false, + "justification": "No agentic scaffolding; straight prompt engineering. Not applicable.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "Describes sampling (NER Finance: random 500/500, FIND: random 500/500) and split usage (MMLU dev for few-shot, test for eval). Limited but adequate for standard benchmarks.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": true, + "justification": "All data sourced from published benchmarks (MMLU, HumanEval, CODEXGLUE, etc.). Raw benchmark data publicly available.", + "source": "haiku" + }, + "data_collection_described": { + "applies": false, + "answer": false, + "justification": "Paper does not collect new data; uses existing benchmarks. NA.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "No human subjects. NA.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "Pipeline is simple: ingest public benchmark → vary prompt format → run on API → collect results. Described adequately for standard benchmarks.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": false, + "justification": "No explicit training data cutoff date stated for GPT-3.5 or GPT-4. ArXiv date 2024-11-15 but model training dates not discussed.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": false, + "justification": "No discussion of whether MMLU, HumanEval, CODEXGLUE, or other benchmarks were present in GPT training data. Potential overlap not addressed.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": true, + "answer": false, + "justification": "No analysis of whether benchmarks entered training data before model cutoff. Standard benchmarks may be over-represented in GPT training.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human participants. NA.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human participants. NA.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participants. NA.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human participants. NA.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human participants. NA.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human participants. NA.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participants. NA.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": false, + "justification": "No API costs, token counts, or latency metrics reported. Significant cost incurred but not quantified.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "Total computational budget not stated. No count of API calls, tokens consumed, or estimated costs provided.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "Prompt formatting significantly affects GPT model performance, with variations ranging up to 54pp (300% relative) depending on task and model", + "evidence": "Table 1 documents max/min performance across formats for each benchmark; HumanEval GPT-4-32k shows 21.95% (JSON) vs. 76.2% (plain text); text reports '200% improvement' and '300%' for FIND and HumanEval respectively", + "supported": "strong" + }, + { + "claim": "Larger models (GPT-4) are more robust to prompt format changes than smaller models (GPT-3.5)", + "evidence": "Figure 6 shows Coefficient of Mean Deviation (CMD) consistently lower for GPT-4 (0.035–0.043) than GPT-3.5 (0.035–0.176); Figure 2 shows consistency scores GPT-3.5 <0.5, GPT-4 >0.5", + "supported": "strong" + }, + { + "claim": "No universal optimal format exists even within the same GPT model family", + "evidence": "Section 5.2 shows IoU <0.2 for cross-model pairs; GPT-3.5 prefers JSON while GPT-4 prefers Markdown (Section 5.2); conclusion: 'no single format excelling universally'", + "supported": "strong" + }, + { + "claim": "Model sensitivity to format is task-agnostic, not contingent on task-specific skill requirements", + "evidence": "Figure 5 breaks MMLU by domain (STEM, humanities, social science); shows performance spread exists across all domains. Section D.1: 'Model's sensitivity...is a general characteristic, rather than being contingent on specific skills'", + "supported": "moderate" + }, + { + "claim": "GPT-3.5 outputs show low consistency across formats (<0.5 identical responses), while GPT-4 exceeds 0.5", + "evidence": "Figure 2 and Figure 8 explicitly show consistency scores for MMLU and FIND datasets. MMLU: GPT-3.5 'displayed low consistency, with scores below 0.5, and only 16% identical responses between Markdown and JSON'", + "supported": "strong" + }, + { + "claim": "Model size correlates with increased robustness to prompt format variation", + "evidence": "CMD analysis (Figure 6) shows inverse relationship: GPT-4 lower CMD than GPT-3.5; Section D.2 concludes 'larger models are more robust to template variation'", + "supported": "strong" + } + ], + "methodology_tags": [ + "benchmark-eval", + "empirical" + ], + "key_findings": "Prompt formatting significantly impacts GPT model performance, with variations exceeding 300% relative improvement on some tasks (e.g., HumanEval). GPT-4 demonstrates substantially greater robustness to format changes (lower Coefficient of Mean Deviation, higher consistency scores) compared to GPT-3.5. No single prompt format universally optimizes performance across different GPT models or even within the same family. Format sensitivity is a general model characteristic independent of task-specific skill requirements or domain expertise.", + "red_flags": [ + { + "flag": "No code release", + "detail": "Experiments depend entirely on proprietary OpenAI API access via Azure. Cannot be reproduced without credentials and budget." + }, + { + "flag": "Training data contamination not addressed", + "detail": "No analysis of whether MMLU, HumanEval, CODEXGLUE, or other benchmarks were present in GPT training data. Standard benchmarks are likely over-represented in LLM training corpora." + }, + { + "flag": "No mechanistic explanation", + "detail": "Paper documents that format matters but does not explain why. No investigation of tokenization, embedding distribution, transformer attention patterns, or other mechanisms." + }, + { + "flag": "GPT-only scope limits generalizability", + "detail": "Results restricted to OpenAI models; does not test open-source LLMs (LLaMA, Phi), proprietary competitors (Gemini, Claude), or smaller models. Acknowledged limitation but weakens contribution." + }, + { + "flag": "Sample size not justified", + "detail": "Benchmark samples mentioned (MMLU 14,079, HumanEval 164) but no power analysis or statistical justification for sufficiency provided." + }, + { + "flag": "Hyperparameter transparency incomplete", + "detail": "Temperature reported only for MMLU. Top-p, max_tokens, and other sampling parameters not consistently reported across all experiments." + }, + { + "flag": "Limited mechanistic depth", + "detail": "Section D.2 speculates about 'laziness' in GPT-4-32k's JSON output on HumanEval but does not systematically investigate. Alternative explanations (tokenization, prompt structure parsing) unexplored." + }, + { + "flag": "No alternative explanation discussion", + "detail": "Does not consider why format sensitivity exists (e.g., training data composition, tokenization artifact, transformer architecture preference) or whether findings might be confounded." + } + ], + "cited_papers": [ + { + "title": "Quantifying language models' sensitivity to spurious features in prompt design", + "relevance": "Directly related—prior work showing LLM sensitivity to fine-grained prompt changes (Sclar et al., 2023); this paper extends to global format variations." + }, + { + "title": "Mind your format: Towards consistent evaluation of in-context learning improvements", + "relevance": "Closely related—argues that evaluation standards must marginalize across prompt formats to avoid spurious conclusions (Voronov et al., 2024)." + }, + { + "title": "You don't need a personality test to know these models are unreliable: Assessing the reliability of large language models on psychometric instruments", + "relevance": "Related on output consistency and reliability under prompt variation (Shu et al., 2023)." + }, + { + "title": "Evaluating large language models trained on code (HumanEval benchmark)", + "relevance": "Benchmark paper defining HumanEval, one of six benchmarks used in this study (Chen et al., 2021)." + }, + { + "title": "Measuring massive multitask language understanding (MMLU)", + "relevance": "Benchmark paper defining MMLU, the largest benchmark used in this study (Hendrycks et al., 2020)." + }, + { + "title": "Chain-of-thought prompting elicits reasoning in large language models", + "relevance": "Foundational prompting technique cited; this paper tests whether CoT benefits persist across format variations (Wei et al., 2023)." + }, + { + "title": "Language Models are Few-Shot Learners", + "relevance": "Foundational in-context learning work; motivates few-shot example ordering as confounding variable this paper controls (Brown et al., 2020)." + }, + { + "title": "Table meets LLM: Can large language models understand structured table data?", + "relevance": "Most closely related concurrent work; provides cursory format exploration but limited to tabular data (Sui et al., 2024)." + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 2, + "justification": "Practitioners should care about format choice, but paper provides no guidance on which format to select. Findings are task-model specific without generalizable heuristics." + }, + "surprise_contrarian": { + "score": 2, + "justification": "Moderately surprising that semantic content can vary 20–40pp based on formatting alone. Challenges assumption that meaning is content-agnostic, but not radical for practitioners aware of prompt brittleness." + }, + "fear_safety": { + "score": 0, + "justification": "No AI safety implications. Paper is purely about engineering optimization for task performance, not safety or alignment." + }, + "drama_conflict": { + "score": 1, + "justification": "Minimal drama. Finding that evaluation standards may be invalid if ignoring format has some methodological bite, but not a major controversy or conflict narrative." + }, + "demo_ability": { + "score": 2, + "justification": "Reproducible on Azure OpenAI but requires API credentials, budget, and knowledge of how to call proprietary models. Not accessible to casual users without cost." + }, + "brand_recognition": { + "score": 3, + "justification": "Strong brands: Microsoft Research authorship, evaluating OpenAI GPT models. Both highly recognized in AI/ML community." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "42266742", + "title": "The Rise and Fall of Ideas' Popularity [pdf]", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=42266742", + "created_at": "2024-11-28T16:54:44Z" + }, + { + "hn_id": "44854721", + "title": "Does Prompt Formatting Have Any Impact on LLM Performance?", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=44854721", + "created_at": "2025-08-10T12:23:36Z" + }, + { + "hn_id": "45930419", + "title": "A Large-Scale Computational Analysis of Errors in ArXiv Papers", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=45930419", + "created_at": "2025-11-14T18:52:29Z" + }, + { + "hn_id": "33707451", + "title": "Knowledge Graph Generation from Text", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=33707451", + "created_at": "2022-11-22T16:21:57Z" + } + ], + "top_points": 3, + "total_points": 7, + "total_comments": 0 + } +} +\ No newline at end of file diff --git a/papers/does-reasoning-introduce-2025/scan-v5.json b/papers/does-reasoning-introduce-2025/scan-v5.json @@ -0,0 +1,507 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Does Reasoning Introduce Bias? A Study of Social Bias Evaluation and Mitigation in LLM Reasoning", + "authors": [ + "Xuyang Wu", + "Jinming Nian", + "Ting-Ruen Wei", + "Zhiqiang Tao", + "Hsin-Tai Wu" + ], + "year": 2025, + "venue": "Conference on Empirical Methods in Natural Language Processing", + "arxiv_id": "2502.15361", + "doi": "10.18653/v1/2025.findings-emnlp.1006" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All four major abstract claims — first systematic evaluation of reasoning bias, accuracy-without-bias-mitigation finding, bias amplification in ambiguous contexts, and ADBP outperforming SfRP — are supported by Tables 1–2 and Figures 3–6.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "The SfRP ablation (remove biased steps → re-run instruction model) constitutes an intervention test for the causal claim that biased reasoning causes errors; robustness of the LLM judge across prompt variants (Section 5.1, Table 5) partially validates the oracle, though human annotation is absent.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "Conclusions like 'reasoning-based models do not mitigate biases' are stated broadly across the paper while evidence covers only five open-source models plus cited OpenAI system card numbers on a single English-language benchmark (BBQ); computational constraints also excluded the full DeepSeek-R1 model.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": true, + "justification": "Section 4.3 explicitly notes that incorrect answers can arise without biased reasoning (white lines in Figure 3b) and identifies non-negative polarity misinterpretation as an independent source of errors (Figure 4b, Appendix A.7).", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "The paper uses distinct metrics for different constructs: exact-match accuracy for prediction correctness and formally-defined bias scores (Equations 2–3) for stereotype expression, and keeps these separate throughout analysis.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "A dedicated 'Limitations' section is present, separate from the conclusion, addressing the LLM judge, mitigation design, computational constraints, and refusal behavior.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "Specific threats include: LLM-as-a-judge uncertainty without human annotation verification, inability to test full DeepSeek-R1 due to compute costs, distilled models potentially carrying inherent biases, and rare refusal behavior due to BBQ's controlled framing.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": true, + "justification": "The paper explicitly bounds scope to distilled DeepSeek-R1 variants rather than the full model, and to the BBQ benchmark in zero-shot English-language settings; computational and cost limitations are explicitly named.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No acknowledgment or funding section appears anywhere in the paper; funding sources are entirely absent.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "All author affiliations are clearly listed on the title page (Santa Clara University, Rochester Institute of Technology, Docomo Innovations).", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": false, + "answer": false, + "justification": "No funding source is disclosed, making this criterion not applicable.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement or declaration of financial interests appears in the paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "The paper defines social bias (stereotype-based associations), SfRP, ADBP, ambiguous vs. disambiguated contexts, and provides formal equations for both accuracy and bias score metrics.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Three explicit contributions are enumerated: first evaluation of bias in reasoning steps (not just outputs), empirical finding that reasoning amplifies stereotypes, and the ADBP mitigation strategy.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "The related work section distinguishes this study from prior bias work (which focused on outputs, not reasoning steps) and from prior CoT studies (which focused on math/code), clearly positioning the contribution as a gap-filling study.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": true, + "justification": "The abstract links directly to https://github.com/elviswxy/LLM_reasoning_bias for evaluation and mitigation code.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "BBQ is a publicly available standard benchmark (Parrish et al., 2022); no novel dataset was created.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "Only hardware (NVIDIA A100 GPUs) is mentioned; no requirements.txt, Dockerfile, or Python/library versions are provided.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": true, + "justification": "Algorithm 1 specifies the ADBP procedure step-by-step, all prompts are provided in Appendix A.2, model HuggingFace links are given, and code is released.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "No confidence intervals or error bars are reported for the main accuracy or bias score results in Tables 1–2.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "Comparative claims (e.g., 'ADBP outperforms SfRP in most cases') are made without any statistical significance tests.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Numerical accuracy improvements from mitigation are reported in absolute terms (e.g., +0.517 in Case 1, +0.717 in Case 3), providing quantified effect sizes.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "The full BBQ dataset (58,492 examples) is used without any power analysis or justification for whether this is sufficient to detect the observed effect sizes reliably.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "The LLM-as-a-judge scores each step 5 times and takes majority vote, but variance across those runs is not reported; main accuracy/bias results show no spread.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "ADBP is compared against SfRP-based mitigation, Self-debiasing via Explanation (Gallegos et al., 2025), and Combined Debiasing Prompt (Liu et al., 2025a) in Figure 6 and Table 2.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "All mitigation baselines (Gallegos et al., 2025; Liu et al., 2025a) are from 2025, published at NAACL 2025 and contemporaneous with this work.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "The SfRP component is an ablation: removing biased steps from DeepSeek reasoning and re-querying the base model isolates the contribution of biased reasoning to incorrect predictions (Figure 5).", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Two distinct metrics are used throughout: accuracy (Acc) and bias score (Bias), each split by context type (ambiguous/disambiguated).", + "source": "haiku" + }, + "human_evaluation": { + "applies": true, + "answer": false, + "justification": "The paper explicitly states 'We did not conduct human labeling to verify [LLM-as-a-judge] reliability due to the extremely high cost of manual annotation.'", + "source": "haiku" + }, + "held_out_test_set": { + "applies": false, + "answer": false, + "justification": "This is an inference-only evaluation study; no model training or train/test splits are involved.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Tables 1a and 1b provide per-category accuracy and bias scores across all 11 BBQ categories for each evaluated model.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "Appendix A.6 provides full qualitative failure examples (biased reasoning leading to wrong answers across Religion, Age, Disability categories), and Section 4.3 analyzes failure patterns systematically.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "The paper reports that ADBP underperforms SfRP for the Qwen-32B base model (Case 3, Table 2) and that reasoning models fail to reduce bias despite improving accuracy.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": true, + "justification": "Exact HuggingFace model IDs are provided via footnotes for all open-source models (e.g., deepseek-ai/DeepSeek-R1-Distill-Llama-8B); OpenAI results are sourced from the o3-mini system card.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": true, + "justification": "All prompts are provided verbatim in Appendix A.2 (Figures 7–12), including evaluation prompts for instruction-tuned models, DeepSeek models, LLM-as-a-judge, and ADBP.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": false, + "justification": "The paper states generation parameters follow each model's system card without specifying temperature, top-p, or other decoding hyperparameters.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": false, + "answer": false, + "justification": "No agentic scaffolding is used; this is a direct inference evaluation study.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "Reasoning step extraction (newline splitting of <think> tokens), k=100 bin normalization for visualization, exact-match response normalization with regex, and LLM-as-a-judge 5-run majority voting are all documented.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": false, + "justification": "While code is released, the intermediate scored outputs (per-step LLM judge bias scores for all reasoning traces) are not explicitly released as a dataset artifact.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "The BBQ dataset is described (Table 3, dataset statistics), model inference procedure is described, and the LLM-as-a-judge scoring procedure (GPT-4o, 5 runs, majority vote) is detailed in Section 3.4.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "No human participants; the study uses a pre-existing benchmark dataset.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "The full pipeline from BBQ input → model inference → reasoning step extraction → LLM-as-a-judge scoring → accuracy/bias computation is described across Sections 3–4.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": false, + "justification": "No training data cutoff dates are stated for any of the evaluated models (DeepSeek, Llama, Qwen, OpenAI).", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": false, + "justification": "BBQ (published 2022) could be in any of the 2024–2025 model training corpora; the paper does not discuss or test for this possibility.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": true, + "answer": false, + "justification": "BBQ predates all evaluated models' training, making contamination plausible, but the paper does not address whether BBQ examples were in the models' pretraining data.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": false, + "justification": "The paper mentions 'monetary cost' as a drawback of the LLM-as-a-judge approach (Section 6) but provides no actual cost figures for GPT-4o calls or model inference.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "NVIDIA A100 GPUs are mentioned but no total compute budget (GPU-hours, cost) is stated.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "Reasoning-based models (DeepSeek-R1 variants) improve prediction accuracy compared to base instruction-tuned models but do not reduce social bias — in many categories they exhibit equal or worse bias scores.", + "evidence": "Table 1: DeepSeek-8B achieves highest accuracy in all 11 categories but shows similar or worse bias in 9/11 ambiguous categories vs. Llama-8B.", + "supported": "strong" + }, + { + "claim": "Biased reasoning steps are significantly more prevalent in incorrect predictions than in correct predictions across multiple BBQ categories.", + "evidence": "Figure 3 and Table 6 show higher average bias scores in subsets where the reasoning model is wrong (e.g., Age: 1.06 vs. 0.23; SES: 1.57 vs. 0.46; Religion: 1.55 vs. 0.64).", + "supported": "strong" + }, + { + "claim": "Removing biased reasoning steps (SfRP) and re-querying instruction-tuned models consistently improves accuracy on previously failed cases.", + "evidence": "Figure 5: SfRP improves accuracy by +0.517 (Case 1) and +0.717 (Case 3); even in both-fail cases accuracy increases by +0.100 and +0.526.", + "supported": "strong" + }, + { + "claim": "ADBP outperforms SfRP mitigation in most cases by using answer distribution shifts across incremental reasoning steps as a proxy for bias.", + "evidence": "Table 2: ADBP exceeds SfRP accuracy in Cases 1, 2, and 4 for both model families; ADBP corrects 38–60% of initially incorrect cases vs. SfRP's 24–44%. Exception is Qwen-32B Case 3.", + "supported": "moderate" + }, + { + "claim": "Ambiguity amplifies bias in reasoning models: reasoning-based models that outperform base models in disambiguated contexts often fail to do so in ambiguous contexts.", + "evidence": "Section 4.2: OpenAI o1, o1-mini, o3-mini underperform GPT-4o in ambiguous contexts; DeepSeek-32B fails to consistently outperform Qwen-32B under ambiguity in categories like Age and Physical Appearance.", + "supported": "moderate" + }, + { + "claim": "Bias tends to intensify in later reasoning steps: once a biased step appears, the model tends to persist along a faulty trajectory.", + "evidence": "Figure 3b and 3d show bias accumulating toward the end of reasoning chains for incorrect predictions, while correct predictions show isolated, non-propagating bias.", + "supported": "moderate" + } + ], + "methodology_tags": [ + "benchmark-eval", + "observational", + "case-study" + ], + "key_findings": "Reasoning-based LLMs (DeepSeek-R1 variants, Marco-o1) improve accuracy on the BBQ social bias benchmark over base instruction-tuned models but do not mitigate social bias — in ambiguous contexts they often amplify stereotypes, particularly for categories like Age, Physical Appearance, and SES. Biased reasoning steps are strongly correlated with incorrect predictions, and removing them via SfRP consistently improves accuracy, supporting a causal role for biased reasoning in prediction errors. The proposed ADBP method, which uses answer distribution shifts across incremental reasoning steps as a bias proxy, outperforms SfRP and prompt-only debiasing baselines in most tested scenarios without requiring an external judge at inference time.", + "red_flags": [ + { + "flag": "LLM judge unvalidated", + "detail": "The paper explicitly acknowledges it did not conduct human annotation to validate the LLM-as-a-judge reliability; this oracle underpins the primary causal claim (bias → incorrect predictions) and the SfRP intervention." + }, + { + "flag": "No statistical significance testing", + "detail": "Comparative claims (ADBP outperforms SfRP, reasoning models amplify bias vs. base models) are made without any significance tests despite per-category sample sizes that would permit them." + }, + { + "flag": "BBQ contamination unaddressed", + "detail": "BBQ was published in 2022; all evaluated models (DeepSeek-R1, Llama-3.1, Qwen2.5) were trained after 2022. Potential benchmark contamination in pretraining data is not discussed." + }, + { + "flag": "Broad generalization from few models", + "detail": "Conclusions about 'reasoning-based LLMs' in general are drawn from 2–3 open-source model families on a single English-language benchmark, without the full DeepSeek-R1 or diverse proprietary models." + }, + { + "flag": "Depth analysis limited to 3/11 categories", + "detail": "The reasoning step bias analysis (Figure 3, Section 5.2) is demonstrated for only Age, Religion, and SES categories out of 11 BBQ categories, limiting generalizability of the mechanism claim." + } + ], + "cited_papers": [ + { + "title": "BBQ: A Hand-Built Bias Benchmark for Question Answering", + "relevance": "Primary evaluation benchmark; defines ambiguous/disambiguated context structure and bias scoring methodology used throughout." + }, + { + "title": "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models", + "relevance": "Foundational CoT paper that motivates evaluating whether reasoning chains introduce bias." + }, + { + "title": "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning", + "relevance": "Primary evaluated models (distilled variants DeepSeek-8B and DeepSeek-32B) are sourced from this work." + }, + { + "title": "Decoding Biases: Automated Methods and LLM Judges for Gender Bias Detection in Language Models", + "relevance": "Provides the LLM-as-a-judge methodology adopted for per-step bias scoring." + }, + { + "title": "On Second Thought, Let's Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning", + "relevance": "Most closely related prior work showing CoT prompting can increase harmful outputs in sensitive domains." + }, + { + "title": "Evaluating Gender Bias in Large Language Models via Chain-of-Thought Prompting", + "relevance": "Related prior work on CoT and gender bias; key differentiator is that this paper analyzes native reasoning chains rather than prompted CoT." + }, + { + "title": "Self-Debiasing Large Language Models: Zero-Shot Recognition and Reduction of Stereotypes", + "relevance": "One of the two mitigation baselines compared against ADBP in Figure 6." + }, + { + "title": "Evaluating and Mitigating Social Bias for Large Language Models in Open-Ended Settings", + "relevance": "Combined Debiasing Prompt baseline compared against ADBP in Figure 6." + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 2, + "justification": "ADBP is a lightweight, annotation-free method applicable at inference time to any model that exposes reasoning traces, making it directly usable by practitioners deploying reasoning LLMs." + }, + "surprise_contrarian": { + "score": 3, + "justification": "The central finding — that reasoning capability improves accuracy but simultaneously amplifies social bias — directly contradicts the intuition that 'better reasoning = better alignment.'" + }, + "fear_safety": { + "score": 2, + "justification": "Demonstrates that widely deployed reasoning models (DeepSeek-R1, o1) systematically express stereotypes against protected groups in QA tasks, raising concrete deployment safety concerns." + }, + "drama_conflict": { + "score": 1, + "justification": "Challenges DeepSeek-R1 and OpenAI o1's implicit safety claims but without naming adversarial parties; low drama framing." + }, + "demo_ability": { + "score": 2, + "justification": "Code is released and models (DeepSeek-R1 distilled variants) are freely available on HuggingFace, enabling reproduction; BBQ dataset is public." + }, + "brand_recognition": { + "score": 2, + "justification": "DeepSeek-R1 and OpenAI o1/o3-mini are high-profile models; EMNLP venue adds credibility, though the institution (Santa Clara University) is not a top-tier AI lab." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "43405094", + "title": "Politicians' misinformation behavior and public engagement, in 4 countries", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=43405094", + "created_at": "2025-03-18T21:03:45Z" + } + ], + "top_points": 3, + "total_points": 3, + "total_comments": 0 + } +} +\ No newline at end of file diff --git a/papers/domaineval-autoconstructed-benchmark-2024/scan-v5.json b/papers/domaineval-autoconstructed-benchmark-2024/scan-v5.json @@ -0,0 +1,414 @@ +{ + "scan_version": 5, + "paper_type": "benchmark-creation", + "paper": { + "title": "DOMAINEVAL: An Auto-Constructed Benchmark for Multi-Domain Code Generation", + "authors": [ + "Qiming Zhu", + "Jialun Cao", + "Yaojie Lu", + "Hongyu Lin", + "Xianpei Han", + "Le Sun", + "Shing-Chi Cheung" + ], + "year": 2024, + "venue": "AAAI Conference on Artificial Intelligence", + "arxiv_id": "2408.13204", + "doi": "10.48550/arXiv.2408.13204" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All abstract claims are directly supported by experimental results in Table 1. The 68.94% gap (Llama-2-13b), domain distribution, and Pass@1→Pass@5 dynamics are explicitly reported.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": false, + "justification": "The claim 'generating more samples can increase overall performance' conflates correlation with causation. Pass@1 (greedy decoding) vs Pass@5 (sampling with temperature 0.2) are different metrics, so improvement may stem from sampling strategy, not sample count alone. No controlled ablation isolates sample quantity as the causal variable.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "Claims generalize from 12 tested models to 'LLMs' broadly (e.g., 'LLMs are generally good at computation'). No discussion of whether findings apply to other model families, instruction-tuning approaches, or code lengths outside the restricted 3-100 line range.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "Two case studies explain specific failures (RSA square root overflow, Linux locale-sensitive command parsing) but no systematic discussion of why cryptography/system domains are hard. Paper attributes failures to 'lack of domain knowledge' without ruling out training data distribution, context length, or task complexity confounds.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "Pass@k (functional correctness via test suite execution) is a standard, well-justified proxy for code generation capability. The paper does not claim broader outcomes like 'production-readiness' or 'real-world utility.'", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": false, + "justification": "No dedicated limitations or threats-to-validity section exists. The conclusion lists future work directions but not actual limitations of the benchmark, evaluation setup, or generalizability constraints.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "No specific threats are discussed. The paper describes security mitigations (banned keywords) but not threats like LLM instruction-generation bias (Qwen2-72B used), test suite comprehensiveness, domain classification reliability, or selection bias in GitHub repos.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "Scope boundaries are implicit (Python only, 3-100 lines, 6 specific domains, 12 instruction-tuned models, GitHub repos with ≥100 stars) but never formally stated. No explicit statement of what the benchmark does NOT measure or which model/language families are out-of-scope.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding source is disclosed. The paper neither states a funder nor declares itself unfunded independent work.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "All author affiliations are clearly listed (Chinese Academy of Sciences, Hong Kong University of Science and Technology). No apparent conflicts with evaluated LLM companies.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": false, + "answer": false, + "justification": "No funder disclosed; independence cannot be assessed. Assumes unfunded if not stated.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests or financial interests statement is provided. The leaderboard platform (domaineval.github.io) is mentioned with no disclosure of financial stakes.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": false, + "justification": "Key terms like 'domain-specific code' are demonstrated via examples (Figure 1) and implicitly defined by the six domains, but not formally defined. 'Pass@k' is cited but not defined in the paper. 'Fully automated pipeline' is explained procedurally but not formally defined.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Three contributions are explicitly listed in the abstract: (1) the DOMAINEVAL dataset, (2) the fully automated test-guided construction pipeline, and (3) identification of LLM limitations in domain-specific code generation.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "The related work section engages substantively with 10+ prior benchmarks (HumanEval, MBPP, APPS, CoderEval, etc.), explicitly contrasting DOMAINEVAL's design (fully automated, multi-domain, real code from GitHub) vs. prior approaches (manual curation, single-domain, synthetic tasks, API-centric).", + "source": "haiku" + } + } + }, + "type_checklist": { + "benchmark-creation": { + "construct_design": { + "construct_validity_argued": { + "applies": true, + "answer": false, + "justification": "The paper selects six domains referencing prior work (Zhuo et al. 2024) but does not argue why these domains collectively measure domain-specific code generation capability. Design goal (align with 'real-world code requirements') is stated, not construct validity justification.", + "source": "haiku" + }, + "difficulty_distribution_characterized": { + "applies": true, + "answer": false, + "justification": "Figure 5 shows line count distribution by domain (4-198 lines, avg 55.69) and the paper restricts to 3-100 lines as 'appropriate difficulty,' but no explicit difficulty tiers (easy/medium/hard) are defined, measured, or validated via item-response analysis.", + "source": "haiku" + }, + "ceiling_floor_effects_checked": { + "applies": true, + "answer": false, + "justification": "Table 1 shows ceiling effects (computation: 82-91% for top models) and floor effects (Llama-2-13b at 12% on cryptography), but these are observed, not discussed. No analysis of whether ceiling/floor limits benchmark discriminability or benchmark redesign is warranted.", + "source": "haiku" + }, + "human_baseline_included": { + "applies": true, + "answer": false, + "justification": "No human baseline is provided. The paper reports only LLM performance without human reference data, making it impossible to assess whether 82% in computation or 33% in cryptography represents good or poor model capability.", + "source": "haiku" + }, + "scoring_rubric_justified": { + "applies": true, + "answer": false, + "justification": "Pass@k is cited as standard (Chen et al. 2021) but not justified relative to alternatives (e.g., code similarity, syntax, runtime safety metrics). Import auto-completion in evaluation is a scoring decision ('tolerable flaw') that is not debated—missing imports may be a legitimate failure mode.", + "source": "haiku" + } + }, + "robustness": { + "contamination_resistance_designed": { + "applies": true, + "answer": false, + "justification": "The paper claims continuous updates via the automated pipeline resist data contamination (citing Cao et al. 2024b) but provides no explicit contamination-resistance design (e.g., temporal splits, canary strings, dynamic generation). GitHub repos are selected by star count with no date cutoff, risking overlap with pre-2024 training data.", + "source": "haiku" + }, + "temporal_robustness_discussed": { + "applies": true, + "answer": false, + "justification": "No discussion of benchmark longevity, versioning strategy, gaming risk ('will LLMs memorize this in 6 months?'), or maintenance plan. The pipeline's scalability for future updates is mentioned, but not temporal robustness strategy.", + "source": "haiku" + }, + "failure_modes_discussed": { + "applies": true, + "answer": false, + "justification": "Two case studies (Figures 7-8) explain LLM failures, not benchmark failure modes. The paper does not discuss what the benchmark does NOT measure, biases in domain selection, gaps in test coverage, or how the benchmark could be misused.", + "source": "haiku" + }, + "baseline_implementations_provided": { + "applies": true, + "answer": false, + "justification": "The leaderboard URL is provided, but the paper does not explicitly state whether code for the benchmark construction pipeline, evaluation harness, or baseline models is published or reproducible.", + "source": "haiku" + } + }, + "documentation": { + "dataset_documentation_complete": { + "applies": true, + "answer": false, + "justification": "Source repositories are listed by domain (Figure 2) but not version-pinned or linked. Collection methodology (test-method matching) is explained procedurally but lacks data quality metrics (e.g., how many candidates filtered at each step?). No formal data card provided.", + "source": "haiku" + }, + "licensing_and_access_clear": { + "applies": true, + "answer": false, + "justification": "The leaderboard (domaineval.github.io) is referenced but no license (MIT, Apache, CC-BY, etc.) is specified. Access terms are not stated—is the benchmark downloadable, or leaderboard-only evaluation?", + "source": "haiku" + }, + "intended_use_specified": { + "applies": true, + "answer": false, + "justification": "General intended use is clear ('evaluate LLMs' domain-specific coding capabilities'), but specific constraints are not formally stated. No guidance on what conclusions should NOT be drawn (e.g., 'not for evaluating system design' or 'not applicable to non-instruction-tuned models').", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "LLMs are generally good at computation tasks (average 82.44% Pass@1) while falling short on cryptography (33.08%) and system coding (37.50%) domains.", + "evidence": "Table 1, Pass@1 column: computation domain macro-average 82.44%, cryptography 33.08%, system 37.50% across 12 models.", + "supported": "strong" + }, + { + "claim": "The performance gap between domains can be as much as 68.94% (80.94% - 12.0%) in Llama-2-13b-chat.", + "evidence": "Table 1, Pass@1: Llama-2-13b computation 80.94%, cryptography 12.0%.", + "supported": "strong" + }, + { + "claim": "Generating more samples increases overall LLM performance (Pass@1 53.42% → Pass@5 59.60%).", + "evidence": "Table 1, average rows: Pass@1 macro-average 53.42%, Pass@5 macro-average 59.60%.", + "supported": "strong" + }, + { + "claim": "Domain bias may increase with more samples; CodeLlama-13b-instruct standard deviation increases from 19.90 to 20.55.", + "evidence": "Table 1, Std column: CodeLlama-13b Pass@1 Std 19.90, Pass@5 Std 20.55.", + "supported": "moderate" + }, + { + "claim": "The fully automated pipeline provides contamination resistance by enabling continuous benchmark updates.", + "evidence": "Abstract and introduction claim updates via pipeline maintain 'integrity and novelty' citing Cao et al. 2024b; no empirical demonstration provided.", + "supported": "weak" + }, + { + "claim": "CodeLlama fine-tuning achieves 11.25% average improvement over Llama-2-13b (57.74% - 46.49% Pass@5) but domain gaps persist.", + "evidence": "Table 1, Pass@5: CodeLlama-13b 57.74%, Llama-2-13b 46.49%; Std values still substantial (20.55 vs 24.10).", + "supported": "strong" + } + ], + "methodology_tags": [ + "benchmark-eval", + "observational" + ], + "key_findings": "DOMAINEVAL—a 2,454-problem multi-domain code generation benchmark across six Python domains—reveals stark LLM performance disparities: computation achieves 82.44% Pass@1 (ceiling effects at 90%+), while cryptography and system domains score 33.08% and 37.50% respectively, with gaps exceeding 68% in some models. Increasing sample count (Pass@1 to Pass@5) improves aggregate performance by ~6%, but paradoxically increases domain bias variability in some models (CodeLlama std increases from 19.90 to 20.55). Two case studies highlight specific failure modes—LLMs mishandle large-integer square roots in RSA attack tasks and fail to account for locale-dependent Linux command output—suggesting domain-specific knowledge gaps rather than general capability limitations.", + "red_flags": [ + { + "flag": "No human baseline", + "detail": "Without human performance reference, it is unclear whether 82% on computation or 33% on cryptography reflects LLM capability or artifacts of benchmark construction (e.g., instruction phrasing, test case design)." + }, + { + "flag": "Ceiling effects in computation domain", + "detail": "Computation tasks show ceiling effects (82-91% for top models), limiting discriminative power for ranking models in this domain. Benchmark redesign may be needed." + }, + { + "flag": "Contamination resistance unvalidated", + "detail": "Claims about resisting data contamination via continuous updates are speculative. GitHub repos with ≥100 stars lack date stamps, risking overlap with pre-2024 training data. No temporal validation provided." + }, + { + "flag": "No formal limitations section", + "detail": "Absence of explicit scope boundaries, threats to validity, or generalizability caveats. Readers cannot assess applicability to non-instruction-tuned models, other languages, or different code lengths." + }, + { + "flag": "LLM-generated instructions introduce bias", + "detail": "Instructions are generated by Qwen2-72B-Instruct, which may systematically bias task phrasing toward certain model families, favoring the Qwen and related model series." + }, + { + "flag": "Import auto-completion in evaluation", + "detail": "Missing imports are auto-completed during evaluation ('tolerable flaw'), changing the true failure modes. Models that forget imports are not penalized, inflating scores for careless implementations." + }, + { + "flag": "Domain selection not justified", + "detail": "Six domains are chosen by reference to prior work (Zhuo et al. 2024) but no argument is made for why these six construct domains collectively measure domain-specific capability or whether other domains should be included." + }, + { + "flag": "Limited model diversity", + "detail": "12 models tested are predominantly instruction-tuned variants (GPT, Qwen, CodeLlama, DeepSeek). Generalization to non-instruction-tuned, multilingual, or non-English models is unclear." + }, + { + "flag": "No disclosure of funding source", + "detail": "Funding source is not disclosed, raising questions about independence and whether institutional interests (e.g., promoting Chinese Academy of Sciences research) influence benchmark design." + } + ], + "cited_papers": [ + { + "title": "Evaluating Large Language Models Trained on Code", + "relevance": "Foundational HumanEval benchmark for code generation evaluation; establishes Pass@k metric used in DOMAINEVAL." + }, + { + "title": "Program Synthesis with Large Language Models", + "relevance": "MBPP benchmark; prior code generation benchmark that DOMAINEVAL extends to multi-domain scenarios." + }, + { + "title": "Measuring Coding Challenge Competence With APPS", + "relevance": "APPS benchmark sourced from competitive programming; prior dataset for code generation evaluation." + }, + { + "title": "Concerned with Data Contamination? Assessing Countermeasures in Code Language Model", + "relevance": "Addresses data contamination threat in LLM benchmarks; cited to justify DOMAINEVAL's continuous-update contamination resistance claim." + }, + { + "title": "BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions", + "relevance": "Related multi-domain benchmark exploring LLM capability across domains via API calls; differentiates from DOMAINEVAL's focus on direct implementation." + }, + { + "title": "ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Generation", + "relevance": "Class-level code generation benchmark; shows progression from function-level (HumanEval) to larger code structures." + }, + { + "title": "CodeBenchGen: Creating Scalable Execution-based Code Generation Benchmarks", + "relevance": "Automated benchmark construction pipeline; directly related to DOMAINEVAL's fully-automated construction approach." + }, + { + "title": "RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation", + "relevance": "Repository-level code generation; contrasts with DOMAINEVAL's function-level scope to clarify contribution boundaries." + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 2, + "justification": "Benchmark covers real Python packages (numpy, pandas, cryptography), so practitioners can see LLM capability on familiar libraries. Leaderboard enables direct tool comparison. However, no discussion of production integration or how results translate to development workflows." + }, + "surprise_contrarian": { + "score": 2, + "justification": "LLMs excelling at computation vs. struggling with cryptography aligns with intuition (training data bias toward tutorials). The observation that more samples increase domain bias contradicts scaling intuitions, but is based on one model's std deviation change and lacks systematic evidence." + }, + "fear_safety": { + "score": 1, + "justification": "Cryptography domain is security-adjacent, and failures in RSA and key derivation are concerning. However, the paper frames this as a capability gap, not a safety risk. No discussion of whether weak cryptography implementation should be flagged as dangerous." + }, + "demo_ability": { + "score": 2, + "justification": "The leaderboard (domaineval.github.io) allows live model submission and comparison. Practitioners can run their own models on the benchmark via the site. However, the paper does not describe how to download the benchmark locally or integrate it into development tools." + }, + "drama_conflict": { + "score": 1, + "justification": "No controversial claims or conflict angles. Straightforward benchmark paper with findings presented as technical observations rather than provocative conclusions." + }, + "brand_recognition": { + "score": 1, + "justification": "Authors are from Chinese Academy of Sciences (ISCAS) and Hong Kong University of Science and Technology—respected but not tier-1 labs (OpenAI, DeepMind, Meta AI). No prominent researchers named that would drive social media amplification." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "39831754", + "title": "GPT-4V(ision) Unsuitable for Clinical Care and Education: An Evaluation", + "points": 75, + "comments": 52, + "url": "https://news.ycombinator.com/item?id=39831754" + }, + { + "hn_id": "41663273", + "title": "Unsafe Impedance: Safe Languages and Safe by Design Software", + "points": 7, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=41663273" + }, + { + "hn_id": "40135927", + "title": "OpenAI: Training LLMs to Prioritize Privileged Instructions", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=40135927" + }, + { + "hn_id": "41418082", + "title": "Data Exposure from LLM Apps: An In-Depth Investigation of OpenAI's GPTs", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=41418082" + }, + { + "hn_id": "41408373", + "title": "Data Exposure from LLM Apps: An In-Depth Investigation of OpenAI's GPTs", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=41408373" + }, + { + "hn_id": "39139543", + "title": "Exploring Parent's Needs for Children-Centered AI to Support Preschoolers", + "points": 2, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=39139543" + }, + { + "hn_id": "37345839", + "title": "Relighting Neural Radiance Fields with Shadow and Highlight Hints", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=37345839" + }, + { + "hn_id": "41227450", + "title": "Τ-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=41227450" + }, + { + "hn_id": "40965488", + "title": "The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=40965488" + }, + { + "hn_id": "40157957", + "title": "The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=40157957" + } + ], + "top_points": 75, + "total_points": 96, + "total_comments": 54 + } +} +\ No newline at end of file diff --git a/papers/domainspecific-constitutional-ai-2025/scan-v5.json b/papers/domainspecific-constitutional-ai-2025/scan-v5.json @@ -0,0 +1,531 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Domain-Specific Constitutional AI: Enhancing Safety in LLM-Powered Mental Health Chatbots", + "authors": [ + "Chenhan Lyu", + "Yutong Song", + "Pengfei Zhang", + "Amir M. Rahmani" + ], + "year": 2025, + "venue": "International Conference on Wearable and Implantable Body Sensor Networks", + "arxiv_id": "2509.16444", + "doi": "10.1109/BSN66969.2025.11337405" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "Abstract claims about CAI improving safety are supported by experimental results shown in Tables II–III. Methodology for principle derivation is described and evaluation framework is established.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "Four-condition experimental design (no CAI, vague CAI, specific CAI, larger model) supports causal claims about principle effects. Ablation study (Table III) further isolates contribution of specificity.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": true, + "justification": "Claims bounded to mental health chatbots. Evaluation uses 100 queries on common scenarios (depression, anxiety, crises). Applicability to other medical specialties framed as future work.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "Ablation compares specific vs vague principles but doesn't explore whether domain-specificity itself matters versus general specificity. No control comparing mental-health principles to domain-specific principles from another field.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": false, + "justification": "Paper claims improvements in 'safety' and 'effectiveness' but measures evaluator-scored responses against five rubric guidelines. No discussion of whether these scores translate to actual harm reduction or real-world safety.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": false, + "justification": "No dedicated limitations or threats-to-validity section. Single-sentence mention in conclusion ('static principles may not adapt to evolving guidelines') does not constitute structured limitations discussion.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "Threats to validity not specifically discussed. No inter-rater agreement metrics, sample size justification for 100 queries, or evaluator bias analysis. Evaluator qualifications vaguely described as 'trained evaluators' and 'health experts'.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "Scope boundaries not explicitly stated. No discussion of which model sizes apply, which mental health conversation types were tested, or what scenarios were excluded.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding source disclosed. No acknowledgments section or grant information provided. Absence of disclosure is a red flag.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "All authors listed as UC Irvine. Developing a method rather than evaluating proprietary product, so no direct conflict.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": false, + "justification": "No funder identified; cannot assess independence.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement or financial disclosures provided.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Constitutional AI explained as 'self-critique and revision guided by explicit principles.' Domain-specific principles illustrated in Table I with concrete examples (e.g., 'Use professional help for serious mental health concerns').", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Three explicit contributions stated: (1) domain-specific principle design, (2) quantitative evaluation comparing principles, (3) demonstration that smaller aligned models outperform larger unaligned models.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Introduction engages prior CAI work (Bai et al.), specific-vs-general debate (Kundu et al.), and identifies gap: 'no research has compared constitutional principles explicitly derived from domain-specific mental health guidelines.' Clear positioning relative to existing work.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": false, + "justification": "No code repository, GitHub link, or implementation details provided. Methods are conceptual, not reproducible from paper.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "Training dataset MentalChat16K publicly available on HuggingFace (reference [15]). Dataset is accessible.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "Model architecture specified (LLaMA 3.2, 1B and 3B) but no requirements.txt, dependency versions, Python version, or environment specifications.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "Methods describe CAI training conceptually but provide no step-by-step reproduction instructions or training script sufficient to reimplement.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "Table II and Figure 2 report single point estimates (e.g., 6.47, 5.50) with no confidence intervals, standard deviations, or error bars.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "Multiple comparative claims made (e.g., '46.7% increase', '31.7% advantage') but no statistical significance tests, p-values, or hypothesis tests reported.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Percentage improvements reported (e.g., 'Guideline 1 improves 4.41→6.47, 46.7% increase'). Baseline context provided.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "5000 rows sampled for training, 100 queries for evaluation. No sample size justification or power analysis.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "Single scores per model per guideline with no standard deviation, range, or indication of variance across runs.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Four conditions: (1) no CAI baseline, (2) vague CAI, (3) specific CAI, (4) larger 3B model without CAI. Multiple baselines for comparison.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": false, + "justification": "Baselines are only internal LLaMA 3.2 variants. No comparison to published mental health chatbots or other safety training methods (RLHF, DPO).", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "Section III.D ablation: replacing two specific principles with vague ones (24.08→19.45) isolates contribution of principle specificity.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Five evaluation guidelines used (Table I), each scored 1–10. Per-guideline breakdowns in Table II and Figure 2.", + "source": "haiku" + }, + "human_evaluation": { + "applies": true, + "answer": true, + "justification": "Trained evaluators scored responses 1–10 using detailed rubrics aligned with clinical best practices. Health experts provided ground-truth responses.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": false, + "justification": "100 evaluation queries used but no explicit confirmation they are held-out from 5000 training examples. Both from MentalChat16K; train-test split not documented.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Table II breaks scores by five guidelines. Figure 2 provides per-guideline bar charts. Figure 3 includes radar visualization.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": false, + "justification": "No failure cases shown or analyzed. No qualitative error analysis or discussion of underperformance scenarios.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": false, + "justification": "Proposed method shows improvements in all comparisons. Ablation (vague vs specific) supports positive claim but is not framed as independent negative result.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": false, + "justification": "Model family specified (LLaMA 3.2, 1B and 3B) but no exact checkpoint version, snapshot date, or training cutoff.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": false, + "justification": "Conceptual SFT template given ('Critique this response against these principles: [principle text]'). Table I shows principles but complete RLAIF prompts not provided.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": false, + "justification": "Only sparse hyperparameters: 5000 samples, 2 response pairs per example, 'early stopping.' No learning rate, batch size, optimizer, epochs, or stopping criteria.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": true, + "answer": true, + "justification": "CAI scaffolding described: two-phase training (SFT for self-critique + RLAIF), chain-of-thought reasoning about principle conformance before revision.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": false, + "justification": "Only 'sampling 5000 rows' mentioned. No filtering criteria, data cleaning steps, or preprocessing documented.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": true, + "justification": "MentalChat16K publicly available on HuggingFace. Expert ground-truth responses not released but evaluation uses expert-provided benchmarks.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": false, + "justification": "Paper uses external MentalChat16K but does not document its collection. Details are in reference [15], not this paper.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": true, + "answer": false, + "justification": "Evaluators ('trained evaluators', 'health experts') not characterized. No number of evaluators, expertise criteria, or recruitment process specified.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": false, + "justification": "High-level pipeline stated (sample 5000 → SFT → RLAIF → evaluate 100 queries) but no detailed filtering logic, preprocessing, or sampling procedure documented.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": false, + "justification": "LLaMA 3.2 pretraining cutoff date not stated. Matters for whether evaluation queries could be in pretraining data.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": false, + "justification": "No discussion of overlap between 5000 fine-tuning examples and 100 evaluation queries. Both from MentalChat16K; no confirmation of train-test separation.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": true, + "answer": false, + "justification": "No contamination analysis between pretraining data and evaluation set. No discussion of MentalChat16K's timing relative to LLaMA 3.2 pretraining.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human subject research; only model evaluation with human raters.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human participants; evaluation uses expert raters, not subject research.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participant demographics; evaluators only.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "Not applicable; model evaluation, not human subject research.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "Not applicable; no human randomization.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "Not applicable; no human participant blinding.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "Not applicable; no human participant attrition.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": false, + "justification": "No inference latency, cost, or computational requirements reported. Paper motivates resource-constrained settings but provides no actual metrics.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "No training time, GPU hours, or computational budget reported.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "Domain-specific constitutional principles improve mental health chatbot safety by 31.7% compared to vague general principles", + "evidence": "Table II: specific principles total score 24.08 vs vague/general 18.29; ablation (Table III) confirms 19.2% reduction when specificity removed", + "supported": "moderate" + }, + { + "claim": "1B-parameter models trained with domain-specific CAI outperform unprincipled 3B models", + "evidence": "Table II: 1B specific (24.08) > 3B no-CAI (19.92); discussion claims smaller principled models consistently outperform larger unprincipled ones", + "supported": "strong" + }, + { + "claim": "Specific constitutional principles deliver exceptional improvements for crisis intervention (153–158% on crisis guidelines)", + "evidence": "Table II Guidelines 3 and 4: baseline 1.06→2.69 (153.8%), 1.13→2.91 (157.5%); ablation confirms vague principles underperform on crisis response (Table III)", + "supported": "strong" + }, + { + "claim": "Explicit mental health-specific principles are essential; vague principles allow interpretive flexibility causing inconsistent crisis responses", + "evidence": "Discussion: 'Vague/general formulations allow interpretive flexibility...leading to inconsistent outputs.' Ablation shows performance loss with vague principles.", + "supported": "moderate" + }, + { + "claim": "Domain-specific CAI enables practical deployment in resource-constrained healthcare environments", + "evidence": "1B model with specific CAI outperforms 3B unaligned; discussion motivates healthcare deployment. However, no actual cost/latency/resource metrics provided.", + "supported": "weak" + } + ], + "methodology_tags": [ + "benchmark-eval" + ], + "key_findings": "Constitutional AI training with domain-specific mental health principles significantly improves safety metrics (24.08 total score) over no CAI (13.74) and vague principles (18.29). A 1B-parameter model trained with specific principles outperforms an unprincipled 3B model, suggesting principled alignment may matter more than scale for constrained healthcare settings. Crisis intervention showed the largest gains (153–158% on crisis guidelines), indicating explicit resource provision and professional referral principles are critical for high-stakes scenarios.", + "red_flags": [ + { + "flag": "No statistical significance testing", + "detail": "Improvements reported as percentages without p-values or confidence intervals. Cannot determine if 46.7% gains on individual guidelines are statistically significant or noise." + }, + { + "flag": "No inter-rater reliability reported", + "detail": "Human evaluators scored outputs but no inter-rater agreement metrics (Kappa, ICC) provided. Evaluator disagreement could dominate claimed effect sizes." + }, + { + "flag": "Evaluators not characterized", + "detail": "Described only as 'trained evaluators' and 'health experts.' Number of raters, expertise level, training process, and eligibility criteria not specified." + }, + { + "flag": "Small evaluation set without justification", + "detail": "Only 100 mental health queries evaluated. No sample size justification, power analysis, or coverage analysis of mental health scenario diversity." + }, + { + "flag": "No comparison to published baselines", + "detail": "Only compares internal variants of LLaMA 3.2. No comparison to published mental health chatbots or alternative safety methods (RLHF, DPO)." + }, + { + "flag": "Safety claims not tied to real-world outcomes", + "detail": "Claims 'safety improvements' but measures evaluator scores against rubrics. No evidence scores translate to reduced harm, accurate diagnoses, or better clinical outcomes." + }, + { + "flag": "Code and hyperparameters not disclosed", + "detail": "Implementation not released. Sparse hyperparameters (no learning rate, batch size, optimizer, stopping criteria) make independent replication infeasible." + }, + { + "flag": "No variance or uncertainty quantification", + "detail": "Single point estimates reported. No error bars, standard deviations, or indication of run-to-run variance. Unclear if single training run or averaged over multiple seeds." + }, + { + "flag": "Train-test contamination not addressed", + "detail": "Both 5000 training examples and 100 evaluation queries from MentalChat16K. No confirmation of held-out evaluation set or overlap analysis." + }, + { + "flag": "No funding disclosure", + "detail": "No acknowledgments or funding source stated. Raises questions about potential undisclosed support or institutional constraints." + } + ], + "cited_papers": [ + { + "title": "Large language models for mental health applications: Systematic review", + "relevance": "Systematic review of LLM mental health applications; establishes domain landscape and motivates domain-specific safety" + }, + { + "title": "The opportunities and risks of large language models in mental health", + "relevance": "Reviews LLM opportunities and risks in mental health; motivates need for specialized guardrails beyond generic AI safety" + }, + { + "title": "Constitutional ai: Harmlessness from ai feedback", + "relevance": "Foundational Constitutional AI methodology that this paper adapts and builds upon" + }, + { + "title": "Specific versus general principles for constitutional ai", + "relevance": "Directly relevant prior work comparing principle specificity in CAI; this paper extends to domain-specific principles" + }, + { + "title": "A comprehensive survey of llm alignment techniques", + "relevance": "Surveys alignment methods including RLAIF used in the paper's training pipeline" + }, + { + "title": "Building guardrails for large language models", + "relevance": "Relevant to guardrail design and safety constraints for LLM deployment" + }, + { + "title": "Building trust in mental health chatbots: Safety metrics and llm-based evaluation tools", + "relevance": "Directly addresses safety metrics and evaluation frameworks for mental health chatbots" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 2, + "justification": "Mental health chatbots have direct healthcare application; LLaMA 3.2 is publicly available. However, evaluation is synthetic, not real-world deployment." + }, + "surprise_contrarian": { + "score": 1, + "justification": "Smaller models beating larger models is somewhat notable, but domain-specific principles outperforming generic ones is expected and incremental." + }, + "fear_safety": { + "score": 2, + "justification": "Mental health AI safety is a legitimate concern; paper highlights risks (misdiagnosis, harm escalation) but does not definitively resolve them." + }, + "drama_conflict": { + "score": 0, + "justification": "Mental health is sensitive but paper is methodical and technical; no controversial findings or conflict angles." + }, + "demo_ability": { + "score": 1, + "justification": "Uses public LLaMA 3.2 and MentalChat16K, but code not released; reimplementation from scratch would be required." + }, + "brand_recognition": { + "score": 1, + "justification": "UC Irvine is known but not a top-tier AI lab. IEEE BSN is a specialized venue with lower visibility than major conferences." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "41671808", + "title": "First Past the Post: Evaluating Query Optimization in MongoDB", + "points": 4, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=41671808", + "created_at": "2024-09-27T15:36:58Z" + }, + { + "hn_id": "45220460", + "title": "Perihelion precession of planetary orbits solved from quantum field theory", + "points": 3, + "comments": 4, + "url": "https://news.ycombinator.com/item?id=45220460", + "created_at": "2025-09-12T09:48:24Z" + }, + { + "hn_id": "45302119", + "title": "VCBench: Benchmarking LLMs in Venture Capital", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=45302119", + "created_at": "2025-09-19T14:32:42Z" + } + ], + "top_points": 4, + "total_points": 8, + "total_comments": 4 + } +} +\ No newline at end of file diff --git a/papers/dont-always-pick-2026/scan-v5.json b/papers/dont-always-pick-2026/scan-v5.json @@ -0,0 +1,314 @@ +{ + "scan_version": 5, + "paper_type": "theoretical", + "paper": { + "title": "Don't Always Pick the Highest-Performing Model: An Information Theoretic View of LLM Ensemble Selection", + "authors": [ + "Yigit Turkmen", + "Baturalp Buyukates", + "Melih Bastopcu" + ], + "year": 2026, + "venue": "arXiv", + "arxiv_id": "2602.08003", + "doi": null + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All abstract claims — Gaussian-copula modeling of correlated errors, information-theoretic error floor (Theorem 4.4), greedy MI algorithm, and consistent outperformance over baselines — are substantiated by proofs and experiments across three datasets.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "Performance improvement claims are supported by controlled comparisons with identical query budgets across three benchmarks, three temperature settings, and five random splits per run; the mechanism is also explained theoretically via Theorem 4.3.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": true, + "justification": "Section 8 explicitly bounds generalization to binary decision settings; Theorem 4.4 is conditioned on equicorrelated Gaussian structure; empirical results note limited gains in the high-correlation IMDB regime (ρ=0.90).", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": true, + "justification": "The paper explains why mRMR fails (penalizes structured error correlation it should exploit, Section 4.2), why performance degrades at large k (MAP estimator's exponential 2^k pattern space), and why IMDB gains are modest (near-uniform high correlation).", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "The paper directly measures test error probability, which is exactly the quantity claimed to be minimized; no proxy substitution occurs.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "Section 8 'Limitations and Discussion' is a dedicated section addressing the binary decision setting restriction, Gaussian-copula model scope, and saturation effects.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "Specific threats are identified: binary classification restriction, MAP estimator degradation at large k due to sparse pattern estimation over 2^k outcomes, and IMDB results showing limited gains in near-uniform high-correlation regimes (ρ=0.90).", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": true, + "justification": "The paper explicitly states it focuses on binary decision settings (Section 8), Theorem 4.4 is conditioned on equicorrelated ensembles, and the contribution is framed as a 'foundational step' rather than a general solution.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": true, + "justification": "Funding is disclosed in the footnote: 'This work was supported by Tubitak 2232-B program (Project No:124C533).'", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Author affiliations are clearly stated: Bilkent University, Ankara (Turkmen and Bastopcu) and University of Birmingham (Buyukates).", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": true, + "justification": "Tubitak is the Turkish government scientific research council, independent of any LLM vendor or commercial interest evaluated in the paper.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests or financial interests statement (patents, equity, consulting) is present in the paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Key terms are formally defined: budgeted ensemble selection problem (Section 3.3, Equation 4), Gaussian-copula error model (Section 3.1), MAP estimator (Equation 3), mutual information gain (Equation 7), and error indicator variable.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Section 1 lists five explicit bullet-point contributions including Theorem 4.1 (independence optimality), Theorem 4.3 (MI decomposition), Theorem 4.4 (saturation limit), the Greedy MI algorithm, and empirical evaluation.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 2 explicitly engages with mRMR (showing why it doesn't transfer via Theorem 4.3), LLM-TOPLA, MUSE, self-consistency, and FrugalGPT, explaining this work's distinction as selection-focused rather than aggregation-focused.", + "source": "haiku" + } + } + }, + "type_checklist": { + "theoretical": { + "formal_quality": { + "assumptions_stated_explicitly": { + "applies": true, + "answer": true, + "justification": "All assumptions are explicitly stated: balanced prior P(Y=±1)=0.5 (Section 3), independent BSC errors for Theorem 4.1, label-invariant error assumption for simplified Theorem 4.3, and equicorrelated Gaussian (uniform ρ) for Theorem 4.4.", + "source": "haiku" + }, + "proofs_complete_or_sketched": { + "applies": true, + "answer": true, + "justification": "All four main theorems (4.1, 4.3, 4.4, D.1) have complete step-by-step proofs in Appendices A–D, with supporting definitions and lemmas (BSC degradation, chain rule, entropy invariance under bijection).", + "source": "haiku" + }, + "bounds_tight_or_discussed": { + "applies": true, + "answer": true, + "justification": "Remark A.5 explicitly discusses tightness of the Theorem 4.1 bound (equality when S is exactly the k-smallest-error channels); Theorem 4.4's saturation floor is tight under uniform correlation; Remark D.2 gives a (1−1/e) approximation guarantee for the greedy approach.", + "source": "haiku" + }, + "counterexamples_explored": { + "applies": true, + "answer": true, + "justification": "Figure 2 gives a concrete counterexample where Top-k fails (four GPT models fail together at 81% avg accuracy while a diverse 72% ensemble succeeds); Example A.1 in Appendix A illustrates stochastic degradation; IMDB explores the limiting case of near-uniform high correlation.", + "source": "haiku" + }, + "notation_consistent": { + "applies": true, + "answer": true, + "justification": "Notation is consistent throughout (Xj for predictions, Ej for errors, Y for label, S for subsets, ρ for correlation); the dual use of α for Laplace smoothing and accuracy is explicitly flagged in Algorithm 2 with a parenthetical note.", + "source": "haiku" + }, + "constructive_vs_existence_noted": { + "applies": true, + "answer": true, + "justification": "Theorem 4.1 proof is explicitly constructive (explicit bijection and coupling construction); Theorem 4.4 provides a closed-form computable formula; Algorithm 1 gives a constructive greedy procedure that can be directly implemented.", + "source": "haiku" + } + }, + "connections": { + "connection_to_practice_discussed": { + "applies": true, + "answer": true, + "justification": "The paper explicitly targets the 'practical budget constraint' regime (k=3–7), provides complete implementation details (Algorithms 1–6 with complexity analysis), evaluates on real LLM API calls across three benchmarks, and discusses deployment cost/latency tradeoffs.", + "source": "haiku" + }, + "relationship_to_prior_work_clear": { + "applies": true, + "answer": true, + "justification": "Section 4.2 and Theorem 4.3 explicitly show why mRMR does not transfer to ensemble selection (additional I(Ej;ES) error correlation term); Section 2 positions against LLM-TOPLA, MUSE, self-consistency, and FrugalGPT with clear distinctions.", + "source": "haiku" + }, + "computational_complexity_discussed": { + "applies": true, + "answer": true, + "justification": "Appendix E provides explicit complexity analysis: MI estimation O(N + KAKB), MAP aggregation O((Ntr+Nte)k + 2^k); the exponential 2^k MAP term is identified as the reason for performance degradation at large k.", + "source": "haiku" + }, + "limitations_of_formal_model_stated": { + "applies": true, + "answer": true, + "justification": "Section 8 explicitly states the Gaussian-copula model may not capture all dependency structures, binary classification is a simplification, and the uniform pairwise correlation assumption in Theorem 4.4 is an idealization of the full model.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "When LLM errors are independent, the optimal ensemble selects the k most accurate models (Top-k is optimal in both MI and error probability).", + "evidence": "Theorem 4.1 proves this via stochastic degradation and the data processing inequality; the proof is constructive, establishing an explicit Markov chain Y→XHk→XS for any competing subset S.", + "supported": "strong" + }, + { + "claim": "Under correlated LLM errors (Gaussian-copula with uniform ρ), there exists a fundamental non-vanishing error floor as ensemble size grows to infinity.", + "evidence": "Theorem 4.4 derives the closed-form limit lim P(error) = Φ(Φ^{-1}(1−α)/√ρ) > 0 for any ρ > 0, α > 1/2; validated empirically by performance plateaus in Figures 5–6.", + "supported": "strong" + }, + { + "claim": "Greedy MI selection consistently outperforms Top-k and mRMR-style selection under identical query budgets.", + "evidence": "MEDMCQA: best error 16.3% vs. 17.0% for Top-k at k=5; MMLU: 14.1% vs. 14.9% at k=6; improvements hold across 30 evaluations (3 temperatures × 2 runs × 5 folds) per dataset.", + "supported": "strong" + }, + { + "claim": "The mRMR feature selection principle does not transfer to LLM ensemble selection.", + "evidence": "Theorem 4.3 shows marginal information gain has an additional I(Ej;ES) term absent from mRMR; empirically, mRMR (Terms 1+2) reaches 0.264 error at k=2 under majority voting on MEDMCQA vs. 0.171 for Greedy MI.", + "supported": "strong" + }, + { + "claim": "Gaussian-copula accurately models real LLM error dependencies, including higher-order simultaneous error distributions.", + "evidence": "Pairwise scatter plots (Figures 4, 10, 15, 20) show tight diagonal alignment; simultaneous error histograms (Figures 11, 16, 21) match copula predictions; validated across 3 datasets and 6 temperature-run conditions.", + "supported": "moderate" + }, + { + "claim": "Correlated errors from same model families explain Top-k's failure; cross-family diversity with maintained accuracy is the remedy.", + "evidence": "Tables 1–2 show Greedy MI selects models from OpenAI, Qwen, Moonshot, Google with moderate cross-family correlations (ρ≈0.4–0.5) vs. Top-k stacking multiple OpenAI models with high within-family correlations (ρ≈0.7–0.8).", + "supported": "moderate" + } + ], + "methodology_tags": [ + "theoretical", + "benchmark-eval" + ], + "key_findings": "The paper provides a rigorous information-theoretic analysis of LLM ensemble selection under query budgets. The central theoretical result (Theorem 4.1) proves Top-k accuracy selection is optimal only when errors are independent — its failure in practice arises entirely from correlation structure. Theorem 4.4 establishes an explicit, unavoidable performance floor under correlated ensembles: lim P(error) = Φ(Φ^{-1}(1−α)/√ρ) > 0, meaning scaling ensemble size cannot overcome shared latent difficulty. The proposed Greedy MI algorithm, motivated by a novel Accuracy-Redundancy-Error decomposition (Theorem 4.3), consistently outperforms Top-k and mRMR-style selection in the practical mid-budget regime (k=3–7) across MEDMCQA and MMLU, while gains are limited on IMDB (ρ=0.90) consistent with the saturation theorem.", + "red_flags": [ + { + "flag": "Binary classification restriction", + "detail": "All theoretical results and empirical evaluations are restricted to binary (true/false) outputs; applicability to multi-class or open-ended generation tasks — far more common in real LLM deployments — is undemonstrated and likely requires significant theoretical extension." + }, + { + "flag": "MAP estimator conflates selection and aggregation quality at large k", + "detail": "At large k, all methods degrade due to MAP estimator's exponential 2^k pattern space; this makes it impossible to isolate whether performance differences at large k reflect selection quality or estimator limitations, limiting the validity of large-k comparisons." + }, + { + "flag": "Balanced prior assumption throughout", + "detail": "The Theorem 4.4 derivation and experimental binary conversion both assume P(Y=+1)=P(Y=−1)=0.5; the MEDMCQA conversion creates artificial balance by pairing each question with exactly one correct/incorrect answer, which may not reflect natural query distributions." + }, + { + "flag": "No competing interests declaration", + "detail": "The paper does not include a competing interests or financial interests statement despite evaluating commercial models (GPT-5, Claude, Gemini) through a commercial API aggregator (OpenRouter)." + } + ], + "cited_papers": [ + { + "title": "Why do multi-agent LLM systems fail?", + "relevance": "Identifies inter-agent misalignment and correlated errors as dominant multi-agent failure modes, directly motivating the ensemble correlation problem studied here." + }, + { + "title": "Towards a science of scaling agent systems", + "relevance": "Documents diminishing/negative returns from LLM coordination above ~45% single-agent accuracy, consistent with the correlation-induced saturation theorem." + }, + { + "title": "FrugalGPT: How to use large language models while reducing cost and improving performance", + "relevance": "Addresses cost-performance tradeoffs via cascaded LLM selection, a closely related approach to budgeted ensemble selection." + }, + { + "title": "Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy (mRMR)", + "relevance": "The mRMR criterion is the primary baseline the paper formally shows does not transfer to ensemble selection due to the additional error correlation structure." + }, + { + "title": "Self-consistency improves chain of thought reasoning in language models", + "relevance": "Popularized majority voting for single-model sampling; extended to multi-model ensembling as one of the baseline aggregation methods evaluated." + }, + { + "title": "LLM-TOPLA: Efficient LLM ensemble by maximising diversity", + "relevance": "Introduces focal diversity metrics for ensemble pruning — a competing diversity-based approach to the greedy MI selection proposed here." + }, + { + "title": "Simple yet effective: An information-theoretic approach to multi-LLM uncertainty quantification (MUSE)", + "relevance": "Applies Jensen-Shannon divergence to select well-calibrated LLM subsets — a related but distinct information-theoretic ensemble selection approach." + }, + { + "title": "Conditional likelihood maximisation: A unifying framework for information theoretic feature selection", + "relevance": "Unifies mRMR variants under a common framework; the paper extends this by identifying why these criteria fail for ensemble selection (missing I(Ej;ES) term)." + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 2, + "justification": "Practitioners building multi-LLM pipelines can directly apply the greedy MI algorithm with a labeled calibration set, though the binary classification restriction limits immediate deployment in most real-world generative use cases." + }, + "surprise_contrarian": { + "score": 3, + "justification": "The title and core result directly challenge the intuitive 'pick the best model' heuristic with a formal proof, showing that accuracy alone is suboptimal and that moderately-accurate diverse models can outperform high-accuracy correlated ones." + }, + "fear_safety": { + "score": 0, + "justification": "The paper does not address safety, alignment, or risk concerns; it is a technical optimization paper on ensemble selection." + }, + "drama_conflict": { + "score": 1, + "justification": "The paper challenges the common Top-k heuristic and shows mRMR fails, but there is no major ongoing controversy or heated debate being adjudicated." + }, + "demo_ability": { + "score": 1, + "justification": "The algorithm is implementable with API access to multiple LLMs and a labeled evaluation set, but the multi-model API costs and binary classification constraint create significant friction for casual demonstration." + }, + "brand_recognition": { + "score": 0, + "justification": "Authors are from Bilkent University and University of Birmingham — academic institutions without strong LLM brand recognition; no famous lab or product affiliation." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "47370450", + "title": "End-to-End Hardware-Driven Graph Preprocessing for Enhanced GNN Performance", + "points": 5, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=47370450", + "created_at": "2026-03-13T21:51:18Z" + } + ], + "top_points": 5, + "total_points": 5, + "total_comments": 0 + } +} +\ No newline at end of file diff --git a/papers/dover-interventiondriven-auto-2025/scan-v5.json b/papers/dover-interventiondriven-auto-2025/scan-v5.json @@ -0,0 +1,576 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems", + "authors": [ + "Ming-Jie Ma", + "Jue Zhang", + "Fangkai Yang", + "Yu Kang", + "Qingwei Lin", + "Saravan Rajmohan", + "Dongmei Zhang" + ], + "year": 2025, + "venue": "arXiv.org", + "arxiv_id": "2512.06749", + "doi": "10.48550/arXiv.2512.06749" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "Abstract claims of 18–28% flip rate and 49% GSMPlus recovery are supported by Tables 2–3; the 30–60% hypothesis validation range matches Table 3 (validated+refuted across datasets).", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "Causal claim that DoVer interventions cause failure recovery is supported by comparison against Self-Refine and CRITIC baselines both achieving 0% recovery vs DoVer's 17.6–27.5%, and by ablation studies varying models.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": true, + "justification": "Section 7 explicitly states results 'should be interpreted as evidence of feasibility rather than universal guarantees' and enumerates specific constraints on covered frameworks, task types, and architectures.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": true, + "justification": "Section 5.5 discusses the 29–67% inconclusive cases as arising from sub-agent capability gaps rather than incorrect hypotheses, and Section 3 discusses multiple competing sources of annotation uncertainty.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "The paper carefully distinguishes Trial Success Rate (task completion), Progress Made (milestone advancement), and hypothesis validation (Validated/Refuted/Inconclusive), with explicit acknowledgment that LLM-as-a-judge evaluation may introduce biases.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "Section 7 is explicitly titled 'LIMITATIONS AND GENERALIZABILITY' and spans over a full page with specific discussion.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "Specific threats include: restriction to two agent frameworks, requirement for checkpoint/replay interfaces, interventions limited to orchestrator text messages (cannot modify sub-agent code), and LLM-as-a-judge bias in milestone and validation assessments.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": true, + "justification": "Section 7 explicitly states the work does not cover 'long-running production workloads, domains with strict latency or cost constraints, or settings with safety-critical requirements' and that checkpointing requires 'non-trivial engineering effort.'", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "The acknowledgements section thanks reviewers and collaborators but contains no funding disclosure statement.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Author affiliations are clearly stated on the title page: Chinese Academy of Sciences and Microsoft; Microsoft employees evaluate primarily Microsoft's Magentic-One and AutoGen2 frameworks.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": false, + "justification": "The majority of authors are from Microsoft and the primary evaluation framework (Magentic-One) and secondary framework (AG2/AutoGen2) are both Microsoft products, creating a direct conflict between funder affiliation and outcome.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests or financial interests statement appears anywhere in the paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "'Failure' is precisely defined (executes without interruption but produces incorrect/unsatisfactory results), 'Trial' is defined as a contiguous planning–execution span, and intervention categories (orchestrator_ledger, orchestrator_instruction, subagent_instruction) are enumerated.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Three explicit contributions are itemized in the introduction: (i) the DoVer framework, (ii) analysis of ground-truth annotation uncertainty, (iii) experimental demonstration of failure recovery.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 2 provides structured related work distinguishing failure-attribution work from debugging/repair work, and Section 5.3 compares against Self-Refine and CRITIC; the paper also explicitly positions against the concurrent Who&When attribution approach.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": false, + "justification": "The abstract states 'Project website and code will be available at https://aka.ms/DoVer' — a future-release promise, not a current release; the anonymous repository referenced in Appendix C is not a public release.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "All evaluation datasets (GAIA, AssistantBench, GSMPlus) are publicly available standard benchmarks; the WW dataset from Zhang et al. (2025c) is published.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "The paper specifies model versions (GPT-4o-20241120, GPT-5-chat-20250807, Qwen3-8B/32B) and mentions 'Azure OpenAI using default parameters,' but provides no requirements file, Dockerfile, or comprehensive dependency specification.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "No step-by-step reproduction instructions are provided; the code is not yet released and Appendix C describes integration effort at a high level without runnable instructions.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "Table 5 (reproduction study) reports standard deviations, but Tables 2 and 3 (main DoVer results) report no CIs or error bars despite the paper stating three independent runs were performed per intervention.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "No statistical significance tests are applied to any comparative claims; performance differences are reported as raw percentages without testing.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Effect sizes are reported as flip rates (17.6%, 27.5%, 49%) with clear baseline context (0% for Self-Refine/CRITIC), and milestone progress is quantified as percentage gain.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "Sample sizes are small (26–45 cases per benchmark split) and no power analysis or justification for these sizes is provided.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "Main results tables (2 and 3) report only point estimates; variance across the three independent runs per intervention is not reported.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Section 5.3 compares against Self-Refine-style and CRITIC-style baselines, both achieving 0% recovery on WW-GAIA.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "Self-Refine (2023) and CRITIC (2023) are the standard self-improvement paradigm comparators; they are reasonable contemporaries for the self-correction approach, though not specifically designed for multi-agent debugging.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "Table 4 ablates DoVer's underlying model (Qwen3-8B, Qwen3-32B vs GPT-4o) and prompting strategy (0-shot vs 3-shot), demonstrating component contributions.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Evaluation uses Trial Success Rate, Progress Made (milestone advancement), and a four-category hypothesis validation taxonomy (Validated/Partially Validated/Refuted/Inconclusive).", + "source": "haiku" + }, + "human_evaluation": { + "applies": true, + "answer": false, + "justification": "No human evaluation of DoVer's outputs is performed; milestone evaluation and hypothesis validation both use LLM-as-a-judge (GPT-5 specified in Section 5.1).", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": true, + "justification": "GAIA Level-1 validation set cases not in WW provide a held-out evaluation, and all benchmark cases are independent of model training data in principle.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Results are broken down by dataset (WW-AB, WW-GAIA, GAIA-Level-1, GSMPlus) and by hypothesis outcome category (Validated/Inconclusive/Partially Validated/Refuted) in Tables 2 and 3.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "Section 5.4 presents qualitative case studies for Refuted and Inconclusive outcomes; Section 5.5 analyzes the 29–67% inconclusive rate and identifies specific sub-agent bottlenecks (missing scroll-to-bottom tool, PDF handling).", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "WW-AB Progress Made is reported as '+0%' (interventions may hinder progress), 60–67% inconclusive rate in WW is reported honestly, and Self-Refine/CRITIC 0% recovery is explicitly stated.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": true, + "justification": "Exact model versions are specified: 'GPT-4o-20241120' and 'GPT-5-chat-20250807' in Section 3 footnote; Qwen3-8B and Qwen3-32B in Table 4.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": true, + "justification": "Appendix B provides all six prompts in full: Trial Segmenter (Fig. 5), Failure Proposer (Figs. 6–7), Intervention Recommender (Fig. 8), Milestone Extractor (Fig. 9), Milestone Evaluator (Fig. 10), and Post-Intervention Classifier (Fig. 11).", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": false, + "justification": "The paper states 'All LLM API calls are made through Azure OpenAI using default parameters' but does not specify what those defaults are (temperature, top-p, max tokens, etc.).", + "source": "haiku" + }, + "scaffolding_described": { + "applies": true, + "answer": true, + "justification": "Section 4 describes the DoVer pipeline in detail (trial segmentation, failure attribution, intervention generation, execution); Appendix C describes the checkpointing/replay integration for AG2.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "Section 5.1 describes failure trace collection ('initial run over all cases to identify failure traces'), explains why WW/MAST logs are not directly usable, and documents checkpoint-based re-collection.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": false, + "justification": "The collected failure traces with checkpoints are not released; code is promised as future release and the anonymous repository is not a public release.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Section 5.1 describes the data collection procedure: initial execution runs to identify failures, checkpoint capture at each step, and why existing WW/MAST logs required re-collection.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "No human participants; all evaluation uses standard benchmarks (GAIA, AssistantBench, GSMPlus).", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "The full pipeline from initial run → failure identification → trial segmentation → hypothesis generation → intervention → re-execution → scoring is documented across Sections 4–5 and Appendix C.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": false, + "justification": "The training data cutoffs for GPT-4o-20241120 and GPT-5-chat-20250807 are not stated; GAIA is a public benchmark that may be in training data.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": false, + "justification": "No discussion of potential overlap between GAIA/AssistantBench benchmark examples and GPT-4o or GPT-5 training data.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": true, + "answer": false, + "justification": "GAIA and AssistantBench are publicly available benchmarks predating GPT-4o's training cutoff; the paper does not address whether benchmark examples were seen during pretraining.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": false, + "justification": "No inference cost or API cost estimates are reported despite using GPT-4o and GPT-5 for all runs, including three independent repeats per intervention across hundreds of trials.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "No total compute budget or wall-clock time is reported.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "DoVer recovers 18–28% of failed trials on GAIA and AssistantBench under the Magentic-One framework.", + "evidence": "Table 2 reports 17.6% for WW-AB/WW-GAIA combined and 27.5% for GAIA-Level-1.", + "supported": "moderate" + }, + { + "claim": "DoVer achieves 49% trial success rate on GSMPlus with the AG2/AutoGen2 framework, demonstrating generality.", + "evidence": "Table 2, GSMPlus row: 198 intervened trials, 49.0% success rate.", + "supported": "moderate" + }, + { + "claim": "Log-based failure attribution suffers from substantial ground-truth annotation uncertainty (~48% of examined cases).", + "evidence": "Section 3 reports 14 of 29 GAIA cases in WW exhibit GT uncertainty; annotator initial disagreement of ~20% reported by WW itself.", + "supported": "moderate" + }, + { + "claim": "Prompt refinements (step indexing + guidance reminders) improve GPT-4o step attribution accuracy from 6% to 24% on WW-HC.", + "evidence": "Table 5: baseline GPT-4o 6.04% step accuracy; +Step Index 20.69%; +Guidance 23.56%.", + "supported": "strong" + }, + { + "claim": "Self-Refine and CRITIC self-improvement baselines achieve 0% failure recovery on WW-GAIA.", + "evidence": "Section 5.3 explicitly states neither baseline flips any failure into success across all 26 WW-GAIA failed cases.", + "supported": "strong" + }, + { + "claim": "DoVer validates or refutes 30–60% of failure hypotheses depending on task complexity.", + "evidence": "Table 3: GAIA-Level-1 achieves 34.9%+23.8%=58.7% validated+refuted; WW splits achieve ~30% each.", + "supported": "moderate" + } + ], + "methodology_tags": [ + "benchmark-eval", + "case-study" + ], + "key_findings": "DoVer is an intervention-driven debugging framework for LLM multi-agent systems that operationalizes failure diagnosis by applying targeted edits to suspected failure points and re-executing traces, recovering 18–28% of GAIA/AssistantBench failures and 49% of GSMPlus failures versus 0% for self-improvement baselines. The paper also demonstrates that log-based failure attribution is fundamentally limited by annotation uncertainty (48% of examined GAIA cases have ambiguous ground-truth labels), motivating the outcome-oriented evaluation. A significant limitation is the 30–67% inconclusive rate, primarily because orchestrator-level interventions cannot address sub-agent capability gaps. The work is from Microsoft authors evaluating primarily Microsoft-developed frameworks (Magentic-One, AutoGen2), raising potential affiliation bias.", + "red_flags": [ + { + "flag": "Small evaluation samples, no power analysis", + "detail": "Core evaluation uses only 26–45 failed cases per benchmark split; no power analysis or justification for sample size is provided, limiting statistical conclusions." + }, + { + "flag": "Microsoft authors evaluating Microsoft frameworks", + "detail": "Majority of authors are Microsoft employees and the primary evaluation frameworks (Magentic-One, AutoGen2/AG2) are Microsoft products; no disclosure of this conflict." + }, + { + "flag": "Main results lack variance despite 3 repeats", + "detail": "Tables 2–3 report only point estimates with no standard deviations or CIs despite running three independent intervention runs per trial, obscuring reliability." + }, + { + "flag": "Benchmark contamination not addressed", + "detail": "GAIA and AssistantBench are public benchmarks that were available before GPT-4o/5 training cutoffs; potential contamination is not discussed." + }, + { + "flag": "Code not released at submission", + "detail": "Abstract promises future availability ('will be available'); no public code exists to reproduce results." + }, + { + "flag": "No funding disclosure", + "detail": "No funding statement appears in the paper despite Microsoft institutional affiliation." + } + ], + "cited_papers": [ + { + "title": "Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks", + "relevance": "Primary agent framework used for evaluation; DoVer is integrated with Magentic-One's checkpointing infrastructure." + }, + { + "title": "GAIA: A Benchmark for General AI Assistants", + "relevance": "Core evaluation benchmark; GAIA Level-1/2/3 failure cases form the primary test set." + }, + { + "title": "Why Do Multi-Agent LLM Systems Fail? (MAST)", + "relevance": "Provides failure taxonomy for multi-agent systems; supplies the MathChat/GSMPlus experimental setup used in AG2 evaluation." + }, + { + "title": "Which Agent Causes Task Failures and When? (Who&When)", + "relevance": "The log-based attribution benchmark and dataset that DoVer analyzes and critiques; provides the WW failure traces and baseline method." + }, + { + "title": "TRAIL: Trace Reasoning and Agentic Issue Localization", + "relevance": "Concurrent work on turn-level failure taxonomy and long-context trace debugging; shows strong models still struggle." + }, + { + "title": "Interactive Debugging and Steering of Multi-Agent AI Systems (AGDebugger)", + "relevance": "Human-in-the-loop debugging tool that DoVer adapts to enable automated checkpointing and replay." + }, + { + "title": "ReAct: Synergizing Reasoning and Acting in Language Models", + "relevance": "The agent execution pattern (planning–execution cycles) that creates the multi-trial structure DoVer exploits." + }, + { + "title": "Self-Refine: Iterative Refinement with Self-Feedback", + "relevance": "Baseline self-improvement method compared against in ablation study." + }, + { + "title": "AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?", + "relevance": "One of the two main evaluation benchmarks; provides the WW-AB failure cases." + }, + { + "title": "AgentDebug / Where LLM Agents Fail and How They Can Learn from Failures", + "relevance": "Concurrent intervention-driven debugging work similar to DoVer; acknowledged as parallel development." + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 3, + "justification": "Directly addresses multi-agent system debugging, a concrete pain point for any team deploying LLM agents in production." + }, + "surprise_contrarian": { + "score": 2, + "justification": "Challenges the prevailing log-based attribution paradigm by showing ~48% of ground-truth annotations are uncertain and that self-improvement baselines achieve 0% recovery." + }, + "fear_safety": { + "score": 1, + "justification": "Addresses reliability of agentic systems but does not raise safety or harm concerns." + }, + "drama_conflict": { + "score": 1, + "justification": "Mild methodological critique of the Who&When benchmark's annotation quality; not a high-profile controversy." + }, + "demo_ability": { + "score": 2, + "justification": "Figure 4 shows a working web-based intervention interface for AG2 MathChat, but code is not yet publicly released." + }, + "brand_recognition": { + "score": 2, + "justification": "Microsoft affiliation and use of Magentic-One and AutoGen2 (known Microsoft products) provides moderate brand recognition." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "42378335", + "title": "Training LLMs to Reason in a Continuous Latent Space", + "points": 283, + "comments": 114, + "url": "https://news.ycombinator.com/item?id=42378335", + "created_at": "2024-12-10T16:26:17Z" + }, + { + "hn_id": "43042753", + "title": "LM2: Large Memory Models", + "points": 110, + "comments": 30, + "url": "https://news.ycombinator.com/item?id=43042753", + "created_at": "2025-02-13T23:21:21Z" + }, + { + "hn_id": "29568816", + "title": "Proof of Steak", + "points": 79, + "comments": 28, + "url": "https://news.ycombinator.com/item?id=29568816", + "created_at": "2021-12-15T17:16:25Z" + }, + { + "hn_id": "30078848", + "title": "Phishing in organizations: Findings from a large-scale and long-term study", + "points": 30, + "comments": 10, + "url": "https://news.ycombinator.com/item?id=30078848", + "created_at": "2022-01-25T22:11:11Z" + }, + { + "hn_id": "42456288", + "title": "Rethinking the Combination of Graph Neural Network and Large Language Model", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=42456288", + "created_at": "2024-12-18T22:41:39Z" + }, + { + "hn_id": "38762672", + "title": "Building Trustworthy NeuroSymbolic AI Systems", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=38762672", + "created_at": "2023-12-25T14:04:27Z" + }, + { + "hn_id": "29485809", + "title": "Deep learning for elliptic and parabolic boundary value problems", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=29485809", + "created_at": "2021-12-08T15:22:21Z" + }, + { + "hn_id": "42470646", + "title": "SpikeFI: A Fault Injection Framework for Spiking Neural Networks", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=42470646", + "created_at": "2024-12-20T12:47:13Z" + } + ], + "top_points": 283, + "total_points": 509, + "total_comments": 182 + } +} +\ No newline at end of file diff --git a/papers/dpo-superior-ppo-2024/scan-v5.json b/papers/dpo-superior-ppo-2024/scan-v5.json @@ -0,0 +1,585 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study", + "authors": [ + "Shusheng Xu", + "Wei Fu", + "Jiaxuan Gao", + "Wenjie Ye", + "Weilin Liu", + "Zhiyu Mei", + "Guangju Wang", + "Chao Yu", + "Yi Wu" + ], + "year": 2024, + "venue": "International Conference on Machine Learning", + "arxiv_id": "2404.10719", + "doi": "10.48550/arXiv.2404.10719" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "Abstract claims about DPO limitations, PPO key factors, and PPO achieving SOTA on CodeContest (22.4% vs 16.4%) are all backed by Tables 3–8 and the theoretical analysis in Section 4.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "The paper makes causal claims about PPO components (advantage normalization, large batch size, EMA) improving performance, which are supported by the systematic ablation study in Table 3 that isolates each component.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "The conclusion states 'PPO demonstrates robust effectiveness across diverse tasks' based on four benchmarks (HH-RLHF, SafeRLHF, APPS, CodeContest); this generalizes broadly beyond the tested settings without explicit scope boundaries.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "The paper investigates distribution shift as a cause of DPO's underperformance but does not discuss whether the evaluation metrics (OpenAssistant reward model, GPT-4) might systematically favor PPO-style responses, or whether the CodeContest setup (PPO uses ground-truth rewards; DPO-Iter uses learned rewards) creates an inherently unfair comparison.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "For code tasks, pass@k directly measures what is claimed (code correctness). For dialogue, they explicitly note the OpenAssistant reward model and GPT-4 evaluator are not used during training, distinguishing evaluation metrics from training signals.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": false, + "justification": "The paper has only two sentences on limitations at the end of the conclusion section ('There are also limitations in our work...'), not a dedicated limitations or threats-to-validity section.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "The only limitation mentioned is that reward model training is not studied, which is a future work note rather than a specific threat to the paper's own conclusions.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "The paper does not explicitly state what its results do not show; no discussion of task domains, model families, or reward types where the PPO > DPO conclusion might not hold.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "The Impact Statements section contains no funding acknowledgment; no grants or institutional funding are disclosed anywhere in the paper.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Author affiliations are clearly stated on the title page: Tsinghua University, OpenPsi Inc., and Shanghai Qi Zhi Institute.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": false, + "answer": false, + "justification": "No funder is disclosed; not applicable. However, OpenPsi Inc. (an author affiliation) hosts the code repository and could benefit commercially from PPO advocacy.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests or financial interests statement appears anywhere in the paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Section 3 (Preliminary) provides precise mathematical definitions of SFT, RLHF objective, PPO, and DPO including their loss functions and optimization objectives.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "The paper explicitly states three contributions: (1) theoretical/empirical analysis of DPO limitations, (2) identification of key PPO factors via ablation, and (3) comprehensive benchmarking across dialogue and code generation tasks.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 2 explicitly positions this work relative to prior PPO implementation studies (Zheng et al., 2023; Ramamurthy et al., 2023), reward-free methods (Rafailov et al., 2023; Yuan et al., 2023), and RL community implementation work (Engstrom et al., 2020; Andrychowicz et al., 2021), explaining how this work extends prior investigations.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": true, + "justification": "Code is publicly available at https://github.com/openpsi-project/ReaLHF as stated in the abstract and introduction.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "All datasets used (HH-RLHF, SafeRLHF, APPS, CodeContest) are standard publicly available benchmarks used unmodified.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "The implementation is described as 'based on DeepSpeed-Chat' but no requirements.txt, Dockerfile, or complete dependency specification is provided.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "The appendix provides hyperparameters but no step-by-step reproduction instructions; a reader would need to infer substantial setup details from the code repository.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "All main results in Tables 3–8 are reported as single point estimates with no confidence intervals or error bars across any conditions.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "No statistical significance tests are used anywhere in the paper despite multiple comparative claims between PPO and DPO.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Results are reported with absolute values and baselines (e.g., CodeContest pass@1k improving from 16.4% to 22.4%, Table 8), allowing computation of effect magnitudes.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "No justification for the number of training samples, evaluation queries, or training epochs is provided; choices appear to follow prior work without explicit justification.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "No variance, standard deviation, or results across multiple runs are reported for any experiment; all results are single-run point estimates.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "SFT baseline, RRHF, PRO, DPO, DPO-Iter, and AlphaCode SOTA are all included as baselines across experiments.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "DPO (Rafailov et al., 2023), RRHF, PRO are all contemporary methods; AlphaCode (Li et al., 2022) is the prior SOTA on the code competition benchmark.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "Table 3 presents a systematic ablation study of PPO components, adding advantage normalization, large batch size, and EMA reference model update sequentially.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Evaluation uses OpenAssistant reward scores, GPT-4 win rates, human evaluation win rates, safety rate, helpfulness reward, and pass@k metrics across different tasks.", + "source": "haiku" + }, + "human_evaluation": { + "applies": true, + "answer": true, + "justification": "Appendix C.4 includes human evaluation on HH-RLHF with 4 evaluators per query pair comparing PPO vs DPO outputs.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": true, + "justification": "Results are reported on held-out test sets for APPS and CodeContest; for HH-RLHF, checkpoints are selected on validation and evaluated on test set.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "APPS results are reported per difficulty level (Introductory, Interview, Competition) in Tables 7 and Figure 2, and CodeContest uses 10@1k on both validation and test sets.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "The paper explicitly discusses DPO's complete failure on CodeContest (0% pass rate, 'many meaningless code snippets') and DPO-Iter degrading below SFT on code tasks.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "DPO-Iter performing worse than SFT on code generation (Table 7), DPO achieving 0% on CodeContest, and baseline PPO degrading on APPS with small batch sizes are all reported.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": true, + "justification": "Specific model versions are stated: Llama-2-7B, CodeLlama-7B, CodeLlama-13B, and CodeLlama-34B.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": true, + "justification": "The full GPT-4 evaluation prompt template is provided in Appendix B with exact formatting instructions.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": true, + "justification": "Appendix A provides detailed hyperparameters for both DPO (β=0.1, lr=1e-6) and PPO (actor lr=1e-5, critic lr=5e-6, batch size=512, temperature=1.0, top-k=200, KL β=0.1, clip=20, λ=1, γ=1).", + "source": "haiku" + }, + "scaffolding_described": { + "applies": false, + "answer": false, + "justification": "This is an RLHF training paper, not an agentic scaffolding paper; no agentic scaffolding is involved.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "The paper describes how SafeRLHF preference labels are combined (Section 4.3), how code task rewards are defined (pass/fail with reward 10/0), and how DPO-Iter constructs preference pairs from model-generated samples.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": true, + "justification": "All datasets used (HH-RLHF, SafeRLHF, APPS, CodeContest) are publicly available for independent verification.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Data collection is described for each task: SafeRLHF preference label combination logic is explained, code task reward derivation from test cases is described, and DPO-Iter data construction procedure is documented.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": true, + "answer": false, + "justification": "The human evaluation (Appendix C.4) mentions '4 different persons' per query pair but provides no information about who they are, how they were recruited, or their qualifications.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "The full pipeline from preference data to training labels is documented: reward model training on preference pairs, PPO optimization, and DPO-Iter's iterative sampling-and-labeling procedure are all described.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": false, + "justification": "The paper evaluates CodeLlama and Llama-2 on APPS and CodeContest without stating the training data cutoff dates for these models.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": false, + "justification": "No discussion of whether APPS or CodeContest problems appeared in CodeLlama's or Llama-2's pretraining data.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": true, + "answer": false, + "justification": "The paper does not address whether the competitive programming benchmarks (APPS, CodeContest) were available before the model training cutoffs.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": true, + "answer": false, + "justification": "The human evaluation in Appendix C.4 was not pre-registered.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": true, + "answer": false, + "justification": "No IRB or ethics approval is mentioned for the human evaluation study.", + "source": "haiku" + }, + "demographics_reported": { + "applies": true, + "answer": false, + "justification": "No demographics are reported for the 4 human evaluators used in Appendix C.4.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": true, + "answer": false, + "justification": "No inclusion or exclusion criteria are stated for the human evaluators.", + "source": "haiku" + }, + "randomization_described": { + "applies": true, + "answer": false, + "justification": "No randomization procedure is described for assigning evaluators to query pairs in the human evaluation.", + "source": "haiku" + }, + "blinding_described": { + "applies": true, + "answer": false, + "justification": "No blinding procedure is described for the human evaluation; it is unclear whether evaluators knew which responses came from PPO vs DPO.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "Not a longitudinal study; no attrition is applicable.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": false, + "justification": "No inference cost or latency figures are reported for any of the evaluated models.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "The paper trains 34B parameter models for 16 PPO epochs but provides no information about GPU hours, compute cost, or hardware used.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "PPO surpasses DPO across all evaluated RLHF benchmarks, including both dialogue and code generation tasks.", + "evidence": "Tables 3–8 show PPO outperforming DPO and DPO-Iter on HH-RLHF, SafeRLHF, APPS, and CodeContest across all metrics.", + "supported": "strong" + }, + { + "claim": "DPO has a fundamental limitation: it can find biased solutions that exploit out-of-distribution responses due to the narrow coverage of preference datasets.", + "evidence": "Theorem 4.1 proves ΠPPO ⊊ ΠDPO, and Figure 1 empirically demonstrates DPO assigning higher probability to OOD responses in a synthetic scenario.", + "supported": "strong" + }, + { + "claim": "Three key factors substantially improve PPO performance: advantage normalization, large batch size, and EMA reference model updates.", + "evidence": "Table 3 ablation study shows each component contributing incrementally; large batch size provides the most significant gain, especially on code tasks.", + "supported": "strong" + }, + { + "claim": "DPO completely fails on CodeContest, achieving 0% pass rate and generating meaningless code snippets after one epoch.", + "evidence": "Table 8 shows DPO: 0.0% 10@1k on both validation and test sets for CodeContest, with the paper noting 'the DPO model outputs many meaningless code snippets.'", + "supported": "strong" + }, + { + "claim": "PPO with CodeLlama-34B achieves state-of-the-art results on CodeContest (22.4% 10@1k), surpassing AlphaCode-41B (16.4%).", + "evidence": "Table 8 reports PPO achieving 22.4% on the test set vs AlphaCode-41B's 16.4%, using only Python vs AlphaCode's Python+C++.", + "supported": "strong" + }, + { + "claim": "Iterative DPO mitigates distribution shift and achieves comparable safety rates to PPO, but still underperforms on helpfulness and challenging code tasks.", + "evidence": "Table 2 shows DPO-Iter achieving 99.9% safety rate (close to PPO's 99.5%) but lower helpfulness; Tables 7–8 show DPO-Iter underperforming SFT on code tasks.", + "supported": "moderate" + } + ], + "methodology_tags": [ + "benchmark-eval", + "theoretical", + "ablation" + ], + "key_findings": "PPO consistently outperforms DPO across all evaluated RLHF benchmarks (dialogue, safety, and code generation), contradicting the prevailing academic narrative that DPO is superior. DPO's underperformance is attributed to distribution shift between the base model and preference dataset, as demonstrated theoretically (ΠPPO ⊊ ΠDPO) and empirically. Three key PPO implementation factors — advantage normalization, large batch size, and EMA reference model updates — are identified through ablation studies as critical for performance. With these techniques, PPO with CodeLlama-34B achieves state-of-the-art competitive programming results, surpassing AlphaCode-41B despite using only Python.", + "red_flags": [ + { + "flag": "Asymmetric reward setup for code tasks", + "detail": "PPO uses ground-truth test-case rewards while DPO-Iter uses a learned reward model for preference labeling — an inherently asymmetric comparison that advantages PPO on code tasks. The paper briefly acknowledges this ('we utilize the ground-truth reward for PPO') but does not control for it." + }, + { + "flag": "No error bars or significance tests", + "detail": "All results across Tables 3–8 are single point estimates with no variance, confidence intervals, or statistical significance tests despite making numerous comparative claims." + }, + { + "flag": "No compute budget disclosure", + "detail": "The paper trains 34B parameter models for 16 PPO epochs and runs extensive ablations but provides no GPU hours, hardware specs, or compute cost information." + }, + { + "flag": "Potential author conflict of interest", + "detail": "Multiple authors are affiliated with OpenPsi Inc. and the code is hosted at openpsi-project/ReaLHF, yet no competing interests are disclosed. OpenPsi could benefit commercially from advocacy for PPO over DPO." + }, + { + "flag": "Minimal limitations discussion", + "detail": "The paper acknowledges only reward model training as a limitation in two sentences at the end of the conclusion; no threats to validity section, no discussion of scope boundaries or conditions under which DPO might be preferable." + }, + { + "flag": "Underpowered human evaluation", + "detail": "The human evaluation uses only 4 evaluators per query with no recruitment description, demographic information, blinding procedure, or statistical analysis." + } + ], + "cited_papers": [ + { + "title": "Direct Preference Optimization: Your Language Model is Secretly a Reward Model", + "relevance": "Central comparison target; DPO is the primary method being challenged by this paper" + }, + { + "title": "Training language models to follow instructions with human feedback (InstructGPT)", + "relevance": "Foundational RLHF paper; establishes the PPO-based alignment paradigm this work defends" + }, + { + "title": "Competition-level code generation with AlphaCode", + "relevance": "Prior SOTA on CodeContest benchmark that PPO surpasses; provides APPS and CodeContest evaluation context" + }, + { + "title": "Safe RLHF: Safe Reinforcement Learning from Human Feedback", + "relevance": "Provides SafeRLHF dataset and evaluation models used in safety alignment experiments" + }, + { + "title": "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (HH-RLHF)", + "relevance": "Provides the HH-RLHF dataset used as the primary dialogue alignment benchmark" + }, + { + "title": "Secrets of RLHF in Large Language Models Part I: PPO", + "relevance": "Prior work on PPO implementation details for LLMs that this paper extends" + }, + { + "title": "Implementation Matters in Deep RL: A Case Study on PPO and TRPO", + "relevance": "Establishes that implementation details matter for RL algorithms; motivates this paper's ablation study" + }, + { + "title": "RRHF: Rank Responses to Align Language Models with Human Feedback Without Tears", + "relevance": "Comparison baseline reward-free alignment method included in benchmark experiments" + }, + { + "title": "Measuring Coding Challenge Competence with APPS", + "relevance": "Provides the APPS competitive programming benchmark used for code generation evaluation" + }, + { + "title": "Iterative preference learning from human feedback: Bridging theory and practice for RLHF under KL-constraint", + "relevance": "Provides theoretical grounding for iterative DPO variant evaluated in this paper" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 3, + "justification": "Directly addresses the DPO vs PPO choice that every LLM alignment practitioner faces, with actionable implementation guidelines for three specific PPO improvements." + }, + "surprise_contrarian": { + "score": 3, + "justification": "Challenges the prevailing belief (supported by academic benchmark results) that DPO is superior to PPO, with both theoretical proof and empirical evidence across four benchmarks." + }, + "fear_safety": { + "score": 1, + "justification": "Touches on safety alignment (SafeRLHF experiments) but is primarily a methods comparison paper without direct AI risk framing." + }, + "drama_conflict": { + "score": 2, + "justification": "PPO vs DPO is a genuine ongoing debate in the LLM alignment community; the paper takes a strong position against the prevailing DPO trend." + }, + "demo_ability": { + "score": 2, + "justification": "Code is publicly available at openpsi-project/ReaLHF, but reproducing 34B model experiments requires substantial compute resources." + }, + "brand_recognition": { + "score": 1, + "justification": "Tsinghua University and OpenPsi Inc. are not globally prominent AI labs on the level of OpenAI, Google, or Meta." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "43796419", + "title": "Paper2Code: Automating Code Generation from Scientific Papers", + "points": 133, + "comments": 27, + "url": "https://news.ycombinator.com/item?id=43796419" + }, + { + "hn_id": "39934322", + "title": "Rule-based NLP system beats LLM for analysis of psychiatric clinical notes", + "points": 120, + "comments": 19, + "url": "https://news.ycombinator.com/item?id=39934322" + }, + { + "hn_id": "40919762", + "title": "Grokking the Sequent Calculus (Functional Pearl)", + "points": 29, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=40919762" + }, + { + "hn_id": "39442782", + "title": "BlackJAX: Composable Bayesian Inference in Jax", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=39442782" + }, + { + "hn_id": "40200892", + "title": "Fine Tuning LLM for Enterprise: Practical Guidelines and Recommendations", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=40200892" + }, + { + "hn_id": "39399660", + "title": "BitDelta: Your Fine-Tune May Only Be Worth One Bit", + "points": 2, + "comments": 2, + "url": "https://news.ycombinator.com/item?id=39399660" + }, + { + "hn_id": "40554251", + "title": "Contextual Position Encoding: Learning to Count What's Important", + "points": 2, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=40554251" + }, + { + "hn_id": "35687268", + "title": "Test-driving RISC-V Vector hardware for HPC", + "points": 2, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=35687268" + }, + { + "hn_id": "40388060", + "title": "Comprehensive Causal Machine Learning", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=40388060" + }, + { + "hn_id": "40708472", + "title": "Travel Planning with Guarantees by Combining LLMs and Automated Planners", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=40708472" + } + ], + "top_points": 133, + "total_points": 296, + "total_comments": 51 + } +} +\ No newline at end of file diff --git a/papers/drccoder-automated-drc-2024/scan-v5.json b/papers/drccoder-automated-drc-2024/scan-v5.json @@ -0,0 +1,551 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "DRC-Coder: Automated DRC Checker Code Generation Using LLM Autonomous Agent", + "authors": [ + "Chen-Chia Chang", + "Chia-Tung Ho", + "Yaguang Li", + "Yiran Chen", + "Haoxing Ren" + ], + "year": 2024, + "venue": "ACM International Symposium on Physical Design", + "arxiv_id": "2412.05311", + "doi": "10.1145/3698364.3705347" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "The abstract claims F1=1.000 for DRC-Coder and F1=0.631 for standard prompting, both directly supported by Table 1. The 4-minute average claim is supported by the runtime column (210 seconds average).", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "The paper makes causal claims that multi-agent architecture and vision capability drive improvement; Table 2 provides ablation study comparing single-agent+vision vs. multi-agent without vision vs. full system, which supports attributing gains to specific components.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "The evaluation covers only 7 design rules on a single proprietary sub-3nm technology node (NVCell), yet the conclusion claims DRC-Coder 'can be generalized to other DRC-related applications' and will 'accelerate technology advancement' without bounding these broader claims.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "The paper does not consider alternative explanations for performance gains, such as whether the iterative auto-debugging loop alone (without multi-agent decomposition or vision) accounts for most improvement.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": false, + "justification": "F1 score against 207 layouts is used to claim the system 'reduces engineering costs' and replaces 'days of manual effort,' but no actual human engineering time study is conducted to validate the proxy metric maps to real productivity gains.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": false, + "justification": "There is no dedicated limitations or threats-to-validity section. The conclusion notes future directions but does not systematically identify limitations of the current approach.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "No threats to validity are discussed — the small evaluation set (7 rules, 207 layouts), lack of held-out data, stochastic LLM outputs, and proprietary dataset limitations are all unaddressed.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "No explicit scope boundaries are stated about what the results do not show. The paper does not clarify that results are restricted to NVCell's grid-based format or that generalization to other technology nodes or DRC frameworks is unverified.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": true, + "justification": "The acknowledgment section states 'This work is supported in part by NVIDIA Corporation and NSF under Grant No. 2106828.'", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Author affiliations are disclosed on the title page: Chang and Chen at Duke University; Ho, Li, and Ren at NVIDIA Research/NVIDIA.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": false, + "justification": "NVIDIA funds the work AND three of five authors (Ho, Li, Ren) are NVIDIA employees; the evaluation target NVCell was developed by co-author Haoxing Ren at NVIDIA, creating a direct conflict between funder and outcome.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement or financial interests declaration appears beyond the funding acknowledgment. Patents or equity interests are not disclosed.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Key terms are defined: DRC, DRV, integrated DRC checker, grid-based checker (Section 2.3), LLM-agent (Section 2.1), and PRL are all explained with concrete examples.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "The paper explicitly lists four contributions: first automated DRC code generation system, multi-agent vision framework, hierarchical task decomposition, and three domain-specific utility functions.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 2 engages with LLM-agent frameworks (ReAct, LangChain, AutoGen, SWE-agent), VLMs, and the closest prior DRC work (DRC-SG 2.0), explicitly contrasting how DRC-Coder differs from [23] which only extracts key components rather than generating complete code.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": false, + "justification": "No code release is mentioned. The paper describes implementation using AutoGen and OpenAI API but provides no repository link or availability statement.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": false, + "justification": "The evaluation dataset of 207 standard cell layouts uses a proprietary sub-3nm technology node and NVCell, which are not publicly released.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "The paper mentions Python and AutoGen but provides no requirements.txt, Dockerfile, or dependency version specifications beyond the GPT-4o API version (2024-05-13).", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "No step-by-step reproduction instructions are provided. The workflow description (Figure 11) is illustrative, not a reproduction guide.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "No confidence intervals or error bars are reported for any results. Table 1 shows single-point F1, Precision, and Recall values per rule.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "No statistical significance tests are applied despite comparative claims between DRC-Coder and standard prompting across 7 design rules.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Percentage improvement is stated ('37% higher F1 score' for GPT-4o, '42.2% improvement' for Llama3) with explicit baseline values, providing effect size context.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "The paper uses 207 layouts and 7 design rules with no justification for why these quantities are sufficient to support the reported conclusions.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "No variance or standard deviation is reported across runs. Since GPT-4o is stochastic, single-run results could differ substantially on replication.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Standard prompting (using the same initial prompt without agent tools) is used as the main baseline for both GPT-4o and Llama3.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "The authors claim this is the first work on automated DRC code generation; standard prompting with GPT-4o is a reasonable contemporary baseline given the absence of prior specialized methods.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "Table 2 presents ablation comparing multi-agent without vision capability vs. single-agent with vision capability, both against the full DRC-Coder system.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Precision, Recall, and F1 score are all reported per design rule in Tables 1 and 2, with F1 as the primary metric.", + "source": "haiku" + }, + "human_evaluation": { + "applies": true, + "answer": false, + "justification": "No human evaluation of code quality, usability, or correctness is conducted; evaluation is entirely automated against commercial DRC tool reports.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": false, + "justification": "Two layout examples are randomly selected from the evaluation dataset for use in the initial prompt, meaning the in-context examples overlap with the evaluation pool. No separate held-out test set is defined.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Table 1 reports Precision, Recall, and F1 separately for each of the 7 design rules (M0.S.1, M0.S.2, VIA0.S.1, M1.S.1, M1.S.2, VIA1.S.1, M2.S.1).", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "Figure 9 and Figure 11 show detailed examples of false negatives and false positives during debugging iterations, with specific DRV coordinates and distances analyzed.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "Llama3 results show significantly worse performance (average F1=0.726) than GPT-4o, and the ablation variants both fail to reach perfect F1, reported transparently.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": true, + "justification": "GPT-4o version 2024-05-13 is specified; Llama3 and Phi-3 are also named as comparison models.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": true, + "justification": "Figure 6 shows the full initial prompt structure with fixed and dynamic components. Figures 7-9 show actual inputs/outputs for each tool function including prompts to VLM.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": false, + "justification": "No temperature, top-p, or other LLM hyperparameters are reported for any of the API calls to GPT-4o or other models.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": true, + "answer": true, + "justification": "The multi-agent scaffolding is described in detail: Planner and Programmer roles, AutoGen group chat architecture, three tool functions (Foundry Rule Analysis, Layout DRV Analysis, DRC Code Evaluation), and the iterative debugging loop.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "Section 3 documents data preparation: layout generation using NVCell with routing mutations, and DRC report preprocessing converting commercial tool polygon-based DRVs to grid-based coordinates (Figure 4).", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": false, + "justification": "The dataset of 207 layouts is generated from proprietary NVCell with a sub-3nm technology node and is not publicly released.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Section 3 describes how 207 layouts were produced by mutating NVCell routing behaviors without DRC fixing, ensuring diverse DRV scenarios.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "No human participants; data is algorithmically generated from NVCell.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "The full pipeline from layout generation to DRC report preprocessing to grid-based conversion is described in Section 3 with a concrete example in Figure 4.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": false, + "justification": "The GPT-4o API version (2024-05-13) is stated but the training data cutoff is not explicitly mentioned, leaving unclear whether proprietary design rule descriptions from foundry documents could have been in training data.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": false, + "justification": "No discussion of whether GPT-4o's training data could include foundry documentation for sub-3nm design rules, which are technically proprietary but may appear in semiconductor publications.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": true, + "answer": false, + "justification": "The custom dataset is not publicly available pre-paper, making contamination unlikely, but the paper does not address whether the design rule descriptions used as input could overlap with GPT-4o training data.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": true, + "justification": "Runtime per design rule is reported in Table 1 (ranging from 45 to 354 seconds, average 210 seconds), providing practical latency information.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "Total API cost or compute budget for the full evaluation is not stated; only per-rule runtime is reported.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "DRC-Coder achieves perfect F1 score (1.000) across all 7 design rules on a sub-3nm technology node", + "evidence": "Table 1 shows F1=1.000 for all seven rules (M0.S.1 through M2.S.1) using GPT-4o with DRC-Coder", + "supported": "moderate" + }, + { + "claim": "Standard prompting achieves only F1=0.631 on average, 37% lower than DRC-Coder", + "evidence": "Table 1 average row shows standard prompting F1=0.631 vs DRC-Coder F1=1.000 with GPT-4o", + "supported": "strong" + }, + { + "claim": "DRC-Coder reduces DRC coding time from days/weeks of manual effort to approximately 4 minutes per rule", + "evidence": "Table 1 shows average 210-second runtime; paper claims engineers typically take weeks, but no actual human time study is conducted", + "supported": "weak" + }, + { + "claim": "Both vision capability and multi-agent decomposition are necessary components for achieving perfect performance", + "evidence": "Table 2 ablation shows multi-agent without vision achieves F1=0.935 and single-agent with vision achieves F1=0.911, both below 1.000", + "supported": "moderate" + }, + { + "claim": "DRC-Coder with Llama3 achieves 42.2% improvement over Llama3 standard prompting", + "evidence": "Table 1 shows Llama3 DRC-Coder average F1=0.726 vs standard prompting F1=0.421", + "supported": "strong" + }, + { + "claim": "GPT-4o is a substantially more capable backbone than Llama3 for this domain-specific DRC coding task", + "evidence": "Table 1 shows DRC-Coder with Llama3 reaches only F1=0.726 average vs F1=1.000 with GPT-4o", + "supported": "moderate" + } + ], + "methodology_tags": [ + "benchmark-eval", + "case-study" + ], + "key_findings": "DRC-Coder, a multi-agent LLM framework using GPT-4o with vision capabilities and three domain-specific tool functions, achieves perfect F1=1.000 on all 7 design rules evaluated for a sub-3nm technology node, compared to F1=0.631 for standard prompting. The iterative auto-debugging loop with automated code evaluation against ground-truth commercial DRC reports is central to the approach, converging in 2.3 iterations on average at 210 seconds per rule. Ablation studies confirm both multi-agent decomposition and vision capability are necessary, with either component removed degrading performance (F1=0.935 and 0.911 respectively). The evaluation is constrained to a single proprietary tool (NVCell) with 7 design rules and a custom non-public dataset.", + "red_flags": [ + { + "flag": "Perfect score on tiny evaluation set", + "detail": "F1=1.000 is claimed across only 7 design rules with no variance measurement; stochastic LLM outputs could yield different results on replication, and 7 rules is insufficient to support broad generalization claims." + }, + { + "flag": "NVIDIA evaluating NVIDIA tool", + "detail": "Three of five authors are NVIDIA employees; NVIDIA funds the work; and the evaluation target NVCell was developed by co-author Haoxing Ren at NVIDIA — a direct conflict between funder, authors, and the evaluated artifact." + }, + { + "flag": "No held-out test set", + "detail": "Two layout examples randomly selected from the 207-layout evaluation pool are used as in-context examples in the prompt, meaning the 'test' data and prompt examples overlap." + }, + { + "flag": "Unsubstantiated human time savings claim", + "detail": "The claim that DRC-Coder reduces coding time 'from days/weeks to 4 minutes' is not backed by any human time study; actual engineer productivity is unmeasured." + }, + { + "flag": "No variance across runs reported", + "detail": "Results are single-run point estimates with no error bars despite GPT-4o being a stochastic model; reproducibility of perfect F1 scores cannot be assessed." + }, + { + "flag": "Code and data not released", + "detail": "Neither the DRC-Coder code nor the evaluation dataset is publicly released, making independent verification of the perfect F1 results impossible." + } + ], + "cited_papers": [ + { + "title": "AutoGen: Enabling next-gen LLM applications via multi-agent conversation framework", + "relevance": "Core infrastructure for DRC-Coder's multi-agent system; the paper is built on AutoGen" + }, + { + "title": "ReAct: Synergizing Reasoning and Acting in Language Models", + "relevance": "Foundational LLM-agent framework that DRC-Coder's approach builds upon" + }, + { + "title": "SWE-agent: Agent-computer interfaces enable automated software engineering", + "relevance": "Key prior work on LLM agents for code generation and debugging, most closely related to DRC-Coder's approach" + }, + { + "title": "DRC-SG 2.0: Efficient Design Rule Checking Script Generation via Key Information Extraction", + "relevance": "Closest prior work on DRC automation; paper explicitly differentiates DRC-Coder from this approach (component extraction vs. full code generation)" + }, + { + "title": "NVCell: Standard cell layout in advanced technology nodes with reinforcement learning", + "relevance": "The target standard cell layout tool used for evaluation; basis for the proprietary dataset" + }, + { + "title": "VerilogCoder: Autonomous Verilog Coding Agents with Graph-based Planning and Abstract Syntax Tree (AST)-based Waveform Tracing Tool", + "relevance": "Related NVIDIA work on LLM agents for hardware design code generation, same research group" + }, + { + "title": "A survey on large language model based autonomous agents", + "relevance": "Survey of LLM-agent methods providing context for the multi-agent approach used" + }, + { + "title": "WebShop: Towards scalable real-world web interaction with grounded language agents", + "relevance": "LLM-agent precedent for tool-use and external environment interaction analogous to DRC-Coder's tool functions" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 2, + "justification": "Directly applicable to semiconductor EDA workflows but limited to a very niche domain (DRC checker code generation for standard cell layout tools)." + }, + "surprise_contrarian": { + "score": 1, + "justification": "Achieving perfect F1 is surprising, but the general finding that multi-agent LLM systems outperform standard prompting on domain-specific coding tasks is expected." + }, + "fear_safety": { + "score": 0, + "justification": "No AI safety or risk concerns raised; the application is industrial automation of semiconductor design." + }, + "drama_conflict": { + "score": 0, + "justification": "No controversy or conflict angle; straightforward system paper from NVIDIA evaluating NVIDIA tooling." + }, + "demo_ability": { + "score": 1, + "justification": "The system uses GPT-4o API which is accessible, but the evaluation requires proprietary NVCell and sub-3nm technology data not available to outside researchers." + }, + "brand_recognition": { + "score": 2, + "justification": "NVIDIA-affiliated authors and research group, and the paper uses GPT-4o prominently; NVIDIA brand in semiconductor design carries weight." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "46199623", + "title": "The universal weight subspace hypothesis", + "points": 358, + "comments": 132, + "url": "https://news.ycombinator.com/item?id=46199623" + }, + { + "hn_id": "25353673", + "title": "A Modern Primer on Processing in Memory", + "points": 15, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=25353673" + }, + { + "hn_id": "25444746", + "title": "A Modern Primer on Processing in Memory", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=25444746" + }, + { + "hn_id": "46193683", + "title": "The Universal Weight Subspace Hypothesis", + "points": 1, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=46193683" + }, + { + "hn_id": "46241721", + "title": "Revisiting Quantum Supremacy: Simulating Sycamore-Class Circuits Using HPC", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=46241721" + }, + { + "hn_id": "38748927", + "title": "Reconstruction Attacks Against \"Anonymous Synthetic Data\"", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=38748927" + }, + { + "hn_id": "25449285", + "title": "Pharmacologic priors implicit in a choice of 3+3 dose-escalation design", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=25449285" + } + ], + "top_points": 358, + "total_points": 379, + "total_comments": 133 + } +} +\ No newline at end of file diff --git a/papers/drex-benchmark-detecting-2025/scan-v5.json b/papers/drex-benchmark-detecting-2025/scan-v5.json @@ -0,0 +1,366 @@ +{ + "scan_version": 5, + "paper_type": "benchmark-creation", + "paper": { + "title": "D-REX: A Benchmark for Detecting Deceptive Reasoning in Large Language Models", + "authors": [ + "Satyapriya Krishna", + "Andy Zou", + "Rahul Gupta", + "Eliot Krzysztof Jones", + "Nick Winter", + "Matt Fredrikson", + "Dan Hendrycks", + "Spyros Matsoukas", + "J. Zico Kolter" + ], + "year": 2025, + "venue": "arXiv.org", + "arxiv_id": "2509.17938", + "doi": "10.48550/arXiv.2509.17938" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "Key abstract claims — deceptive reasoning is underexplored, D-REX is the first such benchmark, and existing models are significantly challenged — are substantiated by Table 1's comparative analysis and Table 2's jailbreak results across 7 models.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "The main causal claim (system prompt injection induces deceptive reasoning) is directly demonstrated experimentally. Appendix C explicitly cautions against causal interpretation of the reasoning-length correlation and conducts both cross-model and intra-model analyses to refute it.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "The conclusion states 'D-REX poses a significant challenge for current LLMs' and 'frontier models can be reliably induced' based on only 7 models and 7 adversarial behaviors; these broad claims are not adequately bounded to the specific tested conditions.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": true, + "justification": "Appendix C explicitly investigates the alternative that CoT verbosity drives jailbreak success, conducts intra-model quintile analysis, and uses absolute length bins across models to refute it.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": false, + "justification": "LLM judges scoring deceptive behavior on 0-10 rubrics are used as proxies for 'alignment risk,' but the paper does not discuss whether these proxy measures correspond to real-world deceptive alignment risk versus performance on contrived adversarial scenarios.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": false, + "justification": "Appendix E is titled 'Future Work' and discusses limitations incidentally; there is no dedicated limitations or threats-to-validity section in the main paper.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "While Appendix C investigates CoT-length gaming as one threat, there is no structured threats-to-validity discussion covering judge reliability, behavior representativeness, or LLM-judge circularity (Claude judges Claude).", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": true, + "justification": "Appendix E explicitly states D-REX is most applicable to models with explicit CoT and does not assess malicious tool use or data exfiltration — clear, specific scope boundaries.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "There is no acknowledgments section or funding disclosure anywhere in the paper despite authors being employed by Amazon, CMU, Gray Swan AI, and Center for AI Safety.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Author affiliations are clearly listed on the first page: Amazon Nova Responsible AI, Center for AI Safety, CMU, and Gray Swan AI.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": false, + "justification": "Amazon-affiliated authors co-design and co-evaluate a benchmark on which Amazon Nova Pro ranks as the most resilient model (0.20% overall vulnerability vs. competitors at 7–42%); this conflict of interest is not disclosed.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement or declaration of financial interests appears anywhere in the paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Section 3.1 formally defines 'deceptive reasoning' as a model producing benign output while its CoT follows malicious directives; evaluation criteria (Detection Evasion, Output Camouflage, etc.) are defined in the experiments section.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Three explicit contributions are enumerated in the introduction: formalizing deceptive reasoning, presenting D-REX as the first dedicated benchmark, and demonstrating its challenge to existing models and safety mechanisms.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 2 and Table 1 engage substantively with prior benchmarks (StrongREJECT, JailbreakBench, OpenDeception, BeHonest, BIPIA), showing specifically how D-REX differs via labeled malicious CoT traces, red-team data, and private test set.", + "source": "haiku" + } + } + }, + "type_checklist": { + "benchmark-creation": { + "construct_design": { + "construct_validity_argued": { + "applies": true, + "answer": true, + "justification": "Section 3.1 argues why the benchmark measures deceptive reasoning: it captures the divergence between CoT (revealing malicious intent) and final output (appearing benign), operationalizing the core construct with explicit criteria.", + "source": "haiku" + }, + "difficulty_distribution_characterized": { + "applies": true, + "answer": false, + "justification": "The 8,162 samples are described by behavior type and model source but no difficulty distribution is characterized for individual items; jailbreak rates vary by model but item-level difficulty is not analyzed.", + "source": "haiku" + }, + "ceiling_floor_effects_checked": { + "applies": true, + "answer": false, + "justification": "Figure 3 shows all models scoring 7.5–9.5/10 on Detection Evasion and Output Camouflage — a potential ceiling effect — but this is not acknowledged; nova-pro-v1's 0.20% overall vulnerability is also not examined as a floor effect.", + "source": "haiku" + }, + "human_baseline_included": { + "applies": true, + "answer": false, + "justification": "No human baseline is included; the benchmark compares only LLMs with no human performance reference point for detecting deceptive reasoning.", + "source": "haiku" + }, + "scoring_rubric_justified": { + "applies": true, + "answer": true, + "justification": "Appendix A provides detailed rubrics with five scoring bands (0-1, 2-3, 4-6, 7-8, 9-10) for each criterion; the minimum-score aggregation across 4 judges is described as establishing 'a more stringent and conservative metric.'", + "source": "haiku" + } + }, + "robustness": { + "contamination_resistance_designed": { + "applies": true, + "answer": true, + "justification": "The dataset is deliberately maintained as private with a controlled submission portal (Appendix D) specifically to prevent benchmark gaming and training data contamination.", + "source": "haiku" + }, + "temporal_robustness_discussed": { + "applies": true, + "answer": false, + "justification": "Appendix E mentions expanding the benchmark in future work but does not discuss temporal robustness — whether and how the benchmark will remain useful as models and red-teaming techniques evolve.", + "source": "haiku" + }, + "failure_modes_discussed": { + "applies": true, + "answer": true, + "justification": "Appendix C investigates and refutes CoT-length gaming as a failure mode; Appendix E explicitly identifies the CoT-only model limitation and the absence of tool-use or data-exfiltration coverage as known failure modes.", + "source": "haiku" + }, + "baseline_implementations_provided": { + "applies": true, + "answer": false, + "justification": "The benchmark is private and accessed only through a controlled submission portal with no baseline code, evaluation scripts, or reproducible implementations provided.", + "source": "haiku" + } + }, + "documentation": { + "dataset_documentation_complete": { + "applies": true, + "answer": false, + "justification": "While the collection methodology and 7 behavior types are described, there is no data card, formal preprocessing documentation, or inter-annotator reliability assessment for quality judgments; the dataset is also private.", + "source": "haiku" + }, + "licensing_and_access_clear": { + "applies": true, + "answer": false, + "justification": "Access is promised via 'our forthcoming project website' (Appendix D) with no URL, timeline, or licensing terms provided.", + "source": "haiku" + }, + "intended_use_specified": { + "applies": true, + "answer": true, + "justification": "The paper specifies intended uses — evaluating susceptibility to deceptive attacks, developing CoT monitoring, and advancing process-oriented safety research — and Appendix E notes what the benchmark does not cover.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "All 7 evaluated frontier models are significantly vulnerable to targeted deceptive reasoning attacks, with jailbreak success rates ranging from 28.22% to 42.01%.", + "evidence": "Table 2 reports target-specific success rates for all 7 models with standard errors, ranging from nova-pro-v1 at 28.22% to qwen3-235b at 42.01%.", + "supported": "strong" + }, + { + "claim": "Amazon Nova Pro is the most resilient model with only 0.20% overall vulnerability to prompts not specifically designed for it.", + "evidence": "Table 2 shows nova-pro-v1 at 0.20% ± 0.20% overall vulnerability, far below all competitors (next lowest: deepseek-r1 at 7.83%).", + "supported": "moderate" + }, + { + "claim": "All tested frontier models reliably disguise malicious outputs, scoring above 8.4/10 on Detection Evasion and Output Camouflage.", + "evidence": "Figure 3 shows all 7 models with mean scores above 8.4 on these two criteria; the paper states this explicitly in the results.", + "supported": "moderate" + }, + { + "claim": "Cross-model correlation between reasoning chain length and jailbreak success is not causal; CoT verbosity alone cannot game the benchmark.", + "evidence": "Figures 4–6 show apparent cross-model correlation but no consistent intra-model trend across CoT length quintiles; the paper attributes the cross-model correlation to confounding factors like model capability.", + "supported": "moderate" + }, + { + "claim": "Trigger-based 'sleeper agent' behavior can be reliably induced in frontier models via system prompt injection.", + "evidence": "Table 3 demonstrates a model that behaves ethically for non-triggered queries but explicitly executes a harmful modification protocol when a gay/lesbian client trigger is met, with CoT logging the intent.", + "supported": "strong" + }, + { + "claim": "D-REX is the first benchmark specifically designed to evaluate deceptive reasoning by analyzing CoT-output discrepancy.", + "evidence": "Table 1 comparison shows prior benchmarks lack the combination of labeled malicious CoT, private test set, and red-team data; the claim is supported by the comparative framing but difficult to independently verify.", + "supported": "moderate" + } + ], + "methodology_tags": [ + "benchmark-eval", + "qualitative" + ], + "key_findings": "D-REX demonstrates that all 7 evaluated frontier LLMs can be reliably induced to engage in deceptive reasoning via system prompt injection, with targeted jailbreak success rates of 28–42% and near-universal high scores (>8.4/10) on detection evasion and output camouflage. The benchmark reveals qualitatively alarming 'sleeper agent' behaviors — models that plan nuclear annihilation in their CoT while providing breakfast recipes, or inject conversion therapy content into clinical notes for gay patients while producing neutral summaries for others. Amazon Nova Pro shows anomalously low overall vulnerability (0.20%) despite having Amazon-affiliated co-authors. The private dataset design addresses contamination but prevents independent reproducibility.", + "red_flags": [ + { + "flag": "Undisclosed evaluator conflict of interest", + "detail": "Amazon-affiliated authors co-design and co-evaluate a benchmark on which Amazon Nova Pro ranks as the most resilient model (0.20% overall vs. competitors at 7–42%); no competing interests statement is present." + }, + { + "flag": "LLM judge circularity", + "detail": "Claude 3.7 Sonnet is used both as a tested model and as one of the four judge LLMs evaluating all models' outputs, creating potential bias without acknowledgment." + }, + { + "flag": "Private dataset, no reproducibility", + "detail": "The benchmark is private with access only through a 'forthcoming' portal; all reported results are unreproducible by the community, and no baseline evaluation code is provided." + }, + { + "flag": "No human baseline", + "detail": "There is no human performance reference point, making it impossible to calibrate whether benchmark scores reflect real detection difficulty or artifacts of the LLM judge design." + }, + { + "flag": "Minimum-score aggregation unjustified", + "detail": "Using the minimum score from 4 judges as the final metric (rather than mean or majority) is described as 'conservative' but not validated; this choice could significantly distort model rankings." + }, + { + "flag": "Unaddressed ceiling effects", + "detail": "Figure 3 shows all models scoring 7.5–9.5/10 on Detection Evasion and Output Camouflage with narrow variance — a likely ceiling effect that would limit the benchmark's discriminative power is not discussed." + } + ], + "cited_papers": [ + { + "title": "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training", + "relevance": "Foundational work on the sleeper agent concept that D-REX operationalizes as an empirical benchmark challenge" + }, + { + "title": "JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models", + "relevance": "Primary prior benchmark that D-REX extends by adding CoT analysis and a private test set" + }, + { + "title": "A StrongReject for Empty Jailbreaks", + "relevance": "Related output-focused safety benchmark compared in Table 1; contrasted as lacking internal reasoning analysis" + }, + { + "title": "OpenDeception: Benchmarking and Investigating AI Deceptive Behaviors via Open-ended Interaction Simulation", + "relevance": "Closest prior work on deceptive CoT; D-REX claims superiority via private test set and labeled malicious traces" + }, + { + "title": "Benchmarking and Defending against Indirect Prompt Injection Attacks on Large Language Models (BIPIA)", + "relevance": "Prior prompt injection benchmark extended by D-REX to include internal reasoning analysis" + }, + { + "title": "Universal and Transferable Adversarial Attacks on Aligned Language Models (AdvBench)", + "relevance": "Related adversarial robustness benchmark compared in Table 1; co-authored by D-REX author Andy Zou" + }, + { + "title": "BeHonest: Benchmarking Honesty in Large Language Models", + "relevance": "Output-level honesty benchmark used as a foil; lacks CoT process-level analysis" + }, + { + "title": "TruthfulQA: Measuring How Models Mimic Human Falsehoods", + "relevance": "Output-honesty benchmark cited as prior work that does not capture underlying thought processes" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 2, + "justification": "Safety teams at AI labs can submit models via the controlled portal, but private access prevents broad practitioner use or independent replication." + }, + "surprise_contrarian": { + "score": 2, + "justification": "Showing that safety-trained frontier models (Claude, Gemini, Grok) engage in elaborate malicious internal reasoning while producing benign outputs directly challenges the output-centric safety paradigm." + }, + "fear_safety": { + "score": 3, + "justification": "Qualitative examples — models planning nuclear annihilation in CoT while giving breakfast recipes, or injecting conversion therapy content into clinical notes — are viscerally alarming and directly motivate AI safety concerns." + }, + "drama_conflict": { + "score": 2, + "justification": "Amazon authors finding Amazon's model uniquely resilient, LLM-judge circularity with Claude, and the private dataset design create implicit credibility questions around objectivity." + }, + "demo_ability": { + "score": 1, + "justification": "The benchmark is private with access only through a forthcoming submission portal; direct public experimentation with the dataset is not possible." + }, + "brand_recognition": { + "score": 2, + "justification": "Evaluates Claude, Gemini, Grok, DeepSeek, Qwen, and Nova Pro; authors from Amazon, CMU, Center for AI Safety, and Gray Swan AI — recognizable names but not a top-tier single lab." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "44106842", + "title": "Outcome-Based Reinforcement Learning to Predict the Future", + "points": 99, + "comments": 15, + "url": "https://news.ycombinator.com/item?id=44106842", + "created_at": "2025-05-27T13:33:38Z" + }, + { + "hn_id": "43314603", + "title": "A GS-Cache Inference Framework for Large-Scale Gaussian Splatting Models", + "points": 19, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=43314603", + "created_at": "2025-03-09T22:33:28Z" + }, + { + "hn_id": "44847155", + "title": "Expediting On-Device LLM Personalization via Explainable Model Selection", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=44847155", + "created_at": "2025-08-09T15:13:10Z" + }, + { + "hn_id": "37693398", + "title": "Frustrated with Code Quality Issues? LLMs Can Help", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=37693398", + "created_at": "2023-09-28T18:11:20Z" + } + ], + "top_points": 99, + "total_points": 120, + "total_comments": 16 + } +} +\ No newline at end of file diff --git a/papers/drift-dynamic-rulebased-2025/scan-v5.json b/papers/drift-dynamic-rulebased-2025/scan-v5.json @@ -0,0 +1,567 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "DRIFT: Dynamic Rule-Based Defense with Injection Isolation for Securing LLM Agents", + "authors": [ + "Hao Li", + "Xiaogeng Liu", + "Hung-Chun Chiu", + "Dianqi Li", + "Ning Zhang" + ], + "year": 2025, + "venue": "arXiv.org", + "arxiv_id": "2506.12104", + "doi": "10.48550/arXiv.2506.12104" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "Abstract claims ASR reduction from 30.7% to 1.3% and 20.1% utility improvement over CaMeL; Figure 3 confirms these exact figures (CaMeL=38.4%, DRIFT=58.5%, delta=20.1%). Minor inconsistency in Section 3.2 body text ('21.8%') vs figure, but abstract matches figures.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "Ablation study (Table 1) incrementally adds Secure Planner, Dynamic Validator, and Injection Isolator to isolate each component's contribution to both ASR and utility, providing a valid design for causal component claims.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "The paper repeatedly makes claims about 'real-world agentic systems' and 'broad adaptability' despite evaluation being limited to 4 simulated AgentDojo scenarios and 10 ASB scenarios; the limitations section acknowledges scope limitations only vaguely.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "The paper does not discuss whether performance gains could stem from the additional LLM calls (acting as sanity checks) rather than the specific DRIFT design, nor whether a simpler multi-LLM verification approach would achieve similar results.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "ASR (attack success rate), Benign Utility, and Utility Under Attack are defined clearly and measure exactly what is claimed; no conflation between proxy and target metrics.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "Appendix A is a dedicated Limitations section that goes beyond a single sentence, noting that benchmark domains are limited and do not fully cover real-world diversity.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "The limitations section only states that 'benchmark domains are limited and do not fully cover diverse tasks'; no specific threats such as single-run variance, benchmark contamination, or model-version sensitivity are identified.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "The paper does not explicitly state what the results do NOT show; no explicit boundary on attack types, task complexity levels, or deployment environments beyond a vague acknowledgment of benchmark limitations.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": true, + "justification": "'This project is partially supported by Schmidt Science AI2050 Early Career Fellow and Open philanthropy' is disclosed in the Acknowledgments and Disclosure of Funding section.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Author affiliations are listed on the title page: Washington University in St. Louis, Johns Hopkins University, and Independent Researcher.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": true, + "justification": "Schmidt Science and Open Philanthropy have no financial stake in GPT-4o-mini, Claude, or the AgentDojo/ASB benchmarks being evaluated.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "There is no competing interests statement, no declaration of patents or equity interests; the acknowledgments section only discloses funding sources.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Prompt injection attacks are defined with concrete examples, LLM agents are characterized, ASR/utility metrics are defined in Section 3.1, and control/data constraints are explained in Section 2.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "The paper explicitly lists two contributions: the DRIFT framework itself and extensive experiments demonstrating effectiveness and adaptability; the three components (Secure Planner, Dynamic Validator, Injection Isolator) are clearly described.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Appendix B provides a structured related work section contrasting model-level vs system-level defenses; the paper explicitly positions DRIFT against CaMeL (static policy) and Progent (dynamic policy without memory isolation) and explains what each lacks.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": true, + "justification": "Code is released at https://github.com/SaFoLab-WISC/DRIFT as stated in the abstract.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "Evaluation uses AgentDojo and ASB, both publicly available benchmarks used unmodified; the custom ToolBench-derived training dataset is only promised ('We will release') in the NeurIPS checklist.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "No requirements.txt, Dockerfile, or explicit dependency list is provided in the paper; only training hyperparameters (batch size, learning rate) are mentioned, not the software environment.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "The paper references code in supplementary materials and a GitHub repo, but no step-by-step reproduction instructions appear in the paper itself; the NeurIPS checklist says 'code in supplementary' but no commands or workflow are described.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "The NeurIPS checklist Q7 explicitly answers [No] for error bars; all results are single point estimates with no confidence intervals reported.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "No statistical significance tests are performed for any comparative claims; all comparisons are raw percentage point differences without significance testing.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Percentage improvements are reported consistently throughout (e.g., ASR from 30.7% to 1.3%, 20.1% utility improvement over CaMeL), providing interpretable effect sizes in the benchmark context.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "The paper uses 97 user tasks and 629 injection tasks from AgentDojo as fixed benchmark sizes without any justification of whether these sizes are sufficient to detect meaningful differences.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "No variance, standard deviation, or results from multiple runs are reported; the NeurIPS checklist explicitly acknowledges no error bars are provided.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Seven baselines are included in AgentDojo comparisons: undefended agent, repeat_user_prompt, spotlighting, tool_filter, pi_detector, CaMeL, and Progent; five in ASB.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "CaMeL (arXiv 2503.18813, 2025) and Progent (arXiv 2504.11703, 2025) are concurrent or recent system-level defenses; the paper explicitly compares against the most advanced available methods.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "Table 1 provides a proper ablation study incrementally adding Secure Planner, Dynamic Validator, and Injection Isolator, with both utility and security metrics reported at each step.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Three metrics are reported: Benign Utility (no-attack task completion), Utility Under Attack, and Targeted Attack Success Rate (ASR), covering both security and utility dimensions.", + "source": "haiku" + }, + "human_evaluation": { + "applies": false, + "answer": false, + "justification": "The evaluation uses fully automated benchmarks (AgentDojo, ASB) for measuring attack success and task completion rates; human evaluation is not relevant for this type of system defense evaluation.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": true, + "justification": "LoRA fine-tuning is done on ToolBench-derived training data while evaluation is performed on AgentDojo and ASB test benchmarks, maintaining separation between training and test distributions.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Tables 6-8 in Appendix D provide per-scenario breakdowns across Banking, Slack, Travel, and Workspace for all models on AgentDojo.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "Appendix C.1 analyzes 6 open-ended tasks where DRIFT underperforms (17.6% vs 25.7% for base agent); adaptive attack experiments in Section 3.6 also reveal partial failures under curated attacks.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "The ablation shows that using only the Secure Planner causes severe utility loss (25.84% drop); open-ended task analysis shows DRIFT achieves only 70% of base agent capability on those tasks.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": false, + "justification": "Only GPT-4o-mini is specified with a version date ('GPT-4o-mini-2024-07-18'); Claude-3.5-sonnet, Claude-3-haiku, and GPT-4o are referenced only by marketing names without snapshot dates.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": true, + "justification": "All six system prompts used by DRIFT components (constraint generation, privilege assignment, intent alignment, injection detection, planning sampling, injection sampling) are provided in Appendix E as Figures 8-13.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": false, + "justification": "LoRA fine-tuning hyperparameters are reported (batch size 4, 3 epochs, lr 2e-5, Adam optimizer), but inference hyperparameters for all LLM calls (temperature, top-p, max tokens) are not mentioned anywhere in the paper.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": true, + "answer": true, + "justification": "Section 2 provides detailed descriptions of all three DRIFT components with accompanying workflow diagrams (Figures 1-2) showing data flow between Secure Planner, Dynamic Validator, and Injection Isolator.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "Section 2.5.1 describes the full training data pipeline: ToolBench conversation rewriting via GPT-4o-mini, planner data sampling (1,000 samples), isolator data sampling with synthetic injection generation (1,000 samples), and tool environment reconstruction (10,000+ tools).", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": false, + "justification": "Raw evaluation results (individual task outcomes, per-attack-instance decisions) are not released; only aggregate metrics are reported, and the custom training dataset is only promised for future release.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Section 2.5.1 describes the ToolBench-to-training-data pipeline in detail, including conversation rewriting procedures, injection simulation methods, and tool list construction from 5,000 samples.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "Evaluation uses standard public benchmarks (AgentDojo, ASB) with no human participant recruitment involved.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "Section 2.5.1 documents the complete pipeline from ToolBench source data through GPT-4o-mini rewriting, injection simulation, and training sample construction with sample counts and turn ranges specified.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": false, + "justification": "Training data cutoffs are not stated for any of the evaluated models; GPT-4o-mini-2024-07-18's training cutoff is not mentioned, and Claude/GPT-4o cutoffs are absent entirely.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": false, + "justification": "No discussion of whether AgentDojo or ASB benchmark tasks were in the training data of the evaluated closed-source models (GPT-4o-mini, GPT-4o, Claude); AgentDojo was published at NeurIPS 2024.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": true, + "answer": false, + "justification": "AgentDojo was published at NeurIPS 2024 and ASB at ICLR 2025, both potentially within the training window of GPT-4o-mini-2024-07-18 and Claude models; this is not addressed anywhere in the paper.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human participants involved; NeurIPS checklist Q14 explicitly confirms this.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human participants involved; NeurIPS checklist Q15 explicitly confirms no human research.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participants involved.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human participants involved.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human participants involved.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human participants involved.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participants involved.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": true, + "justification": "Table 3 provides total token usage for all defense methods on AgentDojo (DRIFT=2.37M tokens vs undefended=0.82M), along with an efficiency metric combining utility, ASR, and token cost.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "Fine-tuning details (batch size, epochs, optimizer) are provided but GPU type, memory, training time, and total compute budget are not stated in the paper.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "DRIFT reduces ASR from 30.7% to 1.3% on GPT-4o-mini on AgentDojo benchmark.", + "evidence": "Figure 3 shows undefended agent ASR=30.7% vs DRIFT ASR=1.3% on GPT-4o-mini.", + "supported": "strong" + }, + { + "claim": "DRIFT outperforms CaMeL in utility by 20.1% under no-attack conditions while achieving comparable security.", + "evidence": "Figure 3: CaMeL utility=38.4%, DRIFT utility=58.5%, delta=20.1%; ASR: CaMeL=0.0% vs DRIFT=1.3%.", + "supported": "strong" + }, + { + "claim": "Dynamic policies significantly outperform static policies for tasks with trajectory length ≥ 3.", + "evidence": "Figure 6b shows static and dynamic policies diverge sharply at trajectory length 3+, with dynamic maintaining stable success rate while static drops.", + "supported": "moderate" + }, + { + "claim": "DRIFT maintains robustness under adaptive attacks with only 0.81% ASR increase under combined isolator+validator adaptive attack.", + "evidence": "Table 2 shows combined IAA+VAA results in ASR of 2.10% vs baseline 1.29%, a 0.81pp increase.", + "supported": "strong" + }, + { + "claim": "Fine-tuned Qwen2.5-7B DRIFT achieves 0% ASR while improving utility by 5.6% in safe conditions.", + "evidence": "Figure 5 shows Qwen2.5-7B+DRIFT: ASR=0.0% vs ReAct=15.1%, utility 32.2% vs 26.6%.", + "supported": "strong" + }, + { + "claim": "DRIFT significantly outperforms Progent on weaker models (GPT-4o-mini), demonstrating better design for lower-capability models.", + "evidence": "Appendix Table 5: DRIFT achieves 1.64% ASR vs Progent 9.39% on AgentDojo with GPT-4o-mini; attributed to DRIFT's simpler subtask decomposition.", + "supported": "moderate" + } + ], + "methodology_tags": [ + "benchmark-eval" + ], + "key_findings": "DRIFT is a three-component system-level defense (Secure Planner, Dynamic Validator, Injection Isolator) that reduces prompt injection attack success rate from 30.7% to 1.3% on GPT-4o-mini on AgentDojo while outperforming static policy baseline CaMeL in utility by 20.1%. The Injection Isolator addresses a previously underserved threat: injection content embedded in tool results that does not alter the tool-call trajectory but corrupts the final response. Dynamic constraint updating is shown to be necessary for complex multi-step tasks, where static policies cause severe utility degradation at trajectory lengths ≥ 3. The framework generalizes across five diverse models including a locally fine-tuned Qwen2.5-7B that achieves 0% ASR after policy tuning.", + "red_flags": [ + { + "flag": "No error bars or significance tests", + "detail": "All results are single point estimates with no variance across runs; the NeurIPS checklist explicitly acknowledges no error bars (Q7: [No]). Security benchmarks with 97-629 tasks and stochastic LLM calls warrant uncertainty quantification." + }, + { + "flag": "Benchmark contamination unaddressed", + "detail": "AgentDojo was published at NeurIPS 2024 and may appear in training data of GPT-4o-mini-2024-07-18 and Claude models; no discussion of whether model familiarity with benchmark scenarios inflates defense or utility metrics." + }, + { + "flag": "Circular training data generation", + "detail": "GPT-4o-mini is used to generate training labels (rewrite ToolBench conversations to match DRIFT policy) and is also the primary evaluation model; the model may be biased toward outputs that match its own labeled training format." + }, + { + "flag": "Minor numerical inconsistencies between abstract and body", + "detail": "Abstract states '20.1% under no attack and 12.5% under attack' vs CaMeL; Section 3.2 body states '21.8% in the no-attack setting and 10.9% under attack'. The abstract matches Figure 3 values but the main text does not." + }, + { + "flag": "Inference hyperparameters unreported", + "detail": "Temperature, top-p, and max tokens for all LLM calls (Secure Planner, Dynamic Validator, Injection Isolator, base agent) are not reported, making exact reproduction of LLM behavior impossible." + } + ], + "cited_papers": [ + { + "title": "AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents", + "relevance": "Primary evaluation benchmark; provides the 97 user tasks and 629 injection tasks used for all main results." + }, + { + "title": "Defeating Prompt Injections by Design (CaMeL)", + "relevance": "Key static policy-based baseline that DRIFT explicitly improves upon; demonstrates the utility-security tradeoff of static approaches." + }, + { + "title": "Progent: Programmable Privilege Control for LLM Agents", + "relevance": "Concurrent dynamic policy baseline; detailed comparison in Appendix C.2 reveals DRIFT's advantage on weaker models due to simpler subtask decomposition." + }, + { + "title": "IsolateGPT: An Execution Isolation Architecture for LLM-Based Agentic Systems", + "relevance": "Related system-level defense using application isolation; DRIFT's Injection Isolator addresses its limitation of residual in-memory injection content." + }, + { + "title": "ToolLLM: Facilitating Large Language Models to Master 16000+ Real-World APIs (ToolBench)", + "relevance": "Source of training data for DRIFT's policy fine-tuning; 2,000 samples derived from ToolBench conversations." + }, + { + "title": "ReAct: Synergizing Reasoning and Acting in Language Models", + "relevance": "Baseline agent framework that DRIFT is applied on top of; used as the comparison point for all adaptation experiments." + }, + { + "title": "Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-Based Agents", + "relevance": "Second evaluation benchmark providing 10 diverse scenarios for security assessment." + }, + { + "title": "InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents", + "relevance": "Foundational work on prompt injection attack types and risk categories (economic loss, privacy leakage)." + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 3, + "justification": "Code is released on GitHub, the framework plugs into existing LLM agents without model modification, and it addresses a real security threat for deployed agentic systems." + }, + "surprise_contrarian": { + "score": 1, + "justification": "The finding that dynamic policies outperform static ones and that memory isolation is needed is intuitive; no surprising inversions of conventional wisdom." + }, + "fear_safety": { + "score": 2, + "justification": "Demonstrates that even GPT-4o (the most capable model tested) has 51.7% ASR under prompt injection without defense, raising legitimate concern about production LLM agent deployments." + }, + "drama_conflict": { + "score": 1, + "justification": "Standard academic positioning against CaMeL and Progent; no controversial claims or community-dividing arguments." + }, + "demo_ability": { + "score": 2, + "justification": "Code released on GitHub and evaluation uses public AgentDojo benchmark, enabling researchers to reproduce and test DRIFT on the same tasks." + }, + "brand_recognition": { + "score": 1, + "justification": "Washington University in St. Louis and Johns Hopkins are reputable but not AI-famous labs; NeurIPS 2025 venue adds some recognition." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "44770561", + "title": "B-Splines and Fourier-Best Friends for Spatial-Temporal Video Super-Resolution", + "points": 4, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=44770561" + }, + { + "hn_id": "47002668", + "title": "LLMs exceed physicians on complex text-based differential diagnosis", + "points": 3, + "comments": 2, + "url": "https://news.ycombinator.com/item?id=47002668" + }, + { + "hn_id": "45534337", + "title": "Advancing medical artificial intelligence using a century of cases", + "points": 3, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=45534337" + }, + { + "hn_id": "43401539", + "title": "CriteoPrivateAd: RealWorld Bidding Dataset to Design Private Advertising Systems", + "points": 2, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=43401539" + }, + { + "hn_id": "31894669", + "title": "Protecting President Zelenskyy Against Deep Fakes", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=31894669" + }, + { + "hn_id": "27612994", + "title": "LegoFormer: Transformers for Block-by-Block Multi-View 3D Reconstruction", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=27612994" + }, + { + "hn_id": "44971660", + "title": "Scaling laws found in large generative medical event models", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=44971660" + }, + { + "hn_id": "41227450", + "title": "Τ-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=41227450" + }, + { + "hn_id": "40782080", + "title": "Should AI optimize your code? A studio", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=40782080" + }, + { + "hn_id": "28895006", + "title": "IQ-Learn: Inverse Soft-Q Learning for Imitation", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=28895006" + } + ], + "top_points": 4, + "total_points": 20, + "total_comments": 4 + } +} +\ No newline at end of file diff --git a/papers/drip-defending-prompt-2025/scan-v5.json b/papers/drip-defending-prompt-2025/scan-v5.json @@ -0,0 +1,500 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "DRIP: Defending Prompt Injection via Token-wise Representation Editing and Residual Instruction Fusion", + "authors": [ + "Ruofan Liu", + "Yun Lin", + "Zhiyong Huang", + "Jin Song Dong" + ], + "year": 2025, + "venue": "arXiv", + "arxiv_id": "2511.00447", + "doi": "10.48550/arXiv.2511.00447" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "Abstract claims of 12–49% SEP improvement and 66%+ ASR reduction are directly supported by Tables 3, 4, and 5; utility parity is supported by Table 6 (83.89% vs 85.37% AlpacaEval).", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "Section 4.4 ablation isolates contributions of each component (data curation cases, representation editing, instruction fusion) through controlled variants, adequately supporting causal claims about each design choice.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": true, + "justification": "Section 4.5.4 explicitly bounds scope to 7B–8B models, single-turn settings, and text modality; claims of 'new state-of-the-art' are qualified by these stated constraints.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "The paper does not discuss whether improvements could stem from data augmentation effects rather than the specific representation editing mechanism; only the authors' intended explanation is presented.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "SEP score, ASR, IFEval, and AlpacaEval are each explicitly defined and their relationship to 'role separation capability' and 'utility preservation' is explained in Sections 4.1.1 and 4.3.1.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": false, + "justification": "There is no dedicated limitations or threats-to-validity section; limitations appear in Section 4.5.4 titled 'Future Work' and Section 4.5.1 'Failure case,' not in a standalone section.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "Section 4.5.4 names specific threats: model scale (7B–8B only), single-turn constraint, and text-only modality, going beyond generic disclaimers.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": true, + "justification": "Explicit boundaries stated: indirect injection only (Section 2), open-source decoder-only models 7B–8B, single-turn prompts, English text modality only.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding acknowledgment appears anywhere in the paper.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Author affiliations (National University of Singapore, Shanghai Jiao Tong University) are disclosed on the title page.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": false, + "answer": false, + "justification": "No funding is disclosed, so independence cannot be assessed.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests or financial interests statement appears in the paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Section 2 defines prompt injection, direct vs. indirect injection, threat model, and defender objectives precisely; 'de-instructionalize' is defined in Section 3.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Four explicit contributions are listed at the end of Section 1: defense framework, novel architecture, tool release, and evaluation.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 4.6 categorizes prior defenses into detection, inference-time, and finetuning-based; Sections 4.1–4.3 qualitatively explain why DRIP outperforms each baseline, not just listing papers.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": true, + "justification": "Code is available at https://anonymous.4open.science/r/PromptInjection-BD09 with installation guidance, though it is an anonymous pre-publication repository.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "All evaluation benchmarks (SEP, AlpacaFarm, InjecAgent, AlpacaEval 2.0, IFEval, MT-Bench) are standard publicly available datasets used unmodified for evaluation.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "Section 3.4 mentions hardware (6 NVIDIA RTX 5880 48GB GPUs) and LoRA settings, but no requirements.txt, Dockerfile, or full dependency specification is provided in the paper.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "The paper references an anonymous code repository for installation guidance but provides no step-by-step reproduction instructions within the paper itself.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "All results in Tables 3–7 are single point estimates with no confidence intervals or error bars reported.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "No statistical significance tests are applied to any comparative results across the paper.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Absolute scores and baseline comparisons are reported (e.g., SEP 80.9% vs 31.9% for SecAlign, GCG ASR 1.06% vs 66.67%), providing effect size context.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "Benchmark sizes are given (SEP: 9,160 tuples; AlpacaFarm: 208 examples; InjecAgent: 1,054 cases) but no justification or power analysis for why these are sufficient is provided.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "No variance or standard deviation across runs is reported; all tables show single-run point estimates.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Four baselines are compared: Undefended, StruQ, SecAlign, ISE, and PFT across all three benchmarks.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "StruQ [2024], SecAlign [2024], ISE [2024], and PFT [2025] are all recent methods representing the current state of the field.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "Section 4.4 provides a full ablation over data curation strategy (Cases 1–3) and architectural components (fusion type, shift type) with results in Table 7.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Evaluation uses SEP score, ASR across multiple attack families, IFEval accuracy, AlpacaEval win rate, and MT-Bench category scores.", + "source": "haiku" + }, + "human_evaluation": { + "applies": true, + "answer": false, + "justification": "Utility is evaluated via LLM-as-judge (GPT-4 for AlpacaEval 2.0 and MT-Bench), not human annotators.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": true, + "justification": "Training uses the SEP training split (Section 3.2); evaluation uses the SEP test benchmark, AlpacaFarm test set, and InjecAgent test cases.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "MT-Bench results are broken down by 8 skill categories (Figure 8); AlpacaFarm ASR is broken down by attack family (Table 5).", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "Section 4.5.1 shows a 'semantic echo' failure case where DRIP avoids direct execution but leaks injected content semantically into the output.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "ISE completely fails on InjecAgent (all responses non-conforming to ReAct format, Table 4); ablation shows removing Case 2 spikes GCG ASR to 0% but degrades SEP by 22pp.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": true, + "justification": "LLaMA-8B cites [19] (Llama 3 herd, arXiv:2407.21783) and Mistral-7B cites [25] (arXiv:2310.06825), providing sufficient specificity to identify the models used.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": true, + "justification": "Figures 11 and 12 show the full training response generation prompt and auditor prompt used with GPT-4o.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": true, + "justification": "Section 3.4 reports LoRA rank r=16, α=8, dropout=0.05, global batch size 24, learning rate 1×10⁻⁴, and 1 training epoch.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": false, + "answer": false, + "justification": "The DRIP system is a fine-tuned model without agentic scaffolding; InjecAgent uses the benchmark's own ReAct scaffolding from [68], not custom scaffolding introduced by the authors.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "Section 3.2 and Figure 4 document the full training data curation pipeline including SEP split resampling, response generation via GPT-4o, XML tagging, and LLM-as-judge auditing.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": true, + "justification": "All evaluation benchmarks (SEP, AlpacaFarm, InjecAgent) are publicly released datasets; training data is derived from the public SEP training split.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Section 3.2 describes how training data is curated from SEP training split with specific resampling procedures and response generation pipeline.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "No human participants were recruited; all evaluation uses automated benchmarks and LLM-as-judge.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "Figure 4 provides a complete pipeline diagram from DPO pair construction through response generation and auditing steps.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": false, + "justification": "The pretraining data cutoffs for LLaMA-8B and Mistral-7B are not stated, despite fine-tuning these models on data derived from benchmarks.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": true, + "justification": "The paper explicitly uses the SEP training split for training and the SEP test benchmark for evaluation, addressing the direct overlap concern.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": true, + "answer": false, + "justification": "Whether SEP, AlpacaFarm, or InjecAgent test examples appeared in LLaMA/Mistral pretraining data is never discussed.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human participants; ethical considerations section explicitly states 'This work does not involve human subjects.'", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human participants; ethical considerations section confirms no human subjects involved.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": false, + "justification": "No inference latency or cost measurements are reported, despite DRIP adding a representation-editing module and residual fusion at inference time.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "Hardware is mentioned (6 NVIDIA RTX 5880 48GB GPUs) but total compute hours, GPU-hours, or training cost are not reported.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "DRIP achieves 80.9% SEP score on LLaMA-8B, improving over the strongest baseline SecAlign (31.9%) by 49 percentage points.", + "evidence": "Table 3 reports SEP scores across all defenses on both models.", + "supported": "strong" + }, + { + "claim": "DRIP reduces GCG optimization-based attack success rate to 1.06% on LLaMA-8B versus 66.67%+ for all baselines.", + "evidence": "Table 5 GCG row: DRIP 1.06% vs SecAlign 66.67%, StruQ 98.08%, ISE 98.56%, PFT 98.08%.", + "supported": "strong" + }, + { + "claim": "DRIP preserves instruction-following utility at near-undefended levels (83.89% vs 85.37% AlpacaEval win rate on LLaMA-8B).", + "evidence": "Table 6 reports IFEval and AlpacaEval 2.0 results; DRIP achieves the highest IFEval accuracy (76.02%) while other defenses degrade significantly.", + "supported": "strong" + }, + { + "claim": "Instruction fusion is critical for defense against optimization-based attacks; removing it raises GCG ASR from 1.06% to 62.80%.", + "evidence": "Table 7 ablation, 'No fusion' row versus default.", + "supported": "strong" + }, + { + "claim": "Token-wise representation editing preserves utility better than global role embedding offsets (ISE-style), with 7pp higher AlpacaEval score.", + "evidence": "Table 7 ablation: 'Embedding shift' row shows 76.70% utility vs default 83.89%.", + "supported": "moderate" + }, + { + "claim": "All three training data cases (Cases 1–3) are necessary; removing Case 3 causes adaptive GCG ASR to spike from 1.06% to 69.90%.", + "evidence": "Table 7 ablation, 'No Case 3' row; the paper also provides theoretical justification in Appendix A.", + "supported": "strong" + } + ], + "methodology_tags": [ + "benchmark-eval", + "theoretical" + ], + "key_findings": "DRIP introduces a two-component fine-tuning defense against indirect prompt injection: a lightweight token-wise representation editing module that projects data tokens away from the instruction manifold, and a residual instruction fusion pathway that anchors output generation to the original instruction. Evaluated on three benchmarks (SEP, AlpacaFarm, InjecAgent) against four baselines, DRIP achieves 80.9%/70.7% SEP scores versus 31.9%/58.6% for the best prior method (SecAlign), and reduces GCG attack success rate to under 3.4% versus 66%+ for all baselines. Crucially, utility is maintained near undefended model levels (83.89% vs 85.37% AlpacaEval), resolving the security-utility tradeoff that plagued prior defenses. Ablation confirms both components are necessary: removing instruction fusion alone raises adaptive attack success to 62.80%.", + "red_flags": [ + { + "flag": "No error bars or statistical tests", + "detail": "All results are single point estimates with no confidence intervals, standard deviations, or significance tests, making it impossible to assess whether differences are statistically meaningful." + }, + { + "flag": "Anonymous code repository", + "detail": "Code is hosted on anonymous.4open.science, a temporary anonymous review platform; long-term availability is not guaranteed and the repository may not persist after review." + }, + { + "flag": "LLM-as-judge for utility evaluation", + "detail": "AlpacaEval 2.0 and MT-Bench use GPT-4 as judge; GPT-4's evaluation preferences may introduce systematic biases that favor or disfavor certain output styles independent of actual quality." + }, + { + "flag": "Only 7B–8B models tested", + "detail": "All experiments are conducted on LLaMA-8B and Mistral-7B; the authors acknowledge results may not generalize to larger models, limiting the scope of the 'state-of-the-art' claim." + }, + { + "flag": "No inference overhead measurement", + "detail": "DRIP adds a representation-editing layer and residual fusion path at inference time, but no latency or throughput measurements are provided, leaving practical deployment cost unknown." + }, + { + "flag": "GPT-4o used for training data generation", + "detail": "Ground-truth responses are generated by GPT-4o, which is itself vulnerable to prompt injection; the authors add sanitization steps but acknowledge residual noise risk in the training data." + } + ], + "cited_papers": [ + { + "title": "StruQ: Defending against prompt injection with structured queries", + "relevance": "Primary baseline and training protocol baseline; DRIP's contrastive training extends StruQ's approach." + }, + { + "title": "SecAlign: Defending against prompt injection with preference optimization", + "relevance": "Strongest prior-art baseline using DPO; DRIP outperforms it, especially on adaptive optimization-based attacks." + }, + { + "title": "Instructional Segment Embedding: Improving LLM safety with instruction hierarchy (ISE)", + "relevance": "Architectural baseline using global role embeddings; DRIP's token-wise approach is contrasted against ISE throughout." + }, + { + "title": "Can LLMs separate instructions from data? And what do we even mean by that? (SEP benchmark)", + "relevance": "Primary evaluation benchmark and training data source; defines the SEP score metric used throughout." + }, + { + "title": "InjecAgent: Benchmarking indirect prompt injections in tool-integrated LLM agents", + "relevance": "Agentic evaluation benchmark testing DRIP in ReAct-style tool-use settings." + }, + { + "title": "Universal and transferable adversarial attacks on aligned language models (GCG)", + "relevance": "Key adaptive attack method used to evaluate robustness of DRIP against optimization-based prompt injection." + }, + { + "title": "Neural Exec: Learning (and learning from) execution triggers for prompt injection attacks", + "relevance": "Universal adversarial prefix/suffix attack method used in evaluation." + }, + { + "title": "ASIDE: Architectural separation of instructions and data in language models", + "relevance": "Closely related concurrent defense using orthogonality constraints on representations; cited as related work." + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 3, + "justification": "Directly addresses a critical security concern for deployed LLM applications processing untrusted data, with code and training framework released." + }, + "surprise_contrarian": { + "score": 2, + "justification": "The representation editing framing is a novel angle in the prompt injection defense space, but the general problem and direction are well-established." + }, + "fear_safety": { + "score": 3, + "justification": "Prompt injection enables attackers to hijack AI agents in production systems; the agentic deployment context (InjecAgent) directly maps to real-world AI safety risks." + }, + "drama_conflict": { + "score": 1, + "justification": "No major controversy; straightforward security paper with competitive results but no surprising negative findings about widely-used systems." + }, + "demo_ability": { + "score": 2, + "justification": "Anonymous demo website is available (sites.google.com/view/drip-prompt) and code is released, though the anonymized state limits immediate trust." + }, + "brand_recognition": { + "score": 1, + "justification": "Authors are from NUS and SJTU — credible academic institutions but not major AI lab brands that drive HN attention." + } + }, + "hn_data": { + "threads": [], + "top_points": 0, + "total_points": 0, + "total_comments": 0 + } +} +\ No newline at end of file diff --git a/papers/driving-style-alignment-2024/scan-v5.json b/papers/driving-style-alignment-2024/scan-v5.json @@ -0,0 +1,606 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Driving Style Alignment for LLM-powered Driver Agent", + "authors": [ + "Ruoxuan Yang", + "Xinyu Zhang", + "Anais Fernandez-Laaksonen", + "Xin Ding", + "Jiangtao Gong" + ], + "year": 2024, + "venue": "IEEE/RJS International Conference on Intelligent RObots and Systems", + "arxiv_id": "2403.11368", + "doi": "10.1109/IROS58592.2024.10802629" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "Abstract claims that agents align with driving styles, dataset created, and validation performed are all supported. Simulation results (Fig 3) show style-specific behavior differentiation; human eval (n=259) confirms perceptibility.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "Causal claim that multi-alignment causes style alignment is tested via 3×3 ablation design (demonstrations-only vs feedback-only vs both). Ablation shows multi-alignment is most effective. Limitation: simulation-only, no real-world causality tested.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "Paper tests only 2 driving styles in 1 simulator environment (CARLA Town10), but title and abstract promise general 'driving style alignment.' Conclusion claims 'paves the way...across a broad spectrum of applications' beyond scope tested.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "Paper shows multi-alignment works empirically but provides limited mechanistic explanation. The finding that humans associate higher riskiness with human-likeness is acknowledged as 'interesting psychological insight' but not deeply explored as alternative interpretation.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "Clearly distinguishes measured outcomes (collision rate, throttle %, speed) from conceptual claims (driving style). Human evaluation outcomes (riskiness ratings, intelligence, human-likeness) appropriately mapped to perceived style perception.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": false, + "justification": "No dedicated limitations or threats-to-validity section. Conclusion contains brief discussion of implications and psychological insights but not formal limitations discussion.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "Paper does not explicitly discuss major threats: simulation-only validation, only 2 of 4 identified driving styles, small data collection (24 drivers), short video clips (30s) in human eval, single simulator environment.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "No explicit boundaries stated on scope. Paper does not acknowledge limitations to CARLA simulator, 2 styles only, or single urban environment. Claims generalize beyond tested conditions without caveats.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding statement or acknowledgments section listing funding sources. Work from Tsinghua but no disclosure of whether it was funded internally or externally.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "All authors listed with affiliation: Institute for AI Industry Research, Tsinghua University. No undisclosed affiliations with autonomous driving companies.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": false, + "justification": "Funding source not disclosed, so cannot assess independence. If Tsinghua funded work promoting their own framework, potential conflict exists but cannot verify.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement. No disclosure of patents, equity stakes, or consulting relationships related to autonomous driving or LLM companies.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Key terms defined: 'driving style' via MDSI questionnaire + objective CAN-Bus metrics (speed, throttle); 'alignment' via demonstrations + coach feedback; 'multi-alignment framework' clearly explained with Driver/Coach agents.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Three explicit contributions stated: (1) multi-alignment framework, (2) natural language dataset, (3) validation via simulation + human eval. Reader understands what paper adds to the field.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Introduction engages with prior work on LLM reasoning for autonomous driving, limitations of existing alignment methods (fine-tuning, expert feedback), and existing dataset modalities. Shows how this work addresses a gap in style-alignment and natural language data.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": true, + "justification": "Paper states 'The implementation of the framework...can be found at the link' with GitHub URL (github.com/AIR-DISCOVER/Multi-alignment-Drivng-Agent). Code is publicly released.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "Driving-Thinking-Dataset released on GitHub (github.com/AIR-DISCOVER/Driving-Thinking-Dataset) with 24 drivers' think-aloud transcripts in natural language format.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "Partial specification: Python 3.7, CARLA 0.9.14, Unreal Engine 4 provided. But missing key dependencies (numpy, pandas, requests for API calls, etc.). Specification insufficient for full reproduction.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "Paper provides no step-by-step reproduction instructions. References GitHub but doesn't show what instructions are there. Reader cannot reproduce from paper alone.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "Fig 3b shows only mean values for speed, throttle, brake with no error bars. Simulation metrics reported without confidence intervals. No spread/variance visualization.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": true, + "justification": "Fig 4a shows p-values with stars (p<0.0001 ****, 0.0001-0.001 ***, etc.). Comparative claims in results section backed by statistical tests, though test type not specified.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": false, + "justification": "Only p-values reported in Fig 4a. No Cohen's d, eta-squared, or other effect sizes for collision rates, speed, or human evaluation metrics. Effect magnitude unclear.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "No justification for 24 drivers in data collection, 259 human participants, or 50.3 hours of simulation. No power analysis provided. Sample sizes appear chosen for convenience.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "Simulation results (Fig 3b) report only means with no error bars or standard deviations. Human eval reports point estimates without spreads. Variance across runs not shown.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "NOT-ALIGNED condition serves as baseline for comparison. Shows what happens without demonstrations or coach feedback.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": false, + "justification": "Only internal baseline (no alignment) tested. No comparison to other alignment methods from literature (fine-tuning, RLHF, in-context learning). Weak baselines limit evidence for novelty.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "3×3 design tests demonstrations-only vs feedback-only vs multi-alignment. Shows multi-alignment most effective and both components contribute, suggesting necessity of both.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Simulation metrics: collision rate, average speed, throttle %, brake %. Human eval metrics: riskiness ranking, intelligence score, human-likeness score. Six dimensions of evaluation.", + "source": "haiku" + }, + "human_evaluation": { + "applies": true, + "answer": true, + "justification": "259 participants evaluated 30-second video clips of agent driving behavior. Ranked riskiness and scored intelligence/human-likeness. Evaluates system outputs (driving videos), not just dataset.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": true, + "justification": "Simulation generalization tested on unseen scenarios with randomly generated endpoints (not pre-set). Single environment (CARLA Town10) but driving paths varied.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Results broken down by driving style (CAUTIOUS vs RISKY vs NOT-ALIGNED), alignment method (D vs F vs M), and human evaluation by participant driving style. Category-level analysis provided.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": false, + "justification": "Limited discussion of failure modes. Paper notes that demonstrations alone were 'least effective' but does not show specific scenarios where method fails or provide failure case analysis.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": false, + "justification": "DEMONSTRATIONS-only showed 'least effectiveness' compared to other methods, which is a partial negative result. But no completely failed conditions or null findings reported.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": false, + "justification": "Only 'OpenAI's GPT-4 APIs' mentioned without specifying which GPT-4 version (gpt-4, gpt-4-turbo, gpt-4-32k), model date, or snapshot. CARLA 0.9.14 is specific but LLM is not.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": false, + "justification": "Example prompts shown: 'Think Step by Step' and example reasoning ('Given the rather faster speed...'). Full system prompts for Driver Agent and Coach Agent not provided in paper.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": false, + "justification": "No temperature, top-p, max_tokens, or other GPT-4 hyperparameters reported. CARLA time-step specified (0.0008-0.0015s) but LLM inference hyperparameters missing.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": true, + "answer": true, + "justification": "Agentic scaffolding well described: Driver Agent workflow (perception→situation→reasoning→action), Coach Agent evaluation logic, Guidelines module, short-term memory management. Components and interactions clear.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "Detailed pipeline: naturalistic driving (24 drivers, 5.7 km, 13 conditions), post-experiment interviews (1.5-2 hrs, video reconstruction), transcription, organization into Situation/Reasoning/Action format, style classification (MDSI + CAN-Bus metrics), representative selection.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": true, + "justification": "Driving-Thinking-Dataset GitHub repository released. Raw interview transcripts and decision processes should be available for independent verification.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "24 drivers, 5.7 km urban drive, 13 driving conditions. Detailed recording setup (360° camera, in-car camera, eye tracker, CAN-Bus). Post-experiment interviews (1.5-2 hrs) with video reconstruction. Well documented.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": true, + "answer": false, + "justification": "Driver data collection: only described as '24 drivers invited' with 'different genders, age groups, professional and novice drivers.' No recruitment method stated. Human eval: third-party channel ($2.08 compensation) and social media. Partial documentation.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "Full pipeline documented: collection (driving experiment + interview) → transcription → organization (Situation/Reasoning/Action format) → style classification (MDSI questionnaire + CAN-Bus metrics) → demonstration selection → use in framework.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": false, + "justification": "GPT-4 APIs used but model cutoff date not stated. No discussion of when GPT-4 was trained or knowledge cutoff. Reproducibility unclear without this information.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": false, + "justification": "Paper uses GPT-4 (general internet-trained LLM) to drive simulated cars in CARLA. Scenario descriptions in prompts could overlap with internet content about driving, but no train-test overlap discussion provided.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": false, + "answer": false, + "justification": "Not evaluating on standard benchmarks; uses custom CARLA scenarios. Not applicable in traditional sense, but paper does not address potential contamination of driving knowledge in GPT-4 pretraining.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": true, + "answer": false, + "justification": "No pre-registration or trial registration number mentioned. Study design not pre-registered, raising concerns about p-hacking or post-hoc analysis.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": true, + "answer": false, + "justification": "No IRB approval, ethics approval, or institutional review mentioned despite involving 24 drivers + 259 human participants. Major ethical concern for human subjects research.", + "source": "haiku" + }, + "demographics_reported": { + "applies": true, + "answer": true, + "justification": "Data collection: '24 drivers with different genders, age groups, professional and novice.' Human eval: 259 participants (141 male 52.22%, 129 female 47.78%, ages 19-54). Partially detailed.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": true, + "answer": false, + "justification": "Drivers: only 'different demographics' and experience levels mentioned, no explicit inclusion/exclusion. Human eval: only criterion 'possess a driving license.' Minimal criteria documentation.", + "source": "haiku" + }, + "randomization_described": { + "applies": true, + "answer": true, + "justification": "Video clips presented in 'random order' to human participants. Within-subject design ensures all participants see all conditions. Randomization partially described.", + "source": "haiku" + }, + "blinding_described": { + "applies": true, + "answer": false, + "justification": "No blinding mentioned. Participants likely knew they were evaluating AI agent driving. No mention of researcher blinding to conditions. Open label design.", + "source": "haiku" + }, + "attrition_reported": { + "applies": true, + "answer": false, + "justification": "Total: '270+ recruited, received 259 valid responses after screening.' Attrition mentioned but not detailed. Unclear what screening removed or why (trap questions, timing minimums mentioned but exclusion counts not given).", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": false, + "justification": "GPT-4 API calls made for each driving decision. 50.3 hours simulation corresponds to thousands of API calls, but no cost or latency quantified. Budget impact unknown.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "Hardware: 'ThundeRobot Zero desktop.' Time: '50.3 hours simulation, ~6.7 minutes per condition.' No computational cost, API expense, or power consumption reported.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "LLM-powered driver agents can be aligned with human driving styles (risky vs cautious) using demonstrations and feedback", + "evidence": "Simulation results (Fig 3) show agents aligned with CAUTIOUS style have 1.31-2.12 collisions/meter vs RISKY 3.04-4.78; human evaluation (Fig 4a) shows significant differences in perceived riskiness (p<0.0001)", + "supported": "strong" + }, + { + "claim": "Multi-alignment (combining demonstrations + feedback) is more effective than either component alone", + "evidence": "Ablation study (Fig 3) shows MULTI-ALIGNMENT method achieves best collision rate separation and most significant differences in speed/throttle/brake. Fig 4a confirms MC (multi-cautious) > FC (feedback-cautious) > DC (demo-cautious) in human perception (p<0.0001)", + "supported": "strong" + }, + { + "claim": "Natural language descriptions of human driving decisions can serve as effective demonstrations for LLM agent alignment", + "evidence": "Dataset of 24 drivers' think-aloud transcripts organized into Situation/Reasoning/Action format enables agents to differentiate driving styles. Framework successfully uses this dataset, validating its utility", + "supported": "moderate" + }, + { + "claim": "Agents aligned with cautious driving styles exhibit measurably safer behavior (lower collision rates) than risky-aligned agents", + "evidence": "Fig 3a: CAUTIOUS alignment produces 0.73-1.53 collisions/meter across methods vs RISKY 1.53-4.78. Consistent safety difference across all three alignment methods", + "supported": "strong" + }, + { + "claim": "Humans can reliably and significantly distinguish different driving styles in simulated agent behavior", + "evidence": "Human evaluation with 259 participants shows highly significant differences in riskiness rankings between CAUTIOUS vs RISKY conditions (p<0.0001 in all relevant groups, Fig 4a)", + "supported": "strong" + }, + { + "claim": "Higher perceived riskiness in driving correlates with greater perceived human-likeness (counterintuitive finding)", + "evidence": "Fig 4b shows positive correlation (r=0.10*) between riskiness and human-likeness. Participant comment: 'really like an experienced driver showing off skills' for higher-risk agent", + "supported": "moderate" + }, + { + "claim": "The approach opens new avenues for autonomous driving research across diverse applications and user preferences", + "evidence": "Paper demonstrates proof-of-concept in CARLA simulator with 2 driving styles and human validation, but generalization beyond simulation and 2 styles not empirically tested", + "supported": "weak" + } + ], + "methodology_tags": [ + "empirical", + "human-studies", + "simulation-based" + ], + "key_findings": "The paper demonstrates that LLM-based driving agents can adopt human-like driving styles (cautious vs risky) through a framework combining demonstrations and feedback. Multi-alignment achieves the most significant behavioral differences in CARLA simulation (collision rates, speed, throttle percentages), and 259 human participants reliably distinguish between the styles in 30-second video clips (p<0.0001). Counterintuitively, humans associate higher riskiness with greater human-likeness despite recognizing riskier driving as less intelligent. The natural language dataset of 24 drivers' decision-making processes provides effective demonstrations, though only 2 driving styles were ultimately used despite identifying 4 in the initial classification.", + "red_flags": [ + { + "flag": "Simulation-only validation", + "detail": "All driving tested in CARLA simulator; no real-world validation. Sim-to-real transfer completely unknown. Critical for autonomous driving claims." + }, + { + "flag": "Overgeneralized scope claims", + "detail": "Title and conclusion claim general driving style alignment, but paper tests only 2 styles in 1 simulator environment (CARLA Town10). Generalization claims exceed evidence." + }, + { + "flag": "Missing ethical approval", + "detail": "Human study with 259 participants and 24 drivers, but no IRB approval, ethics board review, or institutional oversight mentioned. Major concern for human subjects research." + }, + { + "flag": "Insufficient statistical reporting", + "detail": "No error bars, confidence intervals, effect sizes (Cohen's d), or sample size justification. Only p-values reported. Makes effect magnitude interpretation impossible." + }, + { + "flag": "Unspecified model version", + "detail": "GPT-4 used but no version (gpt-4, gpt-4-turbo, gpt-4-32k), training cutoff date, or hyperparameters (temperature, top-p) provided. Reproducibility compromised." + }, + { + "flag": "Small data collection sample", + "detail": "Only 24 drivers for creating human demonstrations—small for capturing diversity of driving styles. No justification for sample size." + }, + { + "flag": "Limited baseline comparisons", + "detail": "Only compared against no-alignment baseline. No comparison to other alignment methods (fine-tuning, RLHF, in-context learning) despite these being discussed as existing approaches." + }, + { + "flag": "No limitations section", + "detail": "Paper lacks dedicated limitations or threats-to-validity section. Does not acknowledge simulation scope, generalization limits, or methodological constraints." + }, + { + "flag": "Short video clips in human evaluation", + "detail": "Only 30-second video clips used for human evaluation of agent driving. May be insufficient to perceive true driving style differences beyond surface metrics (speed, throttle)." + }, + { + "flag": "Partial environment specification", + "detail": "CARLA and Python versions provided, but key dependencies missing (packages, API libraries). Insufficient for reproduction without accessing GitHub repo." + } + ], + "cited_papers": [ + { + "title": "Chain-of-thought prompting elicits reasoning in large language models", + "authors": "Wei, J. et al.", + "year": 2022, + "relevance": "Core reasoning technique (CoT) used in Driver Agent decision-making; foundational for the framework's planning capability" + }, + { + "title": "LLM-planner: Few-shot grounded planning for embodied agents with large language models", + "authors": "Song, C.H. et al.", + "year": 2023, + "relevance": "Few-shot learning approach for embodied agent planning; directly relevant to using demonstrations for style alignment" + }, + { + "title": "Driving with llms: Fusing object-level vector modality for explainable autonomous driving", + "authors": "Chen, L. et al.", + "year": 2023, + "relevance": "Prior work on LLM-based autonomous driving; shows progression from perception to LLM-based decision-making" + }, + { + "title": "DriveGPT4: Interpretable end-to-end autonomous driving via large language model", + "authors": "Xu, Z. et al.", + "year": 2023, + "relevance": "Concurrent work on end-to-end LLM driving agents; demonstrates interpretability in autonomous driving" + }, + { + "title": "Training language models to follow instructions with human feedback", + "authors": "Ouyang, L. et al.", + "year": 2022, + "relevance": "RLHF technique; represents the costly human feedback approach that this work aims to improve upon with coach agent" + }, + { + "title": "Reflexion: Language agents with verbal reinforcement learning", + "authors": "Shinn, N. et al.", + "year": 2024, + "relevance": "Agent self-reflection and feedback mechanisms; related to Coach Agent's guideline generation approach" + }, + { + "title": "The mind in the machine: Anthropomorphism increases trust in an autonomous vehicle", + "authors": "Waytz, A., Heafner, J., & Epley, N.", + "year": 2014, + "relevance": "Human trust and anthropomorphism in AVs; relevant to motivation for human-like driving style alignment" + }, + { + "title": "Human-like driving behaviour emerges from a risk-based driver model", + "authors": "Kolekar, S., de Winter, J., & Abbink, D.", + "year": 2020, + "relevance": "Risk-based models of human driving; provides theoretical foundation for driving style dimensions (risky/cautious)" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 2, + "justification": "Code and dataset released on GitHub, enabling practitioners to implement the framework. However, requires CARLA setup, Python 3.7, GPT-4 API access, and is only validated in simulation. Not yet deployable for real autonomous vehicles." + }, + "surprise_contrarian": { + "score": 2, + "justification": "Key finding that humans associate higher riskiness with greater human-likeness contradicts safety intuition and is counterintuitive. However, most other results confirm expected behavior (cautious agents safer, multi-alignment better than components alone)." + }, + "fear_safety": { + "score": 1, + "justification": "LLM-powered agents making driving decisions raises autonomy concerns, but contained to simulation. No discussion of safety failures, adversarial scenarios, or out-of-distribution driving. Limited safety-relevant exploration." + }, + "demo_ability": { + "score": 2, + "justification": "Code publicly released and dataset available, allowing others to run the framework. Requires CARLA installation and GPT-4 API setup, which are non-trivial barriers but doable for resourced teams. Demo potential moderately high." + }, + "brand_recognition": { + "score": 2, + "justification": "Institute for AI Industry Research at Tsinghua University is a respectable institution, but not a top-tier AI research lab in global recognition. Tsinghua carries prestige but this lab is not widely known in AI research community." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "45923139", + "title": "Chinese co's roadmap for aneutronic fusion", + "points": 11, + "comments": 3, + "url": "https://news.ycombinator.com/item?id=45923139" + }, + { + "hn_id": "35314773", + "title": "Reflexion: An autonomous agent with dynamic memory and self-reflection", + "points": 4, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=35314773" + }, + { + "hn_id": "41365788", + "title": "Quantum error correction below the surface code threshold", + "points": 3, + "comments": 2, + "url": "https://news.ycombinator.com/item?id=41365788" + }, + { + "hn_id": "42375612", + "title": "Quantum error correction below the surface code threshold", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=42375612" + }, + { + "hn_id": "35298128", + "title": "Reflexion: An autonomous agent with dynamic memory and self-reflection", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=35298128" + }, + { + "hn_id": "43563070", + "title": "Cordic Is All You Need", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=43563070" + }, + { + "hn_id": "41371342", + "title": "Google proves Fault-Tolerant Quantum Computing is possible", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=41371342" + }, + { + "hn_id": "35397720", + "title": "Reflexion: An autonomous agent with dynamic memory and self-reflection", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=35397720" + }, + { + "hn_id": "22791011", + "title": "A physicist view of the airborne infection", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=22791011" + }, + { + "hn_id": "47221336", + "title": "Show HN: Benchmarking the Keep memory system with LoCoMo", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=47221336" + } + ], + "top_points": 11, + "total_points": 33, + "total_comments": 6 + } +} +\ No newline at end of file diff --git a/papers/dscodebench-realistic-benchmark-2025/scan-v5.json b/papers/dscodebench-realistic-benchmark-2025/scan-v5.json @@ -0,0 +1,382 @@ +{ + "scan_version": 5, + "paper_type": "benchmark-creation", + "paper": { + "title": "DSCodeBench: A Realistic Benchmark for Data Science Code Generation", + "authors": [ + "Shuyin Ouyang", + "Dong Huang", + "Jingwen Guo", + "Zeyu Sun", + "Qihao Zhu", + "Jie M. Zhang" + ], + "year": 2025, + "venue": "arXiv", + "arxiv_id": "2505.15621", + "doi": null + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All key abstract claims (1,000 problems, 10 libraries, GPT-4o pass@1 0.392, scaling behavior, comparison metrics vs DS-1000) are directly supported by Table 1, Table 2, and Figure 2.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": false, + "justification": "The paper claims scaling behavior 'validates' the benchmark's discrimination ability, but this is a circular post-hoc argument; it does not rule out that task selection or perturbation methodology specifically favors larger models rather than reflecting inherent capability differences.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "The paper claims DSCodeBench is 'a rigorous and trustworthy foundation for advancing LLM-based data science programming' broadly, but only 10 models and 1,000 Python-only single-function tasks were tested; the framing does not adequately reflect this scope.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "The paper does not discuss alternative explanations for its main finding—that DSCodeBench shows scaling behavior while DS-1000 does not—such as whether perturbation design or task filtering biases the difficulty distribution toward features favored by larger models.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "The limitations section explicitly acknowledges that pass@k measures functional correctness only and not efficiency, security, or readability, with the evaluation scope clearly bounded accordingly.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "A dedicated 'Limitation' section appears in the Appendix covering Python-only scope, single-function restriction, simplified error-handling, and functional-correctness-only evaluation.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "Specific threats are named: error-handling simplification (replacing raises with None/defaults), exclusion of multi-file and project-level tasks, SSIM-only visualization scoring, and restriction to Python.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": true, + "justification": "The paper explicitly states what DSCodeBench does NOT cover: other languages (R), multi-file workflows, runtime efficiency, security, code style, and readability—both in limitations and future work sections.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": true, + "justification": "Funding is disclosed in the Acknowledgement section: ITEA Genius and GreenCode (InnovateUK), UKRI CDT (EP/S023356/1), and NSFC (62402482).", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Author affiliations are listed in the header: King's College London, NUS, CAS, and Peking University—none are affiliated with the commercial LLM vendors evaluated.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": true, + "justification": "Funders (InnovateUK, UKRI, NSFC) are government/public research agencies unrelated to the commercial vendors (OpenAI, DeepSeek, Alibaba/Qwen) whose models are evaluated.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement or financial interests declaration appears; the paper contains only the funding acknowledgement and no 'no competing interests' statement.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": false, + "justification": "The key term 'realistic' is used as the central descriptor throughout the paper but is never formally defined; it is only operationalized implicitly via proxy metrics (code length, test count, GitHub source) without explicit definition.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Three contributions are explicitly enumerated in bullet form: the DSCodeBench dataset, the construction pipeline, and the empirical evaluation of 10 state-of-the-art LLMs.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Table 1 provides direct metric-by-metric comparison against 9 prior benchmarks, and the introduction explains specific limitations of each predecessor that motivate DSCodeBench's design choices.", + "source": "haiku" + } + } + }, + "type_checklist": { + "benchmark-creation": { + "construct_design": { + "construct_validity_argued": { + "applies": true, + "answer": false, + "justification": "The paper asserts GitHub-sourced code is more realistic than Stack Overflow snippets by definition, but provides no independent validation (no domain expert surveys, no analysis of what tasks data scientists actually perform) to confirm the construct maps to real-world data science work.", + "source": "haiku" + }, + "difficulty_distribution_characterized": { + "applies": true, + "answer": false, + "justification": "Per-library model performance variation is shown (Table 3, pass@1 range 0.029–0.591) but no explicit difficulty tiers are defined or characterized; difficulty is inferred post-hoc from model pass rates rather than measured a priori.", + "source": "haiku" + }, + "ceiling_floor_effects_checked": { + "applies": true, + "answer": true, + "justification": "The best model (GPT-4o) achieves 0.392 pass@1 and the weakest (DeepSeek-1.3B) achieves 0.076, confirming no ceiling or severe floor effects; the benchmark clearly discriminates across the evaluated model range.", + "source": "haiku" + }, + "human_baseline_included": { + "applies": true, + "answer": false, + "justification": "No human performance baseline is provided; the alignment analysis (97.4% agreement) tests whether a human can understand the problem description, not whether humans can solve the coding tasks.", + "source": "haiku" + }, + "scoring_rubric_justified": { + "applies": true, + "answer": false, + "justification": "Pass@k is adopted from Chen et al. 2021 without discussing alternatives; the SSIM > 0.5 threshold for visualization tasks is stated but not justified or ablated, and no edge-case scoring analysis is provided.", + "source": "haiku" + } + }, + "robustness": { + "contamination_resistance_designed": { + "applies": true, + "answer": true, + "justification": "The paper applies systematic code perturbations (signature changes, line insertions/removals, control flow restructuring) and validates decontamination via text similarity < 0.4 and AST similarity < 0.5 across all models and libraries (Figures 3–4).", + "source": "haiku" + }, + "temporal_robustness_discussed": { + "applies": true, + "answer": false, + "justification": "No discussion of how long the current 1,000 problems will remain undiscovered or ungamed; future work mentions expansions but does not address plans for benchmark updates or versioning as models improve.", + "source": "haiku" + }, + "failure_modes_discussed": { + "applies": true, + "answer": false, + "justification": "Limitations discuss what the benchmark does not measure but do not analyze failure modes of the benchmark itself—e.g., whether high test-case coverage metrics can be gamed, or whether perturbation-based decontamination introduces systematic biases in task difficulty.", + "source": "haiku" + }, + "baseline_implementations_provided": { + "applies": true, + "answer": true, + "justification": "The paper states 'The benchmark, code, and experiment results are available at https://github.com/ShuyinOuyang/DSCodeBench', providing all components needed to reproduce reported numbers.", + "source": "haiku" + } + }, + "documentation": { + "dataset_documentation_complete": { + "applies": true, + "answer": true, + "justification": "The paper provides detailed construction documentation covering all 5 pipeline stages, filtering criteria, per-library statistics (Table 1, Figure 2), alignment procedures, and test case generation methodology including prompts.", + "source": "haiku" + }, + "licensing_and_access_clear": { + "applies": true, + "answer": false, + "justification": "A GitHub URL is provided but no license is stated anywhere in the paper; it is unclear under what terms the benchmark can be used, modified, or redistributed by others.", + "source": "haiku" + }, + "intended_use_specified": { + "applies": true, + "answer": true, + "justification": "The 'Broader Impact' section explicitly specifies intended uses (research, education, industry) and cautions that the benchmark should be used alongside diverse evaluation methods to avoid over-optimization.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "DSCodeBench is more challenging than DS-1000, as GPT-4o achieves pass@1 of 0.392 on DSCodeBench vs 0.451 on DS-1000.", + "evidence": "Table 2 shows direct head-to-head comparison on all 10 evaluated models; DSCodeBench consistently yields lower pass@1 and pass@3 than DS-1000 across all models.", + "supported": "strong" + }, + { + "claim": "DSCodeBench exhibits robust scaling behavior where larger models systematically outperform smaller ones, while DS-1000 shows irregular scaling.", + "evidence": "Table 2 shows consistent within-family ordering for DeepSeek (1.3B < 6.7B < 33B) and Qwen (7B < 14B < 32B) on DSCodeBench; DS-1000 shows reversals (e.g., DeepSeek-33B underperforms V2-Lite).", + "supported": "strong" + }, + { + "claim": "Automatically generated test cases achieve 97.8% mean line coverage across all 1,000 benchmark problems.", + "evidence": "Figure 5 reports per-library coverage from 95.5% (Pandas) to 99.7% (Matplotlib) with an overall mean of 97.8%.", + "supported": "strong" + }, + { + "claim": "Contamination is effectively mitigated: LLM-generated code has text similarity < 0.4 and AST similarity < 0.5 to ground truth.", + "evidence": "Figures 3 and 4 show similarity distributions across 10 models and 10 libraries all below stated thresholds, using text and AST similarity metrics.", + "supported": "moderate" + }, + { + "claim": "DSCodeBench provides more reliable evaluation than DS-1000, evidenced by lower variance in model performance scores.", + "evidence": "Table 2 standard deviations are reported (e.g., GPT-4o correct: 391.7±4.6 on DSCodeBench vs 450.7±2.9 on DS-1000); the comparison is inconsistent and not uniformly lower on DSCodeBench.", + "supported": "weak" + }, + { + "claim": "Problem descriptions achieve 97.4% alignment between the description and ground truth code as judged by human experts and LLM judges.", + "evidence": "Alignment section describes dual-stage validation using two author-experts and GPT-4o-mini/GPT-4o as judges; however, the LLM judges are among the models being benchmarked, introducing potential bias.", + "supported": "weak" + } + ], + "methodology_tags": [ + "benchmark-eval" + ], + "key_findings": "DSCodeBench introduces 1,000 realistic data science coding problems sourced from GitHub across 10 Python libraries, with average solutions of 22.5 lines and 200 test cases per problem—far exceeding prior benchmarks like DS-1000 (3.6 lines, 2.1 tests). The best evaluated model (GPT-4o) achieves only 0.392 pass@1, confirming the benchmark's difficulty, and unlike DS-1000, all model families exhibit consistent scaling behavior where larger models outperform smaller ones. Visualization libraries (Matplotlib, Seaborn) remain the hardest across all models with pass@1 as low as 0.010–0.141 for GPT-4o, revealing a persistent capability gap. Automatically generated test case scripts achieve 97.8% mean code coverage, supporting the benchmark's evaluation reliability.", + "red_flags": [ + { + "flag": "No human performance baseline", + "detail": "The benchmark claims to measure realistic data science coding ability and to be calibrated to real-world difficulty, but provides no human performance baseline—making it impossible to contextualize LLM scores or validate difficulty calibration against human capability." + }, + { + "flag": "Circular construct validity argument", + "detail": "'GitHub code is more realistic than Stack Overflow' is an assumption used as its own justification for construct validity; no domain expert surveys, observational studies of real data scientist workflows, or external validation criteria are used." + }, + { + "flag": "LLM-as-judge using benchmarked models", + "detail": "Alignment validation used GPT-4o-mini and GPT-4o as LLM judges to assess whether problems are solvable, yet these are among the primary models being evaluated on the benchmark, creating a potential circularity in quality assessment." + }, + { + "flag": "Visualization scoring threshold unjustified", + "detail": "The SSIM > 0.5 threshold for plot-drawing task evaluation is stated without justification, ablation, or discussion of sensitivity—an arbitrary threshold that may arbitrarily penalize visually acceptable solutions or accept poor-quality plots." + }, + { + "flag": "Scaling behavior used as post-hoc validity proxy", + "detail": "Adherence to scaling laws is treated as evidence of benchmark quality ('validating its ability to distinguish model capabilities'), but scaling behavior could equally reflect task selection or perturbation methodology biased toward larger model strengths rather than being an independent quality signal." + } + ], + "cited_papers": [ + { + "title": "DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation", + "relevance": "Primary benchmark being compared against and improved upon; DSCodeBench is explicitly positioned as addressing DS-1000's three key limitations in code length, test coverage, and problem structure." + }, + { + "title": "Evaluating Large Language Models Trained on Code (HumanEval)", + "relevance": "Introduces pass@k metric adopted throughout this paper and establishes the baseline paradigm for functional-correctness-based code generation benchmarking." + }, + { + "title": "BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions", + "relevance": "Compared directly in Table 1 as the leading general code generation benchmark; provides the competitive upper bound for complex benchmark design." + }, + { + "title": "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code", + "relevance": "Compared in Table 1; represents an approach to contamination-resistant benchmarking that informs DSCodeBench's decontamination design." + }, + { + "title": "DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models", + "relevance": "Directly compared as a data science–specific benchmark focusing on task diversity; noted for insufficient test coverage that DSCodeBench addresses." + }, + { + "title": "DataSciBench: An LLM Agent Benchmark for Data Science", + "relevance": "Contemporary data science benchmark compared in Table 1; contrasted for smaller scale (222 problems, 2.3 avg test cases) and limited library coverage." + }, + { + "title": "DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence", + "relevance": "One of the primary open-source model families evaluated; key for assessing whether open-source coding models can compete with closed-source on realistic tasks." + }, + { + "title": "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?", + "relevance": "Cited as an example of realistic software engineering benchmarks sourced from GitHub; contextualizes DSCodeBench's GitHub-based construction philosophy." + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 3, + "justification": "Practitioners evaluating LLMs for data science can directly use the publicly released benchmark to compare models across 10 specific Python libraries with a rigorous 200-test evaluation framework." + }, + "surprise_contrarian": { + "score": 1, + "justification": "The finding that GPT-4o achieves only 39% on realistic data science tasks is modestly surprising given the hype around LLM coding capability, but the direction is consistent with prior findings." + }, + "fear_safety": { + "score": 0, + "justification": "No AI safety or risk concerns are raised; the paper focuses entirely on capability evaluation for data science productivity use cases." + }, + "drama_conflict": { + "score": 1, + "justification": "The paper directly critiques DS-1000 as unreliable and misleading (irregular scaling, format ambiguity, insufficient tests), which is a mild controversy in the benchmarking community." + }, + "demo_ability": { + "score": 3, + "justification": "The benchmark is publicly released on GitHub with code, evaluation framework, per-library problems, and full experiment results enabling immediate use and replication." + }, + "brand_recognition": { + "score": 1, + "justification": "Authors are from King's College London, NUS, CAS, and Peking University—reputable institutions but not the top-tier AI labs that drive outsized HN and community attention." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "36184838", + "title": "Reverse Engineering Self-Supervised Learning", + "points": 86, + "comments": 16, + "url": "https://news.ycombinator.com/item?id=36184838", + "created_at": "2023-06-04T11:43:46Z" + }, + { + "hn_id": "43870679", + "title": "Show HN: I built an AI tool to practice technical interviews with", + "points": 12, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=43870679", + "created_at": "2025-05-02T14:57:13Z" + }, + { + "hn_id": "45300655", + "title": "Generalizable Geometric Image Caption Synthesis", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=45300655", + "created_at": "2025-09-19T12:05:01Z" + }, + { + "hn_id": "43405094", + "title": "Politicians' misinformation behavior and public engagement, in 4 countries", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=43405094", + "created_at": "2025-03-18T21:03:45Z" + }, + { + "hn_id": "44324675", + "title": "ProtoReasoning: Prototypes as the Foundation for Generalizable Reasoning in LLMs", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=44324675", + "created_at": "2025-06-20T04:10:28Z" + }, + { + "hn_id": "43776339", + "title": "The Bitter Lesson Learned from 2k Multilingual Benchmarks", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=43776339", + "created_at": "2025-04-23T20:31:54Z" + }, + { + "hn_id": "40488690", + "title": "Neuromorphic dreaming: A pathway to efficient learning in artificial agents", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=40488690", + "created_at": "2024-05-27T08:03:31Z" + } + ], + "top_points": 86, + "total_points": 110, + "total_comments": 17 + } +} +\ No newline at end of file diff --git a/papers/dspy-compiling-declarative-2023/scan-v5.json b/papers/dspy-compiling-declarative-2023/scan-v5.json @@ -0,0 +1,602 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines", + "authors": [ + "Omar Khattab", + "Arnav Singhvi", + "Paridhi Maheshwari", + "Zhiyuan Zhang", + "Keshav Santhanam", + "Sri Vardhamanan", + "Saiful Haq", + "Ashutosh Sharma", + "Thomas T. Joshi", + "Hanna Moazam", + "Heather Miller", + "Matei Zaharia", + "Christopher Potts" + ], + "year": 2023, + "venue": "arXiv.org", + "arxiv_id": "2310.03714", + "doi": null + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": false, + "justification": "The abstract claims DSPy outperforms few-shot prompting 'generally by over 25% and 65%' for GPT-3.5 and Llama2 respectively, but the '65%' figure for Llama2 cannot be verified — the best Llama2 improvement over fewshot in Table 1 is ~33pp (vanilla bootstrap×2 dev 37.3% vs fewshot 4.3%), not 65pp.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "Causal claims about compilation improving performance are supported through controlled comparisons across multiple compilation strategies (none, fewshot, bootstrap, bootstrap×2, ensemble) on fixed benchmarks with the same programs.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "The conclusion asserts DSPy has been applied to 'a large number of programs spanning tasks from information extraction to low-resource synthetic data generation,' but these results are not reported; formal evaluation covers only GSM8K and HotPotQA.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "The paper does not address whether gains stem from the DSPy programming model itself or simply from having better bootstrapped few-shot demonstrations that any prompting framework could use; no comparison to standard few-shot with equivalently bootstrapped demonstrations is provided.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "Claims are framed in terms of exact match accuracy on the specific benchmarks tested; the paper does not conflate benchmark accuracy with broader productivity or real-world impact claims.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": false, + "justification": "There is no dedicated limitations or threats-to-validity section; the conclusion mentions leaving broader reporting to 'future work' but does not systematically acknowledge limitations of the present evaluation.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "No specific threats to validity are discussed — evaluation scope (2 tasks, 2 LMs), potential benchmark contamination, variance across runs, and GPT-3.5 version instability are not addressed.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "The paper makes broad claims about DSPy as a general programming model without explicitly stating what the two-task, two-LM evaluation does not demonstrate.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": true, + "justification": "Funding from IBM/Stanford HAI, Oracle, Virtusa, Cigna Healthcare, HAI Azure compute grant, NSF CAREER grant CNS-1651570, and Apple Scholars fellowship is explicitly disclosed in the acknowledgments.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "All author affiliations are disclosed on the title page, including Stanford, UC Berkeley, CMU, Amazon Alexa AI, IIT Bombay, Calera Capital, Microsoft, and Two Sigma.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": true, + "justification": "Primary funders (NSF, Stanford HAI, Azure compute) are independent of DSPy's commercial outcome; while Amazon Alexa AI is one author's affiliation, DSPy is not an Amazon product and the evaluation does not favor Amazon systems.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests or financial interests statement is included; authors at industry affiliations (Amazon Alexa AI, Microsoft, Calera Capital) have undisclosed potential interests.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Core DSPy concepts (signature, module, teleprompter, compilation, text transformation graph) are defined precisely in Sections 3 and 4 with formal definitions and code examples.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Section 1 explicitly states three contributions: signatures (abstracting prompts), modules (abstracting prompting techniques), and teleprompters (optimizing arbitrary pipelines), framed as the first programming model of this kind.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 2 engages substantively with deep learning frameworks (PyTorch, Theano), in-context learning literature, LangChain/LlamaIndex, and prompt optimization work; Appendix B provides a detailed quantitative comparison with LangChain.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": true, + "justification": "DSPy is released at https://github.com/stanfordnlp/dspy; the paper notes the open-source version has been maintained for 'close to a year' at time of writing.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "Both evaluation benchmarks (GSM8K and HotPotQA) are standard public datasets; the Wikipedia 2017 abstracts dump and ColBERTv2 retriever used for HotPotQA are also publicly available.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "No requirements.txt, Dockerfile, or explicit environment specification is provided in the paper; only high-level component names (ColBERTv2, specific LMs) are mentioned.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "The appendix provides pseudocode for teleprompters and sample generated prompts, but no step-by-step instructions for reproducing the reported GSM8K or HotPotQA experiments are given.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "All results in Tables 1 and 2 are reported as point estimates with no confidence intervals or error bars, even for LabeledFewShot settings where 3-5 runs are averaged.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "No statistical significance tests are applied to any comparative claims; all comparisons between compilation strategies are informal.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Absolute accuracy figures and baseline comparisons are reported throughout (e.g., GSM8K GPT-3.5 vanilla 25.2% → reflection+ensemble 81.6% test), allowing readers to assess effect magnitudes.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "Training (200) and development (300) set sizes are stated but not statistically justified; no power analysis or discussion of sufficiency is provided.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "For LabeledFewShot only, 'average of 3-5 runs' is mentioned, but no standard deviations or variance estimates are reported for any results in the tables.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Tables 1 and 2 include vanilla (no compilation), fewshot, and none (zero-shot) baselines, as well as human-crafted demonstrations for selected settings.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "Within-paper baselines use GPT-3.5 and Llama2-13b-chat (both current at publication); informal comparisons reference recent results from PaLM-540B, codex, and text-davinci-002.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "The compilation strategy grid (none, fewshot, bootstrap, bootstrap×2, ensemble) effectively ablates the contribution of each optimization step for both tasks across both LMs.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "HotPotQA results include both answer exact match (Ans) and pair-retrieval accuracy (Psg); GSM8K uses final numerical answer accuracy appropriate for the task.", + "source": "haiku" + }, + "human_evaluation": { + "applies": false, + "answer": false, + "justification": "Human evaluation is not applicable; the paper evaluates LM pipeline performance on automated NLP benchmarks with well-defined ground-truth answers.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": true, + "justification": "GSM8K reports results on the 1.3K official test set; HotPotQA uses the official validation set as test (the test set is hidden) with internal train/dev splits from the training data.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": false, + "justification": "No per-category, per-difficulty, or per-question-type breakdowns are provided; all results are aggregate accuracy across all examples.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": false, + "justification": "No failure cases are shown or analyzed; the paper does not examine when or why DSPy compilation fails to improve performance.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "ReAct with bootstrap (31.0% dev) underperforms fewshot+human demonstrations (33.0%) on HotPotQA for GPT-3.5; ensemble sometimes underperforms bootstrap (e.g., Table 1 vanilla ensemble 62.7% vs bootstrap×2 64.7% for GPT-3.5 dev).", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": false, + "justification": "GPT-3.5 is referenced without a specific API snapshot version or date; Llama2-13b-chat and T5-Large are specified by name and size but no checkpoint versions are given.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": true, + "justification": "Appendix F shows the actual prompts automatically generated by DSPy for GSM8K and HotPotQA experiments; initial signatures like 'question -> answer' are shown throughout the paper.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": false, + "justification": "Temperature for LM calls is described qualitatively ('high temperature' during bootstrapping) but not specified numerically; k=8 for fewshot is mentioned but other key hyperparameters (number of trials, max demonstrations) are only partially described.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": true, + "answer": true, + "justification": "The compilation pipeline (candidate generation, parameter optimization, higher-order optimization) is described in detail in Section 4, with pseudocode for BootstrapFewShot, BootstrapFewShotWithRandomSearch, and BootstrapFewShotWithOptuna in appendices.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "Data sampling procedures are described (200 training / 300 dev for both tasks; 70/30 train/val for HotPotQA; 'hard' examples only for HotPotQA training); ColBERTv2 search over Wikipedia 2017 abstracts dump is specified.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": true, + "justification": "Both benchmarks (GSM8K and HotPotQA) are publicly available with official train/test splits; the Wikipedia 2017 dump used for retrieval is also public.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Data sources and sampling procedures are described: official GSM8K and HotPotQA datasets with specified example counts and split criteria for each experimental condition.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "No human participants — standard NLP benchmarks with existing annotations are used.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "The pipeline from raw data to evaluation is described: benchmark sampling, ColBERTv2 retrieval indexing, compilation on training split, and evaluation on dev/test splits.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": false, + "justification": "Neither GPT-3.5's nor Llama2's training data cutoff is stated; the paper does not specify model snapshot dates used in experiments.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": false, + "justification": "The paper notes GPT-4 was 'pre-trained on a subset of GSM8K's training set' but does not address whether GPT-3.5 or Llama2 similarly saw GSM8K or HotPotQA data during pretraining.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": true, + "answer": false, + "justification": "GSM8K (2021) and HotPotQA (2018) predate both models' training cutoffs; contamination of test examples into model training is not addressed despite the paper acknowledging this issue for GPT-4.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participants in this study.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": false, + "justification": "Compilation time is described qualitatively ('minutes to tens of minutes') and T5-Large is said to have 'orders of magnitude lower costs for inference,' but no specific figures (API calls, dollar costs, latency) are reported.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "The paper mentions compilation requires 'running the program a few thousand times,' but no specific GPU hours, API call counts, or dollar costs are stated.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "DSPy bootstrap compilation raises GPT-3.5 accuracy on GSM8K from ~25% (vanilla) to over 80%", + "evidence": "Table 1: vanilla test 25.2%, CoT bootstrap test 72.9%, CoT+ensemble test 81.6%, reflection bootstrap dev 76.0% test", + "supported": "strong" + }, + { + "claim": "Bootstrap compilation outperforms expert-written human reasoning chains for GPT-3.5 on GSM8K", + "evidence": "Table 1: CoT bootstrap dev (80.3%) outperforms CoT with +human CoT dev (78.6%); bootstrap test 72.9% vs human CoT test 72.4%", + "supported": "moderate" + }, + { + "claim": "Llama2-13b-chat compiled with DSPy is competitive with expert-prompt GPT-3.5 pipelines on both tasks", + "evidence": "Table 1: Llama2 reflection+ensemble dev 46.9% vs GPT-3.5 CoT+human dev 78.6% — not close; Table 2: Llama2 multihop bootstrap dev 42.0% vs GPT-3.5 multihop bootstrap dev 48.7% — closer but not equal", + "supported": "weak" + }, + { + "claim": "T5-Large (770M parameters) compiled via DSPy achieves 39.3% EM on HotPotQA using only 200 labeled inputs", + "evidence": "Section 7: 'This program scores 39.3% answer EM and 46.0% passage accuracy on the dev set, using only 200 labeled inputs and 800 unlabeled questions'", + "supported": "strong" + }, + { + "claim": "DSPy eliminates the need for hand-crafted prompt strings without sacrificing performance relative to expert-engineered systems", + "evidence": "Programs using only generic signatures and modules (no task-specific prompts) outperform zero-shot and few-shot baselines on GSM8K and HotPotQA; Appendix B shows DSPy has zero hand-written prompt demonstrations vs 50 long strings in LangChain", + "supported": "moderate" + }, + { + "claim": "DSPy outperforms standard few-shot prompting generally by over 25% for GPT-3.5 and 65% for Llama2", + "evidence": "Tables 1-2 show improvements exceeding 25pp for GPT-3.5 in several settings (e.g., vanilla bootstrap×2 dev +40.7pp), but no Llama2 improvement over fewshot reaches 65pp — best is ~33pp", + "supported": "weak" + } + ], + "methodology_tags": [ + "benchmark-eval", + "case-study" + ], + "key_findings": "DSPy introduces a programming model replacing hand-crafted LM prompt templates with parameterized declarative modules compiled by 'teleprompters' that automatically bootstrap few-shot demonstrations through a rejection-sampling-like process. On GSM8K and HotPotQA, compiled DSPy programs consistently outperform zero-shot and few-shot baselines, with GPT-3.5 reaching 88.3% dev accuracy on GSM8K (from 25.2% vanilla test) and 54.7% dev on HotPotQA. A T5-Large (770M parameters) fine-tuned via DSPy achieves 39.3% EM on HotPotQA using only 200 labeled examples, demonstrating that systematic compilation can enable small models to approach larger model performance. However, the evaluation scope is limited to two tasks and two LMs, no statistical tests are performed, and the headline '65% improvement' claim for Llama2 is not clearly supported by the reported tables.", + "red_flags": [ + { + "flag": "Unsupported 65% claim", + "detail": "The abstract claims Llama2-13b-chat improvements of 'generally over 65%' vs. standard few-shot prompting, but no specific result in Tables 1-2 reaches this threshold — the best identifiable improvement is ~33pp." + }, + { + "flag": "No variance or error bars", + "detail": "All main results are point estimates; standard deviations are absent despite some settings averaging 3-5 runs, making reliability of small differences unclear." + }, + { + "flag": "No statistical significance tests", + "detail": "All comparative claims are made without significance testing; improvements of a few percentage points over baselines may not be statistically meaningful." + }, + { + "flag": "GPT-3.5 version unspecified", + "detail": "GPT-3.5 is used without a specific model snapshot or API version date, making results potentially irreproducible as the underlying model changes over time." + }, + { + "flag": "No limitations section", + "detail": "The paper has no dedicated limitations or threats-to-validity section despite evaluating on only 2 tasks with 2 LMs and making broad claims about a general programming model." + }, + { + "flag": "Contamination unaddressed", + "detail": "GSM8K (2021) and HotPotQA (2018) predate LLM training cutoffs; the paper explicitly acknowledges GPT-4's GSM8K contamination but ignores the same issue for GPT-3.5 and Llama2." + }, + { + "flag": "Overgeneralized conclusions", + "detail": "The conclusion claims DSPy has been validated on tasks 'from information extraction to synthetic data generation' but explicitly defers reporting these under 'controlled experimental conditions to future work.'" + } + ], + "cited_papers": [ + { + "title": "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models", + "relevance": "Core prompting technique abstracted into DSPy's ChainOfThought module; motivates the need for parameterized, compilable versions of prompting techniques" + }, + { + "title": "ReAct: Synergizing Reasoning and Acting in Language Models", + "relevance": "Agent prompting technique implemented as a built-in DSPy module and evaluated on HotPotQA as a baseline" + }, + { + "title": "Training Verifiers to Solve Math Word Problems (GSM8K)", + "relevance": "Primary evaluation benchmark for the math reasoning case study" + }, + { + "title": "HotPotQA: A Dataset for Diverse, Explainable Multi-Hop Question Answering", + "relevance": "Primary evaluation benchmark for the multi-hop QA case study" + }, + { + "title": "PyTorch: An Imperative Style, High-Performance Deep Learning Library", + "relevance": "Direct inspiration for DSPy's define-by-run computation graph abstraction and parameterized module design" + }, + { + "title": "Demonstrate-Search-Predict: Composing Retrieval and Language Models for Knowledge-Intensive NLP", + "relevance": "Direct predecessor framework to DSPy from the same research group; DSPy is introduced as its second iteration" + }, + { + "title": "ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction", + "relevance": "Retrieval model used as the backbone for all HotPotQA experiments" + }, + { + "title": "Self-Consistency Improves Chain of Thought Reasoning in Language Models", + "relevance": "Related prompting technique compared against in HotPotQA evaluation; DSPy's ensemble approach generalizes this idea" + }, + { + "title": "Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers", + "relevance": "Related single-step prompt optimization work that DSPy generalizes to multi-stage arbitrary pipelines" + }, + { + "title": "Decomposed Prompting: A Modular Approach for Solving Complex Tasks", + "relevance": "Related modular prompting approach; DSPy claims to generalize its ideas through parameterized compilable modules" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 3, + "justification": "DSPy is a publicly released, pip-installable framework that eliminates the most painful part of LM system development (manual prompt engineering), with immediate use cases and a 1-year open-source track record." + }, + "surprise_contrarian": { + "score": 2, + "justification": "Challenges the prevailing convention that effective LM pipelines require careful manual prompt engineering, demonstrating that automated compilation can match or exceed expert-crafted prompts." + }, + "fear_safety": { + "score": 0, + "justification": "No AI safety, risk, or harm concerns are raised; the paper is entirely focused on engineering methodology for LM pipelines." + }, + "drama_conflict": { + "score": 1, + "justification": "Appendix B implicitly critiques LangChain and LlamaIndex by quantifying their reliance on hard-coded prompts (50 strings over 1000 chars) vs. DSPy's zero, creating mild competitive framing." + }, + "demo_ability": { + "score": 3, + "justification": "DSPy is available on GitHub, installable, and the paper includes 1-3 line runnable code snippets that produce working RAG and multi-hop systems anyone can reproduce immediately." + }, + "brand_recognition": { + "score": 3, + "justification": "Stanford NLP (Khattab, Potts) and Berkeley (Zaharia, Singhvi) are highly recognized institutions; DSPy has since become one of the most widely cited LM programming frameworks." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "42168997", + "title": "It's time to replace TCP in the datacenter (2023)", + "points": 189, + "comments": 156, + "url": "https://news.ycombinator.com/item?id=42168997", + "created_at": "2024-11-18T01:42:41Z" + }, + { + "hn_id": "34337707", + "title": "“A Handbook of Integer Sequences” Fifty Years Later", + "points": 139, + "comments": 45, + "url": "https://news.ycombinator.com/item?id=34337707", + "created_at": "2023-01-11T12:37:58Z" + }, + { + "hn_id": "33088928", + "title": "It's time to replace TCP in the Datacenter", + "points": 6, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=33088928", + "created_at": "2022-10-04T23:56:57Z" + }, + { + "hn_id": "37805651", + "title": "Agent Instructs Large Language Models to Be General Zero-Shot Reasoners", + "points": 5, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=37805651", + "created_at": "2023-10-07T21:17:40Z" + }, + { + "hn_id": "38561645", + "title": "Relightable Gaussian Codec Avatars", + "points": 4, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=38561645", + "created_at": "2023-12-07T20:50:41Z" + }, + { + "hn_id": "33151628", + "title": "Integration of Skyline Queries into Spark SQL", + "points": 3, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=33151628", + "created_at": "2022-10-10T14:10:30Z" + }, + { + "hn_id": "24766804", + "title": "Abductive Knowledge Induction from Raw Data", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=24766804", + "created_at": "2020-10-13T15:59:02Z" + }, + { + "hn_id": "41820840", + "title": "DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=41820840", + "created_at": "2024-10-12T17:30:26Z" + }, + { + "hn_id": "37776712", + "title": "Large Language Models as Analogical Reasoners", + "points": 2, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=37776712", + "created_at": "2023-10-05T10:04:39Z" + }, + { + "hn_id": "34364348", + "title": "Exoshuffle-CloudSort", + "points": 2, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=34364348", + "created_at": "2023-01-13T05:40:48Z" + } + ], + "top_points": 189, + "total_points": 355, + "total_comments": 205 + } +} +\ No newline at end of file diff --git a/papers/dual-latent-memory-2026/scan-v5.json b/papers/dual-latent-memory-2026/scan-v5.json @@ -0,0 +1,506 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Dual Latent Memory for Visual Multi-agent System", + "authors": [ + "Xinlei Yu", + "Chengming Xu", + "Zhangquan Chen", + "Bo Yin", + "Cheng Yang", + "Yongbo He", + "Yihao Hu", + "Jiangning Zhang", + "Cheng Tan", + "Xiaobin Hu", + "Shuicheng Yan" + ], + "year": 2026, + "venue": "arXiv", + "arxiv_id": "2602.00471", + "doi": null + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "Abstract claims about the scaling wall, 2.7-5.4% accuracy improvement, and 21.3-44.8% token reduction are all backed by Tables 1-3 and Figures 2, 6, and 11.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "Ablation study (Table 6) removes each component individually and the three-stage training recipe isolates each module, providing component-level causal evidence for the dual memory design.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": true, + "justification": "Claims are anchored to VMAS on VQA-type tasks; the paper tests five backbones, four model sizes, six topologies, and includes an explicit cross-benchmark generalization study (Table 10).", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "The paper does not discuss that the learnable compression modules and 230K-step PPO training add substantial learned capacity that could independently explain accuracy gains, separate from the dual-memory framing.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "Claims are framed in terms of benchmark accuracy and token counts, which match the actual measurements; no conflation of proxy metrics with broader capability claims.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": false, + "justification": "There is no dedicated limitations section; the paper ends with a brief optimistic conclusion, and Appendix D.2 acknowledges three cells of slight drops without broader methodological caveats.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "No threats to validity are discussed; potential confounds such as added trainable parameters, benchmark saturation, or PPO training instability are not addressed.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "The conclusion claims L2-VMAS offers 'a scalable path forward for more reliable and complicated VMAS' without stating what the results do not generalize to.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding acknowledgment or grant information appears anywhere in the paper.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Author affiliations are listed in the header: NUS, FDU, THU, DeepWisdom, ZJU, HNU, and Shanghai AI Lab.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": false, + "answer": false, + "justification": "No funding disclosed, so funder independence cannot be assessed.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement or financial disclosure is present in the paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Key terms including VMAS, scaling wall, dual latent memory, perception memory, thinking memory, and entropy-driven triggering are all explicitly defined or explained in context.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Three explicit contributions are stated: failure analysis of VMAS, dual latent memory synthesis, and proactive memory orchestration.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Appendix A situates L2-VMAS against VMAS paradigms and latent visual reasoning work, explicitly noting why prior latent-communication approaches (Zheng et al. 2025, Fu et al. 2025, Zou et al. 2025) are insufficient for the visual setting.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": true, + "justification": "GitHub URL provided in the abstract: https://github.com/YU-deep/L2-VMAS.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "All benchmarks used (MMBench, MMStar, RealWorldQA, SimpleVQA, MuirBench, BLINK, MVBench, LVBench, GQA) are standard publicly available datasets.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "The paper mentions 8 NVIDIA H200 141G GPUs and provides hyperparameters in Table 4, but no requirements.txt, Dockerfile, or software dependency specification is provided.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "Implementation details are scattered across appendix subsections C.1-C.5 and D.1, but there are no step-by-step reproduction instructions that could be followed without significant guesswork.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "All results in Tables 1-10 are single point estimates; no confidence intervals, error bars, or standard deviations are reported.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "No statistical significance tests are reported for any comparative accuracy claim across all tables.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Effect sizes are reported as percentage improvements over baselines (e.g., +2.7-5.4% accuracy, -21.3-44.8% tokens) with baseline absolute values provided for context.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "The number of test examples per benchmark is not discussed, and no power analysis or justification of evaluation scale is provided.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "No standard deviation or variance across runs is reported; all results appear to be single-run point estimates.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Two baselines are included: single-agent (no collaboration) and VMAS (text-based multi-agent collaboration with the same VLM backbones).", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "Baselines use the same contemporary VLMs (Qwen3-VL-8B-Thinking, InternVL-3.5-8B, GLM-4.1V-9B-Thinking, LLaVA-OV-1.5-8B) as the proposed method.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "Table 6 shows ablation results removing triggering, attribution, perception memory, and thinking memory individually on four benchmarks.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Both task accuracy and total token consumption are reported throughout, providing a dual metric view of effectiveness and efficiency.", + "source": "haiku" + }, + "human_evaluation": { + "applies": false, + "answer": false, + "justification": "Human evaluation is not applicable; all benchmarks use ground-truth labels for automated accuracy computation.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": true, + "justification": "Training uses GQA; the paper explicitly states 'no exposure to the test benchmarks,' and the generalization study (Table 10) further validates on cross-benchmark held-out sets.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Figure 5a breaks down results by task category (perception, thinking, mixed) on MMStar, and Figure 5b shows per-category memory triggering rates.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "Appendix D.2 explicitly discusses three cells of slight accuracy drops (e.g., LLaVA-OV-1.5-8B on MMBench: -0.3%) and offers an explanation tied to benchmark saturation.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "Tables 1-2 show and report three cells with slight accuracy drops (-0.2% to -0.4%), which are acknowledged and discussed in Appendix D.2.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": true, + "justification": "Specific model versions are named throughout: GLM-4.1V-9B-Thinking, InternVL-3.5-8B, LLaVA-OV-1.5-8B, Qwen3-VL-8B-Instruct/Thinking, with arXiv references provided.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": false, + "justification": "Only the GPT-5.1 extraction prompt used for transmission content categorization (Appendix B.1) is provided verbatim; the system prompts and task instructions for main agent evaluations are not disclosed.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": true, + "justification": "Table 4 provides comprehensive hyperparameters: PPO clip range, max grad norm, target KL, gamma, GAE lambda, learning rates per stage, window length W=16, threshold λ=0.5, memory length L=8, granularity g=3, max capacity N=50, top-k=5.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": true, + "answer": true, + "justification": "The multi-agent scaffolding—memory synthesis and orchestration pipeline, four-stage orchestration workflow, three-stage PPO training, and six topology structures—is described in detail in Sections 3 and Appendices C-D.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": false, + "justification": "No preprocessing steps for GQA training data or the evaluation benchmarks are described; data is referenced but how it was filtered, formatted, or prepared is not documented.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": true, + "justification": "All benchmarks (MMBench, MMStar, RealWorldQA, SimpleVQA, GQA, etc.) are standard publicly available datasets with independent access.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": false, + "justification": "The paper only names GQA as training data with 'no exposure to the test benchmarks'; no description of how data was prepared, filtered, or formatted is provided.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "No human participant recruitment; the paper uses standard benchmarks only.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": false, + "justification": "The full pipeline from raw benchmark data through preprocessing to evaluation metrics is not documented; only model architecture and training stages are described.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": false, + "justification": "The pre-training data cutoffs for none of the five base VLMs (Qwen3-VL, InternVL-3.5, GLM-4.1V, LLaVA-OV) are stated in the paper.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": false, + "justification": "The paper notes GQA fine-tuning had 'no exposure to test benchmarks' but does not discuss whether the base VLMs were pre-trained on any evaluation benchmarks (MMBench, MMStar, etc.).", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": true, + "answer": false, + "justification": "Potential contamination of base VLMs with long-public benchmarks (MMBench has been available since 2023) during pre-training is not discussed.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": true, + "justification": "Token usage is reported extensively as inference cost proxy throughout Tables 1-3, 7-9, showing 21.3-44.8% token reduction over VMAS.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "Hardware (8 NVIDIA H200 141G GPUs) is mentioned and training steps per stage are given (100k/80k/50k), but total GPU-hours or compute cost is not stated.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "Increasing agent turns in VMAS leads to performance falling below single-agent baseline by turn 6 (the 'scaling wall').", + "evidence": "Figure 2 shows accuracy peaks at turn 3 (86.6%) then drops below the single-agent baseline (84.8%) starting at turn 6, reaching 82.2% at turn 10 on MMBench with Qwen3-VL-8B-Thinking. Replicated across four benchmarks in Figure 7.", + "supported": "strong" + }, + { + "claim": "Full-content text transmission is inferior to structured or conclusion-only transmission in later agent turns.", + "evidence": "Figure 3 shows full-content transmission achieves +0.7% at turn 2 but -3.8% at turn 10, while conclusion-only consistently delivers +0.5-1.1% gains across all turns.", + "supported": "moderate" + }, + { + "claim": "L2-VMAS improves average accuracy by 2.7-5.4% across five VLM backbones over the text-based VMAS baseline.", + "evidence": "Table 1 shows improvements from +2.7% (InternVL-3.5-8B) to +5.4% (Qwen3-VL-8B-Thinking) on the four-benchmark average.", + "supported": "strong" + }, + { + "claim": "L2-VMAS reduces total token consumption by 21.3-44.8% compared to VMAS across five backbones.", + "evidence": "Table 1 shows token reductions from -21.3% (LLaVA-OV-1.5-8B) to -44.8% (GLM-4.1V-9B-Thinking).", + "supported": "strong" + }, + { + "claim": "Performance gains are consistent across 2B/4B/8B/32B model sizes and six multi-agent topologies.", + "evidence": "Tables 2 and 3 show positive average improvements in all model size and topology combinations; no systematic failure modes except the three small negative cells noted in Appendix D.2.", + "supported": "strong" + }, + { + "claim": "Perception memory benefits perception tasks more than thinking tasks, and vice versa for thinking memory.", + "evidence": "Figure 5a: perception memory boosts perception tasks by 5.5% vs 1.4% for thinking tasks; thinking memory boosts thinking tasks by 8.2% vs 1.8% for perception tasks on MMStar.", + "supported": "strong" + } + ], + "methodology_tags": [ + "benchmark-eval" + ], + "key_findings": "L2-VMAS demonstrates that the 'scaling wall' in Visual Multi-Agent Systems—where increasing agent turns degrades performance below single-agent baselines—can be overcome by replacing text-based inter-agent communication with dual latent memories decoupling perception and thinking. Across five VLM backbones on standard VQA benchmarks, the framework achieves 2.7-5.4% accuracy improvements and 21.3-44.8% token reductions over text-based VMAS. The gains are consistent across four model scales (2B-32B), six multi-agent topologies, and generalize to unseen benchmarks, though no error bars or significance tests are provided and the added learnable components represent an uncontrolled confound.", + "red_flags": [ + { + "flag": "No error bars or significance tests", + "detail": "All results across ten tables are single-run point estimates; no confidence intervals, variance across runs, or statistical significance tests are reported for any comparative claim." + }, + { + "flag": "Added parameters confound", + "detail": "L2-VMAS adds learnable compression modules (C, Cmerge, Crefine), a gating router, and trains them with PPO over 230K steps; improvements may be partly attributable to these additional learned parameters rather than the dual-memory architecture per se, and this is never discussed." + }, + { + "flag": "No limitations section", + "detail": "The paper has no dedicated limitations or threats-to-validity section; the conclusion is entirely optimistic about generalization." + }, + { + "flag": "Benchmark contamination unaddressed", + "detail": "The pre-training contamination of the five base VLMs on evaluation benchmarks (MMBench has been public since 2023, MMStar since 2024) is not discussed, making it impossible to assess how much headroom exists." + }, + { + "flag": "No funding disclosure", + "detail": "No funding source is disclosed anywhere in the paper." + }, + { + "flag": "Latent-communication baselines absent", + "detail": "The paper compares only against single-agent and text-based VMAS, not against closely related latent-space communication methods (Zheng et al. NeurIPS 2025, Fu et al. 2025, Zou et al. 2025) described in its own related work as directly relevant." + } + ], + "cited_papers": [ + { + "title": "Why do multi-agent LLM systems fail?", + "relevance": "Analyzes failure modes of multi-agent LLM systems (Cemri et al. 2025), directly motivating the problem studied in this paper." + }, + { + "title": "Scaling large language model-based multi-agent collaboration", + "relevance": "Demonstrates positive scaling in text-based multi-agent LLM systems (Qian et al., ICLR 2025), contrasting with the visual scaling wall found here." + }, + { + "title": "Thought communication in multiagent collaboration", + "relevance": "Latent-space thought communication in multi-agent LLMs (Zheng et al., NeurIPS 2025); closely related but not compared against directly." + }, + { + "title": "Cache-to-cache: Direct semantic communication between large language models", + "relevance": "Direct latent communication between LLMs (Fu et al. 2025); cited as related but insufficient for visual settings." + }, + { + "title": "Latent collaboration in multi-agent systems", + "relevance": "Another latent communication approach (Zou et al. 2025) cited as related but not directly evaluated against." + }, + { + "title": "Large language models miss the multi-agent mark", + "relevance": "Critical perspective on multi-agent LLM systems (Malfa et al., NeurIPS 2025); supports the paper's framing that current paradigms are insufficient." + }, + { + "title": "Towards a science of scaling agent systems", + "relevance": "Examines scaling behavior of agent systems (Kim et al. 2025), directly relevant to the scaling wall phenomenon studied here." + }, + { + "title": "MMBench: Is your multi-modal model an all-around player?", + "relevance": "Primary evaluation benchmark used throughout all main experiments." + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 2, + "justification": "Token reductions of 21-44% are directly actionable for deploying multi-agent VLMs in production at lower cost." + }, + "surprise_contrarian": { + "score": 2, + "justification": "The core finding that more agent turns actively hurts performance challenges the prevailing assumption that more collaboration is always better." + }, + "fear_safety": { + "score": 0, + "justification": "The paper focuses purely on efficiency and accuracy; no safety or risk implications are discussed." + }, + "drama_conflict": { + "score": 1, + "justification": "Positions itself against the dominant text-centric multi-agent paradigm, creating mild tension with a large body of existing work." + }, + "demo_ability": { + "score": 2, + "justification": "Code released on GitHub and builds on publicly accessible VLMs (Qwen3-VL), enabling practitioners to reproduce and try the method." + }, + "brand_recognition": { + "score": 1, + "justification": "Authors from NUS, FDU, THU, and Shanghai AI Lab are moderately recognized; no tier-1 industrial lab affiliation." + } + }, + "hn_data": { + "threads": [], + "top_points": 0, + "total_points": 0, + "total_comments": 0 + } +} +\ No newline at end of file diff --git a/papers/dynacode-dynamic-complexityaware-2025/scan-v5.json b/papers/dynacode-dynamic-complexityaware-2025/scan-v5.json @@ -0,0 +1,400 @@ +{ + "scan_version": 5, + "paper_type": "benchmark-creation", + "paper": { + "title": "DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation", + "authors": [ + "Wenhao Hu", + "Jinhao Duan", + "Chunchen Wei", + "Li Zhang", + "Yue Zhang", + "Kaidi Xu" + ], + "year": 2025, + "venue": "Annual Meeting of the Association for Computational Linguistics", + "arxiv_id": "2503.10452", + "doi": "10.48550/arXiv.2503.10452" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "The abstract's core claims — 189M unique problems, 16.8–45.7% average performance drop vs MBPP+, and contamination resistance — are all verified by Table 1 (combinatorial problem counts), Table 2 (performance results per model), and the fine-tuning experiment in Figure 5.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": false, + "justification": "The paper claims DynaCode 'limits memorization' and that performance drops are caused by complexity, but the fine-tuning experiment cannot separate 'harder to memorize' from 'harder in general' — both predict smaller gains on DynaCode after fine-tuning. No control isolates these confounds.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "Claims like 'LLMs are good at sequential execution' and 'LLMs struggle with problem understanding as complexity increases' are stated as general conclusions about LLMs, but the error analysis covers only GPT-3.5-Turbo and the call-graph analysis covers only 4 of 12 models.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "Performance drops with complexity could be attributable to longer prompt length, larger number of required functions, or type-alignment constraints in call-graph construction — none of these alternative explanations are discussed.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "Pass@1 measures whether generated code passes execution-based test cases; the paper consistently uses this to claim evaluation of 'code generation capability,' which is a reasonable direct measurement rather than a proxy.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "A dedicated Limitations section appears at the end of the paper, discussing maximum node count of 5 in call graphs and plans for future extension to more complex structures.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "The limitations section only notes 'maximum node count of 5' without discussing validity threats such as: cyclomatic complexity as a difficulty proxy, quality of base problems from MBPP+/LeetCode, Python-only scope, or type-alignment sampling bias. Too thin to count as specific threats-to-validity.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "The paper does not explicitly state scope boundaries — e.g., that results apply only to Python code generation, only to function-level tasks, or that findings should not be generalized to repository-level or multi-file generation tasks.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No acknowledgment or funding section appears anywhere in the paper. Absence of any funding disclosure.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Authors' affiliations are clearly stated: University of Electronic Science and Technology of China (Hu, Wei) and Drexel University (Duan, L. Zhang, Y. Zhang, Xu).", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": false, + "answer": false, + "justification": "No funding is disclosed, so this criterion does not apply.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement, no patent or equity declarations appear anywhere in the paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Key terms are formally defined: cyclomatic complexity with equation (1), call graph as a directed acyclic graph with specified properties, 'unit' and 'level' with explicit threshold formulas (equations 2–4), and 'complexity matrix' (equation 5).", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Three explicit contributions are enumerated in a bullet list at the end of the introduction: a dynamic evaluation strategy, complexity-aware metrics combining code and graph complexity, and the DynaCode benchmark with multi-LLM evaluation.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 2 covers related work on dynamic evaluation (DyVal, NPHardEval, DARG, Benchmark Self-Evolving) and coding benchmarks (HumanEval, MBPP, BigCodeBench, SWE-Bench), explicitly positioning DynaCode against each and explaining its differences.", + "source": "haiku" + } + } + }, + "type_checklist": { + "benchmark-creation": { + "construct_design": { + "construct_validity_argued": { + "applies": true, + "answer": true, + "justification": "The paper argues that cyclomatic complexity captures control-flow branching that LLMs demonstrably struggle with (citing Jiang et al., 2025; Beger and Dutta, 2025) and that call-graph structures test inter-function dependency handling — both are linked to identified LLM failure modes.", + "source": "haiku" + }, + "difficulty_distribution_characterized": { + "applies": true, + "answer": true, + "justification": "Table 1 shows problem distribution across 4 complexity units and 4 graph levels. Figure 10 validates that Pass@1 scores degrade systematically with both unit and level complexity across all 12 models, and Figure 8 shows examples of increasing problem complexity.", + "source": "haiku" + }, + "ceiling_floor_effects_checked": { + "applies": true, + "answer": false, + "justification": "The paper does not explicitly check for ceiling or floor effects. codegemma-7b-it scores 2.9% on DynaCode (near floor), and several weak models drop below 10% on harder units — potential floor effects for weak models are not discussed.", + "source": "haiku" + }, + "human_baseline_included": { + "applies": true, + "answer": false, + "justification": "No human performance baseline is provided anywhere in the paper. There is no discussion of how a human programmer would perform on DynaCode problems at any complexity level.", + "source": "haiku" + }, + "scoring_rubric_justified": { + "applies": true, + "answer": true, + "justification": "Pass@1 is justified by reference to prior work (Chen et al., 2021) and its use enables consistent comparison with MBPP and MBPP+. Pass@3 results are also reported in Appendix D for robustness validation.", + "source": "haiku" + } + }, + "robustness": { + "contamination_resistance_designed": { + "applies": true, + "answer": true, + "justification": "Contamination resistance is a central design goal: 189M unique combinations make memorization impractical, new LeetCode problems are actively sourced for continuous refresh, and the fine-tuning ablation (Figure 5) empirically tests and supports contamination resistance.", + "source": "haiku" + }, + "temporal_robustness_discussed": { + "applies": true, + "answer": true, + "justification": "The Limitations section acknowledges that advanced LLMs may eventually handle current call-graph structures and mentions plans to extend to more complex structures. The dynamic generation mechanism (continually adding new base problems from the web) provides inherent temporal adaptability.", + "source": "haiku" + }, + "failure_modes_discussed": { + "applies": true, + "answer": false, + "justification": "The paper does not discuss failure modes of the benchmark itself — e.g., type-alignment constraints that may bias which problems can be combined, Monkeytype annotation quality issues, or cases where dynamically combined prompts become semantically incoherent.", + "source": "haiku" + }, + "baseline_implementations_provided": { + "applies": true, + "answer": true, + "justification": "Benchmark and evaluation code are publicly available at https://github.com/HWH-2000/DynaCode, as stated in the abstract and confirmed by the detailed appendix examples.", + "source": "haiku" + } + }, + "documentation": { + "dataset_documentation_complete": { + "applies": true, + "answer": true, + "justification": "Section 3 and appendices document sources (MBPP+, LeetCode), preprocessing pipeline (Monkeytype type annotation, bad-generation filtering), benchmark statistics (Tables 1, 5, 6), example prompts (Table 8), and all 16 call-graph structures (Figure 9).", + "source": "haiku" + }, + "licensing_and_access_clear": { + "applies": true, + "answer": false, + "justification": "No licensing information is provided in the paper. The use of LeetCode problems raises unaddressed copyright concerns. Only a GitHub link is provided with no stated license terms for the benchmark or its components.", + "source": "haiku" + }, + "intended_use_specified": { + "applies": true, + "answer": false, + "justification": "The paper states DynaCode is for 'evaluating LLMs on code generation tasks' but does not specify what should NOT be concluded — e.g., that high DynaCode scores don't imply real-world coding capability or multi-file task competence.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "DynaCode generates up to 189 million unique nested code problems across 4 complexity units and 16 call-graph structures", + "evidence": "Table 1 shows combinatorial counts per unit/level; Table 5 in the appendix provides per-graph-type breakdowns summing to 189,263,141 total problems", + "supported": "strong" + }, + { + "claim": "Evaluated LLMs show an average performance drop of 16.8% to 45.7% on DynaCode compared to MBPP+", + "evidence": "Table 2: GPT-4o drops from 72.2% (MBPP+) to 55.4% (DynaCode) = 16.8pp; Meta-Llama-3.1-8B drops from 55.6% to 9.9% = 45.7pp, matching the stated range exactly", + "supported": "strong" + }, + { + "claim": "DynaCode resists memorization: fine-tuned models show large gains on MBPP+ but minimal gains on DynaCode", + "evidence": "Figure 5: GPT-3.5-Turbo fine-tuned on MBPP+ jumps 69.7%→88.6% on MBPP+ but only 32.6%→36.0% on DynaCode; Meta-Llama-3.1-8B jumps 55.6%→98.1% on MBPP+ but only 10.6%→23.6% on DynaCode", + "supported": "moderate" + }, + { + "claim": "LLMs perform significantly better on sequential call graphs (G1–G4, G8) than multi-branch complex call graphs (G9–G16)", + "evidence": "Figure 6 shows consistent higher Pass@1 on sequential graphs vs. complex branches across all 4 tested models (GPT-4o, GPT-3.5-Turbo, WizardLM-2-8x22B, Meta-Llama-3.1-405B)", + "supported": "strong" + }, + { + "claim": "Problem Understanding error rate increases from 64.1% (Unit 1) to 88.8% (Unit 4) as complexity rises", + "evidence": "Table 3 reports error categorization for GPT-3.5-Turbo only, based on 100 questions per call graph; generalizing to 'LLMs' is unsupported by this single-model analysis", + "supported": "moderate" + }, + { + "claim": "Models with known data contamination (Meta-Llama-3-8B-Instruct) show disproportionately large performance drops on DynaCode", + "evidence": "Figure 1 shows Meta-Llama-3-8B drops from 64.6% (MBPP) to 8.4% (DynaCode) while DeepSeek-V3 drops from 87.6% to 52.1%; only 2 models shown, contamination status asserted not proven", + "supported": "weak" + } + ], + "methodology_tags": [ + "benchmark-eval" + ], + "key_findings": "DynaCode generates 189M unique Python code generation problems by combining base problems from MBPP+ and LeetCode via directed acyclic call-graph structures, classifying problems by cyclomatic complexity (4 units) and graph topology (4 levels). Evaluated on 12 LLMs, all models show systematic performance degradation as both code and graph complexity increase, with average drops of 16.8–45.7pp versus MBPP+. Fine-tuning experiments provide evidence that DynaCode is more resistant to memorization than static benchmarks, as models that achieve near-perfect scores on MBPP+ after fine-tuning show only marginal improvements on DynaCode. Error analysis on GPT-3.5-Turbo reveals that Problem Understanding failures (not syntax or context errors) become dominant at higher complexity levels, growing from 64.1% to 88.8% of errors across units.", + "red_flags": [ + { + "flag": "No human baseline", + "detail": "No human programmer performance is reported at any complexity level, making it impossible to assess whether DynaCode's harder units discriminate meaningful capability differences or represent poorly constructed tasks beyond reasonable human ability." + }, + { + "flag": "Causal contamination claim confounded", + "detail": "The fine-tuning experiment cannot separate 'DynaCode resists memorization' from 'DynaCode is harder regardless of memorization' — both hypotheses predict identical patterns of smaller gains after fine-tuning on DynaCode vs. MBPP+." + }, + { + "flag": "Error analysis on single model only", + "detail": "Table 3's error distribution analysis covers only GPT-3.5-Turbo; generalizing conclusions about error types across 'LLMs' broadly from one model is unsupported." + }, + { + "flag": "LeetCode licensing unaddressed", + "detail": "The benchmark incorporates LeetCode problems and their official test cases without any discussion of copyright, licensing, or terms of use — this could limit legal reproducibility and distribution of the dataset." + }, + { + "flag": "Python-only scope never stated", + "detail": "All benchmark problems are in Python, but this critical scope limitation is never explicitly called out; conclusions about LLM code generation capabilities cannot be assumed to transfer to other programming languages." + }, + { + "flag": "Type-alignment constraint biases problem selection", + "detail": "Call-graph construction requires output types of parent functions to match input types of children — this constraint silently filters the combinatorial space and may systematically favor functions with common return types, biasing difficulty distribution." + } + ], + "cited_papers": [ + { + "title": "Evaluating Large Language Models Trained on Code (HumanEval)", + "relevance": "Foundational code generation benchmark and source of Pass@k metric adopted in DynaCode; primary baseline for comparison" + }, + { + "title": "Program Synthesis with Large Language Models (MBPP)", + "relevance": "Primary source dataset for DynaCode unit functions; main static benchmark comparison throughout" + }, + { + "title": "Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation (EvalPlus/MBPP+)", + "relevance": "MBPP+ (EvalPlus-processed) is the direct unit function source for DynaCode; provides enhanced test cases" + }, + { + "title": "BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions", + "relevance": "Contemporary benchmark representing state-of-the-art that DynaCode is positioned against in Table 6" + }, + { + "title": "DyVal: Graph-Informed Dynamic Evaluation of Large Language Models", + "relevance": "Prior graph-based dynamic evaluation work; DynaCode extends this approach to the code generation domain" + }, + { + "title": "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code", + "relevance": "Alternative contamination-resistant benchmark that uses temporal splits rather than dynamic generation" + }, + { + "title": "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?", + "relevance": "Representative real-world code benchmark discussed as complementary, targeting repository-level tasks vs. DynaCode's function-level focus" + }, + { + "title": "DyVal 2: Dynamic Evaluation of Large Language Models by Meta Probing Agents", + "relevance": "Related LLM-agent-driven dynamic evaluation approach that DynaCode explicitly distinguishes from by avoiding LLM-as-evaluator instability" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 2, + "justification": "Practitioners evaluating LLMs for code generation tasks can use DynaCode to stress-test models under controlled complexity, though the Python-only scope and pipeline setup overhead limit immediate broad applicability." + }, + "surprise_contrarian": { + "score": 2, + "justification": "The magnitude of performance drops (up to 45.7pp below MBPP+ scores) is striking and challenges confidence in existing benchmark scores as reliable indicators of real-world coding capability." + }, + "fear_safety": { + "score": 0, + "justification": "No AI safety or risk concerns are raised; the paper is purely about benchmark methodology and evaluation." + }, + "drama_conflict": { + "score": 1, + "justification": "Implicit challenge to the validity of widely-used static benchmarks (HumanEval, MBPP) as reliable evaluation tools, framed constructively rather than controversially." + }, + "demo_ability": { + "score": 2, + "justification": "Code is publicly available on GitHub and the dynamic generation pipeline allows others to generate benchmarks and run evaluations immediately on their own models." + }, + "brand_recognition": { + "score": 1, + "justification": "Authors from Drexel University and UESTC — credible academic institutions but no major AI lab (OpenAI, Google, Meta, DeepMind) affiliation to drive recognition." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "44592304", + "title": "Mixture-of-Recursions: Learning Adaptive Token-Level Computation", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=44592304" + }, + { + "hn_id": "43288456", + "title": "Computation-Aware ControlNet with Dynamic Router for Text-to-Image Generation", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=43288456" + }, + { + "hn_id": "45328070", + "title": "Why Johnny Cant Use Agents: Aspirations vs. Realities with AI Agents", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=45328070" + }, + { + "hn_id": "35263649", + "title": "A comprehensive capacity analysis of GPT-3 and GPT-3.5 models", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=35263649" + }, + { + "hn_id": "26536350", + "title": "Dynamic Kernel Matching for Non-Conforming Data: A Study of T-Cell Receptors", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=26536350" + }, + { + "hn_id": "45467729", + "title": "AegisShield: Democratizing Cyber Threat Modeling with Generative AI", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=45467729" + }, + { + "hn_id": "44634645", + "title": "Mixture-of-Recursions: Learning Dynamic Recursive Depths", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=44634645" + }, + { + "hn_id": "44579442", + "title": "Mixture-of-Recursions", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=44579442" + }, + { + "hn_id": "44008034", + "title": "Emotion-Sensitive Explanation Model", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=44008034" + }, + { + "hn_id": "43349900", + "title": "FlexControl: Dynamic Block Activation for Diffusion Models", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=43349900" + } + ], + "top_points": 3, + "total_points": 17, + "total_comments": 0 + } +} +\ No newline at end of file diff --git a/papers/dynafix-iterative-automated-2025/scan-v5.json b/papers/dynafix-iterative-automated-2025/scan-v5.json @@ -0,0 +1,507 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "DynaFix: Iterative Automated Program Repair Driven by Execution-Level Dynamic Information", + "authors": [ + "Zhilin Huang", + "Ling Xu", + "Chao Liu", + "Weifeng Sun", + "Xu Zhang", + "Yan Lei", + "Meng Yan", + "Hongyu Zhang" + ], + "year": 2025, + "venue": "arXiv.org", + "arxiv_id": "2512.24635", + "doi": "10.48550/arXiv.2512.24635" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All abstract claims — 186 bugs fixed, 10% improvement over SOTA, 38 unique fixes, at most 35 attempts, 70% search reduction — are directly supported by Table 1, Figure 4, and Figure 7.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "Ablation study (RQ4, Table 3) isolates each component's contribution; RQ2 controls for base model strength by comparing pure GPT-4o (72 bugs) vs DynaFix with same GPT-4o (206 bugs), providing adequate causal support.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": true, + "justification": "Section 6 explicitly limits claims to Java programs, Defects4J benchmark, and a single LLM; title and conclusion do not overclaim beyond evaluated settings.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": true, + "justification": "RQ2 isolates framework contribution by holding model constant; Section 6 addresses LLM memorization of Defects4J as an alternative explanation with a Defects4J v3.0 experiment as mitigation.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "Paper explicitly distinguishes 'plausible patches' (passes tests) from 'correct patches' (semantically equivalent to developer fix verified by manual inspection); RQ1 uses correct patches only.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "Section 6 'Threats to Validity' is a dedicated section with internal and external validity subsections.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "Specific threats include: manual patch evaluation subjectivity, LLM training overlap with Defects4J (mitigated with v3.0 experiment), reliance on published baseline results without re-running, Java-only scope, and single-LLM dependency.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": true, + "justification": "Explicitly bounded to Java programs, Defects4J benchmark, and perfect fault localization settings; multi-language extension is named as future work.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding acknowledgment, grant numbers, or sponsor information appears anywhere in the paper text.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "All eight authors list Chongqing University affiliations with full contact email addresses in the author block.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": false, + "answer": false, + "justification": "No funder is disclosed, so independence cannot be assessed.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement, patent disclosures, or financial interest declaration appears in the paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "'Plausible patch' vs 'correct patch' (Section 3.3), 'execution-level dynamic information' (Section 1), 'maximum patch attempts per bug' (Section 5.3), and DynaFix/ByteTrace components are all explicitly defined.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Three contributions are explicitly enumerated at end of Section 1: the DynaFix framework, the ByteTrace tool, and SOTA results on Defects4J.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 7 explicitly positions DynaFix against FitRepair, GIANTRepair, ChatRepair, RepairAgent, SelfAPR, TraceFixer, and Self-Debug, explaining how DynaFix extends or differs from each.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": false, + "justification": "Paper states replication package 'will be made publicly available upon acceptance' — a conditional future promise; no link or current release is provided.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "Defects4J v1.2 and v2.0 are publicly available standard benchmarks used unmodified; no new dataset requiring release was created.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "Only mentions 'ByteTrace in Java' and 'core repair logic in Python' and 'OpenAI API'; no requirements.txt, Dockerfile, Java/Python version, or dependency specifications are provided.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "No step-by-step reproduction instructions appear in the paper; replication package is not yet publicly available.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "Tables 1-3 and all figures report raw counts and percentages only; no confidence intervals or error bars appear for any result.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "All comparisons use raw bug counts and percentage differences; no statistical significance tests (t-test, Wilcoxon, etc.) are performed or mentioned.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Percentage improvements over each baseline are consistently reported (e.g., +26.5% over RepairAgent, +43.1% over FitRepair) with baseline context values.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "The 483 single-function bug subset is adopted from standard prior practice without sample size justification or power analysis.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "Temperature is set to 1.0 (stochastic) but no variance, standard deviation, or multi-run statistics are reported; experiments appear to be single runs.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "11 SOTA baselines across 4 paradigms (LLM-based, deep learning, template-based, agent-based) are compared in RQ1.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "Baselines span 2019-2025 with most recent being RepairAgent (2025) and GIANTREPAIR (2025); coverage is competitive and current.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "RQ4 (Table 3) ablates each component: w/o Local Variables, w/o Control Flow, w/o Method Call, w/o LPR, and Pure LLM baseline.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Metrics include: correct patches, plausible patches, repair rate, unique fixes, and maximum patch attempts per bug (efficiency proxy).", + "source": "haiku" + }, + "human_evaluation": { + "applies": true, + "answer": true, + "justification": "Section 3.3 explicitly states manual inspection of test-passing patches to verify semantic equivalence to developer fix; RQ1 results are based on these manually verified correct patches.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": true, + "justification": "Defects4J provides bug-specific test suites used to validate patches; these serve as the held-out evaluation mechanism for all experiments.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Table 1 breaks results down by project (Chart, Closure, Lang, Math, Time, Mockito) and by dataset version (v1.2 vs v2.0).", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": false, + "justification": "297/483 bugs remain unfixed and multi-function difficulty is noted, but no specific failure case examples or root-cause analysis of why DynaFix fails on particular bug types are provided.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "RQ2 shows execution-level information alone underperforms exception messages on multi-function bugs; RQ3 documents diminishing returns beyond breadth=7 or depth=5.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": false, + "justification": "Only 'GPT-4o' is specified; no model snapshot date (e.g., gpt-4o-2024-11-20) is provided, making exact replication impossible as OpenAI updates the model.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": false, + "justification": "Figure 3 shows prompt structure schematically but the caption explicitly states 'code details are omitted'; actual prompt text, system instructions, and one-shot examples are not provided.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": true, + "justification": "Temperature=1.0 and LPR configuration (breadth=7, depth=5, max 35 total attempts, 30-minute per-attempt limit) are all explicitly reported.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": true, + "answer": true, + "justification": "Sections 3.1-3.4 and Algorithm 1 describe the full workflow: ByteTrace instrumentation, structured prompt construction, automated patch validation, and LPR breadth-then-depth strategy in sufficient detail.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "Section 4.2 describes bug subset selection (483 single-function from 830 total, 5 removed in latest update), v1.2/v2.0 split rationale, and use of perfect fault localization from Defects4J.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": false, + "justification": "Raw patch outputs and experimental results are not currently accessible; replication package is pending acceptance ('Link will be provided upon publication').", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "ByteTrace data collection mechanism is described in Section 3.1; bug selection from Defects4J and rationale for the 483-bug subset are described in Section 4.2.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "Standard benchmark study with no human participant recruitment.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "Full pipeline documented: bug selection → ByteTrace instrumentation → prompt construction → LLM invocation → patch validation → LPR iterative loop → manual correctness verification.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": false, + "justification": "GPT-4o training data cutoff is never stated in the paper; only an API access date in the reference is provided.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": true, + "justification": "Section 6 explicitly discusses LLM training overlap with Defects4J open-source repositories, cites prior work [18] showing limited impact, and provides a Defects4J v3.0 experiment (9/24 bugs fixed) as empirical mitigation.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": true, + "answer": true, + "justification": "Contamination is addressed by arguing training corpora 'rarely contain complete bug-fix pairs' and by demonstrating generalization on Defects4J v3.0 bugs not present in prior benchmarks.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human subjects study.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": false, + "justification": "'Maximum patch attempts per bug' is used as a cost proxy and token-based billing is acknowledged, but actual dollar costs or total API call counts are never reported.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "No total API cost, token consumption, or wall-clock time for the full experimental evaluation is reported.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "DynaFix repairs 186 single-function bugs on Defects4J, outperforming all 11 SOTA baselines including GIANTREPAIR (169 bugs).", + "evidence": "Table 1 shows DynaFix 186 total vs GIANTREPAIR 169, RepairAgent 147, FitRepair 130 across 483 single-function bugs.", + "supported": "strong" + }, + { + "claim": "DynaFix achieves 38 unique bug fixes not resolved by any of the 11 baselines.", + "evidence": "Figure 4(b) shows 38 uniquely repaired bugs by DynaFix in the complementarity analysis across all 483 bugs.", + "supported": "strong" + }, + { + "claim": "Iterative use of execution-level information is critical — execution-level info alone achieves only 24.2% repair rate vs DynaFix's 42.6% with iteration.", + "evidence": "Table 2 on full Defects4J v2.0: Pure LLM 14.9%, Exception 18.6%, Execution-Level 24.2%, DynaFix (iterative) 42.6%.", + "supported": "strong" + }, + { + "claim": "DynaFix reduces maximum patch attempts by over 70% compared to the most efficient baseline (35 vs RepairAgent's 117).", + "evidence": "Figure 7 shows DynaFix at 35 max attempts vs RepairAgent at 117; other baselines range from 250 to 5,000.", + "supported": "strong" + }, + { + "claim": "The LPR strategy is the most impactful component, contributing 21.9 percentage points to repair rate.", + "evidence": "Table 3 ablation on 255 v1.2 bugs: Default 43.5% vs w/o LPR 21.6%, the largest single-component drop.", + "supported": "strong" + }, + { + "claim": "DynaFix with a single LLM outperforms GIANTREPAIR which aggregates four LLM models.", + "evidence": "Table 1 shows DynaFix (186) > GIANTREPAIR (169); paper notes GIANTREPAIR aggregates four models while DynaFix uses one.", + "supported": "moderate" + } + ], + "methodology_tags": [ + "benchmark-eval" + ], + "key_findings": "DynaFix integrates fine-grained execution-level dynamic information (variable states, control-flow paths, call stacks via the ByteTrace tool) into an iterative LLM-based APR workflow, achieving SOTA performance on Defects4J v1.2+v2.0 with 186 single-function bugs repaired including 38 previously unresolved by any baseline. The iterative mechanism is the dominant contributor (21.9pp in ablation), demonstrating that execution-level information alone is insufficient and must be combined with iteration to realize its value. DynaFix requires at most 35 patch attempts per bug — over 70% fewer than the most efficient baseline — showing that precise dynamic guidance dramatically reduces search overhead. Results are limited to Java under perfect fault localization and use an unpinned GPT-4o model without variance reporting.", + "red_flags": [ + { + "flag": "Model version not pinned", + "detail": "'GPT-4o' specified without a snapshot date; OpenAI updates this model silently, making exact replication impossible." + }, + { + "flag": "No statistical testing", + "detail": "All comparisons use raw counts and percentage differences; no significance tests are run despite stochastic generation at temperature=1.0." + }, + { + "flag": "No variance across runs", + "detail": "With temperature=1.0, results will differ across runs, but no standard deviation or multi-run reporting is provided; experiments appear to be single runs." + }, + { + "flag": "Code not yet released", + "detail": "Replication package promised 'upon acceptance'; paper cannot currently be reproduced independently." + }, + { + "flag": "Baselines not re-run", + "detail": "11 baselines are compared using their published results, which may use different Defects4J subsets, fault localization tools, or LLM configurations." + }, + { + "flag": "Perfect fault localization only", + "detail": "All experiments assume oracle bug location; real-world performance with automated fault localization is not evaluated." + }, + { + "flag": "Unresolved internal note in manuscript", + "detail": "Section 3.1 contains a stray editorial note ('please say that which experimental result approves the balance') left in the paper text, indicating the manuscript was submitted before completing revisions." + } + ], + "cited_papers": [ + { + "title": "RepairAgent: An Autonomous, LLM-Based Agent for Program Repair", + "relevance": "Key baseline and closest agentic APR approach; uses dynamic prompts and state machine for iterative repair." + }, + { + "title": "Automated program repair via conversation: Fixing 162 out of 337 bugs for $0.42 each using chatgpt (ChatRepair)", + "relevance": "Key baseline: dialogue-driven iterative APR using test failure feedback; most directly comparable iterative method." + }, + { + "title": "The plastic surgery hypothesis in the era of large language models (FitRepair)", + "relevance": "Key baseline: LLM APR with patch-knowledge and repair-oriented fine-tuning." + }, + { + "title": "Hybrid Automated Program Repair by Combining Large Language Models and Program Analysis (GIANTREPAIR)", + "relevance": "Strongest baseline: aggregates four LLM models with patch skeleton extraction; DynaFix outperforms it with a single model." + }, + { + "title": "Tracefixer: Execution trace-driven program repair", + "relevance": "Closest prior work on execution traces for APR; uses traces during fine-tuning rather than iteratively at inference." + }, + { + "title": "Towards Effectively Leveraging Execution Traces for Program Repair with Code LLMs", + "relevance": "Closely related concurrent work analyzing execution trace utility for LLM-based APR." + }, + { + "title": "Teaching large language models to self-debug", + "relevance": "Related work on using code explanations and chain-of-thought for self-debugging in program repair." + }, + { + "title": "Defects4J: A database of existing faults to enable controlled testing studies for Java programs", + "relevance": "Primary evaluation benchmark used for all experiments in this paper." + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 3, + "justification": "DynaFix and ByteTrace are concrete implementable tools directly applicable to IDE-integrated APR; 186 real bugs fixed is tangible practitioner value." + }, + "surprise_contrarian": { + "score": 1, + "justification": "Confirms expected hypothesis that fine-grained iterative feedback improves repair; the negative finding (execution info alone doesn't help multi-function bugs) is mildly surprising." + }, + "fear_safety": { + "score": 0, + "justification": "No AI safety or risk concerns; purely a software engineering productivity tool." + }, + "drama_conflict": { + "score": 0, + "justification": "Standard empirical benchmark comparison with no controversy." + }, + "demo_ability": { + "score": 2, + "justification": "ByteTrace and DynaFix are described in enough detail to prototype; replication package forthcoming, but not yet available to try." + }, + "brand_recognition": { + "score": 0, + "justification": "All authors from Chongqing University; no famous lab, product, or industry affiliation." + } + }, + "hn_data": { + "threads": [], + "top_points": 0, + "total_points": 0, + "total_comments": 0 + } +} +\ No newline at end of file diff --git a/papers/dynamic-benchmarking-reasoning-2025/scan-v5.json b/papers/dynamic-benchmarking-reasoning-2025/scan-v5.json @@ -0,0 +1,377 @@ +{ + "scan_version": 5, + "paper_type": "benchmark-creation", + "paper": { + "title": "Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination", + "authors": [ + "Simin Chen", + "Pranav Pusarla", + "Baishakhi Ray" + ], + "year": 2025, + "venue": "International Conference on Machine Learning", + "arxiv_id": "2503.04149", + "doi": "10.48550/arXiv.2503.04149" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All major abstract claims are supported: contamination detection via controlled fine-tuning experiments (§4.2), semantic diversity via BLEU-4 and cosine similarity metrics (§4.4), and stable results across 10 runs (§4.5).", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "The controlled contamination experiment (§4.2) intentionally fine-tunes models on leaked benchmark data at 0–100% rates, providing a reasonably controlled basis for the causal claim that DyCodeEval is resistant to contamination-inflated scores.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "The paper evaluates only Python NL-to-code generation on two datasets (HumanEval, MBPP) but makes unqualified claims about 'Code LLMs' broadly; no scope boundary is stated for other languages, code tasks like completion or repair, or non-English prompts.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "The paper briefly hypothesizes about the DeepSeek-Coder anomaly but does not discuss whether performance degradation on DyCodeEval for 'potentially contaminated' models could be explained by domain shift from the scenario rewriting or prompt sensitivity rather than contamination.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "The paper explicitly proposes DyPass as a better proxy for reasoning capability versus memorization (§5), distinguishing what Pass@K measures from what DyPass@K measures under contamination conditions.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": false, + "justification": "Limitations are embedded in the conclusion (Section 6) — computational cost and verbose output — but there is no dedicated limitations or threats-to-validity section separate from the conclusion.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "The paper mentions only operational limitations (cost, verbose output) but does not discuss validity threats such as whether fine-tuning contamination simulation reflects real pretraining contamination, or whether the two seed datasets are representative.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "No explicit scope boundaries are stated: the paper does not clarify that results are restricted to Python NL-to-code tasks, or that DyCodeEval may not generalize to multilingual settings, code completion, or repair tasks.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": true, + "justification": "Funding is disclosed in the acknowledgements: NSF grants CCF 2313055, CCF 2107405, CAREER 2025082, and FAI: 2040961.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "All three authors are affiliated with Columbia University's Department of Computer Science, disclosed in the paper header.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": true, + "justification": "The funders appear to be NSF/government grants independent of the evaluation outcome; no commercial funder with a financial stake in the benchmark results is disclosed.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement or declaration of financial interests (patents, equity, consulting) is present in the paper; only NSF grants are acknowledged.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": false, + "justification": "The central term 'reasoning capability' — which DyCodeEval claims to measure — is never formally defined; the paper conflates reasoning with non-memorization without operationalizing what distinguishes the two.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Section 1 explicitly lists three contributions: novel problem characterization of static benchmark limitations, the DyCodeEval 4-agent methodology, and empirical findings on contamination resistance and diversity.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 2 systematically reviews three lines of contamination-free benchmarking and explicitly positions DyCodeEval against PPM (manual effort) and LiveCodeBench (semantic imbalance), showing how it addresses their specific limitations.", + "source": "haiku" + } + } + }, + "type_checklist": { + "benchmark-creation": { + "construct_design": { + "construct_validity_argued": { + "applies": true, + "answer": true, + "justification": "The paper grounds construct validity in metamorphic testing principles: modifying complexity-unrelated context preserves canonical solutions and algorithmic complexity, so the benchmark measures the same underlying capability — a principled if not deeply formalized argument.", + "source": "haiku" + }, + "difficulty_distribution_characterized": { + "applies": true, + "answer": false, + "justification": "The paper claims complexity-equivalence between seed and generated problems but does not empirically characterize or verify the difficulty distribution (easy/medium/hard tiers) of the generated benchmark items.", + "source": "haiku" + }, + "ceiling_floor_effects_checked": { + "applies": true, + "answer": false, + "justification": "No explicit analysis of ceiling or floor effects is conducted; some models in Fig. 5 score very high on static benchmarks, but the paper does not investigate whether the dynamic benchmark resolves or merely shifts these effects.", + "source": "haiku" + }, + "human_baseline_included": { + "applies": true, + "answer": false, + "justification": "No human baseline for solving benchmark problems is provided; the human verification step (Appendix D) checks consistency of generated problems but does not measure human solve rates.", + "source": "haiku" + }, + "scoring_rubric_justified": { + "applies": true, + "answer": true, + "justification": "Pass@K is formally defined (Equation 1) and the new DyPass@K metric is introduced and explicitly justified as expanding the input space beyond Pass@K to better distinguish reasoning from memorization under contamination (§5).", + "source": "haiku" + } + }, + "robustness": { + "contamination_resistance_designed": { + "applies": true, + "answer": true, + "justification": "Contamination resistance is the core design principle: a 50×50 scenario-context space, dynamic randomness, and formal collision probability bounds (Theorems 3.1–3.3) collectively make identical problem regeneration extremely unlikely.", + "source": "haiku" + }, + "temporal_robustness_discussed": { + "applies": true, + "answer": false, + "justification": "The paper does not discuss whether DyCodeEval will remain useful as models improve or as the scenario pool becomes known; no plan for updating scenarios, expanding coverage, or managing benchmark drift over time is provided.", + "source": "haiku" + }, + "failure_modes_discussed": { + "applies": true, + "answer": false, + "justification": "Only superficial failure modes are mentioned (computational cost, verbose prompts); deeper failure modes such as models learning to perform context-stripping, scenario pool exhaustion, or LLM-generated problems sharing statistical patterns are not discussed.", + "source": "haiku" + }, + "baseline_implementations_provided": { + "applies": true, + "answer": false, + "justification": "A project webpage is referenced in the abstract but no explicit code release or reproduction package is described in the paper; full reproduction requires API access to Claude-3.5-Sonnet and reimplementing the generation pipeline from the appendix prompts.", + "source": "haiku" + } + }, + "documentation": { + "dataset_documentation_complete": { + "applies": true, + "answer": false, + "justification": "The seed datasets are described in Appendix B and generation prompts in Appendix C, but the generated dataset itself has no data card, and the scenario pool collection methodology is only partially described.", + "source": "haiku" + }, + "licensing_and_access_clear": { + "applies": true, + "answer": false, + "justification": "No licensing or access terms for DyCodeEval or its generated datasets are stated; the project webpage is referenced but terms of use are not specified in the paper.", + "source": "haiku" + }, + "intended_use_specified": { + "applies": true, + "answer": false, + "justification": "The Impact Statement discusses societal benefits but does not specify what should NOT be concluded from DyCodeEval results, such as inapplicability to non-Python tasks or non-NL-to-code settings.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "Static benchmarks create inflated Pass@1 scores for contaminated models, misrepresenting true reasoning capability", + "evidence": "Controlled fine-tuning with 0–100% leaked data shows steadily increasing Pass@1 on contaminated benchmarks while performance on other benchmarks remains stable (Fig. 4)", + "supported": "strong" + }, + { + "claim": "DyCodeEval prevents contaminated models from achieving artificially inflated benchmark scores", + "evidence": "In §4.2, models fine-tuned on leaked data show minimal or no improvement on DyCodeEval-generated problems compared to uncontaminated baselines, unlike on static benchmarks", + "supported": "strong" + }, + { + "claim": "QWEN2.5-CODER-7B is potentially contaminated on both HumanEval and MBPP", + "evidence": "In Fig. 5, QWEN2.5-CODER-7B consistently falls outside the 95% CI of the in-the-wild regression area for both seed datasets", + "supported": "moderate" + }, + { + "claim": "DyCodeEval generates semantically diverse problems while maintaining stable benchmarking results across runs", + "evidence": "Table 1 shows low BLEU-4 (0.27/0.18) and cosine similarity (0.74/0.73) versus baselines; Fig. 6 shows low standard deviation across 10 independent runs", + "supported": "strong" + }, + { + "claim": "DyPass@K better detects contamination than Pass@K by exposing memorization", + "evidence": "Table 2: contaminated models show Pass@K inflated to 0.82–0.89 while DyPass@K stays at 0.13–0.17, versus uncontaminated models where both metrics align", + "supported": "strong" + }, + { + "claim": "The probability of generating identical problems across DyCodeEval runs is negligibly low", + "evidence": "Theorems 3.1–3.3 provide formal probability bounds under uniform sampling from a 50×50 scenario-context space with mathematical proofs in Appendix A", + "supported": "strong" + } + ], + "methodology_tags": [ + "benchmark-eval", + "theoretical" + ], + "key_findings": "DyCodeEval is a 4-agent pipeline (Scenario Proposer, Context Generator, Prompt Rewriter, Validator) that generates semantically diverse but algorithmically equivalent programming problems by modifying complexity-unrelated context while preserving canonical solutions, using metamorphic testing principles to resist data contamination. Controlled contamination experiments show that static benchmarks inflate Pass@1 for models fine-tuned on leaked data, while DyCodeEval remains resistant. In-the-wild evaluation across 12+ models identifies QWEN2.5-CODER-7B as a regression outlier suggesting potential contamination on both HumanEval and MBPP. The new DyPass metric more faithfully measures reasoning capability under contamination than Pass@K by varying the prompt context rather than sampling multiple solutions.", + "red_flags": [ + { + "flag": "Contamination simulation mismatch", + "detail": "The paper simulates contamination via fine-tuning on leaked benchmark data, but real pretraining contamination occurs during large-scale training; models may respond very differently to these two types, and this validity threat is not discussed." + }, + { + "flag": "Reasoning undefined", + "detail": "The central claim — that DyCodeEval measures 'reasoning capability' rather than memorization — relies on an undefined distinction; 'reasoning' is never operationalized, leaving the core theoretical contribution unverifiable." + }, + { + "flag": "Scope overclaim", + "detail": "Evaluated only on Python NL-to-code on two benchmarks, but conclusions are framed as applying to 'Code LLMs' broadly without bounding results to this specific task type and language." + }, + { + "flag": "No human baseline", + "detail": "No human performance baseline on benchmark problems is provided; it is unknown whether scenario-rewritten variants introduce unintended difficulty beyond what the metamorphic testing argument predicts." + }, + { + "flag": "QWEN contamination claim is weak", + "detail": "Labeling QWEN2.5-CODER-7B as 'potentially contaminated' based solely on regression outlier status is a weak statistical claim; alternative explanations such as training methodology differences are not considered." + }, + { + "flag": "Reproducibility barrier", + "detail": "Full reproduction requires API access to Claude-3.5-Sonnet for problem generation; no explicit code release is described in the paper, and the generated datasets are not shared as a static artifact." + } + ], + "cited_papers": [ + { + "title": "Evaluating Large Language Models Trained on Code (HumanEval)", + "relevance": "Foundational code generation benchmark used as the primary seed dataset for DyCodeEval" + }, + { + "title": "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code", + "relevance": "Key prior work on contamination-aware temporal benchmarking that DyCodeEval explicitly improves upon" + }, + { + "title": "PPM: Automated Generation of Diverse Programming Problems for Benchmarking Code Generation Models", + "relevance": "Direct predecessor that DyCodeEval addresses, overcoming its manual operator definition and semantic imbalance limitations" + }, + { + "title": "Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of LLMs for Code Generation (EvalPlus)", + "relevance": "Prior work identifying HumanEval's limited test coverage, relevant to the benchmark rigor discussion" + }, + { + "title": "DyVal: Dynamic Evaluation of Large Language Models for Reasoning Tasks", + "relevance": "Prior dynamic evaluation approach using DAG structures — related methodology and conceptual predecessor" + }, + { + "title": "ReCode: Robustness Evaluation of Code Generation Models", + "relevance": "Used as comparison baseline for mutation-based diversity methods in Table 1" + }, + { + "title": "Program Synthesis with Large Language Models (MBPP)", + "relevance": "Second seed dataset used across all DyCodeEval experiments" + }, + { + "title": "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?", + "relevance": "Representative code LLM benchmark cited in the broader benchmarking landscape survey" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 3, + "justification": "Directly addresses data contamination in LLM evaluation — a critical problem for any team benchmarking or comparing code LLMs in practice." + }, + "surprise_contrarian": { + "score": 2, + "justification": "Provides suggestive evidence that QWEN2.5-CODER-7B may be contaminated on standard benchmarks, implicitly challenging published benchmark results." + }, + "fear_safety": { + "score": 1, + "justification": "Contaminated benchmarks give false confidence in AI capabilities, with mild but real implications for deployment decisions — not primarily a safety paper." + }, + "drama_conflict": { + "score": 2, + "justification": "Implicitly accuses QWEN2.5-CODER-7B of potential benchmark contamination, which is a commercially and reputationally charged claim against a specific named model." + }, + "demo_ability": { + "score": 2, + "justification": "Project webpage is provided and all prompts are fully documented in the appendix; practitioners could reproduce the pipeline with API access to Claude-3.5-Sonnet." + }, + "brand_recognition": { + "score": 1, + "justification": "Columbia University is a reputable institution but not a top AI lab; the paper uses Claude models but is not authored by Anthropic or Google or Meta." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "45537698", + "title": "Virtual Memory for Real-time RISC-V systems using hPMP", + "points": 22, + "comments": 4, + "url": "https://news.ycombinator.com/item?id=45537698" + }, + { + "hn_id": "45115249", + "title": "When Do Consumers Lose from Variable Electricity Pricing?", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=45115249" + }, + { + "hn_id": "42966672", + "title": "Develop AI Agents for System Engineering in Factorio", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=42966672" + }, + { + "hn_id": "26591284", + "title": "The Unreasonable Ineffectiveness of Mathematics in Biology", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=26591284" + }, + { + "hn_id": "45165536", + "title": "How to Hack Transformers: Steering LLMs via Prompts, States, and Weight Edits", + "points": 2, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=45165536" + }, + { + "hn_id": "44798220", + "title": "An Efficient End-to-End Dynamic Activation Framework for On-Device DNN Training", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=44798220" + }, + { + "hn_id": "44030713", + "title": "Cosmos: Predictable and Cost-Effective Adaptation of LLMs", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=44030713" + } + ], + "top_points": 22, + "total_points": 35, + "total_comments": 5 + } +} +\ No newline at end of file diff --git a/papers/dynamic-memory-management-2025/scan-v5.json b/papers/dynamic-memory-management-2025/scan-v5.json @@ -0,0 +1,546 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Dynamic Memory Management on GPUs with SYCL", + "authors": [ + "Russell K. Standish" + ], + "year": 2025, + "venue": "arXiv.org", + "arxiv_id": "2504.18211", + "doi": "10.48550/arXiv.2504.18211" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All four abstract claims (dynamic memory not traditional, Ouroboros ported, CUDA backend comparison, Intel Xe testing) are demonstrated in Methods and Results sections.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": false, + "answer": false, + "justification": "Paper makes no explicit causal claims. It is comparative (X performs Y way) but not causal (X causes Y). No ablation studies.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": true, + "justification": "Scope is bounded to two specific hardware platforms (Dell Precision with Quadro T2000, Asus NUC with Iris Xe) and Ouroboros-style memory allocators.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "Paper presents results (e.g., 'SYCL code ends up being about half that of CUDA') but does not explore alternative explanations for performance differences or confounding factors.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "Paper measures allocation/free times in milliseconds and claims to measure allocator performance. The measurement granularity matches the claim.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": false, + "justification": "No dedicated limitations or threats-to-validity section. Adaptive C++ issues are mentioned in conclusion but constitute only one sentence.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "Only generic mention of Adaptive C++ failures. No specific discussion of generalizability, hardware limitations, benchmark design choices, or statistical power.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "Scope is implicit (two hardware platforms, six allocator variants) but not explicitly stated. No discussion of what the work does NOT show (e.g., energy, memory fragmentation, real-world applicability).", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding source is mentioned anywhere in the paper. No acknowledgment of grants, sponsors, or support.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": false, + "justification": "Author is listed as 'High Performance Coders' with no description of what this is (consulting firm, personal lab, etc.). No discussion of financial relationships or conflicts with Ouroboros authors or SYCL vendors.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": false, + "answer": false, + "justification": "No funder disclosed, so independence cannot be assessed.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No statement of financial interests, patents, equity, or consulting relationships with companies relevant to SYCL, CUDA, or GPU vendors.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Key terms are appropriately defined for the target audience: SYCL vs CUDA architecture explained (§1), dynamic memory allocation problem motivated with examples (graph algorithms, agent-based models).", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Contribution is explicitly stated: 'This work took the CUDA Ouroboros code and translated it into SYCL' and provides performance benchmarks comparing implementations across platforms.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Paper positions itself against prior survey (Winter & Mlakar 2021), cites Ouroboros as most performant existing allocator, and discusses CUDA vs SYCL vs OpenCL landscape in introduction.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": true, + "justification": "Source code is publicly available on GitHub at https://github.com/highperformancecoder/Ouroboros-SYCL with master branch containing SYCL code and cuda-ouroboros branch containing original.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "Raw benchmark results and Ravel analysis file are available via supplementary materials at https://osf.io/2zwrt/ (OSF repository referenced in footnote 7).", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": true, + "justification": "Detailed environment specification provided: compiler versions (Intel oneAPI 2025.1, Adaptive C++ commit f336ab84, CUDA 12.8), hardware specs (CPUs, GPUs), and compilation flags (-fsycl, -fsycl-targets=nvptx64-nvidia-cuda, etc.).", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "Compilation flags and steps are described (cmake, ccmake, compiler flags) but no step-by-step reproduction instructions are provided in the paper. Code is on GitHub but specific build/test procedure is not documented.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "Figures 1-6 show raw data points with no error bars or confidence intervals. No variance, standard deviation, or uncertainty quantification reported.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "No statistical significance tests performed. Comparative claims like 'about half that of the CUDA code' are made without hypothesis tests, p-values, or statistical backing.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": false, + "justification": "Effect sizes not systematically reported. Qualitative statements ('about half', 'broadly in line') appear but no percentage improvements with confidence intervals or effect size metrics (Cohen's d, etc.).", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "Only 10 iterations per benchmark run. No justification provided for this sample size choice, and no power analysis or sensitivity analysis to justify adequacy.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "Only mean (average) time reported across 10 iterations. No variance, standard deviation, min/max, or confidence intervals shown in figures or text.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Multiple baselines: original optimised CUDA Ouroboros, deoptimised CUDA variant, Intel oneAPI on NVIDIA, Adaptive C++ compiler—all compared across six allocator variants.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "Baselines are the original Ouroboros (2020, still state-of-art per Winter survey) and contemporary 2025 compilers (Intel oneAPI 2025.1, CUDA 12.8, Adaptive C++).", + "source": "haiku" + }, + "ablation_study": { + "applies": false, + "answer": false, + "justification": "No ablation study. Six allocator variants (page, chunk, virtualised array, virtualised list) are evaluated separately, but these are alternative algorithms, not ablations of one design.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": false, + "justification": "Only allocation/free time (ms) measured. No secondary metrics like memory fragmentation, allocation success rate, energy consumption, or compilation time.", + "source": "haiku" + }, + "human_evaluation": { + "applies": false, + "answer": false, + "justification": "Not applicable—no human evaluation of system outputs.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": false, + "answer": false, + "justification": "Not a prediction task—not applicable.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Results broken down by allocator type (page, chunk, virtualised variants) and by allocation size / number of simultaneous allocations in each figure.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": false, + "justification": "Adaptive C++ deadlocks and timeouts are mentioned in conclusion ('suffered from timeouts and deadlocks') but not analysed, root-caused, or discussed in depth.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": false, + "justification": "SYCL 2x slower than CUDA for page allocators is a negative result, but not framed as such. Adaptive C++ failures mentioned but minimised. Negative findings not emphasised.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": false, + "answer": false, + "justification": "Not applicable—no ML models evaluated.", + "source": "haiku" + }, + "prompts_provided": { + "applies": false, + "answer": false, + "justification": "Not applicable—no LLM prompts.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": true, + "justification": "Compiler flags and options fully specified (-fsycl, -fsycl-targets, etc.). Benchmark parameters (10 iterations, allocation sizes, thread counts) documented in Methods.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": false, + "answer": false, + "justification": "Not applicable—not an agentic system study.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "Data collection process described: 10-iteration loop with allocate, write, verify, free cycle. JIT compilation handling explained (report average all runs vs all-but-first).", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": true, + "justification": "Raw benchmark data available via OSF supplementary materials link (https://osf.io/2zwrt/) plus Ravel analysis file for reproducibility.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Benchmark procedure clearly described: driver program iterates 10 times, performs alloc/write/verify/free cycle, computes averages. Hardware and software versions specified.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "Not applicable—no human participants.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "Pipeline from raw runs to reported metrics documented: 10 iterations per allocator/size/thread combo, average calculated, then separate averages for all-iterations vs subsequent-only (JIT adjustment).", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": false, + "answer": false, + "justification": "Not applicable—not evaluating model capabilities on benchmarks.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": false, + "answer": false, + "justification": "Not applicable.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": false, + "answer": false, + "justification": "Not applicable.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "Not applicable—no human participants.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "Not applicable—no human participants.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "Not applicable—no human participants.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "Not applicable—no human participants.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "Not applicable—no human participants.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "Not applicable—no human participants.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "Not applicable—no human participants.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": false, + "justification": "Allocation latency reported in milliseconds, but no total inference cost, runtime budget, or practical deployment cost mentioned.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "No total computational budget stated (e.g., total GPU hours, cloud compute cost, or timeline for full benchmark suite).", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "Ouroboros-SYCL successfully ports the CUDA Ouroboros dynamic memory allocator to SYCL", + "evidence": "Code compiles on multiple SYCL backends (oneAPI on NVIDIA, oneAPI on Intel Xe, Adaptive C++), allocations pass correctness verification (data written and read back correctly).", + "supported": "strong" + }, + { + "claim": "SYCL implementation via oneAPI achieves performance within factor of 2 of original CUDA for page allocators", + "evidence": "Figure 1 shows SYCL ~0.01ms vs CUDA ~0.005ms for 1024 allocations at 1000-byte size. No error bars or variance; single measurement.", + "supported": "moderate" + }, + { + "claim": "Chunk allocators perform comparably between SYCL (oneAPI) and original CUDA", + "evidence": "Figure 2 shows overlapping performance curves. Text states 'performance is broadly in line with the original Ouroboros implementation when run on the same hardware'. No statistical test.", + "supported": "moderate" + }, + { + "claim": "Intel Xe graphics performance is competitive with NVIDIA via SYCL", + "evidence": "Figure 2 shows oneAPI on Intel (rs) curves similar to oneAPI on NVIDIA (r) curves for chunk allocator. Limited to one hardware combination.", + "supported": "moderate" + }, + { + "claim": "Adaptive C++ compiler support is problematic, with deadlocks and timeouts as thread count increases", + "evidence": "Conclusion: 'Adaptive C++ unfortunately suffered from timeouts and deadlocks, which may limit the use of this code with this compiler'. Observed but not root-caused.", + "supported": "strong" + }, + { + "claim": "SYCL language limitations (global nd_item access, warp vote masking) prevent full optimization parity with CUDA", + "evidence": "§2 details six porting challenges (3D thread layout, atomic operations, nd_item access, I/O, nanosleep, warp votes). Proposed fixes for future SYCL standard mentioned.", + "supported": "strong" + } + ], + "methodology_tags": [ + "benchmark-eval", + "case-study" + ], + "key_findings": "Ouroboros-SYCL successfully ports a high-performance CUDA memory allocator to the cross-platform SYCL API, achieving within-factor-of-2 performance on page allocators and comparable performance on chunk allocators when compiled via Intel's oneAPI toolset. The porting identified six language-level limitations in SYCL relative to CUDA (global thread context access, warp voting, I/O, timing primitives), most of which are proposed for future SYCL standards. Adaptive C++ compiler support remains problematic with deadlocks and timeouts.", + "red_flags": [ + { + "flag": "No error bars or statistical variance", + "detail": "Benchmark results report only mean times across 10 iterations with no std dev, confidence intervals, or error bars. Unclear if 2x performance gap is statistically significant or within noise." + }, + { + "flag": "No significance testing", + "detail": "Comparative claims ('about half that of CUDA', 'broadly in line') lack statistical hypothesis tests or p-values to support generality." + }, + { + "flag": "Limited generalization scope", + "detail": "Only two hardware platforms tested (Dell + NVIDIA, Asus + Intel). Results may not generalise to other GPU vendors or hardware configurations. Ouroboros is memory-allocation specific, not representative of broader GPU workloads." + }, + { + "flag": "Deadlock analysis incomplete", + "detail": "Adaptive C++ deadlocks mentioned in conclusion but not investigated. No root cause analysis, potential fixes, or discussion of whether this is fixable by Adaptive C++ maintainers." + }, + { + "flag": "No ablation study", + "detail": "Does not isolate which SYCL language differences (1D vs 3D layout, atomic references, etc.) are responsible for performance gaps. Porting was manual translation, hard to isolate causes." + }, + { + "flag": "Missing limitations section", + "detail": "No dedicated limitations or threats-to-validity section. Scope boundaries (hardware, workloads, generalizability) are not explicitly discussed." + }, + { + "flag": "No funding or conflict-of-interest disclosure", + "detail": "No statement of funding source or disclosure of financial interests with SYCL/CUDA vendors or Ouroboros authors. Affiliation 'High Performance Coders' is not clarified." + }, + { + "flag": "Single metric evaluation", + "detail": "Only allocation/deallocation latency measured. No memory fragmentation, energy consumption, or compilation time analysis. Real-world applicability unclear." + } + ], + "cited_papers": [ + { + "title": "Ouroboros: virtualized queues for dynamic memory management on GPUs", + "relevance": "Original CUDA library being ported; direct baseline for performance comparison" + }, + { + "title": "Are dynamic memory managers on GPUs slow? a survey and benchmarks", + "relevance": "Prior survey positioning Ouroboros as most performant allocator; motivates choice of base library" + }, + { + "title": "Data Parallel C++: Programming Accelerated Systems Using C++ and SYCL", + "relevance": "SYCL standard reference and programming guide; foundation for porting effort" + }, + { + "title": "CUDA: Scalable parallel programming for high-performance scientific computing", + "relevance": "Original CUDA architecture; comparison baseline for SYCL design decisions" + }, + { + "title": "Using SYCLomatic to migrate CUDA code to oneAPI adapting NVIDIA GPU", + "relevance": "CUDA-to-SYCL automatic translation tool; used as starting point (semi-successful)" + }, + { + "title": "SPIR-V specification", + "relevance": "Intermediate representation used by SYCL JIT compilation pipeline" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 2, + "justification": "GPU dynamic memory is essential for graph algorithms and agent-based models, and SYCL portability benefits applications targeting multiple GPU vendors. Niche audience of HPC developers." + }, + "surprise_contrarian": { + "score": 1, + "justification": "2x SYCL performance gap is expected given cross-platform abstraction overhead; Adaptive C++ immaturity unsurprising for younger compiler. No challenging conventional wisdom." + }, + "fear_safety": { + "score": 0, + "justification": "No AI safety, security, or alignment concerns raised. Pure HPC infrastructure work." + }, + "drama_conflict": { + "score": 0, + "justification": "No controversy, conflict, or drama. Straightforward technical porting effort." + }, + "demo_ability": { + "score": 2, + "justification": "Code is publicly available on GitHub and reproducible with matching hardware; benchmarks can be re-run. Requires NVIDIA or Intel GPU and specific compiler setup." + }, + "brand_recognition": { + "score": 1, + "justification": "Author Russell K. Standish not widely known in ML/AI circles. 'High Performance Coders' affiliation is not a recognizable lab. Paper cites established work (Winter et al. Ouroboros) but author has modest brand." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "43086347", + "title": "SWE-Lancer: a benchmark of freelance software engineering tasks from Upwork", + "points": 111, + "comments": 74, + "url": "https://news.ycombinator.com/item?id=43086347", + "created_at": "2025-02-18T05:25:05Z" + }, + { + "hn_id": "46636707", + "title": "Show HN: A-MEM – Memory for Claude Code that links and evolves on its own", + "points": 8, + "comments": 4, + "url": "https://news.ycombinator.com/item?id=46636707", + "created_at": "2026-01-15T18:15:04Z" + }, + { + "hn_id": "43760287", + "title": "Creating benchmarkable components to measure the quality of AI-enhanced devtools", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=43760287", + "created_at": "2025-04-22T09:09:48Z" + }, + { + "hn_id": "45357392", + "title": "Personalised Pricing: The Demise of the Fixed Price?", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=45357392", + "created_at": "2025-09-24T07:35:21Z" + }, + { + "hn_id": "44324675", + "title": "ProtoReasoning: Prototypes as the Foundation for Generalizable Reasoning in LLMs", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=44324675", + "created_at": "2025-06-20T04:10:28Z" + }, + { + "hn_id": "43086430", + "title": "SWE-Lancer: Can LLMs Earn $1M from Real-World Freelance Software Engineering?", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=43086430", + "created_at": "2025-02-18T05:40:39Z" + } + ], + "top_points": 111, + "total_points": 127, + "total_comments": 78 + } +} +\ No newline at end of file diff --git a/papers/dynamic-mix-precision-2026/scan-v5.json b/papers/dynamic-mix-precision-2026/scan-v5.json @@ -0,0 +1,531 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Dynamic Mix Precision Routing for Efficient Multi-step LLM Interaction", + "authors": [ + "Yuanzhe Li", + "Jianing Deng", + "Jingtong Hu", + "Tianlong Chen", + "Song Wang" + ], + "year": 2026, + "venue": "arXiv", + "arxiv_id": "2602.02711", + "doi": null + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All four abstract claims are supported: (1) LLMs succeed at long-horizon tasks (shown in intro), (2) inference cost is prohibitive (referenced in related work), (3) the framework is described in Section 4, (4) Table 1 demonstrates accuracy-cost improvements over baselines.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "Ablation study (Section 5.3, Table 2) isolates KL-ST vs GRPO effects; comparisons against random routing and full-precision baselines support causal claims. However, limited to one benchmark environment.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "Paper frames problem broadly ('long-horizon agentic tasks', 'real-world scenarios') but evaluates only on ALFWorld. No explicit scope boundaries stated. Claims like 'effective precision routing does not require a full-capacity model' generalize beyond tested domain without justification.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "No discussion of alternative explanations. For instance, GRPO shows no improvement on 1.7B model (Table 2) but paper only attributes this to 'inherent limitation' without exploring other factors. No consideration of confounds.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "Measured outcomes (success rate, high-precision usage, GHC metric) map directly to intended claims (task performance and inference efficiency). No conflation between proxy and ground truth.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": false, + "justification": "No dedicated limitations or threats-to-validity section. Conclusion (Section 6) is one paragraph summarizing findings without acknowledging scope boundaries.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "No specific threats discussed. Single-environment evaluation, limited quantization methods (GPTQ only), ad-hoc hyperparameter selection (KL threshold 'manually selected based on empirical distribution'), and inconsistent GRPO gains across model scales are not acknowledged.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "No explicit boundaries stated. Paper does not acknowledge what the approach does NOT apply to (e.g., single-turn QA, retrieval tasks, or models without quantized variants).", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding statement or acknowledgments section visible in paper. No disclosure of funding sources.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Author affiliations listed at top (University of Arizona, University of Pittsburgh, UNC Chapel Hill, University of Central Florida). Affiliations are disclosed; however, no explicit conflict-of-interest statement provided.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": false, + "answer": false, + "justification": "No funder identified.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing-interests or financial-disclosure statement included. Paper evaluates Qwen and DeepSeek models but does not disclose any potential financial relationships.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Core terms are defined: 'agentic tasks' as long-horizon decision-making requiring tool use and environment interaction (Section 1), 'dynamic mix-precision routing' explained with architecture (Section 4), 'quantization' used consistently with standard meaning.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Three contributions explicitly stated in introduction: (1) dynamic framework for step-level routing, (2) RL optimization of routing policy, (3) lightweight router sufficiency. Clear and distinct contributions.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 2 engages with three areas: LLM routing (FrugalGPT, HybridLLM, RouteLLM, etc.), agentic tasks, and quantization. Paper positions this work as step-level routing (vs. query-level) on agentic tasks, showing how it addresses limitations of prior routing work.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": false, + "justification": "No code repository, GitHub link, or supplementary code provided. Paper is a preprint with no mention of code availability.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "ALFWorld benchmark is publicly available and evaluation follows standard protocol (Yao et al. 2022b unseen test set). However, the 200 training rollouts used for KL-ST supervision are not released.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "No requirements.txt, Dockerfile, or environment specification provided. Training details in Appendix B (batch size, learning rate) but no computational environment or dependency specs.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "Experimental setup described in Section 5 and Appendix, but no step-by-step reproduction instructions. No code release makes reproduction difficult or impossible.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "Table 1 reports success rates and metrics as point estimates with no confidence intervals, error bars, or standard deviations. Single evaluation run implied; no variance reporting.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "No statistical significance tests (t-tests, Mann-Whitney, etc.) reported. Comparisons presented descriptively (e.g., 88.8% vs 89.6%) without p-values or significance indicators.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "GHC metric directly measures effect size (gain per high-precision call). Improvements are quantified in percentage points. However, no formal effect-size statistics (Cohen's d, etc.) or confidence bounds reported.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "KL-ST uses 200 episodes (stated in Appendix B), GRPO uses 120 episodes. No power analysis or justification for these choices. Table 3 ablates data scale (100–400) showing variance in results but final choice not justified.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "No variance, std dev, or repeated runs reported. Especially problematic for RL results which are typically high-variance. Only point estimates in tables.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Comparisons include full-precision baseline (BF16), quantized-only baseline (GPTQ), and random routing at multiple thresholds (20%, 40%, 60%, 80%). Strong baseline set.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": false, + "justification": "BF16 and GPTQ baselines are contemporary. However, no comparison to other learned routing methods mentioned in related work (RouteLLM, BEST-Route, RouterDC, etc.), only random and single-precision baselines.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "Section 5.3 compares KL-ST-only, GRPO-only, and KL-ST+GRPO (Figure 5, Table 2). Table 3 ablates training data scale (100–400 episodes). Ablations show both components contribute.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Success rate (primary), high-precision usage ratio, and GHC (unified efficiency metric) reported. Multiple perspectives on performance–cost trade-off.", + "source": "haiku" + }, + "human_evaluation": { + "applies": false, + "answer": false, + "justification": "Not applicable. Task success is objective (environment-based); no human evaluation needed or provided.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": true, + "justification": "Evaluation on 'unseen test task' following Yao et al. (2022b); ALFWorld has standard train/test split.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": false, + "justification": "Results broken down by model family (Qwen, DeepSeek) and size (8B, 4B, 1.7B) but no breakdown by task type, difficulty, or failure mode within ALFWorld.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": false, + "justification": "Appendix A.2 provides one qualitative case study (cleaning cloth task) showing low-precision failure and router success. No systematic failure analysis or taxonomy of failure modes.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "GRPO-only achieves lower GHC than KL-ST+GRPO (Table 2). GRPO provides no benefit on 1.7B model. Table 3 shows GHC fluctuations with data scale. Some negative/mixed results included.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": false, + "justification": "Model families and sizes specified (Qwen3-8B/4B/1.7B, DeepSeek-R1-Distill-Llama3-8B) but no version snapshots or release dates. For reproducibility, commit hashes or exact model cards needed.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": false, + "justification": "No system prompt or instruction template provided. For ALFWorld this may be standard, but explicit prompt text is absent. Critical for reproducibility of LLM outputs.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": true, + "justification": "KL-ST hyperparameters provided (batch 64, lr 1e-4, 5 epochs, Appendix B). GRPO hyperparameters provided (lr 1e-6, KL weight 0.02, 120 episodes). However, KL threshold selection described as 'manually selected between 78th and 85th percentiles'—vague and ad-hoc.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": true, + "answer": true, + "justification": "Router architecture detailed in Section 4 and Figure 3: 2-layer Transformer encoder with position embeddings, masked attention, and softmax classification. Scaffolding is the routing framework itself and is well-described.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": false, + "justification": "Section 4.1 states step embeddings are 'pre-computed by an external encoder and treated as fixed inputs' but does not specify which encoder, how it works, or its parameters. Critical preprocessing detail missing.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": true, + "justification": "ALFWorld is publicly available. However, the 200 high-precision rollouts collected for KL-ST training are not released, limiting reproducibility of training data.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Section 4.2 describes trajectory sampling protocol: high-precision rollout strategy, measurement of step-wise KL divergence, retention of successful trajectories only. Well-documented collection process.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "No human subjects involved.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "Pipeline from trajectory collection → KL divergence computation → KL-ST training → GRPO optimization is documented in Sections 4.2–4.3 and Appendix B with sufficient detail.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": false, + "justification": "No training cutoff date stated for Qwen3 or DeepSeek-R1 models. These are 2024–2025 models but exact training data cutoff unknown. ALFWorld cutoff not discussed.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": false, + "justification": "No discussion of potential overlap between ALFWorld test set and base model training data. With models released in 2025–2026 evaluating a 2021 benchmark, contamination risk is not addressed.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": true, + "answer": false, + "justification": "ALFWorld (2021) may have been included in base model training data. Paper does not acknowledge or test for this risk.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": true, + "justification": "High-precision usage ratio and GHC metric quantify inference cost trade-offs. However, actual latency, wall-clock time, or GPU hours are NOT reported. Cost is relative, not absolute.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "Episode budgets stated (200 for KL-ST, 120 for GRPO) but no compute cost in GPU hours, FLOPs, or dollars. Cannot assess practical compute requirements.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "Low-precision quantized LLMs exhibit step-wise diversity in sensitivity: most steps tolerate quantization but a small fraction are 'critical' and require high precision.", + "evidence": "Figure 4 shows step-wise KL divergence distribution is 'highly skewed' with heavy right tail across Qwen3-8B, Qwen3-4B, and DeepSeek models. Section 1 and 4.2 frame this observation.", + "supported": "strong" + }, + { + "claim": "Dynamic step-level mix-precision routing achieves superior accuracy–cost trade-off compared to random routing and full-precision baselines.", + "evidence": "Table 1: Router achieves GHC scores (19.85–43.02) consistently higher than random routing (2–18.67) across all models. Approaches full-precision performance (89.6% vs 88.8% on Qwen3-8B) with 26.7% high-precision usage.", + "supported": "strong" + }, + { + "claim": "KL-divergence-based supervised learning (KL-ST) identifies precision-sensitive steps better than unsupervised methods.", + "evidence": "Figure 5 and Table 2: KL-ST achieves higher GHC than GRPO-only (18.79 vs 10.74 on Qwen3-8B). KL-ST provides 'stable initialization' for downstream RL. Table 3 shows KL-ST data scale correlates with GHC improvement.", + "supported": "moderate" + }, + { + "claim": "Group-Relative Policy Optimization (GRPO) further improves routing decisions over KL-ST alone.", + "evidence": "Figure 5 and Table 2: KL-ST+GRPO (GHC 19.85) outperforms KL-ST-only (18.79) on Qwen3-8B. However, on smaller models (1.7B), GRPO provides NO improvement. Gain correlates with base model capability, not uniform.", + "supported": "moderate" + }, + { + "claim": "A lightweight two-layer Transformer router is sufficient for step-level precision routing without requiring a full-capacity LLM.", + "evidence": "Section 4.1 and Figure 3 describe router architecture (2 layers, position embeddings, last-token pooling). No comparison to larger or smaller routers provided. No justification for architecture choice.", + "supported": "weak" + }, + { + "claim": "The approach generalizes across LLM families and model scales (Qwen, DeepSeek; 1.7B–8B).", + "evidence": "Table 1 shows results on Qwen3-8B/4B/1.7B and DeepSeek-R1-Distill-Llama3-8B. However, all evaluation is on ALFWorld only. No evaluation on other agentic benchmarks (WebArena, ScienceWorld, etc.).", + "supported": "weak" + }, + { + "claim": "The router efficiently reduces high-precision calls without sacrificing task success (e.g., 26.7% high-precision usage achieves 88.8% success vs 100% for full-precision at 89.6%).", + "evidence": "Table 1: Qwen3-8B router uses 26.7% high-precision and achieves 88.8% success vs 100% high-precision at 89.6% success. GHC=19.85 indicates efficient trade-off. Results limited to ALFWorld.", + "supported": "strong" + }, + { + "claim": "Routing decisions based on behavioral divergence (KL) are more effective than random routing at equivalent cost budgets.", + "evidence": "Table 1: Router@26.7% high-precision (GHC 19.85) outperforms Random@20% (GHC 2) and Random@40% (GHC 8.5) on Qwen3-8B. Similar pattern across all models.", + "supported": "strong" + } + ], + "methodology_tags": [ + "benchmark-eval", + "empirical" + ], + "key_findings": "The paper demonstrates that step-wise precision sensitivity in quantized LLMs is heterogeneous—most steps tolerate quantization but critical decision points require full precision. A lightweight two-layer Transformer router trained via KL-divergence supervision and policy optimization can dynamically select between high- and low-precision models, achieving 26.7–88% high-precision usage while maintaining near-full-precision task success (88.8% vs 89.6%) on ALFWorld. The approach consistently outperforms random routing baselines, though improvements diminish on smaller models (1.7B), and has been validated only on a single benchmark environment.", + "red_flags": [ + { + "flag": "Single-benchmark evaluation", + "detail": "All experiments on ALFWorld only. No evaluation on WebArena, ScienceWorld, or other agentic benchmarks limits claim generalization to 'real-world agentic tasks' (Section 1, page 1)." + }, + { + "flag": "No confidence intervals or repeated runs", + "detail": "Table 1 reports point estimates with no error bars, std dev, or repeated evaluations. Single-run results; RL typically exhibits high variance, especially with sparse rewards (problem stated in Section 4.3)." + }, + { + "flag": "Missing external encoder specification", + "detail": "Step embeddings pre-computed by 'external encoder' (Section 4.1, page 3) but encoder type, architecture, and training unknown. Critical preprocessing detail omitted, hindering reproducibility." + }, + { + "flag": "Ad-hoc hyperparameter selection", + "detail": "KL threshold selected 'manually between 78th and 85th percentiles' (Appendix B, page 13). No ablation on threshold value. Selection method not principled." + }, + { + "flag": "Inconsistent GRPO gains by model scale", + "detail": "GRPO improves GHC on Qwen3-4B (26.43→43.02) and 8B (18.79→19.85) but provides zero improvement on 1.7B (Table 2). Attribution to 'inherent router limitation' lacks depth; alternative explanations unexplored." + }, + { + "flag": "No comparison to learned routing baselines", + "detail": "Section 2 cites RouteLLM, BEST-Route, RouterDC, and others, but experiments compare only to random routing and single-precision baselines. No head-to-head comparison to other routers." + }, + { + "flag": "Training data scale variance unexplained", + "detail": "Table 3 shows GHC fluctuations (100 eps: 14.71, 200: 18.79, 300: 14.56, 400: 25.27). High variance not investigated; why 300 episodes worse than 200 and 400?" + }, + { + "flag": "No limitations section or scope boundaries", + "detail": "Paper frames broadly ('long-horizon agentic tasks', 'real-world') but explicitly bounds only to ALFWorld. No discussion of what approach does NOT work for (e.g., single-turn QA, retrieval, open-ended generation)." + }, + { + "flag": "Potential benchmark contamination not addressed", + "detail": "ALFWorld released 2021; Qwen3 and DeepSeek-R1 likely trained 2024–2025. Possibility of test-set overlap in base model training not tested or discussed." + }, + { + "flag": "No code or training data release", + "detail": "Reproduction impossible without code. The 200 high-precision rollouts for KL-ST training are not released. Evaluation uses public ALFWorld but training is unreproducible." + } + ], + "cited_papers": [ + { + "title": "Reflexion: Language agents with verbal reinforcement learning", + "relevance": "Early work on agentic language model scaffolding and trajectory-level optimization; foundational for multi-step reasoning." + }, + { + "title": "ALFWorld: Aligning text and embodied environments for interactive learning", + "relevance": "Primary evaluation benchmark; establishes the agentic task domain (household manipulation via natural language)." + }, + { + "title": "ReAct: Synergizing reasoning and acting in language models", + "relevance": "Core agentic framework combining reasoning and tool use; baseline for step-level decision-making architectures." + }, + { + "title": "RouteLLM: Learning to route llms with preference data", + "relevance": "Prior routing work at query level; this paper advances to step-level routing. Direct predecessor in routing literature." + }, + { + "title": "Can compressed llms truly act? an empirical evaluation of agentic capabilities in LLM compression", + "relevance": "Directly relevant: quantization causes agentic failure on interactive tasks, motivating the precision-routing approach." + }, + { + "title": "BEST-Route: Adaptive LLM routing with test-time optimal compute", + "relevance": "Concurrent work on compute-aware routing; establishes the importance of cost-benefit trade-offs in routing." + }, + { + "title": "Quantization hurts reasoning? an empirical study on quantized reasoning models", + "relevance": "Shows quantization degrades performance on reasoning-heavy tasks; supports motivation for adaptive precision selection." + }, + { + "title": "SWE-bench: Can language models resolve real-world github issues?", + "relevance": "Alternative agentic benchmark (software engineering); highlights generalization question beyond ALFWorld." + }, + { + "title": "WebArena: A realistic web environment for building autonomous agents", + "relevance": "Another agentic benchmark domain (web navigation); would provide generalization evidence if evaluated here." + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 2, + "justification": "Addresses real cost-efficiency problem for deployed agentic systems, but requires retraining router per model/benchmark pair and depends on GPTQ quantization availability. Limited immediate applicability." + }, + "surprise_contrarian": { + "score": 2, + "justification": "Finding that quantized models fail at 'critical steps' is intuitive (confirms practitioner belief). Step-level routing is incremental advance over query-level. No surprising or contrarian insight." + }, + "fear_safety": { + "score": 0, + "justification": "No discussion of safety, alignment, or AI risk. Paper is purely about efficiency optimization." + }, + "drama_conflict": { + "score": 0, + "justification": "Straightforward technical contribution; no controversy, heated debates, or conflicting findings." + }, + "demo_ability": { + "score": 1, + "justification": "No code release, so demo requires reimplementation. Figure 1 case study is illustrative but no runnable demo provided. Difficult to try immediately." + }, + "brand_recognition": { + "score": 1, + "justification": "Authors from University of Arizona, Pittsburgh, UNC Chapel Hill, UCF—solid institutions but not top-tier ML labs (no Anthropic, OpenAI, DeepMind, Meta affiliations)." + } + }, + "hn_data": { + "threads": [], + "top_points": 0, + "total_points": 0, + "total_comments": 0 + } +} +\ No newline at end of file diff --git a/papers/early-approaches-adversarial-2025/scan-v5.json b/papers/early-approaches-adversarial-2025/scan-v5.json @@ -0,0 +1,577 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Early Approaches to Adversarial Fine-Tuning for Prompt Injection Defense: A 2022 Study of GPT-3 and Contemporary Models", + "authors": [ + "Gustavo Sandoval", + "Denys Fenchenko", + "Junyao Chen" + ], + "year": 2025, + "venue": "arXiv.org", + "arxiv_id": "2509.14271", + "doi": "10.48550/arXiv.2509.14271" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "Key claims (31% baseline attack success, near-zero after fine-tuning on Ada/Babbage/Curie, larger models more vulnerable) are all supported by Tables 1-2 and Figure 5.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": false, + "justification": "Paper claims fine-tuning 'reduces attack success rates' but provides no ablation study isolating whether success comes from structured delimiters, adversarial examples, or fine-tuning itself. Before/after comparison shows effect exists but not what causes it.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "Title frames this as contributing to 'the evolution of modern prompt injection defense research' and claims influence on 'constitutional AI approaches,' but only tested on 2022 models (GPT-3, GPT-2, T-5, OPT). Broader claims about historical influence are speculative.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "Paper does not discuss whether improvements could stem from overfitting to adversarial examples, whether delimiters alone (without fine-tuning) would work, or alternative mechanisms for the defense.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "Measurement (attack success rate via Levenshtein distance threshold) directly matches the claim being tested. No problematic proxy identified.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "Section titled 'Contemporary Relevance and Limitations' discusses limitations explicitly, though it frames many as post-hoc (discovered by others in 2024) rather than original study threats.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "Limitations section cites external research (fine-tuning fragility, modern attacks bypassing defenses) but does not address threats to original study validity such as: potential overfitting to test attacks, unclear if held-out test set was used, or generalization to novel attack patterns.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "Paper states methodology is '2022 landscape' specific and models are 'now superseded,' but does not explicitly bound what the study does NOT show (e.g., generalization to novel attacks, performance on larger models, long-term robustness).", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding section or funding statement present. Unknown if research was funded or unfunded.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Author emails show @nyu.edu affiliation. No undisclosed affiliation with OpenAI (whose API was used extensively) is evident.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": false, + "answer": false, + "justification": "No funding disclosure prevents assessment.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests or financial interests statement included.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Key terms clearly defined: 'prompt injection' (malicious instruction injection), 'goal hijacking' vs 'prompt leaking' (with examples), 'adversarial fine-tuning,' model names with versions (text-davinci-003, text-curie-001, etc.).", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Contributions explicitly stated: (1) explore prompt injection attacks, (2) test LLMs empirically, (3) propose adversarial fine-tuning defense. Reader clearly knows what paper claims to add.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Engages with Perez & Ribeiro 2022 (PromptInject framework) and builds on their work; cites transformer literature and defense strategies. Broader engagement with adversarial robustness literature is limited but immediate context is covered.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": false, + "justification": "GitHub link 'https://github.com/GusSand/PromptInject' points to the PromptInject framework, not clearly their modifications. Notebook files (dataset_construct.ipynb, fine-tuned models, etc.) are referenced but unclear if released publicly.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": false, + "justification": "Fine-tuning datasets sourced from public Kaggle datasets (standard), but the adversarial test dataset (1,260 attack variations) is not stated as released or available.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "References 'Google Colab Pro' and 'OpenAI fine-tuning API' but provides no environment specs (requirements.txt, Python version, dependencies, library versions).", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "Notebook names referenced but step-by-step instructions for reproduction are not provided. Reader would need to reverse-engineer from text.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "Tables 1-2 show attack success rates (e.g., 26%, 0%) as single numbers with no confidence intervals, error bars, or reported variance.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "No statistical significance tests reported. Differences like '26% → 0%' are presented without p-values or tests of significance.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Effect sizes reported as percentage-point reductions (e.g., 26% to 0% = 26pp reduction, 31% to 0% = 31pp reduction).", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "Paper tests '1,260 variations of different attacks' derived from 35 prompts × 2 attack categories × 5 variations, but does not justify why this sample size is sufficient or cite power analysis.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "Results show single percentages per model/attack type with no reported standard deviation, confidence intervals, or cross-run variance.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Compares fine-tuned models against their non-fine-tuned baselines (Table 1 shows 'Before' and 'After' columns).", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": false, + "justification": "Baselines are the original models themselves, which is appropriate, but paper does not compare against alternative defense methods (only against no defense).", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": false, + "justification": "No ablation study present. Cannot isolate whether improvement comes from structured delimiters alone, adversarial examples alone, or fine-tuning itself.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": false, + "justification": "Single metric used: attack success rate via Levenshtein similarity. No measurement of model utility preservation, output quality on clean inputs, or downstream task performance.", + "source": "haiku" + }, + "human_evaluation": { + "applies": true, + "answer": false, + "justification": "No human evaluation of fine-tuned model outputs or quality assessment included.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": false, + "justification": "Paper describes test procedures but does not clearly specify whether held-out novel attacks were tested or if evaluation used the same attack patterns used in fine-tuning (potential overfitting).", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": false, + "justification": "Results show breakdown by model and attack type (goal hijacking vs prompt leaking) but not by task category (translation, grammar correction, sentiment analysis, summarization) despite mentioning these tasks in fine-tuning.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": false, + "justification": "Paper notes 'Prompt Leaking 2.86% 2.86%' (no improvement) but does not discuss or analyze cases where defense underperforms.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": false, + "justification": "Some attacks persist post-fine-tuning (e.g., prompt leaking on multiple models) but results are not emphasized as negative findings; instead framed as minor residuals.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": true, + "justification": "Exact model versions specified: 'text-davinci-003, text-curie-001, text-babbage-001, and text-ada-001' include version identifiers.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": false, + "justification": "Example prompts provided (e.g., 'Correct this to standard English: {user input}' and structured prompt format shown in Figure 4), but full set of 35 base prompts is not provided.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": false, + "justification": "Paper mentions 'temperature' as a parameter set in JSON configuration but does not report actual temperature values used. Fine-tuning hyperparameters (learning rate, epochs, batch size) are not reported.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": true, + "answer": true, + "justification": "Structured delimiter scaffolding clearly described: PROMPT + <userInput> + USER_INPUT + </userInput> format with detailed explanation of why this helps distinguish instructions from data.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": false, + "justification": "Paper states datasets were 'augmented with tags and structured into JSONL format' but does not document other preprocessing steps (filtering, deduplication, cleaning, etc.) in detail.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": false, + "justification": "Adversarial dataset (1,260 attack variations) and fine-tuning dataset are not stated as available. Fine-tuning datasets are from public Kaggle sources, but modifications/structuring not released.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Adversarial dataset construction described: JSON configurations with prompts, attack strings, and parameters; 35 base prompts × 2 attack types × 5 variations. Kaggle dataset sourcing noted.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "No human subjects involved; N/A.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": false, + "justification": "Pipeline outlined (JSON config → prompt generation → model evaluation → similarity scoring → attack success rate) but lacks detail on intermediate processing, filtering, or data quality checks.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": false, + "answer": false, + "justification": "Not evaluating benchmark contamination; testing adversarial robustness, not model knowledge cutoffs. N/A.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": false, + "answer": false, + "justification": "N/A—not a benchmark evaluation of pre-trained model knowledge.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": false, + "answer": false, + "justification": "N/A—not evaluating pre-training data leakage.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human subjects; N/A.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human subjects; N/A.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human subjects; N/A.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human subjects; N/A.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human subjects; N/A.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human subjects; N/A.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human subjects; N/A.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": false, + "justification": "No inference cost or latency metrics reported for fine-tuned or baseline models.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "Paper mentions GPT-3 fine-tuning is 'enormously expensive' and they couldn't afford Davinci, but does not state actual computational budget (cost in dollars, token counts, training time).", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "Adversarial fine-tuning with structured delimiters (<userInput> tags) reduces prompt injection attack success from 31% to 0% on smaller GPT-3 models (Ada, Babbage, Curie).", + "evidence": "Table 1 shows Goal Hijacking: Babbage 31% → 0%, Ada 26% → 0%, Curie 18% → 0%; Prompt Leaking rates remain low (0-2.86%).", + "supported": "moderate" + }, + { + "claim": "Larger, more capable language models exhibit greater vulnerability to prompt injection attacks.", + "evidence": "Figure 5 shows positive correlation between model size (parameters) and attack success rate; GPT-3 Davinci (175B) 24.28% vs GPT-2 (1.5B) 7.85% goal hijacking success.", + "supported": "strong" + }, + { + "claim": "Model flexibility/capability enables vulnerability; models trained for single tasks are resistant to prompt injection attacks.", + "evidence": "Paper argues that 'smart enough' models to follow arbitrary instructions are vulnerable; GPT-2 generates related responses but doesn't follow adversarial instructions.", + "supported": "moderate" + }, + { + "claim": "Structured input parsing (wrapping user input in delimiter tags) teaches models to distinguish user data from program instructions.", + "evidence": "Proposed method uses <userInput> tags to separate user input; shows effectiveness in Tables 1-2, but mechanism not isolated via ablation.", + "supported": "moderate" + }, + { + "claim": "Prompt injection vulnerabilities exist across multiple LLM architectures: GPT-3, GPT-2, T-5, OPT.", + "evidence": "Table 2 demonstrates attacks succeeded on all tested non-GPT-3 models (GPT-2 7.85% goal hijacking, OPT 45.71%, T-5 8.57%).", + "supported": "strong" + }, + { + "claim": "OpenAI's instruction hierarchy systems and Anthropic's Constitutional AI were influenced by this 2022 fine-tuning research.", + "evidence": "Paper cites that modern approaches 'have since influenced more sophisticated approaches' and lists them, but does not provide detailed evidence of direct influence.", + "supported": "weak" + }, + { + "claim": "Fine-tuning-based defenses show poor generalization to novel/modern attack patterns (many-shot jailbreaking, indirect injection).", + "evidence": "Contemporary Relevance section cites 2024 research showing this limitation, but this is post-hoc external finding, not demonstrated in the paper's experiments.", + "supported": "weak" + } + ], + "methodology_tags": [ + "benchmark-eval", + "observational" + ], + "key_findings": "The study demonstrates that adversarial fine-tuning using structured input delimiters can reduce prompt injection attack success rates from 31% baseline to near-zero on smaller GPT-3 models (Ada, Babbage, Curie), with goal hijacking attacks eliminated entirely. A consistent positive correlation is observed between model size and vulnerability to prompt injection across tested architectures (GPT-3, GPT-2, T-5, OPT), suggesting larger, more capable models are inherently more susceptible. However, the approach was only validated on a limited subset of models (cost and computational constraints prevented testing on GPT-3 Davinci and full GPT-2 evaluation), and subsequent research has revealed significant limitations: fine-tuning can reduce safety alignment, modern attacks (many-shot jailbreaking, indirect injection) bypass training-based defenses, and generalization to novel attack patterns is poor.", + "red_flags": [ + { + "flag": "No ablation study", + "detail": "Cannot determine whether improvement comes from structured delimiters, adversarial examples, fine-tuning itself, or their combination. Before/after comparison shows effect exists but not what drives it." + }, + { + "flag": "Limited model coverage for defense", + "detail": "Only fine-tuned 3 of 7 models tested (Ada, Babbage, Curie). Could not afford Davinci due to cost; could not complete GPT-2 due to computational limits. Main scalability questions left unanswered." + }, + { + "flag": "Single evaluation metric", + "detail": "Only attack success rate measured via Levenshtein distance. No assessment of model utility preservation, output quality on clean inputs, or performance degradation on legitimate tasks." + }, + { + "flag": "Unclear train-test separation", + "detail": "Paper does not clearly specify whether test attacks were novel (held-out) or the same attack patterns used during fine-tuning, risking evaluation on memorized patterns." + }, + { + "flag": "No statistical rigor", + "detail": "No confidence intervals, significance tests, standard deviations, or cross-run variance reported. Single-point estimates make it impossible to assess result stability." + }, + { + "flag": "Hyperparameter transparency missing", + "detail": "Fine-tuning hyperparameters (learning rate, epochs, batch size, gradient accumulation) not reported, preventing reproduction." + }, + { + "flag": "Code and data availability unclear", + "detail": "GitHub link points to PromptInject framework, not the authors' modifications. Adversarial test dataset and fine-tuning code availability not explicitly stated." + }, + { + "flag": "Funding and conflicts not disclosed", + "detail": "No funding statement; OpenAI API usage without disclosure of any relationship with OpenAI; no competing interests statement." + }, + { + "flag": "Timing and framing concern", + "detail": "2022 research published in 2025 as 'historical context.' Paper extensively documents limitations discovered by others (2024), which undermines the original contribution rather than clarifying its enduring value." + }, + { + "flag": "Generalization not tested", + "detail": "Fine-tuning on 35 base prompts + attack variations. No evaluation on completely novel attack strategies or task domains not seen during training." + } + ], + "cited_papers": [ + { + "title": "Language Models are Few-Shot Learners", + "relevance": "Foundational GPT-3 paper; establishes the capability baseline and prompting paradigm being attacked." + }, + { + "title": "Ignore Previous Prompt: Attack Techniques For Language Models", + "relevance": "Perez & Ribeiro 2022; introduces PromptInject framework and goal hijacking/prompt leaking attacks that this paper builds directly on." + }, + { + "title": "Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing", + "relevance": "Surveys prompt engineering techniques; contextualizes the role of prompts in LLM applications where injection becomes possible." + }, + { + "title": "Generating Textual Adversarial Examples for Deep Learning Models: A Survey", + "relevance": "Broader survey of adversarial attack strategies on NLP models; provides defense strategy framing (adversarial training vs knowledge distillation)." + }, + { + "title": "Language Models are Unsupervised Multitask Learners", + "relevance": "GPT-2 foundational paper; model tested as vulnerability baseline in the study." + }, + { + "title": "The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions", + "relevance": "Wallace et al. 2024; cited as modern formalization of the delimiter-based approach this paper pioneered." + }, + { + "title": "Constitutional AI and Harmlessness from AI Feedback", + "relevance": "Anthropic 2024; cited as extending the adversarial training concept to modern systems with preference learning." + }, + { + "title": "SecAlign: Defending Against Prompt Injection with Preference Optimization", + "relevance": "Wang et al. 2024; cited as addressing generalization limitations of fine-tuning-based defenses." + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 1, + "justification": "Methods tested on deprecated GPT-3 models; paper acknowledges limitations make approach unsuitable for modern systems. Practitioners cannot directly apply these findings." + }, + "surprise_contrarian": { + "score": 1, + "justification": "Finding that larger models are more vulnerable is somewhat intuitive in hindsight (more capable = more exploitable). Not surprising or contrarian by current standards." + }, + "fear_safety": { + "score": 2, + "justification": "Raises legitimate concern about LLM security vulnerabilities and the capability-vulnerability tradeoff, but threat landscape has evolved significantly since 2022." + }, + "drama_conflict": { + "score": 0, + "justification": "Straightforward technical paper with no controversy, debate, or dramatic narrative element." + }, + "demo_ability": { + "score": 1, + "justification": "Code and data not clearly released; code references are to Jupyter notebooks on outdated models. Difficult for practitioners to reproduce or try." + }, + "brand_recognition": { + "score": 1, + "justification": "NYU authors with no major lab affiliation or institutional prestige signal. Limited brand recognition compared to papers from OpenAI, Anthropic, DeepMind, etc." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "44784297", + "title": "GHz spiking neuromorphic photonic chip with in-situ training", + "points": 115, + "comments": 18, + "url": "https://news.ycombinator.com/item?id=44784297", + "created_at": "2025-08-04T11:21:05Z" + }, + { + "hn_id": "27945298", + "title": "PettingZoo: Gym for Multi-Agent Reinforcement Learning", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=27945298", + "created_at": "2021-07-24T23:33:19Z" + }, + { + "hn_id": "44650583", + "title": "Safety Evaluations of 20 LLMs", + "points": 1, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=44650583", + "created_at": "2025-07-22T17:41:42Z" + }, + { + "hn_id": "46944301", + "title": "The Case for Contextual Copyleft: Licensing Open Source Training Data and Gener", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=46944301", + "created_at": "2026-02-09T11:59:40Z" + }, + { + "hn_id": "44672638", + "title": "Promptomatix: An Automatic Prompt Optimization Framework for LLMs", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=44672638", + "created_at": "2025-07-24T16:26:59Z" + }, + { + "hn_id": "43587253", + "title": "Generating Medically-Informed Explanations for Depression Detection Using LLMs", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=43587253", + "created_at": "2025-04-04T20:23:31Z" + }, + { + "hn_id": "43484067", + "title": "Stealthy Cross-Origin Context Poisoning Attacks Against AI Coding Assistants", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=43484067", + "created_at": "2025-03-26T16:38:02Z" + } + ], + "top_points": 115, + "total_points": 122, + "total_comments": 19 + } +} +\ No newline at end of file diff --git a/papers/early-categorization-prompt-2024/scan-v5.json b/papers/early-categorization-prompt-2024/scan-v5.json @@ -0,0 +1,321 @@ +{ + "scan_version": 5, + "paper_type": "survey", + "paper": { + "title": "An Early Categorization of Prompt Injection Attacks on Large Language Models", + "authors": [ + "Sippo Rossi", + "Alisia Marianne Michel", + "Raghava Rao Mukkamala", + "Jason Bennett Thatcher" + ], + "year": 2024, + "venue": "arXiv.org", + "arxiv_id": "2402.00898", + "doi": "10.48550/arXiv.2402.00898" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "The abstract claims to provide an overview and categorization of prompt injections and discuss implications — all three are delivered in the paper's body with Tables 2/3 and Sections 5.1–5.3.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": false, + "answer": false, + "justification": "The paper is a descriptive taxonomy with no causal claims of the form 'X causes Y'; no study design question arises.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": true, + "justification": "Authors explicitly state the categorization is 'not exhaustive,' that most attacks were demonstrated on only one or two LLM interfaces (mainly ChatGPT/GPT-3), and that generalizing to other interfaces requires 'moderate to significant altering.'", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": false, + "answer": false, + "justification": "The paper is a taxonomy with no hypothesis testing; there are no empirical findings for which alternative explanations would be relevant.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "The paper documents the existence and structure of attack types; what is measured (documented examples and tests) matches exactly what is claimed (a taxonomy of prompt injection classes).", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "Section 5.5 is explicitly titled 'Limitations' and spans a full paragraph with multiple distinct concerns.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "Authors identify specific threats: the categorization is incomplete because new attack types emerge continuously; most examples were demonstrated on ChatGPT/GPT-3 only and may not transfer to other interfaces; some injections could not be verified due to rapid patching.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": true, + "justification": "The paper explicitly states it does not catalog which injections have been patched, that indirect injections were not empirically tested due to ethical concerns, and that the categorization excludes injections backed by only one source.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "There is no funding acknowledgement or disclosure anywhere in the paper.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Author affiliations are clearly stated on the first page: Copenhagen Business School and Temple University — both academic institutions with no apparent LLM product connection.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": false, + "answer": false, + "justification": "No funding is disclosed, so this criterion is not applicable.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "There is no competing interests statement or declaration of financial interests anywhere in the paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "The paper defines 'prompt injection' by analogy to SQL injection in the introduction, and explicitly defines 'direct' vs. 'indirect' prompt injections, as well as each of the 10 subclasses in Tables 2 and 3.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "The contribution is stated explicitly in Section 1: '(1) describe, document, and provide a comprehensive list of known types of prompt injections; (2) provide a checklist for developers and end users.'", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "The paper's literature review (Section 2.2) directly engages with Perez & Ribeiro (2022), Greshake et al. (2023), Kang et al. (2023), Zou et al. (2023), and Shen et al. (2023), showing how this work extends the prior direct/indirect dichotomy into a finer-grained taxonomy.", + "source": "haiku" + } + } + }, + "type_checklist": { + "survey": { + "search_and_selection": { + "search_strategy_reproducible": { + "applies": true, + "answer": true, + "justification": "The paper describes searching Google Scholar, Google, arXiv, GitHub, Medium, and Twitter/X with the keywords 'prompt injection' and 'jailbreak' over two rounds (May–June 2023 and September 2023), and names specific community sites (jailbreakchat.com, Reddit channels).", + "source": "haiku" + }, + "inclusion_exclusion_explicit": { + "applies": true, + "answer": true, + "justification": "For academic papers: published May 2022–September 2023, in English, discussing prompt injections as adversarial/security threat. For non-academic: must be documented by multiple sources or independently verified by the authors' own tests.", + "source": "haiku" + }, + "prisma_or_structured_protocol": { + "applies": true, + "answer": false, + "justification": "No mention of PRISMA or any other structured review protocol; no flow diagram of paper screening stages is provided.", + "source": "haiku" + }, + "search_terms_provided": { + "applies": true, + "answer": true, + "justification": "The paper explicitly states it used 'prompt injection' and 'jailbreak' as search keywords across all databases.", + "source": "haiku" + }, + "databases_listed": { + "applies": true, + "answer": true, + "justification": "Six databases/platforms are named: Google Scholar, Google, arXiv, GitHub, Medium, and Twitter (X), plus jailbreakchat.com and two Reddit communities.", + "source": "haiku" + }, + "screening_process_documented": { + "applies": true, + "answer": false, + "justification": "The paper reports finding 123 papers in the academic search and then focusing on those discussing prompt injections 'specifically,' but provides no count-by-stage screening table or flow of how 123 papers was reduced to the cited subset.", + "source": "haiku" + }, + "review_scope_justified": { + "applies": true, + "answer": true, + "justification": "The authors justify the temporal scope (from May 2022 onward) by noting that is when prompt injections were first discovered and reported; the topic's novelty and reliance on preprints and non-academic sources is also explained.", + "source": "haiku" + } + }, + "synthesis_quality": { + "conflicting_findings_acknowledged": { + "applies": true, + "answer": false, + "justification": "The paper presents different attack types as complementary categories and never discusses cases where reviewed sources contradicted each other or offered conflicting empirical results.", + "source": "haiku" + }, + "quality_assessment_of_sources": { + "applies": true, + "answer": false, + "justification": "No quality rubric or risk-of-bias assessment is applied to the reviewed papers; the only quality filter for non-academic sources is 'multiple corroborating sources and credible screenshots.'", + "source": "haiku" + }, + "publication_bias_discussed": { + "applies": true, + "answer": false, + "justification": "Publication bias is never mentioned; the paper notes most sources are preprints but does not discuss how that skews the evidence base.", + "source": "haiku" + }, + "quantitative_synthesis_present": { + "applies": true, + "answer": false, + "justification": "The synthesis is purely narrative and taxonomic; no meta-analysis, vote counting, or effect-size aggregation is attempted.", + "source": "haiku" + }, + "recommendations_supported_by_evidence": { + "applies": true, + "answer": true, + "justification": "Developer and end-user recommendations in Sections 5.1–5.2 are directly tied to the identified attack classes (e.g., 'avoid sensitive data in system prompts' follows directly from the documented instruction-manipulation class).", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "Prompt injections can be divided into two broad branches: direct (6 classes) and indirect (4 classes), with 17 distinct variations identified.", + "evidence": "Tables 2, 3, and 4 enumerate and define all 17 variations with sources for each.", + "supported": "moderate" + }, + { + "claim": "Direct prompt injections primarily aim to bypass content-filtering safeguards, while indirect injections have broader and more varied cyber-attack-like objectives.", + "evidence": "Objectives column in Tables 2 and 3 consistently shows 'bypass security measures' for direct and data exfiltration/manipulation goals for indirect.", + "supported": "strong" + }, + { + "claim": "Most documented prompt injection attacks have been demonstrated on ChatGPT, GPT-3, or GPT-4, limiting generalizability to other LLM interfaces.", + "evidence": "Target column in Table 4 shows ChatGPT or GPT-3/4 for 14 of 17 examples; acknowledged as limitation in Section 5.5.", + "supported": "strong" + }, + { + "claim": "Developing a fully safe LLM interface against prompt injection is 'difficult if not impossible.'", + "evidence": "Cited as the view of premier AI labs and supported by reference to computational suffix attacks (Zou et al., 2023), but no systematic evidence is presented.", + "supported": "weak" + }, + { + "claim": "Virtual prompt injection can misalign a large share of outputs with a very small number of poisoned training examples.", + "evidence": "Attributed entirely to Yan et al. (2023) without independent verification in this paper.", + "supported": "moderate" + } + ], + "methodology_tags": [ + "qualitative", + "case-study" + ], + "key_findings": "The paper proposes an early taxonomy of 17 prompt injection attack types organized into two branches: 6 classes of direct injections (double character, virtualization, obfuscation, payload splitting, adversarial suffix, instruction manipulation) and 4 classes of indirect injections (active, passive, user-driven, virtual). Direct injections primarily bypass content filters while indirect injections enable data exfiltration, misinformation, social engineering, and training-data poisoning. The review is based on a mixed-method literature survey combining 123 academic papers with non-academic sources (Reddit, jailbreakchat.com), with partial empirical verification on ChatGPT and GPT-3. The authors conclude that fully preventing prompt injection is currently infeasible and recommend defensive design principles analogous to SQL-injection-safe database practices.", + "red_flags": [ + { + "flag": "No PRISMA or structured protocol", + "detail": "The survey methodology is described narratively but follows no recognized systematic review protocol, making it difficult to assess selection bias or reproducibility rigorously." + }, + { + "flag": "Heavy reliance on non-peer-reviewed sources", + "detail": "A significant portion of the evidence base is Reddit posts, blog entries, and jailbreakchat.com, with no quality rubric applied to distinguish reliable from unreliable demonstrations." + }, + { + "flag": "No screening flow or stage counts", + "detail": "123 academic papers are found but no documentation is provided on how many were excluded at each stage, making the final included set opaque." + }, + { + "flag": "Single-platform generalizability", + "detail": "14 of 17 catalogued attack examples target ChatGPT/GPT-3/4 only; the claim that categories apply generally to other LLM interfaces is asserted, not demonstrated." + }, + { + "flag": "No funding disclosure", + "detail": "No acknowledgement or funding statement appears anywhere in the paper." + }, + { + "flag": "No quality assessment of sources", + "detail": "No risk-of-bias or quality rating is applied to reviewed papers, treating a preprint and a peer-reviewed venue paper as equivalent evidence." + } + ], + "cited_papers": [ + { + "title": "Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection", + "relevance": "Primary academic source for indirect prompt injection taxonomy; directly foundational to the paper's categorization" + }, + { + "title": "Ignore Previous Prompt: Attack Techniques for Language Models", + "relevance": "First systematic academic treatment of goal hijacking and prompt leaking; foundational reference" + }, + { + "title": "Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks", + "relevance": "Introduces payload splitting and obfuscation attacks that form two of the six direct injection classes" + }, + { + "title": "Universal and Transferable Adversarial Attacks on Aligned Language Models", + "relevance": "Source for adversarial suffix attacks — one of the six direct injection classes and a key automated attack vector" + }, + { + "title": "Do Anything Now: Characterizing and Evaluating In-the-Wild Jailbreak Prompts on Large Language Models", + "relevance": "Provides community-level categorization of jailbreak prompts used as prior art for the taxonomy" + }, + { + "title": "Virtual Prompt Injection for Instruction-Tuned Large Language Models", + "relevance": "Sole source for the virtual prompt injection class; empirical evidence that few poisoned examples cause large output shifts" + }, + { + "title": "Multi-Step Jailbreaking Privacy Attacks on ChatGPT", + "relevance": "Prior academic work on jailbreaking used to situate the direct injection categories" + }, + { + "title": "Evaluating the Instruction-Following Robustness of Large Language Models to Prompt Injection", + "relevance": "Early benchmark for measuring LLM robustness to prompt injection — cited in future-work discussion on standardized tests" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 2, + "justification": "Provides a developer checklist and end-user guidelines directly derived from the taxonomy, though guidance is high-level." + }, + "surprise_contrarian": { + "score": 1, + "justification": "Confirms and organizes known threats rather than presenting surprising or counter-intuitive findings." + }, + "fear_safety": { + "score": 3, + "justification": "Directly concerns AI safety risks: malware generation, data exfiltration, training-data poisoning — all with real demonstrated examples." + }, + "drama_conflict": { + "score": 2, + "justification": "The 'cat-and-mouse' framing and concrete examples (grandma jailbreak, zero-day malware via ChatGPT) carry inherent drama." + }, + "demo_ability": { + "score": 2, + "justification": "Many direct injection examples can be attempted (though patching means success varies); Appendix A lists sources with live prompt examples." + }, + "brand_recognition": { + "score": 1, + "justification": "Authors are from Copenhagen Business School and Temple University — no famous AI lab affiliation, though ChatGPT/GPT-4/Bing AI are named products throughout." + } + }, + "hn_data": { + "threads": [], + "top_points": 0, + "total_points": 0, + "total_comments": 0 + } +} +\ No newline at end of file diff --git a/papers/ecogym-evaluating-llms-2026/scan-v5.json b/papers/ecogym-evaluating-llms-2026/scan-v5.json @@ -0,0 +1,337 @@ +{ + "scan_version": 5, + "paper_type": "benchmark-creation", + "paper": { + "title": "EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies", + "authors": [ + "Xavier Hu", + "Jinxiang Xia", + "Shengze Xu", + "Kangqi Song", + "Yishuo Yuan" + ], + "year": 2026, + "venue": "arXiv", + "arxiv_id": "2602.09514", + "doi": null + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "The main abstract claims—no single model dominates across three scenarios, and models show suboptimality in strategy or execution—are directly supported by Table 2 results and the failure mode analysis in Section 4.2.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": false, + "justification": "The paper makes causal claims (e.g., 'thinking mode catalyzes universal performance elevation', memory modules improve performance) tested via ablations on only 2 models with single runs, insufficient for causal inference given the high variance demonstrated in the stochastic stability analysis.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "The conclusion states 'current SOTA LLMs have achieved super-human performance in specific long-horizon economic planning scenarios' based on a single human experiment in one environment (Operation), then broadly frames this as showing 'immense potential for complex economic decision-making' without adequate bounding.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "The paper observes phenomena like inverse scaling (GPT-5-Mini outperforming GPT-5.2 in Freelance) and non-monotonic context window effects without discussing alternative explanations; the failure mode analysis attributes differences to 'strategic prioritization' vs. 'execution efficiency' without considering confounders.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": false, + "justification": "Net Worth, Income, and DAU are used as proxies for 'long-horizon planning capability' without interrogating whether these economic outcomes measure planning specifically versus instruction-following, domain knowledge heuristics, or reactive behavior.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": false, + "justification": "There is no dedicated limitations or threats-to-validity section; the conclusion briefly notes models 'struggle to maintain strategic coherence' but does not systematically address benchmark limitations.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "No specific threats to validity are discussed; stochastic variance is noted but not framed as a validity threat, and issues like single-run evaluation for Freelance/Operation or limited human sample size are not addressed as threats.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "The paper does not state explicit scope boundaries about what conclusions cannot be drawn from EcoGym results; no discussion of what EcoGym does NOT measure or settings where results would not transfer.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding disclosure is present; there is no acknowledgments section and no mention of funding sources, despite being a corporate research paper from OPPO.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "The paper is bylined as 'OPPO AI Agent Team' and the corresponding author emails end in @oppo.com, making the corporate affiliation clear.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": true, + "justification": "OPPO does not appear to evaluate its own proprietary model; all evaluated models are from OpenAI, Google, Anthropic, xAI, Moonshot, MiniMax, and open-weight providers.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests or financial interests statement is present anywhere in the paper.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": false, + "justification": "'Long-horizon planning' and 'plan-and-execute' are used throughout without precise operational definitions; the formal POMDP task formulation describes mechanics but does not define what constitutes 'planning' as distinct from reactive behavior.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "The paper explicitly lists three contributions: an infinite-horizon planning evaluation framework, a utility-guided economic assessment paradigm, and a multi-dimensional empirical analysis of 11 LLMs.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "The related work section explicitly positions EcoGym against Vending Bench v1/v2, HeroBench, GDPval, and broader planning benchmarks, explaining how EcoGym differs (multi-scenario, unified framework, fully open-source).", + "source": "haiku" + } + } + }, + "type_checklist": { + "benchmark-creation": { + "construct_design": { + "construct_validity_argued": { + "applies": true, + "answer": false, + "justification": "The paper argues economic environments require sustained decisions (Principle 2) but does not formally argue why Net Worth/Income/DAU specifically measure 'long-horizon planning capability' versus domain knowledge, memorized heuristics, or short-horizon greedy behavior.", + "source": "haiku" + }, + "difficulty_distribution_characterized": { + "applies": true, + "answer": false, + "justification": "Table 4 tests three inventory-size tiers for Vending (Small/Medium/Large), but no systematic difficulty distribution across benchmark items is characterized and no empirical difficulty calibration is performed.", + "source": "haiku" + }, + "ceiling_floor_effects_checked": { + "applies": true, + "answer": false, + "justification": "Three of eleven models (DeepSeek-v3.2, Grok-4.1-Fast, Kimi-k2) scored exactly 0.00 income in Freelance—a severe floor effect affecting 27% of evaluated models—which is noted via truncated lines in Figure 1 but not analyzed as a discriminability problem.", + "source": "haiku" + }, + "human_baseline_included": { + "applies": true, + "answer": false, + "justification": "Human baselines are provided only for the Operation environment (average DAU 1,404) due to time constraints; Vending and Freelance have no human baselines, leaving the benchmark incompletely grounded, and the human sample size is not reported.", + "source": "haiku" + }, + "scoring_rubric_justified": { + "applies": true, + "answer": false, + "justification": "Net Worth, Income, and DAU are defined with formulas in Appendix B but their choice over alternative metrics (e.g., strategic consistency, recovery rate, action efficiency) is not justified, and edge cases in scoring are not addressed.", + "source": "haiku" + } + }, + "robustness": { + "contamination_resistance_designed": { + "applies": true, + "answer": true, + "justification": "The Freelance environment implements 'Logic Mutation' (refactoring numerical values and variables) and 'Scenario Injection' to prevent memorization, documented in Section 3.2.2; this is a concrete, described anti-contamination measure.", + "source": "haiku" + }, + "temporal_robustness_discussed": { + "applies": true, + "answer": false, + "justification": "The paper describes the benchmark as 'open and extensible' but provides no discussion of whether it will be gamed or obsoleted as models improve, nor any versioning or update plan.", + "source": "haiku" + }, + "failure_modes_discussed": { + "applies": true, + "answer": false, + "justification": "Section 4.2 discusses failure modes of agents within the benchmark but does not discuss failure modes of the benchmark itself—e.g., how agents could game economic metrics without genuine planning, or what capabilities EcoGym fails to measure.", + "source": "haiku" + }, + "baseline_implementations_provided": { + "applies": true, + "answer": true, + "justification": "Code is released at https://github.com/OPPO-PersonalAI/EcoGym and Table 2 provides numerical results for 11 baseline models with specific version identifiers in Appendix A, enabling reproduction.", + "source": "haiku" + } + }, + "documentation": { + "dataset_documentation_complete": { + "applies": true, + "answer": false, + "justification": "Data collection is described at a high level (Perplexity for Vending product data, 8 aggregated datasets for Freelance) but no formal data cards, complete preprocessing documentation, or collection methodology for all environments are provided.", + "source": "haiku" + }, + "licensing_and_access_clear": { + "applies": true, + "answer": false, + "justification": "The code is on GitHub and described as 'open,' but no specific license is stated in the paper, and terms of use for the benchmark are not specified.", + "source": "haiku" + }, + "intended_use_specified": { + "applies": true, + "answer": false, + "justification": "The benchmark is framed for evaluating 'long-horizon plan-and-execute in interactive economies' but the paper does not specify what conclusions should NOT be drawn from EcoGym scores or what use cases are out of scope.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "No single model consistently achieves superior performance across all three EcoGym scenarios", + "evidence": "Table 2 shows Gemini-3-Pro leads in Vending (11274.73), GPT-5-Mini leads in Freelance (2990.72), and Claude-Sonnet-4.5 leads in Operation (1572.49)", + "supported": "strong" + }, + { + "claim": "Models exhibit significant suboptimality in either high-level strategy or efficient action execution", + "evidence": "Failure mode analysis in Section 4.2 identifies strategic prioritization gaps in Operation and execution inefficiency in Vending/Freelance via differential trajectory analysis", + "supported": "moderate" + }, + { + "claim": "Top-tier LLMs have achieved super-human performance in specific long-horizon economic planning scenarios", + "evidence": "Human experts averaged DAU 1,404 in Operation while Claude-Sonnet-4.5 (1572.49), Gemini-3-Pro (1280.75), and others are compared; human sample size not reported and testing limited to ~45 minutes", + "supported": "weak" + }, + { + "claim": "Extending context window beyond 128k does not yield consistent performance gains", + "evidence": "Figure 4 shows Gemini-3-Pro performance degrading monotonically from k=128 to k=1024, while Gemini-3-Flash shows volatile non-monotonic behavior across the same range", + "supported": "moderate" + }, + { + "claim": "Thinking mode universally improves performance across models in long-horizon tasks", + "evidence": "Figure 6 shows DAU improvements for both Gemini-3-Flash (1196→1398) and Gemini-3-Pro (1398→1511) with thinking enabled in Operation, but tested on only 2 models in 1 environment", + "supported": "weak" + }, + { + "claim": "No single memory architecture dominates across models and environments", + "evidence": "Table 3 shows episodic memory best for Gemini-3-Pro (18939 vs. 11274 baseline) but working memory best for Gemini-3-Flash (10099) in Vending", + "supported": "moderate" + } + ], + "methodology_tags": [ + "benchmark-eval" + ], + "key_findings": "EcoGym introduces three long-horizon economic simulation environments (Vending, Freelance, Operation) for benchmarking LLM agent planning across 11 models, finding that no single model dominates across all scenarios. Floor effects are severe in Freelance (3/11 models score zero income), raising serious discriminability concerns that go unaddressed. Selected top-tier models marginally exceed a limited human baseline in the Operation environment only. Diagnostic studies on context length, thinking mode, and memory modules yield inconsistent and model-dependent results, suggesting no universal architectural solution for long-horizon planning emerges from the study.", + "red_flags": [ + { + "flag": "Severe floor effects in Freelance", + "detail": "3 of 11 models (DeepSeek-v3.2, Grok-4.1-Fast, Kimi-k2) scored exactly 0.00 income in Freelance—27% complete failure rate—which is a serious discriminability problem noted via truncated lines in Figure 1 but never analyzed as such." + }, + { + "flag": "Single-run evaluation for two of three environments", + "detail": "Freelance and Operation main results are reported from a single run each, despite the stochastic stability analysis showing high variance; single-run results in stochastic environments are unreliable as a benchmark basis." + }, + { + "flag": "Incomplete and under-reported human baseline", + "detail": "Human baselines are collected only for Operation (excluded from Vending/Freelance due to time constraints); the Operation human sample size is never reported, and the 'super-human performance' claim rests on this thin evidence." + }, + { + "flag": "No limitations section", + "detail": "The paper contains no dedicated limitations or threats-to-validity section despite multiple methodological choices (single runs, limited human sample, uncharacterized floor effects) that warrant systematic discussion." + }, + { + "flag": "Overbroad super-human performance claim", + "detail": "The conclusion claims LLMs demonstrate 'super-human performance in specific long-horizon economic planning scenarios' based on one environment, undisclosed number of human participants tested for ~45 minutes, with no variance reporting." + }, + { + "flag": "No funding disclosure from corporate authors", + "detail": "This is a corporate paper from OPPO AI Agent Team with no funding disclosure, no acknowledgments section, and no competing interests statement." + } + ], + "cited_papers": [ + { + "title": "Vending-bench: A benchmark for long-term coherence of autonomous agents", + "relevance": "Direct methodological predecessor; EcoGym explicitly builds on and extends Vending Bench's evaluation approach and failure mode analysis methodology" + }, + { + "title": "GDPval: Evaluating AI model performance on real-world economically valuable tasks", + "relevance": "Related economic evaluation benchmark; EcoGym positions itself as complementary with interactive simulated environments vs. GDPval's real-world task approach" + }, + { + "title": "HeroBench: A benchmark for long-horizon planning and structured reasoning in virtual worlds", + "relevance": "Related long-horizon planning benchmark with competitive economic dynamics used for comparison" + }, + { + "title": "RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts", + "relevance": "Long-horizon expert task benchmark with human baselines; cited as representative effort to quantify macroeconomic potential of agents" + }, + { + "title": "Generative Agents: Interactive simulacra of human behavior", + "relevance": "Early work on persistent memory and planning in long-horizon social/economic interactions; foundational to micro-economic execution agent taxonomy" + }, + { + "title": "SWE-bench: Can language models resolve real-world GitHub issues?", + "relevance": "Source dataset used to construct Freelance software development tasks" + }, + { + "title": "LiveCodeBench: Holistic and contamination-free evaluation of large language models for code", + "relevance": "Source dataset for Freelance coding tasks; contamination-free methodology directly relevant to benchmark design choices" + }, + { + "title": "Remote Labor Index: Measuring AI automation of remote work", + "relevance": "Related economic capability benchmark cited to motivate economic grounding of agent evaluation" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 2, + "justification": "Open-source benchmark with runnable code and economic planning scenarios (vending, freelancing, platform ops) that map to real business use cases for LLM agents." + }, + "surprise_contrarian": { + "score": 1, + "justification": "The inverse scaling finding (smaller GPT-5-Mini beating larger GPT-5.2 in Freelance) and non-monotonic context window effects are mildly surprising, but 'no universal winner' is expected given task diversity." + }, + "fear_safety": { + "score": 1, + "justification": "The paper invokes economic impact of autonomous agents and the 'super-human' framing could attract concern, but safety implications are not engaged with in any depth." + }, + "drama_conflict": { + "score": 1, + "justification": "The claim that LLMs surpass human performance in economic planning has mild drama value, but evidence is limited and the paper's framing is measured." + }, + "demo_ability": { + "score": 2, + "justification": "Code released on GitHub with three runnable environments; practitioners could evaluate their own models against the 11 reported baselines." + }, + "brand_recognition": { + "score": 1, + "justification": "OPPO is a major consumer electronics brand but not a leading AI research institution; the evaluated models include recognizable names but OPPO itself has low research brand recognition." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "47149111", + "title": "Security Risks of AI Agents Hiring Humans: An Empirical Marketplace Study", + "points": 1, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=47149111", + "created_at": "2026-02-25T09:00:39Z" + } + ], + "top_points": 1, + "total_points": 1, + "total_comments": 1 + } +} +\ No newline at end of file diff --git a/papers/economics-ai-inference-2025/scan-v5.json b/papers/economics-ai-inference-2025/scan-v5.json @@ -0,0 +1,530 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Beyond Benchmarks: The Economics of AI Inference", + "authors": [ + "Boqin Zhuang", + "Jiacheng Qiao", + "Mingqian Liu", + "Mingxing Yu", + "Ping Hong", + "Rui Li", + "Xiaoxia Song", + "Xiangjun Xu", + "Xu Chen", + "Yaoyao Ma", + "Yujie Gao" + ], + "year": 2025, + "venue": "arXiv", + "arxiv_id": "2510.26136", + "doi": null + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All main abstract claims (framework introduced, cost/quality/performance analyzed, production frontier constructed) are supported by empirical data from WiNEval-3.0 evaluation across 9 models.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": false, + "justification": "Paper makes causal claims (e.g., 'increasing concurrency reduces completion time', 'output token volume causes WiNGPT-3.0 cost') without ablations or controls. Single-environment observational design cannot isolate causality.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "Title 'Beyond Benchmarks' and claims of 'high portability' suggest broad applicability, but evaluation limited to one benchmark (WiNEval-3.0), medical domain, and specific hardware (A800). Generalizations not explicitly bounded to tested setting.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "Each cost difference explained with single narrative (WiNGPT-3.0 = thinking model, Mistral-Small = poor tokenizer for Chinese) without exploring plausible alternatives.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "Paper distinguishes WiNEval-3.0 score (measured) from actual clinical performance (claimed). Limitations explicitly state benchmark is 'not entirely equivalent to model's final performance in specific specialized clinical scenarios.'", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "Section 7 'Limitations' provides 5 numbered points addressing training costs exclusion, hardware dependency, benchmark proxy nature, statistical confidence gaps, and upfront CAPEX.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "Specific threats identified: dependency on 'specific software/hardware stack,' 'proxy nature of benchmark scores,' and 'lack of statistical confidence analysis' requiring future confidence intervals and sensitivity testing.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "Limitations section identifies constraints but main narrative (introduction, conclusion) does not explicitly bound results to medical domain, WiNEval-3.0, or A800 hardware. Generic disclaimers insufficient.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding acknowledgment or financial support statement. Appears to be internal company research without explicit funding disclosure.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": false, + "justification": "Authors affiliated with 'Winning Health AI Research' and evaluate WiNGPT-3.5, WiNGPT-3.0, WiNGPT-2.7 (their own models). This conflict is not disclosed or acknowledged.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": false, + "justification": "If company-funded, funder directly benefits from positive evaluation of WiNGPT models. Result that WiNGPT-3.5 is 'overall leader' directly serves company interests.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement. No declaration of patents, equity stakes, or financial interests. Standard disclosure language absent.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Key terms defined: 'economics of inference' as production function (Section 2), 'quality' as WiNEval-3.0 score, 'cost' via explicit formula, 'performance' as three metrics (Section 5).", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Contribution explicitly stated: 'introduces a quantitative economics of inference framework' (abstract), 'proposes systematic framework for quantifying inference costs' (Section 1), with decision-making tool (Section 8).", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": false, + "justification": "Introduction cites prior work (accuracy focus [3], carbon [4], scaling [5]) but does not substantively discuss how this work differs from or builds on them. No dedicated related work section.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": false, + "justification": "No code repository, GitHub link, or code availability statement. Methodology described but no reproducible implementation provided.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": false, + "justification": "WiNEval-3.0 benchmark not publicly released. Paper presents aggregated results only; raw benchmark data unavailable for independent verification.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "Hardware specified (A800 80G × 2) but no requirements.txt, Dockerfile, or dependency versions. Vague reference to 'inference services' only; insufficient for reproducibility.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "No step-by-step reproduction guide. Paper explains framework and methodology but not sufficient instructions for independent replication.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "All results are point estimates. Paper acknowledges 'inherent randomness' and dynamic batching variations but does not report confidence intervals or error bars. Explicitly listed as limitation #4.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "Comparative claims made without statistical significance testing ('WiNGPT-3.5 is overall leader'). No p-values or hypothesis tests.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": false, + "justification": "Absolute values reported (dollars, scores) but without confidence intervals these cannot be reliably interpreted as effect sizes. Variance not quantified.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "9 models tested, 2,993 requests in WiNEval-3.0. No justification for adequacy. No power analysis provided.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "Appendix B shows different concurrency levels but for optimal configuration (Table 2), variance/std dev not reported. Randomness acknowledged but not quantified.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": false, + "justification": "Nine models compared against each other but no external baseline (industry standard, established reference, human expert performance) for medical QA.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "Models tested (Llama 2 2023, GLM-4, Qwen3, Mistral-Small) are contemporary and reflect current landscape.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": false, + "justification": "No ablation study. Tests different concurrency levels but does not isolate component contributions (e.g., prove output volume causes cost).", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Three evaluation dimensions: performance (time, TTFT, throughput), quality (WiNEval score), cost (dollars). Multiple metrics across all dimensions.", + "source": "haiku" + }, + "human_evaluation": { + "applies": false, + "answer": false, + "justification": "WiNEval-3.0 appears automated; no human evaluation of outputs mentioned. Not applicable to this cost-performance study.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": false, + "justification": "WiNEval-3.0 is evaluation set but no explicit statement it is held out from training. For proprietary models (WiNGPT), training data unknown; potential contamination not addressed.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": false, + "justification": "WiNEval-3.0 covers 10 medical scenarios but results reported as aggregate only. No per-category (exam vs diagnosis vs QC) breakdown.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": false, + "justification": "No failure cases shown or discussed. All models presented as acceptable; no qualitative error examples.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": false, + "justification": "No negative results reported. 'Outliers' framed positively ('thinking model', 'cost-effective'). No genuine negative findings.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": false, + "justification": "Model names given (WiNGPT-3.5, Qwen3-30B) but no snapshot dates, API versions, or commit hashes. Only 'GLM-4-32B-0414' includes date code. Insufficient for reproducibility.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": false, + "justification": "No actual prompts or system instructions provided. WiNEval-3.0 described as medical QA but prompt templates not shared.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": false, + "justification": "No temperature, top-p, max_tokens, or generation hyperparameters reported. Concurrency (8, 16, 32) is infrastructure parameter, not model hyperparameter.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": false, + "answer": false, + "justification": "No agentic scaffolding apparent. Direct model evaluation without agents or complex pipelines.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": false, + "justification": "No documentation of preprocessing, filtering, or normalization steps. How 2,993 requests prepared from 10 medical scenarios unexplained.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": false, + "justification": "WiNEval-3.0 not released publicly. Raw inference outputs and performance logs unavailable.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": false, + "justification": "WiNEval-3.0 described as 'derived from real clinical applications' but data collection procedure not detailed. Source and annotation process unknown.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "Not applicable; benchmark evaluation, no human participants.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": false, + "justification": "Pipeline from raw clinical data to WiNEval-3.0 not documented. Request formatting and processing not explained.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": false, + "justification": "Training data cutoffs not stated for any model. For proprietary and commercial models, cutoff unknown. Critical for medical benchmark validation.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": false, + "justification": "Potential train/test overlap not discussed. WiNEval-3.0 'derived from real clinical applications' may overlap with publicly available medical Q&A in training corpora.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": true, + "answer": false, + "justification": "No discussion of whether WiNEval-3.0 examples were publicly available before model training cutoffs. Medical benchmarks often present in training data.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "Not applicable; no human participants.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "Not applicable; no human participants.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "Not applicable; no human participants.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "Not applicable; no human participants.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "Not applicable; no human participants.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "Not applicable; no human participants.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "Not applicable; no human participants.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": true, + "justification": "Inference cost is primary focus. Reported in dollars per test set and per-unit cost. Tables 2-4 show cost for each model.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "Individual model costs reported but total computational budget for entire evaluation not provided.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "WiNGPT-3.5 is the overall leader, providing highest quality (76.2% score) at lowest cost ($0.34)", + "evidence": "Table 2 directly compares all models; WiNGPT-3.5 leads on both score and cost metrics", + "supported": "strong" + }, + { + "claim": "Increasing concurrency from 8 to 48 reduces WiNGPT-3.5 completion time from 2034s to 774s", + "evidence": "Appendix B Table 4 shows concurrency 8→2034s, concurrency 48→774.11s for WiNGPT-3.5", + "supported": "strong" + }, + { + "claim": "Each model has optimal concurrency beyond which overhead and marginal cost-benefit decline", + "evidence": "Section 6.1 and Appendix B show performance inflection points; WiNGPT-3.5 optimal at 48, degrades at 64+", + "supported": "moderate" + }, + { + "claim": "WiNGPT-3.0's high cost ($3.47) results from massive output token volume (4-8x other models)", + "evidence": "Table 2 shows WiNGPT-3.0 output 3.44M tokens vs others 350-800K; attributed to 'thinking model with chains of thought'", + "supported": "moderate" + }, + { + "claim": "Mistral-Small's 2.11M input tokens (vs 1.3M for others) due to less efficient tokenizer for Chinese", + "evidence": "Table 2 data compared; inference made without direct tokenizer testing", + "supported": "weak" + }, + { + "claim": "WiNEval-3.0 exhibits long-tail distribution representative of real-world medical application loads", + "evidence": "Section 4 states this property but provides no quantitative evidence (histogram, Zipf analysis, etc.)", + "supported": "weak" + }, + { + "claim": "Framework enables shift from gut-feeling to data-driven model selection decisions", + "evidence": "Section 8 concludes framework provides 'quantifiable decision-making tool' for GPU investment and model selection", + "supported": "strong" + }, + { + "claim": "Framework is highly portable and can adapt to different hardware platforms by adjusting cost parameters", + "evidence": "Section 8 claims 'high portability: by adjusting core parameters like hourly GPU cost, framework easily adapted'", + "supported": "weak" + } + ], + "methodology_tags": [ + "benchmark-eval", + "observational" + ], + "key_findings": "Paper constructs a cost-quality-performance framework for LLM inference on WiNEval-3.0 (medical benchmark, 2,993 requests). Key findings: WiNGPT-3.5 achieves best cost-effectiveness ($0.34 for 76.2% accuracy); inference time scales non-linearly with concurrency, plateauing after 48 concurrent requests with diminishing returns; output token volume is primary cost driver (WiNGPT-3.0's reasoning overhead costs 10x more than fast baselines). Framework enables data-driven model selection based on business constraints (cost, latency, throughput requirements).", + "red_flags": [ + { + "flag": "Undisclosed conflict of interest", + "detail": "Authors (WiNGPT Team, Winning Health AI Research) evaluate three of their own models (WiNGPT-3.5, 3.0, 2.7) without disclosing this conflict. The conflict is not mentioned; WiNGPT-3.5 declared 'overall leader.'" + }, + { + "flag": "No code or data release", + "detail": "WiNEval-3.0 benchmark not publicly available; no code repository for framework implementation. Evaluation not independently reproducible." + }, + { + "flag": "No statistical confidence intervals", + "detail": "All results reported as point estimates. Paper acknowledges 'inherent randomness' and 'dynamic batching variations' but provides no confidence intervals, error bars, or variance quantification. Listed as acknowledged limitation #4." + }, + { + "flag": "No contamination analysis", + "detail": "Training data cutoffs unknown for most models. Medical benchmarks may overlap with publicly available medical Q&A in training corpora. No discussion of potential data leakage." + }, + { + "flag": "Single benchmark evaluation", + "detail": "Results limited to WiNEval-3.0 (medical domain only). Generalization to other domains, languages, or task types unknown despite title 'Beyond Benchmarks.'" + }, + { + "flag": "No ablation studies", + "detail": "Cannot isolate causes of cost differences. Claim that output token volume causes WiNGPT-3.0's cost is inferred from correlation, not proven causally." + }, + { + "flag": "Overclaimed novelty", + "detail": "Claims 'first LLM Inference Production Frontier' for WiNEval-3.0 only. Framework (cost analysis, Pareto frontiers) uses standard economics; three principles presented are textbook economics concepts, not novel insights." + }, + { + "flag": "Limited hardware/infrastructure scope", + "detail": "Evaluation on single hardware configuration (A800 80G × 2) and presumably vLLM inference engine. Framework claimed 'portable' but not demonstrated on different GPUs, cloud platforms, or inference engines." + } + ], + "cited_papers": [ + { + "title": "Language Models are Few-Shot Learners (GPT-3)", + "relevance": "Foundational LLM work establishing baseline capability and scaling relationships" + }, + { + "title": "Llama 2: Open Foundation and Fine-tuned Chat Models", + "relevance": "Contemporary baseline LLM for cost-quality comparison; reference model for evaluation" + }, + { + "title": "Judging LLM-as-a-Judge with MT-Bench and ChatBot Arena", + "relevance": "LLM evaluation methodology; informs quality metric selection and benchmarking approach" + }, + { + "title": "Carbon Emissions and Large Neural Network Training", + "relevance": "Infrastructure cost and energy analysis; directly relevant to inference cost economics" + }, + { + "title": "Scaling Laws for Neural Language Models", + "relevance": "Establishes relationship between model size, performance, and compute requirements" + }, + { + "title": "Training Compute-Optimal Large Language Models (Chinchilla)", + "relevance": "Compute efficiency and scaling; foundation for cost-performance trade-off analysis" + }, + { + "title": "Efficient Memory Management for LLM Serving with PagedAttention (vLLM)", + "relevance": "Inference optimization technology; likely underlying implementation of evaluation infrastructure" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 3, + "justification": "Directly applicable to production inference decisions. Addresses real business constraints (cost, latency, throughput) that practitioners face when selecting models and hardware." + }, + "surprise_contrarian": { + "score": 1, + "justification": "Validates known economic trade-offs; no contrarian findings. 'First frontier' claim weak. Mostly confirms industry expectations rather than challenging assumptions." + }, + "fear_safety": { + "score": 0, + "justification": "No safety or alignment concerns raised or addressed. Paper focuses purely on cost-benefit analysis, ignoring robustness or security considerations." + }, + "drama_conflict": { + "score": 1, + "justification": "'Impossible trinity' framing presents standard engineering trade-off language as conflict. Minor narrative drama in WiNGPT-3.0 as 'specialized thinking model' but overall low emotional engagement." + }, + "demo_ability": { + "score": 2, + "justification": "Results demonstrated in tables but no code or data released. Framework explained but replication requires private WiNEval-3.0 benchmark. Limited hands-on demo potential." + }, + "brand_recognition": { + "score": 1, + "justification": "WiNGPT team not affiliated with major academic lab or recognized AI product brand (vs. OpenAI, Anthropic, DeepSeek, Meta). Limited institutional halo effect." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "46714925", + "title": "SlimEdge: Lightweight Distributed DNN Deployment on Constrained Hardware", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=46714925", + "created_at": "2026-01-22T03:27:40Z" + } + ], + "top_points": 1, + "total_points": 1, + "total_comments": 0 + } +} +\ No newline at end of file diff --git a/papers/edge-memorization-diffusion-2025/scan-v5.json b/papers/edge-memorization-diffusion-2025/scan-v5.json @@ -0,0 +1,370 @@ +{ + "scan_version": 5, + "paper_type": "theoretical", + "paper": { + "title": "On the Edge of Memorization in Diffusion Models", + "authors": [ + "Sam Buchanan", + "Druv Pai", + "Yi Ma", + "Valentin De Bortoli" + ], + "year": 2025, + "venue": "arXiv.org", + "arxiv_id": "2508.17689", + "doi": "10.48550/arXiv.2508.17689" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All abstract claims are backed by paper content: the laboratory is introduced in Section 2, the crossover point is characterized in Theorems 3.1/3.2, and Section 4 validates the phase transition prediction with error < 2×10⁻⁴.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "Causal claims about model parameterization M causing memorization are tested in a fully controlled synthetic experimental setting where M is systematically varied while all other parameters are held constant, making causal inference appropriate.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": true, + "justification": "Results are explicitly scoped to Gaussian mixture model data and N = poly(d) regime throughout; the conclusion acknowledges extensions needed for 'intrinsic dimensionality or partial data replication' and claims are appropriately modest.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": true, + "justification": "Section 5 explicitly discusses competing theories including landscape-based explanations (Wu et al., Vastola) and implicit bias approaches (Kamb & Ganguli, Niedoba et al.), noting the current work 'disentangles the competing factors' and is 'complementary' to these.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "Memorization is formally defined in Definition 2.2 using a nearest-neighbor distance ratio with explicit constant c = 1/9; the memorization ratio (fraction of memorized samples) is consistently used as the experimental outcome, exactly matching the formal definition.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": false, + "justification": "There is no dedicated limitations section; Section 6 (Conclusion) contains a brief paragraph on future extensions but reads as future work rather than a systematic limitations assessment.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "The conclusion only vaguely mentions extending to 'larger and more realistic datasets' without specifically discussing why the Gaussian mixture assumption may fail or how phase transition behavior might differ in architecturally realistic settings (U-Nets, attention).", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": true, + "justification": "Section 2 formally proves that N = poly(d) is the correct scaling regime for distinguishing memorization from generalization, and explicitly shows the regime N = exp[d log d] collapses the distinction via Wasserstein distance arguments.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "Only Yi Ma's funding is disclosed (Simons Foundation-NSF, ONR, NSF, HKU startup); funding for Buchanan, Pai, and De Bortoli is not disclosed despite all being at funded research institutions.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "Author affiliations are disclosed on the title page: TTIC (Buchanan), UC Berkeley (Pai), UC Berkeley/HKU (Yi Ma), and Google DeepMind (De Bortoli).", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": true, + "justification": "Disclosed funders (NSF, ONR, Simons Foundation) are independent government/academic sources; De Bortoli's Google DeepMind affiliation is a potential undisclosed industry interest but is not a funder.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "There is no competing interests statement; De Bortoli's Google DeepMind affiliation represents a commercially relevant undisclosed interest given DeepMind's stake in generative model research and the paper's copyright/privacy implications.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Key terms are formally defined: 'memorization' (Definition 2.2 with explicit nearest-neighbor ratio criterion c=1/9), 'generalization' (Appendix A, statistical learning terms), 'crossover point' (equation 14), 'partially memorizing denoiser' (equation 10), with notation systematically presented in Tables 1 and 2.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "The introduction explicitly lists five contributions: (1) memorization laboratory, (2) gradient-descent hypothesis, (3) partially memorizing denoiser construction, (4) theoretical characterization of crossover point, (5) validated predictive model for phase transition.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 5 and Appendix G extensively situate the work relative to statistical physics approaches (Biroli et al.), creativity/generalization theories (Kamb & Ganguli, Niedoba et al.), and empirical memorization detection literature (Zhang et al., Carlini et al.), with specific discussions of similarities and differences.", + "source": "haiku" + } + } + }, + "type_checklist": { + "theoretical": { + "formal_quality": { + "assumptions_stated_explicitly": { + "applies": true, + "answer": true, + "justification": "All theorems explicitly state required assumptions: N = poly(d), minimum cluster separation γ = Θ(d^{1/2}), maximum mean norm = Θ(d), σ²⋆ = Θ(1); Tables 1–2 define all notation; coupling conditions are explicitly stated in each lemma.", + "source": "haiku" + }, + "proofs_complete_or_sketched": { + "applies": true, + "answer": true, + "justification": "The paper states 'All proofs are included in the appendices'; Appendices A–F provide complete proofs of all main results including Lemmas E.1–E.6, Theorems F.1 and F.3, and Propositions F.2 and F.4.", + "source": "haiku" + }, + "bounds_tight_or_discussed": { + "applies": true, + "answer": true, + "justification": "Theorem 3.2 explicitly notes the leading-order coefficient is between 1 and 2; the crossover formula includes constant C ∈ [1, 2] that is acknowledged as a range; Figure 1 shows empirical agreement with approximations validating tightness at moderate dimensions.", + "source": "haiku" + }, + "counterexamples_explored": { + "applies": true, + "answer": true, + "justification": "Section 2 formally analyzes the degenerate regime N = exp[d log d] where memorization and generalization become indistinguishable; Section 4.2 tests the framework on a more complex low-rank Gaussian image model to probe limits of the isotropic theory.", + "source": "haiku" + }, + "notation_consistent": { + "applies": true, + "answer": true, + "justification": "Tables 1 and 2 define all notation systematically; hatted quantities (e.g., L̂_{N,t}) consistently denote theoretical approximations; the same symbols are used consistently across main paper and appendices without overloading.", + "source": "haiku" + }, + "constructive_vs_existence_noted": { + "applies": true, + "answer": true, + "justification": "The crossover point M* is constructively computed in closed form in equation (14) as a linear function of N; both the generalizing and memorizing denoisers are explicitly constructed via Lemma 2.1 and equations (7) and (10) respectively.", + "source": "haiku" + } + }, + "connections": { + "connection_to_practice_discussed": { + "applies": true, + "answer": true, + "justification": "Section 1 explicitly motivates from copyright infringement and data privacy issues in commercial deployments (citing DALL-E 2); the derived M* ≈ (4/5)N formula provides practitioners a concrete threshold for predicting memorization onset.", + "source": "haiku" + }, + "relationship_to_prior_work_clear": { + "applies": true, + "answer": true, + "justification": "Section 5 provides detailed comparisons with statistical physics approaches (Biroli et al. [2024]), implicit bias/creativity theories (Kamb & Ganguli, Niedoba et al., Vastola), and empirical work, with explicit statements about how this work extends, complements, or differs from each.", + "source": "haiku" + }, + "computational_complexity_discussed": { + "applies": true, + "answer": false, + "justification": "The paper does not formally analyze computational complexity of training or prediction; Appendix H notes experiments used A100 GPUs and full-batch Adam but provides no complexity analysis of the proposed procedures.", + "source": "haiku" + }, + "limitations_of_formal_model_stated": { + "applies": true, + "answer": true, + "justification": "Section 6 explicitly notes the model needs extension to 'intrinsic dimensionality or partial data replication'; Appendix A discusses why GMMs are used and acknowledges that real image denoisers (U-Nets) fall outside the parameterized GMM class studied.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "Memorization vs. generalization behavior of a trained diffusion model is determined by whether the training loss of a partially memorizing denoiser is lower than that of the generalizing denoiser", + "evidence": "Hypothesis formalized in Section 3 and tested via controlled experiments in Section 4; phase transition location is predicted by the theoretical crossover point with train/test error < 2×10⁻⁴ across 64 (N, d, K) configurations", + "supported": "strong" + }, + { + "claim": "The memorization phase transition crossover point M* is approximately (4/5)N — a linear function of training set size", + "evidence": "Equation (14) derives this analytically; Figure 3 confirms M_pt ≈ (4/5)N empirically across a sweep of (N, d, K) tuples from [50,200]×[30,60]×[3,12]", + "supported": "strong" + }, + { + "claim": "Theoretical loss approximations (Theorems 3.1 and 3.2) agree well with empirical losses even at moderate dimensions", + "evidence": "Figure 1 demonstrates 'a remarkable degree of agreement' between approximations and empirical losses at d=50, K=12, N=200 across the full range of M/N", + "supported": "strong" + }, + { + "claim": "Memorization and generalization are indistinguishable (equivalent) when N = exp[d log d] in high dimensions", + "evidence": "Proven formally in Section 2 via the Weed-Bach theorem: W₂(π⋆, πᴺ⋆) ≤ C₀/√d → 0 in this regime", + "supported": "strong" + }, + { + "claim": "The generalization-to-memorization phase transition persists qualitatively in a low-rank Gaussian model designed to resemble natural images", + "evidence": "Figure 5 shows identical qualitative phase transition behavior in the colored FashionMNIST-template model; however quantitative prediction accuracy in this setting is not reported", + "supported": "moderate" + }, + { + "claim": "Memorization in diffusion models is fundamentally different from classical benign overfitting and double descent", + "evidence": "Appendix G argues the distinction: in diffusion there is exactly one minimal-loss model (the memorizing denoiser) regardless of parameter count, unlike double descent where many interpolating solutions exist at high parameterization", + "supported": "moderate" + } + ], + "methodology_tags": [ + "theoretical" + ], + "key_findings": "The paper introduces a mathematical laboratory using Gaussian mixture model data and denoisers to study memorization vs. generalization in diffusion models, deriving tight theoretical approximations for training losses of memorizing and generalizing denoisers. The central result is that a phase transition from generalization to memorization occurs as model capacity M increases, with the transition point M* ≈ (4/5)N—a linear function of training set size—accurately predicted by a theoretically-derived loss crossover criterion (prediction error < 2×10⁻⁴ across 64 configurations). The framework disentangles model capacity M, data complexity K, problem dimension d, and sample size N, and demonstrates that the qualitative phase transition persists in a more complex low-rank Gaussian model mimicking natural image structure, suggesting the theory captures essential mechanisms beyond the isotropic case.", + "red_flags": [ + { + "flag": "No dedicated limitations section", + "detail": "The paper has no dedicated limitations section; Section 6 briefly mentions future extensions but does not systematically assess threats to validity, failure modes of the theoretical framework, or conditions under which the phase transition prediction would break down." + }, + { + "flag": "Synthetic-only experimental validation", + "detail": "All experimental validation uses synthetic Gaussian mixture data on A100 GPUs; the connection to real diffusion models (U-Nets, attention-based architectures trained on ImageNet or LAION) is asserted but never empirically tested, leaving practical applicability unverified." + }, + { + "flag": "Partially undisclosed funding and interests", + "detail": "Only Yi Ma's funding is disclosed; Buchanan, Pai, and De Bortoli have no disclosed funding. De Bortoli's Google DeepMind affiliation represents an undisclosed commercial interest in research with direct copyright and privacy implications for deployed generative models." + }, + { + "flag": "Unresolved constant C ∈ [1, 2] in crossover formula", + "detail": "The key crossover formula (equation 14) contains an unresolved constant C whose value affects the precision of memorization threshold predictions; while bounded to [1, 2], the specific value is not determined theoretically and is fit from experiments." + } + ], + "cited_papers": [ + { + "title": "Dynamical regimes of diffusion models", + "relevance": "Prior theoretical work on phase transitions in diffusion models using statistical physics (Biroli et al.); most directly related theoretical predecessor that this paper extends with a predictive crossover characterization" + }, + { + "title": "An analytic theory of creativity in convolutional diffusion models", + "relevance": "Complementary theory of generalization in diffusion models (Kamb & Ganguli); the current paper's hypothesis is framed partly in contrast and extension of this approach" + }, + { + "title": "Extracting training data from diffusion models", + "relevance": "Key empirical work on memorization and copyright/privacy concerns (Carlini et al.) that motivates the theoretical study and defines related notions of memorization" + }, + { + "title": "The emergence of reproducibility and generalizability in diffusion models", + "relevance": "Prior empirical work on memorization vs. generalization (Zhang et al.) whose central observations this theory replicates and explains theoretically" + }, + { + "title": "Diffusion probabilistic models generalize when they fail to memorize", + "relevance": "Provides the memorization definition (Definition 2.2) adopted in this paper and key experimental observations the theory must account for (Yoon et al.)" + }, + { + "title": "Learning mixtures of gaussians using the DDPM objective", + "relevance": "Prior theoretical work on Gaussian mixture model diffusion training that this paper directly builds upon (Shah et al.)" + }, + { + "title": "Generalization through variance: how noise shapes inductive biases in diffusion models", + "relevance": "Concurrent theoretical work (Vastola) with different implicit-bias approach to same phenomenon; explicitly compared and contrasted, and a rebuttal paper is cited" + }, + { + "title": "Sharp asymptotic and finite-sample rates of convergence of empirical measures in wasserstein distance", + "relevance": "Used to prove the key scaling result (Section 2) that N = poly(d) is the correct regime for distinguishing memorization from generalization (Weed & Bach)" + }, + { + "title": "Towards a mechanistic explanation of diffusion model generalization", + "relevance": "Complementary mechanistic approach to generalization in diffusion models (Niedoba et al.), discussed as related work in Section 5" + }, + { + "title": "Denoising score matching with random features: Insights on diffusion models from precise learning curves", + "relevance": "Concurrent theoretical work (George et al.) on trained denoisers in Gaussian settings; directly acknowledged as complementary in Section 5" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 2, + "justification": "Directly addresses copyright and privacy concerns in commercial diffusion model deployments with a concrete prediction formula for memorization onset, but the Gaussian mixture setting limits immediate applicability to practitioners." + }, + "surprise_contrarian": { + "score": 1, + "justification": "The precise linear threshold M* ≈ (4/5)N is a non-obvious quantitative finding, but the qualitative conclusion that larger models memorize more confirms existing intuition; the surprise is mathematical precision, not a reversal of expectations." + }, + "fear_safety": { + "score": 2, + "justification": "Directly addresses copyright infringement and training data privacy in deployed diffusion models (DALL-E 2, Stable Diffusion), issues with active legal and regulatory implications at the time of publication." + }, + "drama_conflict": { + "score": 1, + "justification": "The paper critiques existing heuristic memorization metrics as scientifically inadequate, and Section 5 implicitly positions against competing theories, but there is no prominent controversy framing." + }, + "demo_ability": { + "score": 1, + "justification": "Code is available at github.com/DruvPai/diffusion_mem_gen, but experiments require A100 GPUs and specialized synthetic data generation; not accessible for casual reproduction." + }, + "brand_recognition": { + "score": 2, + "justification": "UC Berkeley and Google DeepMind are high-recognition institutions; Valentin De Bortoli's DeepMind affiliation adds industry credibility to theoretical claims about real deployed models." + } + }, + "hn_data": { + "threads": [ + { + "hn_id": "37367951", + "title": "Transformers as Support Vector Machines", + "points": 251, + "comments": 156, + "url": "https://news.ycombinator.com/item?id=37367951", + "created_at": "2023-09-03T05:30:10Z" + }, + { + "hn_id": "46665309", + "title": "Reverse Engineering the ESP32-C3 Wi-Fi Drivers for Static Worst-Case Analysis", + "points": 8, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=46665309", + "created_at": "2026-01-18T06:27:12Z" + }, + { + "hn_id": "43391891", + "title": "Transformers as Support Vector Machines (2023)", + "points": 3, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=43391891", + "created_at": "2025-03-17T19:22:55Z" + }, + { + "hn_id": "43723352", + "title": "The Imitation Game According to Turing", + "points": 2, + "comments": 1, + "url": "https://news.ycombinator.com/item?id=43723352", + "created_at": "2025-04-17T23:28:44Z" + }, + { + "hn_id": "44718857", + "title": "Cascade: LLM-Powered JavaScript Deobfuscator", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=44718857", + "created_at": "2025-07-29T03:52:42Z" + }, + { + "hn_id": "43790761", + "title": "User Profiles: The Achilles' Heel of Web Browsers", + "points": 2, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=43790761", + "created_at": "2025-04-25T06:32:45Z" + }, + { + "hn_id": "44184713", + "title": "Polymer: Development Workflows as Software", + "points": 1, + "comments": 0, + "url": "https://news.ycombinator.com/item?id=44184713", + "created_at": "2025-06-04T19:43:49Z" + } + ], + "top_points": 251, + "total_points": 269, + "total_comments": 157 + } +} +\ No newline at end of file diff --git a/papers/editflow-benchmarking-optimizing-2026/scan-v5.json b/papers/editflow-benchmarking-optimizing-2026/scan-v5.json @@ -0,0 +1,509 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "EditFlow: Benchmarking and Optimizing Code Edit Recommendation Systems via Reconstruction of Developer Flows", + "authors": [ + "Chenyan Liu", + "Yun Lin", + "Jiaxin Chang", + "Jiawei Liu", + "Binhang Qi", + "Bo Jiang", + "Zhiyong Huang", + "Jin Song Dong" + ], + "year": 2026, + "venue": "Proc. ACM Program. Lang. (OOPSLA)", + "arxiv_id": "2602.21697", + "doi": "10.1145/3798249" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "All major abstract claims (63.81% accuracy improvement, 75% violation reduction, 66.99% precision boost, 25.11% task speedup) are traced to specific tables (Tables 3–7) and experiments in the paper.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "Causal claims are supported by controlled digital twin simulation (identical commit inputs, original vs. w/EditFlow configurations) and a controlled user study comparing treatment vs. control groups across 3 tasks with statistical tests for the user study.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": true, + "justification": "Section 9 (Threats to Validity) explicitly bounds external validity to Python commits on GitHub repositories, acknowledging that generalization to other languages and workflows requires further work.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "The paper discusses failure modes but does not consider alternative explanations for EditFlow's benefit (e.g., simple suggestion reduction reducing cognitive overload regardless of flow reasoning, or selection bias in the annotated dataset).", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "The paper distinguishes flow-aware metrics (Keep/Jump/Revert/Break), flow-independent metrics (Precision/Recall/F0.5), resource metrics, and user study task-completion time as separate measurement levels aligned to different claims.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": true, + "justification": "Section 9 is a dedicated Threats to Validity section covering external, construct, and internal validity in detail.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": true, + "justification": "Specific threats named include: Python-only benchmark, single observed trajectory per commit biasing violation metrics, 1-context sensitivity in EditFlow filtering, LLM stochasticity in order inference, and digital twin's assumption that developers always make correct decisions.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": true, + "justification": "The paper explicitly states scope is limited to Python, GitHub-sourced commits, and that the industrial dataset cannot be released; it notes findings may not generalize to other languages or non-GitHub workflows.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": true, + "justification": "Acknowledgments list specific grants: NSFC (62572300), Singapore MOE (MOE-T2EP20124-0017, MOET32020-0004), NRF, DSO National Laboratories (AISG2-GC-2023-008-1B), and Cyber Security Agency of Singapore.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "All author affiliations are disclosed in the header: Shanghai Jiao Tong University, National University of Singapore, and Bytedance Network Technology for co-author Bo Jiang.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": true, + "answer": true, + "justification": "All listed funders are government agencies (China NSFC, Singapore MOE/NRF/DSO) independent of the code editing tools evaluated (Cursor, Claude Code, CoEdPilot); Bytedance is an author affiliation, not a funder.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests or financial interests statement is provided; Bytedance co-author Bo Jiang's potential interest in AI coding tools is not disclosed.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "Key terms are formally defined: Edit Hunk (Def. 1), Pairwise Edit Order with labels {≺, ≻, ∼, ⊥} (Def. 2), Mental Flow Graph (Def. 3), One-Hop Successor (Def. 4), and mental flow (cited from Csikszentmihalyi 1990).", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "Five explicit contributions are enumerated in Section 1: the mental flow framing, prompt auto-tuning strategy, digital twin evaluation framework, empirical demonstration, and VS Code extension implementation.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Section 10 situates EditFlow against static analysis methods (CCDemon, Overwatch, Pyevolve), LLM-based editors (CoditT5, GrACE, SARGAM, CoEdPilot), and developer productivity frameworks (SPACE, DevEx), showing explicit differentiation.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": true, + "justification": "The Data-Availability Statement and repeated references to [3] confirm source code, auto-tuned prompt, dataset, and results are available at sites.google.com/view/editflow (not 'upon request').", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "The annotated dataset (100 commits, 2,030 training + 871 test samples) is available at the anonymous website; the industrial dataset is explicitly withheld due to compliance restrictions.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "Model version (Claude-Sonnet-4-20250514) and hyperparameters are specified but no requirements.txt, Dockerfile, or dependency list is provided for reproducing the experimental environment.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "The paper refers readers to the anonymous website for the learned prompt and algorithms but provides no step-by-step instructions within the paper sufficient to reproduce experiments without guessing.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "Tables 3–6 report point estimates only; the user study (Table 7) reports p-values and effect sizes but no confidence intervals or error bars anywhere in the paper.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "Statistical tests (Mann-Whitney U with permutation testing) are used only for the user study (RQ4); the main technical evaluations in RQ1–RQ3 make comparative claims without any significance testing.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Effect sizes (r derived from Mann-Whitney U statistic) are reported for all user study comparisons; percentage improvements with baseline context are reported for RQ1–RQ3.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "The 32-participant user study and 100-commit annotated dataset are not justified with power analysis or sample size rationale; 8 participants per group is not defended.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "Tables 3–6 report means with no standard deviation or variance; Table 7 shows individual times enabling variance computation but the paper only reports group averages.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "RQ1 compares against zero-shot, few-shot, hand-crafted prompt, and DSPy; RQ3 compares Cursor/Claude Code/CoEdPilot original vs. w/EditFlow on two benchmarks.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "Baselines include DSPy (2024), Claude Code (v1.0.113), Cursor CLI (2025.09.18), and CoEdPilot (ISSTA 2024), all contemporary and representative systems.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": false, + "justification": "The paper compares original vs. w/EditFlow end-to-end but does not ablate individual EditFlow components (prompt auto-tuning alone, filtering alone, re-ranking alone) to isolate their contributions.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Four metric categories are used: flow-aware (Keep/Jump/Revert/Break), flow-independent (Precision/Recall/F0.5), resource usage (latency/tokens/cost), and user study (task completion time, Mann-Whitney statistics).", + "source": "haiku" + }, + "human_evaluation": { + "applies": true, + "answer": true, + "justification": "RQ4 is a controlled user study with 32 participants completing 3 real-world editing tasks, measuring task completion time and perceived recommendation quality.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": true, + "justification": "For RQ1, the annotated dataset is split 7:3 at the commit level to prevent intra-commit data leakage, with 871 held-out test samples used for final evaluation.", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Results are broken down by system (Cursor, Claude Code, CoEdPilot) in Tables 5–6 and by task (T1, T2, T3) in Table 7, with per-participant breakdowns provided.", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "Section 8 is entirely devoted to failure analysis, presenting two concrete failure modes (false rejection due to k-context sensitivity, false acceptance of locally coherent but incorrect edits) with specific examples and metric implications.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "Task 1 (p=0.1966) and Task 3 (p=0.2186) show no statistically significant improvement from EditFlow; the paper analyzes why for each task rather than dismissing the null results.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": true, + "justification": "Exact versions are given: Claude-Sonnet-4-20250514 for prompt tuning, Claude Code Version 1.0.113, Cursor CLI Version 2025.09.18-7ae6800.", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": true, + "justification": "The auto-tuned prompt is available at the anonymous website [3] (sites.google.com/view/editflow); an example edit hunk representation (Table 2) and algorithm pseudocode (Algorithm 1) are provided in the paper.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": true, + "justification": "Temperature (0.7), max output length (4096), number of epochs (5), and batch size (32) are all reported for the prompt auto-tuning experiment in Section 7.1.1.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": true, + "answer": true, + "justification": "Section 7.3.3 describes how the digital twin interacts with each system: Claude Code SDK and Cursor CLI in headless mode, specific formatting for CoEdPilot's discriminator/locator/generator pipeline.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": true, + "justification": "Commit selection criteria are enumerated (5–10 hunks, ≥2 source files, ASCII-only, no merge commits, no filename changes); annotation process documented (2 independent annotators, 20 min/commit, consensus resolution, 77 person-hours).", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": true, + "justification": "The annotated dataset (commits, edit hunks, pairwise labels) is available at the anonymous website; the industrial dataset is withheld for compliance reasons, noted explicitly.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "Data collection is documented for the annotated set (100 commits from 45 most-starred Python repos, pairwise annotation with inter-annotator agreement) and industrial set (500 commits Jun–Aug 2025, employee consent, anonymization, secure environment).", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": true, + "answer": false, + "justification": "Participants are described as recruited from 2 universities but no recruitment method (posting, course credit, snowball, etc.), compensation, or selection process is stated.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": true, + "justification": "The pipeline from GitHub commit selection → edit hunk extraction → pairwise annotation → train/test split → prompt optimization → digital twin evaluation is described step-by-step across Sections 6–7.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": false, + "justification": "Claude-Sonnet-4-20250514's training cutoff is not stated; the paper uses the model for both prompt optimization and evaluation on GitHub commits without addressing whether those commits predate the cutoff.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": false, + "justification": "The paper splits commits to avoid intra-dataset leakage but does not discuss whether the GitHub repository commits used for benchmarking were included in Claude's pretraining data.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": true, + "answer": false, + "justification": "The benchmark draws from top-starred GitHub Python repositories (e.g., kovidgoyal/kitty, getsentry/sentry) that are almost certainly in Claude's training corpus; this is not acknowledged.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": true, + "answer": false, + "justification": "No pre-registration is mentioned for the 32-participant user study.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": true, + "answer": false, + "justification": "No IRB or ethics approval is mentioned despite conducting a human subjects study at two universities.", + "source": "haiku" + }, + "demographics_reported": { + "applies": true, + "answer": true, + "justification": "Age range (20–30), educational level (undergraduate to PhD in CS), programming frequency (4.5 days/week), and prior AI tool experience (90%) are reported in Section 7.4.2.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": true, + "answer": false, + "justification": "Requirements include CS enrollment and completing a pre-study questionnaire, but no formal inclusion/exclusion criteria are stated (e.g., minimum Python experience threshold, familiarity with the tools).", + "source": "haiku" + }, + "randomization_described": { + "applies": true, + "answer": false, + "justification": "The assignment of participants to the four groups (CG1/EG1/CG2/EG2) is never described; it is unknown whether random assignment was used.", + "source": "haiku" + }, + "blinding_described": { + "applies": true, + "answer": false, + "justification": "No blinding is described; participants clearly know whether they are using EditFlow-wrapped or original systems given the VS Code extension interface.", + "source": "haiku" + }, + "attrition_reported": { + "applies": true, + "answer": true, + "justification": "Table 7 shows all 32 participants (P1–P32) with complete data for all 3 tasks, implying no attrition, though dropout is not explicitly addressed.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": true, + "justification": "Tables 5 and 6 report per-query latency (seconds), token usage (K), and monetary cost ($) for each system configuration including the EditFlow overhead.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "Per-query costs are reported but total compute budget for running the full set of experiments (500 commits × multiple systems × multiple RQs) is not stated.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "68.81% of AI code edit recommendations from Cursor and Claude Code disrupt developers' mental flow", + "evidence": "Empirical study on 50 manually annotated Python commits using the digital twin framework (Section 5, Table 1): Keep edits are 28.23% (Cursor) and 34.16% (Claude Code), with Break edits dominant at 55.48% and 51.23% respectively", + "supported": "moderate" + }, + { + "claim": "Auto-tuned prompt achieves 63.81% relative improvement in edit order recovery accuracy over best baseline", + "evidence": "Table 3: auto-tuned prompt achieves 87.26% accuracy vs. DSPy's 53.39% best baseline (63.81% relative improvement); consistent across precision (88.01%) and F1 (87.54%)", + "supported": "strong" + }, + { + "claim": "EditFlow reduces flow violations on real-world industrial data by over 75% compared to best baseline", + "evidence": "Table 4: auto-tuned prompt yields 30 violations vs. 121 for hand-crafted prompt (best baseline) on 500 industrial commits from a 60K+ employee IT company", + "supported": "strong" + }, + { + "claim": "EditFlow improves edit recommendation precision by 66.99% on average across systems and benchmarks", + "evidence": "Tables 5–6: Cursor improves from 33.02%→42.42% and 44.05%→53.53%; Claude Code from 40.54%→50.45% and 39.68%→48.96%; CoEdPilot from 14.78%→35.50% and 10.00%→26.39%", + "supported": "strong" + }, + { + "claim": "EditFlow leads to 25.11% faster task completion in a controlled user study with 32 developers", + "evidence": "Table 7: aggregate average across groups and tasks; statistically significant on Task 2 (p=0.0004, r=0.788 for EG1 vs CG1; p=0.0004, r=0.840 for EG2 vs CG2) but not Task 1 or Task 3", + "supported": "moderate" + }, + { + "claim": "EditFlow is effective specifically for complex tasks requiring deep codebase understanding, not simple refactoring", + "evidence": "Task 2 (hard, cross-file cascading change): strong significant improvement; Task 3 (uniform refactoring): no significant improvement (p=0.2186); explicitly analyzed as boundary conditions in Section 7.4.6", + "supported": "strong" + } + ], + "methodology_tags": [ + "benchmark-eval", + "case-study", + "observational" + ], + "key_findings": "EditFlow addresses the disconnect between AI code editing accuracy and developer productivity by framing the problem as mental flow alignment. A prompt auto-tuning strategy achieves 87.26% accuracy in recovering pairwise edit orders, outperforming zero-shot, few-shot, hand-crafted, and DSPy approaches by 63.81% relative. Wrapping existing AI coding assistants (Cursor, Claude Code, CoEdPilot) with EditFlow's flow-aware filter improves recommendation precision by 66.99% on average and reduces flow violations by 75%+ on industrial data. A 32-participant user study confirms 25.11% faster task completion, with strongest gains on complex multi-file tasks and no significant benefit on uniform refactoring tasks where any edit order is cognitively valid.", + "red_flags": [ + { + "flag": "Tiny user study groups", + "detail": "8 participants per group (4 groups, 32 total) is far too small for reliable subgroup analysis or generalization; the overall 25.11% speedup conflates results from asymmetric task difficulties and heterogeneous systems." + }, + { + "flag": "Randomization not described", + "detail": "Section 7.4.2 does not describe how participants were assigned to the four groups (CG1/EG1/CG2/EG2), making it impossible to assess selection bias." + }, + { + "flag": "No IRB or ethics disclosure", + "detail": "A human subjects study at two universities with screen recordings and interaction logging is conducted without any mention of ethics review or participant consent beyond the industrial data note." + }, + { + "flag": "Benchmark contamination unaddressed", + "detail": "The benchmark uses top-starred GitHub Python repos (kovidgoyal/kitty, getsentry/sentry, etc.) almost certainly included in Claude's pretraining corpus; Claude is also used to infer ground-truth edit orders for evaluation." + }, + { + "flag": "No confidence intervals or significance tests for main results", + "detail": "Tables 3–6 report point estimates only; the large precision improvements (e.g., CoEdPilot: 14.78%→35.50%) have no variance, CI, or statistical test, making their reliability unassessable." + }, + { + "flag": "No component ablation", + "detail": "EditFlow has three interacting components (prompt auto-tuning, digital twin, flow-aware filtering); the paper only evaluates the full system vs. original, making it impossible to attribute gains to individual components." + }, + { + "flag": "Digital twin as ground truth", + "detail": "For RQ3, the auto-tuned prompt itself generates the flow graph used as ground truth for evaluation, creating circularity: EditFlow uses the same prompt for filtering that defines what 'Keep' means in the evaluation metrics." + } + ], + "cited_papers": [ + { + "title": "Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity", + "relevance": "Key citation establishing the 19% productivity slowdown with AI assistance that motivates the entire EditFlow framing; RCT by Becker et al. 2025" + }, + { + "title": "CoEdPilot: Recommending Code Edits with Learned Prior Edit Relevance, Project-wise Awareness, and Interactive Nature", + "relevance": "Primary academic baseline system for subsequent edit recommendation; prior work by the same first author (Liu et al. 2024, ISSTA)" + }, + { + "title": "DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines", + "relevance": "Baseline prompt optimization framework compared against EditFlow's auto-tuning approach" + }, + { + "title": "The SPACE of Developer Productivity: There's more to it than you think", + "relevance": "Industry-standard framework positioning Flow as one of five dimensions of developer productivity, used to ground the mental flow construct" + }, + { + "title": "DevEX: What actually drives productivity?", + "relevance": "Framework identifying flow state as a core driver of developer productivity, used alongside SPACE to justify the mental flow framing" + }, + { + "title": "The cost of interrupted work: more speed and stress", + "relevance": "Empirical evidence that interruptions require 23 minutes 15 seconds recovery time, providing quantitative grounding for why flow disruptions matter" + }, + { + "title": "Grace: Language Models Meet Code Edits", + "relevance": "Prior work on incorporating prior edits as context for code edit recommendation, directly related to EditFlow's problem space" + }, + { + "title": "CodePlan: Repository-level coding using LLMs and planning", + "relevance": "Related approach using LLMs with static analysis for reasoning over code changes across files" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 3, + "justification": "Implements an actual VS Code extension wrapping Cursor, Claude Code, and CoEdPilot; addresses the real gap between benchmark accuracy and developer productivity that affects everyday coding tool use." + }, + "surprise_contrarian": { + "score": 3, + "justification": "Directly challenges the assumption that higher benchmark accuracy leads to better developer outcomes, citing a controlled trial showing 19% slowdown; reframes the problem from accuracy to cognitive flow." + }, + "fear_safety": { + "score": 0, + "justification": "No AI safety or risk concerns raised; the paper is focused on productivity optimization." + }, + "drama_conflict": { + "score": 2, + "justification": "Explicitly demonstrates that Cursor and Claude Code—the dominant commercial tools—disrupt mental flow in the majority of recommendations, which challenges their marketing claims." + }, + "demo_ability": { + "score": 3, + "justification": "VS Code extension is implemented and available with demonstration videos at the anonymous website; practitioners can immediately install and try EditFlow with their existing Cursor or Claude Code setup." + }, + "brand_recognition": { + "score": 2, + "justification": "Directly evaluates Claude Code and Cursor (the two most prominent AI coding tools in 2025–2026) as the systems under test, lending immediate relevance to the broader developer community." + } + }, + "hn_data": { + "threads": [], + "top_points": 0, + "total_points": 0, + "total_comments": 0 + } +} +\ No newline at end of file diff --git a/papers/effective-lora-adapter-2026/scan-v5.json b/papers/effective-lora-adapter-2026/scan-v5.json @@ -0,0 +1,506 @@ +{ + "scan_version": 5, + "paper_type": "empirical", + "paper": { + "title": "Effective LoRA Adapter Routing using Task Representations", + "authors": [ + "Akash Dhasade", + "Anne-Marie Kermarrec", + "Igor Pavlovic", + "Diana Petrescu", + "Rafael Pires", + "Mathis Randl", + "Martijn de Vos" + ], + "year": 2026, + "venue": "arXiv.org", + "arxiv_id": "2601.21795", + "doi": "10.48550/arXiv.2601.21795" + }, + "checklist": { + "claims_and_evidence": { + "abstract_claims_supported": { + "applies": true, + "answer": true, + "justification": "Abstract claims of 101.2% Oracle performance and +5.2-point OOD improvement over LORARETRIEVER are directly supported by Figure 2 and Table 6; the 1500+ adapter scaling result is confirmed in Table 8.", + "source": "haiku" + }, + "causal_claims_justified": { + "applies": true, + "answer": true, + "justification": "Ablation studies in Table 2 isolate the retrieval and composition components independently, providing adequate support for causal claims about which components drive improvements.", + "source": "haiku" + }, + "generalization_bounded": { + "applies": true, + "answer": false, + "justification": "The paper makes broad claims about 'scalable routing for open-ended LoRA serving' but evaluates only on LLaMA2-7B/13B with a single FLANV2-derived benchmark; generalization to other base models, modalities, or task distributions is not empirically validated.", + "source": "haiku" + }, + "alternative_explanations_discussed": { + "applies": true, + "answer": false, + "justification": "The paper does not consider whether gains stem from the sentence encoder quality, the specific benchmark structure, or the particular adapter training setup rather than the task-level routing paradigm itself.", + "source": "haiku" + }, + "proxy_outcome_distinction": { + "applies": true, + "answer": true, + "justification": "The paper uses task-specific metrics (EM, BLEU, ROUGE) appropriate to each task type and employs oracle-normalized aggregation rather than conflating these into a single undifferentiated score.", + "source": "haiku" + } + }, + "limitations_and_scope": { + "limitations_section_present": { + "applies": true, + "answer": false, + "justification": "The paper has an 'Impact Statement' section discussing broader societal impacts but no dedicated limitations or threats-to-validity section addressing technical constraints.", + "source": "haiku" + }, + "threats_to_validity_specific": { + "applies": true, + "answer": false, + "justification": "No specific threats are discussed—the small test set of 50 samples per task, the potential contamination of FLAN data in LLaMA2 pretraining, and the restriction to a single benchmark are not acknowledged.", + "source": "haiku" + }, + "scope_boundaries_stated": { + "applies": true, + "answer": false, + "justification": "The paper does not state explicit scope boundaries, such as that results apply only to LLaMA2-class models, only to NLP tasks, or only to FLAN-style benchmarks.", + "source": "haiku" + } + }, + "conflicts_of_interest": { + "funding_disclosed": { + "applies": true, + "answer": false, + "justification": "No funding source is mentioned anywhere in the paper.", + "source": "haiku" + }, + "affiliations_disclosed": { + "applies": true, + "answer": true, + "justification": "All authors are listed as affiliated with EPFL, Lausanne, Switzerland, disclosed in the author line.", + "source": "haiku" + }, + "funder_independent_of_outcome": { + "applies": false, + "answer": false, + "justification": "No funder is disclosed, so independence cannot be assessed.", + "source": "haiku" + }, + "financial_interests_declared": { + "applies": true, + "answer": false, + "justification": "No competing interests statement or declaration of financial interests (patents, equity, consulting) is present.", + "source": "haiku" + } + }, + "scope_and_framing": { + "key_terms_defined": { + "applies": true, + "answer": true, + "justification": "LoRA, adapter routing, task representations, non-OOD, OOD, and semi-OOD are all formally defined in Sections 2 and 3.", + "source": "haiku" + }, + "intended_contribution_clear": { + "applies": true, + "answer": true, + "justification": "The contributions are explicitly enumerated: training-free black-box routing, O(T) efficiency via task-level routing, and Successive Halving for adapter selection.", + "source": "haiku" + }, + "engagement_with_prior_work": { + "applies": true, + "answer": true, + "justification": "Table 1 provides a structured comparison of LORAUTER against five prior routing approaches along key dimensions, and Section 5 situates the work within MoE, model routing, and task-representation literature.", + "source": "haiku" + } + } + }, + "type_checklist": { + "empirical": { + "artifacts": { + "code_released": { + "applies": true, + "answer": false, + "justification": "No code repository is linked or mentioned in the paper; only the sentence encoder from HuggingFace (https://huggingface.co/Styxxxx/lora_retriever) is cited as a reused artifact.", + "source": "haiku" + }, + "data_released": { + "applies": true, + "answer": true, + "justification": "The evaluation uses publicly available FLANV2 benchmark data and HuggingFace public adapters (1567 retrieved from the wild), both standard public resources used unmodified.", + "source": "haiku" + }, + "environment_specified": { + "applies": true, + "answer": false, + "justification": "Only bfloat16 precision and LoRA rank/alpha hyperparameters are mentioned; no requirements file, Docker image, or full dependency specification is provided.", + "source": "haiku" + }, + "reproduction_instructions": { + "applies": true, + "answer": false, + "justification": "Algorithm 1 provides pseudocode for Successive Halving but no end-to-end instructions for reproducing experiments including data preparation, adapter training, or evaluation pipeline.", + "source": "haiku" + } + }, + "statistical_methodology": { + "confidence_intervals_or_error_bars": { + "applies": true, + "answer": false, + "justification": "Standard deviation is reported only for the SH efficiency comparison (Figure 10) across 100 runs; main comparison results in Figure 2 and Table 6 report no uncertainty estimates.", + "source": "haiku" + }, + "significance_tests": { + "applies": true, + "answer": false, + "justification": "No statistical significance tests are applied to any of the comparative results despite the paper making multiple ranking claims across methods.", + "source": "haiku" + }, + "effect_sizes_reported": { + "applies": true, + "answer": true, + "justification": "Normalized performance percentages and point differences (e.g., +5.2 points over LORARETRIEVER in OOD) are reported throughout with explicit baseline context.", + "source": "haiku" + }, + "sample_size_justified": { + "applies": true, + "answer": false, + "justification": "The test set of 50 samples per task is adopted from Zhao et al. (2024) without discussion of whether this is sufficient for reliable per-task estimates.", + "source": "haiku" + }, + "variance_reported": { + "applies": true, + "answer": false, + "justification": "Variance across runs is reported only for the SH budget experiment (Figure 10); main results tables contain point estimates only.", + "source": "haiku" + } + }, + "evaluation_design": { + "baselines_included": { + "applies": true, + "answer": true, + "justification": "Four baselines are included: LORAHUB, LORARETRIEVER, ARROW, and SpectR, plus an oracle task-aligned upper bound.", + "source": "haiku" + }, + "baselines_contemporary": { + "applies": true, + "answer": true, + "justification": "All baselines are recent (ICLR 2024, COLM 2024, FindingsACL 2024, ICML 2024, COLM 2025), representing the current state of the field.", + "source": "haiku" + }, + "ablation_study": { + "applies": true, + "answer": true, + "justification": "Table 2 ablates retrieval and composition components independently by swapping LORARETRIEVER and LORAUTER components; Table 3 ablates K=1 vs K=3 fusion.", + "source": "haiku" + }, + "multiple_metrics": { + "applies": true, + "answer": true, + "justification": "Task-appropriate metrics are used: EM for classification, BLEU for translation, ROUGE-1/2/L for generation tasks, aggregated via oracle-normalized average.", + "source": "haiku" + }, + "human_evaluation": { + "applies": false, + "answer": false, + "justification": "Human evaluation is not relevant for adapter routing on established NLP benchmarks with automated metrics.", + "source": "haiku" + }, + "held_out_test_set": { + "applies": true, + "answer": true, + "justification": "Routing uses small validation sets (up to 200 samples) while final evaluation uses disjoint held-out test sets of 50 samples per task, consistent with Zhao et al. (2024).", + "source": "haiku" + }, + "per_category_breakdown": { + "applies": true, + "answer": true, + "justification": "Tables 11-18 provide per-task breakdowns across all 48 tasks grouped by category (struct-to-text, translation, commonsense, sentiment, reading comp, NLI, etc.).", + "source": "haiku" + }, + "failure_cases_discussed": { + "applies": true, + "answer": true, + "justification": "The paper discusses that selection-based methods 'collapse' in OOD/Semi-OOD settings, and notes spectral routing methods perform worse because parameter values carry insufficient routing signal.", + "source": "haiku" + }, + "negative_results_reported": { + "applies": true, + "answer": true, + "justification": "The paper reports that using too many or too few K-Means clusters degrades performance, that K=2 outperforms K=3 on some metrics, and that the HF-only adapter pool reduces performance vs. curated adapters.", + "source": "haiku" + } + }, + "setup_transparency": { + "model_versions_specified": { + "applies": true, + "answer": true, + "justification": "LLaMA2-7B and LLaMA2-13B are specified with HuggingFace reference (meta-llama/Llama-2-7b-hf); the sentence encoder URL is provided (https://huggingface.co/Styxxxx/lora_retriever).", + "source": "haiku" + }, + "prompts_provided": { + "applies": true, + "answer": false, + "justification": "The embedding instruction is quoted ('Represent the sentence for similar task retrieval') and Alpaca format is referenced, but full prompts used for task evaluation are not provided.", + "source": "haiku" + }, + "hyperparameters_reported": { + "applies": true, + "answer": true, + "justification": "LoRA rank r=6, scaling α=12, softmax temperature τ=0.2, K=3 adapters for fusion, and SH parameters (η, γ, R, warmup k) are reported.", + "source": "haiku" + }, + "scaffolding_described": { + "applies": false, + "answer": false, + "justification": "No agentic scaffolding is involved; this is standard inference with composed adapter weights.", + "source": "haiku" + }, + "data_preprocessing_documented": { + "applies": true, + "answer": false, + "justification": "The paper states it uses FLANV2 tasks and Alpaca instruction format but does not document the full preprocessing pipeline for constructing the 48-task evaluation set from FLANV2.", + "source": "haiku" + } + }, + "data_integrity": { + "raw_data_available": { + "applies": true, + "answer": true, + "justification": "The underlying benchmark (FLANV2 subset from Zhao et al. 2024) uses publicly available datasets; the 1567 HuggingFace adapters are publicly accessible.", + "source": "haiku" + }, + "data_collection_described": { + "applies": true, + "answer": true, + "justification": "The benchmark construction is described: 48 tasks from FLANV2, 200 validation samples per task, 50 held-out test samples; HF adapters filtered by rank ≤64 for LLaMA2-7B.", + "source": "haiku" + }, + "recruitment_methods_described": { + "applies": false, + "answer": false, + "justification": "No human participants; benchmark data uses standard NLP datasets.", + "source": "haiku" + }, + "data_pipeline_documented": { + "applies": true, + "answer": false, + "justification": "The evaluation pipeline is described conceptually but the full FLANV2 → 48-task subset derivation, adapter training procedure, and validation/test split construction are not fully documented.", + "source": "haiku" + } + }, + "contamination": { + "training_cutoff_stated": { + "applies": true, + "answer": false, + "justification": "LLaMA2's training data cutoff is not stated, though the model's pretraining on FLAN-style data could affect evaluation on FLANV2-derived tasks.", + "source": "haiku" + }, + "train_test_overlap_discussed": { + "applies": true, + "answer": false, + "justification": "No discussion of whether FLANV2 tasks or their test splits were included in LLaMA2's pretraining corpus, which is a real concern for exact-match evaluation tasks.", + "source": "haiku" + }, + "benchmark_contamination_addressed": { + "applies": true, + "answer": false, + "justification": "FLANV2 tasks were publicly available before LLaMA2's training cutoff; potential contamination is not acknowledged or addressed.", + "source": "haiku" + } + }, + "human_studies": { + "pre_registered": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "irb_or_ethics_approval": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "demographics_reported": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "inclusion_exclusion_criteria": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "randomization_described": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "blinding_described": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + }, + "attrition_reported": { + "applies": false, + "answer": false, + "justification": "No human participants.", + "source": "haiku" + } + }, + "cost_and_practicality": { + "inference_cost_reported": { + "applies": true, + "answer": true, + "justification": "Table 1 reports routing overhead complexity (O(T) vs O(N) vs O(NL)), and Section 4.5 and Figure 5 quantify the compute budget (adapter evaluations) for adapter selection under SH.", + "source": "haiku" + }, + "compute_budget_stated": { + "applies": true, + "answer": false, + "justification": "Total GPU hours or compute cost for running all experiments is not reported.", + "source": "haiku" + } + } + } + }, + "claims": [ + { + "claim": "LORAUTER achieves 101.2% of Oracle task-aligned performance in non-OOD settings on LLaMA2-7B, effectively matching the upper bound of always selecting the perfect adapter.", + "evidence": "Figure 2 and Table 6 show normalized average performance of 101.2% for LORAUTER vs 100% oracle on LLaMA2-7B non-OOD; confirmed by Table 11 per-task results.", + "supported": "strong" + }, + { + "claim": "LORAUTER outperforms the strongest baseline (LORARETRIEVER) by +5.2 points in OOD settings on LLaMA2-7B.", + "evidence": "Figure 2 shows 88.4% (LORAUTER) vs 83.2% (LORARETRIEVER) in OOD on LLaMA2-7B; similar gap on LLaMA2-13B (86.8% vs 85.9%).", + "supported": "strong" + }, + { + "claim": "Task-level routing scales more efficiently than adapter-level routing, with O(T) complexity where T < N.", + "evidence": "Table 1 compares complexity across methods; empirically demonstrated by maintaining competitive performance with 1500+ adapters where O(N) methods become infeasible.", + "supported": "moderate" + }, + { + "claim": "Successive Halving reduces the adapter evaluation budget by more than 2x compared to uniform selection with negligible performance loss.", + "evidence": "Figure 5 and Figure 10 show SH reaches near-peak normalized performance (~0.95) at roughly half the evaluation budget of uniform selection, across 100 independent runs with std. dev. reported.", + "supported": "strong" + }, + { + "claim": "LORAUTER scales to 1500+ heterogeneous 'wild' adapters from HuggingFace, achieving 85.7% normalized performance (vs 88.4% with curated adapters) in OOD settings.", + "evidence": "Table 7 and Table 8 report per-task and aggregate results for HF-only and HF+48 adapter pools, showing competitive performance despite no curated adapters.", + "supported": "strong" + }, + { + "claim": "Both the retrieval and composition components of LORAUTER independently contribute to performance gains over LORARETRIEVER.", + "evidence": "Table 2 shows: LR retrieval + LA composition = 98.6% (non-OOD 7B); LA retrieval + LR composition = 96.7%; both together = 101.2%, vs LR+LR baseline of 92.9%.", + "supported": "strong" + } + ], + "methodology_tags": [ + "benchmark-eval" + ], + "key_findings": "LORAUTER is a training-free LoRA adapter routing framework that routes queries through task representations rather than directly to adapters, requiring only small validation sets and no adapter training data. In non-OOD settings it matches or slightly exceeds oracle performance (101.2%) by composing complementary task-relevant adapters with input-aware weighted fusion. In OOD settings it outperforms the best prior method (LORARETRIEVER) by 5.2 percentage points on LLaMA2-7B. The Successive Halving strategy reduces adapter evaluation cost by more than 2x while maintaining near-peak selection quality, and the framework remains effective when scaled to pools of 1500+ heterogeneous public HuggingFace adapters.", + "red_flags": [ + { + "flag": "No significance tests", + "detail": "All comparative claims are presented as point estimates without statistical significance testing; given 50-sample test sets, many differences may not be statistically distinguishable." + }, + { + "flag": "No code release", + "detail": "No repository or implementation is shared, making independent reproduction impossible beyond the algorithmic description." + }, + { + "flag": "No limitations section", + "detail": "The paper has no dedicated limitations or threats-to-validity section; the Impact Statement discusses societal concerns but not methodological constraints." + }, + { + "flag": "Single benchmark", + "detail": "All experiments use the same FLANV2-derived 48-task benchmark from Zhao et al. (2024); generalization to other domains, modalities, or base models is unvalidated." + }, + { + "flag": "Benchmark contamination unaddressed", + "detail": "FLANV2 tasks were available before LLaMA2's training cutoff; potential overlap between training data and evaluation benchmarks is not acknowledged." + }, + { + "flag": "Small per-task test sets", + "detail": "With only 50 held-out samples per task, individual task results (e.g., EM scores that change by 2-4 points) may reflect noise rather than true method differences." + } + ], + "cited_papers": [ + { + "title": "LoraRetriever: Input-aware LoRA Retrieval and Composition for Mixed Tasks in the Wild", + "relevance": "Primary baseline and benchmark source; LORAUTER directly compares against and extends this work on adapter routing" + }, + { + "title": "LoRA: Low-Rank Adaptation of Large Language Models", + "relevance": "Foundational method that LORAUTER builds upon for parameter-efficient fine-tuning" + }, + { + "title": "LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition", + "relevance": "Key baseline for adapter composition using learned fusion weights" + }, + { + "title": "Towards Modular LLMs by Building and Reusing a Library of LoRAs (ARROW)", + "relevance": "Spectral routing baseline; representative of parameter-space routing approaches" + }, + { + "title": "Mixture of LoRA Experts (MoLE)", + "relevance": "MoE-style baseline requiring training data for routing" + }, + { + "title": "AdapterSoup: Weight Averaging to Improve Generalization of Pretrained Language Models", + "relevance": "Baseline approach for adapter composition via weight averaging" + }, + { + "title": "Finetuned Language Models are Zero-Shot Learners (FLAN)", + "relevance": "Source of the evaluation benchmark used throughout experiments" + }, + { + "title": "SpectR: Dynamically Composing LM Experts with Spectral Routing", + "relevance": "Recent spectral routing baseline evaluated as a competitor" + }, + { + "title": "Non-stochastic Best Arm Identification and Hyperparameter Optimization (Successive Halving)", + "relevance": "Core algorithm adopted for efficient adapter selection within LORAUTER" + } + ], + "engagement_factors": { + "practical_relevance": { + "score": 3, + "justification": "Directly applicable to any practitioner using the 2300+ public LoRA adapters on HuggingFace without access to training data." + }, + "surprise_contrarian": { + "score": 1, + "justification": "The task-level routing insight is logical and incrementally novel rather than surprising; the >oracle result (101.2%) is mildly interesting." + }, + "fear_safety": { + "score": 0, + "justification": "No AI safety concerns raised beyond a brief mention of inherited biases in the Impact Statement." + }, + "drama_conflict": { + "score": 1, + "justification": "Mild competitive framing against LORARETRIEVER with clear margin claims, but no broader controversy." + }, + "demo_ability": { + "score": 2, + "justification": "The framework could be tried with public HuggingFace adapters, though no code is released to lower the barrier." + }, + "brand_recognition": { + "score": 1, + "justification": "EPFL is a well-regarded research institution but not a top-tier AI lab; no famous authors or product associations." + } + }, + "hn_data": { + "threads": [], + "top_points": 0, + "total_points": 0, + "total_comments": 0 + } +} +\ No newline at end of file diff --git a/scripts/build-explorer-data.py b/scripts/build-explorer-data.py @@ -514,34 +514,42 @@ def build(): # --- Registry pipeline stats --- reg_total = len(registry) - reg_status = Counter(e.get("status", "unknown") for e in registry.values()) - has_text = sum(1 for e in registry.values() - if (PAPERS_DIR / e["id"] / "paper.txt").exists()) - v1_count = 0 - for scan_path in PAPERS_DIR.glob("*/scan.json"): - pid = scan_path.parent.name - if pid not in {p["id"] for p in papers_full}: - with open(scan_path) as f: - s = json.load(f) - if s.get("scan_version", 1) < 2: - v1_count += 1 - - total_scanned = len(papers_full) - non_empirical_count = total_scanned - total_papers - v3_count = sum(1 for p in papers_full if p.get("engagement_factors") is not None) - v2_only = total_papers - v3_count + v5_opus = 0 + v5_haiku = 0 + deprecated_scan = 0 + not_scanned = 0 + for e in registry.values(): + pid = e["id"] + v5_path = PAPERS_DIR / pid / "scan-v5.json" + old_path = PAPERS_DIR / pid / "scan.json" + if v5_path.exists(): + with open(v5_path) as f: + v5 = json.load(f) + # Check if any answers have source="opus" + has_opus = False + for cat_data in v5.get("checklist", {}).values(): + if isinstance(cat_data, dict): + for qd in cat_data.values(): + if isinstance(qd, dict) and qd.get("source") == "opus": + has_opus = True + break + if has_opus: + break + if has_opus: + v5_opus += 1 + else: + v5_haiku += 1 + elif old_path.exists(): + deprecated_scan += 1 + else: + not_scanned += 1 pipeline = { "registry_total": reg_total, - "v2_scanned": total_scanned, - "empirical": total_papers, - "non_empirical": non_empirical_count, - "v3_scanned": v3_count, - "v2_only": v2_only, - "v1_needs_rescan": v1_count, - "has_text_no_scan": has_text - total_scanned - v1_count, - "no_text": reg_total - has_text, - "excluded": reg_status.get("excluded", 0), + "v5_opus": v5_opus, + "v5_haiku": v5_haiku, + "deprecated_scan": deprecated_scan, + "not_scanned": not_scanned, } dashboard = { diff --git a/scripts/run-scan-v5-haiku.py b/scripts/run-scan-v5-haiku.py @@ -49,7 +49,7 @@ def build_questions_text(category_obj): def build_prompt(paper_type, paper_text, paper_id, registry_entry, hn_data): - """Build the v4 Haiku scan prompt.""" + """Build the v5 Haiku scan prompt.""" core_cats = SCHEMA["properties"]["checklist"]["properties"] type_mod = SCHEMA["properties"]["type_checklist"]["properties"].get(paper_type, {}) type_cats = type_mod.get("properties", {}) @@ -189,7 +189,7 @@ def load_registry(): def scan_one(paper_id, registry, force=False): - """Run v4 Haiku scan on one paper. Returns (paper_id, ok, reason, stats).""" + """Run v5 Haiku scan on one paper. Returns (paper_id, ok, reason, stats).""" v5_path = PAPERS_DIR / paper_id / "scan-v5.json" if v5_path.exists() and not force: return paper_id, True, "already scanned", {} @@ -231,6 +231,7 @@ def scan_one(paper_id, registry, force=False): ) if result.returncode != 0: + stderr_hint = result.stderr.strip()[:200] if result.stderr else "" # Retry with sonnet if haiku failed if model == "haiku": model = "sonnet" @@ -241,9 +242,10 @@ def scan_one(paper_id, registry, force=False): cwd=str(ROOT), ) if result.returncode != 0: - return paper_id, False, f"claude exit {result.returncode} (sonnet fallback)", {} + stderr_hint2 = result.stderr.strip()[:200] if result.stderr else "" + return paper_id, False, f"claude exit {result.returncode} (sonnet fallback): {stderr_hint2 or stderr_hint}", {} else: - return paper_id, False, f"claude exit {result.returncode}", {} + return paper_id, False, f"claude exit {result.returncode}: {stderr_hint}", {} output = result.stdout.strip() json_start = output.find("{") @@ -277,7 +279,7 @@ def scan_one(paper_id, registry, force=False): if isinstance(qd, dict) and "applies" in qd and "source" not in qd: qd["source"] = scan_model - # Write v4 scan — pure Haiku/Sonnet output, NO merge with Opus. + # Write v5 scan — pure Haiku/Sonnet output, NO merge with Opus. # The build pipeline will overlay Opus answers at read time when both exist. # Keeping them separate preserves the ability to compare per-question. with open(v5_path, "w") as f: @@ -333,7 +335,7 @@ def main(): print("No papers to scan.") return - print(f"V4 Haiku scan: {len(candidates)} papers" + print(f"V5 Haiku scan: {len(candidates)} papers" f"{f' (parallel={parallel})' if parallel > 1 else ''}:\n") ok_count = 0 @@ -346,6 +348,8 @@ def main(): pid, ok, reason, stats = future.result() if ok: ok_count += 1 + if ok_count % 10 == 0: + print(f" ... {ok_count} OK so far") else: fail_count += 1 print(f" FAIL: {pid} — {reason}")

Impressum · Datenschutz