scan-v4.json - ai-research-survey - Systematic scan of agentic development research. What's signal, what's noise.

scan-v4.json (23798B)
      1 {
      2   "scan_version": 4,
      3   "paper_type": "benchmark-creation",
      4   "paper": {
      5     "title": "Evaluation of LLMs on Syntax-Aware Code Fill-in-the-Middle Tasks",
      6     "authors": [
      7       "Linyuan Gong",
      8       "Sida Wang",
      9       "Mostafa Elhoushi",
     10       "Alvin Cheung"
     11     ],
     12     "year": 2024,
     13     "venue": "International Conference on Machine Learning",
     14     "arxiv_id": "2403.04814",
     15     "doi": "10.48550/arXiv.2403.04814"
     16   },
     17   "checklist": {
     18     "claims_and_evidence": {
     19       "abstract_claims_supported": {
     20         "applies": true,
     21         "answer": true,
     22         "justification": "Abstract claims are supported: FIM pretraining enhancing FIM and L2R inference is shown in Table 2 (Section 6.1), pretraining methods mattering more than model size is demonstrated in Table 4/Figure 3 (Section 6.3), SAFIM providing fair comparisons is evidenced by the truncation ablation (Table 3) and prompt comparison (Table 2).",
     23         "source": "opus"
     24       },
     25       "causal_claims_justified": {
     26         "applies": true,
     27         "answer": false,
     28         "justification": "The abstract claims 'FIM pretraining not only enhances FIM proficiency but also improves Left-to-Right (L2R) inference' — a causal claim based on observational comparison across model families with different architectures, data, and training procedures. While the paper acknowledges in Section 7 that 'our conclusions are drawn from comparisons across various model families trained with different paradigms, rather than from controlled experiments,' the abstract presents these causal claims without caveats.",
     29         "source": "opus"
     30       },
     31       "generalization_bounded": {
     32         "applies": true,
     33         "answer": true,
     34         "justification": "The title scopes to 'Syntax-Aware Code Fill-in-the-Middle Tasks.' Claims use hedging language ('our findings suggest'). The paper acknowledges cross-family comparisons are not controlled experiments (Section 7) and that results may vary by programming language (Appendix A.7). Scope is mostly bounded to SAFIM's three task types and four languages.",
     35         "source": "opus"
     36       },
     37       "alternative_explanations_discussed": {
     38         "applies": true,
     39         "answer": true,
     40         "justification": "The paper discusses multiple alternative explanations: data contamination (Appendix A.9 with a dedicated experiment), differences in pretraining environments confounding cross-family comparisons (Section 1, Section 7), programming language coding style affecting results (Appendix A.7), and training data cutoff date overlap (Table 1).",
     41         "source": "opus"
     42       },
     43       "proxy_outcome_distinction": {
     44         "applies": true,
     45         "answer": true,
     46         "justification": "The paper measures Pass@1 on execution-based tests and syntactical matching, and frames results in terms of 'code completion proficiency' and 'FIM performance.' The measurements match the granularity of claims — no broader framing of 'developer productivity' or similar proxy gaps exists.",
     47         "source": "opus"
     48       }
     49     },
     50     "limitations_and_scope": {
     51       "limitations_section_present": {
     52         "applies": true,
     53         "answer": true,
     54         "justification": "Section 7 contains substantive limitations discussion: 'We acknowledge a key limitation in our study: our conclusions are drawn from comparisons across various model families trained with different paradigms, rather than from controlled experiments altering pretraining paradigms within the same model.' An Impact Statement section also discusses broader concerns.",
     55         "source": "opus"
     56       },
     57       "threats_to_validity_specific": {
     58         "applies": true,
     59         "answer": true,
     60         "justification": "The paper discusses specific threats: data contamination from training data overlap (Table 1, Appendix A.9 with a dedicated experiment), non-controlled cross-family comparisons (Section 7), programming language coding style affecting results (Appendix A.7), and prompt sensitivity affecting model rankings (Section 6.1).",
     61         "source": "opus"
     62       },
     63       "scope_boundaries_stated": {
     64         "applies": true,
     65         "answer": true,
     66         "justification": "Section 7 explicitly states what the results do NOT show: 'our conclusions are drawn from comparisons across various model families trained with different paradigms, rather than from controlled experiments' and proposes 'future work in pretraining such models under the same environment to validate these observations further.' The paper also scopes to decoder-only models, excluding encoder-decoder models (Section 2).",
     67         "source": "opus"
     68       }
     69     },
     70     "conflicts_of_interest": {
     71       "funding_disclosed": {
     72         "applies": true,
     73         "answer": true,
     74         "justification": "The Acknowledgements section lists: 'gift from Meta, the U.S. National Science Foundation through grants IIS-1955488, IIS-2027575, ARO W911NF2110339, ONR N00014-21-1-2724, and DOE awards DE-SC0016260, DE-SC0021982.'",
     75         "source": "opus"
     76       },
     77       "affiliations_disclosed": {
     78         "applies": true,
     79         "answer": true,
     80         "justification": "Author affiliations are listed: Linyuan Gong and Alvin Cheung at University of California at Berkeley, Sida Wang and Mostafa Elhoushi at 'AI at Meta.' Two authors are at Meta, whose CodeLLaMa model is evaluated.",
     81         "source": "opus"
     82       },
     83       "funder_independent_of_outcome": {
     84         "applies": true,
     85         "answer": false,
     86         "justification": "Meta provided a gift fund and two authors are Meta employees. Meta develops CodeLLaMa, one of the evaluated models. While CodeLLaMa does not consistently win (DeepSeekCoder outperforms it), the funder has a financial interest in LLM evaluation outcomes.",
     87         "source": "opus"
     88       },
     89       "financial_interests_declared": {
     90         "applies": true,
     91         "answer": false,
     92         "justification": "No competing interests or financial interests statement is present in the paper. Two Meta-affiliated authors evaluate Meta's CodeLLaMa, but no explicit conflict-of-interest disclosure is provided.",
     93         "source": "opus"
     94       }
     95     },
     96     "scope_and_framing": {
     97       "key_terms_defined": {
     98         "applies": true,
     99         "answer": true,
    100         "justification": "Key terms are well-defined: FIM (Fill-in-the-Middle) is explained with all five prompt variants (L2R, PSM, SPM, IPF, 1S), 'syntax-aware' is defined as AST-based masking, and each of the three benchmark splits (algorithmic block, control-flow expression, API function call) is precisely characterized with examples.",
    101         "source": "haiku"
    102       },
    103       "intended_contribution_clear": {
    104         "applies": true,
    105         "answer": true,
    106         "justification": "The paper clearly states three contributions: (1) the SAFIM benchmark dataset with 17,720 examples, (2) multiple prompt designs enabling fair cross-model comparisons, and (3) a novel syntax-aware truncation algorithm for post-processing.",
    107         "source": "haiku"
    108       },
    109       "engagement_with_prior_work": {
    110         "applies": true,
    111         "answer": true,
    112         "justification": "Section 2 provides substantive engagement with prior FIM benchmarks (HumanEval-Infilling, HumanEval, MBPP), code LLMs, and repository-level benchmarks, explicitly identifying gaps SAFIM addresses: scale, multilingual coverage, execution-based evaluation, and syntax-awareness.",
    113         "source": "haiku"
    114       }
    115     }
    116   },
    117   "type_checklist": {
    118     "benchmark-creation": {
    119       "construct_design": {
    120         "construct_validity_argued": {
    121           "applies": true,
    122           "answer": true,
    123           "justification": "The paper argues that AST-based syntax-aware masking better represents real development tasks than randomly masked lines or character spans, contrasting explicitly with HumanEval-Infilling's approach. Each split is motivated by a distinct code comprehension capability (algorithm design, control-flow understanding, API knowledge).",
    124           "source": "haiku"
    125         },
    126         "difficulty_distribution_characterized": {
    127           "applies": true,
    128           "answer": false,
    129           "justification": "The paper provides statistics on example sizes and average completion lengths (Tables 5–6, Figure 4) but does not characterize difficulty distribution into tiers (easy/medium/hard) or empirically measure per-example difficulty.",
    130           "source": "haiku"
    131         },
    132         "ceiling_floor_effects_checked": {
    133           "applies": true,
    134           "answer": false,
    135           "justification": "No explicit ceiling or floor effect analysis is conducted. The observed Pass@1 range of ~23%–69% suggests reasonable discrimination, but this is observed post-hoc rather than verified by design.",
    136           "source": "haiku"
    137         },
    138         "human_baseline_included": {
    139           "applies": true,
    140           "answer": false,
    141           "justification": "No human performance baseline is reported. The API split mentions examples are 'solvable by humans based on the given context' but no human performance measurement is provided for any split.",
    142           "source": "haiku"
    143         },
    144         "scoring_rubric_justified": {
    145           "applies": true,
    146           "answer": true,
    147           "justification": "Execution-based Pass@1 is justified for algorithmic/control-flow tasks (unit tests available), syntactic match for API calls is justified by the impossibility of executing external API side effects, and the use of single Pass@1 rather than multi-sample averaging is justified by the dataset's large scale (17,720 examples).",
    148           "source": "haiku"
    149         }
    150       },
    151       "robustness": {
    152         "contamination_resistance_designed": {
    153           "applies": true,
    154           "answer": true,
    155           "justification": "SAFIM uses a post-April 2022 cutoff to avoid overlap with The Stack (cutoff March 2022) and GPT-3.5/4 training data (cutoff September 2021). Appendix A.9 empirically validates that even for models with overlapping cutoffs (CodeLLaMa, DeepSeekCoder), performance is essentially unchanged on a clean post-April 2023 held-out dataset.",
    156           "source": "haiku"
    157         },
    158         "temporal_robustness_discussed": {
    159           "applies": true,
    160           "answer": false,
    161           "justification": "The paper does not discuss whether or how the benchmark will remain discriminative as model capabilities advance, or provide a plan for updating the benchmark when model training cutoffs surpass the data sources.",
    162           "source": "haiku"
    163         },
    164         "failure_modes_discussed": {
    165           "applies": true,
    166           "answer": false,
    167           "justification": "The paper briefly notes that syntactic match evaluation for API calls is imperfect and that cross-family comparisons aren't controlled, but does not systematically enumerate benchmark failure modes or describe what behaviors SAFIM cannot measure or what strategies could game it.",
    168           "source": "haiku"
    169         },
    170         "baseline_implementations_provided": {
    171           "applies": true,
    172           "answer": true,
    173           "justification": "The full evaluation toolkit and dataset are released at github.com/gonglinyuan/safim; Appendix A.3 (Table 7) provides exact model identifiers for every evaluated model on Huggingface and OpenAI API, and generation hyperparameters (top-p=0.95, temperature=0.2) are specified.",
    174           "source": "haiku"
    175         }
    176       },
    177       "documentation": {
    178         "dataset_documentation_complete": {
    179           "applies": true,
    180           "answer": true,
    181           "justification": "Dataset construction is documented with source description (Codeforces and GitHub), date ranges (April 2022–January 2023), filtering criteria (unit test pass rate, CodeBLEU deduplication, GitHub quality filters), and per-split and per-language statistics in Tables 5–6 and Figure 4.",
    182           "source": "haiku"
    183         },
    184         "licensing_and_access_clear": {
    185           "applies": true,
    186           "answer": false,
    187           "justification": "The paper states the dataset is available at GitHub and lists a leaderboard URL but specifies no license for the benchmark data or evaluation toolkit, creating uncertainty about legal reuse and redistribution.",
    188           "source": "haiku"
    189         },
    190         "intended_use_specified": {
    191           "applies": true,
    192           "answer": true,
    193           "justification": "Intended use is clearly stated: evaluating code LLMs on FIM tasks and comparing pretraining paradigms. The paper also explicitly cautions that cross-family comparisons 'should be interpreted with caution' and that controlled experiments are needed to validate causal conclusions.",
    194           "source": "haiku"
    195         }
    196       }
    197     }
    198   },
    199   "claims": [
    200     {
    201       "claim": "FIM pretraining enhances both FIM task performance and Left-to-Right (L2R) inference capabilities.",
    202       "evidence": "Table 2 shows FIM-pretrained StarCoder (15.5B) outperforms L2R-only CodeGen-16B in L2R mode (29.3% vs 24.6%); CodeLLaMa-13B (FIM+L2R) outperforms CodeLLaMa-34B (L2R only) on FIM tasks despite smaller size.",
    203       "supported": "moderate"
    204     },
    205     {
    206       "claim": "Pretraining methods and data quality matter more than sheer model size for code task performance.",
    207       "evidence": "StarCoder (15.5B) matches GPT-4 (53.3% vs 55.5% avg Pass@1); DeepSeekCoder-1.3B (52.6%) far exceeds CodeGen-16B (31.0%). Within-family scaling gains are modest; cross-family differences are large.",
    208       "supported": "moderate"
    209     },
    210     {
    211       "claim": "Syntax-aware truncation significantly improves Pass@1 and reduces compilation errors versus no truncation.",
    212       "evidence": "Table 3: CodeLLaMa-13B jumps from 16.4% to 41.4% Pass@1 with CErr% dropping from 64.6% to 10.9%; CodeGen-16B goes from 0.0% to 25.9%. Improvements are consistent across all models.",
    213       "supported": "strong"
    214     },
    215     {
    216       "claim": "Using a limited prompt range skews FIM benchmark comparisons; multiple prompt types are required for fair evaluation.",
    217       "evidence": "Table 2 shows CodeGen-16B scores 15.2% on IPF but 25.9% on SPM, reversing its apparent ranking versus InCoder-6B (25.2% on PSM) — exactly replicating the skewed comparison methodology of Fried et al. 2023.",
    218       "supported": "strong"
    219     },
    220     {
    221       "claim": "Data contamination (date-range overlap) has negligible impact on SAFIM evaluation results.",
    222       "evidence": "Appendix A.9 Table 17 shows no significant performance decrease on a clean post-April 2023 dataset for CodeLLaMa or DeepSeekCoder, with several models slightly improving on newer data.",
    223       "supported": "moderate"
    224     },
    225     {
    226       "claim": "Repository-level pretraining context is the key factor for API function call completion performance.",
    227       "evidence": "StarCoder (68.1%) and DeepSeekCoder (75.2%) lead the API split; the paper attributes this to their repo-level pretraining (GitHub issues/commits and topological dependency ordering respectively). But this is inferred from observational model comparisons without ablation.",
    228       "supported": "weak"
    229     }
    230   ],
    231   "methodology_tags": [
    232     "benchmark-eval"
    233   ],
    234   "key_findings": "SAFIM introduces a 17,720-example multilingual code FIM benchmark using AST-based masking across three task types (algorithmic block, control-flow, API function call). Evaluation of 15+ LLMs shows DeepSeekCoder-33B achieves the highest performance (69.0% avg Pass@1) while the novel syntax-aware truncation algorithm dramatically improves scores for models lacking explicit stop tokens (e.g., CodeLLaMa-13B: 16.4%→41.4%). Observationally, FIM-pretrained smaller models match or outperform larger L2R-only models, and prompt selection has a large impact on apparent model rankings — prior benchmarks using a single prompt type produced misleading comparisons. The contamination analysis found negligible impact even for models with overlapping training date ranges.",
    235   "red_flags": [
    236     {
    237       "flag": "No human baseline",
    238       "detail": "No human performance is measured on any SAFIM split, making it impossible to calibrate benchmark difficulty or know whether top models approach human-level performance."
    239     },
    240     {
    241       "flag": "Meta conflict of interest unacknowledged",
    242       "detail": "Two authors are employed at 'AI at Meta' and the work is funded by a Meta gift, yet the paper evaluates CodeLLaMa (Meta's model) without any competing interests declaration or mitigation."
    243     },
    244     {
    245       "flag": "Causal claims from uncontrolled comparisons",
    246       "detail": "The abstract frames 'FIM pretraining improves L2R inference' and 'pretraining > model size' as findings, but these are observational cross-family comparisons confounded by training data, compute, and architecture differences — the paper acknowledges this only in the conclusion body."
    247     },
    248     {
    249       "flag": "API split very small (310 examples)",
    250       "detail": "The API function call completion split has only 310 examples from GitHub versus 8,781 and 8,629 in the other two splits. Conclusions about API performance are based on a much weaker statistical foundation."
    251     },
    252     {
    253       "flag": "No licensing information",
    254       "detail": "The paper provides no license for the benchmark data or toolkit, creating legal ambiguity about use and redistribution for others who wish to build on SAFIM."
    255     },
    256     {
    257       "flag": "No difficulty distribution or ceiling/floor analysis",
    258       "detail": "Benchmark items are not characterized by difficulty tier, and no explicit ceiling/floor effect check is performed — though the observed score range (23%–69%) suggests adequate discrimination in practice."
    259     }
    260   ],
    261   "cited_papers": [
    262     {
    263       "title": "Evaluating Large Language Models Trained on Code (HumanEval / Codex)",
    264       "relevance": "The primary prior benchmark SAFIM extends and critiques for limited scope (164 Python functions, standalone generation only, no FIM)"
    265     },
    266     {
    267       "title": "Efficient Training of Language Models to Fill in the Middle (Bavarian et al. 2022)",
    268       "relevance": "Establishes FIM pretraining methodology and the HumanEval-Infilling benchmark that SAFIM supersedes"
    269     },
    270     {
    271       "title": "Code Llama: Open Foundation Models for Code",
    272       "relevance": "Key evaluated model; also the source of skewed prior evaluation methodology that SAFIM's multi-prompt approach corrects"
    273     },
    274     {
    275       "title": "DeepSeek-Coder: When the Large Language Model Meets Programming",
    276       "relevance": "Top-performing evaluated model; demonstrates benefits of repo-level pretraining for code tasks"
    277     },
    278     {
    279       "title": "StarCoder: May the Source Be with You!",
    280       "relevance": "Evaluated model incorporating GitHub issues/commits in pretraining; demonstrates FIM pretraining benefits for both FIM and L2R"
    281     },
    282     {
    283       "title": "InCoder: A Generative Model for Code Infilling and Synthesis",
    284       "relevance": "Pioneer of FIM pretraining for decoder-only models; SAFIM directly extends and critiques the HumanEval-Infilling benchmark introduced here"
    285     },
    286     {
    287       "title": "Rethinking Benchmark and Contamination for Language Models with Rephrased Samples",
    288       "relevance": "Motivates SAFIM's contamination-resistance design via temporal cutoffs"
    289     },
    290     {
    291       "title": "xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval",
    292       "relevance": "Provides the ExecEval execution framework used for SAFIM's execution-based evaluation"
    293     },
    294     {
    295       "title": "RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems",
    296       "relevance": "Related repository-level code completion benchmark; contextualizes SAFIM's API function call split"
    297     },
    298     {
    299       "title": "SWE-Bench: Can Language Models Resolve Real-World Github Issues?",
    300       "relevance": "Contextualizes SAFIM in the broader landscape of real-world software engineering benchmarks"
    301     }
    302   ],
    303   "engagement_factors": {
    304     "practical_relevance": {
    305       "score": 2,
    306       "justification": "Provides a publicly available benchmark toolkit and dataset for evaluating code LLMs on FIM tasks, useful for model developers but not directly usable by most practitioners."
    307     },
    308     "surprise_contrarian": {
    309       "score": 1,
    310       "justification": "The finding that smaller models can match larger ones and FIM helps L2R is mildly surprising but aligns with a growing body of evidence on data quality over model size."
    311     },
    312     "fear_safety": {
    313       "score": 0,
    314       "justification": "No safety or security concerns raised; the Impact Statement mentions risks of automated code generation but this is generic."
    315     },
    316     "drama_conflict": {
    317       "score": 0,
    318       "justification": "No controversy — a straightforward benchmark paper with balanced evaluation across multiple model families."
    319     },
    320     "demo_ability": {
    321       "score": 2,
    322       "justification": "Code, dataset, and leaderboard (safimbenchmark.com) are publicly available; researchers can evaluate their own models on SAFIM."
    323     },
    324     "brand_recognition": {
    325       "score": 2,
    326       "justification": "Published at ICML 2024 (top venue), evaluates well-known models (GPT-4, CodeLLaMa, DeepSeekCoder), two authors from Meta AI."
    327     }
    328   },
    329   "hn_data": {
    330     "threads": [
    331       {
    332         "hn_id": "40881654",
    333         "title": "LLM Agents can Autonomously Exploit One-day Vulnerabili-ties [pdf]",
    334         "points": 4,
    335         "comments": 1,
    336         "url": "https://news.ycombinator.com/item?id=40881654"
    337       },
    338       {
    339         "hn_id": "40138889",
    340         "title": "LLM Agents Can Autonomously Exploit One-Day Vulnerabilities",
    341         "points": 4,
    342         "comments": 1,
    343         "url": "https://news.ycombinator.com/item?id=40138889"
    344       },
    345       {
    346         "hn_id": "40633364",
    347         "title": "LLM Agents Can Autonomously Exploit One-Day Vulnerabilities",
    348         "points": 3,
    349         "comments": 1,
    350         "url": "https://news.ycombinator.com/item?id=40633364"
    351       },
    352       {
    353         "hn_id": "41128425",
    354         "title": "Things Come from Having Many Good Models",
    355         "points": 2,
    356         "comments": 0,
    357         "url": "https://news.ycombinator.com/item?id=41128425"
    358       },
    359       {
    360         "hn_id": "40756286",
    361         "title": "Solving Maxwell's Equations with Non-Trainable Graph Neural Network",
    362         "points": 2,
    363         "comments": 0,
    364         "url": "https://news.ycombinator.com/item?id=40756286"
    365       },
    366       {
    367         "hn_id": "40679472",
    368         "title": "Discovering Optimization Algorithms With And For Large Language Models",
    369         "points": 2,
    370         "comments": 0,
    371         "url": "https://news.ycombinator.com/item?id=40679472"
    372       },
    373       {
    374         "hn_id": "40666270",
    375         "title": "Discovering Preference Optimization Algorithms with Large Language Models",
    376         "points": 2,
    377         "comments": 0,
    378         "url": "https://news.ycombinator.com/item?id=40666270"
    379       },
    380       {
    381         "hn_id": "40085930",
    382         "title": "LLM Agents Can Autonomously Exploit One-Day Vulnerabilities with 87% Success",
    383         "points": 2,
    384         "comments": 0,
    385         "url": "https://news.ycombinator.com/item?id=40085930"
    386       },
    387       {
    388         "hn_id": "39765229",
    389         "title": "Quantifying Contamination in Code Generation Capabilities of Language Models",
    390         "points": 1,
    391         "comments": 0,
    392         "url": "https://news.ycombinator.com/item?id=39765229"
    393       },
    394       {
    395         "hn_id": "39737870",
    396         "title": "LSTM-Based Machine Learning for Enhancing Storm Surge Forecasting Accuracy",
    397         "points": 1,
    398         "comments": 0,
    399         "url": "https://news.ycombinator.com/item?id=39737870"
    400       }
    401     ],
    402     "top_points": 4,
    403     "total_points": 23,
    404     "total_comments": 3
    405   }
    406 }
	ai-research-survey Systematic scan of agentic development research. What's signal, what's noise.
	git clone https://git.shiptheloop.com/ai-research-survey.git
	Log \| Files \| Refs