scan-v4.json - ai-research-survey - Systematic scan of agentic development research. What's signal, what's noise.

scan-v4.json (37511B)
      1 {
      2   "scan_version": 4,
      3   "paper_type": "empirical",
      4   "paper": {
      5     "title": "Fixing 7,400 Bugs for 1$: Cheap Crash-Site Program Repair",
      6     "authors": [
      7       "Han Zheng",
      8       "Ilia Shumailov",
      9       "Tianqi Fan",
     10       "Aiden Hall",
     11       "Mathias Payer"
     12     ],
     13     "year": 2025,
     14     "venue": "arXiv.org",
     15     "arxiv_id": "2505.13103",
     16     "doi": "10.48550/arXiv.2505.13103"
     17   },
     18   "checklist": {
     19     "claims_and_evidence": {
     20       "abstract_claims_supported": {
     21         "applies": true,
     22         "answer": false,
     23         "justification": "The abstract claims the combined pipeline achieves '73.5% (+29.6%)' fixing rate, but 195 (CodeRover-S) + 60 additional = 255/358 = 71.2%, not 73.5%. More critically, the Conclusion (Section 7) swaps the two key numbers: 'reduces token usage by 29.6% and improves the fixing rate by 45.9%', contradicting the abstract's '45.9% cost reduction' and '29.6% fixing rate improvement.' These internal inconsistencies undermine the abstract claims.",
     24         "source": "opus"
     25       },
     26       "causal_claims_justified": {
     27         "applies": true,
     28         "answer": true,
     29         "justification": "The main causal claims ('crash-site repair reduces cost while maintaining effectiveness') are supported by controlled comparison: same benchmark (358 ARVO bugs), same LLM (gpt-4o-2024-08-06) for the head-to-head, with WilliamT tested against CodeRover-S, Agentless, and VulMaster under equivalent conditions.",
     30         "source": "opus"
     31       },
     32       "generalization_bounded": {
     33         "applies": true,
     34         "answer": false,
     35         "justification": "The title 'Fixing 7,400 Bugs for 1$' is not supported anywhere in the paper body — the number 7,400 never appears in the text. The evaluation covers 358 bugs of 4 types (HBO, GBO, SBO, UAF) on one benchmark (ARVO). At the reported cost of $0.0026/bug, $1 covers ~385 bugs, not 7,400. The claim in the title is an unsupported extrapolation.",
     36         "source": "opus"
     37       },
     38       "alternative_explanations_discussed": {
     39         "applies": true,
     40         "answer": false,
     41         "justification": "Section 6 discusses failure modes of WILLIAMT but does not consider alternative explanations for its performance. For example, there is no discussion of whether ARVO bugs are particularly amenable to crash-site repair, whether the specific bug types selected bias results, or whether LLMs may have seen the fixes in training.",
     42         "source": "opus"
     43       },
     44       "proxy_outcome_distinction": {
     45         "applies": true,
     46         "answer": true,
     47         "justification": "Appendix A (Table 1, Figure 9) extensively distinguishes between the 'plausible fix' proxy metric and actual correctness. They show the plausible metric 'does not ensure that the program's functionality is preserved' and demonstrate that only 56/165 plausible fixes survive manual review. This is an unusually thorough proxy-outcome analysis.",
     48         "source": "opus"
     49       }
     50     },
     51     "limitations_and_scope": {
     52       "limitations_section_present": {
     53         "applies": true,
     54         "answer": true,
     55         "justification": "Section 6 ('Discussion') serves as a dedicated limitations section, discussing three categories of limitations: incorrect crash site analysis, semantically disruptive patch insertion, and imprecise plausible metric. Each is discussed substantively.",
     56         "source": "opus"
     57       },
     58       "threats_to_validity_specific": {
     59         "applies": true,
     60         "answer": true,
     61         "justification": "The threats are specific to this study: 'the code snippet provided to the LLM may lack the correct variable due to limited context' (specific to their context window), ''No Patch' failures account for approximately 37% of all bugs, with many caused by such semantic violations' (quantified failure mode), and the plausible metric limitation with concrete data (only 56/165 are truly correct).",
     62         "source": "opus"
     63       },
     64       "scope_boundaries_stated": {
     65         "applies": true,
     66         "answer": true,
     67         "justification": "The paper explicitly scopes to crash-site repair (not root cause), four memory corruption types (HBO, GBO, SBO, UAF), and acknowledges crash-site fixes are temporary: 'giving developers time to implement a permanent solution' (Section 3). Section 6 states the plausible metric 'may overestimate the number of truly correct fixes.'",
     68         "source": "opus"
     69       }
     70     },
     71     "conflicts_of_interest": {
     72       "funding_disclosed": {
     73         "applies": true,
     74         "answer": false,
     75         "justification": "No funding source, acknowledgments section, or grant information is disclosed anywhere in the paper.",
     76         "source": "opus"
     77       },
     78       "affiliations_disclosed": {
     79         "applies": true,
     80         "answer": true,
     81         "justification": "Author affiliations are clearly listed: Han Zheng and Mathias Payer at EPFL; Ilia Shumailov at Google DeepMind; Tianqi Fan at Google; Aiden Hall at Google.",
     82         "source": "opus"
     83       },
     84       "funder_independent_of_outcome": {
     85         "applies": true,
     86         "answer": false,
     87         "justification": "Three of five authors are affiliated with Google/DeepMind. The evaluation relies on Google infrastructure (OSS-Fuzz, ClusterFuzz, ARVO). Google has a stake in demonstrating effective APR for OSS-Fuzz bugs. No funding disclosure makes independence impossible to assess.",
     88         "source": "opus"
     89       },
     90       "financial_interests_declared": {
     91         "applies": true,
     92         "answer": false,
     93         "justification": "No competing interests statement, no patent disclosures, and no financial interests declaration appears in the paper.",
     94         "source": "opus"
     95       }
     96     },
     97     "scope_and_framing": {
     98       "key_terms_defined": {
     99         "applies": true,
    100         "answer": true,
    101         "justification": "Key terms are defined precisely: 'crash-site repair' vs. root-cause repair is explained with an example (Figure 3), 'plausible fix' is defined operationally (patched program does not crash on PoC), and memory corruption categories (spatial/temporal) are defined in Section 2.2.",
    102         "source": "haiku"
    103       },
    104       "intended_contribution_clear": {
    105         "applies": true,
    106         "answer": true,
    107         "justification": "Section 1 ends with a four-bullet contribution list explicitly identifying: crash-site repair proposal, template-guided patch generation technique, WILLIAMT implementation/evaluation, and open-source promise.",
    108         "source": "haiku"
    109       },
    110       "engagement_with_prior_work": {
    111         "applies": true,
    112         "answer": true,
    113         "justification": "Section 2 situates the work relative to existing APR agents (AutoCodeRover-S, Agentless, VulMaster) with direct comparisons in the evaluation; the paper contrasts its crash-site approach with root-cause-oriented methods and explains why root-cause analysis fails in practice (Figure 3 example).",
    114         "source": "haiku"
    115       }
    116     }
    117   },
    118   "type_checklist": {
    119     "empirical": {
    120       "artifacts": {
    121         "code_released": {
    122           "applies": true,
    123           "answer": false,
    124           "justification": "The paper explicitly states 'We promise to fully release WILLIAMT upon paper acceptance to support open science' (Section 1, repeated in Section 7). A promise of future release does not count as released code.",
    125           "source": "opus"
    126         },
    127         "data_released": {
    128           "applies": true,
    129           "answer": true,
    130           "justification": "The evaluation uses ARVO, a publicly available benchmark (reference [30]). The paper states ARVO 'reconstructs and curates a reproducible dataset of OSS bugs specifically tailored for APR evaluation' and all bugs originate from OSS-Fuzz.",
    131           "source": "opus"
    132         },
    133         "environment_specified": {
    134           "applies": true,
    135           "answer": false,
    136           "justification": "Hardware is mentioned ('Ubuntu 22.04 server equipped with AMD EPYC 7302P and 64GB RAM', 'RTX 4090 GPU', 'Mac Mini M4') but no software dependency specifications, requirements.txt, Dockerfile, or library versions are provided. The ARVO Docker images provide reproducibility for the benchmark, but the WILLIAMT tool's own environment is not specified.",
    137           "source": "opus"
    138         },
    139         "reproduction_instructions": {
    140           "applies": true,
    141           "answer": false,
    142           "justification": "No step-by-step reproduction instructions, README, or scripts are provided. Code is not yet released, making reproduction impossible.",
    143           "source": "opus"
    144         }
    145       },
    146       "statistical_methodology": {
    147         "confidence_intervals_or_error_bars": {
    148           "applies": true,
    149           "answer": false,
    150           "justification": "All results are reported as single point estimates (e.g., '54.5%', '46.1%', '$0.0026'). No confidence intervals, error bars, or uncertainty measures appear in any table or figure.",
    151           "source": "opus"
    152         },
    153         "significance_tests": {
    154           "applies": true,
    155           "answer": false,
    156           "justification": "Comparative claims like 'WILLIAMT reduces token cost by 45.9%' and 'increases the bug-fixing rate to 73.5%' are based solely on comparing raw numbers without any statistical significance test.",
    157           "source": "opus"
    158         },
    159         "effect_sizes_reported": {
    160           "applies": true,
    161           "answer": true,
    162           "justification": "Effect sizes are given with baseline context throughout: '45.9% cost reduction', '29.6% improvement in fixing rate', '357 times more bugs per dollar', '99.7% token cost reduction while preserving over 86.7% performance' (Section 5.1-5.3). Absolute numbers and percentages are provided for both systems.",
    163           "source": "opus"
    164         },
    165         "sample_size_justified": {
    166           "applies": true,
    167           "answer": false,
    168           "justification": "The 358-bug sample is defined by selection criteria ('all HOF, SOF, UAF and GOF bugs that can be compiled within 15 minutes') but there is no justification for why this sample size is adequate for the claims being made, nor any power analysis.",
    169           "source": "opus"
    170         },
    171         "variance_reported": {
    172           "applies": true,
    173           "answer": false,
    174           "justification": "All results appear to be from single runs. WILLIAMT is described as 'one-shot' (one trial, one patch), and no standard deviation, variance, or spread measures are reported across any experimental condition.",
    175           "source": "opus"
    176         }
    177       },
    178       "evaluation_design": {
    179         "baselines_included": {
    180           "applies": true,
    181           "answer": true,
    182           "justification": "Three SoTA baselines are compared: AutoCodeRover-S [60], Agentless [54], and VulMaster [64]. Results are presented in Figure 4 and the Venn diagram in Figure 5c.",
    183           "source": "opus"
    184         },
    185         "baselines_contemporary": {
    186           "applies": true,
    187           "answer": true,
    188           "justification": "All baselines are from 2024: CodeRover-S (arXiv 2411.03346, 2024), Agentless (ISSTA 2024), and VulMaster (ICSE 2024). These represent the current state of the art in LLM-based APR.",
    189           "source": "opus"
    190         },
    191         "ablation_study": {
    192           "applies": true,
    193           "answer": false,
    194           "justification": "WILLIAMT has two main components (regex-based context retrieval and template-guided patch generation) but no ablation study removes individual components to measure their contribution. The multi-LLM comparison (Figure 6) varies the backend but not the system architecture.",
    195           "source": "opus"
    196         },
    197         "multiple_metrics": {
    198           "applies": true,
    199           "answer": true,
    200           "justification": "Multiple metrics are reported: plausible fix rate, token cost (USD), execution time, CPU usage, and a detailed fix classification (Plausible, Multiple, Compiles, No Patch, No Code). Appendix A additionally reports actual fix ratio after manual review.",
    201           "source": "opus"
    202         },
    203         "human_evaluation": {
    204           "applies": true,
    205           "answer": true,
    206           "justification": "Appendix A (Figure 9) presents manual evaluation of all 165 plausible patches from WILLIAMT-GPT-4o: automated validation of execution consistency, then manual review of robustness across diverse inputs. 56 of 165 are found truly correct.",
    207           "source": "opus"
    208         },
    209         "held_out_test_set": {
    210           "applies": true,
    211           "answer": false,
    212           "justification": "All 358 bugs are used for evaluation. There is no separation into development and test sets, and no discussion of whether prompts or templates were tuned on any subset of the data.",
    213           "source": "opus"
    214         },
    215         "per_category_breakdown": {
    216           "applies": true,
    217           "answer": true,
    218           "justification": "Results are broken down by fix outcome (Plausible/Multiple/Compiles/No Patch/No Code in Figures 4 and 6) and by bug type (HBO 52.3%, UAF 11.3%, SBO 8.9%, GBO 4.6% in Figure 11). The Venn diagram (Figure 5c) shows overlap across tools.",
    219           "source": "opus"
    220         },
    221         "failure_cases_discussed": {
    222           "applies": true,
    223           "answer": true,
    224           "justification": "Section 6 discusses three failure modes: incorrect crash site analysis (LLM returns inaccurate variables), semantically disruptive patch insertion (breaking control flow), and the imprecise plausible metric. Appendix A provides a detailed failure funnel (Figure 9): 165→95→56 through successive filtering.",
    225           "source": "opus"
    226         },
    227         "negative_results_reported": {
    228           "applies": true,
    229           "answer": true,
    230           "justification": "Reasoning models (DeepSeek-R1, o3-mini) are shown to cost 5.7-5.9× more than Claude35-Haiku 'without providing notable improvements over non-reasoning LLMs' (Section 5.3). Gemma3:1b fixes only 10 bugs. VulMaster 'resolves only 5 bugs in total.' The manual review shows only 56/165 plausible patches are truly correct.",
    231           "source": "opus"
    232         }
    233       },
    234       "setup_transparency": {
    235         "model_versions_specified": {
    236           "applies": true,
    237           "answer": false,
    238           "justification": "Only GPT-4o is specified with a snapshot date ('gpt-4o-2024-08-06'). All other models use marketing names without API versions or snapshot dates: 'DeepSeek (V3, R1)', 'Claude (3.5-Haiku, 3.7-Sonnet)', 'o3-mini', 'Gemma3 (27B, 12B, 4B, 1B)'. Per the schema, marketing names without snapshot dates do not count.",
    239           "source": "opus"
    240         },
    241         "prompts_provided": {
    242           "applies": true,
    243           "answer": true,
    244           "justification": "Figure 15 (Appendix C) provides the full prompt text used for crash site analysis, including the system instruction, variable format specifications, examples, and the <issue> tag structure. The patch templates are fully specified in Figures 12 and 13.",
    245           "source": "opus"
    246         },
    247         "hyperparameters_reported": {
    248           "applies": true,
    249           "answer": false,
    250           "justification": "No LLM hyperparameters (temperature, top-p, max tokens) are reported for any of the models tested. These settings significantly affect output and are not mentioned.",
    251           "source": "opus"
    252         },
    253         "scaffolding_described": {
    254           "applies": true,
    255           "answer": true,
    256           "justification": "The WILLIAMT workflow is described in detail: Figure 1 shows the pipeline, Section 4 describes regex-based context retrieval and template-guided patch generation, Appendix B provides the vulnerability templates, and Appendix C details the regex parsing and prompt preparation.",
    257           "source": "opus"
    258         },
    259         "data_preprocessing_documented": {
    260           "applies": true,
    261           "answer": true,
    262           "justification": "Bug selection criteria are stated: 'all HOF, SOF, UAF and GOF bugs (358 bugs) that can be compiled within 15 minutes following the recommended practice' from ARVO. Appendix C documents the regex-based preprocessing of AddressSanitizer reports including stack trace parsing, bug type classification, and code context extraction.",
    263           "source": "opus"
    264         }
    265       },
    266       "data_integrity": {
    267         "raw_data_available": {
    268           "applies": true,
    269           "answer": false,
    270           "justification": "No raw data is released — no patch outputs, LLM responses, per-bug results, or detailed logs. Only aggregate results are presented. The ARVO benchmark is public but the experimental outputs are not.",
    271           "source": "opus"
    272         },
    273         "data_collection_described": {
    274           "applies": true,
    275           "answer": true,
    276           "justification": "Data collection is described: ARVO benchmark with 5,000+ bugs, filtered to 4 bug types (HBO, GBO, SBO, UAF) that compile within 15 minutes, yielding 358 bugs. Selection follows 'the recommended practice' from [60].",
    277           "source": "opus"
    278         },
    279         "recruitment_methods_described": {
    280           "applies": false,
    281           "answer": false,
    282           "justification": "No human participants. The data source is a standard public benchmark (ARVO).",
    283           "source": "opus"
    284         },
    285         "data_pipeline_documented": {
    286           "applies": true,
    287           "answer": false,
    288           "justification": "The pipeline from ARVO (5,000+ bugs) to 358 evaluation bugs lacks intermediate counts. How many HBO, GBO, SBO, UAF bugs exist in ARVO before the 15-minute compile filter? How many were excluded by the compile filter? These intermediate steps are not documented.",
    289           "source": "opus"
    290         }
    291       },
    292       "contamination": {
    293         "training_cutoff_stated": {
    294           "applies": true,
    295           "answer": false,
    296           "justification": "No training data cutoff date is stated for any of the LLMs used (GPT-4o, DeepSeek, Claude, Gemma3). The ARVO bugs come from publicly accessible OSS-Fuzz reports and open-source git histories that could appear in training data.",
    297           "source": "opus"
    298         },
    299         "train_test_overlap_discussed": {
    300           "applies": true,
    301           "answer": false,
    302           "justification": "No discussion of whether the LLMs may have seen ARVO bug reports, crash traces, or ground-truth patches during training. The OSS-Fuzz reports and open-source commit histories containing the fixes are publicly available and likely in LLM training corpora.",
    303           "source": "opus"
    304         },
    305         "benchmark_contamination_addressed": {
    306           "applies": true,
    307           "answer": false,
    308           "justification": "ARVO bugs originate from OSS-Fuzz, with ground-truth fixes in public git repositories. These fixes were committed before most models' training cutoffs, creating significant contamination risk. This is not discussed or addressed.",
    309           "source": "opus"
    310         }
    311       },
    312       "human_studies": {
    313         "pre_registered": {
    314           "applies": false,
    315           "answer": false,
    316           "justification": "No human participants in this study.",
    317           "source": "opus"
    318         },
    319         "irb_or_ethics_approval": {
    320           "applies": false,
    321           "answer": false,
    322           "justification": "No human participants in this study.",
    323           "source": "opus"
    324         },
    325         "demographics_reported": {
    326           "applies": false,
    327           "answer": false,
    328           "justification": "No human participants in this study.",
    329           "source": "opus"
    330         },
    331         "inclusion_exclusion_criteria": {
    332           "applies": false,
    333           "answer": false,
    334           "justification": "No human participants in this study.",
    335           "source": "opus"
    336         },
    337         "randomization_described": {
    338           "applies": false,
    339           "answer": false,
    340           "justification": "No human participants in this study.",
    341           "source": "opus"
    342         },
    343         "blinding_described": {
    344           "applies": false,
    345           "answer": false,
    346           "justification": "No human participants in this study.",
    347           "source": "opus"
    348         },
    349         "attrition_reported": {
    350           "applies": false,
    351           "answer": false,
    352           "justification": "No human participants in this study.",
    353           "source": "opus"
    354         }
    355       },
    356       "cost_and_practicality": {
    357         "inference_cost_reported": {
    358           "applies": true,
    359           "answer": true,
    360           "justification": "Detailed per-bug inference costs are reported: $0.0026/bug for WILLIAMT vs $0.93/bug for CodeRover-S (Figure 5a). Figure 7 breaks down cost across all frontier LLMs, with DeepSeek-V3 at <0.03 cents/bug and Claude35-Haiku at ~0.09 cents/bug.",
    361           "source": "opus"
    362         },
    363         "compute_budget_stated": {
    364           "applies": true,
    365           "answer": true,
    366           "justification": "Hardware is specified (AMD EPYC 7302P, 64GB RAM, RTX 4090, Mac Mini M4). Execution times are reported: WILLIAMT <1 minute vs CodeRover-S ~43.5 minutes. Figure 8 provides detailed time breakdown for preprocessing and each LLM backend. Total cost for all 358 bugs: <$0.68 for WILLIAMT.",
    367           "source": "opus"
    368         }
    369       },
    370       "experimental_rigor": {
    371         "seed_sensitivity_reported": {
    372           "applies": true,
    373           "answer": false,
    374           "justification": "No seed sensitivity analysis. WILLIAMT uses a one-shot design and all results appear to be single-run. LLM output stochasticity is not addressed.",
    375           "source": "opus"
    376         },
    377         "number_of_runs_stated": {
    378           "applies": true,
    379           "answer": true,
    380           "justification": "WILLIAMT is explicitly described as one-shot: 'WILLIAMT performs a single trial and applies only one patch throughout the entire fixing pipeline' (Section 5.1). CodeRover-S is stated to use 'up to three trials per bug' with 'up to 18 attempts.'",
    381           "source": "opus"
    382         },
    383         "hyperparameter_search_budget": {
    384           "applies": true,
    385           "answer": false,
    386           "justification": "No discussion of how the prompt template, context window size (2 lines before/after), or patch templates were developed or tuned. No search budget reported.",
    387           "source": "opus"
    388         },
    389         "best_config_selection_justified": {
    390           "applies": true,
    391           "answer": false,
    392           "justification": "No explanation of how the specific prompt design, template structure, or context window size were selected. The paper presents one configuration without discussing alternatives tried.",
    393           "source": "opus"
    394         },
    395         "multiple_comparison_correction": {
    396           "applies": false,
    397           "answer": false,
    398           "justification": "No statistical tests are performed, so multiple comparison correction is not applicable.",
    399           "source": "opus"
    400         },
    401         "self_comparison_bias_addressed": {
    402           "applies": true,
    403           "answer": false,
    404           "justification": "The authors compare their own WILLIAMT against SoTA tools. While they import CodeRover-S results from its original paper for fairness, they do not acknowledge the general bias of evaluating one's own system or discuss how this might affect template/prompt design choices.",
    405           "source": "opus"
    406         },
    407         "compute_budget_vs_performance": {
    408           "applies": true,
    409           "answer": true,
    410           "justification": "Performance is explicitly analyzed as a function of cost: Figures 5a-5b compare cost and time between WILLIAMT and CodeRover-S, Figure 7 shows cost vs performance across LLM backends, and Section 5.3 discusses cost-performance tradeoffs for reasoning vs non-reasoning models.",
    411           "source": "opus"
    412         },
    413         "benchmark_construct_validity": {
    414           "applies": true,
    415           "answer": false,
    416           "justification": "The paper describes ARVO as a reliable benchmark and notes it 'ensures that all bugs are both ground-truth and reproducible' but does not question whether fixing 358 ARVO bugs of 4 types generalizes to real-world APR capability. No discussion of whether ARVO's specific bug distribution represents real-world vulnerability patterns.",
    417           "source": "opus"
    418         },
    419         "scaffold_confound_addressed": {
    420           "applies": true,
    421           "answer": true,
    422           "justification": "When comparing LLM backends (Figure 6), all models use the same WILLIAMT scaffold, isolating the model variable. The SoTA comparison (Figure 4) compares complete systems as intended — the scaffold IS the thing being tested.",
    423           "source": "opus"
    424         }
    425       },
    426       "data_leakage": {
    427         "temporal_leakage_addressed": {
    428           "applies": true,
    429           "answer": false,
    430           "justification": "Not discussed. ARVO bugs have ground-truth fixes committed to public repositories before the LLMs' training cutoffs. Models may have learned the exact patches from training data.",
    431           "source": "opus"
    432         },
    433         "feature_leakage_addressed": {
    434           "applies": true,
    435           "answer": false,
    436           "justification": "Not discussed. The crash-site analysis prompt (Figure 15) provides sanitizer output and source context, which is the intended input. However, no analysis of whether the LLM might be retrieving memorized fixes rather than reasoning about the crash site.",
    437           "source": "opus"
    438         },
    439         "non_independence_addressed": {
    440           "applies": true,
    441           "answer": false,
    442           "justification": "Not discussed. Multiple bugs may come from the same project or share similar code patterns. No analysis of whether results are driven by a few projects or are independent across bugs.",
    443           "source": "opus"
    444         },
    445         "leakage_detection_method": {
    446           "applies": true,
    447           "answer": false,
    448           "justification": "No leakage detection or prevention method is used. No canary strings, membership inference tests, temporal splits, or decontamination procedures.",
    449           "source": "opus"
    450         }
    451       }
    452     }
    453   },
    454   "claims": [
    455     {
    456       "claim": "WILLIAMT reduces token cost by 99.7% compared to CodeRover-S while retaining over 86.7% of CodeRover-S's bug-fixing performance",
    457       "evidence": "WILLIAMT uses GPT-4o and achieves 46.1% plausible fixing rate vs CodeRover-S's 54.5%, at $0.0026/bug vs $0.93/bug — a 357x cost reduction (Section 5.1, Figure 5a)",
    458       "supported": "strong"
    459     },
    460     {
    461       "claim": "Combining WILLIAMT with CodeRover-S achieves 29.6% more bugs fixed and 45.9% cost reduction compared to CodeRover-S alone",
    462       "evidence": "The combined pipeline fixes 60 additional plausible bugs (73.5% total) while reducing cost by 45.9%; supported by Section 5.2 and Figure 5c Venn diagram showing largely disjoint fix sets",
    463       "supported": "strong"
    464     },
    465     {
    466       "claim": "Non-reasoning models outperform reasoning models on crash-site repair due to template guidance eliminating the need for heavy reasoning",
    467       "evidence": "Claude35-Haiku (non-reasoning) achieves the highest fixing rate at 47.5%; DeepSeek-R1 and GPT-o3-mini are 5.7x–5.9x more expensive without better performance (Section 5.3, Figure 6–7)",
    468       "supported": "moderate"
    469     },
    470     {
    471       "claim": "Local LLMs running on consumer hardware can achieve reasonable APR performance: Gemma3:27B reaches 96.4% of GPT-4o's fixing performance on an RTX 4090 or Mac Mini M4",
    472       "evidence": "Figure 6 shows Gemma3:27B fixes 163/358 bugs vs GPT-4o's 165/358; Mac Mini M4 test with Gemma3:4b 'performs on par to RTX 4090' (Section 5.3)",
    473       "supported": "moderate"
    474     },
    475     {
    476       "claim": "The standard 'plausible fix' metric overstates repair quality: only 34% of plausible patches (56/165) are genuinely correct across a broader range of inputs",
    477       "evidence": "Appendix A: of 165 plausible patches, 95 avoid early exit on PoC, 56 are manually verified correct for other inputs; 39 block valid inputs (Figure 9)",
    478       "supported": "strong"
    479     },
    480     {
    481       "claim": "WILLIAMT and CodeRover-S fix largely disjoint sets of bugs, indicating complementary rather than overlapping capabilities",
    482       "evidence": "Figure 5c: WILLIAMT fixes 50 unique bugs, CodeRover-S fixes 52 unique bugs, with only 46+15 in the overlap region; described in Section 5.2",
    483       "supported": "strong"
    484     }
    485   ],
    486   "methodology_tags": [
    487     "benchmark-eval"
    488   ],
    489   "key_findings": "WILLIAMT achieves 46.1% plausible bug-fixing rate on 358 ARVO memory-corruption bugs at $0.0026/bug — a 357x cost reduction vs CodeRover-S ($0.93/bug) while retaining 86.7% of its performance. Combining WILLIAMT (first pass) with CodeRover-S (second pass on failures) achieves 73.5% overall fixing rate with 45.9% cost reduction, exploiting the largely complementary fix sets of the two approaches. Template guidance reduces LLM reasoning requirements: non-reasoning Claude35-Haiku outperforms all reasoning models on this task, and local Gemma3:27B achieves 96.4% of GPT-4o performance. However, manual review reveals that the standard plausible-fix metric substantially overstates actual repair quality: only 56/165 plausible patches (34%) are genuinely correct across inputs beyond the original PoC, yielding a true repair rate of ~15.6%.",
    490   "red_flags": [
    491     {
    492       "flag": "Title inflation / misleading cost claim",
    493       "detail": "The title '7,400 bugs for $1' derives from the cheapest possible configuration (DeepSeek-V3 at ~0.0135 cent/bug), not the primary GPT-4o evaluation ($0.0026/bug ≈ 384 bugs per dollar); the headline claim is the best-case, not the representative case."
    494     },
    495     {
    496       "flag": "Plausible metric greatly overstates correctness",
    497       "detail": "Manual review of WILLIAMT-GPT-4o shows only 56/165 plausible patches (34%) are correct for inputs beyond the PoC — yielding a true repair rate of ~15.6% vs the reported 46.1% plausible rate. This analysis was not performed for CodeRover-S or other baselines, making comparisons asymmetric."
    498     },
    499     {
    500       "flag": "Code not released",
    501       "detail": "WILLIAMT source code is promised only 'upon paper acceptance'; reproduction is not possible from the paper alone."
    502     },
    503     {
    504       "flag": "Benchmark contamination unaddressed",
    505       "detail": "ARVO bugs are real-world vulnerabilities from public OSS-Fuzz with published fixes in public repositories; LLMs may have been trained on the exact bug-fix commits, inflating all models' performance. This risk is never mentioned."
    506     },
    507     {
    508       "flag": "No ablation of system components",
    509       "detail": "The paper attributes cost reduction to both regex-based retrieval AND template-guided generation, but no ablation tests either component in isolation; individual contribution of each mechanism is unknown."
    510     },
    511     {
    512       "flag": "No statistical significance testing",
    513       "detail": "All comparative claims (WILLIAMT vs. baselines, LLM-to-LLM comparisons) are made without significance tests; differences of a few percentage points between LLMs (e.g., 46.1% vs 47.5%) are treated as meaningful without uncertainty quantification."
    514     },
    515     {
    516       "flag": "Google conflict of interest undisclosed",
    517       "detail": "Two co-authors are from Google/Google DeepMind; the evaluation extensively benchmarks Google's Gemma3 series with favorable results; no competing interests disclosure appears."
    518     },
    519     {
    520       "flag": "No per-bug-category fixing rate breakdown",
    521       "detail": "Bug type distribution is shown (HBO 52.3%, UAF 11.3%, etc.) but fixing rates are never broken down by category, making it impossible to assess whether crash-site repair works uniformly or only for specific vulnerability types."
    522     }
    523   ],
    524   "cited_papers": [
    525     {
    526       "title": "Fixing Security Vulnerabilities with AI in OSS-Fuzz (AutoCodeRover-S)",
    527       "relevance": "Primary baseline; the paper directly compares to and combines with CodeRover-S, making this the central comparison point for APR performance claims"
    528     },
    529     {
    530       "title": "Automated Program Repair via Conversation: Fixing 162 out of 337 Bugs for $0.42 each using ChatGPT (Agentless)",
    531       "relevance": "Second primary baseline and directly comparable cost-efficiency claim in the same domain"
    532     },
    533     {
    534       "title": "ARVO: Atlas of Reproducible Vulnerabilities for Open Source Software",
    535       "relevance": "The benchmark used for all evaluation; critical for understanding the dataset's scope and reproducibility guarantees"
    536     },
    537     {
    538       "title": "Out of Sight, Out of Mind: Better Automatic Vulnerability Repair by Broadening Input Ranges and Sources (VulMaster)",
    539       "relevance": "Third baseline system evaluated; performs poorly (5 plausible fixes), providing contrast for WILLIAMT's approach"
    540     },
    541     {
    542       "title": "Template-guided Program Repair in the Era of Large Language Models",
    543       "relevance": "Prior work on template-guided APR that motivates WILLIAMT's approach"
    544     },
    545     {
    546       "title": "AutoCodeRover: Autonomous Program Improvement",
    547       "relevance": "Earlier version of primary baseline; understanding the progression from AutoCodeRover to CodeRover-S is important for evaluating the competitive landscape"
    548     },
    549     {
    550       "title": "AddressSanitizer: A Fast Address Sanity Checker",
    551       "relevance": "Core infrastructure: WILLIAMT relies on ASan reports as input; understanding ASan capabilities determines scope of crash-site repair approach"
    552     },
    553     {
    554       "title": "Less Training, More Repairing Please: Revisiting Automated Program Repair via Zero-Shot Learning",
    555       "relevance": "Establishes zero-shot LLM-based APR paradigm that WILLIAMT builds upon"
    556     }
    557   ],
    558   "engagement_factors": {
    559     "practical_relevance": {
    560       "score": 3,
    561       "justification": "Directly addresses a real developer pain point (fuzzer bug backlog), designed for local deployment on consumer hardware (Mac Mini M4), and shows concrete cost savings."
    562     },
    563     "surprise_contrarian": {
    564       "score": 2,
    565       "justification": "Challenges the assumption that root-cause analysis and expensive frontier models are necessary for effective program repair; shows crash-site patches can be useful."
    566     },
    567     "fear_safety": {
    568       "score": 1,
    569       "justification": "Addresses security vulnerabilities (memory corruption) but proposes mitigations rather than raising new concerns."
    570     },
    571     "drama_conflict": {
    572       "score": 1,
    573       "justification": "Implicitly argues expensive agent-based APR approaches are overkill for many bugs, but framed constructively as complementary rather than adversarial."
    574     },
    575     "demo_ability": {
    576       "score": 0,
    577       "justification": "Code is not released; promised only upon acceptance. No live demo or installable tool available."
    578     },
    579     "brand_recognition": {
    580       "score": 2,
    581       "justification": "Three authors from Google/Google DeepMind, two from EPFL — recognizable institutions but not about a flagship Google product."
    582     }
    583   },
    584   "hn_data": {
    585     "threads": [
    586       {
    587         "hn_id": "45444062",
    588         "title": "Machine Learnability as a Measure of Order in Aperiodic Sequences",
    589         "points": 48,
    590         "comments": 5,
    591         "url": "https://news.ycombinator.com/item?id=45444062"
    592       },
    593       {
    594         "hn_id": "46697408",
    595         "title": "WildCAT3D: Appearance-Aware Multi-View Diffusion in the Wild",
    596         "points": 3,
    597         "comments": 0,
    598         "url": "https://news.ycombinator.com/item?id=46697408"
    599       },
    600       {
    601         "hn_id": "43401539",
    602         "title": "CriteoPrivateAd: RealWorld Bidding Dataset to Design Private Advertising Systems",
    603         "points": 2,
    604         "comments": 1,
    605         "url": "https://news.ycombinator.com/item?id=43401539"
    606       },
    607       {
    608         "hn_id": "43516923",
    609         "title": "UniHOPE: A Unified Approach for Hand-Only and Hand-Object Pose Estimation",
    610         "points": 2,
    611         "comments": 0,
    612         "url": "https://news.ycombinator.com/item?id=43516923"
    613       },
    614       {
    615         "hn_id": "43496516",
    616         "title": "UniHOPE: A Unified Approach for Hand-Only and Hand-Object Pose Estimation",
    617         "points": 2,
    618         "comments": 0,
    619         "url": "https://news.ycombinator.com/item?id=43496516"
    620       },
    621       {
    622         "hn_id": "36016970",
    623         "title": "Visual Question Answering: Techniques and Common Trends in Recent Literature",
    624         "points": 2,
    625         "comments": 0,
    626         "url": "https://news.ycombinator.com/item?id=36016970"
    627       },
    628       {
    629         "hn_id": "44686218",
    630         "title": "The Heteronomy of Algorithms",
    631         "points": 1,
    632         "comments": 0,
    633         "url": "https://news.ycombinator.com/item?id=44686218"
    634       },
    635       {
    636         "hn_id": "47380252",
    637         "title": "Show HN: Karpathy's Autoresearch with Evolutionary Database",
    638         "points": 1,
    639         "comments": 0,
    640         "url": "https://news.ycombinator.com/item?id=47380252"
    641       },
    642       {
    643         "hn_id": "40515506",
    644         "title": "Evaluating AI-Generated Code for C++, Fortran, Go, Java, Julia, Matlab, etc.",
    645         "points": 1,
    646         "comments": 2,
    647         "url": "https://news.ycombinator.com/item?id=40515506"
    648       },
    649       {
    650         "hn_id": "43104988",
    651         "title": "Aide: AI-Driven Exploration in the Space of Code (Arxiv)",
    652         "points": 1,
    653         "comments": 1,
    654         "url": "https://news.ycombinator.com/item?id=43104988"
    655       }
    656     ],
    657     "top_points": 48,
    658     "total_points": 63,
    659     "total_comments": 9
    660   }
    661 }
	ai-research-survey Systematic scan of agentic development research. What's signal, what's noise.
	git clone https://git.shiptheloop.com/ai-research-survey.git
	Log \| Files \| Refs