calibration.json - ai-research-survey - Systematic scan of agentic development research. What's signal, what's noise.

calibration.json (24351B)
      1 {
      2   "paper_slug": "accelerating-automatic-program-2025",
      3   "total_questions": 50,
      4   "agreement_count": 45,
      5   "disagreement_count": 5,
      6   "agreement_rate": 0.90,
      7   "disagreements": [
      8     {
      9       "category": "artifacts",
     10       "question": "code_released",
     11       "sonnet_applies": true,
     12       "sonnet_answer": true,
     13       "opus_applies": true,
     14       "opus_answer": false,
     15       "opus_justification": "Appendix H provides a link to https://anonymous.4open.science/r/SelRepair-5F1D/. However, this is an anonymous review platform link (anonymous.4open.science), which is ephemeral and used during peer review — not a permanent repository or archive. The schema requires 'a working URL or archive.' Anonymous review links are temporary and may be taken down after the review process. Under strict interpretation, this does not constitute a reliable, permanent code release. The paper is also an arXiv preprint, suggesting it is still under review.",
     16       "sonnet_justification": "The paper provides a URL in Appendix H: 'We make our approach available at https://anonymous.4open.science/r/SelRepair-5F1D/'. This is a working link to an anonymous repository, not merely a promise of future release.",
     17       "direction": "sonnet_generous"
     18     },
     19     {
     20       "category": "statistical_methodology",
     21       "question": "effect_sizes_reported",
     22       "sonnet_applies": true,
     23       "sonnet_answer": false,
     24       "opus_applies": true,
     25       "opus_answer": true,
     26       "opus_justification": "The paper reports percentage improvements with full baseline context throughout. For example, Table 1 shows SelRepair at 26.29% EM vs RAP-Gen at 24.80% EM, and the text reports percentage improvements like '6.01%' improvement over RAP-Gen. The schema explicitly states: 'A paper that says 12% improvement over baseline (from 45% to 57%) provides enough context for YES.' Both absolute scores and relative improvements are consistently provided across all comparisons in Tables 1-3, giving readers sufficient context to assess the magnitude of effects. Formal effect sizes like Cohen's d are not required by the schema.",
     27       "sonnet_justification": "Percentage improvements are reported (e.g., 'outperforms GPT-3.5 by 918.99%') but these are relative improvements without statistical context. Effect sizes in the formal sense (Cohen's d, etc.) are not reported.",
     28       "direction": "opus_generous"
     29     },
     30     {
     31       "category": "evaluation_design",
     32       "question": "human_evaluation",
     33       "sonnet_applies": false,
     34       "sonnet_answer": false,
     35       "opus_applies": true,
     36       "opus_answer": false,
     37       "opus_justification": "Human evaluation is relevant to APR — automated metrics like EM only capture exact string matches and miss semantically correct alternative fixes. BLEU and CodeBLEU measure surface similarity, not semantic correctness. Human evaluation of patch quality would strengthen claims about repair effectiveness. The schema says applies=false only 'if human evaluation is clearly irrelevant to the claims.' For code repair, it is not clearly irrelevant — the paper itself acknowledges in the threats section that EM and BLEU may not fully reflect repair correctness. The paper could reasonably be expected to include human evaluation.",
     38       "sonnet_justification": "Human evaluation is not applicable to this paper. The evaluation task (automated program repair) uses automated metrics (EM, BLEU-4, CodeBLEU) and pass/fail on test cases, which is the standard and appropriate method for assessing patch correctness.",
     39       "direction": "applies_boundary"
     40     },
     41     {
     42       "category": "setup_transparency",
     43       "question": "prompts_provided",
     44       "sonnet_applies": true,
     45       "sonnet_answer": true,
     46       "opus_applies": true,
     47       "opus_answer": false,
     48       "opus_justification": "Section 3.2 shows the code-only prompt format as a template: '[BUG] RBC1 [FIX] RFC1 ... [BUG] BC [FIX]' where RBC1, RFC1, BC are placeholders for actual code. Figure 5 shows the GPT-3.5/GPT-4o prompt template with placeholder sections ('Target Buggy Code', 'Retrieved Bug-fix Pairs'). The schema explicitly states: 'A prompt TEMPLATE with placeholders (e.g., [Task Description]) does NOT count unless the actual fill values are also provided — the reader must be able to reconstruct every prompt sent to the model.' While Figure 6 shows one concrete example, the reader cannot reconstruct the full set of prompts used across thousands of test samples. The prompt structure is shown but not the actual prompts.",
     49       "sonnet_justification": "Figure 5 in Appendix E.4 shows the actual GPT-3.5 and GPT-4o prompt template with the system prompt, target buggy code placement, retrieved bug-fix pairs section, and end prompt text. The code-only prompt format for SelRepair is shown explicitly in Section 3.2: '[BUG] RBC1 [FIX] RFC1 [BUG] RBC2 [FIX] RFC2 ... [BUG] BC [FIX]'.",
     50       "direction": "sonnet_generous"
     51     },
     52     {
     53       "category": "data_integrity",
     54       "question": "raw_data_available",
     55       "sonnet_applies": true,
     56       "sonnet_answer": true,
     57       "opus_applies": true,
     58       "opus_answer": false,
     59       "opus_justification": "While the Tufano and VulRepair benchmarks are publicly available as original datasets, the paper's specific processed versions (exact random splits, RAG codebase selection of 1,000/2,000 samples) are not verifiably released. The anonymous repository link (anonymous.4open.science) is ephemeral. The enterprise dataset of 200 samples is explicitly not released ('we intend to open-source this benchmark in the future'). For independent verification of the reported results, one needs the exact data splits and enterprise data used — the original public benchmarks alone are insufficient since random sampling was used to create splits.",
     60       "sonnet_justification": "The primary training and test data (Tufano dataset, VulRepair dataset) are public benchmarks with known sources. The anonymous repository link (Appendix H) presumably makes the preprocessed data available. However, the enterprise dataset of 200 samples is not publicly available.",
     61       "direction": "sonnet_generous"
     62     }
     63   ],
     64   "opus_checklist": {
     65     "artifacts": {
     66       "code_released": {
     67         "applies": true,
     68         "answer": false,
     69         "justification": "Appendix H provides a link to https://anonymous.4open.science/r/SelRepair-5F1D/. However, this is an anonymous review platform link (anonymous.4open.science), which is ephemeral and used during peer review. The schema requires 'a working URL or archive' — anonymous review links are temporary and may not persist. Under strict interpretation, an anonymized review link is not a permanent release."
     70       },
     71       "data_released": {
     72         "applies": true,
     73         "answer": true,
     74         "justification": "The paper uses two public datasets: the Tufano dataset (Tufano et al., 2019) collected from GitHub commit records and VulRepair (Fu et al., 2022). These are publicly available benchmarks. The enterprise dataset is not released but the primary evaluation benchmarks are public."
     75       },
     76       "environment_specified": {
     77         "applies": true,
     78         "answer": true,
     79         "justification": "Appendix E.3 specifies: '4 GPUs of NVIDIA GeForce RTX 3090', base model StarCoder2-7B, optimizer Adam, learning rate 5e-5, context window sizes (512, 1024, 1500 tokens), and fine-tuning epochs (3 for large LLM, 50 for CodeT5). This provides sufficient detail to recreate the environment."
     80       },
     81       "reproduction_instructions": {
     82         "applies": true,
     83         "answer": false,
     84         "justification": "No step-by-step reproduction instructions are provided in the paper. While hyperparameters and code are referenced via an anonymous link, there is no README, 'Reproducing Results' section, or scripts described for replicating the main experiments."
     85       }
     86     },
     87     "statistical_methodology": {
     88       "confidence_intervals_or_error_bars": {
     89         "applies": true,
     90         "answer": false,
     91         "justification": "No confidence intervals or error bars are reported. All results in Tables 1-5 are point estimates (e.g., '26.29% EM') with no uncertainty quantification."
     92       },
     93       "significance_tests": {
     94         "applies": true,
     95         "answer": false,
     96         "justification": "No statistical significance tests are used. Claims like 'SelRepair achieves new SoTA performance... outperforming other SoTA LLMs' are based on comparing point estimates without p-values, t-tests, or any other significance testing."
     97       },
     98       "effect_sizes_reported": {
     99         "applies": true,
    100         "answer": true,
    101         "justification": "The paper reports percentage improvements with baseline context throughout: e.g., SelRepair at 26.29% EM vs RAP-Gen at 24.80% EM (Table 1). The schema states 'A paper that says 12% improvement over baseline (from 45% to 57%) provides enough context for YES.' Both absolute scores and relative improvements are provided for all comparisons, giving readers sufficient context to judge magnitude."
    102       },
    103       "sample_size_justified": {
    104         "applies": true,
    105         "answer": false,
    106         "justification": "The test set sizes (5,735, 6,447, 821, 200 samples) come from the public dataset splits or enterprise data without any justification for why these sizes are adequate for the claims made. No power analysis is discussed."
    107       },
    108       "variance_reported": {
    109         "applies": true,
    110         "answer": false,
    111         "justification": "No variance, standard deviation, or spread measures are reported across experimental runs. All results appear to be single-run point estimates with no mention of multiple runs or seed variation."
    112       }
    113     },
    114     "evaluation_design": {
    115       "baselines_included": {
    116         "applies": true,
    117         "answer": true,
    118         "justification": "Multiple baselines are compared: GPT-3.5, GPT-4o, DeepSeek-R1-Distill, RAP-Gen, and several variants of SelRepair (SelRepairLlama, SelRepairT5, SelRepairLoRA). Results in Tables 1 and 5."
    119       },
    120       "baselines_contemporary": {
    121         "applies": true,
    122         "answer": true,
    123         "justification": "Baselines include GPT-4o, DeepSeek-R1-Distill (2025), and RAP-Gen (ESEC/FSE 2023). These are contemporary and include both general-purpose LLMs and specialized APR approaches."
    124       },
    125       "ablation_study": {
    126         "applies": true,
    127         "answer": true,
    128         "justification": "Table 2 (Section 4.3, RQ2) presents a thorough ablation: w/o RAG & Fine-tuning, w/o Fine-tuning, w/o RAG, w/o SR, w/o SSDR. Each component's contribution is measured across all three datasets."
    129       },
    130       "multiple_metrics": {
    131         "applies": true,
    132         "answer": true,
    133         "justification": "Three evaluation metrics are used: Exact Match (EM), BLEU-4, and CodeBLEU, reported consistently across all experiments in Tables 1-3."
    134       },
    135       "human_evaluation": {
    136         "applies": true,
    137         "answer": false,
    138         "justification": "No human evaluation is performed. All evaluation uses automated metrics (EM, BLEU-4, CodeBLEU). Human evaluation of patch quality would be relevant — EM only captures exact matches and BLEU-based metrics only measure surface similarity, so semantically correct but syntactically different patches are missed. The schema states NA only 'if human evaluation is clearly irrelevant to the claims' — for APR, human evaluation of patch correctness is relevant."
    139       },
    140       "held_out_test_set": {
    141         "applies": true,
    142         "answer": true,
    143         "justification": "The paper explicitly describes 80/10/10 train/validation/test splits (Appendix E.1): 'we split 80% of the dataset as a training set, 10% as a validation set, and 10% as a test set.' Results are reported on the held-out test sets."
    144       },
    145       "per_category_breakdown": {
    146         "applies": true,
    147         "answer": true,
    148         "justification": "Results are broken down by dataset (Tufano Subset 1 for <50 tokens, Tufano Subset 2 for 50-100 tokens, VulRepair for C/C++), by threshold setting (Table 3), and by component ablation (Table 2). Defects4J results further broken into V1.2 and V2.0 (Table 5)."
    149       },
    150       "failure_cases_discussed": {
    151         "applies": true,
    152         "answer": true,
    153         "justification": "Section F.2 presents a case study discussing failures of compared approaches: SelRepairT5 and SelRepairLoRA generate same code as buggy, SelRepairLlama misunderstands the method, GPT-3.5/4o make invalid modifications, DeepSeek-R1-Distill outputs excessively long reasoning without generating repaired code."
    154       },
    155       "negative_results_reported": {
    156         "applies": true,
    157         "answer": true,
    158         "justification": "The ablation study shows 'w/o RAG & Ft' and 'w/o Ft' achieve 0.00 EM. SelRepairLlama performs poorly (5.96 EM). Table 3 shows certain threshold settings hurt performance. SelRepair achieves close to but does not exceed RAP-Gen on Defects4J V2.0 (11 vs 12 correct patches)."
    159       }
    160     },
    161     "claims_and_evidence": {
    162       "abstract_claims_supported": {
    163         "applies": true,
    164         "answer": true,
    165         "justification": "The abstract claims '26.29% and 17.64% in terms of exact match (EM)' are confirmed in Table 1. The claim of 'reducing inference time by at least 6.42%' is confirmed in Table 3 (6.42%, 29.68%, 15.96% reductions across datasets)."
    166       },
    167       "causal_claims_justified": {
    168         "applies": true,
    169         "answer": true,
    170         "justification": "Causal claims about component contributions are supported by ablation studies (Table 2) that systematically remove individual components (w/o RAG, w/o SR, w/o SSDR, w/o Fine-tuning). This controlled single-variable manipulation is adequate for component-level causal inference per the schema guidance."
    171       },
    172       "generalization_bounded": {
    173         "applies": true,
    174         "answer": false,
    175         "justification": "The title 'Accelerating Automatic Program Repair' and abstract use broad language, but evaluation is limited to Java and C/C++ at the method level. The threats-to-validity section acknowledges untested generalization to Python/JavaScript, but the abstract and title do not adequately bound the scope of claims."
    176       },
    177       "alternative_explanations_discussed": {
    178         "applies": true,
    179         "answer": false,
    180         "justification": "The threats to validity section (Appendix G) discusses generic limitations (language coverage, metric adequacy) but does not consider specific alternative explanations for the observed results, such as whether the training data overlap or dataset characteristics (rather than the method itself) explain the performance differences."
    181       }
    182     },
    183     "setup_transparency": {
    184       "model_versions_specified": {
    185         "applies": true,
    186         "answer": false,
    187         "justification": "StarCoder2-7B is specified for the fine-tuned model. However, 'GPT-3.5' and 'GPT-4o' are used as baselines without version numbers, snapshot dates, or API version identifiers. The schema requires specific model names with version (e.g., 'gpt-4-0613'). Marketing names like 'GPT-4o' without snapshot date do NOT count as specified versions per the schema."
    188       },
    189       "prompts_provided": {
    190         "applies": true,
    191         "answer": false,
    192         "justification": "Section 3.2 shows the code-only prompt format as a template: '[BUG] RBC1 [FIX] RFC1 ... [BUG] BC [FIX]' where RBC1, RFC1, BC are placeholders for actual code. Figure 5 shows the GPT-3.5/GPT-4o prompt template with placeholder sections. The schema explicitly states: 'A prompt TEMPLATE with placeholders does NOT count unless the actual fill values are also provided.' While Figure 6 shows one concrete example, the reader cannot reconstruct the full set of prompts used across thousands of test samples."
    193       },
    194       "hyperparameters_reported": {
    195         "applies": true,
    196         "answer": true,
    197         "justification": "Appendix E.3 reports: fine-tuning epochs (3 for large LLM, 50 for CodeT5), context window sizes (512, 1024, 1500 tokens), learning rate (5e-5), optimizer (Adam), RAG thresholds (0.9, 0.8), beam size (10 for Defects4J), and hardware (4x NVIDIA RTX 3090)."
    198       },
    199       "scaffolding_described": {
    200         "applies": true,
    201         "answer": true,
    202         "justification": "The dual RAG architecture is described in detail: Section 3.1 covers the hybrid retriever with SR and SSDR components, Algorithm 1 (Appendix C) formalizes the procedure, the RAG selection gate mechanism is described with threshold logic, and Algorithm 2 (Appendix D) documents AST traversal."
    203       },
    204       "data_preprocessing_documented": {
    205         "applies": true,
    206         "answer": true,
    207         "justification": "Appendix E.1 documents: random sampling of 1,000 samples for RAG codebase (2,000 for VulRepair), 80/10/10 train/validation/test split, filtering of 'invalid samples, such as samples that were null' from VulRepair. Table 4 reports final dataset sizes for each split."
    208       }
    209     },
    210     "limitations_and_scope": {
    211       "limitations_section_present": {
    212         "applies": true,
    213         "answer": true,
    214         "justification": "The paper has a dedicated 'Limitations' section after the conclusion discussing method-level scope constraints and LLM parameter size challenges. Additionally, Appendix G presents 'Threats to Validity' covering internal, external, and construct validity."
    215       },
    216       "threats_to_validity_specific": {
    217         "applies": true,
    218         "answer": true,
    219         "justification": "Appendix G identifies specific threats: (1) errors in bug-fix dataset and overfitting during fine-tuning, (2) coarse-grained threshold settings for different code lengths, (3) untested generalization to Python/JavaScript, (4) reliance on EM and BLEU-4 metrics. These are reasonably specific to this study."
    220       },
    221       "scope_boundaries_stated": {
    222         "applies": true,
    223         "answer": true,
    224         "justification": "The Limitations section states the method is 'limited by datasets focused on individual methods' and the threats section acknowledges 'effectiveness with other languages like Python or JavaScript is untested.' These are specific scope boundaries about what was not tested."
    225       }
    226     },
    227     "data_integrity": {
    228       "raw_data_available": {
    229         "applies": true,
    230         "answer": false,
    231         "justification": "While the Tufano and VulRepair benchmarks are publicly available as original datasets, the paper's specific processed versions (exact random splits, RAG codebase selection of 1,000/2,000 samples) are not verifiably released. The anonymous repository link (anonymous.4open.science) is ephemeral. The enterprise dataset of 200 samples is explicitly not released ('we intend to open-source this benchmark in the future'). For independent verification of the reported results, one needs the exact data splits and enterprise data used."
    232       },
    233       "data_collection_described": {
    234         "applies": true,
    235         "answer": true,
    236         "justification": "Appendix E.1 describes data sources: Tufano dataset 'collected from fix commit records from GitHub', VulRepair 'consists of bug-fix pairs combined by CVE-Fixes and BigVul', enterprise dataset 'consists of 200 semantic bug-fix pairs caused by enterprise developers in real development scenarios.'"
    237       },
    238       "recruitment_methods_described": {
    239         "applies": false,
    240         "answer": false,
    241         "justification": "No human participants in this study. Data sources are public benchmarks and enterprise code. NA for this criterion."
    242       },
    243       "data_pipeline_documented": {
    244         "applies": true,
    245         "answer": true,
    246         "justification": "The pipeline is documented: random sampling for RAG codebase (1,000 for Tufano, 2,000 for VulRepair), 80/10/10 splits for remaining samples, filtering of null/invalid samples from VulRepair. Table 4 reports final sizes for each split."
    247       }
    248     },
    249     "conflicts_of_interest": {
    250       "funding_disclosed": {
    251         "applies": true,
    252         "answer": false,
    253         "justification": "No acknowledgments or funding disclosure section is present in the paper. The absence of any funding statement is NO per the schema."
    254       },
    255       "affiliations_disclosed": {
    256         "applies": true,
    257         "answer": true,
    258         "justification": "Author affiliations are listed: Xiaoheng Xie, Peng Di, and Yu Zhang are from Ant Group. The academic affiliations (Sun Yat-sen University, Hong Kong Baptist University, Hong Kong Metropolitan University) are also listed."
    259       },
    260       "funder_independent_of_outcome": {
    261         "applies": true,
    262         "answer": false,
    263         "justification": "Three authors are from Ant Group, and the enterprise dataset appears to originate from Ant Group. Without funding disclosure, independence cannot be verified. The enterprise evaluation (RQ4) may represent a conflict where the company's data is used to validate the company's researchers' method."
    264       },
    265       "financial_interests_declared": {
    266         "applies": true,
    267         "answer": false,
    268         "justification": "No competing interests statement or financial interest disclosure is present in the paper. The schema states: 'absence of disclosure is not the same as absence of conflict.'"
    269       }
    270     },
    271     "contamination": {
    272       "training_cutoff_stated": {
    273         "applies": true,
    274         "answer": false,
    275         "justification": "No training data cutoff dates are stated for GPT-3.5, GPT-4o, or StarCoder2-7B. The Tufano dataset (published 2019) is widely used and likely in the training data of these models."
    276       },
    277       "train_test_overlap_discussed": {
    278         "applies": true,
    279         "answer": false,
    280         "justification": "The paper does not discuss potential train-test overlap for GPT-3.5, GPT-4o, or StarCoder2-7B with the benchmark datasets. Section F.3 notes RAP-Gen's data leakage concern on Defects4J but does not apply the same scrutiny to SelRepair's own models."
    281       },
    282       "benchmark_contamination_addressed": {
    283         "applies": true,
    284         "answer": false,
    285         "justification": "The Tufano dataset (2019) was almost certainly in the training data of GPT-3.5, GPT-4o, and StarCoder2-7B (all trained on large code corpora post-2019). The paper does not address this contamination risk for its own approach."
    286       }
    287     },
    288     "human_studies": {
    289       "pre_registered": {
    290         "applies": false,
    291         "answer": false,
    292         "justification": "No human participants in this study. All evaluation uses automated metrics on code datasets."
    293       },
    294       "irb_or_ethics_approval": {
    295         "applies": false,
    296         "answer": false,
    297         "justification": "No human participants. IRB approval is not applicable."
    298       },
    299       "demographics_reported": {
    300         "applies": false,
    301         "answer": false,
    302         "justification": "No human participants. Demographics not applicable."
    303       },
    304       "inclusion_exclusion_criteria": {
    305         "applies": false,
    306         "answer": false,
    307         "justification": "No human participants. Inclusion/exclusion criteria for participants not applicable."
    308       },
    309       "randomization_described": {
    310         "applies": false,
    311         "answer": false,
    312         "justification": "No human participants. Randomization of participants not applicable."
    313       },
    314       "blinding_described": {
    315         "applies": false,
    316         "answer": false,
    317         "justification": "No human participants. Blinding not applicable."
    318       },
    319       "attrition_reported": {
    320         "applies": false,
    321         "answer": false,
    322         "justification": "No human participants. Attrition not applicable."
    323       }
    324     },
    325     "cost_and_practicality": {
    326       "inference_cost_reported": {
    327         "applies": true,
    328         "answer": true,
    329         "justification": "Table 3 reports inference time reductions (6.42%, 13.77%, 9.95%, 29.68%, 15.96%) and average input token lengths for different threshold configurations. This constitutes practical efficiency reporting."
    330       },
    331       "compute_budget_stated": {
    332         "applies": true,
    333         "answer": false,
    334         "justification": "Hardware is stated (4x NVIDIA RTX 3090) but total GPU hours, training time, or total compute budget are not quantified. The paper does not report how long fine-tuning took or the total compute cost."
    335       }
    336     }
    337   }
    338 }
	ai-research-survey Systematic scan of agentic development research. What's signal, what's noise.
	git clone https://git.shiptheloop.com/ai-research-survey.git
	Log \| Files \| Refs