calibration.json - ai-research-survey - Systematic scan of agentic development research. What's signal, what's noise.

calibration.json (22003B)
      1 {
      2   "paper_slug": "adapting-knowledge-prompt-2025",
      3   "total_questions": 50,
      4   "agreement_count": 46,
      5   "disagreement_count": 4,
      6   "agreement_rate": 0.92,
      7   "disagreements": [
      8     {
      9       "category": "evaluation_design",
     10       "question": "human_evaluation",
     11       "sonnet_applies": false,
     12       "sonnet_answer": false,
     13       "opus_applies": true,
     14       "opus_answer": false,
     15       "opus_justification": "Human evaluation IS relevant for APR. Exact Match only catches patches identical to a reference fix; there can be multiple valid fixes. Human evaluation of generated patches would strengthen the claims about repair effectiveness. The paper itself acknowledges EM may not capture the full fix ratio (Construct Validity section). The criterion is applicable — the paper could reasonably have included human evaluation of patches — but chose not to. The schema says applies=false only when 'human evaluation is clearly irrelevant to the claims.' For APR, where correctness of patches matters and automated metrics are known to be incomplete, human evaluation is relevant.",
     16       "sonnet_justification": "Human evaluation is clearly not applicable here. The paper evaluates automated program repair using automated metrics (EM, SC, CodeBLEU), and human evaluation of generated patches is not necessary for the claims being made about relative performance.",
     17       "direction": "applies_boundary"
     18     },
     19     {
     20       "category": "claims_and_evidence",
     21       "question": "alternative_explanations_discussed",
     22       "sonnet_applies": true,
     23       "sonnet_answer": true,
     24       "opus_applies": true,
     25       "opus_answer": false,
     26       "opus_justification": "The Threats to Validity section (Section VI.B) discusses generic threats: randomness in training (Internal Validity), EM metric limitations (Construct Validity), and limited generalizability (External Validity). However, these are methodological limitations, not alternative explanations for the observed results. The paper does not discuss specific alternative explanations such as: whether prompt tuning's advantage is simply due to having more trainable parameters per effective sample, whether the improvement is driven by the prompt providing a stronger inductive bias rather than domain knowledge per se, or whether the 87.33% improvement figure is inflated by the choice of data scarcity threshold. The Code Refinement variable anonymization discussion is a dataset-specific observation, not an alternative explanation for the main findings. Per the schema: 'A threats-to-validity section counts only if it discusses specific alternative explanations for the observed results, not just generic methodological limitations.'",
     27       "sonnet_justification": "The Threats to Validity section (Section VI.B) discusses specific alternative explanations: randomness in training (Internal Validity), that EM may not capture the full fix ratio, and that results may not generalize to other models/datasets. The discussion also considers why Code Refinement shows lower improvement (variable anonymization hypothesis).",
     28       "direction": "sonnet_generous"
     29     },
     30     {
     31       "category": "setup_transparency",
     32       "question": "model_versions_specified",
     33       "sonnet_applies": true,
     34       "sonnet_answer": true,
     35       "opus_applies": true,
     36       "opus_answer": false,
     37       "opus_justification": "The paper specifies model names and parameter counts (CodeT5+ 220M, CodeT5+ 770M, GPT-Neo 1.3B) and states they are loaded from 'official versions available on Hugging Face.' However, it does not specify exact Hugging Face model identifiers (e.g., Salesforce/codet5p-220m) or commit hashes. Per the schema: 'Marketing names like Gemini-2.5 or GPT-4o without a snapshot date or API version do NOT count as specified versions.' CodeT5+ 220M is a model family name with size, not a precise version — model weights on Hugging Face can be updated without changing the model name. These are open-weight models with specific sizes, which makes them more identifiable than API models, but the schema explicitly requires version-level specificity.",
     38       "sonnet_justification": "Model sizes and versions are specified: 'CodeT5+ 220M (220 million parameters) and CodeT5+ 770M (770 million parameters)' and 'GPT-Neo 1.3B (1.3 billion parameters)'. The paper states models are loaded from official Hugging Face versions.",
     39       "direction": "sonnet_generous"
     40     },
     41     {
     42       "category": "setup_transparency",
     43       "question": "prompts_provided",
     44       "sonnet_applies": true,
     45       "sonnet_answer": true,
     46       "opus_applies": true,
     47       "opus_answer": false,
     48       "opus_justification": "Tables I and III show prompt TEMPLATES with placeholders ([X], [mask], [bugType], [repairAction], [repairPattern], [SOFT], etc.). Per the schema: 'A prompt TEMPLATE with placeholders (e.g., [Task Description]) does NOT count unless the actual fill values are also provided — the reader must be able to reconstruct every prompt sent to the model.' While [X] (buggy code) and [mask] (output slot) are structural model placeholders, the knowledge prompt templates use domain knowledge placeholders like [bugType], [repairAction], [repairPattern], [AST], [errorMessage], [tags] whose actual fill values are dataset-derived and not fully enumerated. The paper shows one example (Fig. 4) but does not provide the actual domain knowledge values used across all experimental instances. A reader cannot reconstruct every prompt sent to the model.",
     49       "sonnet_justification": "Full prompt templates are provided in Tables I and III, including all seven basic prompt templates (BP1-BP7) in both hard and soft variants, and example knowledge prompt templates for various domain knowledge types. The templates show exact structure with placeholder positions.",
     50       "direction": "sonnet_generous"
     51     }
     52   ],
     53   "opus_checklist": {
     54     "artifacts": {
     55       "code_released": {
     56         "applies": true,
     57         "answer": true,
     58         "justification": "The Conclusion states: 'Our source code and experimental data are publicly available at: https://github.com/Cxm211/k-prompt'. A working GitHub URL is provided."
     59       },
     60       "data_released": {
     61         "applies": true,
     62         "answer": true,
     63         "justification": "The paper uses six publicly available APR benchmark datasets (BugsInPy, Code Refinement, Defects4J, ManySStuBs4J-SStuBs, TFix, xCodeEval-APR-C). These are standard public benchmarks with citations and the GitHub repo claims to include experimental data."
     64       },
     65       "environment_specified": {
     66         "applies": true,
     67         "answer": false,
     68         "justification": "The paper mentions OpenPrompt and PyTorch frameworks, and hardware (NVIDIA A40 48GB GPU and H100 80GB GPU), but does not provide a requirements.txt, Dockerfile, or specific library version numbers sufficient to recreate the environment."
     69       },
     70       "reproduction_instructions": {
     71         "applies": true,
     72         "answer": false,
     73         "justification": "The paper provides no step-by-step reproduction instructions, no README walkthrough, and no scripts to replicate experiments. Source code is available on GitHub, but the paper text does not include reproduction guidance."
     74       }
     75     },
     76     "statistical_methodology": {
     77       "confidence_intervals_or_error_bars": {
     78         "applies": true,
     79         "answer": false,
     80         "justification": "All results (Tables VI, VII, VIII, IX, X) are reported as single point estimates with no confidence intervals, error bars, or plus-minus notation. Results are averaged over three runs but no uncertainty is reported."
     81       },
     82       "significance_tests": {
     83         "applies": true,
     84         "answer": false,
     85         "justification": "No statistical significance tests are reported anywhere in the paper. Claims of 'outperforms' and 'improvement' are based solely on comparing raw numbers without p-values, t-tests, or any statistical test."
     86       },
     87       "effect_sizes_reported": {
     88         "applies": true,
     89         "answer": false,
     90         "justification": "The paper reports relative improvement percentages (e.g., '87.33% on average') but no standardized effect sizes (Cohen's d, etc.). While tables show both baseline and treatment numbers providing some context, the headline claims use relative percentages without standardized measures."
     91       },
     92       "sample_size_justified": {
     93         "applies": true,
     94         "answer": false,
     95         "justification": "The paper selects 1% of large datasets and uses datasets with 260-1082 training instances but provides no justification for why these sample sizes are sufficient. No power analysis is presented."
     96       },
     97       "variance_reported": {
     98         "applies": true,
     99         "answer": false,
    100         "justification": "The paper explicitly states it runs experiments three times with different seeds and averages results, but never reports standard deviations, confidence intervals, or any spread measure. Only mean values appear in all tables."
    101       }
    102     },
    103     "evaluation_design": {
    104       "baselines_included": {
    105         "applies": true,
    106         "answer": true,
    107         "justification": "Fine-tuning is the primary baseline, and a 'Naive Copy' baseline (copying buggy code as fix) is also included. Results are compared across all six datasets."
    108       },
    109       "baselines_contemporary": {
    110         "applies": true,
    111         "answer": true,
    112         "justification": "The baseline is fine-tuning of the same pre-trained models (CodeT5+, GPT-Neo), which is the standard contemporary approach for APR. This is the appropriate and competitive baseline for evaluating prompt tuning."
    113       },
    114       "ablation_study": {
    115         "applies": true,
    116         "answer": true,
    117         "justification": "RQ2 compares hard vs. soft prompts and different prompt templates (BP1-BP7). RQ3 ablates individual types of domain knowledge (repair action, repair pattern, bug type, AST, error message, tags) and their combinations. These serve as ablation studies."
    118       },
    119       "multiple_metrics": {
    120         "applies": true,
    121         "answer": true,
    122         "justification": "Three evaluation metrics are used: Exact Match (EM), Syntactically Correct Patch (SC), and CodeBLEU. Section IV.D describes all three."
    123       },
    124       "human_evaluation": {
    125         "applies": true,
    126         "answer": false,
    127         "justification": "Human evaluation IS applicable for APR — the paper itself acknowledges that EM may not capture all valid fixes (Construct Validity). Human evaluation of generated patches would have strengthened claims about repair quality. No human evaluation was conducted; all evaluation is automated."
    128       },
    129       "held_out_test_set": {
    130         "applies": true,
    131         "answer": true,
    132         "justification": "The paper explicitly splits data into training, validation, and test sets in an 8:1:1 ratio (Table IV). Results are reported on held-out test sets. RQ4 uses a fixed 500-instance test set."
    133       },
    134       "per_category_breakdown": {
    135         "applies": true,
    136         "answer": true,
    137         "justification": "Results are broken down across all six individual datasets, four programming languages, and three models. Tables VI-X provide detailed per-dataset, per-model breakdowns."
    138       },
    139       "failure_cases_discussed": {
    140         "applies": true,
    141         "answer": true,
    142         "justification": "The paper discusses failure cases: Code Refinement's variable anonymization limiting performance, xCodeEval where noisy domain knowledge reduces effectiveness, and cases where combining multiple knowledge types hurts performance."
    143       },
    144       "negative_results_reported": {
    145         "applies": true,
    146         "answer": true,
    147         "justification": "Several negative results: combining multiple domain knowledge types does not lead to incremental gains (RQ3/Table IX), certain prompt templates consistently underperform for GPT-Neo, Code Refinement shows limited improvement, xCodeEval knowledge prompts sometimes reduce performance below baseline."
    148       }
    149     },
    150     "claims_and_evidence": {
    151       "abstract_claims_supported": {
    152         "applies": true,
    153         "answer": true,
    154         "justification": "The abstract claims '87.33% average improvement over fine-tuning in data scarcity scenarios' which is supported by Table VI results. The claim about comprehensive evaluation across three LLMs, six datasets, and four languages is confirmed by the experimental design."
    155       },
    156       "causal_claims_justified": {
    157         "applies": true,
    158         "answer": true,
    159         "justification": "The paper makes causal claims (prompt tuning 'improves' performance). The ablation design is adequate — they control for model architecture, datasets, and evaluation protocol, varying only the tuning method in a controlled single-variable manner."
    160       },
    161       "generalization_bounded": {
    162         "applies": true,
    163         "answer": true,
    164         "justification": "The paper explicitly acknowledges: 'Our experimental results may be applicable to specific datasets and models only' (Section VI.B). The conclusion states evaluation is limited to specific datasets, models, and prompt templates."
    165       },
    166       "alternative_explanations_discussed": {
    167         "applies": true,
    168         "answer": false,
    169         "justification": "The Threats to Validity section discusses generic methodological limitations (randomness, EM limitations, generalizability) but does not discuss specific alternative explanations for why prompt tuning outperforms fine-tuning. Per the schema, a threats-to-validity section that discusses only generic methodological limitations does not count. The paper does not consider whether the improvement is due to additional inductive bias, effective parameter count differences, or other confounds."
    170       }
    171     },
    172     "setup_transparency": {
    173       "model_versions_specified": {
    174         "applies": true,
    175         "answer": false,
    176         "justification": "The paper specifies 'CodeT5+ 220M', 'CodeT5+ 770M', and 'GPT-Neo 1.3B' with parameter counts. It states models are loaded from 'official versions available on Hugging Face' but does not provide exact Hugging Face model identifiers or commit hashes. Per the schema, marketing names without version/snapshot identifiers do not count as fully specified versions."
    177       },
    178       "prompts_provided": {
    179         "applies": true,
    180         "answer": false,
    181         "justification": "Tables I and III provide prompt TEMPLATES with placeholders ([X], [mask], [bugType], [repairAction], etc.). Per the schema, 'A prompt TEMPLATE with placeholders does NOT count unless the actual fill values are also provided.' The domain knowledge placeholders' actual values are not provided for all experiments. A reader cannot reconstruct every prompt sent to the model."
    182       },
    183       "hyperparameters_reported": {
    184         "applies": true,
    185         "answer": true,
    186         "justification": "Table V provides a complete hyperparameter table: optimizer (AdamW), Adam epsilon (1e-8), learning rate (5e-5), LR scheduler (Linear), training epochs (10), temperature (1.0), top-p (0.9), beam number (5), repetition penalty (1.0)."
    187       },
    188       "scaffolding_described": {
    189         "applies": false,
    190         "answer": false,
    191         "justification": "No agentic scaffolding is used. This is standard prompt tuning and fine-tuning of LLMs without any agent framework, tool use, or complex scaffolding."
    192       },
    193       "data_preprocessing_documented": {
    194         "applies": true,
    195         "answer": true,
    196         "justification": "Section IV.C documents the preprocessing pipeline: single-hunk fix extraction, 8:1:1 train/val/test split, 1% sampling from large datasets using three seeds with averaging. Table IV shows final dataset statistics."
    197       }
    198     },
    199     "limitations_and_scope": {
    200       "limitations_section_present": {
    201         "applies": true,
    202         "answer": true,
    203         "justification": "Section VI.B 'Threats to Validity' contains three substantive subsections: Construct Validity, Internal Validity, and External Validity."
    204       },
    205       "threats_to_validity_specific": {
    206         "applies": true,
    207         "answer": true,
    208         "justification": "The threats are specific to this study: EM may not capture all valid fixes for THIS paper's APR evaluation, randomness in the 1% sampling of THIS paper's datasets, and results may only apply to the specific datasets and models tested."
    209       },
    210       "scope_boundaries_stated": {
    211         "applies": true,
    212         "answer": true,
    213         "justification": "The conclusion explicitly states scope: evaluation is limited to 'four datasets, three relatively small pre-trained code models, and limited prompt templates.' External Validity also acknowledges results may be dataset/model specific."
    214       }
    215     },
    216     "data_integrity": {
    217       "raw_data_available": {
    218         "applies": true,
    219         "answer": true,
    220         "justification": "The benchmark datasets are publicly available (BugsInPy, Code Refinement, Defects4J, ManySStuBs4J, TFix, xCodeEval) and the GitHub repository claims to include experimental data. Raw data is independently verifiable."
    221       },
    222       "data_collection_described": {
    223         "applies": true,
    224         "answer": true,
    225         "justification": "Section IV.C describes dataset selection (six commonly used APR benchmarks), processing (single-hunk extraction, 8:1:1 split, 1% sampling), and Table IV provides statistics."
    226       },
    227       "recruitment_methods_described": {
    228         "applies": false,
    229         "answer": false,
    230         "justification": "No human participants. This is a purely computational study using pre-existing benchmark datasets. No participant recruitment involved."
    231       },
    232       "data_pipeline_documented": {
    233         "applies": true,
    234         "answer": true,
    235         "justification": "The data pipeline is documented: benchmark datasets to single-hunk extraction to 8:1:1 split to 1% sampling (for large datasets) using three seeds to averaging. Table IV shows final counts."
    236       }
    237     },
    238     "conflicts_of_interest": {
    239       "funding_disclosed": {
    240         "applies": true,
    241         "answer": true,
    242         "justification": "The Acknowledgments section states: 'This research is supported by the Ministry of Education, Singapore under its Academic Research Fund Tier 3 (Award ID: MOET32020-0004).'"
    243       },
    244       "affiliations_disclosed": {
    245         "applies": true,
    246         "answer": true,
    247         "justification": "Both authors are affiliated with Singapore Management University, clearly stated on the title page. No commercial affiliations are present."
    248       },
    249       "funder_independent_of_outcome": {
    250         "applies": true,
    251         "answer": true,
    252         "justification": "The funder is the Ministry of Education, Singapore — a government academic funding body with no financial interest in prompt tuning vs. fine-tuning outcomes."
    253       },
    254       "financial_interests_declared": {
    255         "applies": true,
    256         "answer": false,
    257         "justification": "No competing interests statement or declaration of financial interests is present in the paper. Absence of disclosure is not the same as absence of conflict per the schema."
    258       }
    259     },
    260     "contamination": {
    261       "training_cutoff_stated": {
    262         "applies": true,
    263         "answer": false,
    264         "justification": "The paper does not state training data cutoff dates for CodeT5+ or GPT-Neo. These models were trained on web-scraped code data, and the cutoff dates are relevant for assessing benchmark contamination."
    265       },
    266       "train_test_overlap_discussed": {
    267         "applies": true,
    268         "answer": false,
    269         "justification": "No discussion of potential overlap between the pre-training data of CodeT5+ or GPT-Neo and the APR benchmark test sets. Public benchmarks like Defects4J could plausibly be in training corpora."
    270       },
    271       "benchmark_contamination_addressed": {
    272         "applies": true,
    273         "answer": false,
    274         "justification": "Benchmarks like Defects4J (2014) and Code Refinement (2019) were published well before CodeT5+ and GPT-Neo training cutoffs, making contamination plausible. The paper does not address this risk."
    275       }
    276     },
    277     "human_studies": {
    278       "pre_registered": {
    279         "applies": false,
    280         "answer": false,
    281         "justification": "No human participants. This is a computational study evaluating LLMs on APR benchmark datasets."
    282       },
    283       "irb_or_ethics_approval": {
    284         "applies": false,
    285         "answer": false,
    286         "justification": "No human participants. Purely computational study."
    287       },
    288       "demographics_reported": {
    289         "applies": false,
    290         "answer": false,
    291         "justification": "No human participants. Purely computational study."
    292       },
    293       "inclusion_exclusion_criteria": {
    294         "applies": false,
    295         "answer": false,
    296         "justification": "No human participants. Purely computational study."
    297       },
    298       "randomization_described": {
    299         "applies": false,
    300         "answer": false,
    301         "justification": "No human participants. Purely computational study."
    302       },
    303       "blinding_described": {
    304         "applies": false,
    305         "answer": false,
    306         "justification": "No human participants. Purely computational study."
    307       },
    308       "attrition_reported": {
    309         "applies": false,
    310         "answer": false,
    311         "justification": "No human participants. Purely computational study."
    312       }
    313     },
    314     "cost_and_practicality": {
    315       "inference_cost_reported": {
    316         "applies": true,
    317         "answer": false,
    318         "justification": "No inference cost, latency, tokens consumed, or cost-per-example reported. Hardware is mentioned (A40 and H100 GPUs) but no cost or timing data."
    319       },
    320       "compute_budget_stated": {
    321         "applies": true,
    322         "answer": false,
    323         "justification": "Hardware is specified (NVIDIA A40 48GB and H100 80GB) but total GPU hours, training time, or total compute budget are not reported despite extensive experimentation."
    324       }
    325     }
    326   }
    327 }
	ai-research-survey Systematic scan of agentic development research. What's signal, what's noise.
	git clone https://git.shiptheloop.com/ai-research-survey.git
	Log \| Files \| Refs