calibration.json - ai-research-survey - Systematic scan of agentic development research. What's signal, what's noise.

calibration.json (19472B)
      1 {
      2   "calibration": {
      3     "paper_slug": "agentic-programming-survey-2025",
      4     "sonnet_scan_date": "2026-02-28",
      5     "opus_calibration_date": "2026-02-28",
      6     "agreement_rate": 0.98,
      7     "total_questions": 50,
      8     "agreements": 49,
      9     "disagreements": 1,
     10     "disagreement_details": [
     11       {
     12         "category": "claims_and_evidence",
     13         "question": "generalization_bounded",
     14         "sonnet": {"applies": true, "answer": true},
     15         "opus": {"applies": true, "answer": false},
     16         "direction": "sonnet_generous",
     17         "explanation": "The paper states 'We focus primarily on LLM-driven agentic systems for software development' which is a topic statement, not a scope boundary on what the results do NOT show. The schema requires explicit statements about 'what was not tested, what populations/settings are excluded, what claims the authors are NOT making.' The paper provides no search date cutoff, does not state which adjacent areas were excluded, does not state whether non-English papers were excluded, and does not specify what databases were omitted. However, re-reading the schema description again: 'Are generalizations bounded to the tested setting?' — for a survey, the 'tested setting' is the reviewed corpus. The paper does say it focuses on LLM-driven agentic systems for software development and notes Python bias. This is borderline. After careful reconsideration, the paper's scoping statement combined with specific qualifications like noting Python bias in benchmarks (Section 6.3) and the explicit focus statement in the Introduction may constitute adequate bounding for a survey. However, the title 'AI Agentic Programming: A Survey' is quite broad and does not bound to the 152-paper corpus or any time period. Opus answer: false — the broad title and claims are not bounded to the specific corpus. Sonnet answer: true — crediting the focus statement and Python-bias qualification."
     18       }
     19     ],
     20     "opus_checklist": {
     21       "artifacts": {
     22         "code_released": {
     23           "applies": true,
     24           "answer": false,
     25           "justification": "No repository URL, code archive, or analysis scripts are provided anywhere in the paper. The survey could have released its search/screening tools, data extraction scripts, or classification code but did not."
     26         },
     27         "data_released": {
     28           "applies": true,
     29           "answer": false,
     30           "justification": "The corpus of 152 reviewed papers is not released as a structured dataset. No supplementary data files, appendix listing the full corpus, or data archive link is provided."
     31         },
     32         "environment_specified": {
     33           "applies": false,
     34           "answer": false,
     35           "justification": "This is a systematic literature review with no software experiments or computational pipeline requiring an environment specification."
     36         },
     37         "reproduction_instructions": {
     38           "applies": true,
     39           "answer": false,
     40           "justification": "No step-by-step instructions for reproducing the systematic review are provided. Section 3 describes the methodology at a high level but lacks sufficient detail (e.g., exact queries per database, access dates, disagreement resolution criteria) for independent replication."
     41         }
     42       },
     43       "statistical_methodology": {
     44         "confidence_intervals_or_error_bars": {
     45           "applies": false,
     46           "answer": false,
     47           "justification": "This is a qualitative survey paper with no primary statistical analysis. Confidence intervals are not applicable."
     48         },
     49         "significance_tests": {
     50           "applies": false,
     51           "answer": false,
     52           "justification": "No hypothesis tests are conducted. The paper is a literature survey with taxonomic classification and descriptive statistics only."
     53         },
     54         "effect_sizes_reported": {
     55           "applies": false,
     56           "answer": false,
     57           "justification": "No effect sizes are applicable. The paper does not conduct meta-analytic or comparative statistical analyses."
     58         },
     59         "sample_size_justified": {
     60           "applies": false,
     61           "answer": false,
     62           "justification": "No statistical sample requiring power analysis. The paper reviews papers using SLR methodology, not a statistical sampling design."
     63         },
     64         "variance_reported": {
     65           "applies": false,
     66           "answer": false,
     67           "justification": "No repeated experiments or measurements. Variance reporting is not applicable to this survey."
     68         }
     69       },
     70       "evaluation_design": {
     71         "baselines_included": {
     72           "applies": true,
     73           "answer": false,
     74           "justification": "The survey does not structurally compare itself against prior surveys on the topic (e.g., Liu et al. 2024 on LLM-based agents for SE, Hou et al. 2024 on LLMs for SE). While it mentions related surveys exist, no systematic comparison of coverage, methodology, or conclusions is provided."
     75         },
     76         "baselines_contemporary": {
     77           "applies": false,
     78           "answer": false,
     79           "justification": "No experimental baselines are used. This criterion applies to comparing against competitive baselines in experiments, which a survey does not have."
     80         },
     81         "ablation_study": {
     82           "applies": false,
     83           "answer": false,
     84           "justification": "No system or model is evaluated. Ablation studies are structurally inapplicable to a survey paper."
     85         },
     86         "multiple_metrics": {
     87           "applies": false,
     88           "answer": false,
     89           "justification": "The survey does not use evaluation metrics for its own analysis. It describes benchmarks and metrics used by others but does not apply any to its own work."
     90         },
     91         "human_evaluation": {
     92           "applies": false,
     93           "answer": false,
     94           "justification": "No system outputs are produced that require human evaluation. The survey produces a taxonomy and synthesis, which are not evaluated by human raters."
     95         },
     96         "held_out_test_set": {
     97           "applies": false,
     98           "answer": false,
     99           "justification": "No prediction or classification model is trained. Held-out test sets are structurally inapplicable."
    100         },
    101         "per_category_breakdown": {
    102           "applies": true,
    103           "answer": true,
    104           "justification": "The survey provides multiple per-category breakdowns: Table 5 compares systems across behavioral dimensions, Table 8 breaks down benchmarks by source/language/task/difficulty, Figure 6 shows temporal distribution of papers by year, and Figure 9 shows per-benchmark model performance."
    105         },
    106         "failure_cases_discussed": {
    107           "applies": true,
    108           "answer": true,
    109           "justification": "Section 5 (Challenges) discusses failure modes and limitations of existing agentic systems, including benchmark inadequacies (5.1), communication protocol deficiencies (5.2), domain-specific weaknesses (5.3), and safety failures (5.4). Figure 9 shows where models fail on software optimization."
    110         },
    111         "negative_results_reported": {
    112           "applies": true,
    113           "answer": true,
    114           "justification": "Section 5 and 6.3 explicitly report negative findings: all models perform poorly on software optimization (GSO), benchmarks are biased toward Python, existing frameworks lack multi-turn evaluation support, and there is no standard taxonomy or evaluation methodology."
    115         }
    116       },
    117       "claims_and_evidence": {
    118         "abstract_claims_supported": {
    119           "applies": true,
    120           "answer": true,
    121           "justification": "The abstract claims the paper provides 'a taxonomy of agent behaviors and system architectures' (delivered in Section 4), 'examining relevant techniques' (Section 2), 'challenges' (Section 5), and 'opportunities' (Section 6). All abstract claims are descriptive and appropriately supported."
    122         },
    123         "causal_claims_justified": {
    124           "applies": false,
    125           "answer": false,
    126           "justification": "The survey makes no causal empirical claims from its own evidence. It describes, categorizes, and synthesizes existing work without asserting causal relationships."
    127         },
    128         "generalization_bounded": {
    129           "applies": true,
    130           "answer": false,
    131           "justification": "The paper states 'We focus primarily on LLM-driven agentic systems for software development' but the title 'AI Agentic Programming: A Survey of Techniques, Challenges, and Opportunities' is quite broad. No search date cutoff is stated, no explicit boundaries on what was excluded (non-English papers, specific databases, adjacent domains), and no statement of what the survey's results do NOT show. The scoping is a topic statement rather than explicit scope boundaries per the schema's requirement."
    132         },
    133         "alternative_explanations_discussed": {
    134           "applies": false,
    135           "answer": false,
    136           "justification": "This is a pure survey/taxonomy paper with no empirical results of its own. The schema states 'NA only for papers that present no empirical results (e.g., pure surveys or taxonomies).'"
    137         }
    138       },
    139       "setup_transparency": {
    140         "model_versions_specified": {
    141           "applies": false,
    142           "answer": false,
    143           "justification": "The survey does not deploy or use any LLM. It reviews papers that use LLMs but does not run any models itself."
    144         },
    145         "prompts_provided": {
    146           "applies": false,
    147           "answer": false,
    148           "justification": "The survey does not use any prompting. It is a human-conducted literature review."
    149         },
    150         "hyperparameters_reported": {
    151           "applies": false,
    152           "answer": false,
    153           "justification": "No models or algorithms are run by the survey authors. Hyperparameter reporting is not applicable."
    154         },
    155         "scaffolding_described": {
    156           "applies": false,
    157           "answer": false,
    158           "justification": "No agentic scaffolding is used by the survey authors. This is a human-conducted literature review."
    159         },
    160         "data_preprocessing_documented": {
    161           "applies": true,
    162           "answer": false,
    163           "justification": "Section 3 describes the filtering pipeline with counts (7,700 -> 395 -> 141 -> 152) and lists inclusion/exclusion criteria (Section 3.2). However, per the schema: 'describing the filtering pipeline stages with counts is YES only if the actual filtering CRITERIA at each stage are also stated.' The criteria are listed globally but not differentiated per stage. The criteria for citation chaining additions (adding 11 papers from 141 to 152) are not specified. Disagreement resolution is described only as 'through discussion' without operationalized rules."
    164         }
    165       },
    166       "limitations_and_scope": {
    167         "limitations_section_present": {
    168           "applies": true,
    169           "answer": false,
    170           "justification": "There is no dedicated 'Limitations' or 'Threats to Validity' section about the survey itself. Section 5 (Challenges) discusses limitations of the field/technology, not the survey's methodology. Section 7 (Conclusion) does not include substantive limitations discussion."
    171         },
    172         "threats_to_validity_specific": {
    173           "applies": true,
    174           "answer": false,
    175           "justification": "No threats-to-validity discussion exists for the survey. There is no mention of potential biases in paper selection, publication bias, search string limitations, database coverage gaps, or limitations of the inclusion/exclusion criteria."
    176         },
    177         "scope_boundaries_stated": {
    178           "applies": true,
    179           "answer": false,
    180           "justification": "The paper says 'We focus primarily on LLM-driven agentic systems for software development' but does not explicitly state what it does NOT cover: no search date range, no list of excluded adjacent areas, no statement about non-English papers, and no explicit negative scope boundaries. The schema requires 'explicit statements about what was not tested... what claims the authors are NOT making.'"
    181         }
    182       },
    183       "data_integrity": {
    184         "raw_data_available": {
    185           "applies": true,
    186           "answer": false,
    187           "justification": "The list of 152 reviewed papers is not released as a structured dataset. No appendix or supplementary material lists the full corpus. The bibliography covers cited papers, which likely includes the corpus but is not identified as such."
    188         },
    189         "data_collection_described": {
    190           "applies": true,
    191           "answer": true,
    192           "justification": "Section 3 describes the search strategy: databases searched (Google Scholar, ACM DL, IEEE Xplore, SpringerLink, arXiv), the Boolean search string with three term clusters, and the three-stage selection process with counts at each stage."
    193         },
    194         "recruitment_methods_described": {
    195           "applies": false,
    196           "answer": false,
    197           "justification": "No human participants. This is a literature review where the 'data' is published papers. The criterion about recruitment methods applies to human subjects research."
    198         },
    199         "data_pipeline_documented": {
    200           "applies": true,
    201           "answer": false,
    202           "justification": "Pipeline counts (7,700 -> 395 -> 141 -> 152) are given in Section 3.3, but: (1) criteria at each stage are not differentiated, (2) the increase from 141 to 152 via citation chaining lacks explanation of selection criteria for those 11 papers, (3) inter-rater reliability for the two-researcher screening is not reported, and (4) disagreement resolution is described only as 'through discussion.'"
    203         }
    204       },
    205       "conflicts_of_interest": {
    206         "funding_disclosed": {
    207           "applies": true,
    208           "answer": false,
    209           "justification": "No acknowledgments section or funding disclosure is present. There is no mention of grants, institutional support, or funding agencies anywhere in the paper."
    210         },
    211         "affiliations_disclosed": {
    212           "applies": true,
    213           "answer": true,
    214           "justification": "All five authors list their affiliation as University of Leeds (UK) on the title page with individual email addresses. Affiliations are clearly disclosed."
    215         },
    216         "funder_independent_of_outcome": {
    217           "applies": false,
    218           "answer": false,
    219           "justification": "No funding is disclosed, so funder independence cannot be assessed. The schema says 'NA if unfunded.' Without a funding disclosure, there is no funder to evaluate for independence."
    220         },
    221         "financial_interests_declared": {
    222           "applies": true,
    223           "answer": false,
    224           "justification": "No competing interests statement is present. There is no declaration that authors hold (or do not hold) patents, equity, or financial interests related to the tools and systems reviewed."
    225         }
    226       },
    227       "contamination": {
    228         "training_cutoff_stated": {
    229           "applies": false,
    230           "answer": false,
    231           "justification": "This is a survey paper. The authors do not train or evaluate any pre-trained models on benchmarks. Contamination is structurally inapplicable."
    232         },
    233         "train_test_overlap_discussed": {
    234           "applies": false,
    235           "answer": false,
    236           "justification": "Survey paper with no model evaluation. Train/test overlap is not applicable."
    237         },
    238         "benchmark_contamination_addressed": {
    239           "applies": false,
    240           "answer": false,
    241           "justification": "Survey paper with no model evaluation by the authors. Contamination is not applicable to the survey itself."
    242         }
    243       },
    244       "human_studies": {
    245         "pre_registered": {
    246           "applies": false,
    247           "answer": false,
    248           "justification": "No human participants. This is a literature review of published papers."
    249         },
    250         "irb_or_ethics_approval": {
    251           "applies": false,
    252           "answer": false,
    253           "justification": "No human participants. IRB approval is not applicable to a systematic literature review."
    254         },
    255         "demographics_reported": {
    256           "applies": false,
    257           "answer": false,
    258           "justification": "No human participants in this survey."
    259         },
    260         "inclusion_exclusion_criteria": {
    261           "applies": false,
    262           "answer": false,
    263           "justification": "No human participants. The inclusion/exclusion criteria in Section 3.2 refer to papers, not human subjects."
    264         },
    265         "randomization_described": {
    266           "applies": false,
    267           "answer": false,
    268           "justification": "No human participants. This is not an experimental study."
    269         },
    270         "blinding_described": {
    271           "applies": false,
    272           "answer": false,
    273           "justification": "No human participants. Blinding is not applicable."
    274         },
    275         "attrition_reported": {
    276           "applies": false,
    277           "answer": false,
    278           "justification": "No human participants. Attrition is not applicable."
    279         }
    280       },
    281       "cost_and_practicality": {
    282         "inference_cost_reported": {
    283           "applies": false,
    284           "answer": false,
    285           "justification": "This is a survey paper. The authors do not run any LLM inference. The schema states 'NA if cost is clearly irrelevant (e.g., theoretical paper, survey paper).' The paper reports costs of systems it reviews (Table 6) but not the cost of the survey's own method."
    286         },
    287         "compute_budget_stated": {
    288           "applies": false,
    289           "answer": false,
    290           "justification": "This is a survey paper with no computational experiments. No compute budget is required or applicable."
    291         }
    292       }
    293     },
    294     "summary": "Very high agreement (98%, 49/50 questions) between Sonnet and Opus on this survey paper. Both raters agreed on the applies field for all 50 questions. The single disagreement is on generalization_bounded: Sonnet credited the paper's focus statement and Python-bias qualification as adequate scope bounding, while Opus found the broad title and absence of explicit negative scope boundaries (no search date range, no list of excluded areas) insufficient. This is a borderline interpretive disagreement. The paper is a well-structured systematic literature review that follows standard SLR methodology but lacks a limitations section about the survey itself, does not release its corpus data, and has no funding or competing interests disclosures.",
    295     "category_breakdown": {
    296       "artifacts": {"agreements": 4, "disagreements": 0},
    297       "statistical_methodology": {"agreements": 5, "disagreements": 0},
    298       "evaluation_design": {"agreements": 9, "disagreements": 0},
    299       "claims_and_evidence": {"agreements": 3, "disagreements": 1},
    300       "setup_transparency": {"agreements": 5, "disagreements": 0},
    301       "limitations_and_scope": {"agreements": 3, "disagreements": 0},
    302       "data_integrity": {"agreements": 4, "disagreements": 0},
    303       "conflicts_of_interest": {"agreements": 4, "disagreements": 0},
    304       "contamination": {"agreements": 3, "disagreements": 0},
    305       "human_studies": {"agreements": 7, "disagreements": 0},
    306       "cost_and_practicality": {"agreements": 2, "disagreements": 0}
    307     }
    308   }
    309 }
	ai-research-survey Systematic scan of agentic development research. What's signal, what's noise.
	git clone https://git.shiptheloop.com/ai-research-survey.git
	Log \| Files \| Refs