calibration.json - ai-research-survey - Systematic scan of agentic development research. What's signal, what's noise.

calibration.json (23833B)
      1 {
      2   "paper_slug": "agentbased-evaluation-framework-2025",
      3   "calibrated_by": "opus",
      4   "calibration_date": "2026-02-28",
      5   "total_questions": 50,
      6   "agreement_count": 42,
      7   "disagreement_count": 8,
      8   "agreement_rate": 0.84,
      9   "disagreements": [
     10     {
     11       "category": "evaluation_design",
     12       "question": "human_evaluation",
     13       "sonnet": {"applies": true, "answer": true},
     14       "opus": {"applies": true, "answer": false},
     15       "direction": "sonnet_generous",
     16       "explanation": "Sonnet credits the human annotation of the benchmark as 'human evaluation,' but the schema specifies 'human ratings, manual inspection, user studies, expert review of the system's OUTPUTS. The humans must be evaluating what the system produced.' The human experts here scored the LLM-generated code to create ground truth labels for the benchmark -- they were NOT evaluating CodeVisionary's outputs. They evaluated LLM code generation quality, not the evaluation framework's outputs. Manual classification of the benchmark itself does not count per the schema description."
     17     },
     18     {
     19       "category": "evaluation_design",
     20       "question": "held_out_test_set",
     21       "sonnet": {"applies": true, "answer": true},
     22       "opus": {"applies": true, "answer": false},
     23       "direction": "sonnet_generous",
     24       "explanation": "Sonnet acknowledges the conflation of tuning and testing ('this does conflate tuning and testing') but still answers true. The schema says 'If unclear whether the reported numbers are on data used for any selection decisions, NO.' RQ3 uses the same 363-sample set for hyperparameter selection (number of judges, rounds), and the final reported results in Tables II and III are on the same data. This is clearly not a held-out test set."
     25     },
     26     {
     27       "category": "evaluation_design",
     28       "question": "failure_cases_discussed",
     29       "sonnet": {"applies": true, "answer": true},
     30       "opus": {"applies": true, "answer": false},
     31       "direction": "sonnet_generous",
     32       "explanation": "Sonnet credits the failure analysis of the ICE-Score baseline in Section I as failure case discussion for CodeVisionary. However, the schema asks 'Are failure cases shown or discussed? Look for: error analysis, qualitative examples of failures, discussion of where the approach breaks down. If the paper only shows successes, NO.' The paper discusses failures of the BASELINE (ICE-Score) to motivate CodeVisionary, not failures of CodeVisionary itself. There is no discussion of where CodeVisionary fails or breaks down. The case study in Figure 5 shows a success case."
     33     },
     34     {
     35       "category": "human_studies",
     36       "question": "pre_registered",
     37       "sonnet": {"applies": true, "answer": false},
     38       "opus": {"applies": false, "answer": false},
     39       "direction": "applies_boundary",
     40       "explanation": "Sonnet treats the expert annotation study as having human participants. However, the annotators are providing ground truth labels for a benchmark -- they are expert raters doing professional work, not research participants. The schema says 'NA if no human participants. Mining public repositories or analyzing public data does NOT make participants.' The annotation work is analogous to professional labeling, not a human subjects study. The pre_registered criterion (and most human_studies criteria) are for studies OF human participants, not studies that USE human annotators as tools."
     41     },
     42     {
     43       "category": "human_studies",
     44       "question": "irb_or_ethics_approval",
     45       "sonnet": {"applies": true, "answer": false},
     46       "opus": {"applies": false, "answer": false},
     47       "direction": "applies_boundary",
     48       "explanation": "Same reasoning as pre_registered. Expert annotators providing benchmark labels are not human research participants. IRB review is required for research ON human subjects, not for using human annotators as part of a data pipeline. This is a benchmark evaluation paper, not a human subjects study."
     49     },
     50     {
     51       "category": "human_studies",
     52       "question": "demographics_reported",
     53       "sonnet": {"applies": true, "answer": false},
     54       "opus": {"applies": false, "answer": false},
     55       "direction": "applies_boundary",
     56       "explanation": "Same reasoning. The annotators are expert raters creating ground truth labels, not participants in a human study. Demographic reporting requirements apply to studies of human participants, not to studies that employ expert annotators for benchmark construction."
     57     },
     58     {
     59       "category": "human_studies",
     60       "question": "inclusion_exclusion_criteria",
     61       "sonnet": {"applies": true, "answer": false},
     62       "opus": {"applies": false, "answer": false},
     63       "direction": "applies_boundary",
     64       "explanation": "Sonnet treats the expert annotators as human participants and marks applies=true (answer=false since no criteria stated). Opus marks applies=false because expert annotators creating benchmark labels are instruments for data collection, not human research participants. Inclusion/exclusion criteria requirements apply to participant recruitment in human subjects research, not professional annotator selection."
     65     },
     66     {
     67       "category": "human_studies",
     68       "question": "blinding_described",
     69       "sonnet": {"applies": true, "answer": true},
     70       "opus": {"applies": false, "answer": false},
     71       "direction": "applies_boundary",
     72       "explanation": "Sonnet credits the paper's statement that 'All experts remain unaware of the identity of the LLM that generates each response' as blinding in a human subjects study. Opus marks applies=false because this is not a human subjects study -- the annotators are professional raters, not research participants. While blinding annotators to LLM identity is good annotation practice, the human_studies.blinding_described criterion is about blinding participants in experimental studies, which does not apply here."
     73     }
     74   ],
     75   "opus_checklist": {
     76     "artifacts": {
     77       "code_released": {
     78         "applies": true,
     79         "answer": true,
     80         "justification": "The abstract provides a GitHub URL: 'The resources of CodeVisionary are available at https://github.com/Eshe0922/CodeVisionary.' This is a working URL provided in the paper."
     81       },
     82       "data_released": {
     83         "applies": true,
     84         "answer": false,
     85         "justification": "The constructed benchmark of 363 samples with human annotations is not released. Only the framework code repository is referenced. The paper provides no download link for the benchmark data."
     86       },
     87       "environment_specified": {
     88         "applies": true,
     89         "answer": false,
     90         "justification": "The paper mentions Docker containers for code execution and describes tools used, but does not provide a requirements.txt, Dockerfile for the framework itself, or a detailed environment setup section with library versions needed to run CodeVisionary."
     91       },
     92       "reproduction_instructions": {
     93         "applies": true,
     94         "answer": false,
     95         "justification": "No step-by-step reproduction instructions are provided in the paper. The GitHub repository is referenced but the paper itself contains no 'Reproducing Results' section or specific commands to run experiments."
     96       }
     97     },
     98     "statistical_methodology": {
     99       "confidence_intervals_or_error_bars": {
    100         "applies": true,
    101         "answer": false,
    102         "justification": "All results in Tables II, III, and V are point estimates (Pearson, Spearman, Kendall-Tau coefficients) with no confidence intervals, error bars, or uncertainty quantification."
    103       },
    104       "significance_tests": {
    105         "applies": true,
    106         "answer": false,
    107         "justification": "The paper claims CodeVisionary outperforms baselines by comparing numeric coefficients but performs no statistical significance tests (no p-values, t-tests, bootstrap tests, etc.)."
    108       },
    109       "effect_sizes_reported": {
    110         "applies": true,
    111         "answer": true,
    112         "justification": "The paper reports absolute improvements with context: 'outperforming the best baseline with average improvements of 0.217, 0.163, and 0.141 in Pearson, Spearman, and Kendall-Tau coefficients.' Both baseline and method values are provided in Table II, giving readers enough context to assess magnitude."
    113       },
    114       "sample_size_justified": {
    115         "applies": true,
    116         "answer": false,
    117         "justification": "The benchmark uses 363 samples (121 tasks x 3 LLMs). No power analysis or justification for why this sample size is sufficient is provided."
    118       },
    119       "variance_reported": {
    120         "applies": true,
    121         "answer": false,
    122         "justification": "Section V-D states 'we perform multiple trials and take the average as the experimental results' but no standard deviation, IQR, or any spread measure is reported across these trials. Only averaged values appear in result tables."
    123       }
    124     },
    125     "evaluation_design": {
    126       "baselines_included": {
    127         "applies": true,
    128         "answer": true,
    129         "justification": "Three baselines are compared: VANILLA (direct LLM prompting), ICE-Score (EACL 2024), and CODEJUDGE (EMNLP 2024)."
    130       },
    131       "baselines_contemporary": {
    132         "applies": true,
    133         "answer": true,
    134         "justification": "ICE-Score (EACL 2024) and CODEJUDGE (EMNLP 2024) are both recent contemporary baselines representing the state of the art in LLM-based code evaluation."
    135       },
    136       "ablation_study": {
    137         "applies": true,
    138         "answer": true,
    139         "justification": "Section IV-B (RQ2) presents ablation studies removing the RMCD stage (w/o RMCD), the FSAS stage (w/o FSAS), and individual information types (w/o RT, w/o UI/UX, w/o ST). Results are in Table III."
    140       },
    141       "multiple_metrics": {
    142         "applies": true,
    143         "answer": true,
    144         "justification": "Three correlation metrics are used: Pearson (rp), Spearman (rs), and Kendall-Tau (tau) coefficients."
    145       },
    146       "human_evaluation": {
    147         "applies": true,
    148         "answer": false,
    149         "justification": "The human expert annotations create ground truth labels for the benchmark. They evaluate the LLM-generated code quality, NOT CodeVisionary's outputs. Per the schema: 'The humans must be evaluating what the system produced -- manual classification of the benchmark or dataset itself does not count.' No humans evaluated CodeVisionary's evaluation outputs (scores, reports). Evaluation of the system is entirely automated via correlation metrics."
    150       },
    151       "held_out_test_set": {
    152         "applies": true,
    153         "answer": false,
    154         "justification": "RQ3 explores hyperparameter choices (number of judges 2-5, rounds 2-5) on the same 363-sample test set that final results are reported on. The schema states 'If unclear whether the reported numbers are on data used for any selection decisions, NO.' The same dataset is used for both hyperparameter selection and evaluation."
    155       },
    156       "per_category_breakdown": {
    157         "applies": true,
    158         "answer": true,
    159         "justification": "Figure 4 provides per-coding-scenario breakdowns (5 categories) and per-programming-language breakdowns (5 languages), showing Spearman correlation for each."
    160       },
    161       "failure_cases_discussed": {
    162         "applies": true,
    163         "answer": false,
    164         "justification": "The paper discusses failure cases of the BASELINES (ICE-Score) in Section I to motivate the work. However, it does not discuss where CodeVisionary itself fails or breaks down. The case study (Figure 5) shows a success case. The schema requires 'discussion of where the approach breaks down.'"
    165       },
    166       "negative_results_reported": {
    167         "applies": true,
    168         "answer": true,
    169         "justification": "Section V-C tests CodeVisionary on CoNaLa and reports it does not outperform ICE-Score on rp (0.644 vs 0.655). This is an honest negative result. The ablation in Table III shows performance degradation when components are removed."
    170       }
    171     },
    172     "claims_and_evidence": {
    173       "abstract_claims_supported": {
    174         "applies": true,
    175         "answer": true,
    176         "justification": "The abstract claims 'average improvements of 0.217, 0.163, and 0.141 in Pearson, Spearman, and Kendall-Tau coefficients.' Table II confirms CodeVisionary achieves avg 0.301/0.272/0.241 vs best baseline 0.084/0.109/0.100. The improvements match."
    177       },
    178       "causal_claims_justified": {
    179         "applies": true,
    180         "answer": false,
    181         "justification": "The paper makes causal claims via ablation studies (e.g., 'The RMCD stage boosts rp, rs, and tau by 27.0%, 32.0%, and 32.4%'). However, the ablation design removes entire multi-component stages, not individual variables. The same dataset is used for hyperparameter selection (RQ3) and final evaluation, creating potential circularity."
    182       },
    183       "generalization_bounded": {
    184         "applies": true,
    185         "answer": false,
    186         "justification": "The paper claims to be 'the first agent-based evaluation framework for complex code generation' -- a broad framing. It tests on one benchmark (CodeArena hard tasks, 363 samples) and one simpler dataset (CoNaLa). The title and abstract are unbounded; the limitations section acknowledges 'our benchmark may not cover all coding scenarios and programming languages' but does not constrain the claims accordingly."
    187       },
    188       "alternative_explanations_discussed": {
    189         "applies": true,
    190         "answer": false,
    191         "justification": "Section V-D mentions only benchmark coverage and LLM randomness as threats. The paper does not discuss alternative explanations such as whether improvements stem from compute asymmetry (40 interactions vs. single-call baselines) rather than architectural design, or whether the multi-judge consensus simply averages out noise."
    192       }
    193     },
    194     "setup_transparency": {
    195       "model_versions_specified": {
    196         "applies": true,
    197         "answer": false,
    198         "justification": "The paper uses 'GPT-4o', 'GPT-3.5-turbo', and 'Claude-3.5-Sonnet' throughout without specific API versions or snapshot dates. References point to product announcement blog posts. Per the schema, 'Marketing names like Gemini-2.5 or GPT-4o without a snapshot date or API version do NOT count.'"
    199       },
    200       "prompts_provided": {
    201         "applies": true,
    202         "answer": false,
    203         "justification": "The paper describes prompts in natural language and provides one example of clarity evaluation criteria. Full prompt text is not included. The paper says 'remaining aspects and criteria are available in our repository' but the actual prompts sent to GPT-4o for the RMCD and FSAS stages are not reproduced in the paper."
    204       },
    205       "hyperparameters_reported": {
    206         "applies": true,
    207         "answer": true,
    208         "justification": "Section III-D reports: temperature=0.2 for RMCD stage, temperature=0.7 for FSAS stage, max interactions=40, number of judges=3, max negotiation rounds=4."
    209       },
    210       "scaffolding_described": {
    211         "applies": true,
    212         "answer": true,
    213         "justification": "The paper extensively describes the agentic scaffolding: Docker container setup, 8 tool types (Dynamic Execution, Static Linter, Unit Tests, Screenshot, Interaction, Web Browsing, General Semantic, Bash Command), the thought/action/observation loop, Execute/Analyze state alternation, multi-judge negotiation protocol with formal definitions."
    214       },
    215       "data_preprocessing_documented": {
    216         "applies": true,
    217         "answer": true,
    218         "justification": "Section III-A documents the benchmark construction pipeline: filter CodeArena to 'hard' tasks, exclude platform-specific tasks (MATLAB, Verilog), generate responses using 3 LLMs, manual scoring by two experts with Kappa>80% and third-expert adjudication. Counts are provided (397 -> 121 tasks, 363 samples)."
    219       }
    220     },
    221     "limitations_and_scope": {
    222       "limitations_section_present": {
    223         "applies": true,
    224         "answer": true,
    225         "justification": "Section V-D is titled 'Threats and Limitations' and provides dedicated discussion."
    226       },
    227       "threats_to_validity_specific": {
    228         "applies": true,
    229         "answer": false,
    230         "justification": "Section V-D mentions only two threats: (1) benchmark may not cover all scenarios and (2) LLM randomness causes variation. These are generic disclaimers. No specific threats like compute asymmetry, hyperparameter tuning on test data, or GPT-4o evaluating GPT-4o-generated code are discussed."
    231       },
    232       "scope_boundaries_stated": {
    233         "applies": true,
    234         "answer": false,
    235         "justification": "The limitations section notes 'may not cover all coding scenarios' but does not explicitly state what the results do NOT show. It does not bound results to GPT-4o as evaluator, to 'hard' complexity tasks specifically, or state that results may not generalize to other evaluator models."
    236       }
    237     },
    238     "data_integrity": {
    239       "raw_data_available": {
    240         "applies": true,
    241         "answer": false,
    242         "justification": "The 363-sample benchmark with human annotations is not publicly available. Only the framework code repository is referenced. The human-annotated ground truth needed to reproduce correlation results is not accessible."
    243       },
    244       "data_collection_described": {
    245         "applies": true,
    246         "answer": true,
    247         "justification": "Section III-A describes data collection: filtering CodeArena to hard tasks, excluding platform-specific tasks, generating responses via 3 LLMs, human annotation with inter-rater reliability (Kappa>80%) and adjudication process."
    248       },
    249       "recruitment_methods_described": {
    250         "applies": true,
    251         "answer": false,
    252         "justification": "The paper states annotators have 'over five years of expertise in the relevant programming languages' but does not describe how these experts were recruited, selected, or compensated, or whether they could introduce bias."
    253       },
    254       "data_pipeline_documented": {
    255         "applies": true,
    256         "answer": true,
    257         "justification": "The pipeline from CodeArena (397 hard tasks) to final benchmark (121 tasks x 3 LLM responses = 363 samples) is documented with filtering criteria, response generation procedure, and annotation process. Counts are consistent."
    258       }
    259     },
    260     "conflicts_of_interest": {
    261       "funding_disclosed": {
    262         "applies": true,
    263         "answer": true,
    264         "justification": "The Acknowledgment section discloses funding: NSFC grants (No. 62472126, 62276075), Guangdong Province Natural Science Foundation (No. 2023A1515011959), and Shenzhen-Hong Kong Jointly Funded Project (No. SGDX20230116091246007)."
    265       },
    266       "affiliations_disclosed": {
    267         "applies": true,
    268         "answer": true,
    269         "justification": "Author affiliations are listed: Xinchen Wang and Ruida Hu from HIT Shenzhen, Pengfei Gao and Chao Peng from ByteDance. A footnote notes 'Work done during an internship at ByteDance.'"
    270       },
    271       "funder_independent_of_outcome": {
    272         "applies": true,
    273         "answer": true,
    274         "justification": "Funders are Chinese government research foundations (NSFC, Guangdong Province, Shenzhen-Hong Kong) with no financial interest in CodeVisionary's performance. ByteDance is an affiliate of some authors but not listed as a funder."
    275       },
    276       "financial_interests_declared": {
    277         "applies": true,
    278         "answer": false,
    279         "justification": "There is no competing interests statement or declaration regarding patents, equity, or financial interests. Absence of such a statement means this criterion is not met."
    280       }
    281     },
    282     "contamination": {
    283       "training_cutoff_stated": {
    284         "applies": true,
    285         "answer": false,
    286         "justification": "GPT-4o is used as the evaluator model but its training data cutoff is not stated. GPT-3.5-turbo, Claude-3.5-Sonnet, and GPT-4o are used for response generation without training cutoff dates. The paper evaluates model-generated code against human labels, so contamination of the evaluation benchmark by the models is relevant."
    287       },
    288       "train_test_overlap_discussed": {
    289         "applies": true,
    290         "answer": false,
    291         "justification": "CodeArena was published in 2024 (arXiv:2412.05210) and could have been in the training data of GPT-4o. The paper does not discuss whether the tasks or generated code could have appeared in model training data."
    292       },
    293       "benchmark_contamination_addressed": {
    294         "applies": true,
    295         "answer": false,
    296         "justification": "CodeArena tasks were collected from various sources and published in late 2024. GPT-4o's training cutoff is unknown from the paper. No contamination risk assessment is provided for either the benchmark tasks or the evaluation process."
    297       }
    298     },
    299     "human_studies": {
    300       "pre_registered": {
    301         "applies": false,
    302         "answer": false,
    303         "justification": "The expert annotators are providing ground truth labels for a benchmark -- they are professional raters, not research participants in a human subjects study. Pre-registration applies to studies OF human participants, not studies that use human annotators as instruments. This is a benchmark evaluation paper."
    304       },
    305       "irb_or_ethics_approval": {
    306         "applies": false,
    307         "answer": false,
    308         "justification": "Expert annotators creating benchmark labels are not human research participants. IRB review is for research ON human subjects, not for employing human annotators in a data pipeline. This is a benchmark evaluation paper."
    309       },
    310       "demographics_reported": {
    311         "applies": false,
    312         "answer": false,
    313         "justification": "The annotators are expert raters creating ground truth labels, not participants in a human study. Demographic reporting requirements apply to studies of human participants."
    314       },
    315       "inclusion_exclusion_criteria": {
    316         "applies": false,
    317         "answer": false,
    318         "justification": "Same reasoning -- expert annotators are instruments for benchmark construction, not human research participants. Inclusion/exclusion criteria requirements are for participant recruitment in human subjects research."
    319       },
    320       "randomization_described": {
    321         "applies": false,
    322         "answer": false,
    323         "justification": "No human subjects study is conducted. The expert annotation is an observational scoring task, not an experimental study."
    324       },
    325       "blinding_described": {
    326         "applies": false,
    327         "answer": false,
    328         "justification": "No human subjects study is conducted. While the paper does note that experts 'remain unaware of the identity of the LLM,' this is good annotation practice, not blinding in a human subjects study context."
    329       },
    330       "attrition_reported": {
    331         "applies": false,
    332         "answer": false,
    333         "justification": "No human subjects study is conducted. Attrition is not applicable to expert annotation of a benchmark."
    334       }
    335     },
    336     "cost_and_practicality": {
    337       "inference_cost_reported": {
    338         "applies": true,
    339         "answer": false,
    340         "justification": "CodeVisionary uses GPT-4o with up to 40 interactions in the RMCD stage plus additional FSAS calls. Table IV reports average 6.12 actions per instance, but no API cost, token count, or monetary cost is reported."
    341       },
    342       "compute_budget_stated": {
    343         "applies": true,
    344         "answer": false,
    345         "justification": "No total computational budget (GPU hours, API spend, wall-clock time) is stated for running the 363-sample evaluation. Given the multi-interaction approach with GPT-4o, this is a non-trivial cost that is not quantified."
    346       }
    347     }
    348   }
    349 }
	ai-research-survey Systematic scan of agentic development research. What's signal, what's noise.
	git clone https://git.shiptheloop.com/ai-research-survey.git
	Log \| Files \| Refs