calibration.json (22649B)
1 { 2 "paper_slug": "adaptevolve-improving-efficiency-2026", 3 "total_questions": 50, 4 "agreement_count": 45, 5 "disagreement_count": 5, 6 "agreement_rate": 0.90, 7 "disagreements": [ 8 { 9 "category": "artifacts", 10 "question": "environment_specified", 11 "sonnet_applies": true, 12 "sonnet_answer": true, 13 "opus_applies": true, 14 "opus_answer": false, 15 "opus_justification": "The schema requires 'requirements.txt, Dockerfile, conda environment file, or a detailed Environment Setup section listing library versions.' The paper lists hyperparameters in Table 5 and mentions hardware (8x AMD MI250 GPUs), but does not provide a requirements.txt, Dockerfile, or library version listing. Citing scikit-learn and River without version numbers is insufficient. Hyperparameters are not the same as environment/dependency specifications.", 16 "sonnet_justification": "Table 5 (Appendix A.4) specifies exact hyperparameters including model names (Qwen3-32B, Qwen3-4B), temperature (0.6), top-p (0.95), max tokens (20000), and hardware (8x AMD Instinct MI250 GPUs). The River package (Montiel et al., 2021) and scikit-learn (Buitinck et al., 2013) are cited. This provides sufficient detail for environment reconstruction.", 17 "direction": "sonnet_generous" 18 }, 19 { 20 "category": "statistical_methodology", 21 "question": "effect_sizes_reported", 22 "sonnet_applies": true, 23 "sonnet_answer": false, 24 "opus_applies": true, 25 "opus_answer": true, 26 "opus_justification": "The schema says 'A paper that says 12% improvement over baseline (from 45% to 57%) provides enough context for YES.' The paper reports exactly this pattern: 'retains 97.9% of the 32B upper-bound accuracy (73.6% vs 75.2%) while reducing compute cost by 34.4%', '2.4 points (73.6% vs 71.2%)', and '41.5% while maintaining 97.1% of peak accuracy.' These provide percentage improvements with explicit baseline context, which satisfies the schema criterion.", 27 "sonnet_justification": "The paper reports raw differences (e.g., '2.4 points', '34.4% cost reduction') but no standardized effect sizes such as Cohen's d or odds ratios.", 28 "direction": "opus_generous" 29 }, 30 { 31 "category": "claims_and_evidence", 32 "question": "causal_claims_justified", 33 "sonnet_applies": true, 34 "sonnet_answer": false, 35 "opus_applies": true, 36 "opus_answer": true, 37 "opus_justification": "The schema states 'ablation studies (removing component X reduces performance by Y%) ARE causal claims — check whether the ablation design is adequate (controlled single-variable manipulation counts as YES).' The paper does controlled single-variable comparisons: static tree vs. adaptive tree (Table 1), threshold vs. tree (Table 4), and various routing strategies vs. pure baselines. Each comparison changes one variable while holding the benchmark, models, and other settings constant. This is adequate controlled experimental design for the causal claims made. The lack of statistical power (single run) is captured by other checklist items (significance_tests, variance_reported).", 38 "sonnet_justification": "The paper claims confidence-driven routing 'reduces total inference compute by 37.9% while retaining 97.5% accuracy' and that 'intrinsic confidence provides a robust signal for safe agentic acceleration.' These are causal claims about what the method achieves, but without multiple seeds or statistical tests, alternative explanations (lucky single run, benchmark-specific effects) are not ruled out.", 39 "direction": "opus_generous" 40 }, 41 { 42 "category": "conflicts_of_interest", 43 "question": "funding_disclosed", 44 "sonnet_applies": true, 45 "sonnet_answer": false, 46 "opus_applies": true, 47 "opus_answer": true, 48 "opus_justification": "The footnote on the first page states 'Work done during Internship at Advanced Micro Devices, Inc. (AMD).' This explicitly identifies AMD as the institution that supported the work. While there is no formal 'Acknowledgments' section with grant numbers, the funding source (AMD via internship) is clearly disclosed to the reader. The schema asks 'Is the funding source disclosed?' and it is.", 49 "sonnet_justification": "The paper acknowledges '*Work done during Internship at Advanced Micro Devices, Inc. (AMD)' in a footnote, but there is no formal funding disclosure, acknowledgments section, or grant information provided.", 50 "direction": "opus_generous" 51 }, 52 { 53 "category": "cost_and_practicality", 54 "question": "compute_budget_stated", 55 "sonnet_applies": true, 56 "sonnet_answer": true, 57 "opus_applies": true, 58 "opus_answer": false, 59 "opus_justification": "The schema asks 'Is the total computational budget stated? Look for: GPU hours, total API spend, hardware used, training time.' While Section 3.1 states the hardware (8x AMD MI250 GPUs) and Table 5 lists configuration parameters, the paper never states the TOTAL compute consumed — no GPU-hours, no wall-clock time for experiments, no total training/inference time. Listing hardware is necessary but not sufficient; the total budget (how long the hardware was used) is not quantified.", 60 "sonnet_justification": "Section 3.1 states experiments were executed on 'a single node with 8x AMD Instinct MI250 GPUs.' Table 5 reports max iterations (8), evaluation timeout (1200s per evaluation), and parallel evaluations (32), enabling estimation of the compute budget.", 61 "direction": "sonnet_generous" 62 } 63 ], 64 "opus_checklist": { 65 "artifacts": { 66 "code_released": { 67 "applies": true, 68 "answer": true, 69 "justification": "The abstract provides a GitHub URL: 'Our code is available at https://github.com/raypretam/adaptive_llm_selection'. This is a working repository link, satisfying the criterion." 70 }, 71 "data_released": { 72 "applies": true, 73 "answer": true, 74 "justification": "The paper uses two publicly available standard benchmarks: LiveCodeBench v5 (Jain et al., 2024) with 880 samples and MBPP (Austin et al., 2021) with 974 samples. Neither benchmark was modified by the authors." 75 }, 76 "environment_specified": { 77 "applies": true, 78 "answer": false, 79 "justification": "The paper lists hyperparameters in Table 5 and mentions hardware (8x AMD MI250 GPUs), but does not provide a requirements.txt, Dockerfile, conda environment file, or a detailed Environment Setup section with library versions. Citing scikit-learn and River without version numbers is insufficient to recreate the environment." 80 }, 81 "reproduction_instructions": { 82 "applies": true, 83 "answer": false, 84 "justification": "No step-by-step reproduction instructions are provided in the paper. While code is released and hyperparameters are listed in Table 5, there is no 'Reproducing Results' section or README-like instructions within the paper itself." 85 } 86 }, 87 "statistical_methodology": { 88 "confidence_intervals_or_error_bars": { 89 "applies": true, 90 "answer": false, 91 "justification": "All results in Tables 1, 2, and 4 are point estimates only. No confidence intervals, error bars, or +/- notation appear anywhere in the paper." 92 }, 93 "significance_tests": { 94 "applies": true, 95 "answer": false, 96 "justification": "The paper claims AdaptEvolve 'strictly outperforms' and 'significantly surpasses' baselines but provides no p-values, t-tests, or any statistical significance tests to support these comparative claims." 97 }, 98 "effect_sizes_reported": { 99 "applies": true, 100 "answer": true, 101 "justification": "Per the schema, 'A paper that says 12% improvement over baseline (from 45% to 57%) provides enough context for YES.' The paper reports: 'retains 97.9% of the 32B upper-bound accuracy (73.6% vs 75.2%) while reducing compute cost by 34.4%' and '41.5% cost reduction while maintaining 97.1% of peak accuracy (91.3/94.0).' These provide percentage improvements with explicit baseline context." 102 }, 103 "sample_size_justified": { 104 "applies": true, 105 "answer": false, 106 "justification": "The warm-up set of N=50 is described as 'minimal' but no justification or power analysis explains why 50 samples is sufficient to bootstrap a reliable routing classifier. No justification for benchmark sample sizes either." 107 }, 108 "variance_reported": { 109 "applies": true, 110 "answer": false, 111 "justification": "Table 3 reports standard deviations for confidence metrics in calibration analysis, but the main experimental results (Tables 1, 2) are single-run point estimates with no variance across seeds or runs. The schema specifies: 'If the paper reports single-run numbers only, NO.'" 112 } 113 }, 114 "evaluation_design": { 115 "baselines_included": { 116 "applies": true, 117 "answer": true, 118 "justification": "The paper compares against multiple baselines: Pure 4B (lower bound), Pure 32B (upper bound), Random Routing, Static Decision Tree, and the Cascading baseline from Chen et al. (2023). See Tables 1 and 2." 119 }, 120 "baselines_contemporary": { 121 "applies": true, 122 "answer": true, 123 "justification": "The Cascading baseline (Chen et al., 2023) is recent and represents a standard approach. Pure model bounds and random routing are appropriate reference points. While RouteLLM (2024) is cited but not compared, the included baselines are reasonable for the evolutionary agent setting." 124 }, 125 "ablation_study": { 126 "applies": true, 127 "answer": true, 128 "justification": "Table 1 compares Static Decision Tree vs. Adaptive Hoeffding Tree on both benchmarks, isolating the contribution of online adaptation. Appendix A.3 (Table 4) compares the Decision Tree router against a threshold-based switch, testing the non-linear routing hypothesis." 129 }, 130 "multiple_metrics": { 131 "applies": true, 132 "answer": true, 133 "justification": "The paper reports three metrics across all configurations: Accuracy, Compute Cost (normalized units), and Efficiency (Accuracy/Cost), plus the Small:Large usage ratio. Tables 1 and 2 present all metrics." 134 }, 135 "human_evaluation": { 136 "applies": false, 137 "answer": false, 138 "justification": "The paper evaluates code generation using automated pass/fail test cases on benchmarks. Human evaluation is clearly irrelevant to claims about routing efficiency in an automated evolutionary system." 139 }, 140 "held_out_test_set": { 141 "applies": true, 142 "answer": true, 143 "justification": "The main evaluation uses the full LiveCodeBench v5 (880 samples) and MBPP (974 samples) benchmarks, which are standard held-out test sets not seen during model training. The N=50 warm-up phase is part of the method's operation rather than hyperparameter tuning." 144 }, 145 "per_category_breakdown": { 146 "applies": true, 147 "answer": false, 148 "justification": "Results are reported only as single aggregate numbers per benchmark (Table 1). No per-difficulty, per-category, or per-task breakdown is provided for either LiveCodeBench or MBPP." 149 }, 150 "failure_cases_discussed": { 151 "applies": true, 152 "answer": false, 153 "justification": "No failure cases or error analysis is presented. The paper does not discuss instances where the router made incorrect routing decisions, nor does it provide qualitative examples of failures." 154 }, 155 "negative_results_reported": { 156 "applies": true, 157 "answer": false, 158 "justification": "Every experiment presented shows AdaptEvolve performing favorably. No configurations where the method performed worse than baselines or failed to improve are reported." 159 } 160 }, 161 "claims_and_evidence": { 162 "abstract_claims_supported": { 163 "applies": true, 164 "answer": true, 165 "justification": "The abstract claims '37.9% cost reduction' and '97.5% accuracy retention.' Table 1 and Section 3.6 support these: LiveCodeBench shows 34.4% cost reduction at 97.9% accuracy retention; MBPP shows 41.5% at 97.1%. The averages match the abstract claims." 166 }, 167 "causal_claims_justified": { 168 "applies": true, 169 "answer": true, 170 "justification": "The paper makes causal claims through ablation comparisons. Per the schema, 'controlled single-variable manipulation counts as YES.' The paper compares routing strategies while holding benchmarks, models, and other settings constant (e.g., static tree vs. adaptive tree in Table 1; threshold vs. tree in Table 4). This controlled experimental design is adequate for the specific causal claims about routing component contributions." 171 }, 172 "generalization_bounded": { 173 "applies": true, 174 "answer": false, 175 "justification": "The conclusion claims 'intelligent resource allocation is a viable pathway for scalable agentic reasoning' — a broad generalization from two Qwen3 models on two Python coding benchmarks within one framework (OpenEvolve). The paper does not bound claims to the specific tested setting." 176 }, 177 "alternative_explanations_discussed": { 178 "applies": true, 179 "answer": false, 180 "justification": "No alternative explanations for the observed results are discussed. The paper does not consider whether efficiency gains could stem from benchmark-specific properties, the Qwen3 model family characteristics, warm-up set selection effects, or other confounds." 181 } 182 }, 183 "setup_transparency": { 184 "model_versions_specified": { 185 "applies": true, 186 "answer": true, 187 "justification": "Section 3.2 specifies 'Qwen3-4B' and 'Qwen3-32B' from the Qwen3 family with citation to the technical report (Yang et al., 2025). For open-weight models, the model name plus parameter count identifies specific model releases." 188 }, 189 "prompts_provided": { 190 "applies": true, 191 "answer": false, 192 "justification": "Section 3.1 mentions that 'previous high-scoring generations serve as few-shot exemplars' but no actual prompt text sent to the models is provided anywhere in the paper or appendix." 193 }, 194 "hyperparameters_reported": { 195 "applies": true, 196 "answer": true, 197 "justification": "Table 5 (Appendix A.4) provides a comprehensive hyperparameters table covering LLM settings (temperature=0.6, top-p=0.95, max tokens=20000), evolutionary parameters (population size=8, archive size=3), and confidence calculation windows (2048)." 198 }, 199 "scaffolding_described": { 200 "applies": true, 201 "answer": true, 202 "justification": "The adaptive routing scaffold is described in detail: Section 2 explains the decision tree router, confidence metric computation, and routing logic. Figure 1 provides a workflow diagram. Table 6 shows the labeling logic. Appendix A.1 details confidence computation formulas." 203 }, 204 "data_preprocessing_documented": { 205 "applies": true, 206 "answer": false, 207 "justification": "No data preprocessing steps are documented. Section 3.3 states sample counts (880 for LiveCodeBench, 974 for MBPP) but provides no details on any filtering, preprocessing, or transformations applied to benchmark inputs before use in the evolutionary framework." 208 } 209 }, 210 "limitations_and_scope": { 211 "limitations_section_present": { 212 "applies": true, 213 "answer": true, 214 "justification": "Section 5 is titled 'Limitations' and substantively discusses that the method requires access to token-level log-probabilities, restricting applicability to open-weight models or APIs that expose logprobs." 215 }, 216 "threats_to_validity_specific": { 217 "applies": true, 218 "answer": false, 219 "justification": "The limitations section discusses only a deployment constraint (logprob access requirement), which is a scope limitation rather than a threat to the validity of the reported results. No specific threats to validity — such as single-run result reliability, benchmark contamination risk, generalizability to other model families, or warm-up sample selection effects — are discussed." 220 }, 221 "scope_boundaries_stated": { 222 "applies": true, 223 "answer": false, 224 "justification": "The paper does not explicitly state what the results do NOT show. There are no statements bounding the results to only Qwen3 models, only Python coding tasks, only OpenEvolve-style frameworks, or only the two tested benchmarks. The logprob limitation is about method applicability, not about what the results demonstrate." 225 } 226 }, 227 "data_integrity": { 228 "raw_data_available": { 229 "applies": true, 230 "answer": false, 231 "justification": "Only aggregated results appear in tables. No raw per-sample predictions, routing decisions, confidence scores, or intermediate evolutionary outputs are released or available for independent verification." 232 }, 233 "data_collection_described": { 234 "applies": true, 235 "answer": true, 236 "justification": "Section 3.3 describes the benchmarks used (LiveCodeBench v5 with 880 samples, MBPP with 974 samples). Section 3.5 describes the warm-up data collection (N=50) with the labeling logic detailed in Table 6." 237 }, 238 "recruitment_methods_described": { 239 "applies": false, 240 "answer": false, 241 "justification": "No human participants are involved. Data comes from standard public benchmarks (LiveCodeBench, MBPP), so participant recruitment does not apply." 242 }, 243 "data_pipeline_documented": { 244 "applies": true, 245 "answer": false, 246 "justification": "The confidence computation pipeline is described (Appendix A.1), but the full data pipeline from benchmark input to final evaluation output is incompletely documented. Details on how evolutionary iterations proceed, how outputs are aggregated, and how final accuracy numbers are computed from the evolutionary population are not fully specified." 247 } 248 }, 249 "conflicts_of_interest": { 250 "funding_disclosed": { 251 "applies": true, 252 "answer": true, 253 "justification": "The footnote on the first page states 'Work done during Internship at Advanced Micro Devices, Inc. (AMD)', clearly identifying AMD as the institution that supported and funded the work." 254 }, 255 "affiliations_disclosed": { 256 "applies": true, 257 "answer": true, 258 "justification": "Author affiliations are listed on the title page: Pretam Ray (IIT Kharagpur) and Pratik Prabhanjan Brahma, Zicheng Liu, Emad Barsoum (AMD). The footnote explicitly states the work was performed during an AMD internship." 259 }, 260 "funder_independent_of_outcome": { 261 "applies": true, 262 "answer": false, 263 "justification": "AMD funded the work via internship, three of four authors are AMD employees, and experiments were conducted on AMD MI250 GPUs. AMD has commercial interest in demonstrating efficient inference on its hardware. The funder is not independent of the outcome." 264 }, 265 "financial_interests_declared": { 266 "applies": true, 267 "answer": false, 268 "justification": "No competing interests or financial interests statement appears anywhere in the paper. AMD affiliations are disclosed but no formal declaration of conflicts of interest is made." 269 } 270 }, 271 "contamination": { 272 "training_cutoff_stated": { 273 "applies": true, 274 "answer": false, 275 "justification": "The paper uses Qwen3-4B and Qwen3-32B on code benchmarks but does not state the training data cutoff date for either model. The Qwen3 technical report is cited but the cutoff date is not mentioned." 276 }, 277 "train_test_overlap_discussed": { 278 "applies": true, 279 "answer": false, 280 "justification": "No discussion of potential train/test overlap. MBPP (2021) is particularly at risk of being in Qwen3 training data given the multi-year gap, but this is not addressed." 281 }, 282 "benchmark_contamination_addressed": { 283 "applies": true, 284 "answer": false, 285 "justification": "MBPP (Austin et al., 2021) predates Qwen3 by several years and is widely used in LLM training. LiveCodeBench is designed to be contamination-free but this property is not discussed in the paper. Neither benchmark's contamination risk is addressed." 286 } 287 }, 288 "human_studies": { 289 "pre_registered": { 290 "applies": false, 291 "answer": false, 292 "justification": "No human participants are involved in this benchmark evaluation study." 293 }, 294 "irb_or_ethics_approval": { 295 "applies": false, 296 "answer": false, 297 "justification": "No human participants are involved in this study." 298 }, 299 "demographics_reported": { 300 "applies": false, 301 "answer": false, 302 "justification": "No human participants are involved in this study." 303 }, 304 "inclusion_exclusion_criteria": { 305 "applies": false, 306 "answer": false, 307 "justification": "No human participants are involved in this study." 308 }, 309 "randomization_described": { 310 "applies": false, 311 "answer": false, 312 "justification": "No human participants are involved in this study." 313 }, 314 "blinding_described": { 315 "applies": false, 316 "answer": false, 317 "justification": "No human participants are involved in this study." 318 }, 319 "attrition_reported": { 320 "applies": false, 321 "answer": false, 322 "justification": "No human participants are involved in this study." 323 } 324 }, 325 "cost_and_practicality": { 326 "inference_cost_reported": { 327 "applies": true, 328 "answer": true, 329 "justification": "Section 3.6 and Tables 1-2 report inference cost in normalized compute units (one 32B call = 1.0, one 4B call = 0.125), with total costs per configuration reported across both benchmarks." 330 }, 331 "compute_budget_stated": { 332 "applies": true, 333 "answer": false, 334 "justification": "Section 3.1 states the hardware (8x AMD MI250 GPUs) and Table 5 lists configuration parameters, but the total compute consumed is never stated — no GPU-hours, wall-clock time, or total experiment duration. Listing available hardware is not the same as quantifying the total computational budget used." 335 } 336 } 337 } 338 }