calibration.json - ai-research-survey - Systematic scan of agentic development research. What's signal, what's noise.

calibration.json (22812B)
      1 {
      2   "paper_slug": "agentbench-evaluating-llms-2023",
      3   "calibrated_by": "opus",
      4   "calibration_date": "2026-02-28",
      5   "total_questions": 50,
      6   "agreement_count": 46,
      7   "disagreement_count": 4,
      8   "agreement_rate": 0.92,
      9   "disagreements": [
     10     {
     11       "category": "artifacts",
     12       "question": "environment_specified",
     13       "sonnet": {"applies": true, "answer": true},
     14       "opus": {"applies": true, "answer": false},
     15       "direction": "sonnet_generous",
     16       "explanation": "Sonnet credits Docker mentions (Ubuntu Docker for OS, Ubuntu 20.04 for DCG) as sufficient environment specification. However, the schema requires 'requirements.txt, Dockerfile, conda environment file, or a detailed Environment Setup section listing library versions' — mentioning that tasks run in Ubuntu Docker is not the same as providing dependency specifications. The paper does not include a requirements.txt, Dockerfile definition, or library version listings. Mentioning Docker images by name without specifying their contents or dependencies does not meet the schema standard."
     17     },
     18     {
     19       "category": "evaluation_design",
     20       "question": "ablation_study",
     21       "sonnet": {"applies": true, "answer": false},
     22       "opus": {"applies": false, "answer": false},
     23       "direction": "applies_boundary",
     24       "explanation": "AgentBench is a benchmark, not a proposed system with modular components. The benchmark itself does not have separable components that could be ablated — it consists of 8 distinct tasks, none of which is a 'component' of the others. The paper does not propose a novel method with removable parts. Comparing different models or model families (CodeLlama vs Llama-2) is model evaluation, not ablation of the benchmark. The schema states applies=false when 'the system has only one component.' AgentBench as a benchmark artifact is a single evaluation instrument, not a multi-component system."
     25     },
     26     {
     27       "category": "evaluation_design",
     28       "question": "human_evaluation",
     29       "sonnet": {"applies": false, "answer": false},
     30       "opus": {"applies": true, "answer": false},
     31       "direction": "applies_boundary",
     32       "explanation": "The schema says applies=false only when 'Human evaluation is clearly irrelevant to the claims.' For a benchmark evaluating LLM agent capabilities across 8 diverse tasks including lateral thinking puzzles, web browsing, and game playing, human evaluation of LLM outputs would be relevant — for instance, evaluating whether LLM reasoning traces are sensible, whether partial solutions have value, or whether the automated metrics capture the full picture. The paper itself acknowledges (Appendix F.2) that automatic evaluation sometimes differs from human evaluation for LTP, suggesting human evaluation IS relevant. The paper chose not to include it (answer=false), but it applies."
     33     },
     34     {
     35       "category": "setup_transparency",
     36       "question": "model_versions_specified",
     37       "sonnet": {"applies": true, "answer": true},
     38       "opus": {"applies": true, "answer": false},
     39       "direction": "sonnet_generous",
     40       "explanation": "While Table 1 lists version numbers for some models (gpt-4: 0613, claude-instant: v1.1, vicuna-13b: v1.5), several models lack version/snapshot information: claude-2 has VER '-', chat-bison-001 has VER '-', glm-4 has VER '-'. The schema states: 'Marketing names like Gemini-2.5 or GPT-4o without a snapshot date or API version do NOT count as specified versions.' claude-2 and glm-4 without version IDs are marketing names without snapshots. Since multiple evaluated models lack version specification, the paper does not consistently specify exact model versions."
     41     }
     42   ],
     43   "opus_checklist": {
     44     "artifacts": {
     45       "code_released": {
     46         "applies": true,
     47         "answer": true,
     48         "justification": "The abstract explicitly states: 'Datasets, environments, and an integrated evaluation package for AGENTBENCH are released at https://github.com/THUDM/AgentBench.' A working repository URL is provided."
     49       },
     50       "data_released": {
     51         "applies": true,
     52         "answer": true,
     53         "justification": "Section 4.1 states 'All datasets are publicly available' and datasets are part of the GitHub release. Multiple tasks use publicly available benchmarks (ALFWorld, WebShop, Mind2Web) and newly created datasets are also released."
     54       },
     55       "environment_specified": {
     56         "applies": true,
     57         "answer": false,
     58         "justification": "The paper mentions Docker images (Ubuntu Docker for OS) and the Server-Client architecture, but does not provide a requirements.txt, Dockerfile definition, conda environment, or detailed library version listings. Mentioning that tasks run inside Docker is not the same as specifying the environment dependencies. The schema requires 'enough detail to recreate the environment.'"
     59       },
     60       "reproduction_instructions": {
     61         "applies": true,
     62         "answer": true,
     63         "justification": "The GitHub repository is provided, and the paper describes the Server-Client toolkit architecture (Appendix A), per-task evaluation procedures (Appendices B-I), and states researchers need to 'set up a model server accessible via the HTTP protocol.' Combined with Docker images, this provides reproducible instructions."
     64       }
     65     },
     66     "statistical_methodology": {
     67       "confidence_intervals_or_error_bars": {
     68         "applies": true,
     69         "answer": false,
     70         "justification": "All results in Table 3 are single point estimates. No confidence intervals, error bars, or uncertainty quantification is provided for any benchmark score across any of the 29 models."
     71       },
     72       "significance_tests": {
     73         "applies": true,
     74         "answer": false,
     75         "justification": "The paper makes comparative claims (e.g., 'gpt-4 presents the best performance on 6 out of 8 datasets,' 'significant disparity in performance') without any statistical significance tests. No p-values or other tests are reported."
     76       },
     77       "effect_sizes_reported": {
     78         "applies": true,
     79         "answer": false,
     80         "justification": "While raw scores are reported (4.01 for gpt-4 vs 0.51 average for OSS), no formal effect size measures (Cohen's d, etc.) are computed. The percentage context (e.g., '78% success rate on House Holding') provides some relative information but does not constitute formal effect size reporting."
     81       },
     82       "sample_size_justified": {
     83         "applies": true,
     84         "answer": false,
     85         "justification": "The paper mentions setting Test size to 1,014 to roughly match MMLU call counts, which is an efficiency argument, not a statistical power justification. No power analysis or formal sample size justification is provided."
     86       },
     87       "variance_reported": {
     88         "applies": true,
     89         "answer": false,
     90         "justification": "Temperature=0 (greedy decoding) was used for all evaluations, meaning single deterministic runs only. No variance, standard deviation, or multi-run results are reported. Even with greedy decoding, API-based models can vary across calls, but this is not addressed."
     91       }
     92     },
     93     "evaluation_design": {
     94       "baselines_included": {
     95         "applies": true,
     96         "answer": true,
     97         "justification": "29 LLMs are compared against each other, and for the Digital Card Game task, two naive strategies (random and greedy) are included as explicit baselines (Section E.1). The multi-model comparison itself serves as a baselines framework."
     98       },
     99       "baselines_contemporary": {
    100         "applies": true,
    101         "answer": true,
    102         "justification": "The evaluation includes the most current models available at time of publication: gpt-4 (0613), claude-2, llama-2 (70b), codellama-34b. These were state-of-the-art at the time. claude-3 (opus) was added later, further updating the baselines."
    103       },
    104       "ablation_study": {
    105         "applies": false,
    106         "answer": false,
    107         "justification": "AgentBench is a benchmark, not a proposed system with removable components. The paper evaluates models on fixed tasks; there are no system components to ablate. Comparing model families (CodeLlama vs Llama-2) is observational model comparison, not ablation."
    108       },
    109       "multiple_metrics": {
    110         "applies": true,
    111         "answer": true,
    112         "justification": "Multiple task-specific metrics are used: Success Rate (OS, DB, HH), F1 (KG), Reward (DCG, WS), Game Progress (LTP), Step SR (WB). The paper also reports outcome categories (Completed, IF, IA, TLE, CLE) as additional evaluation dimensions."
    113       },
    114       "human_evaluation": {
    115         "applies": true,
    116         "answer": false,
    117         "justification": "The paper uses entirely automated evaluation. Appendix F.2 acknowledges that for LTP, automatic evaluation sometimes differs from human evaluation, suggesting human evaluation would be relevant. No human evaluation of LLM agent outputs was conducted despite its relevance to claims about reasoning and decision-making quality."
    118       },
    119       "held_out_test_set": {
    120         "applies": true,
    121         "answer": true,
    122         "justification": "Table 2 shows explicit Dev/Test splits for all 8 tasks (e.g., OS: 26 dev / 144 test, DB: 60 dev / 300 test). Table 3 is titled 'Test set (standard) results,' confirming main results are on held-out test data."
    123       },
    124       "per_category_breakdown": {
    125         "applies": true,
    126         "answer": true,
    127         "justification": "Table 3 provides per-environment scores for all 29 LLMs across 8 environments. Table 4 provides per-task execution outcome breakdowns. Appendix J provides additional per-model validity analysis."
    128       },
    129       "failure_cases_discussed": {
    130         "applies": true,
    131         "answer": true,
    132         "justification": "Appendix J.2 provides detailed failure case analysis: gpt-3.5-turbo repetition loops (J.2.2), gpt-4 invalid format errors (J.2.1), repetition as primary TLE cause (J.2.4), and qualitative examples with model outputs."
    133       },
    134       "negative_results_reported": {
    135         "applies": true,
    136         "answer": true,
    137         "justification": "Section 4.3 explicitly reports that code training has 'ambivalent impacts' — improving Web Shopping but harming Digital Card Game performance. The unexpected similar performance of llama-2-13b and llama-2-70b is also reported as a negative/surprising finding."
    138       }
    139     },
    140     "claims_and_evidence": {
    141       "abstract_claims_supported": {
    142         "applies": true,
    143         "answer": true,
    144         "justification": "Abstract claims about significant performance disparity (supported by Table 3 averages: 2.32 vs 0.51), poor reasoning/decision-making (supported by Table 4 TLE dominance), and ambivalent code training impact (supported by Section 4.3 analysis) are all backed by results in the paper."
    145       },
    146       "causal_claims_justified": {
    147         "applies": true,
    148         "answer": false,
    149         "justification": "The paper makes causal claims: 'training on code present ambivalent impacts' and 'high-quality alignment training...could also help improve LLM agents.' These are based on comparing model families (CodeLlama vs Llama-2, vicuna vs llama-2) that differ on multiple dimensions beyond the single factor of interest. No controlled experiments isolate the causal factor."
    150       },
    151       "generalization_bounded": {
    152         "applies": true,
    153         "answer": false,
    154         "justification": "The conclusion states AGENTBENCH will 'serve as a cornerstone for subsequent LLM agent research.' The title 'Evaluating LLMs as Agents' is broad. Section 4.3 generalizes about code training's impact on agents broadly. Claims are not bounded to the specific 8 English-language, text-only tasks and the tested model set."
    155       },
    156       "alternative_explanations_discussed": {
    157         "applies": true,
    158         "answer": false,
    159         "justification": "The paper does not discuss alternative explanations for observed patterns. The OSS vs commercial gap could reflect data quantity, RLHF quality, parameter count, or instruction tuning differences — these are not systematically considered. The only hint is the speculation about llama-2-70b insufficient pre-training, which is offered as the authors' interpretation without considering alternatives."
    160       }
    161     },
    162     "setup_transparency": {
    163       "model_versions_specified": {
    164         "applies": true,
    165         "answer": false,
    166         "justification": "Table 1 lists versions for some models (gpt-4: 0613, gpt-3.5-turbo: 0613, claude-instant: v1.1, vicuna-13b: v1.5) but several key models lack version/snapshot information: claude-2 has '-', glm-4 has '-', chat-bison-001 has '-'. Per the schema, marketing names without snapshot dates do not count as specified versions. The evaluation is inconsistent in version specification."
    167       },
    168       "prompts_provided": {
    169         "applies": true,
    170         "answer": true,
    171         "justification": "Full prompt texts are provided in Appendices B through I for all 8 environments, including complete system instructions, CoT examples, and interaction formats. For example, Appendix B.3 shows the full OS instruction prompt, Appendix D.2 shows the KG prompt with API descriptions."
    172       },
    173       "hyperparameters_reported": {
    174         "applies": true,
    175         "answer": true,
    176         "justification": "Section 4.1 states 'we set temperature=0 (i.e., greedy decoding) in the inference on all tasks.' The context window management strategy is also described (minimum r such that token count <= 3500). These are the key inference hyperparameters."
    177       },
    178       "scaffolding_described": {
    179         "applies": true,
    180         "answer": true,
    181         "justification": "The evaluation scaffolding is described: CoT prompting strategy (Section 2), Server-Client framework with max-flow algorithm (Appendix A), per-round interaction management with context window handling, and task-specific interaction loops in Appendices B-I."
    182       },
    183       "data_preprocessing_documented": {
    184         "applies": true,
    185         "answer": true,
    186         "justification": "Each task appendix documents data construction: OS data from Stack Overflow filtering + gpt-4 generation with unit test filtering (Appendix B); DB data augmentation with gpt-3.5-turbo + validity filtering (Appendix C); KG data sourced from GrailQA/ComplexWebQuestions with 5+ tool invocation criteria (Appendix D). A bias study for augmented data is included (Appendix C.4)."
    187       }
    188     },
    189     "limitations_and_scope": {
    190       "limitations_section_present": {
    191         "applies": true,
    192         "answer": false,
    193         "justification": "There is no dedicated limitations or threats-to-validity section. The paper goes directly from Related Work (Section 5) to Conclusion (Section 6) without discussing limitations. The conclusion mentions future directions but does not address what the benchmark does NOT show."
    194       },
    195       "threats_to_validity_specific": {
    196         "applies": true,
    197         "answer": false,
    198         "justification": "No specific threats to validity are discussed anywhere in the paper. Issues such as greedy-only decoding, single-run evaluation, English-only text-only tasks, potential benchmark contamination, and the representativeness of the 8 chosen tasks are not addressed."
    199       },
    200       "scope_boundaries_stated": {
    201         "applies": true,
    202         "answer": false,
    203         "justification": "The paper does not explicitly state scope boundaries. It does not clarify that results apply only to text-only LLMs, only to English tasks, only to the specific 8 task types, or what agent capabilities are outside AgentBench's coverage. The broad title and conclusion ('cornerstone for subsequent LLM agent research') suggest unbounded generalization."
    204       }
    205     },
    206     "data_integrity": {
    207       "raw_data_available": {
    208         "applies": true,
    209         "answer": true,
    210         "justification": "All datasets, environments, and evaluation code are released at the GitHub repository (https://github.com/THUDM/AgentBench). This enables independent verification of results by running the same evaluations."
    211       },
    212       "data_collection_described": {
    213         "applies": true,
    214         "answer": true,
    215         "justification": "Each task appendix (B-I) provides detailed data collection descriptions: OS from Stack Overflow + gpt-4 generation with unit tests; DB from existing sources + gpt-3.5-turbo augmentation; KG from GrailQA/ComplexWebQuestions/GraphQuestions with filtering criteria; LTP from web-based puzzles with manual simplification."
    216       },
    217       "recruitment_methods_described": {
    218         "applies": false,
    219         "answer": false,
    220         "justification": "This is a benchmark evaluation paper with no human participants. While annotators are mentioned for OS task construction (8 annotators from Stack Overflow), these are dataset creators not study participants. The NA criterion applies as the data source is benchmark construction, not human subject research."
    221       },
    222       "data_pipeline_documented": {
    223         "applies": true,
    224         "answer": true,
    225         "justification": "The data pipeline from collection to final evaluation is documented per task: DB shows source datasets -> gpt-3.5-turbo augmentation -> validity filtering -> 300 final entries categorized by type. OS shows Stack Overflow collection -> annotator filtering -> gpt-4 generation -> unit test filtering. Checking pipelines are also described."
    226       }
    227     },
    228     "conflicts_of_interest": {
    229       "funding_disclosed": {
    230         "applies": true,
    231         "answer": true,
    232         "justification": "The acknowledgment section lists NSFC grants (62276148, 61825602), Ministry of Science and Technology of China grant (2022ZD0118600), Tsinghua University program, New Cornerstone Science Foundation, and Zhipu AI covering 'all GPU and API cost consumed in this study' plus a 'research fund from Zhipu AI.'"
    233       },
    234       "affiliations_disclosed": {
    235         "applies": true,
    236         "answer": true,
    237         "justification": "Author affiliations are listed: Tsinghua University (majority), Ohio State University, and UC Berkeley. The Tsinghua/Zhipu AI connection is implicit through shared authors (Zeng et al., 2022; Du et al., 2022 for both chatglm-6b and glm-4) and the funding acknowledgment."
    238       },
    239       "funder_independent_of_outcome": {
    240         "applies": true,
    241         "answer": false,
    242         "justification": "Zhipu AI covered 'all GPU and API cost' and provided a research fund. Zhipu AI's glm-4 is evaluated in the benchmark and ranked 3rd overall (score 2.89), ahead of claude-2 and gpt-3.5-turbo. Zhipu AI has a direct financial interest in glm-4 performing well, making the funder non-independent of the outcome."
    243       },
    244       "financial_interests_declared": {
    245         "applies": true,
    246         "answer": false,
    247         "justification": "There is no competing interests statement or financial interests declaration. The Zhipu AI funding/affiliation and evaluation of glm-4 is not explicitly flagged as a potential conflict. The paper lacks any declaration of competing interests."
    248       }
    249     },
    250     "contamination": {
    251       "training_cutoff_stated": {
    252         "applies": true,
    253         "answer": false,
    254         "justification": "No training data cutoff dates are stated for any of the 29 evaluated models. Models are listed with version identifiers but no training data temporal bounds, making contamination assessment impossible."
    255       },
    256       "train_test_overlap_discussed": {
    257         "applies": true,
    258         "answer": false,
    259         "justification": "No systematic train/test overlap analysis is performed. The DB appendix notes data augmentation was done partly to 'avoid leakage from the dataset' (Section C.1), but no contamination analysis is conducted for any of the 8 tasks, especially the adapted ones."
    260       },
    261       "benchmark_contamination_addressed": {
    262         "applies": true,
    263         "answer": false,
    264         "justification": "WebShop (2022) and ALFWorld (2020) are publicly available benchmarks adapted for AgentBench. These were available well before most models' training data was collected. The paper does not discuss whether these benchmark solutions could be in the training data. The KG datasets (GrailQA, ComplexWebQuestions) are also pre-existing public datasets."
    265       }
    266     },
    267     "human_studies": {
    268       "pre_registered": {
    269         "applies": false,
    270         "answer": false,
    271         "justification": "This is a benchmark evaluation paper with no human participants. Pre-registration is not applicable."
    272       },
    273       "irb_or_ethics_approval": {
    274         "applies": false,
    275         "answer": false,
    276         "justification": "This is a benchmark evaluation paper with no human participants. IRB approval is not applicable."
    277       },
    278       "demographics_reported": {
    279         "applies": false,
    280         "answer": false,
    281         "justification": "This is a benchmark evaluation paper with no human participants. Demographics are not applicable."
    282       },
    283       "inclusion_exclusion_criteria": {
    284         "applies": false,
    285         "answer": false,
    286         "justification": "This is a benchmark evaluation paper with no human participants. Inclusion/exclusion criteria for participants are not applicable."
    287       },
    288       "randomization_described": {
    289         "applies": false,
    290         "answer": false,
    291         "justification": "This is a benchmark evaluation paper with no human participants. Randomization of participant assignment is not applicable."
    292       },
    293       "blinding_described": {
    294         "applies": false,
    295         "answer": false,
    296         "justification": "This is a benchmark evaluation paper with no human participants. Blinding is not applicable."
    297       },
    298       "attrition_reported": {
    299         "applies": false,
    300         "answer": false,
    301         "justification": "This is a benchmark evaluation paper with no human participants. Attrition reporting is not applicable."
    302       }
    303     },
    304     "cost_and_practicality": {
    305       "inference_cost_reported": {
    306         "applies": true,
    307         "answer": false,
    308         "justification": "The paper does not report API costs, inference latency, or per-example cost for running the benchmark. The acknowledgment notes Zhipu AI covered 'all GPU and API cost' but does not quantify these costs. Appendix K analyzes token/round distributions for completed trajectories but does not translate these into cost estimates."
    309       },
    310       "compute_budget_stated": {
    311         "applies": true,
    312         "answer": false,
    313         "justification": "No total computational budget is stated. Running 29 models across 8 environments on 1,014+ test examples each represents substantial compute (estimated ~11k inference calls per model), but GPU hours, total API spend, and hardware specifications are not quantified."
    314       }
    315     }
    316   }
    317 }
	ai-research-survey Systematic scan of agentic development research. What's signal, what's noise.
	git clone https://git.shiptheloop.com/ai-research-survey.git
	Log \| Files \| Refs