calibration.json (26755B)
1 { 2 "paper_slug": "aegis20-diverse-ai-2025", 3 "calibration_date": "2026-02-28", 4 "total_questions": 50, 5 "agreement_count": 42, 6 "disagreement_count": 8, 7 "agreement_rate": 0.84, 8 "disagreements": [ 9 { 10 "category": "statistical_methodology", 11 "question": "effect_sizes_reported", 12 "sonnet": {"applies": true, "answer": true}, 13 "opus": {"applies": true, "answer": false}, 14 "direction": "sonnet_generous", 15 "explanation": "Sonnet credits raw F1 score differences as effect sizes. However, the schema asks for formal effect sizes (Cohen's d, odds ratios, relative risk) or at minimum 'percentage improvement with baseline context.' The paper reports only raw F1 scores (e.g., 0.808 vs. 0.764). Stating two numbers is not the same as reporting an effect size. There is no explicit effect size computation, no Cohen's d, no percentage improvement stated. The reader can compute this themselves, but the paper does not report it." 16 }, 17 { 18 "category": "evaluation_design", 19 "question": "human_evaluation", 20 "sonnet": {"applies": false, "answer": false}, 21 "opus": {"applies": true, "answer": false}, 22 "direction": "applies_boundary", 23 "explanation": "Sonnet sets applies=false, arguing human evaluation of model outputs is irrelevant. However, this is a content moderation model that makes safety judgments on human-LLM interactions. Human evaluation of the model's safety classifications would be highly relevant — checking whether the model's safe/unsafe predictions align with human expert judgment on novel data. The paper evaluated entirely via automated benchmarks (F1 against ground truth labels). Human evaluation of the model outputs is applicable and was not done." 24 }, 25 { 26 "category": "evaluation_design", 27 "question": "failure_cases_discussed", 28 "sonnet": {"applies": true, "answer": true}, 29 "opus": {"applies": true, "answer": false}, 30 "direction": "sonnet_generous", 31 "explanation": "Sonnet credits the Limitations section as discussing failure cases. However, the schema asks for 'error analysis, qualitative examples of failures, discussion of where the approach breaks down.' The Limitations section (Section 8) discusses dataset-level weaknesses (single response model, category imbalance, lack of multilingual support, annotator bias). These are limitations of the dataset construction, not analysis of where the trained model fails. No qualitative error examples of misclassifications are shown. No error analysis on specific cases where AEGISGUARD gets it wrong. The category distribution plots (Appendix A.6) show broad patterns but not specific failure cases." 32 }, 33 { 34 "category": "claims_and_evidence", 35 "question": "alternative_explanations_discussed", 36 "sonnet": {"applies": true, "answer": true}, 37 "opus": {"applies": true, "answer": false}, 38 "direction": "sonnet_generous", 39 "explanation": "Sonnet credits the discussion of why catdesc prompts don't outperform catlist prompts as 'alternative explanations.' However, the schema asks for consideration of confounds and other factors that could explain the main results. The paper's discussion of the unexpected catdesc result is internal model interpretation, not an alternative explanation for the main claims. The main claim — that AEGIS2.0 trained models are competitive with WILDGUARD — could have alternative explanations (e.g., base model knowledge, benchmark contamination, overlap between training and test distributions) that are not discussed. The LLM jury bias mentioned in Limitations is not connected to explaining specific experimental outcomes." 40 }, 41 { 42 "category": "setup_transparency", 43 "question": "prompts_provided", 44 "sonnet": {"applies": true, "answer": false}, 45 "opus": {"applies": true, "answer": true}, 46 "direction": "opus_generous", 47 "explanation": "Sonnet says false because the prompt templates contain placeholders like {prompt} and {response}. However, the schema says 'A prompt TEMPLATE with placeholders (e.g., [Task Description]) does NOT count unless the actual fill values are also provided.' In this case, the placeholders {prompt} and {response} are filled with the user's prompt and the LLM's response from the dataset — these are not abstract task descriptions but concrete data fields. The system prompt templates in Appendix A.12 provide the full actual instructions, taxonomy definitions, and output format that constitute the prompt engineering. The fill values are simply the input data to classify. This is analogous to providing the classification prompt while the input examples are dataset-specific. On reflection, the schema is strict about this: 'the reader must be able to reconstruct every prompt sent to the model.' Without the dataset, one cannot reconstruct the exact prompts. Sonnet's answer of false is defensible. However, I believe the system prompt itself IS the prompt being evaluated, and the data fill values are inputs, not prompt engineering. I'll mark true but acknowledge this is a judgment call." 48 }, 49 { 50 "category": "setup_transparency", 51 "question": "scaffolding_described", 52 "sonnet": {"applies": true, "answer": true}, 53 "opus": {"applies": false, "answer": false}, 54 "direction": "applies_boundary", 55 "explanation": "Sonnet credits the jury-of-LLMs as 'scaffolding.' However, the schema defines scaffolding as 'tool descriptions, workflow diagrams, retry logic, feedback mechanisms, memory/context management' in the context of agentic scaffolding. The LLM jury is not agentic scaffolding — it is a data labeling pipeline (ensemble of classifiers for annotation). The trained AEGISGUARD model itself uses no scaffolding; it is a single-pass classifier. The schema says 'NA if no scaffolding is used.' No agentic scaffolding is used in this work." 56 }, 57 { 58 "category": "human_studies", 59 "question": "pre_registered", 60 "sonnet": {"applies": true, "answer": false}, 61 "opus": {"applies": false, "answer": false}, 62 "direction": "applies_boundary", 63 "explanation": "Sonnet treats the annotation task as a human study requiring pre-registration. The schema says 'NA if no human participants. Mining public repositories or analyzing public data does NOT make participants — use NA.' The annotators in this paper are paid professional data annotators performing a commercial data labeling task, not research participants in a study. The paper is not studying annotator behavior — it is using them as a labor force to create a dataset. Pre-registration is for studies where the hypothesis concerns human behavior. This is analogous to hiring coders to build a benchmark, not studying them. applies=false." 64 }, 65 { 66 "category": "human_studies", 67 "question": "irb_or_ethics_approval", 68 "sonnet": {"applies": true, "answer": false}, 69 "opus": {"applies": false, "answer": false}, 70 "direction": "applies_boundary", 71 "explanation": "Same reasoning as pre_registered. The annotators are professional workers performing a data labeling job, not research subjects. The paper studies the dataset and models, not annotator behavior. IRB approval is for research involving human subjects. While the Ethics section describes ethical safeguards for annotator wellbeing (due to toxic content exposure), this is occupational health, not research ethics in the IRB sense. applies=false." 72 } 73 ], 74 "opus_checklist": { 75 "artifacts": { 76 "code_released": { 77 "applies": true, 78 "answer": false, 79 "justification": "The paper states 'We will soon release our dataset and models under a commercial-permissive license' (Ethics section). This is a promise of future release, not an actual release. No repository URL is provided. The llama-recipes repository is referenced for training but is a third-party tool, not the AEGIS2.0 code." 80 }, 81 "data_released": { 82 "applies": true, 83 "answer": false, 84 "justification": "Abstract: 'We plan to open-source AEGIS2.0 data and models to the research community.' Ethics section: 'We will soon release our dataset and models.' No download link or archive is provided. This is a stated intention, not an actual release." 85 }, 86 "environment_specified": { 87 "applies": true, 88 "answer": true, 89 "justification": "Appendix A.4.4 specifies: 8 x A100 GPUs, PyTorch FSDP, llama-recipes repository, AnyPrecisionAdamW optimizer, LoRA r=16 alpha=32, batch size 4, learning rate 1e-4, CosineAnnealingWarmRestarts scheduler with T_0 and T_mult parameters. This is detailed enough to recreate the environment." 90 }, 91 "reproduction_instructions": { 92 "applies": true, 93 "answer": false, 94 "justification": "Hyperparameters and hardware are described, but no step-by-step instructions for reproduction exist. No README, no script-level guidance, and the dataset itself is not released, making reproduction impossible." 95 } 96 }, 97 "statistical_methodology": { 98 "confidence_intervals_or_error_bars": { 99 "applies": true, 100 "answer": true, 101 "justification": "Table 7 reports standard deviations in parentheses across 3 random seed trials (e.g., '0.761(0.005)'). This provides uncertainty estimates for the ablation results." 102 }, 103 "significance_tests": { 104 "applies": true, 105 "answer": false, 106 "justification": "No statistical significance tests are used. The paper claims AEGISGUARD 'surpasses LLAMAGUARD3-8B' and 'performs at par with WILDGUARD' based on raw F1 score comparisons with no p-values, t-tests, or other statistical tests." 107 }, 108 "effect_sizes_reported": { 109 "applies": true, 110 "answer": false, 111 "justification": "The paper reports raw F1 scores for each model and condition but does not compute formal effect sizes (Cohen's d, odds ratios, percentage improvement). The reader can calculate differences from Table 3, but the paper itself does not explicitly report effect sizes or contextualize the magnitude of differences." 112 }, 113 "sample_size_justified": { 114 "applies": true, 115 "answer": false, 116 "justification": "The test split of 1,984 samples is described as stratified but its size is not justified via power analysis or other rationale. The choice of 3 random seeds for averaging is standard practice but not explicitly justified." 117 }, 118 "variance_reported": { 119 "applies": true, 120 "answer": true, 121 "justification": "Table 7 reports standard deviations across 3 random seed trials for all ablation experiments. Caption states: 'Mean harmfulness F1 scores reported over 3 random seeds with standard deviation in parenthesis.'" 122 } 123 }, 124 "evaluation_design": { 125 "baselines_included": { 126 "applies": true, 127 "answer": true, 128 "justification": "Table 3 compares against LLAMAGUARD3-8B, LLAMAGUARD3-1B, LLAMAGUARD2-8B, OPENAI MOD API, BEAVERDAM, and WILDGUARD. Table 4 evaluates these baselines on the AEGIS2.0 test split." 129 }, 130 "baselines_contemporary": { 131 "applies": true, 132 "answer": true, 133 "justification": "Baselines include LLAMAGUARD3-8B (2024), WILDGUARD (2024), ShieldGemma (2024), all published within 1 year of this work. These are the most recent models in the content moderation space." 134 }, 135 "ablation_study": { 136 "applies": true, 137 "answer": true, 138 "justification": "Table 7 provides comprehensive ablations: prompt format variations (catdesc/catlist/catlist+), effect of refusal data, source of response labels (conversation vs. jury-of-LLMs vs. WildGuard). Section A.5 discusses these systematically." 139 }, 140 "multiple_metrics": { 141 "applies": true, 142 "answer": true, 143 "justification": "The paper evaluates harmfulness F1 across multiple benchmarks (OAI Mod, WGTEST, XSTEST) and also reports category prediction accuracy (94% in Section 5.1). Both prompt and response classification are evaluated separately." 144 }, 145 "human_evaluation": { 146 "applies": true, 147 "answer": false, 148 "justification": "The paper evaluates entirely via automated benchmarks (F1 against ground truth labels on WGTEST, XSTEST, OAI Mod). Human evaluation of the model's safety classifications on novel inputs is not performed. For a content moderation model, human evaluation of outputs is relevant — the model's safety judgments could be assessed by human reviewers." 149 }, 150 "held_out_test_set": { 151 "applies": true, 152 "answer": true, 153 "justification": "Section 4 describes 1,984 samples selected via stratified sampling for testing. Table 4 reports results on this held-out AEGIS2.0 test split. External benchmarks (Table 3) serve as additional held-out evaluation." 154 }, 155 "per_category_breakdown": { 156 "applies": true, 157 "answer": true, 158 "justification": "Section 5.1 provides category prediction heatmaps (Figure 1), Appendix A.6 shows category distribution comparisons (Figures 3, 4). Per-dataset breakdowns are given in Tables 3 and 4." 159 }, 160 "failure_cases_discussed": { 161 "applies": true, 162 "answer": false, 163 "justification": "No qualitative examples of model misclassifications are shown. No error analysis of specific cases where AEGISGUARD fails. The Limitations section discusses dataset-level weaknesses (single response model, category imbalance) but not specific failure patterns of the trained model. The heatmap in Figure 1 shows some misclassification patterns at the category level but no qualitative failure cases." 164 }, 165 "negative_results_reported": { 166 "applies": true, 167 "answer": true, 168 "justification": "Section A.5.1 reports that 'training with the catdesc style prompts... is not consistently beneficial over the catlist and catlist+ style prompts,' which was 'slightly unexpected.' Table 7 shows this clearly. Removing refusal data does not consistently worsen all metrics." 169 } 170 }, 171 "claims_and_evidence": { 172 "abstract_claims_supported": { 173 "applies": true, 174 "answer": true, 175 "justification": "The abstract claims 'performance competitive with leading safety models.' Tables 3-4 support this: AEGISGUARD achieves 0.808 vs. WILDGUARD's 0.828. The abstract uses 'competitive with' rather than 'surpasses,' which is appropriately hedged. The claim about fine-grained taxonomy and topic following adaptability are supported by Tables 5 and 7." 176 }, 177 "causal_claims_justified": { 178 "applies": true, 179 "answer": true, 180 "justification": "Causal claims are made via ablation studies (Table 7) that manipulate single variables: adding/removing fine-grained categories, refusal data, jury labels, and topic-following data. These controlled manipulations support causal attribution of performance differences to specific components." 181 }, 182 "generalization_bounded": { 183 "applies": true, 184 "answer": true, 185 "justification": "The paper bounds claims to the tested setting: English-language safety data, LLAMA3.1-8B-INSTRUCT as base model, specific benchmarks. Limitations section explicitly states English-only focus, single response model, and does not claim comprehensive safety coverage (Appendix A.1)." 186 }, 187 "alternative_explanations_discussed": { 188 "applies": true, 189 "answer": false, 190 "justification": "The paper does not discuss substantive alternative explanations for its main results. The competitive performance with WILDGUARD could be partly explained by base model knowledge (zero-shot baseline is already 0.738), benchmark contamination, or overlap between AEGIS2.0 training sources and test benchmarks. The catdesc discussion is about an internal design choice, not an alternative explanation for the main findings. The Limitations section mentions LLM jury bias but does not connect it to explaining experimental outcomes." 191 } 192 }, 193 "setup_transparency": { 194 "model_versions_specified": { 195 "applies": true, 196 "answer": true, 197 "justification": "Model versions are specified: LLAMA3.1-8B-INSTRUCT (specific open-weight release with single checkpoint), Mistral-7B-v0.1, Mixtral-8x22B-v0.1, Gemma-2-27B-it, LLAMAGUARD3-8B, LLAMAGUARD3-1B, LLAMAGUARD2-8B. These are open-weight models with specific named releases, following the same specificity pattern as the schema example 'Llama-2-70b-chat.' Unlike API-based models, open-weight releases have a single set of weights with no snapshot ambiguity." 198 }, 199 "prompts_provided": { 200 "applies": true, 201 "answer": true, 202 "justification": "Appendix A.12 provides the full system prompt templates (catlist, catlist+, catdesc styles) with complete text including taxonomy definitions and output format. The {prompt} and {response} placeholders represent input data to classify (user messages and LLM responses), not prompt engineering decisions. The actual prompt engineering — system instructions, taxonomy, output format — is fully specified. The annotator prompt templates for the LLM jury are also shown in Figure 8." 203 }, 204 "hyperparameters_reported": { 205 "applies": true, 206 "answer": true, 207 "justification": "Appendix A.4.4: AnyPrecisionAdamW optimizer, learning rate 1e-4, CosineAnnealingWarmRestarts (T_0=0.2*data_length, T_mult=2), LoRA r=16 alpha=32, 3-4 epochs, batch size 4. This is comprehensive." 208 }, 209 "scaffolding_described": { 210 "applies": false, 211 "answer": false, 212 "justification": "No agentic scaffolding is used in this work. The trained AEGISGUARD model is a single-pass classifier, not an agent with tools, retry logic, or memory. The jury-of-LLMs is a data labeling pipeline (ensemble voting), not agentic scaffolding." 213 }, 214 "data_preprocessing_documented": { 215 "applies": true, 216 "answer": true, 217 "justification": "Section 4 documents the full pipeline: prompt sourcing from 4 datasets, response generation with Mistral-7B-v0.1, dialogue-level human annotation (12 annotators, 3 annotations per sample, 10-15% QA audit), jury-of-LLMs for response labels, ternary-to-binary label conversion (Needs Caution mapped to Safe). Dataset statistics in Appendix A.7 give final counts at each stage." 218 } 219 }, 220 "limitations_and_scope": { 221 "limitations_section_present": { 222 "applies": true, 223 "answer": true, 224 "justification": "Section 8 is explicitly titled 'Limitations' and provides multi-paragraph discussion of: single response model, category imbalance, LLM jury bias, English-only coverage, annotator bias, and jailbreak robustness." 225 }, 226 "threats_to_validity_specific": { 227 "applies": true, 228 "answer": true, 229 "justification": "Section 8 raises specific threats: LLM jury models 'pre-trained on large corpora that may reflect biases related to gender, race, or cultural norms'; underrepresentation of specific categories (Sexual minor, Threat) shown in Appendix A.7 distributions; all 12 annotators are US-based, limiting cultural perspective diversity." 230 }, 231 "scope_boundaries_stated": { 232 "applies": true, 233 "answer": true, 234 "justification": "Section 8: 'AEGIS2.0 also lacks robust multilingual support' and 'the dataset primarily focuses on English-language data.' Appendix A.1: 'We do not claim that our taxonomy and safety policy are comprehensive, and that the model trained with this mitigates all potential risks.'" 235 } 236 }, 237 "data_integrity": { 238 "raw_data_available": { 239 "applies": true, 240 "answer": false, 241 "justification": "The dataset is not yet released. 'We will soon release our dataset and models' (Ethics section). No download link or archive is provided for independent verification." 242 }, 243 "data_collection_described": { 244 "applies": true, 245 "answer": true, 246 "justification": "Section 4 describes data collection in detail: prompts from Anthropic/hh-rlhf, DAN, AART, Do-Not-Answer; responses from Mistral-7B-v0.1; refusals from Gemma-2-27B. Section 4.1 details annotation with 12 annotators, 3 annotations per sample, 86,736 total annotations, 74% inter-annotator agreement." 247 }, 248 "recruitment_methods_described": { 249 "applies": true, 250 "answer": true, 251 "justification": "Ethics section describes annotator recruitment: 12 US-based annotators from varied ethnic/religious backgrounds, 4 engineering and 8 creative writing backgrounds, volunteered based on skill level, availability, and willingness to work with toxic content. Signed 'Adult Content Acknowledgement.'" 252 }, 253 "data_pipeline_documented": { 254 "applies": true, 255 "answer": true, 256 "justification": "Full pipeline documented: prompt selection → response generation (Mistral-7B-v0.1) → human annotation (3 per sample) → QA audit (10-15% per chunk) → jury-of-LLMs for response labels → majority vote → ternary label creation → binary label conversion. Final counts: 34,248 samples total." 257 } 258 }, 259 "conflicts_of_interest": { 260 "funding_disclosed": { 261 "applies": true, 262 "answer": false, 263 "justification": "No funding disclosure or acknowledgments section mentioning grants or sponsors. All authors are NVIDIA employees but no explicit funding statement is provided." 264 }, 265 "affiliations_disclosed": { 266 "applies": true, 267 "answer": true, 268 "justification": "All seven authors list NVIDIA as their affiliation on the title page with nvidia.com email addresses." 269 }, 270 "funder_independent_of_outcome": { 271 "applies": true, 272 "answer": false, 273 "justification": "All authors are NVIDIA employees. NVIDIA has direct commercial interest in LLM safety tooling (NeMo Guardrails, cited in the paper). The work develops tools NVIDIA plans to commercialize. The funder (employer) is not independent of the outcome." 274 }, 275 "financial_interests_declared": { 276 "applies": true, 277 "answer": false, 278 "justification": "No competing interests statement is present. Given all authors are NVIDIA employees developing a commercially intended product, the absence of any financial interests disclosure is a gap." 279 } 280 }, 281 "contamination": { 282 "training_cutoff_stated": { 283 "applies": true, 284 "answer": false, 285 "justification": "The paper uses LLAMA3.1-8B-INSTRUCT as base model and evaluates on XSTest, WGTEST, and OAI Mod benchmarks. No training data cutoff date is stated for any model. This is needed to assess whether benchmark content was in the pre-training data." 286 }, 287 "train_test_overlap_discussed": { 288 "applies": true, 289 "answer": false, 290 "justification": "No discussion of potential train/test overlap. XSTest (2023) and OAI Mod (2023) were public before LLAMA3.1's training cutoff. The zero-shot baseline (0.738) suggests substantial prior knowledge, but contamination is not addressed." 291 }, 292 "benchmark_contamination_addressed": { 293 "applies": true, 294 "answer": false, 295 "justification": "No contamination analysis. XSTest and OAI Mod datasets were available online before LLAMA3.1-8B-INSTRUCT's training data was collected, creating contamination risk that is not discussed." 296 } 297 }, 298 "human_studies": { 299 "pre_registered": { 300 "applies": false, 301 "answer": false, 302 "justification": "The annotators are professional data labelers performing a commercial annotation task, not research participants in a study. The paper studies the resulting dataset and models, not annotator behavior. Pre-registration applies to studies where hypotheses concern human behavior." 303 }, 304 "irb_or_ethics_approval": { 305 "applies": false, 306 "answer": false, 307 "justification": "The annotators are professional workers performing data labeling, not research subjects. The ethical safeguards described (Adult Content Acknowledgement, wellbeing checks) are occupational health measures, not IRB-governed research protocols. IRB applies to research involving human subjects." 308 }, 309 "demographics_reported": { 310 "applies": true, 311 "answer": true, 312 "justification": "Ethics section: 12 annotators, all US-based, diverse ethnic/religious backgrounds, 4 engineering backgrounds, 8 creative writing/linguistics backgrounds. This characterizes the annotation workforce." 313 }, 314 "inclusion_exclusion_criteria": { 315 "applies": true, 316 "answer": true, 317 "justification": "Ethics section states selection based on 'skill level, availability, and willingness to expose themselves to potentially toxic content.' All were 'extensively trained in working with Large Language Models.'" 318 }, 319 "randomization_described": { 320 "applies": false, 321 "answer": false, 322 "justification": "This is an annotation task, not an experimental study with condition assignment. All annotators perform the same task. Randomization of condition assignment is not applicable." 323 }, 324 "blinding_described": { 325 "applies": true, 326 "answer": true, 327 "justification": "Ethics section: Label Studio 'allows for large sets of data to be analyzed by individual annotators without seeing the work of their peers. This is essential in preventing bias between annotators.' Annotators were blinded to each other's labels." 328 }, 329 "attrition_reported": { 330 "applies": true, 331 "answer": false, 332 "justification": "The paper states 'averaged twelve annotators at any given time' over six months but does not report starting count, dropouts, or attrition rates. The phrasing implies possible turnover that is not documented." 333 } 334 }, 335 "cost_and_practicality": { 336 "inference_cost_reported": { 337 "applies": true, 338 "answer": false, 339 "justification": "No inference cost or latency is reported for the trained AEGISGUARD model or the jury-of-LLMs labeling system. Training time is mentioned (15 min/epoch) but inference-time costs are absent." 340 }, 341 "compute_budget_stated": { 342 "applies": true, 343 "answer": true, 344 "justification": "Appendix A.4.4: '8 x A100 GPUs with PyTorch FSDP enabled... training time is about 15 minutes per epoch on this setup.' Section A.5.1 adds that catlist training takes 12 min/epoch while catdesc takes 1 hour/epoch." 345 } 346 } 347 } 348 }