calibration-summary.md (4684B)
1 # Calibration Summary: Sonnet vs Opus on Boolean Checklist 2 3 **Date**: 2026-02-28 4 **Papers calibrated**: 8 of 19 (11 remaining due to rate limits) 5 **Instrument**: 50-question boolean checklist (scan.schema.json) 6 7 ## Overall Agreement 8 9 | Metric | Value | 10 |--------|-------| 11 | Total question-pairs evaluated | 400 (8 papers x 50 questions) | 12 | Agreements | 373 | 13 | **Overall agreement rate** | **93.2%** | 14 | Disagreements | 27 | 15 16 ## Per-Paper Results 17 18 | Paper | Agreement | Rate | Disagreements | 19 |-------|-----------|------|---------------| 20 | agentic-adoption-github-2026 | 50/50 | 100% | 0 | 21 | adoption-generative-artificial-2026 | 49/50 | 98% | 1 | 22 | adaptive-test-generation-2023 | 48/50 | 96% | 2 | 23 | agentless-2024 | 48/50 | 96% | 2 | 24 | ai-code-not-reproducible-2025 | 48/50 | 96% | 2 | 25 | agentic-programming-survey-2025 | 44/50 | 88% | 6 | 26 | adaptive-attacks-bypass-defenses-2025 | 43/50 | 86% | 7 | 27 | agent-developer-practices-2025 | 43/50 | 86% | 7 | 28 29 ## Disagreement Direction 30 31 When Sonnet and Opus disagree, who is stricter? 32 33 | Direction | Count | % of disagreements | 34 |-----------|-------|--------------------| 35 | Sonnet=YES, Opus=NO | 11 | 40.7% | 36 | Sonnet=NO, Opus=NA | 7 | 25.9% | 37 | Sonnet=YES, Opus=NA | 4 | 14.8% | 38 | Sonnet=NA, Opus=YES | 2 | 7.4% | 39 | Sonnet=NA, Opus=NO | 2 | 7.4% | 40 | Sonnet=NO, Opus=YES | 1 | 3.7% | 41 42 **Key finding**: Sonnet is more generous than Opus. In 41% of disagreements, Sonnet said YES where Opus said NO. Sonnet also over-applies YES/NO where Opus thinks the question is NA (41% of disagreements involve NA boundary). 43 44 ## Most Contentious Questions 45 46 | Question | Disagreements | Pattern | 47 |----------|---------------|---------| 48 | `claims_and_evidence.causal_claims_justified` | 3 | Mixed: Sonnet sometimes marks NA when Opus sees causal claims (ablation studies), sometimes NO when Opus sees adequate justification | 49 | `setup_transparency.prompts_provided` | 2 | Sonnet YES, Opus NO. Sonnet counts described prompts; Opus requires actual prompt text | 50 | `claims_and_evidence.alternative_explanations_discussed` | 2 | Sonnet YES, Opus NO. Different thresholds for what counts as "discussed" | 51 52 ## Disagreement Categories 53 54 ### 1. NA Boundary Disputes (13/27 = 48%) 55 The most common disagreement type. Questions that don't clearly apply to a paper type: 56 - **contamination** questions on non-benchmark papers (Sonnet says NO, Opus says NA) 57 - **human_studies** items on interview/qualitative studies (Sonnet says NO, Opus says NA) 58 - Survey papers: Sonnet answers artifact questions (code/data released) as NA, Opus says NO 59 60 **Recommendation**: Tighten NA guidance in schema descriptions. For surveys, "code_released" should be NO (not NA) — a survey could release its analysis scripts. For contamination questions on non-LLM-evaluation papers, NA is correct. 61 62 ### 2. Strictness Gaps (11/27 = 41%) 63 Sonnet says YES where Opus says NO. Common in: 64 - `setup_transparency`: Sonnet counts partial/described information; Opus requires the actual artifacts (full prompt text, specific version strings) 65 - `claims_and_evidence`: Sonnet credits vague mentions; Opus requires substantive discussion 66 - `limitations_and_scope`: Sonnet counts generic statements; Opus requires study-specific threats 67 68 **Recommendation**: These suggest the schema descriptions need sharper examples of what counts vs. what doesn't. Consider adding "This counts: ..." and "This does NOT count: ..." examples to the most ambiguous items. 69 70 ### 3. Genuine Interpretive Differences (3/27 = 11%) 71 Reasonable disagreements where both answers are defensible: 72 - Whether manual classification of benchmark problems counts as "human evaluation" 73 - Whether ablation studies constitute "causal claims" 74 75 **Assessment**: These are inherent to the instrument and unlikely to be eliminated. 3 out of 400 is acceptable. 76 77 ## Conclusion 78 79 93.2% inter-model agreement on a 50-question instrument is strong. For context: 80 - Medical inter-rater reliability studies consider >80% "substantial agreement" 81 - The boolean format achieves much higher agreement than a 0-3 Likert scale would 82 - Most disagreements are systematic (NA boundaries, strictness thresholds) and can be reduced with schema refinements 83 84 **Actionable improvements**: 85 1. Add explicit NA guidance to contamination, human_studies, and artifacts categories 86 2. Add "counts / does not count" examples to setup_transparency and claims_and_evidence items 87 3. Consider making Sonnet the primary rater with the understanding that it is ~7% more generous than Opus — this is a known, systematic bias that can be disclosed in the paper 88 89 **Remaining work**: 11 more papers need calibration to strengthen these findings (rate-limited, will resume later).