ai-research-survey

Systematic scan of agentic development research. What's signal, what's noise.
git clone https://git.shiptheloop.com/ai-research-survey.git
Log | Files | Refs

calibration-summary.md (4684B)


      1 # Calibration Summary: Sonnet vs Opus on Boolean Checklist
      2 
      3 **Date**: 2026-02-28
      4 **Papers calibrated**: 8 of 19 (11 remaining due to rate limits)
      5 **Instrument**: 50-question boolean checklist (scan.schema.json)
      6 
      7 ## Overall Agreement
      8 
      9 | Metric | Value |
     10 |--------|-------|
     11 | Total question-pairs evaluated | 400 (8 papers x 50 questions) |
     12 | Agreements | 373 |
     13 | **Overall agreement rate** | **93.2%** |
     14 | Disagreements | 27 |
     15 
     16 ## Per-Paper Results
     17 
     18 | Paper | Agreement | Rate | Disagreements |
     19 |-------|-----------|------|---------------|
     20 | agentic-adoption-github-2026 | 50/50 | 100% | 0 |
     21 | adoption-generative-artificial-2026 | 49/50 | 98% | 1 |
     22 | adaptive-test-generation-2023 | 48/50 | 96% | 2 |
     23 | agentless-2024 | 48/50 | 96% | 2 |
     24 | ai-code-not-reproducible-2025 | 48/50 | 96% | 2 |
     25 | agentic-programming-survey-2025 | 44/50 | 88% | 6 |
     26 | adaptive-attacks-bypass-defenses-2025 | 43/50 | 86% | 7 |
     27 | agent-developer-practices-2025 | 43/50 | 86% | 7 |
     28 
     29 ## Disagreement Direction
     30 
     31 When Sonnet and Opus disagree, who is stricter?
     32 
     33 | Direction | Count | % of disagreements |
     34 |-----------|-------|--------------------|
     35 | Sonnet=YES, Opus=NO | 11 | 40.7% |
     36 | Sonnet=NO, Opus=NA | 7 | 25.9% |
     37 | Sonnet=YES, Opus=NA | 4 | 14.8% |
     38 | Sonnet=NA, Opus=YES | 2 | 7.4% |
     39 | Sonnet=NA, Opus=NO | 2 | 7.4% |
     40 | Sonnet=NO, Opus=YES | 1 | 3.7% |
     41 
     42 **Key finding**: Sonnet is more generous than Opus. In 41% of disagreements, Sonnet said YES where Opus said NO. Sonnet also over-applies YES/NO where Opus thinks the question is NA (41% of disagreements involve NA boundary).
     43 
     44 ## Most Contentious Questions
     45 
     46 | Question | Disagreements | Pattern |
     47 |----------|---------------|---------|
     48 | `claims_and_evidence.causal_claims_justified` | 3 | Mixed: Sonnet sometimes marks NA when Opus sees causal claims (ablation studies), sometimes NO when Opus sees adequate justification |
     49 | `setup_transparency.prompts_provided` | 2 | Sonnet YES, Opus NO. Sonnet counts described prompts; Opus requires actual prompt text |
     50 | `claims_and_evidence.alternative_explanations_discussed` | 2 | Sonnet YES, Opus NO. Different thresholds for what counts as "discussed" |
     51 
     52 ## Disagreement Categories
     53 
     54 ### 1. NA Boundary Disputes (13/27 = 48%)
     55 The most common disagreement type. Questions that don't clearly apply to a paper type:
     56 - **contamination** questions on non-benchmark papers (Sonnet says NO, Opus says NA)
     57 - **human_studies** items on interview/qualitative studies (Sonnet says NO, Opus says NA)
     58 - Survey papers: Sonnet answers artifact questions (code/data released) as NA, Opus says NO
     59 
     60 **Recommendation**: Tighten NA guidance in schema descriptions. For surveys, "code_released" should be NO (not NA) — a survey could release its analysis scripts. For contamination questions on non-LLM-evaluation papers, NA is correct.
     61 
     62 ### 2. Strictness Gaps (11/27 = 41%)
     63 Sonnet says YES where Opus says NO. Common in:
     64 - `setup_transparency`: Sonnet counts partial/described information; Opus requires the actual artifacts (full prompt text, specific version strings)
     65 - `claims_and_evidence`: Sonnet credits vague mentions; Opus requires substantive discussion
     66 - `limitations_and_scope`: Sonnet counts generic statements; Opus requires study-specific threats
     67 
     68 **Recommendation**: These suggest the schema descriptions need sharper examples of what counts vs. what doesn't. Consider adding "This counts: ..." and "This does NOT count: ..." examples to the most ambiguous items.
     69 
     70 ### 3. Genuine Interpretive Differences (3/27 = 11%)
     71 Reasonable disagreements where both answers are defensible:
     72 - Whether manual classification of benchmark problems counts as "human evaluation"
     73 - Whether ablation studies constitute "causal claims"
     74 
     75 **Assessment**: These are inherent to the instrument and unlikely to be eliminated. 3 out of 400 is acceptable.
     76 
     77 ## Conclusion
     78 
     79 93.2% inter-model agreement on a 50-question instrument is strong. For context:
     80 - Medical inter-rater reliability studies consider >80% "substantial agreement"
     81 - The boolean format achieves much higher agreement than a 0-3 Likert scale would
     82 - Most disagreements are systematic (NA boundaries, strictness thresholds) and can be reduced with schema refinements
     83 
     84 **Actionable improvements**:
     85 1. Add explicit NA guidance to contamination, human_studies, and artifacts categories
     86 2. Add "counts / does not count" examples to setup_transparency and claims_and_evidence items
     87 3. Consider making Sonnet the primary rater with the understanding that it is ~7% more generous than Opus — this is a known, systematic bias that can be disclosed in the paper
     88 
     89 **Remaining work**: 11 more papers need calibration to strengthen these findings (rate-limited, will resume later).

Impressum · Datenschutz