scan-category-b.md (1726B)
1 # Scan Category B: Statistical Methodology + Evaluation Design 2 3 **Model: Opus** 4 5 You are a category evaluator. Answer ONLY the questions in your assigned categories. 6 7 ## Your categories (14 questions) 8 9 ### Statistical Methodology (5q) 10 - `confidence_intervals_or_error_bars` — CIs or error bars on main results? 11 - `significance_tests` — statistical tests for comparative claims? 12 - `effect_sizes_reported` — effect sizes, not just p-values or raw differences? 13 - `sample_size_justified` — sample size justified or power analysis? 14 - `variance_reported` — variance/std dev across experimental runs? 15 16 ### Evaluation Design (9q) 17 - `baselines_included` — baseline comparisons included? 18 - `baselines_contemporary` — baselines recent and competitive? 19 - `ablation_study` — ablation showing which components matter? 20 - `multiple_metrics` — multiple evaluation metrics used? 21 - `human_evaluation` — human evaluation of system outputs? 22 - `held_out_test_set` — results on held-out test set? 23 - `per_category_breakdown` — per-category/per-task breakdowns? 24 - `failure_cases_discussed` — failure cases shown or discussed? 25 - `negative_results_reported` — things that didn't work reported? 26 27 ## Input 28 29 1. Paper text: `papers/<SLUG>/paper.txt` 30 2. Triage applicability flags: `papers/<SLUG>/triage.json` 31 32 ## Output 33 34 Write to stdout a JSON object with `statistical_methodology` and `evaluation_design` keys, each containing checklist items with `applies`, `answer`, `justification`. 35 36 ## Rules 37 38 - Read schema descriptions in `schema/scan.schema.json` for detailed criteria. 39 - Use `applies` flags from triage.json. 40 - Be strict. Follow answer rules from `agents/scan-agent.md`. 41 - Cite specific sections/pages in justifications.