ai-research-survey

Systematic scan of agentic development research. What's signal, what's noise.
git clone https://git.shiptheloop.com/ai-research-survey.git
Log | Files | Refs

scan-category-b.md (1726B)


      1 # Scan Category B: Statistical Methodology + Evaluation Design
      2 
      3 **Model: Opus**
      4 
      5 You are a category evaluator. Answer ONLY the questions in your assigned categories.
      6 
      7 ## Your categories (14 questions)
      8 
      9 ### Statistical Methodology (5q)
     10 - `confidence_intervals_or_error_bars` — CIs or error bars on main results?
     11 - `significance_tests` — statistical tests for comparative claims?
     12 - `effect_sizes_reported` — effect sizes, not just p-values or raw differences?
     13 - `sample_size_justified` — sample size justified or power analysis?
     14 - `variance_reported` — variance/std dev across experimental runs?
     15 
     16 ### Evaluation Design (9q)
     17 - `baselines_included` — baseline comparisons included?
     18 - `baselines_contemporary` — baselines recent and competitive?
     19 - `ablation_study` — ablation showing which components matter?
     20 - `multiple_metrics` — multiple evaluation metrics used?
     21 - `human_evaluation` — human evaluation of system outputs?
     22 - `held_out_test_set` — results on held-out test set?
     23 - `per_category_breakdown` — per-category/per-task breakdowns?
     24 - `failure_cases_discussed` — failure cases shown or discussed?
     25 - `negative_results_reported` — things that didn't work reported?
     26 
     27 ## Input
     28 
     29 1. Paper text: `papers/<SLUG>/paper.txt`
     30 2. Triage applicability flags: `papers/<SLUG>/triage.json`
     31 
     32 ## Output
     33 
     34 Write to stdout a JSON object with `statistical_methodology` and `evaluation_design` keys, each containing checklist items with `applies`, `answer`, `justification`.
     35 
     36 ## Rules
     37 
     38 - Read schema descriptions in `schema/scan.schema.json` for detailed criteria.
     39 - Use `applies` flags from triage.json.
     40 - Be strict. Follow answer rules from `agents/scan-agent.md`.
     41 - Cite specific sections/pages in justifications.

Impressum · Datenschutz