ai-research-survey

Systematic scan of agentic development research. What's signal, what's noise.
git clone https://git.shiptheloop.com/ai-research-survey.git
Log | Files | Refs

audit.md (5217B)


      1 Run an Opus calibration audit on recent Sonnet scans.
      2 
      3 Arguments: $ARGUMENTS
      4 - A number (e.g., `5`, `10`) sets how many papers to audit
      5 - `all` audits every paper that has scan.json but no calibration.json
      6 - `status` prints current audit progress without running
      7 - `summary` aggregates existing calibration.json files into a report
      8 
      9 ## Instructions
     10 
     11 ### 1. Check status
     12 
     13 ```bash
     14 echo "scan.json:" && find papers -name "scan.json" | wc -l
     15 echo "calibration.json:" && find papers -name "calibration.json" | wc -l
     16 echo "Unaudited:" && find papers -name "scan.json" -exec sh -c 'test ! -f "$(dirname {})/calibration.json" && echo {}' \; | wc -l
     17 ```
     18 
     19 Print the status summary. If the argument is `status`, stop here.
     20 
     21 ### 2. If argument is `summary`, skip to step 6.
     22 
     23 ### 3. Get list of papers to audit
     24 
     25 Find papers with scan.json but no calibration.json:
     26 ```bash
     27 find papers -name "scan.json" -exec sh -c 'test ! -f "$(dirname {})/calibration.json" && basename $(dirname {})' \; | sort | head -<N>
     28 ```
     29 
     30 ### 4. Launch Opus calibration agents in parallel batches
     31 
     32 Launch **5 sub-agents at a time** using the Task tool with `model: "opus"` and `run_in_background: true`.
     33 
     34 For each sub-agent, use this prompt (fill in the slug):
     35 
     36 ---
     37 
     38 You are a calibration agent. Your job is to independently evaluate a research paper using the same boolean checklist that a Sonnet agent already completed, then compare your answers.
     39 
     40 **Step 1: Read these files:**
     41 1. `/root/projects/ai-research-survey/schema/scan.schema.json` — the checklist schema with evaluation criteria
     42 2. `/root/projects/ai-research-survey/agents/scan-agent.md` — evaluation instructions
     43 3. `/root/projects/ai-research-survey/papers/<SLUG>/paper.txt` — the paper
     44 4. `/root/projects/ai-research-survey/papers/<SLUG>/scan.json` — Sonnet's answers
     45 
     46 **Step 2: Independently answer all 50 boolean checklist questions.**
     47 For each question, read the schema description carefully and search the paper for evidence. Answer with `applies` (boolean) and `answer` (boolean) plus a justification. Do NOT look at Sonnet's answers until you have formed your own answer for each question.
     48 
     49 **Step 3: Compare your answers with Sonnet's scan.json.**
     50 For each of the 50 questions, compare both fields (`applies` and `answer`). A disagreement is when either field differs. For disagreements, explain why your answer differs.
     51 
     52 **Step 4: Write a calibration JSON file** to `/root/projects/ai-research-survey/papers/<SLUG>/calibration.json` with this structure:
     53 ```json
     54 {
     55   "paper_slug": "<SLUG>",
     56   "total_questions": 50,
     57   "agreement_count": <number of questions where both applies AND answer match>,
     58   "disagreement_count": <number of questions where either field differs>,
     59   "agreement_rate": <float 0-1>,
     60   "disagreements": [
     61     {
     62       "category": "<category_name>",
     63       "question": "<question_id>",
     64       "sonnet_applies": <bool>,
     65       "sonnet_answer": <bool>,
     66       "opus_applies": <bool>,
     67       "opus_answer": <bool>,
     68       "opus_justification": "<your reasoning>",
     69       "sonnet_justification": "<sonnet's reasoning from scan.json>",
     70       "direction": "<sonnet_generous|opus_generous|applies_boundary|interpretive>"
     71     }
     72   ],
     73   "opus_checklist": { <your full checklist in same format as scan.json checklist> }
     74 }
     75 ```
     76 
     77 The "direction" field should be:
     78 - "sonnet_generous" if Sonnet has answer=true and you have answer=false (both applies=true)
     79 - "opus_generous" if you have answer=true and Sonnet has answer=false (both applies=true)
     80 - "applies_boundary" if the applies field differs between Sonnet and Opus
     81 - "interpretive" for other disagreements
     82 
     83 Be strict and follow the schema descriptions exactly. Do not be generous.
     84 
     85 ---
     86 
     87 ### 5. Wait for each batch, then launch next
     88 
     89 After each batch of 5 completes, report:
     90 - How many succeeded (wrote calibration.json)
     91 - How many failed
     92 - How many remain
     93 
     94 Continue until the limit is reached or all papers are audited.
     95 
     96 ### 6. Aggregate and summarize
     97 
     98 After all agents complete (or if argument is `summary`), read all calibration.json files and produce a summary:
     99 
    100 ```python
    101 import json, glob
    102 
    103 calibrations = []
    104 for f in sorted(glob.glob('papers/*/calibration.json')):
    105     with open(f) as fh:
    106         calibrations.append(json.load(fh))
    107 ```
    108 
    109 Print a table:
    110 
    111 | Paper | Agreement | Disagreements |
    112 |-------|-----------|---------------|
    113 | slug  | XX%       | N (breakdown) |
    114 
    115 Then print overall statistics:
    116 - **Overall agreement**: X/Y (Z%)
    117 - **Disagreement breakdown**: applies_boundary: N, sonnet_generous: N, opus_generous: N, interpretive: N
    118 - **Most disagreed questions**: list questions with 2+ disagreements across papers
    119 - **Comparison with previous rounds** (if analysis/calibration-summary.json exists)
    120 
    121 Write the summary to `analysis/calibration-summary.json` and print it for the user.
    122 
    123 ### 7. Structural validation
    124 
    125 Also run structural validation on all audited scan.json files:
    126 - All 50 questions present with `applies` (bool), `answer` (bool), `justification` (string)
    127 - No `applies=false, answer=true` violations
    128 - All claims have `claim`, `evidence`, `supported` (with valid enum)
    129 - `methodology_tags` use only allowed enum values
    130 - Flag any schema violations found
    131 
    132 Report any validation failures to the user.

Impressum · Datenschutz