audit.md (5217B)
1 Run an Opus calibration audit on recent Sonnet scans. 2 3 Arguments: $ARGUMENTS 4 - A number (e.g., `5`, `10`) sets how many papers to audit 5 - `all` audits every paper that has scan.json but no calibration.json 6 - `status` prints current audit progress without running 7 - `summary` aggregates existing calibration.json files into a report 8 9 ## Instructions 10 11 ### 1. Check status 12 13 ```bash 14 echo "scan.json:" && find papers -name "scan.json" | wc -l 15 echo "calibration.json:" && find papers -name "calibration.json" | wc -l 16 echo "Unaudited:" && find papers -name "scan.json" -exec sh -c 'test ! -f "$(dirname {})/calibration.json" && echo {}' \; | wc -l 17 ``` 18 19 Print the status summary. If the argument is `status`, stop here. 20 21 ### 2. If argument is `summary`, skip to step 6. 22 23 ### 3. Get list of papers to audit 24 25 Find papers with scan.json but no calibration.json: 26 ```bash 27 find papers -name "scan.json" -exec sh -c 'test ! -f "$(dirname {})/calibration.json" && basename $(dirname {})' \; | sort | head -<N> 28 ``` 29 30 ### 4. Launch Opus calibration agents in parallel batches 31 32 Launch **5 sub-agents at a time** using the Task tool with `model: "opus"` and `run_in_background: true`. 33 34 For each sub-agent, use this prompt (fill in the slug): 35 36 --- 37 38 You are a calibration agent. Your job is to independently evaluate a research paper using the same boolean checklist that a Sonnet agent already completed, then compare your answers. 39 40 **Step 1: Read these files:** 41 1. `/root/projects/ai-research-survey/schema/scan.schema.json` — the checklist schema with evaluation criteria 42 2. `/root/projects/ai-research-survey/agents/scan-agent.md` — evaluation instructions 43 3. `/root/projects/ai-research-survey/papers/<SLUG>/paper.txt` — the paper 44 4. `/root/projects/ai-research-survey/papers/<SLUG>/scan.json` — Sonnet's answers 45 46 **Step 2: Independently answer all 50 boolean checklist questions.** 47 For each question, read the schema description carefully and search the paper for evidence. Answer with `applies` (boolean) and `answer` (boolean) plus a justification. Do NOT look at Sonnet's answers until you have formed your own answer for each question. 48 49 **Step 3: Compare your answers with Sonnet's scan.json.** 50 For each of the 50 questions, compare both fields (`applies` and `answer`). A disagreement is when either field differs. For disagreements, explain why your answer differs. 51 52 **Step 4: Write a calibration JSON file** to `/root/projects/ai-research-survey/papers/<SLUG>/calibration.json` with this structure: 53 ```json 54 { 55 "paper_slug": "<SLUG>", 56 "total_questions": 50, 57 "agreement_count": <number of questions where both applies AND answer match>, 58 "disagreement_count": <number of questions where either field differs>, 59 "agreement_rate": <float 0-1>, 60 "disagreements": [ 61 { 62 "category": "<category_name>", 63 "question": "<question_id>", 64 "sonnet_applies": <bool>, 65 "sonnet_answer": <bool>, 66 "opus_applies": <bool>, 67 "opus_answer": <bool>, 68 "opus_justification": "<your reasoning>", 69 "sonnet_justification": "<sonnet's reasoning from scan.json>", 70 "direction": "<sonnet_generous|opus_generous|applies_boundary|interpretive>" 71 } 72 ], 73 "opus_checklist": { <your full checklist in same format as scan.json checklist> } 74 } 75 ``` 76 77 The "direction" field should be: 78 - "sonnet_generous" if Sonnet has answer=true and you have answer=false (both applies=true) 79 - "opus_generous" if you have answer=true and Sonnet has answer=false (both applies=true) 80 - "applies_boundary" if the applies field differs between Sonnet and Opus 81 - "interpretive" for other disagreements 82 83 Be strict and follow the schema descriptions exactly. Do not be generous. 84 85 --- 86 87 ### 5. Wait for each batch, then launch next 88 89 After each batch of 5 completes, report: 90 - How many succeeded (wrote calibration.json) 91 - How many failed 92 - How many remain 93 94 Continue until the limit is reached or all papers are audited. 95 96 ### 6. Aggregate and summarize 97 98 After all agents complete (or if argument is `summary`), read all calibration.json files and produce a summary: 99 100 ```python 101 import json, glob 102 103 calibrations = [] 104 for f in sorted(glob.glob('papers/*/calibration.json')): 105 with open(f) as fh: 106 calibrations.append(json.load(fh)) 107 ``` 108 109 Print a table: 110 111 | Paper | Agreement | Disagreements | 112 |-------|-----------|---------------| 113 | slug | XX% | N (breakdown) | 114 115 Then print overall statistics: 116 - **Overall agreement**: X/Y (Z%) 117 - **Disagreement breakdown**: applies_boundary: N, sonnet_generous: N, opus_generous: N, interpretive: N 118 - **Most disagreed questions**: list questions with 2+ disagreements across papers 119 - **Comparison with previous rounds** (if analysis/calibration-summary.json exists) 120 121 Write the summary to `analysis/calibration-summary.json` and print it for the user. 122 123 ### 7. Structural validation 124 125 Also run structural validation on all audited scan.json files: 126 - All 50 questions present with `applies` (bool), `answer` (bool), `justification` (string) 127 - No `applies=false, answer=true` violations 128 - All claims have `claim`, `evidence`, `supported` (with valid enum) 129 - `methodology_tags` use only allowed enum values 130 - Flag any schema violations found 131 132 Report any validation failures to the user.