calibration.json - ai-research-survey - Systematic scan of agentic development research. What's signal, what's noise.

calibration.json (24976B)
      1 {
      2   "paper_slug": "agents-of-chaos-2026",
      3   "calibration_date": "2026-02-28",
      4   "sonnet_scan_date": "2026-02-28",
      5   "agreement_rate": 0.92,
      6   "total_questions": 50,
      7   "agreements": 46,
      8   "disagreements": 4,
      9   "disagreement_details": [
     10     {
     11       "category": "statistical_methodology",
     12       "question": "sample_size_justified",
     13       "sonnet": {"applies": true, "answer": false},
     14       "opus": {"applies": false, "answer": false},
     15       "direction": "applies_boundary",
     16       "explanation": "Sonnet treats sample size justification as applicable because 20 researchers and 6 agents were involved. Opus treats it as structurally inapplicable because the paper explicitly states its goal is existence-proof methodology ('Our goal was not to statistically estimate failure rates, but to establish the existence of critical vulnerabilities'). For a red-teaming study seeking counterexamples rather than prevalence estimates, sample size justification is analogous to asking a penetration tester why they tested N systems. The criterion is designed for studies making statistical claims; this study makes none."
     17     },
     18     {
     19       "category": "conflicts_of_interest",
     20       "question": "funder_independent_of_outcome",
     21       "sonnet": {"applies": false, "answer": false},
     22       "opus": {"applies": true, "answer": false},
     23       "direction": "applies_boundary",
     24       "explanation": "Sonnet applies the schema's 'NA if unfunded' exemption since no funding is disclosed. Opus argues that the absence of a funding disclosure does not confirm the work is unfunded—the study used substantial compute resources (Claude Opus 4.6 API for two weeks of continuous agent operation, Fly.io VMs for 6 agents, ProtonMail accounts) and involves researchers from multiple well-funded universities. Setting applies=false rewards non-disclosure. The criterion should apply, and the answer is false because funder independence cannot be assessed without a funding statement."
     25     },
     26     {
     27       "category": "human_studies",
     28       "question": "randomization_described",
     29       "sonnet": {"applies": true, "answer": false},
     30       "opus": {"applies": false, "answer": false},
     31       "direction": "applies_boundary",
     32       "explanation": "Sonnet treats randomization as applicable because human participants were involved. Opus treats it as inapplicable because this is an observational/exploratory red-teaming study, not a controlled experiment with treatment conditions. The schema states 'NA if not an experimental study (e.g., cross-sectional surveys, observational studies).' This is an observational case-study methodology where participants freely chose which agents to probe and which attacks to attempt. There are no treatment vs. control conditions to randomize."
     33     },
     34     {
     35       "category": "human_studies",
     36       "question": "attrition_reported",
     37       "sonnet": {"applies": false, "answer": false},
     38       "opus": {"applies": true, "answer": false},
     39       "direction": "applies_boundary",
     40       "explanation": "Sonnet treats attrition as inapplicable because this is not a longitudinal study with formal enrollment. Opus argues that in a two-week voluntary participation study with 20 researchers, the level and duration of each participant's engagement is a meaningful concern—some may have participated minimally or dropped out. The paper reports '20 AI researchers participated' but never clarifies how many participated through the full two weeks, how interaction was distributed across participants, or whether some researchers' contributions were negligible. This is relevant to understanding the breadth and diversity of the red-teaming effort."
     41     }
     42   ],
     43   "opus_checklist": {
     44     "artifacts": {
     45       "code_released": {
     46         "applies": true,
     47         "answer": false,
     48         "justification": "The paper references the OpenClaw open-source framework (https://github.com/openclaw/openclaw) and ClawnBoard, but does not release the study's own experimental code, agent configuration files, or automation scripts. The interactive website (https://agentsofchaos.baulab.info/) hosts logs but is not a code artifact."
     49       },
     50       "data_released": {
     51         "applies": true,
     52         "answer": false,
     53         "justification": "The interactive website hosts Discord conversation logs, but these are not released as a formal, downloadable dataset. Email logs, agent memory states, and system logs are referenced in appendices but not made available for independent analysis."
     54       },
     55       "environment_specified": {
     56         "applies": true,
     57         "answer": false,
     58         "justification": "The paper describes infrastructure at a high level (OpenClaw on Fly.io VMs with 20GB persistent volumes, ProtonMail, Discord) but provides no reproducible environment specifications: no Dockerfile, no requirements.txt, no dependency lists, no versioned OpenClaw configuration."
     59       },
     60       "reproduction_instructions": {
     61         "applies": true,
     62         "answer": false,
     63         "justification": "The paper provides narrative descriptions of setup but no step-by-step reproduction instructions. The paper explicitly acknowledges setup was 'a messy, failure-prone process' (Section 2) but gives no runbook for replication."
     64       }
     65     },
     66     "statistical_methodology": {
     67       "confidence_intervals_or_error_bars": {
     68         "applies": false,
     69         "answer": false,
     70         "justification": "This is a qualitative case-study paper presenting existence proofs of vulnerabilities, with no quantitative outcome measures. Statistical uncertainty quantification is structurally inapplicable."
     71       },
     72       "significance_tests": {
     73         "applies": false,
     74         "answer": false,
     75         "justification": "No comparative quantitative claims are made. The paper explicitly states its goal is to 'establish the existence of critical vulnerabilities,' not to compare effect sizes or failure rates."
     76       },
     77       "effect_sizes_reported": {
     78         "applies": false,
     79         "answer": false,
     80         "justification": "No quantitative effect magnitudes are measured. The study demonstrates existence of vulnerabilities through case studies, not effect measurement."
     81       },
     82       "sample_size_justified": {
     83         "applies": false,
     84         "answer": false,
     85         "justification": "The paper explicitly states 'Our goal was not to statistically estimate failure rates, but to establish the existence of critical vulnerabilities under realistic interaction conditions' (Section 3). For an existence-proof/red-teaming methodology, sample size justification is structurally inapplicable—one successful exploit is sufficient to demonstrate vulnerability."
     86       },
     87       "variance_reported": {
     88         "applies": false,
     89         "answer": false,
     90         "justification": "Qualitative case-study methodology with no repeated experimental runs or quantitative outcome measures from which variance would be computed."
     91       }
     92     },
     93     "evaluation_design": {
     94       "baselines_included": {
     95         "applies": false,
     96         "answer": false,
     97         "justification": "This is an exploratory red-teaming case study, not a comparative evaluation. There is no baseline condition; the goal is existence proof of vulnerabilities, not comparison against a control."
     98       },
     99       "baselines_contemporary": {
    100         "applies": false,
    101         "answer": false,
    102         "justification": "No baselines are included, so contemporaneity of baselines is inapplicable."
    103       },
    104       "ablation_study": {
    105         "applies": false,
    106         "answer": false,
    107         "justification": "Ablation studies are structurally inapplicable to a qualitative case-study red-teaming exercise. There is no system with components to ablate."
    108       },
    109       "multiple_metrics": {
    110         "applies": false,
    111         "answer": false,
    112         "justification": "The study uses qualitative case-study outcomes rather than quantitative metrics. No numerical performance measures are applied."
    113       },
    114       "human_evaluation": {
    115         "applies": false,
    116         "answer": false,
    117         "justification": "The red-teaming itself constitutes human interaction with agents. Human evaluation of system outputs as a separate evaluation methodology is not relevant; the researchers' interactions ARE the evaluation."
    118       },
    119       "held_out_test_set": {
    120         "applies": false,
    121         "answer": false,
    122         "justification": "No train/test split or held-out set concept is applicable to this exploratory red-teaming case study."
    123       },
    124       "per_category_breakdown": {
    125         "applies": true,
    126         "answer": true,
    127         "justification": "The paper presents 11 distinct successful case studies plus 5 failed/hypothetical cases (Section 15, Cases #12-16), each organized by vulnerability category. Section 16 further categorizes failures thematically (social coherence failures, multi-agent amplification, etc.)."
    128       },
    129       "failure_cases_discussed": {
    130         "applies": true,
    131         "answer": true,
    132         "justification": "Section 15, titled 'Hypothetical Cases (What Happened In Practice),' documents five failed attack attempts with detailed explanations of why they failed and what the agents did correctly."
    133       },
    134       "negative_results_reported": {
    135         "applies": true,
    136         "answer": true,
    137         "justification": "Section 15 dedicates five case studies (#12-16) to attacks that failed, including prompt injection via broadcast that was detected, refusal of email spoofing, resistance to data tampering, rejection of social engineering, and inter-agent coordination on suspicious requests. The paper also notes 'A failed attempt doesn't mean it can't happen.'"
    138       }
    139     },
    140     "claims_and_evidence": {
    141       "abstract_claims_supported": {
    142         "applies": true,
    143         "answer": true,
    144         "justification": "Each claim in the abstract (unauthorized compliance, disclosure of sensitive information, destructive system-level actions, denial-of-service, identity spoofing, cross-agent propagation, false completion reports) is substantiated by a dedicated case study with detailed interaction logs in the paper body and appendices."
    145       },
    146       "causal_claims_justified": {
    147         "applies": true,
    148         "answer": false,
    149         "justification": "The paper makes causal claims throughout: 'post-training alignment becomes the mechanism of exploitation' (Section 10 analysis), agents' lack of a 'stakeholder model' causes failures (Section 16.2), 'the agentic layer introduces new failure surfaces' (Section 1). These causal attributions are based on qualitative case observations without controlled manipulation of variables. Alternative causes (specific OpenClaw configuration, specific model behavior, specific prompt wording) are not systematically ruled out."
    150       },
    151       "generalization_bounded": {
    152         "applies": true,
    153         "answer": false,
    154         "justification": "The paper generalizes from 6 agents on one framework (OpenClaw) with two models (Claude Opus 4.6, Kimi K2.5) to broad claims about 'LLM-backed agents' as a class. Section 16.2 states 'Current agentic systems lack an explicit stakeholder model' and 'LLM-based agents process instructions and data as tokens in a context window, making the two fundamentally indistinguishable' as universal claims. The conclusion does not adequately bound these claims to the specific setting tested."
    155       },
    156       "alternative_explanations_discussed": {
    157         "applies": true,
    158         "answer": false,
    159         "justification": "Section 16.3 discusses fundamental vs. contingent failures but does not systematically consider specific alternative explanations for the observed results: e.g., that failures could be artifacts of OpenClaw's specific architecture, the specific system prompts used, the particular model versions, or the non-representative participant group. The paper attributes failures to broad structural properties without ruling out framework-specific confounds."
    160       }
    161     },
    162     "setup_transparency": {
    163       "model_versions_specified": {
    164         "applies": true,
    165         "answer": true,
    166         "justification": "The paper specifies 'Claude Opus 4.6 (proprietary; Anthropic, 2026)' and 'Kimi K2.5 (open-weights; Team et al., 2026)' with citations to the respective system cards/technical reports. While snapshot dates are not provided, these are specific versioned model names rather than generic product names."
    167       },
    168       "prompts_provided": {
    169         "applies": true,
    170         "answer": false,
    171         "justification": "The paper describes the types of workspace configuration files that form the agent's system prompt (AGENTS.md, SOUL.md, TOOLS.md, IDENTITY.md, USER.md, HEARTBEAT.md) and their structural roles, but does not provide the actual text content of these files. Appendix A.1 describes the file structure but not the actual prompts/instructions injected into context."
    172       },
    173       "hyperparameters_reported": {
    174         "applies": true,
    175         "answer": false,
    176         "justification": "No API hyperparameters (temperature, top-p, max tokens, context window settings, thinking/reasoning settings) are reported for either Claude Opus 4.6 or Kimi K2.5. The paper mentions that OpenClaw allows configuring 'thinking' amounts but does not state the settings used."
    177       },
    178       "scaffolding_described": {
    179         "applies": true,
    180         "answer": true,
    181         "justification": "Sections 2 and Appendix A.1 describe the OpenClaw scaffold in detail: workspace files (AGENTS.md, SOUL.md, TOOLS.md, IDENTITY.md, USER.md, HEARTBEAT.md, MEMORY.md), file-based memory system (daily logs, curated MEMORY.md, memory_search tool), heartbeat mechanism (30-minute periodic background check-ins), cron job system, and Fly.io deployment architecture. Figure 21 provides an architecture diagram."
    182       },
    183       "data_preprocessing_documented": {
    184         "applies": false,
    185         "answer": false,
    186         "justification": "This is a live observational red-teaming study, not a study involving a preprocessed dataset. There is no data preprocessing step to document."
    187       }
    188     },
    189     "limitations_and_scope": {
    190       "limitations_section_present": {
    191         "applies": true,
    192         "answer": false,
    193         "justification": "There is no dedicated 'Limitations' or 'Threats to Validity' section. Some limitations are mentioned inline—e.g., Section 3 states 'Our experiments were simple (case-study-based) and not robust (without scaling and diversity)' and Section 15 notes 'not all unsuccessful attempts were documented'—but these are scattered and do not constitute a substantive standalone section."
    194       },
    195       "threats_to_validity_specific": {
    196         "applies": true,
    197         "answer": false,
    198         "justification": "No structured threats-to-validity discussion exists. Specific limitations are mentioned casually (messy setup, undocumented failures, small sample) but there is no systematic analysis of how these threats affect the validity of the conclusions. The paper does not discuss, for example, whether the self-selected researcher pool biases the types of vulnerabilities discovered, or whether OpenClaw-specific bugs conflate with fundamental agent limitations."
    199       },
    200       "scope_boundaries_stated": {
    201         "applies": true,
    202         "answer": false,
    203         "justification": "The paper states the goal is 'to establish the existence of critical vulnerabilities' and 'not to statistically estimate failure rates,' which is a methodological scope boundary. However, it does not explicitly bound what the results do NOT show—no equivalent of 'these findings do not generalize to [X].' The discussion sections make broad claims about 'LLM-backed agents' generally that exceed the tested scope of one framework with two models."
    204       }
    205     },
    206     "data_integrity": {
    207       "raw_data_available": {
    208         "applies": true,
    209         "answer": false,
    210         "justification": "An interactive website (https://agentsofchaos.baulab.info/) hosts Discord conversation logs, but this is not a formal data release. Email conversations, agent memory states, system logs, and the full corpus of interactions are not provided as downloadable raw data files for independent verification."
    211       },
    212       "data_collection_described": {
    213         "applies": true,
    214         "answer": true,
    215         "justification": "Section 3 (Evaluation Procedure) describes the data collection procedure: initial structured 'hello world' phase with email greetings, followed by a two-week open exploratory period with 20 AI researchers conducting adversarial and benign interactions via Discord and email. The paper describes the time period, interaction channels, and participant roles."
    216       },
    217       "recruitment_methods_described": {
    218         "applies": true,
    219         "answer": false,
    220         "justification": "The paper states '20 AI researchers participated' who were 'invited—all researchers in the lab and interested collaborators.' No details are provided about which lab, how external collaborators were identified, what selection criteria applied, whether participation was compensated, or whether the self-selected sample introduces bias in which vulnerabilities were discovered."
    221       },
    222       "data_pipeline_documented": {
    223         "applies": true,
    224         "answer": false,
    225         "justification": "The paper acknowledges 'numerous experimental iterations were conducted, and not all unsuccessful attempts were documented' (Section 15). The process for selecting which 11 successful case studies and 5 failed case studies to include out of the total (unquantified) interaction corpus is not described, raising concerns about selective reporting."
    226       }
    227     },
    228     "conflicts_of_interest": {
    229       "funding_disclosed": {
    230         "applies": true,
    231         "answer": false,
    232         "justification": "The Acknowledgments section thanks individuals for discussions and advice but mentions no funding sources—no grants, institutional funding, or corporate sponsorship. Given the significant compute costs involved (Claude Opus 4.6 API, Fly.io VMs, email infrastructure), the absence of any funding disclosure is a transparency gap."
    233       },
    234       "affiliations_disclosed": {
    235         "applies": true,
    236         "answer": true,
    237         "justification": "Author affiliations are listed prominently on the title page: Northeastern University, Stanford, UBC, Harvard, Hebrew University, Max Planck, MIT, Tufts, CMU, Alter, Technion, Vector Institute, and independent researchers."
    238       },
    239       "funder_independent_of_outcome": {
    240         "applies": true,
    241         "answer": false,
    242         "justification": "No funding source is disclosed despite significant resource usage (Claude Opus 4.6 API for two weeks of continuous multi-agent operation, Fly.io cloud VMs for 6 agents). The work involves researchers from multiple well-funded universities, making truly unfunded status unlikely. Funder independence cannot be assessed without a funding statement. Setting applies=false per the 'NA if unfunded' exemption would reward non-disclosure."
    243       },
    244       "financial_interests_declared": {
    245         "applies": true,
    246         "answer": false,
    247         "justification": "No competing interests or financial disclosure statement appears anywhere in the paper. The paper evaluates Claude Opus 4.6 (Anthropic) and Kimi K2.5 (MoonshotAI) and references several author-affiliated products and platforms. One author is affiliated with 'Alter' which may be a commercial entity. The absence of any declaration is itself a transparency gap."
    248       }
    249     },
    250     "contamination": {
    251       "training_cutoff_stated": {
    252         "applies": false,
    253         "answer": false,
    254         "justification": "This is a red-teaming case study testing agent behavior in a live environment, not a benchmark evaluation of model knowledge. Training data cutoff is not relevant to the study's claims about agentic failure modes."
    255       },
    256       "train_test_overlap_discussed": {
    257         "applies": false,
    258         "answer": false,
    259         "justification": "The study does not evaluate models on any standardized benchmark. It tests agent behavior through live adversarial interactions. Training data contamination is structurally inapplicable."
    260       },
    261       "benchmark_contamination_addressed": {
    262         "applies": false,
    263         "answer": false,
    264         "justification": "No benchmark is used. The evaluation consists entirely of live adversarial interactions in a deployed environment."
    265       }
    266     },
    267     "human_studies": {
    268       "pre_registered": {
    269         "applies": true,
    270         "answer": false,
    271         "justification": "The study involves 20 human researchers as participants in a two-week interactive experiment. No pre-registration is mentioned or linked. Given the exploratory nature, pre-registration would have been valuable for distinguishing confirmatory from exploratory findings."
    272       },
    273       "irb_or_ethics_approval": {
    274         "applies": true,
    275         "answer": false,
    276         "justification": "Despite involving 20 human participants in interactive experiments that exposed private information (planted PII in emails), tested adversarial manipulations, and included a gaslighting experiment (Case Study #7), no IRB or ethics board approval is mentioned. The Ethics Statement discusses AI safety philosophy broadly but does not address human subjects review."
    277       },
    278       "demographics_reported": {
    279         "applies": true,
    280         "answer": false,
    281         "justification": "The paper says '20 AI researchers' but provides no demographic information: no gender, seniority level, AI expertise distribution, years of experience, or institutional affiliation breakdown. Individual names appear in case studies but no systematic characterization is given."
    282       },
    283       "inclusion_exclusion_criteria": {
    284         "applies": true,
    285         "answer": false,
    286         "justification": "Participants were described as 'researchers in the lab and interested collaborators' with no stated inclusion or exclusion criteria. Self-selection bias is not addressed."
    287       },
    288       "randomization_described": {
    289         "applies": false,
    290         "answer": false,
    291         "justification": "This is an observational/exploratory red-teaming study, not an experimental study with treatment conditions. Participants self-selected which agents to interact with and which attacks to attempt. Per the schema, randomization is NA for observational studies."
    292       },
    293       "blinding_described": {
    294         "applies": false,
    295         "answer": false,
    296         "justification": "Blinding is structurally inapplicable to an open adversarial red-teaming study where researchers must know they are attempting to exploit the system."
    297       },
    298       "attrition_reported": {
    299         "applies": true,
    300         "answer": false,
    301         "justification": "The paper states '20 AI researchers participated over the two-week period' but does not report how engagement was distributed: how many participated through the full duration, how many contributed minimally, or whether some dropped out after initial interactions. This is relevant to understanding the breadth of the red-teaming effort."
    302       }
    303     },
    304     "cost_and_practicality": {
    305       "inference_cost_reported": {
    306         "applies": true,
    307         "answer": false,
    308         "justification": "One case study mentions 'approximately 60,000 tokens' consumed over nine days (Case Study #4), but no systematic cost reporting is provided. Total API spend across the two-week study, per-case costs, or cost per agent are not reported despite being relevant to understanding the practical feasibility of the setup."
    309       },
    310       "compute_budget_stated": {
    311         "applies": true,
    312         "answer": false,
    313         "justification": "No compute budget is stated. The study used Fly.io VMs for 6 agent deployments and called Claude Opus 4.6 and Kimi K2.5 APIs for two weeks. Total compute cost, hardware specifications, or total API spend are not reported."
    314       }
    315     }
    316   },
    317   "summary": "High agreement (92%) between Opus and Sonnet on this paper. All 4 disagreements are applies-boundary disputes—no disagreements on the answer field when both agree the criterion applies. This pattern is consistent with calibration findings: the two-field (applies/answer) design eliminates most substantive scoring disagreements, with remaining disputes concentrated on edge cases about whether a criterion is structurally inapplicable vs. applicable-but-unsatisfied. Notably, 3 of 4 disagreements involve whether human_studies and statistical_methodology criteria apply to an exploratory qualitative red-teaming study that defies clean categorization: it involves human participants but is not a traditional human subjects experiment, and involves a sample but makes no statistical claims. The funder_independent_of_outcome disagreement reflects a philosophical question about whether absent disclosure should trigger the 'unfunded' exemption or be treated as a transparency failure."
    318 }
	ai-research-survey Systematic scan of agentic development research. What's signal, what's noise.
	git clone https://git.shiptheloop.com/ai-research-survey.git
	Log \| Files \| Refs