calibration.json - ai-research-survey - Systematic scan of agentic development research. What's signal, what's noise.

calibration.json (19842B)
      1 {
      2   "paper_slug": "aart-aiassisted-redteaming-2023",
      3   "total_questions": 50,
      4   "agreement_count": 47,
      5   "disagreement_count": 3,
      6   "agreement_rate": 0.94,
      7   "disagreements": [
      8     {
      9       "category": "contamination",
     10       "question": "training_cutoff_stated",
     11       "sonnet_applies": true,
     12       "sonnet_answer": false,
     13       "opus_applies": false,
     14       "opus_answer": false,
     15       "opus_justification": "This paper is a red-teaming data generation study. PaLM 2 is used as a generation tool to produce adversarial prompts, not evaluated on a benchmark. The schema description explicitly lists 'red-teaming study that tests defenses rather than model knowledge' as a case where contamination questions are NA. While PaLM 2 may have seen comparison datasets during training, this is an alternative-explanations concern, not a benchmark contamination concern in the schema's intended sense.",
     16       "sonnet_justification": "The paper uses PaLM 2 (instruction-tuned) as the generation model. No training data cutoff date is stated for PaLM 2. This matters for assessing whether existing red-teaming datasets used for comparison may have been in PaLM 2's training data.",
     17       "direction": "applies_boundary"
     18     },
     19     {
     20       "category": "contamination",
     21       "question": "train_test_overlap_discussed",
     22       "sonnet_applies": true,
     23       "sonnet_answer": false,
     24       "opus_applies": false,
     25       "opus_answer": false,
     26       "opus_justification": "Same reasoning as training_cutoff_stated: this is a red-teaming data generation study, not a benchmark evaluation of model knowledge. PaLM 2 is the generation instrument, not the subject being evaluated. The comparison is about keyword coverage of generated datasets, not about PaLM 2's performance on a test set. The schema and scan-agent instructions explicitly exclude red-teaming studies from contamination applicability.",
     27       "sonnet_justification": "The paper compares AART-generated data against existing public datasets (RealToxicityPrompts 2020, ParlAI 2019, BAD 2021, Anthropic 2022) without discussing whether PaLM 2 may have been trained on these datasets, which would confound the quality and novelty comparisons.",
     28       "direction": "applies_boundary"
     29     },
     30     {
     31       "category": "contamination",
     32       "question": "benchmark_contamination_addressed",
     33       "sonnet_applies": true,
     34       "sonnet_answer": false,
     35       "opus_applies": false,
     36       "opus_answer": false,
     37       "opus_justification": "Same reasoning: the paper does not evaluate PaLM 2 on a benchmark. It uses PaLM 2 to generate adversarial prompts and compares the generated dataset's keyword coverage against other datasets. This is structurally different from a benchmark evaluation where contamination matters. The concern that PaLM 2 trained on comparison datasets is valid but is better captured under alternative_explanations_discussed (where both Sonnet and Opus agree the paper fails) rather than contamination.",
     38       "sonnet_justification": "The comparison datasets (especially RealToxicityPrompts published 2020, ParlAI 2019) predate PaLM 2's likely training cutoff. The paper does not address whether PaLM 2 was trained on these datasets, which could influence the model's generation patterns.",
     39       "direction": "applies_boundary"
     40     }
     41   ],
     42   "opus_checklist": {
     43     "artifacts": {
     44       "code_released": {
     45         "applies": true,
     46         "answer": false,
     47         "justification": "No code repository is provided. The paper mentions a GitHub link for a demonstration dataset but no code implementing the AART pipeline is released."
     48       },
     49       "data_released": {
     50         "applies": true,
     51         "answer": false,
     52         "justification": "The abstract states 'we intend to make available a demonstration dataset' (future tense) at a GitHub URL. This is a promise of future release, not an actual release. As of publication, the dataset is not available."
     53       },
     54       "environment_specified": {
     55         "applies": true,
     56         "answer": false,
     57         "justification": "The paper mentions using 'PaLM API' (Vertex LLM PaLM 2 API) but provides no library versions, requirements file, Docker configuration, or environment setup details."
     58       },
     59       "reproduction_instructions": {
     60         "applies": true,
     61         "answer": false,
     62         "justification": "No step-by-step reproduction instructions are provided. Appendix A shows the prompts used and Appendix B shows dataset filters, but there is no README or instructions that would allow end-to-end reproduction without significant guesswork."
     63       }
     64     },
     65     "statistical_methodology": {
     66       "confidence_intervals_or_error_bars": {
     67         "applies": true,
     68         "answer": false,
     69         "justification": "Table 2 reports keyword coverage rates as single point estimates (e.g., 0.384 for Policy Concepts in AART). Query length is reported with mean and standard deviation, but the main coverage metrics have no confidence intervals or error bars."
     70       },
     71       "significance_tests": {
     72         "applies": true,
     73         "answer": false,
     74         "justification": "The paper compares AART keyword coverage rates against five other datasets (Table 2) and claims 'promising results,' but no statistical significance tests are applied to any of these comparisons."
     75       },
     76       "effect_sizes_reported": {
     77         "applies": true,
     78         "answer": false,
     79         "justification": "Coverage ratios are compared numerically (e.g., AART's 0.384 vs. Perez adaptation's 0.210) but no standardized effect sizes (Cohen's d, odds ratios, etc.) are reported."
     80       },
     81       "sample_size_justified": {
     82         "applies": true,
     83         "answer": false,
     84         "justification": "The qualitative analysis uses n=120 prompts (Section 4) with no justification for why this sample size was chosen or whether it is sufficient to support the percentages reported."
     85       },
     86       "variance_reported": {
     87         "applies": true,
     88         "answer": false,
     89         "justification": "Keyword coverage rates in Table 2 are single-run numbers with no variance. The qualitative analysis percentages are from a single pass with no inter-rater reliability or spread measures. Query length std dev is reported but this is not the main result."
     90       }
     91     },
     92     "evaluation_design": {
     93       "baselines_included": {
     94         "applies": true,
     95         "answer": true,
     96         "justification": "Table 2 compares AART against four human red-teaming datasets (RealToxicityPrompts, ParlAI Dialogue Safety, BAD, Anthropic) and one automated method (Perez et al. 2022 adapted to the same scenario)."
     97       },
     98       "baselines_contemporary": {
     99         "applies": true,
    100         "answer": true,
    101         "justification": "Baselines include Perez et al. (2022) and the Anthropic red-teaming dataset (2022), which are the most relevant contemporary comparators for automated red-teaming of LLMs."
    102       },
    103       "ablation_study": {
    104         "applies": true,
    105         "answer": false,
    106         "justification": "No ablation study is conducted. The paper presents the full AART pipeline without isolating individual components' contributions (e.g., effect of structured CoT, geographic dimensions, or the multi-step pipeline vs. a simpler approach)."
    107       },
    108       "multiple_metrics": {
    109         "applies": true,
    110         "answer": true,
    111         "justification": "The paper uses quantitative keyword-match metrics across three dimensions (Table 2: Policy Concepts, Task Formats, Geographic Regions) and qualitative human review of n=120 prompts assessing quality, concept reference, and format reference."
    112       },
    113       "human_evaluation": {
    114         "applies": true,
    115         "answer": true,
    116         "justification": "Section 4 describes a qualitative analysis on n=120 sampled prompts assessing quality (92.5%), policy concept reference (99.2%), geographic region reference (96.7%), and task format reference (85.0%). This constitutes human review of the system's outputs."
    117       },
    118       "held_out_test_set": {
    119         "applies": false,
    120         "answer": false,
    121         "justification": "AART is a data generation pipeline, not a trained model. The evaluation compares generated datasets using keyword coverage metrics; a held-out test set is structurally inapplicable."
    122       },
    123       "per_category_breakdown": {
    124         "applies": true,
    125         "answer": true,
    126         "justification": "Table 2 breaks down keyword coverage across three separate dimensions: Policy Concepts, Task Formats, and Geographic Regions. The qualitative analysis also reports separate percentages for each dimension."
    127       },
    128       "failure_cases_discussed": {
    129         "applies": true,
    130         "answer": true,
    131         "justification": "Section 4 discusses over-representation of top-5 regions (53.4%), imbalanced task format coverage, and missing formats. Appendix C.1 provides a dedicated list of worst-case examples from the qualitative analysis."
    132       },
    133       "negative_results_reported": {
    134         "applies": true,
    135         "answer": true,
    136         "justification": "Task Format is described as 'an area of relative weakness' (85.0%), geographic region imbalances are reported, several task formats had zero representation, and the paper acknowledges keyword-based evaluation underestimates actual coverage."
    137       }
    138     },
    139     "claims_and_evidence": {
    140       "abstract_claims_supported": {
    141         "applies": true,
    142         "answer": true,
    143         "justification": "The abstract claims AART 'shows promising results in terms of concept coverage and data quality.' Table 2 shows higher keyword coverage than baselines, and qualitative analysis finds 92.5% quality. The hedged language ('promising results,' 'compared to some state-of-the-art tools') is appropriate for the limited evidence."
    144       },
    145       "causal_claims_justified": {
    146         "applies": true,
    147         "answer": false,
    148         "justification": "The paper makes implicit causal claims that recipe-guided structured generation causes higher diversity than ad-hoc methods. These are supported only by descriptive comparisons across datasets of different sizes, origins, and purposes, without controlling for any confounds."
    149       },
    150       "generalization_bounded": {
    151         "applies": true,
    152         "answer": false,
    153         "justification": "Section 6 states 'AART enabled us to launch several products with improved safety measures,' generalizing from a single English-language demonstration scenario to broad product safety claims. No results are shown for other applications or languages."
    154       },
    155       "alternative_explanations_discussed": {
    156         "applies": true,
    157         "answer": false,
    158         "justification": "The paper does not discuss that higher coverage of AART-defined keywords in AART-generated data is expected by construction, since AART's prompts are parameterized by those very concepts. The comparison datasets were designed for different contexts with different keyword vocabularies."
    159       }
    160     },
    161     "setup_transparency": {
    162       "model_versions_specified": {
    163         "applies": true,
    164         "answer": false,
    165         "justification": "The paper refers to 'instruction tuned LLM (PALM API)' and cites the PaLM 2 technical report, but does not specify an exact model version, size, or snapshot date. PaLM 2 encompasses multiple model sizes (Gecko, Otter, Bison, Unicorn)."
    166       },
    167       "prompts_provided": {
    168         "applies": true,
    169         "answer": true,
    170         "justification": "Appendix A provides the full prompt text for all three pipeline steps: Step 1 prompts for generating policy concepts, task formats, and geographic regions; the complete Step 3 structured generation prompt with all placeholders; and Appendix A.2 provides the actual keyword lists used as fill values."
    171       },
    172       "hyperparameters_reported": {
    173         "applies": true,
    174         "answer": false,
    175         "justification": "Temperature 0.7 is stated only for the Perez et al. adaptations (Appendix B). No hyperparameters (temperature, top-p, max tokens) are reported for AART's own generation steps."
    176       },
    177       "scaffolding_described": {
    178         "applies": true,
    179         "answer": true,
    180         "justification": "The AART pipeline (Steps 1-4) is described in detail in Section 3.1 with Figure 2 providing an overview diagram. The structured generation process, how dimensions are combined, and the CoT-style instruction creation are documented clearly enough for re-implementation."
    181       },
    182       "data_preprocessing_documented": {
    183         "applies": true,
    184         "answer": true,
    185         "justification": "Appendix B documents how comparison datasets were filtered (e.g., RealToxicityPrompts: 'challenging=True or toxicity >= 0.8'; ParlAI: 'first exchanges, adversarial, labeled not ok'). JSON parsing errors are reported (144 lines discarded, 95.8% valid)."
    186       }
    187     },
    188     "limitations_and_scope": {
    189       "limitations_section_present": {
    190         "applies": true,
    191         "answer": true,
    192         "justification": "Section 5 is titled 'Limitations' and provides substantive discussion of long-tail coverage gaps, LLM generation biases, ambiguity in defining 'adversarial,' low keyword-based coverage rates, and computational expense."
    193       },
    194       "threats_to_validity_specific": {
    195         "applies": true,
    196         "answer": true,
    197         "justification": "Section 5 identifies specific threats: keyword-based evaluation underestimates concept presence; LLM-generated data may exhibit systematic biases; the method may miss emerging attack patterns (prompt injection, jailbreaking); the qualitative analysis is limited to n=120."
    198       },
    199       "scope_boundaries_stated": {
    200         "applies": true,
    201         "answer": false,
    202         "justification": "The limitations section does not explicitly state what the results do NOT generalize to. The paper does not bound claims to the specific English-language, text-generation, dangerous-activities demonstration scenario. The conclusion broadly claims real-world product launches."
    203       }
    204     },
    205     "data_integrity": {
    206       "raw_data_available": {
    207         "applies": true,
    208         "answer": false,
    209         "justification": "The generated adversarial dataset is promised as a future release but not available at publication time. Independent verification of the reported quality metrics (92.5%, etc.) is not possible."
    210       },
    211       "data_collection_described": {
    212         "applies": true,
    213         "answer": true,
    214         "justification": "The data generation procedure is described in detail across Section 3 and Appendix A: specific prompts, parameter choices, number of runs per policy concept (twice), sampling strategy for task formats and geographic regions, and JSON parsing step."
    215       },
    216       "recruitment_methods_described": {
    217         "applies": false,
    218         "answer": false,
    219         "justification": "The paper uses automated LLM generation and existing public datasets with no external human participant recruitment. The qualitative analysis was conducted by the paper's authors."
    220       },
    221       "data_pipeline_documented": {
    222         "applies": true,
    223         "answer": true,
    224         "justification": "The full pipeline from Step 1 (concept generation) through Step 2 (scoping/filtering) to Step 3 (structured generation) to Step 4 (review) is documented. JSON parsing step with 144 discarded lines is reported. Comparison dataset filtering is documented in Appendix B."
    225       }
    226     },
    227     "conflicts_of_interest": {
    228       "funding_disclosed": {
    229         "applies": true,
    230         "answer": false,
    231         "justification": "No funding source is disclosed. The acknowledgments section thanks individual colleagues at Google but includes no formal funding statement or grant acknowledgment."
    232       },
    233       "affiliations_disclosed": {
    234         "applies": true,
    235         "answer": true,
    236         "justification": "All four authors are listed as 'Google Research' on the title page. The affiliation is clearly disclosed."
    237       },
    238       "funder_independent_of_outcome": {
    239         "applies": true,
    240         "answer": false,
    241         "justification": "The authors are Google Research employees using Google's PaLM 2 API. Google has a financial interest in demonstrating effective AI safety tools and promoting responsible use of its LLMs. The funder is not independent of the outcome."
    242       },
    243       "financial_interests_declared": {
    244         "applies": true,
    245         "answer": false,
    246         "justification": "No competing interests statement is included in the paper. Absence of disclosure is not the same as absence of conflict."
    247       }
    248     },
    249     "contamination": {
    250       "training_cutoff_stated": {
    251         "applies": false,
    252         "answer": false,
    253         "justification": "This is a red-teaming data generation study. PaLM 2 is used as a generation tool to produce adversarial prompts, not evaluated on a benchmark for its knowledge or capabilities. The schema and scan-agent instructions explicitly list red-teaming studies as cases where contamination questions are NA."
    254       },
    255       "train_test_overlap_discussed": {
    256         "applies": false,
    257         "answer": false,
    258         "justification": "Same reasoning: the paper does not evaluate a pre-trained model on a benchmark. PaLM 2 is the generation instrument for producing adversarial prompts, not the subject being evaluated on test data."
    259       },
    260       "benchmark_contamination_addressed": {
    261         "applies": false,
    262         "answer": false,
    263         "justification": "The paper does not evaluate PaLM 2 on any benchmark. It uses PaLM 2 to generate data and then compares that data's keyword coverage against other datasets. Contamination in the traditional sense is structurally inapplicable."
    264       }
    265     },
    266     "human_studies": {
    267       "pre_registered": {
    268         "applies": false,
    269         "answer": false,
    270         "justification": "No external human participants. The qualitative analysis (n=120) was conducted by the paper's authors as an internal quality assessment."
    271       },
    272       "irb_or_ethics_approval": {
    273         "applies": false,
    274         "answer": false,
    275         "justification": "No external human participants were involved. The qualitative analysis was conducted by the authors themselves."
    276       },
    277       "demographics_reported": {
    278         "applies": false,
    279         "answer": false,
    280         "justification": "No external human participants. The qualitative analysis was conducted by the authors."
    281       },
    282       "inclusion_exclusion_criteria": {
    283         "applies": false,
    284         "answer": false,
    285         "justification": "No external human participants were recruited."
    286       },
    287       "randomization_described": {
    288         "applies": false,
    289         "answer": false,
    290         "justification": "No human subjects study with randomization. The qualitative analysis uses a convenience sample of 20 prompts per dimension category."
    291       },
    292       "blinding_described": {
    293         "applies": false,
    294         "answer": false,
    295         "justification": "No human subjects study requiring blinding. The authors knew which method generated each prompt."
    296       },
    297       "attrition_reported": {
    298         "applies": false,
    299         "answer": false,
    300         "justification": "No human participants; attrition is not applicable."
    301       }
    302     },
    303     "cost_and_practicality": {
    304       "inference_cost_reported": {
    305         "applies": true,
    306         "answer": false,
    307         "justification": "Section 5 notes 'the computational expense of using LLMs is high' but provides no quantitative figures: no API call counts, token counts, dollar costs, or latency measurements for generating the 3,269-example demonstration dataset."
    308       },
    309       "compute_budget_stated": {
    310         "applies": true,
    311         "answer": false,
    312         "justification": "No compute budget is stated. Total API calls, generation time, or cloud compute costs are not reported anywhere in the paper."
    313       }
    314     }
    315   }
    316 }
	ai-research-survey Systematic scan of agentic development research. What's signal, what's noise.
	git clone https://git.shiptheloop.com/ai-research-survey.git
	Log \| Files \| Refs