ai-research-survey

Systematic scan of agentic development research. What's signal, what's noise.
git clone https://git.shiptheloop.com/ai-research-survey.git
Log | Files | Refs

commit 6e63d899afcec26a7a1e6668f9197bfad25b53f0
parent fd2ab321110f363123c295dbaa3862329aec7709
Author: Brian Graham <brian@buildingbetterteams.de>
Date:   Sat, 28 Feb 2026 06:55:19 +0100

Tighten scan instrument based on Opus calibration (93.2% agreement)

Calibration of 8 papers found two systematic Sonnet failure modes:
- NA boundary errors (56% of disagreements): added explicit "NA when:"
  guidance to contamination, human_studies, artifacts, cost categories
- Generosity bias (44%): added "does NOT count" examples to
  prompts_provided, variance_reported, model_versions_specified, etc.

Schema: 14 question descriptions updated with sharper criteria.
Agent prompt: added "When to use NA" section, "Common traps" list,
and per-paper-type NA/NO guidance for surveys, mining studies, benchmarks.
Methodology context rewritten to reflect boolean checklist (was still
describing old 0-3 rubric).

All 30 scan.json and 8 calibration.json removed for re-run with
improved instrument. Calibration round 1 results preserved in
analysis/calibration-summary.{json,md}.

Added /scan project command for running scan pipeline.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Diffstat:
A.claude/commands/scan.md | 76++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Magents/scan-agent.md | 54++++++++++++++++++++++++++++++++++++++++++++++--------
Aanalysis/calibration-summary.json | 369+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Aanalysis/calibration-summary.md | 89+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Mcontext/methodology.md | 102++++++++++++++++++++++++++++++-------------------------------------------------
Mschema/scan.schema.json | 36++++++++++++++++++------------------
6 files changed, 636 insertions(+), 90 deletions(-)

diff --git a/.claude/commands/scan.md b/.claude/commands/scan.md @@ -0,0 +1,76 @@ +Run the scan pipeline on papers that have paper.txt but no scan.json. + +Arguments: $ARGUMENTS +- A number (e.g., `5`, `20`) sets the batch limit +- `all` or no argument runs unlimited until all papers are scanned +- `status` just prints current progress without scanning + +## Instructions + +### 1. Check status first + +Run `python3 scripts/claim.py status` to see how many papers are available. + +Also count total papers with paper.txt but no scan.json: +```bash +find papers -name "paper.txt" -exec sh -c 'test ! -f "$(dirname {})/scan.json" && echo {}' \; | wc -l +``` + +Print the status summary to the user. + +If the argument is `status`, stop here. + +### 2. Determine batch size + +Parse `$ARGUMENTS`: +- If it's a number, scan that many papers maximum +- If it's `all` or empty, scan everything available + +### 3. Get the list of papers to scan + +```bash +python3 scripts/claim.py list --limit <N> +``` + +This returns slugs of papers that have paper.txt, no scan.json, and no active claim. + +### 4. Launch scan sub-agents in parallel batches + +Launch **5 sub-agents at a time** using the Task tool with `model: "sonnet"` and `run_in_background: true`. + +For each sub-agent, use this prompt (fill in the slug): + +--- + +You are a scan agent. Your job is to evaluate a single research paper using a boolean checklist. + +**Read these files first:** +1. `/root/projects/ai-research-survey/schema/scan.schema.json` — the checklist schema with evaluation criteria +2. `/root/projects/ai-research-survey/agents/scan-agent.md` — your full instructions +3. `/root/projects/ai-research-survey/papers/<SLUG>/paper.txt` — the paper to evaluate + +**Then:** +1. Claim the paper: run `python3 /root/projects/ai-research-survey/scripts/claim.py take <SLUG>`. If it prints "taken", stop immediately — another agent is working on it. +2. Follow the instructions in scan-agent.md to evaluate the paper. +3. Write the result to `/root/projects/ai-research-survey/papers/<SLUG>/scan.json` +4. Release the claim: run `python3 /root/projects/ai-research-survey/scripts/claim.py done <SLUG>` + +Do NOT write to registry.jsonl. Only write scan.json. + +--- + +### 5. Wait for each batch to complete before launching the next + +After each batch of 5 completes, report: +- How many succeeded (wrote scan.json) +- How many failed +- How many remain + +Then launch the next batch of 5. Continue until the limit is reached or all papers are scanned. + +### 6. Final summary + +After all batches complete, print: +- Total papers scanned this run +- Total papers now scanned overall +- Any failures (papers where scan.json was not written) diff --git a/agents/scan-agent.md b/agents/scan-agent.md @@ -36,11 +36,38 @@ The checklist has 50 yes/no/na questions across 11 categories. For each question **Rules:** - **yes** = the paper clearly satisfies the criterion. You must be able to point to where. - **no** = the paper does not satisfy the criterion, or you cannot find evidence that it does. Absence of evidence is NO, not NA. -- **na** = the criterion genuinely does not apply to this type of paper (e.g., human_studies questions for a benchmark paper with no human participants, or blinding for an observational study where blinding is infeasible). +- **na** = the criterion genuinely does not apply to this type of paper. See "When to use NA" below. **Do not guess.** If you cannot find the information in the paper, the answer is NO with a justification like "No mention of [X] found in the paper." -**Do not be generous.** A vague mention does not count. "We used GPT-4" without a version is NO for model_versions_specified. "Code will be released" is NO for code_released. Apply the criteria as written in the schema descriptions. +**Do not be generous.** A vague mention does not count. Apply the criteria as written in the schema descriptions. Common traps: +- "We used GPT-4" without a version → NO for `model_versions_specified` (need e.g., "gpt-4-0613") +- "Code will be released" or "available upon request" → NO for `code_released` +- Prompt *templates* with placeholders → NO for `prompts_provided` (need actual prompt text used) +- Reporting medians across runs without std dev or IQR → NO for `variance_reported` +- Describing what a prompt does in natural language → NO for `prompts_provided` (need the actual text) +- A threats-to-validity section with only generic disclaimers → NO for `threats_to_validity_specific` +- Abstract mentions of alternative factors → NO for `alternative_explanations_discussed` (need substantive discussion) + +### When to use NA + +NA means "this question is structurally inapplicable to this paper type." It does NOT mean "the paper didn't do it." If a paper *could have* done it but didn't, that's NO. + +**Use NA when:** +- `human_studies.*` — The paper has no human participants at all (mining GitHub repos, running benchmarks on code, etc.). A survey OF papers is not a human subjects study. +- `contamination.*` — The paper does not evaluate a pre-trained model's capability on any benchmark (e.g., a mining study, interview study, survey paper, or red-teaming study that tests defenses rather than model knowledge). +- `evaluation_design.ablation_study` — The system has only one component. +- `evaluation_design.human_evaluation` — Human evaluation is clearly irrelevant to the claims. +- `setup_transparency.scaffolding_described` — No agentic scaffolding is used, OR the paper evaluates third-party tools as black boxes (it cannot describe their internal scaffolding). +- `setup_transparency.prompts_provided` — The paper does not use prompting at all. +- `cost_and_practicality.*` — The paper is purely theoretical or is a survey. +- `claims_and_evidence.causal_claims_justified` — The paper genuinely makes no causal claims (but check carefully — ablation studies and "X improves Y" language ARE causal claims). + +**Do NOT use NA when:** +- A survey paper could have released analysis code/data but didn't → `code_released` = NO, `data_released` = NO +- A paper could have reported costs but chose not to → NO, not NA +- A question is difficult to answer → still answer YES or NO +- The paper is weak in an area → NO is the correct answer, not NA ### 3. Extract Claims @@ -96,17 +123,28 @@ If there are no red flags, return an empty array. ## Handling Different Paper Types ### Empirical papers (most common) -Answer all applicable checklist items. Most items will be applicable. +Answer all applicable checklist items. Most items will be applicable. NA is rare. + +### Benchmark evaluation papers +Answer all items. contamination items are especially important. human_studies items are NA unless the paper includes a user study. ### Survey / review papers -Many evaluation_design and statistical_methodology items will be NA. Focus on: -- artifacts: Did they document search strategy and provide data? -- data_integrity: Is the review methodology transparent and reproducible? -- claims_and_evidence: Do conclusions follow from the papers reviewed? -- limitations_and_scope: Does it acknowledge selection bias, publication bias? +- `artifacts`: Answer YES/NO, not NA. A survey *can* release its search corpus, analysis scripts, or extracted data. If it didn't, that's NO. +- `statistical_methodology`: Most items NA (surveys don't run experiments). Exception: if the survey does meta-analysis with statistical aggregation, answer these. +- `evaluation_design`: Most items NA. Exception: `baselines_included` and `per_category_breakdown` can apply (does the survey compare papers systematically?). +- `claims_and_evidence`: Answer all. Do conclusions follow from the papers reviewed? +- `setup_transparency`: `data_preprocessing_documented` applies (was the paper selection pipeline documented with counts at each filtering stage?). Most others NA. +- `data_integrity`: Answer all. Is the review methodology transparent and reproducible? +- `contamination`, `human_studies`: NA. +- `cost_and_practicality`: NA (surveys don't have inference costs — and if a survey reports costs of systems it reviews, those are the reviewed systems' costs, not the survey's). A survey that just collects and summarizes without structured quality assessment is laundering the signal-to-noise ratio of its sources. Flag this in red_flags. +### Mining / repository studies (no human participants) +- `human_studies`: All NA (mining public repositories is not a human subjects study). +- `contamination`: NA unless the study evaluates an LLM on a benchmark. +- All other categories: answer YES/NO as applicable. + ### Theoretical / position papers Most empirical checklist items will be NA. Focus on claims_and_evidence and limitations_and_scope. diff --git a/analysis/calibration-summary.json b/analysis/calibration-summary.json @@ -0,0 +1,369 @@ +{ + "calibration_date": "2026-02-28", + "papers_calibrated": 8, + "papers_remaining": 11, + "total_questions_evaluated": 400, + "total_agreements": 373, + "overall_agreement_rate": 0.9325, + "mean_applicable_agreement_rate": 0.9033, + "per_paper": [ + { + "slug": "adaptive-attacks-bypass-defenses-2025", + "agreement_rate": 0.86, + "applicable_agreement_rate": 0.88, + "disagreement_count": 7 + }, + { + "slug": "adaptive-test-generation-2023", + "agreement_rate": 0.96, + "applicable_agreement_rate": 0.952, + "disagreement_count": 2 + }, + { + "slug": "adoption-generative-artificial-2026", + "agreement_rate": 0.98, + "applicable_agreement_rate": 0.97, + "disagreement_count": 1 + }, + { + "slug": "agent-developer-practices-2025", + "agreement_rate": 0.86, + "applicable_agreement_rate": 0.912, + "disagreement_count": 7 + }, + { + "slug": "agentic-adoption-github-2026", + "agreement_rate": 1.0, + "applicable_agreement_rate": 1.0, + "disagreement_count": 0 + }, + { + "slug": "agentic-programming-survey-2025", + "agreement_rate": 0.88, + "applicable_agreement_rate": 0.625, + "disagreement_count": 6 + }, + { + "slug": "agentless-2024", + "agreement_rate": 0.96, + "applicable_agreement_rate": 0.9375, + "disagreement_count": 2 + }, + { + "slug": "ai-code-not-reproducible-2025", + "agreement_rate": 0.96, + "applicable_agreement_rate": 0.95, + "disagreement_count": 2 + } + ], + "disagreement_direction": { + "yes_to_no": 11, + "no_to_na": 7, + "yes_to_na": 4, + "na_to_yes": 2, + "na_to_no": 2, + "no_to_yes": 1 + }, + "questions_with_disagreements": [ + { + "question": "claims_and_evidence.causal_claims_justified", + "disagreement_count": 3, + "details": [ + { + "paper": "adaptive-attacks-bypass-defenses-2025", + "sonnet": "na", + "opus": "yes", + "opus_justification": "The paper does make implicit causal claims: that adaptive attacks CAUSE defense failure, and that the weakness of prior evaluations explains the false sense of security. However, these claims are justified by direct demonstration — they construct the attacks and show they work. This is the strongest possible causal evidence (interventional). The study design is adequate: they directly intervene by applying adaptive attacks and observe the outcome. YES is more appropriate than NA because the paper's core thesis is causal (adaptive attacks break defenses), and the study design supports it." + }, + { + "paper": "adaptive-test-generation-2023", + "sonnet": "no", + "opus": "yes", + "opus_justification": "The paper's causal claims are about which prompt components contribute to test effectiveness (e.g., 'all five components are essential'). These claims are supported by a systematic ablation study (RQ5) where each component is disabled one at a time and the impact is measured with Wilcoxon tests and effect sizes. Ablation studies are a standard and accepted methodology for causal claims about component contributions in SE and ML research. The claim 'removing component X degrades metric Y' is directly testable by ablation, and the paper does this correctly. Sonnet's objection about order effects and interactions is overly strict — the paper does not claim to have isolated interaction effects, and the causal claims are appropriately scoped to the ablation evidence provided." + }, + { + "paper": "agentless-2024", + "sonnet": "na", + "opus": "yes", + "opus_justification": "Sonnet marked this NA ('no causal claims made'), but the paper does make causal claims through its ablation studies. For example: 'the most significant performance improvement was achieved by incorporating additional filtering based on the generated reproduction tests' (Section 5.2.3), and 'This demonstrates the importance of our hierarchical localization design' (Section 5.2.1). These are causal claims about which components cause performance changes. The ablation study design (varying one component while holding others constant) is an adequate controlled-experiment design for these engineering causal claims. Therefore the answer should be YES -- causal claims are made and the study design supports them." + } + ] + }, + { + "question": "setup_transparency.prompts_provided", + "disagreement_count": 2, + "details": [ + { + "paper": "adaptive-attacks-bypass-defenses-2025", + "sonnet": "yes", + "opus": "no", + "opus_justification": "The schema asks whether 'prompts or system instructions used in experiments' are provided, and clarifies that 'if prompts are described only in natural language without the actual text, NO.' The paper shows examples of successful attack prompts (outputs of the optimization), but the system prompts and evaluation prompts used to run the experiments — such as the mutator system prompt, the critic LLM prompt, the scorer instructions, and the defense system prompts — are not fully reproduced. Section D says the mutator system prompt was 'generated by another LLM and then manually edited' and describes its sections conceptually but does not provide the actual text. Showing attack outputs is not the same as providing the experimental prompts." + }, + { + "paper": "ai-code-not-reproducible-2025", + "sonnet": "yes", + "opus": "no", + "opus_justification": "Figure 1 shows only the prompt TEMPLATE with placeholders like [Task Description], [Language], and [Environment]. The actual 100 task descriptions — which are a critical part of the input determining the generated code — are not provided in the paper, appendix, or a linked repository. The schema requires 'full prompt text in the paper or appendix, or a link to a repository containing prompts.' Without the 100 specific task descriptions, a researcher cannot reproduce the prompts and thus the experiment. The template alone is insufficient." + } + ] + }, + { + "question": "claims_and_evidence.alternative_explanations_discussed", + "disagreement_count": 2, + "details": [ + { + "paper": "agent-developer-practices-2025", + "sonnet": "yes", + "opus": "no", + "opus_justification": "Sonnet credits the threats-to-validity section and the MetaGPT popularity explanation, but these do not constitute discussion of alternative explanations for the main findings. The threats section discusses methodology weaknesses (GPT-4 bias, tag inaccuracy), not alternative explanations for observed patterns. For example, the finding that '96% of top-starred projects use multiple frameworks' could be explained by selection bias (multi-framework projects may be more complex and garner more stars), but this is not discussed. The MetaGPT explanation is a single interpretation, not a consideration of alternatives. No robustness checks are performed." + }, + { + "paper": "agentic-programming-survey-2025", + "sonnet": "no", + "opus": "na", + "opus_justification": "Sonnet marked NO, arguing the paper presents a single taxonomy without discussing alternatives. I marked NA because the paper does not present experimental findings that need alternative explanations — it is a survey offering a taxonomic framework. The schema says 'consideration of confounds, other factors that could explain the results.' A survey does not have 'results' in the experimental sense. However, Sonnet's point about alternative taxonomic frameworks is valid — the paper could have discussed whether different categorization schemes would change conclusions. This is a borderline call; NO is defensible if you interpret the criterion broadly." + } + ] + }, + { + "question": "evaluation_design.held_out_test_set", + "disagreement_count": 1, + "details": [ + { + "paper": "adaptive-attacks-bypass-defenses-2025", + "sonnet": "yes", + "opus": "na", + "opus_justification": "This paper is a security evaluation (red-teaming) paper, not a traditional ML train/test evaluation. The concept of a 'held-out test set' does not apply in the same way: they are not tuning a system on dev data and then evaluating on test data. They use public benchmarks as evaluation targets, but there is no train/dev/test split relevant to their attack methodology. The schema asks about 'results reported on a held-out test set (not the dev/validation set used for tuning)' — but the attack optimization (RL, search) is done per-defense and per-sample, not on a shared validation set. NA is more appropriate than YES." + } + ] + }, + { + "question": "setup_transparency.model_versions_specified", + "disagreement_count": 1, + "details": [ + { + "paper": "adaptive-attacks-bypass-defenses-2025", + "sonnet": "yes", + "opus": "no", + "opus_justification": "The schema requires 'specific model names with version (e.g., gpt-4-0613, Claude 3.5 Sonnet, Llama-2-70b-chat).' The paper uses names like 'Gemini-2.5 Pro', 'GPT-5 Mini', 'Llama-3.3-70B' but without snapshot dates or exact API version identifiers. For the RL attacker, they use 'a closed-source base model with no safety alignment' without even naming it. The schema explicitly says 'If the paper says just GPT-4 or Claude without a version or snapshot date, NO — model behavior changes across versions.' While 'Gemini-2.5 Pro' is more specific than just 'Gemini', it still lacks a snapshot date, and the unnamed RL base model is a clear gap." + } + ] + }, + { + "question": "setup_transparency.hyperparameters_reported", + "disagreement_count": 1, + "details": [ + { + "paper": "adaptive-attacks-bypass-defenses-2025", + "sonnet": "yes", + "opus": "no", + "opus_justification": "The schema specifies 'temperature, top-p, max tokens, learning rate, etc.' and says 'If the paper uses an LLM API without stating temperature/sampling settings, NO — these significantly affect output.' The paper uses multiple LLM APIs (Gemini-2.5 Pro as mutator, various target models, a critic LLM) but never reports temperature or sampling parameters for any of them. While some hyperparameters are reported (32 sessions, 5 rounds, 8 candidates per mutation, 800 queries budget), the critical LLM inference hyperparameters (temperature, top-p) are missing throughout." + } + ] + }, + { + "question": "data_integrity.recruitment_methods_described", + "disagreement_count": 1, + "details": [ + { + "paper": "adaptive-attacks-bypass-defenses-2025", + "sonnet": "yes", + "opus": "no", + "opus_justification": "The paper says they ran an 'online red-teaming competition with over 500 participants' with $20,000 in prizes, but does not describe HOW participants were recruited — what platforms were used to advertise the competition, what communities were targeted, what channels were used. The schema asks about 'how participants were found, what channels were used, whether recruitment could introduce bias.' Knowing it was a competition with prizes tells us the incentive structure, not the recruitment method. The Wakefield parallel in the schema (recruited through anti-vaccine activists, biasing the sample) is exactly the kind of recruitment bias that goes unexamined here — competition participants self-select from security/ML communities, which could bias the assessment of how difficult these attacks are." + } + ] + }, + { + "question": "contamination.benchmark_contamination_addressed", + "disagreement_count": 1, + "details": [ + { + "paper": "adaptive-attacks-bypass-defenses-2025", + "sonnet": "no", + "opus": "na", + "opus_justification": "This paper evaluates the robustness of defenses against adaptive attacks, not the baseline capability of models on benchmarks. Benchmark contamination is primarily a concern when measuring model capability — if a model has memorized HumanEval solutions, its coding score is inflated. Here, the paper tests whether defenses can withstand novel attacks. Even if model utility scores are contamination-inflated, the core finding (that adaptive attacks bypass defenses at >90% ASR) is about the defense mechanism, not model capability. The attacks are generated fresh and novel, so contamination is irrelevant to the main claims. NA better reflects that the criterion does not meaningfully apply to this paper's central contribution." + } + ] + }, + { + "question": "statistical_methodology.variance_reported", + "disagreement_count": 1, + "details": [ + { + "paper": "adaptive-test-generation-2023", + "sonnet": "yes", + "opus": "no", + "opus_justification": "The paper states 'we run all experiments 10 times' and reports medians over those 10 runs, but it does NOT report standard deviation, variance, or any measure of spread across runs. The schema requires: 'If it explicitly states averaged over K runs with std dev YES.' Reporting medians without any variance/spread measure does not satisfy this criterion. The reader cannot tell how much the results vary across the 10 runs." + } + ] + }, + { + "question": "human_studies.randomization_described", + "disagreement_count": 1, + "details": [ + { + "paper": "adoption-generative-artificial-2026", + "sonnet": "no", + "opus": "na", + "opus_justification": "The schema states 'NA if not an experimental study or no human participants.' This is a cross-sectional survey study with no treatment conditions or experimental groups. Randomization is a concept for experimental studies where participants are assigned to conditions. Since this is a descriptive survey study (not an experiment), randomization of participants to conditions is not applicable. Sonnet's 'no' is defensible but inconsistent with marking 'blinding_described' as NA for the same reason — both are experimental design features irrelevant to a survey study." + } + ] + }, + { + "question": "claims_and_evidence.generalization_bounded", + "disagreement_count": 1, + "details": [ + { + "paper": "agent-developer-practices-2025", + "sonnet": "yes", + "opus": "no", + "opus_justification": "The paper's title ('An Empirical Study of Agent Developer Practices in AI Agent Frameworks') and abstract make broad claims about 'agent developer practices' without qualifying that findings apply only to GitHub open-source projects with >10 stars and >5 forks from 2022-2025. While Section 7.3 acknowledges the GitHub-only sample as a limitation, the findings (e.g., Finding 1-13) and abstract do not carry scope qualifiers. The schema asks whether generalizations are bounded to the tested setting; the main claims are presented as general truths about agent development rather than bounded to the specific GitHub sample." + } + ] + }, + { + "question": "limitations_and_scope.scope_boundaries_stated", + "disagreement_count": 1, + "details": [ + { + "paper": "agent-developer-practices-2025", + "sonnet": "yes", + "opus": "no", + "opus_justification": "The schema references METR's Table 2 ('What the evidence does not show') as the gold standard. This paper does not have any such explicit statement of what the results do NOT show. Section 7.3 acknowledges the GitHub-only limitation and Section 4.1.1 notes the ten frameworks 'do not encompass all existing frameworks,' but these are qualifications buried in methodology and threats sections, not prominent scope boundary statements. The findings sections (Findings 1-13) present conclusions without scope qualifications. There is no explicit 'what we do not claim' framing." + } + ] + }, + { + "question": "contamination.training_cutoff_stated", + "disagreement_count": 1, + "details": [ + { + "paper": "agent-developer-practices-2025", + "sonnet": "no", + "opus": "na", + "opus_justification": "The schema states this is applicable when the paper evaluates a pre-trained model: 'NA if the paper does not evaluate a pre-trained model.' This paper does not evaluate GPT-4/GPT-4o as subjects of study; it uses them as analysis tools for text summarization and classification. The subject of evaluation is the agent frameworks and developer practices, not model performance. While GPT-4's training data could theoretically bias the analysis, this is a methodology concern (addressed under red_flags), not a contamination concern in the schema's intended sense." + } + ] + }, + { + "question": "contamination.train_test_overlap_discussed", + "disagreement_count": 1, + "details": [ + { + "paper": "agent-developer-practices-2025", + "sonnet": "no", + "opus": "na", + "opus_justification": "The schema asks about potential train/test overlap when 'the paper uses a public benchmark with a model that could have trained on it.' This paper does not use any benchmark to evaluate model performance. The developer discussions are not a test set for GPT-4; they are data being processed by GPT-4 as an analysis tool. The contamination category is designed for the 'you measured it wrong' problem of benchmark evaluation, which does not apply here." + } + ] + }, + { + "question": "human_studies.pre_registered", + "disagreement_count": 1, + "details": [ + { + "paper": "agent-developer-practices-2025", + "sonnet": "no", + "opus": "na", + "opus_justification": "The schema explicitly states 'NA if no human participants.' This study mines publicly available GitHub repository data and discussion threads. There are no human participants enrolled in any experimental or survey protocol. The two expert annotators who reviewed classifications are part of the research team's quality assurance process, not study participants. Sonnet's NO answer incorrectly applies a human-studies criterion to a mining-based empirical study." + } + ] + }, + { + "question": "human_studies.blinding_described", + "disagreement_count": 1, + "details": [ + { + "paper": "agent-developer-practices-2025", + "sonnet": "no", + "opus": "na", + "opus_justification": "The schema states 'NA if blinding is not feasible or no human participants.' This study has no human participants in the experimental sense. The expert annotators are co-researchers performing quality checks on GPT-4 output, not blinded evaluators in a human-subjects experiment. Sonnet's concern about annotators knowing which framework each discussion pertains to is reasonable as a methodology concern, but the human_studies category is explicitly scoped to papers with human participants, which this is not." + } + ] + }, + { + "question": "artifacts.code_released", + "disagreement_count": 1, + "details": [ + { + "paper": "agentic-programming-survey-2025", + "sonnet": "na", + "opus": "no", + "opus_justification": "Sonnet marked NA because this is a survey paper. However, the schema asks 'Is source code released?' — the answer should be NO, not NA. Survey papers can still release code (e.g., analysis scripts, search automation, taxonomy visualization tools). The criterion asks a factual question about whether code exists, and here it does not. NA should be reserved for cases where releasing code is categorically irrelevant (e.g., a pure theoretical proof). A systematic literature review could release replication scripts for the search and screening process." + } + ] + }, + { + "question": "artifacts.data_released", + "disagreement_count": 1, + "details": [ + { + "paper": "agentic-programming-survey-2025", + "sonnet": "na", + "opus": "no", + "opus_justification": "Sonnet marked NA because no dataset was 'created.' However, the corpus of 152 papers IS the data of this survey, and it was not released in any structured format. The schema says 'If they collected proprietary data and did not release it, NO.' The authors assembled a curated list of papers through systematic search — this is a dataset they collected but did not release. A proper SLR should release its paper list for verification. NO is more appropriate than NA because releasing the corpus is expected for replicability." + } + ] + }, + { + "question": "evaluation_design.negative_results_reported", + "disagreement_count": 1, + "details": [ + { + "paper": "agentic-programming-survey-2025", + "sonnet": "yes", + "opus": "na", + "opus_justification": "Sonnet marked YES citing Section 6.3 and Fig. 9 where the paper notes 'all models still perform poorly on software optimization tasks.' On reflection, Sonnet has a point — the paper does report negative results about the state of the field (poor benchmark performance, inadequate benchmarks). However, these are not negative results from experiments conducted by the authors; they are observations synthesized from other work. The schema asks about 'things that didn't work' in the context of the paper's own experiments. Since this is a survey with no experiments, I originally marked NA. However, Sonnet's interpretation that the survey's identification of field-level failures counts is reasonable. I lean NA because the question assumes experimental work, but this is a borderline case." + } + ] + }, + { + "question": "setup_transparency.data_preprocessing_documented", + "disagreement_count": 1, + "details": [ + { + "paper": "agentic-programming-survey-2025", + "sonnet": "yes", + "opus": "no", + "opus_justification": "Sonnet marked YES citing the three-stage selection process in Section 3. I marked NO because while the pipeline stages and counts are documented, the actual preprocessing details are thin: we do not know the exact database queries used (only high-level term clusters), the specific disagreement resolution process, how many papers each reviewer independently accepted/rejected, or what 'discussion' entailed for resolving conflicts. The schema says 'If the paper goes from we collected data to here are the results without describing intermediate processing, NO.' Section 3 describes the process at a moderate level — enough for a broad understanding but not enough to actually replicate the screening decisions. On reflection, Sonnet's YES is defensible for a survey context where the 'preprocessing' is paper selection, and the three stages with counts do describe the pipeline. This is genuinely borderline." + } + ] + }, + { + "question": "cost_and_practicality.inference_cost_reported", + "disagreement_count": 1, + "details": [ + { + "paper": "agentic-programming-survey-2025", + "sonnet": "yes", + "opus": "na", + "opus_justification": "Sonnet marked YES citing Table 6 and Section 4.4 which detail LLM token pricing. However, these costs are for the systems surveyed, not for the survey methodology itself. The survey authors did not run any inference — they compiled pricing information from providers. The schema asks 'Is inference cost or latency reported?' in the context of the paper's own methodology. Since no inference was conducted by the authors, NA is more appropriate. That said, Sonnet's interpretation has merit: the paper does provide cost analysis as a contribution, even if it is surveyed rather than measured. This question's intent is somewhat ambiguous for survey papers that discuss costs as a topic rather than incurring them." + } + ] + }, + { + "question": "evaluation_design.human_evaluation", + "disagreement_count": 1, + "details": [ + { + "paper": "agentless-2024", + "sonnet": "yes", + "opus": "no", + "opus_justification": "The schema asks whether 'human evaluation' of the system's outputs is included, listing 'human ratings, manual inspection, user studies, expert review of outputs' as indicators. Sonnet counted the manual classification of SWE-bench Lite problems (Section 6.1) as human evaluation. However, that classification characterizes the benchmark's properties (description quality, solution presence, location information) -- it is not a human evaluation of AGENTLESS's outputs, patches, or performance. No human raters assessed whether AGENTLESS's patches were correct, useful, or well-formed. The evaluation of AGENTLESS itself is entirely automated (pass/fail on test suites). Therefore the answer should be NO." + } + ] + }, + { + "question": "setup_transparency.scaffolding_described", + "disagreement_count": 1, + "details": [ + { + "paper": "ai-code-not-reproducible-2025", + "sonnet": "yes", + "opus": "na", + "opus_justification": "The paper uses three commercial coding agents (Claude Code, Codex, Gemini) as black-box evaluation subjects — it does not build or use agentic scaffolding as part of its own methodology. The evaluation pipeline (Algorithm 1, SciUnit capture, environment reset) describes the experimental infrastructure, not agentic scaffolding. The schema asks 'If the approach uses agentic scaffolding, is it described in detail?' and says 'NA if no scaffolding is used.' Since the authors' approach does not employ scaffolding — the agents are the objects of study, not tools the authors built or orchestrated — NA is the correct answer. Sonnet appears to have conflated the evaluation methodology with scaffolding." + } + ] + } + ] +} diff --git a/analysis/calibration-summary.md b/analysis/calibration-summary.md @@ -0,0 +1,89 @@ +# Calibration Summary: Sonnet vs Opus on Boolean Checklist + +**Date**: 2026-02-28 +**Papers calibrated**: 8 of 19 (11 remaining due to rate limits) +**Instrument**: 50-question boolean checklist (scan.schema.json) + +## Overall Agreement + +| Metric | Value | +|--------|-------| +| Total question-pairs evaluated | 400 (8 papers x 50 questions) | +| Agreements | 373 | +| **Overall agreement rate** | **93.2%** | +| Disagreements | 27 | + +## Per-Paper Results + +| Paper | Agreement | Rate | Disagreements | +|-------|-----------|------|---------------| +| agentic-adoption-github-2026 | 50/50 | 100% | 0 | +| adoption-generative-artificial-2026 | 49/50 | 98% | 1 | +| adaptive-test-generation-2023 | 48/50 | 96% | 2 | +| agentless-2024 | 48/50 | 96% | 2 | +| ai-code-not-reproducible-2025 | 48/50 | 96% | 2 | +| agentic-programming-survey-2025 | 44/50 | 88% | 6 | +| adaptive-attacks-bypass-defenses-2025 | 43/50 | 86% | 7 | +| agent-developer-practices-2025 | 43/50 | 86% | 7 | + +## Disagreement Direction + +When Sonnet and Opus disagree, who is stricter? + +| Direction | Count | % of disagreements | +|-----------|-------|--------------------| +| Sonnet=YES, Opus=NO | 11 | 40.7% | +| Sonnet=NO, Opus=NA | 7 | 25.9% | +| Sonnet=YES, Opus=NA | 4 | 14.8% | +| Sonnet=NA, Opus=YES | 2 | 7.4% | +| Sonnet=NA, Opus=NO | 2 | 7.4% | +| Sonnet=NO, Opus=YES | 1 | 3.7% | + +**Key finding**: Sonnet is more generous than Opus. In 41% of disagreements, Sonnet said YES where Opus said NO. Sonnet also over-applies YES/NO where Opus thinks the question is NA (41% of disagreements involve NA boundary). + +## Most Contentious Questions + +| Question | Disagreements | Pattern | +|----------|---------------|---------| +| `claims_and_evidence.causal_claims_justified` | 3 | Mixed: Sonnet sometimes marks NA when Opus sees causal claims (ablation studies), sometimes NO when Opus sees adequate justification | +| `setup_transparency.prompts_provided` | 2 | Sonnet YES, Opus NO. Sonnet counts described prompts; Opus requires actual prompt text | +| `claims_and_evidence.alternative_explanations_discussed` | 2 | Sonnet YES, Opus NO. Different thresholds for what counts as "discussed" | + +## Disagreement Categories + +### 1. NA Boundary Disputes (13/27 = 48%) +The most common disagreement type. Questions that don't clearly apply to a paper type: +- **contamination** questions on non-benchmark papers (Sonnet says NO, Opus says NA) +- **human_studies** items on interview/qualitative studies (Sonnet says NO, Opus says NA) +- Survey papers: Sonnet answers artifact questions (code/data released) as NA, Opus says NO + +**Recommendation**: Tighten NA guidance in schema descriptions. For surveys, "code_released" should be NO (not NA) — a survey could release its analysis scripts. For contamination questions on non-LLM-evaluation papers, NA is correct. + +### 2. Strictness Gaps (11/27 = 41%) +Sonnet says YES where Opus says NO. Common in: +- `setup_transparency`: Sonnet counts partial/described information; Opus requires the actual artifacts (full prompt text, specific version strings) +- `claims_and_evidence`: Sonnet credits vague mentions; Opus requires substantive discussion +- `limitations_and_scope`: Sonnet counts generic statements; Opus requires study-specific threats + +**Recommendation**: These suggest the schema descriptions need sharper examples of what counts vs. what doesn't. Consider adding "This counts: ..." and "This does NOT count: ..." examples to the most ambiguous items. + +### 3. Genuine Interpretive Differences (3/27 = 11%) +Reasonable disagreements where both answers are defensible: +- Whether manual classification of benchmark problems counts as "human evaluation" +- Whether ablation studies constitute "causal claims" + +**Assessment**: These are inherent to the instrument and unlikely to be eliminated. 3 out of 400 is acceptable. + +## Conclusion + +93.2% inter-model agreement on a 50-question instrument is strong. For context: +- Medical inter-rater reliability studies consider >80% "substantial agreement" +- The boolean format achieves much higher agreement than a 0-3 Likert scale would +- Most disagreements are systematic (NA boundaries, strictness thresholds) and can be reduced with schema refinements + +**Actionable improvements**: +1. Add explicit NA guidance to contamination, human_studies, and artifacts categories +2. Add "counts / does not count" examples to setup_transparency and claims_and_evidence items +3. Consider making Sonnet the primary rater with the understanding that it is ~7% more generous than Opus — this is a known, systematic bias that can be disclosed in the paper + +**Remaining work**: 11 more papers need calibration to strengthen these findings (rate-limited, will resume later). diff --git a/context/methodology.md b/context/methodology.md @@ -11,87 +11,61 @@ This project adapts systematic review methodology from medical research, particu The key adaptation is that we are reviewing *methodological quality*, not synthesizing effect sizes. We are not doing a meta-analysis of "how much does AI help"; we are asking "how well did each study support its claims." -## Scoring Rubric +## Quality Assessment Instrument -Six dimensions, each scored 0-3. +### Design: 50-question boolean checklist -### 1. Artifacts & Reproducibility (0-3) +We use a 50-question yes/no/na checklist rather than subjective Likert-style scores. This was a deliberate design decision: -Can someone reproduce this work? +- **Verifiable**: each question has a factually correct answer checkable against the paper +- **Auditable**: justification text cites specific sections/quotes +- **High inter-rater reliability**: yes/no has much less variance than 0-3 scores +- **Fast human calibration**: checking 50 booleans takes ~15 min, not hours +- **Derived scores**: composite scores computed deterministically from boolean counts +- **The questions are findings**: "only 34% of papers release code" is concrete and publishable -| Score | Meaning | -|-------|---------| -| 0 - Absent | No code, no data, no artifacts released | -| 1 - Weak | Code or data mentioned but not available, or available but incomplete | -| 2 - Adequate | Code and data released, reasonably documented | -| 3 - Strong | Full reproduction package: code, data, environment specs, and instructions that a competent researcher could follow | +Previous design (discarded): 6-dimension 0-3 rubric (Artifacts, Statistical Rigor, Benchmark Quality, Claim-to-Evidence Ratio, Setup Transparency, Limitations Discussion). Discarded because LLM-assigned subjective scores are hard to defend in a paper about methodological rigor. Inter-rater reliability would be poor, construct validity unproven, and the instrument itself would be a methodology weakness. -**Rationale**: Reproducibility is the foundation of empirical science. In a field where benchmark scores can vary 10x based on scaffolding alone (GPT-4 SWE-bench: 2.7% to 28.3%), releasing artifacts is not optional. +### Categories (11 groups, 50 questions) -### 2. Statistical Rigor (0-3) +1. **Artifacts** (4q): code released, data released, environment specs, reproduction instructions +2. **Statistical methodology** (5q): CIs/error bars, significance tests, effect sizes, sample size justification, variance +3. **Evaluation design** (9q): baselines, contemporary baselines, ablation, multiple metrics, human eval, held-out test, breakdowns, failure cases, negative results +4. **Claims & evidence** (4q): abstract supported, causal claims justified, generalization bounded, alternatives discussed +5. **Setup transparency** (5q): model versions, prompts, hyperparameters, scaffolding, data preprocessing +6. **Limitations & scope** (3q): limitations section, specific threats, scope boundaries +7. **Data integrity** (4q): raw data available, collection described, recruitment described, pipeline documented +8. **Conflicts of interest** (4q): funding disclosed, affiliations disclosed, funder independent, financial interests declared +9. **Contamination** (3q): training cutoff stated, train/test overlap discussed, contamination addressed +10. **Human studies** (7q): pre-registered, IRB, demographics, inclusion/exclusion, randomization, blinding, attrition +11. **Cost & practicality** (2q): inference cost, compute budget -Are the statistical methods appropriate for the claims made? +Data integrity and conflicts of interest categories inspired by the Wakefield MMR case — "Is raw data available for independent verification?" would have caught the fabrication years earlier. -| Score | Meaning | -|-------|---------| -| 0 - Absent | No statistical analysis; raw numbers only | -| 1 - Weak | Basic descriptive statistics but no uncertainty quantification | -| 2 - Adequate | Appropriate tests, confidence intervals, or effect sizes reported | -| 3 - Strong | Pre-registered analysis plan, multiple comparisons correction, sensitivity analysis | +Full schema with evaluation criteria for each question: `schema/scan.schema.json` -**Rationale**: Many papers in this space report "X% improvement" without confidence intervals, significance tests, or acknowledgment of variance. This is especially problematic for small-N studies. +### Answer rules -### 3. Benchmark Quality (0-3) +- **yes** = the paper clearly satisfies the criterion; you can point to where +- **no** = the paper does not satisfy the criterion, or evidence is absent (absence of evidence is NO, not NA) +- **na** = the criterion is structurally inapplicable to this paper type (e.g., human_studies questions for a benchmark paper, contamination questions for a mining study) -Are the benchmarks appropriate for the claims being made? +Each answer includes a 1-3 sentence justification citing specific paper sections. -| Score | Meaning | -|-------|---------| -| 0 - Absent | No benchmark or evaluation; claims unsupported | -| 1 - Weak | Benchmarks used but inappropriate for the claim (e.g., HumanEval for agent capability) | -| 2 - Adequate | Appropriate benchmarks with known limitations acknowledged | -| 3 - Strong | Multiple complementary benchmarks, contamination checks, real-world validation | +### Model assignment -**Rationale**: Benchmark choice determines what is actually measured. HumanEval measures single-function completion; SWE-bench measures multi-step repo tasks. Using one to claim the other is a category error. +- **Primary rater**: Sonnet (boolean checklist is factual lookup, not subjective judgment) +- **Calibration rater**: Opus (independent re-evaluation of subset to measure agreement) -### 4. Claim-to-Evidence Ratio (0-3) +### Calibration results (round 1, 2026-02-28) -Do the claims stay within what the evidence supports? +8 papers calibrated (Sonnet vs Opus). Overall agreement: **93.2%** (373/400 questions). -| Score | Meaning | -|-------|---------| -| 0 - Absent | Major claims with no supporting evidence | -| 1 - Weak | Claims significantly overreach the evidence (e.g., "AI will replace developers" from a benchmark study) | -| 2 - Adequate | Claims mostly supported, with minor overreach in discussion | -| 3 - Strong | Claims precisely scoped to what was measured; limitations clearly stated | +Two systematic issues identified and corrected in the schema: +1. **NA boundary errors** (56% of disagreements): Sonnet marked NO when questions didn't apply, or NA when they did. Fixed by adding explicit "NA when:" guidance to schema descriptions for contamination, human_studies, artifacts, cost_and_practicality categories. +2. **Generosity bias** (44% of disagreements): Sonnet said YES where Opus said NO, especially on setup_transparency (credited partial information) and claims_and_evidence (credited vague mentions). Fixed by adding "does NOT count" examples to schema descriptions. -**Rationale**: The gap between "what was measured" and "what is claimed" is where most misleading narratives originate. The Stanford study measured git activity with Copilot autocomplete; headlines said "AI makes developers 30% faster." - -### 5. Setup Transparency (0-3) - -Is the experimental setup described well enough to understand what was actually tested? - -| Score | Meaning | -|-------|---------| -| 0 - Absent | Setup not described | -| 1 - Weak | High-level description only (e.g., "we used GPT-4") | -| 2 - Adequate | Model, prompts, parameters, and tools described | -| 3 - Strong | Full setup including scaffolding, system prompts, tool configurations, and any post-processing | - -**Rationale**: In agentic AI research, the scaffolding around the model often matters more than the model itself. A paper that says "we used Claude" without describing the scaffold is not describing a reproducible experiment. - -### 6. Limitations Discussion (0-3) - -Does the paper honestly discuss what it does *not* show? - -| Score | Meaning | -|-------|---------| -| 0 - Absent | No limitations section or acknowledgment | -| 1 - Weak | Boilerplate limitations section without substance | -| 2 - Adequate | Genuine limitations discussed, including threats to validity | -| 3 - Strong | Limitations are specific, actionable, and inform the reader about exactly when the results do and do not apply | - -**Rationale**: Honest limitations discussion is a strong signal of methodological maturity. Papers that acknowledge what they did not measure are more trustworthy than papers that don't. +Schema and agent prompt updated based on calibration findings. All scans removed for re-run with improved instrument. Calibration round 2 pending. ## Paper Selection diff --git a/schema/scan.schema.json b/schema/scan.schema.json @@ -104,7 +104,7 @@ }, "variance_reported": { "$ref": "#/$defs/checklist_item", - "description": "Is variance or standard deviation reported across experimental runs? Look for: std dev in tables, variance across seeds, multiple-run results. If the paper reports single-run numbers only, NO. If it explicitly states 'averaged over K runs with std dev' YES." + "description": "Is variance or standard deviation reported across experimental runs? Look for: std dev in tables, variance across seeds, interquartile range, multiple-run results with spread measures. If the paper reports single-run numbers only, NO. If it explicitly states 'averaged over K runs with std dev' YES. Reporting medians across runs WITHOUT any spread measure (std dev, IQR, min/max range) is NO — the reader cannot assess result stability." } } }, @@ -141,7 +141,7 @@ }, "human_evaluation": { "$ref": "#/$defs/checklist_item", - "description": "Is human evaluation included (not just automated metrics)? Look for: human ratings, manual inspection, user studies, expert review of outputs. If evaluation is entirely automated, NO. NA if human evaluation is clearly irrelevant to the claims." + "description": "Is human evaluation included (not just automated metrics)? Look for: human ratings, manual inspection, user studies, expert review of the system's OUTPUTS. The humans must be evaluating what the system produced — manual classification of the benchmark or dataset itself does not count. If evaluation of the system is entirely automated (e.g., pass/fail on test suites), NO. NA if human evaluation is clearly irrelevant to the claims." }, "held_out_test_set": { "$ref": "#/$defs/checklist_item", @@ -177,15 +177,15 @@ }, "causal_claims_justified": { "$ref": "#/$defs/checklist_item", - "description": "If the paper makes causal claims, is the study design adequate for causal inference? Look for: RCT, natural experiment, instrumental variables, difference-in-differences, or other causal identification strategies. If the paper says 'X improves Y' from observational data without addressing confounds, NO. NA if no causal claims are made." + "description": "If the paper makes causal claims, is the study design adequate for causal inference? Look for: RCT, natural experiment, instrumental variables, difference-in-differences, or other causal identification strategies. If the paper says 'X improves Y' from observational data without addressing confounds, NO. NA if no causal claims are made. Note: ablation studies ('removing component X reduces performance by Y%') ARE causal claims — check whether the ablation design is adequate (controlled single-variable manipulation counts as YES). Language like 'improves', 'causes', 'leads to', 'enables' signals causal claims." }, "generalization_bounded": { "$ref": "#/$defs/checklist_item", - "description": "Are generalizations bounded to the tested setting? Look for: claims that extend beyond the tested models, languages, tasks, or populations. If the paper tests on Python and claims results for 'code generation' generally, NO. If it says 'on Python tasks with GPT-4' YES." + "description": "Are generalizations bounded to the tested setting? Look for: claims that extend beyond the tested models, languages, tasks, or populations. If the paper tests on Python and claims results for 'code generation' generally, NO. If it says 'on Python tasks with GPT-4' YES. Check the title and abstract — broad titles like 'LLM-based Software Engineering' when results are on a single benchmark in a single language is NO." }, "alternative_explanations_discussed": { "$ref": "#/$defs/checklist_item", - "description": "Are alternative explanations for the results discussed? Look for: consideration of confounds, other factors that could explain the results, robustness checks. If the paper presents one interpretation without considering alternatives, NO." + "description": "Are alternative explanations for the results discussed? Look for: consideration of confounds, other factors that could explain the results, robustness checks. If the paper presents one interpretation without considering alternatives, NO. A threats-to-validity section counts only if it discusses specific alternative explanations for the observed results, not just generic methodological limitations. NA only for papers that present no empirical results (e.g., pure surveys or taxonomies)." } } }, @@ -202,11 +202,11 @@ "properties": { "model_versions_specified": { "$ref": "#/$defs/checklist_item", - "description": "Are exact model versions or sizes specified? Look for: specific model names with version (e.g., 'gpt-4-0613', 'Claude 3.5 Sonnet', 'Llama-2-70b-chat'). If the paper says just 'GPT-4' or 'Claude' without a version or snapshot date, NO — model behavior changes across versions." + "description": "Are exact model versions or sizes specified? Look for: specific model names with version (e.g., 'gpt-4-0613', 'Claude 3.5 Sonnet', 'Llama-2-70b-chat'). If the paper says just 'GPT-4' or 'Claude' without a version or snapshot date, NO — model behavior changes across versions. Marketing names like 'Gemini-2.5' or 'GPT-4o' without a snapshot date or API version do NOT count as specified versions." }, "prompts_provided": { "$ref": "#/$defs/checklist_item", - "description": "Are the prompts or system instructions used in experiments provided? Look for: full prompt text in the paper or appendix, or a link to a repository containing prompts. If prompts are described only in natural language ('we asked the model to...') without the actual text, NO. NA if the paper does not use prompting." + "description": "Are the prompts or system instructions used in experiments provided? Look for: full prompt text in the paper or appendix, or a link to a repository containing prompts. If prompts are described only in natural language ('we asked the model to...') without the actual text, NO. A prompt TEMPLATE with placeholders (e.g., '[Task Description]') does NOT count unless the actual fill values are also provided — the reader must be able to reconstruct every prompt sent to the model. NA if the paper does not use prompting." }, "hyperparameters_reported": { "$ref": "#/$defs/checklist_item", @@ -214,11 +214,11 @@ }, "scaffolding_described": { "$ref": "#/$defs/checklist_item", - "description": "If the approach uses agentic scaffolding, is it described in detail? Look for: tool descriptions, workflow diagrams, retry logic, feedback mechanisms, memory/context management. If the paper says 'we used an agent' without describing the scaffold, NO. NA if no scaffolding is used." + "description": "If the approach uses agentic scaffolding, is it described in detail? Look for: tool descriptions, workflow diagrams, retry logic, feedback mechanisms, memory/context management. If the paper says 'we used an agent' without describing the scaffold, NO. NA if no scaffolding is used. Also NA if the paper evaluates third-party tools (e.g., Claude Code, Copilot) as black boxes — the authors cannot be expected to describe internal scaffolding they have no access to." }, "data_preprocessing_documented": { "$ref": "#/$defs/checklist_item", - "description": "Are data preprocessing and filtering steps documented? Look for: how raw data was cleaned, filtered, or transformed before use. If the paper goes from 'we collected data' to 'here are the results' without describing intermediate processing, NO." + "description": "Are data preprocessing and filtering steps documented? Look for: how raw data was cleaned, filtered, or transformed before use. If the paper goes from 'we collected data' to 'here are the results' without describing intermediate processing, NO. For survey papers: describing the filtering pipeline stages with counts (e.g., '500 initial results → 200 after title screening → 80 after full-text review') is YES only if the actual filtering CRITERIA at each stage are also stated. Listing stages without criteria is NO." } } }, @@ -241,7 +241,7 @@ }, "scope_boundaries_stated": { "$ref": "#/$defs/checklist_item", - "description": "Are scope boundaries explicitly stated (what the results do NOT show)? Look for: explicit statements about what was not tested, what populations/settings are excluded, what claims the authors are NOT making. The METR paper's Table 2 ('What the evidence does not show') is the gold standard." + "description": "Are scope boundaries explicitly stated (what the results do NOT show)? Look for: explicit statements about what was not tested, what populations/settings are excluded, what claims the authors are NOT making. The METR paper's Table 2 ('What the evidence does not show') is the gold standard. Generic limitations like 'our results may not generalize' do NOT count — the paper must state specific things it did NOT test or claim." } } }, @@ -265,7 +265,7 @@ }, "recruitment_methods_described": { "$ref": "#/$defs/checklist_item", - "description": "Are participant or sample recruitment methods described? Look for: how participants were found, what channels were used, whether recruitment could introduce bias. Wakefield recruited through anti-vaccine activists, biasing the sample. If participants/samples were selected without description of the selection process, NO. NA if no human participants and data source is a standard benchmark." + "description": "Are participant or sample recruitment methods described? Look for: how participants were found, what channels were used, whether recruitment could introduce bias. Wakefield recruited through anti-vaccine activists, biasing the sample. If participants/samples were selected without description of the selection process, NO. For crowd-sourced events (competitions, red-teaming), simply stating 'we ran a competition' is not enough — describe how participants were recruited and whether this introduces selection bias. NA if no human participants and data source is a standard benchmark." }, "data_pipeline_documented": { "$ref": "#/$defs/checklist_item", @@ -312,15 +312,15 @@ "properties": { "training_cutoff_stated": { "$ref": "#/$defs/checklist_item", - "description": "Is the model's training data cutoff date stated? Look for: explicit mention of when the training data ends. This is necessary to assess whether test examples could have been in the training set. If the paper uses a model without stating when its training data was collected, NO. NA if the paper does not evaluate a pre-trained model." + "description": "Is the model's training data cutoff date stated? Look for: explicit mention of when the training data ends. This is necessary to assess whether test examples could have been in the training set. If the paper uses a model without stating when its training data was collected, NO. NA if the paper does not evaluate a pre-trained model's capability on any benchmark (e.g., mining studies, interview studies, surveys, or studies that test defenses/tools rather than model knowledge)." }, "train_test_overlap_discussed": { "$ref": "#/$defs/checklist_item", - "description": "Is potential train/test overlap discussed? Look for: any analysis of whether test examples appeared in the training data. Canary strings, membership inference, or temporal splits all count. If the paper uses a public benchmark with a model that could have trained on it and doesn't address this, NO." + "description": "Is potential train/test overlap discussed? Look for: any analysis of whether test examples appeared in the training data. Canary strings, membership inference, or temporal splits all count. If the paper uses a public benchmark with a model that could have trained on it and doesn't address this, NO. NA if the paper does not evaluate a pre-trained model on any benchmark (same NA rule as training_cutoff_stated)." }, "benchmark_contamination_addressed": { "$ref": "#/$defs/checklist_item", - "description": "Were benchmark examples available online before the model's training cutoff? Look for: whether the benchmark was published before the model's training data was collected. HumanEval was published in 2021; any model trained after 2021 may have seen it. If the paper uses such a benchmark without discussing contamination risk, NO. NA if using a benchmark created after the model's training cutoff." + "description": "Were benchmark examples available online before the model's training cutoff? Look for: whether the benchmark was published before the model's training data was collected. HumanEval was published in 2021; any model trained after 2021 may have seen it. If the paper uses such a benchmark without discussing contamination risk, NO. NA if using a benchmark created after the model's training cutoff, OR if the paper does not evaluate a pre-trained model on any benchmark (same NA rule as training_cutoff_stated)." } } }, @@ -339,7 +339,7 @@ "properties": { "pre_registered": { "$ref": "#/$defs/checklist_item", - "description": "Is the study pre-registered? Look for: a link to a pre-registration (OSF, AsPredicted, ClinicalTrials.gov, AEA registry). Pre-registration commits the researchers to their analysis plan before seeing the data, preventing p-hacking and outcome switching. Very rare in CS but standard in medicine. NA if no human participants." + "description": "Is the study pre-registered? Look for: a link to a pre-registration (OSF, AsPredicted, ClinicalTrials.gov, AEA registry). Pre-registration commits the researchers to their analysis plan before seeing the data, preventing p-hacking and outcome switching. Very rare in CS but standard in medicine. NA if no human participants. Mining public repositories or analyzing public data does NOT make participants — use NA." }, "irb_or_ethics_approval": { "$ref": "#/$defs/checklist_item", @@ -355,11 +355,11 @@ }, "randomization_described": { "$ref": "#/$defs/checklist_item", - "description": "Is the randomization procedure described (if applicable)? Look for: how participants were assigned to conditions, whether randomization was stratified, what tool was used. If the paper compares treatment vs. control without explaining how assignment worked, NO. NA if not an experimental study or no human participants." + "description": "Is the randomization procedure described (if applicable)? Look for: how participants were assigned to conditions, whether randomization was stratified, what tool was used. If the paper compares treatment vs. control without explaining how assignment worked, NO. NA if not an experimental study (e.g., cross-sectional surveys, observational studies, repository mining) or no human participants." }, "blinding_described": { "$ref": "#/$defs/checklist_item", - "description": "Is blinding described (if applicable)? Look for: whether participants knew which condition they were in, whether evaluators knew which outputs came from which system. If applicable and not mentioned, NO. NA if blinding is not feasible or no human participants." + "description": "Is blinding described (if applicable)? Look for: whether participants knew which condition they were in, whether evaluators knew which outputs came from which system. If applicable and not mentioned, NO. NA if blinding is not feasible, no human participants, or not an experimental study (e.g., cross-sectional surveys, observational studies)." }, "attrition_reported": { "$ref": "#/$defs/checklist_item", @@ -377,7 +377,7 @@ "properties": { "inference_cost_reported": { "$ref": "#/$defs/checklist_item", - "description": "Is inference cost or latency reported? Look for: API costs, tokens consumed, wall-clock time, cost per example. If the paper proposes a method that calls GPT-4 100 times per example without mentioning cost, NO. NA if cost is clearly irrelevant (e.g., theoretical paper)." + "description": "Is inference cost or latency reported? Look for: API costs, tokens consumed, wall-clock time, cost per example. If the paper proposes a method that calls GPT-4 100 times per example without mentioning cost, NO. NA if cost is clearly irrelevant (e.g., theoretical paper, survey paper). If a survey reports costs of systems it reviews, that does NOT count — this question asks about the cost of the paper's own method." }, "compute_budget_stated": { "$ref": "#/$defs/checklist_item",

Impressum · Datenschutz