ai-research-survey

Systematic scan of agentic development research. What's signal, what's noise.
git clone https://git.shiptheloop.com/ai-research-survey.git
Log | Files | Refs

commit a0bf4b555a37ff061384f55a6807ac4235ed17bf
parent b4be3d6dbb04eddb183b2a99dad0327adbe38b39
Author: Brian Graham <brian@buildingbetterteams.de>
Date:   Fri, 27 Feb 2026 22:31:36 +0100

Replace subjective 0-3 rubric with 50-question boolean checklist

Redesigned the scan instrument for verifiability and auditability:
- 50 yes/no/na questions across 11 categories (artifacts, statistical
  methodology, evaluation design, claims & evidence, setup transparency,
  limitations, data integrity, conflicts of interest, contamination,
  human studies, cost & practicality)
- Each question has detailed evaluation guidance in the schema
  description explaining exactly what to look for
- Each answer requires a justification citing specific paper sections
- Inspired by Wakefield case: added data_integrity and
  conflicts_of_interest categories to catch fabrication and undisclosed
  conflicts
- Changed model assignment from Opus to Sonnet (booleans are factual
  lookups, not subjective judgment)

Old rubric (6 dimensions, 0-3 scores) removed. Composite scores are
now derived deterministically from boolean counts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Diffstat:
Magents/scan-agent.md | 90++++++++++++++++++++++++++++++++++++++-----------------------------------------
Mschema/scan.schema.json | 386+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------
2 files changed, 404 insertions(+), 72 deletions(-)

diff --git a/agents/scan-agent.md b/agents/scan-agent.md @@ -1,8 +1,8 @@ # Scan Agent -**Model: Opus** (requires judgment on methodology quality) +**Model: Sonnet** (boolean checklist is factual lookup, not subjective judgment. Opus used for calibration subset only.) -You are a research paper scan agent. Your job is to read a research paper and produce a structured assessment of its methodological quality. +You are a research paper scan agent. Your job is to read a research paper and answer a structured boolean checklist about its methodological quality. ## Input @@ -24,37 +24,36 @@ After writing scan.json, update the paper's status in `registry.jsonl` to `"scan Fill in the `paper` object: title, authors, year, venue, arxiv_id, doi. Use what is stated in the paper itself, not external sources. -### 2. Score Each Rubric Dimension +### 2. Answer the Boolean Checklist -For each of the six dimensions, assign a score (0-3) and provide evidence from the paper text justifying that score. Be specific: cite section numbers, quote relevant passages, and reference figures or tables where applicable. +The checklist has 50 yes/no/na questions across 11 categories. For each question: -**Dimensions:** +1. Read the question's `description` in the schema carefully — it tells you exactly what to look for and what counts as yes vs no. +2. Search the paper text for the relevant information. +3. Answer `yes`, `no`, or `na`. +4. Write a 1-3 sentence justification citing specific sections, pages, or quoting the paper. -- **Artifacts & Reproducibility** (0-3): Can someone reproduce this work? Look for: released code, datasets, environment specifications, reproduction instructions. -- **Statistical Rigor** (0-3): Are the statistical methods appropriate? Look for: confidence intervals, significance tests, effect sizes, sample size justification, multiple comparisons correction. -- **Benchmark Quality** (0-3): Are the benchmarks appropriate for the claims? Look for: benchmark relevance to claims, known limitations acknowledged, contamination checks, real-world validation. -- **Claim-to-Evidence Ratio** (0-3): Do the claims stay within the evidence? Look for: overgeneralization, hedging vs. certainty, scope of claims vs. scope of study. -- **Setup Transparency** (0-3): Is the experimental setup fully described? Look for: model versions, prompt text, scaffolding details, tool configurations, hyperparameters, post-processing steps. -- **Limitations Discussion** (0-3): Does the paper honestly discuss limitations? Look for: threats to validity, scope limitations, known confounds, what the results do NOT show. +**Rules:** +- **yes** = the paper clearly satisfies the criterion. You must be able to point to where. +- **no** = the paper does not satisfy the criterion, or you cannot find evidence that it does. Absence of evidence is NO, not NA. +- **na** = the criterion genuinely does not apply to this type of paper (e.g., human_studies questions for a benchmark paper with no human participants, or blinding for an observational study where blinding is infeasible). -**Scoring guide:** -- 0 (absent): The dimension is not addressed at all -- 1 (weak): Minimal or inadequate treatment -- 2 (adequate): Reasonable treatment with minor gaps -- 3 (strong): Thorough, specific, and exemplary treatment +**Do not guess.** If you cannot find the information in the paper, the answer is NO with a justification like "No mention of [X] found in the paper." + +**Do not be generous.** A vague mention does not count. "We used GPT-4" without a version is NO for model_versions_specified. "Code will be released" is NO for code_released. Apply the criteria as written in the schema descriptions. ### 3. Extract Claims Identify the paper's key empirical claims. For each claim: - State the claim as written or closely paraphrased - Note the evidence provided (with section/page references) -- Rate how well the evidence supports the claim: `strong`, `moderate`, `weak`, or `unsupported` +- Rate support level: `strong`, `moderate`, `weak`, or `unsupported` Focus on empirical claims (things that can be verified), not opinions or motivations. ### 4. Assign Methodology Tags -Assign one or more tags describing the research methodology: +Assign one or more tags: - `rct` - Randomized controlled trial - `observational` - Observational study - `benchmark-eval` - Benchmark evaluation @@ -69,65 +68,62 @@ Write a 2-4 sentence summary of the paper's most important findings. Be factual ### 6. Extract Cited Papers -Scan the paper's references for other papers that fall within the survey scope (AI/LLM capability, productivity, safety, code generation, agentic workflows). For each relevant cited paper, extract: - -- **title**: As it appears in the references section -- **authors**: If listed (at least first author) +Scan the references for papers relevant to the survey scope (AI/LLM capability, productivity, safety, code generation, agentic workflows). For each, extract: +- **title**: As it appears in the references +- **authors**: At least first author if listed - **year**: If available -- **arxiv_id**: If an arXiv URL or ID appears in the reference +- **arxiv_id**: If visible in the reference - **doi**: If available -- **relevance**: One sentence on why this paper belongs in the survey - -Do NOT include every reference. Only include papers that make empirical claims about AI/LLM capability, productivity, safety, or code generation. A typical paper might cite 30-60 references; you should extract 3-15 relevant ones. +- **relevance**: One sentence on why it belongs in the survey -These cited papers feed a citation-chasing pipeline: `scripts/harvest-citations.py` reads them from all scan.json files and proposes new registry entries. +Extract 3-15 relevant references, not all of them. ### 7. Flag Red Flags -Note any methodological concerns, including but not limited to: +Note any methodological concerns: - Cherry-picked results or selective reporting - Benchmark gaming or contamination risk -- Conflicts of interest (e.g., company evaluating its own product) +- Conflicts of interest (company evaluating its own product) - Missing baselines or unfair comparisons - Claims that significantly outrun the evidence - Tiny sample sizes for the claims being made - No error bars or uncertainty quantification +- Data that seems too clean or too good to be true +- Recruitment bias (non-representative sample presented as general) If there are no red flags, return an empty array. ## Handling Different Paper Types ### Empirical papers (most common) -Score all six dimensions normally. These are the core of the survey. +Answer all applicable checklist items. Most items will be applicable. ### Survey / review papers -These are in scope. Score them, but calibrate appropriately: -- **Artifacts & Reproducibility**: Did they document search strategy, inclusion criteria, and extraction process? A rigorous survey is reproducible. -- **Statistical Rigor**: Did they do quantitative synthesis, or just narrative summary? A survey that just lists papers without structured quality assessment scores low. -- **Benchmark Quality**: N/A for most surveys — score based on how well they evaluated the benchmarks they discuss. -- **Claim-to-Evidence Ratio**: Do the survey's conclusions follow from the papers reviewed, or do they overgeneralize? -- **Setup Transparency**: Is the review methodology clear? Search terms, databases, date ranges, screening process? -- **Limitations Discussion**: Does it acknowledge selection bias, publication bias, scope limitations? +Many evaluation_design and statistical_methodology items will be NA. Focus on: +- artifacts: Did they document search strategy and provide data? +- data_integrity: Is the review methodology transparent and reproducible? +- claims_and_evidence: Do conclusions follow from the papers reviewed? +- limitations_and_scope: Does it acknowledge selection bias, publication bias? -A survey that just collects and summarizes without quality assessment is laundering the signal-to-noise ratio of its sources. Score accordingly. +A survey that just collects and summarizes without structured quality assessment is laundering the signal-to-noise ratio of its sources. Flag this in red_flags. ### Theoretical / position papers -Score what applies. Statistical rigor may be N/A. Claim-to-evidence ratio still applies — are the theoretical claims well-argued? +Most empirical checklist items will be NA. Focus on claims_and_evidence and limitations_and_scope. ## Validation -Your output must be valid JSON conforming to `schema/scan.schema.json`. Specifically: -- All six rubric dimensions must have `score` (integer 0-3) and `evidence` (string) -- Each claim must have `claim`, `evidence`, and `supported` (one of: strong, moderate, weak, unsupported) -- `methodology_tags` must use only the allowed values +Your output must be valid JSON conforming to `schema/scan.schema.json`: +- All 11 checklist categories must be present with all required items +- Each checklist item must have `answer` (yes/no/na) and `justification` (string) +- Each claim must have `claim`, `evidence`, and `supported` +- `methodology_tags` must use only the allowed enum values - `cited_papers` must each have at least `title` and `relevance` - `red_flags` must each have `flag` and `detail` ## Guidelines -- Be fair but rigorous. A low score is not an insult; it is information. +- Be fair but strict. A NO is not an insult; it is information. - Quote the paper directly when possible. -- If information is genuinely absent (not just hard to find), score it 0. Do not guess. -- If you are uncertain about a score, err toward the lower score and explain your uncertainty in the evidence field. - Do not hallucinate content that is not in the paper. -- A paper can be important and influential while still scoring poorly on methodology. Score what's there, not what you think should be there. +- A paper can be important and influential while still having many NOs. Score what's there, not what you think should be there. +- The checklist is the instrument. Follow the schema descriptions precisely. diff --git a/schema/scan.schema.json b/schema/scan.schema.json @@ -2,11 +2,11 @@ "$schema": "https://json-schema.org/draft/2020-12/schema", "$id": "scan.schema.json", "title": "Paper Scan Result", - "description": "Structured output from the scan agent for a single research paper.", + "description": "Structured output from the scan agent for a single research paper. Quality assessment uses boolean checklist questions (verifiable, auditable) rather than subjective scores.", "type": "object", "required": [ "paper", - "rubric", + "checklist", "claims", "methodology_tags", "key_findings", @@ -30,29 +30,366 @@ "doi": { "type": "string" } } }, - "rubric": { + "checklist": { "type": "object", - "description": "Scores across six quality dimensions.", + "description": "Boolean quality checklist. Each question has a verifiable yes/no/na answer plus justification text. These replace subjective 0-3 scores with factual, auditable checks.", "required": [ - "artifacts_reproducibility", - "statistical_rigor", - "benchmark_quality", - "claim_to_evidence", + "artifacts", + "statistical_methodology", + "evaluation_design", + "claims_and_evidence", "setup_transparency", - "limitations_discussion" + "limitations_and_scope", + "data_integrity", + "conflicts_of_interest", + "contamination", + "human_studies", + "cost_and_practicality" ], "properties": { - "artifacts_reproducibility": { "$ref": "#/$defs/rubric_dimension" }, - "statistical_rigor": { "$ref": "#/$defs/rubric_dimension" }, - "benchmark_quality": { "$ref": "#/$defs/rubric_dimension" }, - "claim_to_evidence": { "$ref": "#/$defs/rubric_dimension" }, - "setup_transparency": { "$ref": "#/$defs/rubric_dimension" }, - "limitations_discussion": { "$ref": "#/$defs/rubric_dimension" } + "artifacts": { + "type": "object", + "description": "Can someone reproduce this work from what was released?", + "required": [ + "code_released", + "data_released", + "environment_specified", + "reproduction_instructions" + ], + "properties": { + "code_released": { + "$ref": "#/$defs/checklist_item", + "description": "Is source code released (e.g., GitHub link, Zenodo archive)? Look for: repository URLs in the paper, footnotes, or abstract. A promise of future release counts as NO. Code 'available upon request' counts as NO. Only YES if a working URL or archive is provided." + }, + "data_released": { + "$ref": "#/$defs/checklist_item", + "description": "Is the dataset released or publicly available? Look for: dataset download links, references to public datasets used (e.g., 'we use the publicly available SWE-bench dataset' = YES). If they collected proprietary data and did not release it, NO. If the data is a standard public benchmark they didn't modify, YES." + }, + "environment_specified": { + "$ref": "#/$defs/checklist_item", + "description": "Are environment or dependency specifications provided? Look for: requirements.txt, Dockerfile, conda environment file, or a detailed 'Environment Setup' section listing library versions. Mentioning 'Python 3.x' alone is NOT enough — there must be enough detail to recreate the environment." + }, + "reproduction_instructions": { + "$ref": "#/$defs/checklist_item", + "description": "Are step-by-step reproduction instructions included? Look for: a README with commands to run, a 'Reproducing Results' section, or scripts that replicate the main experiments. The instructions must be specific enough that a competent researcher could follow them without guessing." + } + } + }, + "statistical_methodology": { + "type": "object", + "description": "Are the numbers treated with appropriate rigor?", + "required": [ + "confidence_intervals_or_error_bars", + "significance_tests", + "effect_sizes_reported", + "sample_size_justified", + "variance_reported" + ], + "properties": { + "confidence_intervals_or_error_bars": { + "$ref": "#/$defs/checklist_item", + "description": "Are confidence intervals or error bars reported for main results? Look for: CI notation (e.g., '95% CI [x, y]'), error bars on figures, ± notation in tables. If the paper reports only point estimates (e.g., '43.2% accuracy') with no uncertainty, NO." + }, + "significance_tests": { + "$ref": "#/$defs/checklist_item", + "description": "Are statistical significance tests used where claims of difference are made? Look for: p-values, t-tests, Mann-Whitney U, chi-squared, ANOVA, bootstrap tests, permutation tests. If the paper claims 'X outperforms Y' based solely on comparing two numbers without any test, NO. NA if the paper makes no comparative claims." + }, + "effect_sizes_reported": { + "$ref": "#/$defs/checklist_item", + "description": "Are effect sizes reported (not just p-values or raw differences)? Look for: Cohen's d, odds ratios, relative risk, percentage improvement with baseline context. A paper that says 'p < 0.05' without indicating the magnitude of the effect is NO. A paper that says '12% improvement over baseline (from 45% to 57%)' provides enough context for YES." + }, + "sample_size_justified": { + "$ref": "#/$defs/checklist_item", + "description": "Is the sample size justified or is a power analysis discussed? Look for: explicit justification for why N participants/examples were chosen, power analysis, or acknowledgment that the sample may be too small for certain claims. If N is small and no justification is given, NO. NA for theoretical papers." + }, + "variance_reported": { + "$ref": "#/$defs/checklist_item", + "description": "Is variance or standard deviation reported across experimental runs? Look for: std dev in tables, variance across seeds, multiple-run results. If the paper reports single-run numbers only, NO. If it explicitly states 'averaged over K runs with std dev' YES." + } + } + }, + "evaluation_design": { + "type": "object", + "description": "Is the evaluation designed to actually test the claims?", + "required": [ + "baselines_included", + "baselines_contemporary", + "ablation_study", + "multiple_metrics", + "human_evaluation", + "held_out_test_set", + "per_category_breakdown", + "failure_cases_discussed", + "negative_results_reported" + ], + "properties": { + "baselines_included": { + "$ref": "#/$defs/checklist_item", + "description": "Are baseline comparisons included? Look for: comparison against prior work, naive baselines, or ablated versions. A paper that only reports its own system's numbers with no comparison is NO. NA for papers that define a new task with no prior work." + }, + "baselines_contemporary": { + "$ref": "#/$defs/checklist_item", + "description": "Are the baselines contemporary and competitive? Look for: whether the baselines are recent and represent the state of the art, or whether they are suspiciously old/weak. If the newest baseline is 3+ years old when newer alternatives exist, NO. If the paper justifies why older baselines are appropriate, YES." + }, + "ablation_study": { + "$ref": "#/$defs/checklist_item", + "description": "Is there an ablation study showing which components matter? Look for: experiments that remove or modify individual components to measure their contribution. NA if the system has only one component." + }, + "multiple_metrics": { + "$ref": "#/$defs/checklist_item", + "description": "Are multiple evaluation metrics used? Look for: at least two different metrics (e.g., accuracy AND F1, or Pass@1 AND Pass@10). If the paper reports only a single metric, NO." + }, + "human_evaluation": { + "$ref": "#/$defs/checklist_item", + "description": "Is human evaluation included (not just automated metrics)? Look for: human ratings, manual inspection, user studies, expert review of outputs. If evaluation is entirely automated, NO. NA if human evaluation is clearly irrelevant to the claims." + }, + "held_out_test_set": { + "$ref": "#/$defs/checklist_item", + "description": "Are results reported on a held-out test set (not the dev/validation set used for tuning)? Look for: explicit separation of dev and test splits. If unclear whether the reported numbers are on data used for any selection decisions, NO." + }, + "per_category_breakdown": { + "$ref": "#/$defs/checklist_item", + "description": "Are per-category or per-task breakdowns provided (not just overall averages)? Look for: tables showing performance on individual tasks, categories, or splits. A single aggregate number hides important variation — if a system scores 80% overall but 20% on hard cases, the average is misleading." + }, + "failure_cases_discussed": { + "$ref": "#/$defs/checklist_item", + "description": "Are failure cases shown or discussed? Look for: error analysis, qualitative examples of failures, discussion of where the approach breaks down. If the paper only shows successes, NO." + }, + "negative_results_reported": { + "$ref": "#/$defs/checklist_item", + "description": "Are negative results reported (things that didn't work)? Look for: ablations that hurt performance, approaches that were tried and abandoned, configurations that failed. If every experiment shows improvement, be skeptical — NO unless the paper explicitly addresses this." + } + } + }, + "claims_and_evidence": { + "type": "object", + "description": "Do the claims stay within what the evidence supports?", + "required": [ + "abstract_claims_supported", + "causal_claims_justified", + "generalization_bounded", + "alternative_explanations_discussed" + ], + "properties": { + "abstract_claims_supported": { + "$ref": "#/$defs/checklist_item", + "description": "Are all claims in the abstract supported by results in the paper? Read the abstract and check each empirical claim against the results section. If the abstract says 'our method achieves state-of-the-art' but the results show it's second-best, NO. If the abstract hedges appropriately, YES." + }, + "causal_claims_justified": { + "$ref": "#/$defs/checklist_item", + "description": "If the paper makes causal claims, is the study design adequate for causal inference? Look for: RCT, natural experiment, instrumental variables, difference-in-differences, or other causal identification strategies. If the paper says 'X improves Y' from observational data without addressing confounds, NO. NA if no causal claims are made." + }, + "generalization_bounded": { + "$ref": "#/$defs/checklist_item", + "description": "Are generalizations bounded to the tested setting? Look for: claims that extend beyond the tested models, languages, tasks, or populations. If the paper tests on Python and claims results for 'code generation' generally, NO. If it says 'on Python tasks with GPT-4' YES." + }, + "alternative_explanations_discussed": { + "$ref": "#/$defs/checklist_item", + "description": "Are alternative explanations for the results discussed? Look for: consideration of confounds, other factors that could explain the results, robustness checks. If the paper presents one interpretation without considering alternatives, NO." + } + } + }, + "setup_transparency": { + "type": "object", + "description": "Is the experimental setup described well enough to understand what was actually tested?", + "required": [ + "model_versions_specified", + "prompts_provided", + "hyperparameters_reported", + "scaffolding_described", + "data_preprocessing_documented" + ], + "properties": { + "model_versions_specified": { + "$ref": "#/$defs/checklist_item", + "description": "Are exact model versions or sizes specified? Look for: specific model names with version (e.g., 'gpt-4-0613', 'Claude 3.5 Sonnet', 'Llama-2-70b-chat'). If the paper says just 'GPT-4' or 'Claude' without a version or snapshot date, NO — model behavior changes across versions." + }, + "prompts_provided": { + "$ref": "#/$defs/checklist_item", + "description": "Are the prompts or system instructions used in experiments provided? Look for: full prompt text in the paper or appendix, or a link to a repository containing prompts. If prompts are described only in natural language ('we asked the model to...') without the actual text, NO. NA if the paper does not use prompting." + }, + "hyperparameters_reported": { + "$ref": "#/$defs/checklist_item", + "description": "Are hyperparameters reported (temperature, top-p, max tokens, learning rate, etc.)? Look for: a hyperparameters table or section. If the paper uses an LLM API without stating temperature/sampling settings, NO — these significantly affect output." + }, + "scaffolding_described": { + "$ref": "#/$defs/checklist_item", + "description": "If the approach uses agentic scaffolding, is it described in detail? Look for: tool descriptions, workflow diagrams, retry logic, feedback mechanisms, memory/context management. If the paper says 'we used an agent' without describing the scaffold, NO. NA if no scaffolding is used." + }, + "data_preprocessing_documented": { + "$ref": "#/$defs/checklist_item", + "description": "Are data preprocessing and filtering steps documented? Look for: how raw data was cleaned, filtered, or transformed before use. If the paper goes from 'we collected data' to 'here are the results' without describing intermediate processing, NO." + } + } + }, + "limitations_and_scope": { + "type": "object", + "description": "Does the paper honestly discuss what it does not show?", + "required": [ + "limitations_section_present", + "threats_to_validity_specific", + "scope_boundaries_stated" + ], + "properties": { + "limitations_section_present": { + "$ref": "#/$defs/checklist_item", + "description": "Is there a limitations or threats-to-validity section? Look for: a dedicated section or subsection titled 'Limitations', 'Threats to Validity', or similar. A single sentence buried in the conclusion does not count — there must be substantive discussion." + }, + "threats_to_validity_specific": { + "$ref": "#/$defs/checklist_item", + "description": "Are specific threats to validity discussed (not just boilerplate)? Look for: threats that are specific to THIS study, not generic disclaimers like 'our results may not generalize.' Good: 'Our sample of 16 developers is too small for subgroup analysis.' Bad: 'More research is needed.' If the limitations are all generic, NO." + }, + "scope_boundaries_stated": { + "$ref": "#/$defs/checklist_item", + "description": "Are scope boundaries explicitly stated (what the results do NOT show)? Look for: explicit statements about what was not tested, what populations/settings are excluded, what claims the authors are NOT making. The METR paper's Table 2 ('What the evidence does not show') is the gold standard." + } + } + }, + "data_integrity": { + "type": "object", + "description": "Can the underlying data be verified? Inspired by cases like the Wakefield MMR paper where fabricated data went undetected for 12 years because no one could check it.", + "required": [ + "raw_data_available", + "data_collection_described", + "recruitment_methods_described", + "data_pipeline_documented" + ], + "properties": { + "raw_data_available": { + "$ref": "#/$defs/checklist_item", + "description": "Is raw data available for independent verification? Look for: data downloads, supplementary data files, database access. If only processed/aggregated results are shown with no way to verify the underlying data, NO. This is the check that would have caught Wakefield — if the raw medical records had been available, fabrication would have been detected immediately." + }, + "data_collection_described": { + "$ref": "#/$defs/checklist_item", + "description": "Is the data collection procedure described in detail? Look for: how data was gathered, what instruments were used, what time period, what inclusion/exclusion criteria. If the paper says 'we collected N examples' without explaining how, NO." + }, + "recruitment_methods_described": { + "$ref": "#/$defs/checklist_item", + "description": "Are participant or sample recruitment methods described? Look for: how participants were found, what channels were used, whether recruitment could introduce bias. Wakefield recruited through anti-vaccine activists, biasing the sample. If participants/samples were selected without description of the selection process, NO. NA if no human participants and data source is a standard benchmark." + }, + "data_pipeline_documented": { + "$ref": "#/$defs/checklist_item", + "description": "Is the full data pipeline from collection to final analysis documented? Look for: each transformation step, filtering criteria and how many examples were removed at each stage, any manual annotation steps. If there are unexplained jumps (e.g., 'we started with 1000 examples' then results show 500 with no explanation), NO." + } + } + }, + "conflicts_of_interest": { + "type": "object", + "description": "Are potential biases from funding, affiliation, or financial interest disclosed?", + "required": [ + "funding_disclosed", + "affiliations_disclosed", + "funder_independent_of_outcome", + "financial_interests_declared" + ], + "properties": { + "funding_disclosed": { + "$ref": "#/$defs/checklist_item", + "description": "Is the funding source disclosed? Look for: an acknowledgments section listing grants, corporate sponsors, or funding agencies. If there is no mention of funding at all, NO. NA only if it's clearly unfunded work (e.g., a solo independent researcher)." + }, + "affiliations_disclosed": { + "$ref": "#/$defs/checklist_item", + "description": "Are author affiliations with the evaluated product or company disclosed? Look for: authors who work at the company whose product is being tested. If Google employees evaluate Gemini, or OpenAI employees evaluate GPT, this must be prominent. If affiliations are listed but the conflict is not explicitly acknowledged, still YES for this question (the conflict-of-interest flag is separate)." + }, + "funder_independent_of_outcome": { + "$ref": "#/$defs/checklist_item", + "description": "Is the funder independent of the outcome? Look for: whether the entity paying for the research has a financial interest in a particular result. Wakefield was secretly paid by lawyers suing vaccine makers. A paper funded by OpenAI evaluating GPT-4 has a non-independent funder. YES if the funder has no stake in the results, NO if they do, NA if unfunded." + }, + "financial_interests_declared": { + "$ref": "#/$defs/checklist_item", + "description": "Do any authors hold patents, equity, or other financial interests related to the findings? Look for: competing interests statements, patent disclosures, author-affiliated startups. If there is no competing interests statement at all, NO — absence of disclosure is not the same as absence of conflict." + } + } + }, + "contamination": { + "type": "object", + "description": "Could the model have seen the test data during training? This is the 'you measured it wrong' problem — if the benchmark is in the training data, the results are meaningless.", + "required": [ + "training_cutoff_stated", + "train_test_overlap_discussed", + "benchmark_contamination_addressed" + ], + "properties": { + "training_cutoff_stated": { + "$ref": "#/$defs/checklist_item", + "description": "Is the model's training data cutoff date stated? Look for: explicit mention of when the training data ends. This is necessary to assess whether test examples could have been in the training set. If the paper uses a model without stating when its training data was collected, NO. NA if the paper does not evaluate a pre-trained model." + }, + "train_test_overlap_discussed": { + "$ref": "#/$defs/checklist_item", + "description": "Is potential train/test overlap discussed? Look for: any analysis of whether test examples appeared in the training data. Canary strings, membership inference, or temporal splits all count. If the paper uses a public benchmark with a model that could have trained on it and doesn't address this, NO." + }, + "benchmark_contamination_addressed": { + "$ref": "#/$defs/checklist_item", + "description": "Were benchmark examples available online before the model's training cutoff? Look for: whether the benchmark was published before the model's training data was collected. HumanEval was published in 2021; any model trained after 2021 may have seen it. If the paper uses such a benchmark without discussing contamination risk, NO. NA if using a benchmark created after the model's training cutoff." + } + } + }, + "human_studies": { + "type": "object", + "description": "For papers involving human participants. All items NA if the paper has no human subjects.", + "required": [ + "pre_registered", + "irb_or_ethics_approval", + "demographics_reported", + "inclusion_exclusion_criteria", + "randomization_described", + "blinding_described", + "attrition_reported" + ], + "properties": { + "pre_registered": { + "$ref": "#/$defs/checklist_item", + "description": "Is the study pre-registered? Look for: a link to a pre-registration (OSF, AsPredicted, ClinicalTrials.gov, AEA registry). Pre-registration commits the researchers to their analysis plan before seeing the data, preventing p-hacking and outcome switching. Very rare in CS but standard in medicine. NA if no human participants." + }, + "irb_or_ethics_approval": { + "$ref": "#/$defs/checklist_item", + "description": "Is IRB or ethics board approval mentioned? Look for: 'This study was approved by [institution] IRB' or equivalent. If the study collects data from human participants without mentioning ethics review, NO. NA if no human participants." + }, + "demographics_reported": { + "$ref": "#/$defs/checklist_item", + "description": "Are participant demographics reported? Look for: experience level, years of experience, gender, geographic distribution, programming languages known, company size. If the paper says 'N developers' without characterizing them, NO. NA if no human participants." + }, + "inclusion_exclusion_criteria": { + "$ref": "#/$defs/checklist_item", + "description": "Are inclusion and exclusion criteria for participants stated? Look for: who was eligible, who was excluded and why, any screening process. If participants just 'were recruited' with no selection criteria, NO. NA if no human participants." + }, + "randomization_described": { + "$ref": "#/$defs/checklist_item", + "description": "Is the randomization procedure described (if applicable)? Look for: how participants were assigned to conditions, whether randomization was stratified, what tool was used. If the paper compares treatment vs. control without explaining how assignment worked, NO. NA if not an experimental study or no human participants." + }, + "blinding_described": { + "$ref": "#/$defs/checklist_item", + "description": "Is blinding described (if applicable)? Look for: whether participants knew which condition they were in, whether evaluators knew which outputs came from which system. If applicable and not mentioned, NO. NA if blinding is not feasible or no human participants." + }, + "attrition_reported": { + "$ref": "#/$defs/checklist_item", + "description": "Is participant attrition or dropout reported? Look for: how many participants started vs. finished, reasons for dropout, whether intention-to-treat analysis was used. If participants are mentioned at the start but the final N is smaller with no explanation, NO. NA if no human participants." + } + } + }, + "cost_and_practicality": { + "type": "object", + "description": "Is the practical cost of the approach reported? Important for assessing real-world applicability.", + "required": [ + "inference_cost_reported", + "compute_budget_stated" + ], + "properties": { + "inference_cost_reported": { + "$ref": "#/$defs/checklist_item", + "description": "Is inference cost or latency reported? Look for: API costs, tokens consumed, wall-clock time, cost per example. If the paper proposes a method that calls GPT-4 100 times per example without mentioning cost, NO. NA if cost is clearly irrelevant (e.g., theoretical paper)." + }, + "compute_budget_stated": { + "$ref": "#/$defs/checklist_item", + "description": "Is the total computational budget stated? Look for: GPU hours, total API spend, hardware used, training time. If the approach required significant compute and this is not quantified, NO." + } + } + } } }, "claims": { "type": "array", - "description": "Key claims extracted from the paper with supporting evidence.", + "description": "Key empirical claims extracted from the paper with supporting evidence.", "items": { "type": "object", "required": ["claim", "evidence", "supported"], @@ -149,19 +486,18 @@ } }, "$defs": { - "rubric_dimension": { + "checklist_item": { "type": "object", - "required": ["score", "evidence"], + "required": ["answer", "justification"], "properties": { - "score": { - "type": "integer", - "minimum": 0, - "maximum": 3, - "description": "0=absent, 1=weak, 2=adequate, 3=strong" + "answer": { + "type": "string", + "enum": ["yes", "no", "na"], + "description": "yes = the paper satisfies this criterion. no = it does not. na = the criterion does not apply to this type of paper." }, - "evidence": { + "justification": { "type": "string", - "description": "Justification for the score, citing specific sections or text from the paper." + "description": "1-3 sentences explaining the answer. Cite specific sections, pages, or quote the paper where possible. For NO answers, state what is missing. For NA answers, state why the criterion does not apply." } } }

Impressum · Datenschutz