scan-v4.json - ai-research-survey - Systematic scan of agentic development research. What's signal, what's noise.

scan-v4.json (36466B)
      1 {
      2   "scan_version": 4,
      3   "paper_type": "empirical",
      4   "paper": {
      5     "title": "On The Fragility of Benchmark Contamination Detection in Reasoning Models",
      6     "authors": [
      7       "Han Wang",
      8       "Haoyu Li",
      9       "Brian Ko",
     10       "Huan Zhang"
     11     ],
     12     "year": 2025,
     13     "venue": "arXiv.org",
     14     "arxiv_id": "2510.02386",
     15     "doi": "10.48550/arXiv.2510.02386"
     16   },
     17   "checklist": {
     18     "claims_and_evidence": {
     19       "abstract_claims_supported": {
     20         "applies": true,
     21         "answer": true,
     22         "justification": "Abstract claims that (1) GRPO conceals contamination signals — supported by Table 2 showing AUROC drops averaging -14 to -20 points; (2) PPO-style clipping is the root cause — supported by Theorem 3.1 and ablation in Table 3; (3) detection methods perform near random on advanced LRMs — supported by Table 5 averaging ~50-55% AUROC.",
     23         "source": "opus"
     24       },
     25       "causal_claims_justified": {
     26         "applies": true,
     27         "answer": true,
     28         "justification": "The causal claim that 'PPO-style importance sampling and clipping objectives are the root cause of detection concealment' is supported by (1) theoretical analysis (Theorem 3.1), (2) controlled ablation removing clipping from RAFT++/GRPO (Table 3), and (3) comparison of RAFT (no clipping, no concealment) vs RAFT++ (with clipping, concealment).",
     29         "source": "opus"
     30       },
     31       "generalization_bounded": {
     32         "applies": true,
     33         "answer": false,
     34         "justification": "The title claims fragility of 'benchmark contamination detection in reasoning models' broadly, but experiments use only 7-8B parameter models (Qwen2.5-7B-Instruct, DeepSeek-R1-Distill-Llama-8B/Qwen-7B) on 6 math/science benchmarks. No results on larger models, coding benchmarks, or other reasoning domains. The paper does not bound its claims to the tested scale or domains.",
     35         "source": "opus"
     36       },
     37       "alternative_explanations_discussed": {
     38         "applies": true,
     39         "answer": true,
     40         "justification": "Section 3.1 explicitly rules out alternative explanations: (1) 'simply training with more clean samples' by comparing continued SFT vs GRPO (Figure 2, Table 14); (2) 'further training makes models forget contamination' by showing Pass@1 inflation persists (Table 1). Section 4 discusses the alternative that LRMs generalize rather than memorize.",
     41         "source": "opus"
     42       },
     43       "proxy_outcome_distinction": {
     44         "applies": true,
     45         "answer": true,
     46         "justification": "The paper measures AUROC for detection and Pass@1 for benchmark performance — both directly match the claimed constructs (detection effectiveness and performance inflation). No proxy gap exists between measurements and claims.",
     47         "source": "opus"
     48       }
     49     },
     50     "limitations_and_scope": {
     51       "limitations_section_present": {
     52         "applies": true,
     53         "answer": true,
     54         "justification": "Appendix A contains a dedicated 'LIMITATIONS' section discussing the scope of the findings and acknowledging that no new detection algorithm is proposed.",
     55         "source": "opus"
     56       },
     57       "threats_to_validity_specific": {
     58         "applies": true,
     59         "answer": false,
     60         "justification": "The limitations section (Appendix A) discusses the implications of their findings but does not address specific threats to their own study's validity, such as whether results generalize beyond 7-8B models, whether the small benchmark sizes (15 member items in AIME) introduce noise, or whether base model pre-training contamination could confound results.",
     61         "source": "opus"
     62       },
     63       "scope_boundaries_stated": {
     64         "applies": true,
     65         "answer": false,
     66         "justification": "The paper does not explicitly state what the results do NOT show. It generalizes from a few 7-8B models to 'LRMs' broadly without bounding claims to the tested scale, model families, or benchmark domains.",
     67         "source": "opus"
     68       }
     69     },
     70     "conflicts_of_interest": {
     71       "funding_disclosed": {
     72         "applies": true,
     73         "answer": false,
     74         "justification": "No funding, acknowledgments, or grants section is present in the paper.",
     75         "source": "opus"
     76       },
     77       "affiliations_disclosed": {
     78         "applies": true,
     79         "answer": true,
     80         "justification": "Author affiliations are clearly listed: University of Illinois Urbana-Champaign and University of Washington. None of the authors are affiliated with the model providers being evaluated (DeepSeek, Qwen/Alibaba).",
     81         "source": "opus"
     82       },
     83       "funder_independent_of_outcome": {
     84         "applies": true,
     85         "answer": false,
     86         "justification": "Cannot assess funder independence because no funding source is disclosed. The absence of a funding disclosure makes this unanswerable.",
     87         "source": "opus"
     88       },
     89       "financial_interests_declared": {
     90         "applies": true,
     91         "answer": false,
     92         "justification": "No competing interests or financial interests statement appears in the paper.",
     93         "source": "opus"
     94       }
     95     },
     96     "scope_and_framing": {
     97       "key_terms_defined": {
     98         "applies": true,
     99         "answer": true,
    100         "justification": "Key terms are defined precisely: 'benchmark contamination' is defined in the introduction, 'SFT contamination' and 'RL contamination' are formally defined in Section 3, and 'member'/'non-member' sets are explicitly defined with the detection setup.",
    101         "source": "haiku"
    102       },
    103       "intended_contribution_clear": {
    104         "applies": true,
    105         "answer": true,
    106         "justification": "The paper explicitly states 'we present the first systematic study of benchmark contamination in LRMs' and structures contributions around two scenarios; the contribution (vulnerability analysis + theoretical explanation) is unambiguous.",
    107         "source": "haiku"
    108       },
    109       "engagement_with_prior_work": {
    110         "applies": true,
    111         "answer": true,
    112         "justification": "Section 2 provides substantive related work covering LRMs, contamination detection methods (categorized into generation-based, perturbation-based, reference-based, reference-free), and contamination concealment prior work, explaining how this work differs from and extends each.",
    113         "source": "haiku"
    114       }
    115     }
    116   },
    117   "type_checklist": {
    118     "empirical": {
    119       "artifacts": {
    120         "code_released": {
    121           "applies": true,
    122           "answer": true,
    123           "justification": "The abstract states 'Our code is available at https://github.com/ASTRAL-Group/LRM_Conta_Detection_Arena.git' — a working repository URL is provided.",
    124           "source": "opus"
    125         },
    126         "data_released": {
    127           "applies": true,
    128           "answer": true,
    129           "justification": "All evaluation benchmarks are publicly available (AIME 2024/2025, AMC 2023, GPQA Diamond, OlympiadBench, Minerva Math). Clean training data comes from public datasets (OpenThoughts3, DeepMath-103K).",
    130           "source": "opus"
    131         },
    132         "environment_specified": {
    133           "applies": true,
    134           "answer": false,
    135           "justification": "Appendix F lists hardware (9× NVIDIA L40S, CUDA 12.8, Ubuntu 22.04) and mentions frameworks (LLaMA-Factory, Verl, FlashAttention-2, DeepSpeed ZeRO-1, Liger kernels, vLLM), but no requirements.txt, Dockerfile, or versioned dependency list is provided. Framework versions are cited only by arXiv paper references.",
    136           "source": "opus"
    137         },
    138         "reproduction_instructions": {
    139           "applies": true,
    140           "answer": false,
    141           "justification": "The paper provides detailed experimental setup in Appendix D (contamination pipelines, hyperparameters, prompt templates) and a code repository URL, but no step-by-step reproduction instructions (e.g., 'run this command') are included in the paper itself.",
    142           "source": "opus"
    143         }
    144       },
    145       "statistical_methodology": {
    146         "confidence_intervals_or_error_bars": {
    147           "applies": true,
    148           "answer": false,
    149           "justification": "Tables 1-5 report point estimates for AUROC and Pass@1 with no confidence intervals or error bars. Results are 'averaged over detection scores from 8 rollouts' but no uncertainty measures are provided.",
    150           "source": "opus"
    151         },
    152         "significance_tests": {
    153           "applies": true,
    154           "answer": false,
    155           "justification": "Claims that GRPO 'conceals' contamination and that detection methods 'perform near random guess' are based on comparing raw AUROC numbers without any statistical significance tests (no p-values, t-tests, or bootstrap tests).",
    156           "source": "opus"
    157         },
    158         "effect_sizes_reported": {
    159           "applies": true,
    160           "answer": true,
    161           "justification": "Tables 2, 3, and 5 include a Δ column reporting absolute AUROC changes relative to the baseline (e.g., -14.22, -16.42 for GRPO concealment). Table 1 reports Pass@1 with absolute differences across conditions, providing magnitude context.",
    162           "source": "opus"
    163         },
    164         "sample_size_justified": {
    165           "applies": true,
    166           "answer": false,
    167           "justification": "No justification is given for using these particular benchmarks or their sizes. Some benchmarks are very small (AIME 2024/2025 have only 30 problems each, yielding 15-member evaluation sets), and no power analysis or sample size rationale is provided.",
    168           "source": "opus"
    169         },
    170         "variance_reported": {
    171           "applies": true,
    172           "answer": false,
    173           "justification": "AUROC is averaged over 8 rollouts and Pass@1 over 3-10 rollouts, but no standard deviations, interquartile ranges, or other spread measures are reported for any results.",
    174           "source": "opus"
    175         }
    176       },
    177       "evaluation_design": {
    178         "baselines_included": {
    179           "applies": true,
    180           "answer": true,
    181           "justification": "10 representative contamination detection methods are evaluated spanning generation-based, perturbation-based, reference-based, and reference-free approaches (Table 2). Multiple training regimes (SFT only, SFT+GRPO, SFT+RAFT, SFT+RAFT++) serve as comparison conditions.",
    182           "source": "opus"
    183         },
    184         "baselines_contemporary": {
    185           "applies": true,
    186           "answer": true,
    187           "justification": "Detection methods range from 2018-2025, including recent work like Min-K%++ (Zhang et al., 2024), DICE (Tu et al., 2024), CDD (Dong et al., 2024), and Verbatim (Wu et al., 2025). The RL algorithms (GRPO, RAFT++) are also contemporary.",
    188           "source": "opus"
    189         },
    190         "ablation_study": {
    191           "applies": true,
    192           "answer": true,
    193           "justification": "Table 3 presents a carefully designed ablation isolating the effect of importance sampling and clipping. RAFT (no clipping), RAFT++ (with/without clipping), and GRPO (with/without clipping) are compared, directly testing the theoretical prediction.",
    194           "source": "opus"
    195         },
    196         "multiple_metrics": {
    197           "applies": true,
    198           "answer": true,
    199           "justification": "The paper reports both AUROC for detection performance and Pass@1 for benchmark performance/contamination inflation, providing complementary views of the contamination problem.",
    200           "source": "opus"
    201         },
    202         "human_evaluation": {
    203           "applies": true,
    204           "answer": false,
    205           "justification": "No human evaluation is included. All evaluation is automated (AUROC computation, Pass@1 with automated answer checking). Human expert review of detection failures or contamination evidence could have strengthened the analysis.",
    206           "source": "opus"
    207         },
    208         "held_out_test_set": {
    209           "applies": true,
    210           "answer": true,
    211           "justification": "For each benchmark, 'we randomly sample half of the questions as the member set (used for contamination) and leave the remaining half as the non-member set (for detection evaluation)' (Section 3). The member/non-member split provides clear separation.",
    212           "source": "opus"
    213         },
    214         "per_category_breakdown": {
    215           "applies": true,
    216           "answer": true,
    217           "justification": "Results are broken down by each of the 6 benchmarks (OlympiadBench, GPQA, AIME25, AIME24, Minerva, AMC23) and by each detection method category (generation-based, perturbation-based, reference-based, reference-free) in all tables.",
    218           "source": "opus"
    219         },
    220         "failure_cases_discussed": {
    221           "applies": true,
    222           "answer": true,
    223           "justification": "Extensive failure analysis is provided: log-probability distributions show why detection fails (Figures 3-4), the discussion in Section 4 explains why LRMs generalize rather than memorize, and embedding visualizations (Appendix E.6, Figures 10-12) show member/non-member indistinguishability.",
    224           "source": "opus"
    225         },
    226         "negative_results_reported": {
    227           "applies": true,
    228           "answer": true,
    229           "justification": "RAFT does NOT conceal contamination (Table 3, Δ=+2.03), RL contamination provides 'no significant difference compared to using a clean RL training set' (Section 3.1), and further SFT does NOT conceal contamination (Figure 2, Table 14). These are explicitly reported negative results.",
    230           "source": "opus"
    231         }
    232       },
    233       "setup_transparency": {
    234         "model_versions_specified": {
    235           "applies": true,
    236           "answer": true,
    237           "justification": "Specific model checkpoints are named: 'Qwen2.5-7B-Instruct', 'DeepSeek-R1-Distill-Llama-8B', 'DeepSeek-R1-Distill-Qwen-7B', 'QwQ-32B' for distillation, and 'bespokelabs/Bespoke-Stratos-7B' as the reference model. These are identifiable checkpoints.",
    238           "source": "opus"
    239         },
    240         "prompts_provided": {
    241           "applies": true,
    242           "answer": true,
    243           "justification": "Appendix D.4 provides the actual prompt templates used: the math reasoning template ('{question}\\nPlease reason step by step, and put your final answer within \\boxed{}.') and the multiple-choice template, with concrete examples showing how they are instantiated.",
    244           "source": "opus"
    245         },
    246         "hyperparameters_reported": {
    247           "applies": true,
    248           "answer": true,
    249           "justification": "Appendix D.4 provides comprehensive hyperparameter tables for SFT (batch size 128, LR 4e-5, 5 epochs, cosine scheduler, etc.) and RL (batch size 64, ε=0.2, LR 1e-6, rollout num 4, temp 0.6) training, plus inference settings (temperature=0.6, top_p=0.95, max_new_tokens=32768).",
    250           "source": "opus"
    251         },
    252         "scaffolding_described": {
    253           "applies": false,
    254           "answer": false,
    255           "justification": "No agentic scaffolding is used. The paper trains and evaluates language models directly without any scaffolding layer.",
    256           "source": "opus"
    257         },
    258         "data_preprocessing_documented": {
    259           "applies": true,
    260           "answer": true,
    261           "justification": "Appendix D.1 describes the contamination pipeline: 10K clean samples from OpenThoughts3, distillation with QwQ-32B using rejection sampling (64 rollouts), 3× replication of member data. Deduplication uses '13-gram overlap deduplication' against evaluation benchmarks. Table 6 shows the proportion of questions solved after distillation rollouts.",
    262           "source": "opus"
    263         }
    264       },
    265       "data_integrity": {
    266         "raw_data_available": {
    267           "applies": true,
    268           "answer": false,
    269           "justification": "While code is released and benchmarks are public, the paper does not release raw detection scores, contaminated model checkpoints, or intermediate experimental data needed for independent verification of the reported AUROC and Pass@1 values.",
    270           "source": "opus"
    271         },
    272         "data_collection_described": {
    273           "applies": true,
    274           "answer": true,
    275           "justification": "Appendix D describes data collection in detail: contamination pipeline construction (D.1), detection method implementations (D.2), benchmark descriptions with sizes (D.3), and SFT/RL implementation specifics (D.4).",
    276           "source": "opus"
    277         },
    278         "recruitment_methods_described": {
    279           "applies": false,
    280           "answer": false,
    281           "justification": "No human participants. All data sources are standard public benchmarks (AIME, AMC, GPQA Diamond, OlympiadBench, Minerva Math) and public training datasets.",
    282           "source": "opus"
    283         },
    284         "data_pipeline_documented": {
    285           "applies": true,
    286           "answer": true,
    287           "justification": "The full pipeline is documented: base model → SFT contamination (10K clean + 1,866 member × 3 replications = 11,866 samples) → RL training (4,096 clean + optional members). Detection pipeline: 8 rollouts per question → compute detection score per rollout → average. Deduplication step is documented.",
    288           "source": "opus"
    289         }
    290       },
    291       "contamination": {
    292         "training_cutoff_stated": {
    293           "applies": true,
    294           "answer": false,
    295           "justification": "No training data cutoff dates are stated for the base models (Qwen2.5-7B-Instruct, DeepSeek-R1-Distill models). While the paper controls its own contamination, the base models may have already been exposed to the evaluation benchmarks during pre-training.",
    296           "source": "opus"
    297         },
    298         "train_test_overlap_discussed": {
    299           "applies": true,
    300           "answer": true,
    301           "justification": "Train/test overlap is the central topic. The paper explicitly splits benchmarks into member/non-member halves and uses 13-gram deduplication on clean training data against evaluation benchmarks to prevent unintended overlap.",
    302           "source": "opus"
    303         },
    304         "benchmark_contamination_addressed": {
    305           "applies": true,
    306           "answer": true,
    307           "justification": "Benchmark contamination is the entire subject of the paper. They control contamination as an experimental variable and explicitly address it through member/non-member splits and deduplication of clean data.",
    308           "source": "opus"
    309         }
    310       },
    311       "human_studies": {
    312         "pre_registered": {
    313           "applies": false,
    314           "answer": false,
    315           "justification": "No human participants in this study.",
    316           "source": "opus"
    317         },
    318         "irb_or_ethics_approval": {
    319           "applies": false,
    320           "answer": false,
    321           "justification": "No human participants in this study.",
    322           "source": "opus"
    323         },
    324         "demographics_reported": {
    325           "applies": false,
    326           "answer": false,
    327           "justification": "No human participants in this study.",
    328           "source": "opus"
    329         },
    330         "inclusion_exclusion_criteria": {
    331           "applies": false,
    332           "answer": false,
    333           "justification": "No human participants in this study.",
    334           "source": "opus"
    335         },
    336         "randomization_described": {
    337           "applies": false,
    338           "answer": false,
    339           "justification": "No human participants in this study.",
    340           "source": "opus"
    341         },
    342         "blinding_described": {
    343           "applies": false,
    344           "answer": false,
    345           "justification": "No human participants in this study.",
    346           "source": "opus"
    347         },
    348         "attrition_reported": {
    349           "applies": false,
    350           "answer": false,
    351           "justification": "No human participants in this study.",
    352           "source": "opus"
    353         }
    354       },
    355       "cost_and_practicality": {
    356         "inference_cost_reported": {
    357           "applies": true,
    358           "answer": false,
    359           "justification": "No inference costs, API costs, or per-example wall-clock times are reported. The paper generates many rollouts per question (8 for detection, 3-10 for Pass@1 across 6 benchmarks) but does not quantify the computational cost.",
    360           "source": "opus"
    361         },
    362         "compute_budget_stated": {
    363           "applies": true,
    364           "answer": false,
    365           "justification": "Appendix F states hardware specs (9× NVIDIA L40S GPUs, 48 GiB each) but does not report total GPU hours, wall-clock training time, or total compute spent across all experiments.",
    366           "source": "opus"
    367         }
    368       },
    369       "experimental_rigor": {
    370         "seed_sensitivity_reported": {
    371           "applies": true,
    372           "answer": false,
    373           "justification": "No results across multiple random seeds are reported. The member/non-member split is performed once ('we randomly sample half of the questions') with no sensitivity analysis across different random splits.",
    374           "source": "opus"
    375         },
    376         "number_of_runs_stated": {
    377           "applies": true,
    378           "answer": true,
    379           "justification": "The paper states 'Each AUROC is averaged over detection scores from 8 rollouts' and 'we evaluate pass@1 and run 10 rollouts on AIME 2024 & 2025, AMC 2023, and 3 rollouts on OlympiadBench, GPQA Diamond, and Minerva Math' (Appendix D.4).",
    380           "source": "opus"
    381         },
    382         "hyperparameter_search_budget": {
    383           "applies": true,
    384           "answer": false,
    385           "justification": "No hyperparameter search is reported. The paper adopts SFT hyperparameters 'suggested by OpenThought3' and uses fixed RL hyperparameters without stating whether alternatives were explored.",
    386           "source": "opus"
    387         },
    388         "best_config_selection_justified": {
    389           "applies": true,
    390           "answer": true,
    391           "justification": "The paper uses recommended hyperparameters from prior work ('we adopt the SFT hyperparameters suggested by OpenThought3 for medium dataset scales', Appendix D.4), providing transparent justification for configuration choices rather than cherry-picking.",
    392           "source": "opus"
    393         },
    394         "multiple_comparison_correction": {
    395           "applies": false,
    396           "answer": false,
    397           "justification": "No statistical tests are performed, so multiple comparison correction is inapplicable.",
    398           "source": "opus"
    399         },
    400         "self_comparison_bias_addressed": {
    401           "applies": true,
    402           "answer": false,
    403           "justification": "The authors re-implement 10 detection methods from other groups but do not acknowledge the potential bias of their own implementations of these baselines. No independent evaluation or verification of implementation correctness is discussed.",
    404           "source": "opus"
    405         },
    406         "compute_budget_vs_performance": {
    407           "applies": true,
    408           "answer": true,
    409           "justification": "Figure 2 and Tables 11-12 show AUROC as a function of RL training steps (0, 64, 110, 156 steps), providing performance-vs-compute curves that reveal the monotonic decline in detection with more training.",
    410           "source": "opus"
    411         },
    412         "benchmark_construct_validity": {
    413           "applies": true,
    414           "answer": false,
    415           "justification": "The paper uses AUROC as the sole measure of detection effectiveness without discussing whether it adequately captures real-world detection utility (e.g., at what false positive rate a method becomes useful). No discussion of whether the member/non-member random split on small benchmarks is a valid construct for evaluating practical detectability.",
    416           "source": "opus"
    417         },
    418         "scaffold_confound_addressed": {
    419           "applies": false,
    420           "answer": false,
    421           "justification": "No scaffolding is used. Models are trained and evaluated directly.",
    422           "source": "opus"
    423         }
    424       },
    425       "data_leakage": {
    426         "temporal_leakage_addressed": {
    427           "applies": true,
    428           "answer": false,
    429           "justification": "The paper does not discuss whether the base models (Qwen2.5-7B-Instruct, DeepSeek-R1-Distill) may have already been exposed to the evaluation benchmarks (AIME, GPQA, etc.) during pre-training. AIME 2024 problems may have appeared in 2025-era model training data.",
    430           "source": "opus"
    431         },
    432         "feature_leakage_addressed": {
    433           "applies": true,
    434           "answer": false,
    435           "justification": "No discussion of whether the evaluation setup inadvertently leaks information. For example, non-members from the same benchmark may share structural features with members, making them not truly 'unseen' in distribution.",
    436           "source": "opus"
    437         },
    438         "non_independence_addressed": {
    439           "applies": true,
    440           "answer": false,
    441           "justification": "Members and non-members are randomly split from the same benchmarks but share strong distributional similarity (same competition, same difficulty distribution). This non-independence is not discussed and could affect AUROC measurements.",
    442           "source": "opus"
    443         },
    444         "leakage_detection_method": {
    445           "applies": true,
    446           "answer": true,
    447           "justification": "The paper uses '13-gram overlap deduplication' on clean training datasets against evaluation benchmarks 'to ensure conclusive results' (Appendix D.4). This is a concrete decontamination method applied to their own training data.",
    448           "source": "opus"
    449         }
    450       }
    451     }
    452   },
    453   "claims": [
    454     {
    455       "claim": "SFT contamination to the base model is initially detectable by existing contamination detection methods (e.g., LiRA achieves 89.13% AUROC before RL)",
    456       "evidence": "Table 2 shows AUROC values before RL: LiRA 89.13%, Min-K% 74.96%, Max-K% 69.83% on average across six benchmarks",
    457       "supported": "strong"
    458     },
    459     {
    460       "claim": "Subsequent GRPO training with clean data markedly conceals SFT contamination signals across all 10 detection methods",
    461       "evidence": "Table 2 shows consistent AUROC decreases after GRPO: Max-K% drops 19.84pp, Loss drops 16.68pp, Min-K% drops 16.42pp on average",
    462       "supported": "strong"
    463     },
    464     {
    465       "claim": "PPO-style importance sampling/clipping is the root cause of contamination concealment, not simply more training",
    466       "evidence": "Table 3 ablation: removing clipping from GRPO restores Loss AUROC from 61.26% to 73.28%; RAFT (no clipping) maintains AUROC at 77.51% vs RAFT++ (with clipping) dropping to 57.58%",
    467       "supported": "strong"
    468     },
    469     {
    470       "claim": "Contamination inflation mainly comes from SFT, not RL training",
    471       "evidence": "Table 1: SFT contamination adds +8.82pp Pass@1 on average while RL contamination shows no significant difference vs clean RL training",
    472       "supported": "strong"
    473     },
    474     {
    475       "claim": "Extensive SFT contamination with CoT on advanced LRMs leaves minimal evidence for existing detection methods (near random AUROC)",
    476       "evidence": "Table 5: detection methods applied to Stage II contaminated models average 46-63% AUROC across all three LRMs and six benchmarks, compared to 50% random baseline",
    477       "supported": "strong"
    478     },
    479     {
    480       "claim": "LRMs generalize from contaminated member data to distributionally similar non-members, explaining why Stage II contamination is undetectable",
    481       "evidence": "Fig. 4, 8, 9 show that log-prob of both members AND non-members increases at similar margins after extensive SFT contamination with CoT",
    482       "supported": "moderate"
    483     },
    484     {
    485       "claim": "Further SFT or RL training does not cause models to forget contamination — performance inflation persists while detection evidence disappears",
    486       "evidence": "Table 14 shows continued performance inflation after 4 additional clean SFT epochs; Table 1 shows contaminated models retain 7.14pp average inflation after GRPO",
    487       "supported": "strong"
    488     }
    489   ],
    490   "methodology_tags": [
    491     "benchmark-eval",
    492     "theoretical"
    493   ],
    494   "key_findings": "Existing benchmark contamination detection methods are critically fragile against LRMs: in Stage I, SFT-induced contamination that is initially detectable (LiRA 89.13% AUROC) is effectively concealed by as few as 64-156 steps of GRPO training with clean data, with AUROC falling toward random for most methods. The root cause is PPO-style importance sampling/clipping, which contracts the log-probability gap between member and non-member samples — an effect confirmed by both ablation experiments and theoretical analysis. In Stage II, extensive SFT contamination with CoT on advanced LRMs (DeepSeek-R1-Distill variants) produces 10-12pp performance inflation while all detection methods fail (near 50% AUROC), because LRMs generalize from contaminated training data to distributionally similar unseen samples, fundamentally undermining the memorization assumption underlying all existing detectors. The paper does not propose new detection methods, only characterizing the vulnerability.",
    495   "red_flags": [
    496     {
    497       "flag": "No statistical significance testing",
    498       "detail": "AUROC differences are reported without significance tests; a 5pp drop with n=30 benchmark items (AIME) could easily be noise, yet results are treated as conclusive."
    499     },
    500     {
    501       "flag": "No confidence intervals or error bars",
    502       "detail": "All AUROC and Pass@1 results are point estimates averaged over 8 rollouts with no variance reported, making it impossible to assess reliability of results."
    503     },
    504     {
    505       "flag": "RL steps far below real-world scale",
    506       "detail": "The paper uses max 156 GRPO steps and explicitly acknowledges this 'is still far fewer than the steps used in some advanced open-sourced reasoning models,' limiting the strength of generalization claims."
    507     },
    508     {
    509       "flag": "Mathematical reasoning benchmarks only",
    510       "detail": "All experiments use mathematical and scientific reasoning benchmarks (AIME, GPQA, Olympiad, Minerva, AMC); generalization to code, language understanding, or other domains is untested."
    511     },
    512     {
    513       "flag": "Base model pre-training contamination not addressed",
    514       "detail": "The paper does not verify that Qwen2.5-7B-Instruct or DeepSeek-R1-Distill models had not already been exposed to these benchmarks during pre-training, which could confound member/non-member comparisons."
    515     },
    516     {
    517       "flag": "Minimal limitations section",
    518       "detail": "Appendix A consists of a single short paragraph acknowledging the lack of a proposed solution, without discussing threats to validity such as model selection, benchmark coverage, or RL scale."
    519     }
    520   ],
    521   "cited_papers": [
    522     {
    523       "title": "Detecting Pretraining Data from Large Language Models (Min-K%)",
    524       "relevance": "Core contamination detection baseline method used throughout as a comparison point"
    525     },
    526     {
    527       "title": "Extracting Training Data from Large Language Models (Carlini et al., 2021)",
    528       "relevance": "Foundational membership inference work; LOSS, Ref, and Zlib detection methods all derive from this paper"
    529     },
    530     {
    531       "title": "An Empirical Analysis of Memorization in Fine-Tuned Autoregressive Language Models (LiRA)",
    532       "relevance": "Reference-based contamination detection method that performs best in Stage I but fails in Stage II"
    533     },
    534     {
    535       "title": "Min-K%++: Improved Baseline for Detecting Pre-Training Data from Large Language Models",
    536       "relevance": "Improved contamination detection baseline tested in both experimental stages"
    537     },
    538     {
    539       "title": "LLM Dataset Inference: Did You Train on My Dataset? (Max-K%)",
    540       "relevance": "Key detection baseline method showing the largest AUROC drop after GRPO"
    541     },
    542     {
    543       "title": "Evading Data Contamination Detection for Language Models is (Too) Easy",
    544       "relevance": "Prior work on contamination concealment via benchmark augmentation — contrasted with this paper's algorithmic-level analysis"
    545     },
    546     {
    547       "title": "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning",
    548       "relevance": "Source of the DeepSeek-R1-Distill models used in Stage II experiments; defines the GRPO algorithm being analyzed"
    549     },
    550     {
    551       "title": "Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models (CDD)",
    552       "relevance": "Key contamination detection method; demonstrates the prior art assumes memorization, which this paper shows breaks down for LRMs"
    553     },
    554     {
    555       "title": "How Much Can We Forget about Data Contamination? (Bordt et al., 2024)",
    556       "relevance": "Related contamination concealment work from the training dynamics perspective, contrasted with this paper's algorithmic-level analysis"
    557     }
    558   ],
    559   "engagement_factors": {
    560     "practical_relevance": {
    561       "score": 1,
    562       "justification": "Findings inform contamination detection researchers and leaderboard operators, but not directly usable as a tool by practitioners."
    563     },
    564     "surprise_contrarian": {
    565       "score": 3,
    566       "justification": "Directly challenges the widely-held assumption that contamination detection methods work, showing they are trivially evaded by standard RL training."
    567     },
    568     "fear_safety": {
    569       "score": 2,
    570       "justification": "Raises alarm about the integrity of LLM leaderboards and the ease of undetectable benchmark gaming, undermining trust in model evaluations."
    571     },
    572     "drama_conflict": {
    573       "score": 3,
    574       "justification": "Strong 'leaderboards are fake' angle — demonstrates that model developers could easily contaminate LRMs to achieve inflated rankings while evading all 10 tested detection methods."
    575     },
    576     "demo_ability": {
    577       "score": 1,
    578       "justification": "Code is released on GitHub but this is a research pipeline requiring significant compute (9× L40S GPUs), not something easily tried."
    579     },
    580     "brand_recognition": {
    581       "score": 1,
    582       "justification": "University researchers (UIUC, UW) without major lab affiliation. Uses DeepSeek models which have moderate recognition."
    583     }
    584   },
    585   "hn_data": {
    586     "threads": [
    587       {
    588         "hn_id": "45693591",
    589         "title": "ChunkLLM: A Lightweight Pluggable Framework for Accelerating LLMs Inference",
    590         "points": 96,
    591         "comments": 8,
    592         "url": "https://news.ycombinator.com/item?id=45693591"
    593       },
    594       {
    595         "hn_id": "45235119",
    596         "title": "Instruction-Following Pruning for Large Language Models",
    597         "points": 5,
    598         "comments": 0,
    599         "url": "https://news.ycombinator.com/item?id=45235119"
    600       },
    601       {
    602         "hn_id": "37921371",
    603         "title": "Quantum Computing: Principles and Applications",
    604         "points": 5,
    605         "comments": 0,
    606         "url": "https://news.ycombinator.com/item?id=37921371"
    607       },
    608       {
    609         "hn_id": "45843183",
    610         "title": "Mathematical Exploration and Discovery at Scale",
    611         "points": 4,
    612         "comments": 2,
    613         "url": "https://news.ycombinator.com/item?id=45843183"
    614       },
    615       {
    616         "hn_id": "45862519",
    617         "title": "Mathematical exploration and discovery at scale – Terence Tao et al.",
    618         "points": 4,
    619         "comments": 1,
    620         "url": "https://news.ycombinator.com/item?id=45862519"
    621       },
    622       {
    623         "hn_id": "45721820",
    624         "title": "The Fragility of Benchmark Contamination Detection in Reasoning Models",
    625         "points": 2,
    626         "comments": 0,
    627         "url": "https://news.ycombinator.com/item?id=45721820"
    628       },
    629       {
    630         "hn_id": "45837025",
    631         "title": "Mathematical Exploration and Discovery at Scale",
    632         "points": 2,
    633         "comments": 0,
    634         "url": "https://news.ycombinator.com/item?id=45837025"
    635       },
    636       {
    637         "hn_id": "41801438",
    638         "title": "Comprehensive Survey of Mamba Architectures for Medical Image Analysis,Beyond",
    639         "points": 2,
    640         "comments": 0,
    641         "url": "https://news.ycombinator.com/item?id=41801438"
    642       },
    643       {
    644         "hn_id": "45607220",
    645         "title": "Conceptualizing/Modeling Communication-Based Cyberattacks on Automated Vehicles",
    646         "points": 1,
    647         "comments": 0,
    648         "url": "https://news.ycombinator.com/item?id=45607220"
    649       }
    650     ],
    651     "top_points": 96,
    652     "total_points": 121,
    653     "total_comments": 11
    654   }
    655 }
	ai-research-survey Systematic scan of agentic development research. What's signal, what's noise.
	git clone https://git.shiptheloop.com/ai-research-survey.git
	Log \| Files \| Refs