scan-v4.json - ai-research-survey - Systematic scan of agentic development research. What's signal, what's noise.

scan-v4.json (22421B)
      1 {
      2   "scan_version": 4,
      3   "paper_type": "benchmark-creation",
      4   "paper": {
      5     "title": "Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators",
      6     "authors": [
      7       "Yilun Zhou",
      8       "Austin Xu",
      9       "Peifeng Wang",
     10       "Caiming Xiong",
     11       "Shafiq Joty"
     12     ],
     13     "year": 2025,
     14     "venue": "International Conference on Machine Learning",
     15     "arxiv_id": "2504.15253",
     16     "doi": "10.48550/arXiv.2504.15253"
     17   },
     18   "checklist": {
     19     "claims_and_evidence": {
     20       "abstract_claims_supported": {
     21         "applies": true,
     22         "answer": true,
     23         "justification": "Abstract claims are supported: 'judges are competitive with outcome reward models in reranking' (Fig. 4, Best RM comparison), 'consistently worse than process reward models in beam search' (Fig. 9, QPRM comparison), 'natural language critiques are currently ineffective' (Fig. 11, all δ(Eff) < 1.0).",
     24         "source": "opus"
     25       },
     26       "causal_claims_justified": {
     27         "applies": true,
     28         "answer": true,
     29         "justification": "Claims are appropriately hedged as observational: 'judge-specific finetuning seems to primarily boost instruction-following evaluation abilities' (Sec. 4.2), 'This may suggest an intrinsic pliability' (Sec. 4.4). Controlled comparisons (same setup, varying one factor) support the comparative claims adequately.",
     30         "source": "opus"
     31       },
     32       "generalization_bounded": {
     33         "applies": true,
     34         "answer": true,
     35         "justification": "Claims are bounded by 'current' temporal qualifier and refer to the tested judges. The paper tests 10 judges across 3 domains and 8 generators, providing broad but bounded coverage. The title correctly identifies this as 'The JETTS Benchmark' rather than making universal claims.",
     36         "source": "opus"
     37       },
     38       "alternative_explanations_discussed": {
     39         "applies": true,
     40         "answer": false,
     41         "justification": "Limited systematic discussion of alternatives. For the main finding that critiques are ineffective, the paper identifies failure modes (style over substance) but doesn't consider whether the refinement prompt design (Fig. 18), generator capabilities, or other factors could explain the results. The JETTS vs. RewardBench difficulty gap discussion (Sec. 1) is a notable exception.",
     42         "source": "opus"
     43       },
     44       "proxy_outcome_distinction": {
     45         "applies": true,
     46         "answer": true,
     47         "justification": "Metrics match claims well. Normalized helpfulness (Eq. 1) directly measures what is claimed (judge improvement over greedy baseline, relative to oracle). The paper does not conflate benchmark performance with broader capabilities — claims are about judge performance in specific test-time scaling settings.",
     48         "source": "opus"
     49       }
     50     },
     51     "limitations_and_scope": {
     52       "limitations_section_present": {
     53         "applies": true,
     54         "answer": true,
     55         "justification": "Sec. 5 ('Conclusion and Future Work') identifies specific limitations with corresponding research directions: the pairwise performance-efficiency dilemma (O(N²)), insufficient CoT reasoning for judgment and critiques, and the need for better reasoning capabilities. This constitutes substantive discussion.",
     56         "source": "opus"
     57       },
     58       "threats_to_validity_specific": {
     59         "applies": true,
     60         "answer": false,
     61         "justification": "No dedicated threats-to-validity discussion. Some specific methodological concerns are noted in passing (CHAMP oracle overestimation in App. B.2, positional bias mitigation in App. B.1), but these are not framed as threats to validity and no systematic analysis of how these could affect conclusions is provided.",
     62         "source": "opus"
     63       },
     64       "scope_boundaries_stated": {
     65         "applies": true,
     66         "answer": false,
     67         "justification": "No explicit 'what the results do NOT show' statements. The paper does not explicitly state what settings, populations, or claims are excluded from its scope. The practitioner note (Sec. 3.5) offers practical guidance but does not delineate scope boundaries.",
     68         "source": "opus"
     69       }
     70     },
     71     "conflicts_of_interest": {
     72       "funding_disclosed": {
     73         "applies": true,
     74         "answer": false,
     75         "justification": "No funding disclosure or acknowledgments section in the paper. All authors are from Salesforce AI Research but no funding sources are mentioned.",
     76         "source": "opus"
     77       },
     78       "affiliations_disclosed": {
     79         "applies": true,
     80         "answer": true,
     81         "justification": "All five authors are listed as affiliated with 'Salesforce AI Research' on the first page.",
     82         "source": "opus"
     83       },
     84       "funder_independent_of_outcome": {
     85         "applies": true,
     86         "answer": false,
     87         "justification": "Salesforce AI Research developed SFR-Judge (Wang et al., 2024a), one of the evaluated judge models. SFR-Judge-70B achieves the highest reranking and beam search performance among judges in the leaderboard (Fig. 1), and Salesforce has a commercial interest in demonstrating its judge model's effectiveness.",
     88         "source": "opus"
     89       },
     90       "financial_interests_declared": {
     91         "applies": true,
     92         "answer": false,
     93         "justification": "No competing interests or financial disclosure statement appears in the paper.",
     94         "source": "opus"
     95       }
     96     },
     97     "scope_and_framing": {
     98       "key_terms_defined": {
     99         "applies": true,
    100         "answer": true,
    101         "justification": "Key terms are defined: 'LLM-as-judge', 'test-time scaling', 'normalized helpfulness' (Eq. 1), and 'effective improvement ratio' (Eq. 3) are all formally defined with mathematical notation.",
    102         "source": "haiku"
    103       },
    104       "intended_contribution_clear": {
    105         "applies": true,
    106         "answer": true,
    107         "justification": "The paper explicitly states it proposes 'the first systematic benchmark of LLM-judges for model's test-time scaling' with three specific tasks, and clearly differentiates its contribution from existing benchmarks like RewardBench.",
    108         "source": "haiku"
    109       },
    110       "engagement_with_prior_work": {
    111         "applies": true,
    112         "answer": true,
    113         "justification": "Section 2 provides a substantive comparison against RewardBench, PPE, JudgeBench, ProcessBench, and critique evaluation benchmarks, explaining specifically how JETTS differs (simulating actual test-time scenarios vs. fixed pairwise test sets) rather than merely listing related papers.",
    114         "source": "haiku"
    115       }
    116     }
    117   },
    118   "type_checklist": {
    119     "benchmark-creation": {
    120       "construct_design": {
    121         "construct_validity_argued": {
    122           "applies": true,
    123           "answer": true,
    124           "justification": "The paper argues that existing benchmarks like RewardBench form pairs from different generators, allowing judges to exploit stylistic shortcuts; JETTS requires comparing responses from the same generator, specifically measuring utility in realistic test-time compute scenarios.",
    125           "source": "haiku"
    126         },
    127         "difficulty_distribution_characterized": {
    128           "applies": true,
    129           "answer": false,
    130           "justification": "The benchmark uses multiple datasets of varying difficulty (GSM8k vs. MATH Level 5 vs. CHAMP), but difficulty tiers are not formally characterized or measured; the spread across datasets is implicit rather than systematically analyzed.",
    131           "source": "haiku"
    132         },
    133         "ceiling_floor_effects_checked": {
    134           "applies": true,
    135           "answer": false,
    136           "justification": "The paper observes that some judges perform below the greedy baseline (negative normalized helpfulness) and that single-rating judges are over-lenient, but does not conduct a formal ceiling/floor analysis or assess whether the benchmark discriminates adequately across the range of evaluated models.",
    137           "source": "haiku"
    138         },
    139         "human_baseline_included": {
    140           "applies": true,
    141           "answer": false,
    142           "justification": "No human baseline performance is reported for any of the benchmark tasks; the paper compares only AI models against random and oracle baselines.",
    143           "source": "haiku"
    144         },
    145         "scoring_rubric_justified": {
    146           "applies": true,
    147           "answer": true,
    148           "justification": "Normalized helpfulness (Eq. 1) is defined and justified as measuring improvement beyond greedy with oracle as natural upper bound; the effective improvement ratio (Eq. 3) is justified for refinement where oracle upper bound is unavailable.",
    149           "source": "haiku"
    150         }
    151       },
    152       "robustness": {
    153         "contamination_resistance_designed": {
    154           "applies": true,
    155           "answer": false,
    156           "justification": "The benchmark relies entirely on existing public datasets (GSM8k, MATH, HumanEval+, etc.) that are widely present in LLM training data; no contamination resistance measures such as temporal splits, canary strings, or dynamic generation are employed.",
    157           "source": "haiku"
    158         },
    159         "temporal_robustness_discussed": {
    160           "applies": true,
    161           "answer": false,
    162           "justification": "The paper does not discuss whether the benchmark will be gamed as models improve or provide plans for dataset refreshing; no analysis of how quickly benchmark utility might decay is offered.",
    163           "source": "haiku"
    164         },
    165         "failure_modes_discussed": {
    166           "applies": true,
    167           "answer": true,
    168           "justification": "The paper explicitly discusses failure modes: oracle accuracy over-estimation for CHAMP math (App. B.2), positional bias (addressed via consistency checks, App. B.1), and qualitative judge critique failure modes (false positives from stylistic focus, false negatives from over-scrutiny, Figs. 27-28).",
    169           "source": "haiku"
    170         },
    171         "baseline_implementations_provided": {
    172           "applies": true,
    173           "answer": true,
    174           "justification": "Pre-computed model responses are released as part of the benchmark, prompt templates are provided in the appendix (Figs. 15-18), and code is available at the GitHub repository listed in the abstract.",
    175           "source": "haiku"
    176         }
    177       },
    178       "documentation": {
    179         "dataset_documentation_complete": {
    180           "applies": true,
    181           "answer": false,
    182           "justification": "Table 1 lists the datasets with sizes and metrics, and Appendix A.1 describes evaluation procedures, but no formal data card is provided; documentation for the pre-computed response cache (collection methodology, model generation details) is minimal.",
    183           "source": "haiku"
    184         },
    185         "licensing_and_access_clear": {
    186           "applies": true,
    187           "answer": false,
    188           "justification": "A GitHub repository URL is provided, but the paper does not state a license for the benchmark materials; it is unclear under what terms others may use or redistribute the pre-computed responses.",
    189           "source": "haiku"
    190         },
    191         "intended_use_specified": {
    192           "applies": true,
    193           "answer": false,
    194           "justification": "A practitioner note advises using reranking as a proxy for beam search, but the paper does not explicitly specify what cannot be concluded from JETTS results or delineate appropriate vs. inappropriate use cases for the benchmark.",
    195           "source": "haiku"
    196         }
    197       }
    198     }
    199   },
    200   "claims": [
    201     {
    202       "claim": "LLM-judges are competitive with outcome reward models in response reranking",
    203       "evidence": "SFR-70B (h=0.171) and Skywork-Critic-70B (h=0.177) exceed the best RM baseline (h=0.113) on reranking in Fig. 1",
    204       "supported": "strong"
    205     },
    206     {
    207       "claim": "LLM-judges are consistently worse than process reward models in beam search",
    208       "evidence": "Fig. 9 shows Qwen2.5-Math-PRM-7B achieves h=0.195 in beam search while the best judge (SFR-70B) achieves h=0.138; Table 14 confirms QPRM outperforms all judges across most dataset/generator combinations",
    209       "supported": "strong"
    210     },
    211     {
    212       "claim": "Natural language critiques from LLM-judges are currently ineffective at guiding generators to produce better responses",
    213       "evidence": "Fig. 11 shows no judge achieves effective improvement ratio >1.0 across all task categories; the seed response (index 0) is most likely to be selected after refinement (Fig. 14)",
    214       "supported": "strong"
    215     },
    216     {
    217       "claim": "RewardBench fails to reveal capability differences between small and large judges that matter for test-time scaling",
    218       "evidence": "Fig. 2 shows Skywork-Critic 8B and 70B perform similarly on RewardBench but diverge substantially on JETTS reranking and beam search tasks",
    219       "supported": "strong"
    220     },
    221     {
    222       "claim": "Larger judge-to-generator size ratio significantly increases helpfulness for math but not for code generation",
    223       "evidence": "Fig. 5 shows regression coefficient 0.16*** for math vs. 0.00 n.s. for code in reranking; Fig. 8 confirms the same pattern in beam search (Math coef: 0.09***, Code coef: 0.01 n.s.)",
    224       "supported": "strong"
    225     },
    226     {
    227       "claim": "Judge critiques fail because they focus on stylistic features rather than response correctness",
    228       "evidence": "Qualitative case study of 100+ critique-response pairs (App. B.3, Figs. 27-28) finds two modes: false positives (judges miss errors, praise formatting) and false negatives (judges over-penalize verbose but correct responses)",
    229       "supported": "moderate"
    230     }
    231   ],
    232   "methodology_tags": [
    233     "benchmark-eval"
    234   ],
    235   "key_findings": "JETTS reveals that LLM-judges are competitive with outcome reward models (ORMs) in response reranking but substantially lag the 7B Qwen2.5-Math-PRM in step-level beam search, undermining the case for using judges as drop-in RM replacements in test-time compute pipelines. Natural language critiques — considered a defining advantage of LLM-judges over scalar RMs — are currently ineffective: no judge achieves an effective improvement ratio above 1.0 in critique-based refinement across any task category. Existing judge benchmarks like RewardBench mask this capability gap by forming pairs from different generators, allowing stylistic shortcuts; JETTS's within-generator comparison exposes that small judges (8B) provide negligible improvement over a 70B generator even when the 8B model scores similarly to a 70B judge on RewardBench.",
    236   "red_flags": [
    237     {
    238       "flag": "Self-evaluation conflict",
    239       "detail": "All authors are from Salesforce AI Research, which developed SFR-Judge — one of the evaluated judge families. SFR-Judge appears prominently in all three benchmark tasks with no conflict-of-interest disclosure."
    240     },
    241     {
    242       "flag": "No contamination controls",
    243       "detail": "The benchmark uses entirely pre-existing public datasets (GSM8k, MATH, HumanEval+, MBPP+, BigCodeBench, AlpacaEval, IFEval) that are widely included in judge model training data, potentially inflating judge performance on familiar problems."
    244     },
    245     {
    246       "flag": "No human baseline",
    247       "detail": "The benchmark does not establish human performance on any of its tasks, making it impossible to contextualize how close LLM-judges are to human-level evaluation quality."
    248     },
    249     {
    250       "flag": "No dedicated limitations section",
    251       "detail": "Limitations are folded into the conclusion section without formal enumeration; threats to validity such as training data overlap between judges and test tasks are not addressed."
    252     },
    253     {
    254       "flag": "No license for benchmark artifacts",
    255       "detail": "Pre-computed responses and benchmark materials are released on GitHub with no stated license, creating ambiguity for downstream use and reproducibility."
    256     }
    257   ],
    258   "cited_papers": [
    259     {
    260       "title": "RewardBench: Evaluating Reward Models for Language Modeling",
    261       "relevance": "Primary comparison benchmark; JETTS explicitly compares against RewardBench to show it reveals judge capability differences that RewardBench masks"
    262     },
    263     {
    264       "title": "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena",
    265       "relevance": "Foundational LLM-as-judge work that established the paradigm JETTS evaluates in test-time scaling contexts"
    266     },
    267     {
    268       "title": "JudgeBench: A Benchmark for Evaluating LLM-Based Judges",
    269       "relevance": "Direct related work on judge evaluation; JETTS differentiates itself by simulating actual test-time scaling rather than fixed pairwise tests"
    270     },
    271     {
    272       "title": "ProcessBench: Identifying Process Errors in Mathematical Reasoning",
    273       "relevance": "Related benchmark for process reward model evaluation; JETTS extends to judge models in beam search"
    274     },
    275     {
    276       "title": "Scaling LLM Test-Time Compute Optimally Can Be More Effective than Scaling Model Parameters",
    277       "relevance": "Key reference establishing the test-time scaling paradigm that JETTS benchmarks judges against"
    278     },
    279     {
    280       "title": "How to Evaluate Reward Models for RLHF (PPE)",
    281       "relevance": "Related evaluator benchmark covering Best-of-N settings; JETTS extends to three distinct task types and judge-specific architectures"
    282     },
    283     {
    284       "title": "Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models",
    285       "relevance": "One of the evaluated judge models; represents the fine-grained evaluation with natural language feedback approach"
    286     },
    287     {
    288       "title": "Direct Judgement Preference Optimization (SFR-Judge)",
    289       "relevance": "Salesforce-developed judge model central to the evaluation; training methodology relevant to understanding judge capabilities"
    290     }
    291   ],
    292   "engagement_factors": {
    293     "practical_relevance": {
    294       "score": 3,
    295       "justification": "Directly actionable for ML practitioners choosing between LLM-judges and reward models for inference-time compute; benchmark and pre-computed responses are released for immediate use."
    296     },
    297     "surprise_contrarian": {
    298       "score": 2,
    299       "justification": "The finding that natural language critiques — widely touted as judges' key advantage — are currently useless for refinement challenges the prevailing optimism about LLM-as-judge capabilities."
    300     },
    301     "fear_safety": {
    302       "score": 1,
    303       "justification": "Minor safety relevance: unreliable evaluators in test-time scaling could propagate errors or reinforce incorrect responses at scale, but no explicit safety framing is made."
    304     },
    305     "drama_conflict": {
    306       "score": 1,
    307       "justification": "Implicitly challenges the validity of RewardBench as a proxy for real-world judge utility, but the disagreement is framed technically rather than polemically."
    308     },
    309     "demo_ability": {
    310       "score": 3,
    311       "justification": "Full benchmark with pre-computed responses and code released on GitHub; researchers can immediately reproduce and extend the evaluation with new judge models."
    312     },
    313     "brand_recognition": {
    314       "score": 2,
    315       "justification": "Salesforce AI Research is a recognizable lab, and the paper appears at ICML 2025; evaluates well-known models (Llama, Qwen, DeepSeek) which increases visibility."
    316     }
    317   },
    318   "hn_data": {
    319     "threads": [
    320       {
    321         "hn_id": "40160728",
    322         "title": "CatLIP: Clip Vision Accuracy with 2.7x Faster Pre-Training on Web-Scale Data",
    323         "points": 48,
    324         "comments": 4,
    325         "url": "https://news.ycombinator.com/item?id=40160728",
    326         "created_at": "2024-04-25T17:46:04Z"
    327       },
    328       {
    329         "hn_id": "43686458",
    330         "title": "NPB-Rust: NAS Parallel Benchmarks in Rust",
    331         "points": 6,
    332         "comments": 1,
    333         "url": "https://news.ycombinator.com/item?id=43686458",
    334         "created_at": "2025-04-14T21:21:43Z"
    335       },
    336       {
    337         "hn_id": "41517885",
    338         "title": "Towards Large Language Models as Copilots for Theorem Proving in Lean",
    339         "points": 3,
    340         "comments": 0,
    341         "url": "https://news.ycombinator.com/item?id=41517885",
    342         "created_at": "2024-09-12T05:34:47Z"
    343       },
    344       {
    345         "hn_id": "40086186",
    346         "title": "Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing",
    347         "points": 3,
    348         "comments": 0,
    349         "url": "https://news.ycombinator.com/item?id=40086186",
    350         "created_at": "2024-04-19T12:51:23Z"
    351       },
    352       {
    353         "hn_id": "43781749",
    354         "title": "A Comprehensive Benchmark for C-to-Safe-Rust Transpilation",
    355         "points": 2,
    356         "comments": 0,
    357         "url": "https://news.ycombinator.com/item?id=43781749",
    358         "created_at": "2025-04-24T12:08:53Z"
    359       },
    360       {
    361         "hn_id": "44327775",
    362         "title": "Approximating Language Model Training Data from Weights",
    363         "points": 2,
    364         "comments": 0,
    365         "url": "https://news.ycombinator.com/item?id=44327775",
    366         "created_at": "2025-06-20T13:56:11Z"
    367       },
    368       {
    369         "hn_id": "44086818",
    370         "title": "Gen2seg: Generative Models Enable Generalizable Instance Segmentation",
    371         "points": 2,
    372         "comments": 0,
    373         "url": "https://news.ycombinator.com/item?id=44086818",
    374         "created_at": "2025-05-25T10:20:25Z"
    375       },
    376       {
    377         "hn_id": "40139677",
    378         "title": "Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing",
    379         "points": 2,
    380         "comments": 0,
    381         "url": "https://news.ycombinator.com/item?id=40139677",
    382         "created_at": "2024-04-24T02:10:20Z"
    383       },
    384       {
    385         "hn_id": "40116933",
    386         "title": "Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing",
    387         "points": 2,
    388         "comments": 0,
    389         "url": "https://news.ycombinator.com/item?id=40116933",
    390         "created_at": "2024-04-22T18:02:16Z"
    391       },
    392       {
    393         "hn_id": "45349444",
    394         "title": "Seeing Is Deceiving:Mirror-Based Lidar Spoofing for Autonomous Vehicle Deception",
    395         "points": 1,
    396         "comments": 0,
    397         "url": "https://news.ycombinator.com/item?id=45349444",
    398         "created_at": "2025-09-23T16:39:48Z"
    399       }
    400     ],
    401     "top_points": 48,
    402     "total_points": 71,
    403     "total_comments": 5
    404   }
    405 }
	ai-research-survey Systematic scan of agentic development research. What's signal, what's noise.
	git clone https://git.shiptheloop.com/ai-research-survey.git
	Log \| Files \| Refs