paper_type.json (259B)
1 { 2 "paper_type": "benchmark-creation", 3 "reason": "The paper introduces JETTS, a new benchmark for evaluating LLM-judges as test-time scaling evaluators, and reports comprehensive baseline results comparing different judge variants against reward models." 4 }