paper_type.json (253B)
1 { 2 "paper_type": "benchmark-creation", 3 "reason": "Introduces PythonSaga, a new 185-problem code benchmark designed to address documented biases in HumanEval and MBPP, with baseline evaluations demonstrating improved concept and difficulty balance." 4 }