paper_type.json (277B)
1 { 2 "paper_type": "benchmark-creation", 3 "reason": "Introduces GitTaskBench, a new 54-task benchmark for evaluating code agents across 7 domains; the benchmark itself is the primary contribution, with experimental evaluation of baselines serving to demonstrate its utility." 4 }