paper_type.json (272B)
1 { 2 "paper_type": "benchmark-creation", 3 "reason": "The primary contribution is HCAST, a new benchmark dataset of 189 autonomy tasks calibrated against human performance baselines; agent evaluations are provided as validation of the benchmark's difficulty calibration." 4 }