paper_type.json (237B)
1 { 2 "paper_type": "benchmark-creation", 3 "reason": "Introduces WebArena, a new realistic web environment with 812 long-horizon tasks across multiple domains, with baseline evaluations to establish benchmark difficulty and properties." 4 }