paper_type.json (272B)
1 { 2 "paper_type": "benchmark-creation", 3 "reason": "Proposes Task2Quiz (T2Q), a new two-stage evaluation paradigm and framework for studying environment understanding in LLM agents, with 30 procedurally-generated TextWorld environments as the benchmark infrastructure." 4 }