paper_type.json (236B)
1 { 2 "paper_type": "benchmark-creation", 3 "reason": "Introduces WebGuard, the first large-scale action-level dataset for web agent guardrail evaluation (4,939 human-annotated actions), with empirical results validating the benchmark." 4 }