38.
AgentWorldBench: 7-domain benchmark with ground-truth observations from real environments, constructed from 5 fro…
AgentWorldBench: 7-domain benchmark with ground-truth observations from real environments, constructed from 5 frontier model trajectories on 9 established benchmarks. Results: Qwen-AgentWorld-397B-A17B achieves the highest overall score (