TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks Paper • 2412.14161 • Published Dec 18, 2024 • 50
VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks Paper • 2410.19100 • Published Oct 24, 2024 • 6
Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale Paper • 2409.08264 • Published Sep 12, 2024 • 44