Tracks perf of LLMs, VLMs and agents on web navigation tasks
Submit and evaluate models on GAIA benchmark