Post
1711
Think AGI is just around the corner? Not so fast.
When OpenAI released its Computer-Using Agent (CUA) API, I happened to be playing Wordle π§© and thought, why not see how the model handles it?
Spoiler: Wordle turned out to be a surprisingly effective benchmark.
So Romain Cosentino Ph.D. and I dug in and analyzed the results of several hundred runs.
π Takeaways
1οΈβ£ Even the best computer-using models struggle with simple, context-dependent tasks.Β
2οΈβ£ Visual perception and reasoning remain major hurdles for multimodal agents.
3οΈβ£ Real-world use cases reveal significant gaps between hype and reality. Perception accuracy drops to near zero by the last turn π
π Read our arxiv article for more details https://www.arxiv.org/abs/2504.15434
When OpenAI released its Computer-Using Agent (CUA) API, I happened to be playing Wordle π§© and thought, why not see how the model handles it?
Spoiler: Wordle turned out to be a surprisingly effective benchmark.
So Romain Cosentino Ph.D. and I dug in and analyzed the results of several hundred runs.
π Takeaways
1οΈβ£ Even the best computer-using models struggle with simple, context-dependent tasks.Β
2οΈβ£ Visual perception and reasoning remain major hurdles for multimodal agents.
3οΈβ£ Real-world use cases reveal significant gaps between hype and reality. Perception accuracy drops to near zero by the last turn π
π Read our arxiv article for more details https://www.arxiv.org/abs/2504.15434