Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
shekkizhΒ 
posted an update 3 days ago
Post
1711
Think AGI is just around the corner? Not so fast.

When OpenAI released its Computer-Using Agent (CUA) API, I happened to be playing Wordle 🧩 and thought, why not see how the model handles it?
Spoiler: Wordle turned out to be a surprisingly effective benchmark.
So Romain Cosentino Ph.D. and I dug in and analyzed the results of several hundred runs.

πŸ”‘ Takeaways
1️⃣ Even the best computer-using models struggle with simple, context-dependent tasks.Β 
2️⃣ Visual perception and reasoning remain major hurdles for multimodal agents.
3️⃣ Real-world use cases reveal significant gaps between hype and reality. Perception accuracy drops to near zero by the last turn πŸ“‰

πŸ”— Read our arxiv article for more details https://www.arxiv.org/abs/2504.15434

I wonder if it's just bad colour perception, bad reasoning, unexpectedly bad prompting, or some combination of those.

Like if you can somehow give accurate colours for each letter in each row, can the agent do better? (I don't know whether that's possible with OpenAI CUA)

Also, if the problem is with the image tokenization, then it sounds like a CNN would be able to perceive the whole grid better, if there were such a model capable of playing Wordle.