@shekkizh on Hugging Face: "Think AGI is just around the corner? Not so fast. When OpenAI released its…"

Hugging Face

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Back to feed

shekkizh

posted an update Apr 23

Post

1899

Think AGI is just around the corner? Not so fast.

When OpenAI released its Computer-Using Agent (CUA) API, I happened to be playing Wordle 🧩 and thought, why not see how the model handles it?
Spoiler: Wordle turned out to be a surprisingly effective benchmark.
So Romain Cosentino Ph.D. and I dug in and analyzed the results of several hundred runs.

🔑 Takeaways
1️⃣ Even the best computer-using models struggle with simple, context-dependent tasks.
2️⃣ Visual perception and reasoning remain major hurdles for multimodal agents.
3️⃣ Real-world use cases reveal significant gaps between hype and reality. Perception accuracy drops to near zero by the last turn 📉

🔗 Read our arxiv article for more details https://www.arxiv.org/abs/2504.15434

agentlans

Apr 25

I wonder if it's just bad colour perception, bad reasoning, unexpectedly bad prompting, or some combination of those.

Like if you can somehow give accurate colours for each letter in each row, can the agent do better? (I don't know whether that's possible with OpenAI CUA)

Also, if the problem is with the image tokenization, then it sounds like a CNN would be able to perceive the whole grid better, if there were such a model capable of playing Wordle.

shekkizh

Apr 26

Images are split into patches and each patch is tokenized - the tokenization is taking into a feature dimension and quantizing. This is probably already has CNN and/or attention. The issue is that of the model not able to reason both color and text in the tokenized space.

We ran about 1000 experiments - different prompting, tool call to different model for recognition, and several other techniques. The results still hold. The paper is a small part of the analysis.🤷‍♂️

In this post