Qwen/Qwen3-4B-Instruct-2507 · 4b model with an 84.2 MMLU-Redux score?

A 4b dense model with an MMLU-Redux score on par with far larger top models is very hard to believe.

And after playing with this model for just a few minutes it's clear something's not right. It's only performing on par with ~65 MMLU-Redux scoring models.

Perhaps multiple choice testing is largely to blame since real-world use cases require the full and accurate retrieval of the desired information in response to prompts with highly variable wording and contexts, which is much harder than simply picking the correct answer out of a provided lineup.

But in many cases something else is going on. For example, an 80.2 ZebraLogic score is nonsense. There's nothing remotely special about the logical reasoning of this model. In fact, I can trip it up without even trying. Its score should be ~10-15. And the stated 54 English SimpleQA score of your Qwen3 235b model is equally insane. It can only answer comparably esoteric questions in the same domains covered by the test on par with ~15-17 scoring models.

Even if this was due to accidental test contamination why report what you know to be absurdly high test results in papers and model cards? It's like you're deliberately trying to make comparing LLM performance using standardized tests meaningless.