The SimpleQA score of the model is WAY off.

by phil111 - opened Jul 2

Jul 2

According to your blog the SimpleQA score of this model is 45.9, which is even higher than GPT4.1's score of 40.2, and far higher than DeepSeek V3's score of 27.3.

This not just WAY off, it's absolutely insane. The English SimpleQA test consists of esoteric full recall questions (non-multiple choice) across a broad spectrum of domains, and across these domains DS V3 has vastly more knowledge than this model, and GPT4.1 has vastly more than DS V3. I estimate that the true SimpleQA score of this model, sans test contamination, is ~12, and I find it very hard to believe that the makers of this model aren't aware of this.

The same thing happened with the smaller Ernie 4.5 21b, discussed in the following link. It has a true SimpleQA score of ~3, yet Baidu claimed 24.2, which would put it well above the previous 70b- leader Llama 3 70b, which has an English SimpleQA score of only 20.

I've tested ~100 models and this is the first time I came across such an egregious disparity between a test results and the true performance.

https://huggingface.co/baidu/ERNIE-4.5-21B-A3B-PT/discussions/1#686292d583da5b66b019d0f0

Aver0

8 days ago

•

edited 8 days ago

Thanks for confirming my suspictions. I was testing ERNIE locally on my questions and use cases, and I got mediocre results (much worse than Qwen3-235B-Instruct), but since I could only run IQ4XS quant on my hardware, I had doubts. Now I see this 300B model really isn't worth its size.

phil111

8 days ago

@Aver0 While it's true that Qwen3-235B-Instruct-2507 has more broad English knowledge than this model, it still has FAR less than Kimi K2.

For example, Kimi K2 got an English SimpleQA score of 31, which is consistent with my testing. It might sound low but the SimpleQA test is very hard and comprised of esoteric questions across a broad spectrum of domains. And despite Qwen3 2507 performing far worse than Kimi K2 across all domains of knowledge it claimed an English SimpleQA score of 54. It's true score is only ~15, moderately better than the previous version's 13.

15 is still an OK score and is higher than models like Mistral Small and Gemma 3 27b which score ~11-12, but it's concerning how egregiously Baidu and Alibaba cheated then lied about their SimpleQA scores.

Aver0

8 days ago

Too bad Kimi K2 is so ridiculously huge. Even DeepseekV3 was manageable (barely), Kimi K2 just isn't for us mortals.
There's always a catch with these guys.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment