Hallucinates more than Mistral 7b
Mistral Small does very well for its size on my popular world knowledge test (82.6).
And Mistral 7b is among the leaders in its size range (69.3).
But Ministral 8b performed notably worse (51.3), and was even beaten by Llama 3.2 3b (62.1).
This doesn't appear to be due to a configuration error on my part because I got the same results from multiple online servers, including LMsys.
I noticed a very similar thing with Qwen2.5 72b. Its score dropped from 85.9 (Q2 72b) to 68.4.
I then noticed that both Q2.5 72b and Ministral 8b were also hallucinating more on the STEM knowledge covered by the MMLU, but only when forced to fully retrieve the desired information vs simply identifying the correct answer from a list (multiple choice).
It really appears to me (and I may be way off base) that both you and the Qwen team, when making Ministral & Qwen2.5, undertrained a larger corpus, leaving the resultant LLMs too scrambled to accurately retrieve the information in response to natural language questions near as well as previous versions (Mistral 7b & Qwen2), yet still able to identify more correct answers when their memories are jogged by seeing the answer in a list (hence higher scores on multiple choice tests like the MMLU).
Example: Who played the comic store owner on the TV show The Big Bang Theory?
Response: Kevin Sizemore vs Kevin Sussman (scrambled, but close enough if multiple choice)
Assuming I'm right, then loosely holding more information (hence scoring higher on multiple choice knowledge tests) isn't even remotely worth the huge spike in hallucinations when trying the retrieve the same information using natural language questions since users almost never give the LLM possibilities to choose from in the real-world.
Anyways, I'm a big fan of your work, and Mistral 7b, Small and Large are great. Just thought I'd share my theory with you in case there's any truth to it.