Temp 0.7 is too high.
This model, coupled with the settings you recommend, reliably falls off the rails in response to my easy questions that much smaller models don't struggle with. It helps bringing the temp down to 0.3 and setting minP to 0.6, but it still vomits hallucinations across most popular domains of human knowledge, other than a select few like coding, math and STEM, compared to similarly sized models by Google, Meta, and Mistral.
For example, when asked for the main cast of one of the most watched TV shows in history, and which ran for 12 years (Two and a Half Men), your Q4_K_M with your recommended settings returned the following.
Charlie Sheen as Charlie Harper
Jon Cryer as Alan Aldridge
Jennifer Taylor as Selma
Amanda Tepe as Lil' Danny
Lucy Liu as Denise Richards
Jonathan Groff as Jake Morris
Even some tiny 1b models did better on these types of questions than Qwen3 30b. Most of these names are just random A-list actors like Alan Aldridge and Lucy Liu. And the fact that Alan and Jake Harper went to Aldridge and Morris, despite the model knowing that Alan was Charlie's Harper's brother, and Jake was his nephew, illustrates the lack of weight separation this model has. The Harper token should have had a far higher statistical probability than Aldridge or Morris (note: the repeat penalty was set to 1.08).
Here's another example: "Who sang the modern hit song Dear Future Husband? What album is it from? And what year was it released?"
Qwen3 30b: The song “Dear Future Husband” was sung by Tara Reid. It's a modern pop song that gained some attention, though it wasn't a massive chart-topping hit.
Artist: Tara Reid
Album: Glamour & Chaos
Release Year: 2013
It's worth noting that while “Dear Future Husband” was somewhat of a viral or internet hit, it didn't achieve widespread mainstream success and isn't typically considered one of her major releases. The track has been more associated with social media trends than traditional radio play."
Tara Reid isn't even a singer. She's an actress. There was a male singer named Tara Reid, but he didn't sing a song like Dear Future Husband, release an Album called Glamour & Chaos.. and the paragraph that followed was nothing but made up nonsense.
These aren't isolated examples. All Qwen3 models, including the massive full-float 252b one, have less weight separation across nearly every popular domain of knowledge other than a select few like coding, math, and STEM, in which case they do better than other similarly sized models. Qwen3 is the most overfit model I've ever come across.
I have other prompts that are specifically designed to detect poor weight separation that only very small models usually struggle with. For example, I asked for 9 synonyms for extra, which Qwen3 30b using your recommended settings not only outputted extra twice, but it was the word I was asking for a synonym of. And with my settings (temp 0.3 minp 0.6) it didn't make this mistake, but did repeat "additional". The smallest model to pass this test was Llama 3.2 3b, and comparably sized models like Gemma 2/3 27b, Mistral Small 2503, and even the old and bad at instruction following Mixtral 8x7b ace this test over and over again at the same quantization and settings.
Anyways, Qwen3 is unusually good for its size when it comes to things like coding and math, and is competitive with models like Gemma 3 27b when it comes to STEM. But it's broad knowledge and abilities (e.g. poem writing) are astonishingly bad.
Western models aren't doing the same. For example, when their general Chinese knowledge is low (e.g. low Chinese SimpleQA), then their Chinese MMLU scores are also low, and vice versa. Yet with every generation your models are becoming more and more overfit to coding, math, and the English MMLU. For example, going from Qwen2 72b to Qwen2.5 72b saw an English MMLU boost coupled with a MASSIVE drop in general knowledge on both my test and the English SimpleQA, dropping to a score of only 10/100. And Qwen3 252b saw an even larger overfitting to coding, math and MMLU, getting only 11/100 on the English SimpleQA. And I suspect even that was partially due to test contamination because it got more of the questions I tested wrong.