Terrible instruction following
Now I'm using qwen3:4b-thinking-2507-q4_K_M
. So as for me it's absolutely lost ability of instruction following in context of structured output at least.
For example, in my case, I set a task to decompose a task and pass Pydantic
a schema that specifies description, title types and other parameters for each field. Additionally, for re-assurance, I specify in the prompt what type to return:{“steps”: {“id”: int, “step”: str, "depends_on": List[int]}}
.
However, in 50-70% of cases, the model works very badly and returns something like:
<think>
...
</think>
The question requires splitting into minimal stages that an AI agent can process. After analyzing the user's query, the minimal stages are:
1. Clarify the driver's name
2. Check the season status and collect Verstappen's results
3. Gather recent Formula 1 news
These stages ensure that the AI can process the query efficiently while addressing both parts of the user's question (season status and results for Verstappen, and recent news). The minimal number of stages is **3**.
### Final Answer
\{boxed{3}
one more example:
<think>
...
</think>
{ "steps": [
{ "id": 1, "title": “bla bla bla bla bla” },
{ “id”: 2, “title”: “bla bla bla bla ” },
{ ‘id’: 3, “title”: "bla bla bla bla }
]
} }
That is, the following can be seen:
- The model clearly has problems with Structured Output
- The model does not understand the essence of the task, which is formulated unambiguously and simply
I realize that one might think that the written system or user prompt is extremely poor. However, this is not quite true if only because the previous model qwen3:4b-q4_K_M
handled this prompt just perfectly.
Meanwhile, I can see from the responses that it's like the model is clearly smarter, but it just doesn't understand what to do or how to follow instructions.
Thus several hypotheses are born:
- Either the model actually follows instructions worse and understands them worse
- Or the model is fundamentally different from the previous version and, therefore, requires a completely different approach to prompting
What I'm saying applies to both the new version and the instructed version. Maybe this problem is exclusively related to quantized versions, I've only tried 4q_K_M
so far.
Any thoughts on this?
I want to clarify that this is strange to me primarily for the reason that in my personal experience qwen3:4b-q4_K_M
followed instructions much better than llama3.1:8b-instruct-q4_K_M
, even though LLaMA is larger. Yes, there was slippage in knowledge domains compared to LLaMA, but that's also reasonable. But it is genuinely a mystery to me how it could happen that the Instruct version follows instructions worse than the base version, and even more so with mixed mode thinking...
It appears to be second guessing. That is, rather than just blindly following the user's instructions it's checking for things like veracity, ambiguity, intent, sarcasm...
Since this tiny 4b model is dumb as dirt this is rarely beneficial and mostly just negatively impacts instruction following and performance.
For example, when asked "What is the condition called when blood sugar briefly falls too far after eating?" it initially responded with 'blood sugar doesn't fall after eating, it rises, even if diabetic', but got it right when I ran it again.
To any medically trained human the correct answer is obviously reactive or postprandial hypoglycemia. The ambiguity in how long after eating the blood sugar falls doesn't trip up humans, but it caused this model to ignore the obvious intent of my instruction to question whether blood sugar can fall after eating.
An example of how this can be beneficial is Qwen3 30b 2507's response to the prompt "Tom Cruise was married twice. Which two women did he marry?", to which it correctly responded 'Actually Tom Cruise was married three times..."
So yes, I agree with you, both performance and instruction following have been negatively impacted, but primarily because they tried to force this dumb 4b model to do smart things instead of just blindly following instructions and taking them at face value.
@phil111 I'm absolutely agree with all examples thet u provided, but I take the position that even 4b models can be cranially useful, the main thing is to be able to get what you need out of them. And that's where multi-agent modeling comes in handy.
So what I mean is that if you give even a tiny model a very strictly and precisely defined instruction, and a narrow area of responsibility, and then build a multi-agent system that way, it will work well. So the skill of following instructions is incredibly important for tiny models.
Whoever said anything about qwen3:4b
, that model is really so strong on following instructions, but what happened with this release is just appalling...
And about your examples, indeed, such models badly memorize all world facts respectively cannot find a trick in a question. But it can still be worked on.
I'm a minion of the fact that we want too much from language neural networks. We humans don't think with neural nets, we have higher abstractions for that, accordingly, and in the context of LLM we have to build higher abstractions. And the key role of LLM here is to learn a world model, that's what makes LLM so cool and effective. And when a model is just memorizing facts, it's more like overfitting to the current world model. It isn't the knowledge of facts, but the understanding of how the world works that makes models truly effective.
When a model understands the world model, it's able to follow instructions well. And when a model is able to follow instructions well, it can be made into as cool a agent as your own instructional skills are strong.
This, by the way, is completely consistent with your statement “because they tried to force this dumb 4b model to do smart things instead of just blindly following instructions and taking them at face value.”
@denisalpino The reason I'm not a fan of most models under 7b, including this one, is that they have astonishingly little weight separation.
For example, bringing the temp down to 0.3 prevented this model from making a synonym list without numerous repeats for a word with a ton of synonyms.
Prompt: Make a simple list of 9 single word synonyms for extra.
Response: Sure! Here's a simple list of 9 single-word synonyms for “extra”:
Additional (used 3x)
Extra (used 2x, plus it's the word I asked for a synonym of)
Further
More
Additional
Spare
Bonus
Extra
Additional
This doesn't have to be the case. For example, the smaller Llama 3.2 3b at the same temperature created 9 valid synonyms and didn't use the word asked for.
These GROSSLY overfit models like Qwen3 4b are annoying incompetent. According to test scores like Creative Writing they score very well, but in the real world it can't write stories worth shit. If all you give it is a simple story prompt like "Write a story about fishing." it can regurgitate a quality story from its training data, hence the high scores on poorly designed benchmarks.But when you force it to write an original story by adding a couple inclusions and exclusions the stories are not only bad, but are filled with errors and contradictions. Plus they lack variation and include the same phrases, story elements, mistakes... in response to wildly different story prompts.
In short, this 4b model's real-world performance is about 20 points lower than what many of the benchmarks indicate. Just looking at the benchmars, like a CreativeWriting score of 83.5, an MMLU Redux of 84.2, and an Instruction Following score of 83.4 suggests a very powerful model matching and exceeding the performance of larger top performers like Gemma 3 27b, but in reality it performs FAR worse on creative writing tasks, answering questions that overlaps the domains covered my the MMLU, and understanding what the prompts are asking and following their instructions.