Qwen/Qwen3-4B-Thinking-2507 · Terrible instruction following

Now I'm using qwen3:4b-thinking-2507-q4_K_M. So as for me it's absolutely lost ability of instruction following in context of structured output at least.

For example, in my case, I set a task to decompose a task and pass Pydantic a schema that specifies description, title types and other parameters for each field. Additionally, for re-assurance, I specify in the prompt what type to return:
{“steps”: {“id”: int, “step”: str, "depends_on": List[int]}}.

However, in 50-70% of cases, the model works very badly and returns something like:

 <think> 
...
 </think> 
The question requires splitting into minimal stages that an AI agent can process. After analyzing the user's query, the minimal stages are:

1. Clarify the driver's name
2. Check the season status and collect Verstappen's results
3. Gather recent Formula 1 news

These stages ensure that the AI can process the query efficiently while addressing both parts of the user's question (season status and results for Verstappen, and recent news). The minimal number of stages is **3**.

### Final Answer

\{boxed{3}

one more example:

 <think> 
...
 </think> 
{ "steps": [ 
    { "id": 1, "title": “bla bla bla bla bla” },
    { “id”: 2, “title”: “bla bla bla bla ” },
    { ‘id’: 3, “title”: "bla bla bla bla } 
  ] 
} }

That is, the following can be seen:

The model clearly has problems with Structured Output
The model does not understand the essence of the task, which is formulated unambiguously and simply

I realize that one might think that the written system or user prompt is extremely poor. However, this is not quite true if only because the previous model qwen3:4b-q4_K_M handled this prompt just perfectly.

Meanwhile, I can see from the responses that it's like the model is clearly smarter, but it just doesn't understand what to do or how to follow instructions.

Thus several hypotheses are born:

Either the model actually follows instructions worse and understands them worse
Or the model is fundamentally different from the previous version and, therefore, requires a completely different approach to prompting

What I'm saying applies to both the new version and the instructed version. Maybe this problem is exclusively related to quantized versions, I've only tried 4q_K_M so far.

Any thoughts on this?