unsloth/Qwen3-30B-A3B-Thinking-2507-GGUF · `<think>` and `</think>` missing

johnbean393

5 days ago

•

edited 5 days ago

The model consistently doesn't wrap reasoning tokens in <think> and </think> for me.

I understand this is likely user error, so let me know if anyone notices something wrong with my setup.

shimmyshimmer

Unsloth AI org 5 days ago

•

edited 5 days ago

We directly utilized Qwen3's thinking chat template. You need to use jinja since it adds the think token. Otherwise you need to set reasoning format to qwen3 not none.

For lmstudio, you can try copying and pasting the chat template for Qwen3-30B-A3B and see if that works but I think that's an lmstudio issue

johnbean393

5 days ago

does this occur with the Q8 quant as well?

I'm using the Q4_K_M quant because that's the maximum my 32GB will fit. Not sure about Q8_0.

johnbean393

5 days ago

It looks like a user in r/LocalLLaMa has run into the same issue with the Q4_K_M quant.

https://www.reddit.com/r/LocalLLaMA/comments/1md8t1g/comment/n6002cq/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

redeemer

5 days ago

I can confirm that on the llama.cpp server, when using --jinja --reasoning-format none, the starting <think> tag is missing (it works without --jinja). I believe the issue lies in the Jinja chat template inside the GGUF file, particularly in the last lines.

...
{%- if add_generation_prompt %}
    {{- '<|im_start|>assistant\n<think>\n' }}
{%- endif %}

It seems to work when I change it to

...
{%- if add_generation_prompt %}
    {{- '<|im_start|>assistant\n' }}
{%- endif %}

danielhanchen

Unsloth AI org 5 days ago

@redeemer We directly utilized Qwen3's thinking chat template. You need to use jinja since it adds the think token. Otherwise you need to set reasoning format to qwen3 not none.

For lmstudio, you can try copying and pasting the chat template for Qwen3-30B-A3B and see if that works but I think that's an lmstudio issue

redeemer

5 days ago

@danielhanchen I’m using --reasoning-format none on purpose, to let my own app split the content into reasoning and non-reasoning by itself.

Providing the following Jinja template with the mentioned fix using --chat-template-file <file> to the llama.cpp server seems to fix it:

{%- if tools %}
    {{- '<|im_start|>system\n' }}
    {%- if messages[0].role == 'system' %}
        {{- messages[0].content + '\n\n' }}
    {%- endif %}
    {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
    {%- for tool in tools %}
        {{- "\n" }}
        {{- tool | tojson }}
    {%- endfor %}
    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
{%- else %}
    {%- if messages[0].role == 'system' %}
        {{- '<|im_start|>system\n' + messages[0].content + '<|im_end|>\n' }}
    {%- endif %}
{%- endif %}
{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
{%- for message in messages[::-1] %}
    {%- set index = (messages|length - 1) - loop.index0 %}
    {%- if ns.multi_step_tool and message.role == "user" and message.content is string and not(message.content.startswith('<tool_response>') and message.content.endswith('</tool_response>')) %}
        {%- set ns.multi_step_tool = false %}
        {%- set ns.last_query_index = index %}
    {%- endif %}
{%- endfor %}
{%- for message in messages %}
    {%- if message.content is string %}
        {%- set content = message.content %}
    {%- else %}
        {%- set content = '' %}
    {%- endif %}
    {%- if (message.role == "user") or (message.role == "system" and not loop.first) %}
        {{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }}
    {%- elif message.role == "assistant" %}
        {%- set reasoning_content = '' %}
        {%- if message.reasoning_content is string %}
            {%- set reasoning_content = message.reasoning_content %}
        {%- else %}
            {%- if '</think>' in content %}
                {%- set reasoning_content = ((content.split('</think>')|first).rstrip('\n').split('<think>')|last).lstrip('\n') %}
                {%- set content = (content.split('</think>')|last).lstrip('\n') %}
            {%- endif %}
        {%- endif %}
        {%- if loop.index0 > ns.last_query_index %}
            {%- if loop.last or (not loop.last and reasoning_content) %}
                {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content.strip('\n') + '\n</think>\n\n' + content.lstrip('\n') }}
            {%- else %}
                {{- '<|im_start|>' + message.role + '\n' + content }}
            {%- endif %}
        {%- else %}
            {{- '<|im_start|>' + message.role + '\n' + content }}
        {%- endif %}
        {%- if message.tool_calls %}
            {%- for tool_call in message.tool_calls %}
                {%- if (loop.first and content) or (not loop.first) %}
                    {{- '\n' }}
                {%- endif %}
                {%- if tool_call.function %}
                    {%- set tool_call = tool_call.function %}
                {%- endif %}
                {{- '<tool_call>\n{"name": "' }}
                {{- tool_call.name }}
                {{- '", "arguments": ' }}
                {%- if tool_call.arguments is string %}
                    {{- tool_call.arguments }}
                {%- else %}
                    {{- tool_call.arguments | tojson }}
                {%- endif %}
                {{- '}\n</tool_call>' }}
            {%- endfor %}
        {%- endif %}
        {{- '<|im_end|>\n' }}
    {%- elif message.role == "tool" %}
        {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
            {{- '<|im_start|>user' }}
        {%- endif %}
        {{- '\n<tool_response>\n' }}
        {{- content }}
        {{- '\n</tool_response>' }}
        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
            {{- '<|im_end|>\n' }}
        {%- endif %}
    {%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
    {{- '<|im_start|>assistant\n' }}
{%- endif %}

netroy

5 days ago

</think> is missing for me on Q4_K_M on the latest llama.cpp as well.

YearZero

5 days ago

•

edited 5 days ago

@redeemer We directly utilized Qwen3's thinking chat template. You need to use jinja since it adds the think token. Otherwise you need to set reasoning format to qwen3 not none.

For lmstudio, you can try copying and pasting the chat template for Qwen3-30B-A3B and see if that works but I think that's an lmstudio issue

By the way, there is no --reasoning-format qwen3

The only available choices are none or deepseek

YearZero

5 days ago

•

edited 5 days ago

I think it's important to remember what Qwen team said in the model card:

Additionally, to enforce model thinking, the default chat template automatically includes <think>. Therefore, it is normal for the model's output to contain only </think> without an explicit opening <think> tag.

So getting rid of the think tag (like in the modified script above) in the chat template fixes llamacpp's behavior, but it's only because the model automatically adds the think tag when it starts to respond. Llamacpp then recognizes it as a proper think tag.

We should follow the official advice and force the think tag in the template (which unsloth is already correctly doing in their template), and llamacpp should recognize it and treat the think tag inside the template the same as a think tag that was generated in the model's response.

So right now I'm thinking the correct fix would be on llamacpp side.

urtuuuu

5 days ago

so some of you probably tried bartowski quant...any difference?

danielhanchen

Unsloth AI org 5 days ago

Let me investigate this a bit further today and I'll update you guys!

redeemer

5 days ago

Just to clarify, missing opening <think> tag is a problem only in specific cases when using llama.cpp server with --jinja --reasoning-format none.

If you don't use --reasoning-format none, it falls back to the default <think> and </think> tags (called deepseek in llama.cpp), and the llama.cpp server properly splits the reasoning content on its own - the server's API endpoint returns it under the "reasoning_content" key, while the "normal" content is returned as "content" key.

@YearZero : Also I don't think it's fixable in llama.cpp because the whole point of --reasoning-format none is to just pass everything as it is (as "normal" content) by the llama.cpp server - so llama.cpp doesn't know what the reasoning open/end tags are.

YearZero

5 days ago

•

edited 5 days ago

Just to clarify, missing opening <think> tag is a problem only in specific cases when using llama.cpp server with --jinja --reasoning-format none.

If you don't use --reasoning-format none, it falls back to the default <think> and </think> tags (called deepseek in llama.cpp), and the llama.cpp server properly splits the reasoning content on its own - the server's API endpoint returns it under the "reasoning_content" key, while the "normal" content is returned as "content" key.

@YearZero : Also I don't think it's fixable in llama.cpp because the whole point of --reasoning-format none is to just pass everything as it is (as "normal" content) by the llama.cpp server - so llama.cpp doesn't know what the reasoning open/end tags are.

So when I use "--reasoning-format deepseek" (and also --jinja) it no longer outputs the thinking tokens to the screen in llama-server, but it hides them inside of 3 dots (...) without allowing me to expand/collapse to see the thinking tokens at all. Is that intended behavior? Is there a way to make llamacpp allow me to expand/collapse the reasoning parts in this case?

Would adding "--reasoning-format qwen" be a viable solution for llamacpp devs, since "none" and "deepseek" don't provide the desired behavior for these models?

redeemer

5 days ago

@YearZero
I might be wrong, but I think --reasoning-format deepseek (default) is correct for Qwen3 - they use the same <think> and </think> tokens. However, whether it is displayed depends on the inference app or script and if it properly handles the split between reasoning_content and content. I'm using my own custom inference app, which does not. I also just checked the built-in web UI of the llama.cpp server - and interestingly, it seems that it's not showing reasoning_content either. I'm attaching a screenshot where you can see both reasoning_content and content being returned from the server (in the Chrome Web Developer Tools network tab), but reasoning_content isn't showing up anywhere. The model is "thinking" internally - it's just not being displayed.

soulhacker

5 days ago

@YearZero can confirm @redeemer . --reasoning-format deepseek works for this model (Qwen3-30B-A3B-Thinking-2507-GGUF), at least for me.

danielhanchen

Unsloth AI org 5 days ago

As an update, I reuploaded the model @johnbean393 @netroy @soulhacker @redeemer @urtuuuu @johnbean393 - we verified that removing the <think> is fine, since the model's probability of producing the think token seems to be nearly 100% anyways.

This should make llama.cpp / lmstudio inference work! Please redownload weights or as @redeemer mentioned, simply delete the <think> token in the chat template ie change the below:

{%- if add_generation_prompt %}
    {{- '<|im_start|>assistant\n<think>\n' }}
{%- endif %}

to:

{%- if add_generation_prompt %}
    {{- '<|im_start|>assistant\n' }}
{%- endif %}

See https://huggingface.co/unsloth/Qwen3-30B-A3B-Thinking-2507-GGUF?chat_template=default or https://huggingface.co/unsloth/Qwen3-30B-A3B-Thinking-2507/raw/main/chat_template.jinja

Koitenshin

5 days ago

Thanks @danielhanchen for linking the template. That was so maddening I tried to get Qwen to diagnose her own template. But she died once she typed "<|im_end|>".

syazvinski

5 days ago

Can confirm updating Jina template in LM studio to the one linked above fixes the issue for thinking in LM studio.

YearZero

4 days ago

Looking good for me too!

Koitenshin

4 days ago

I couldn't handle that template any longer. If you want a denser thinking section, do the following:
Replace this line:

{{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content.strip('\n') + '\n</think>\n\n' + content.lstrip('\n') }}

with this one:

{{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content.strip('\n') + '\n</think>' + content.lstrip('\n') }}