To Think or Not to Think: A Router for Hybrid LLMs

Community Article Published November 16, 2025

router-thumb

This is the project I worked on during the summer, and I figured it might be useful to turn it into a blog post explaining what I did and what the results were.

OpenAI released o1 in 2024, which was a big shift in how LLMs worked because of the idea of test-time compute: instead of always using a fixed number of tokens to answer a question, models could now spend more tokens (“think”) on harder questions, leading to better outputs and more accurate results [2].

Later in 2025, we got DeepSeek-R1, one of the first large-scale open reasoning models [3]. Then came the Qwen3 family of models, which introduced one of the first hybrid models: you can flip a flag to make the model use thinking mode or not [4]. Around the same time, OpenAI was releasing models like o3, GPT-4.5, o4-mini, o4-mini-high, and many variations. This made it increasingly hard to decide which model to use for a given task.

A lot of the time, using o3 was great — but for some tasks I just wanted the answer as fast as possible. Using a slow, reasoning-heavy model for a simple task quickly became annoying.

That’s when I started working on this project: a router that predicts whether a given task requires thinking or no thinking. Intuitively, this makes a lot of sense: instead of manually deciding whether to turn thinking mode on or off, you could have an Auto option and let a classifier decide. This is especially useful for hybrid models (like Qwen3’s think/no-think flag), and builds on previous work on LLM routing such as RouteLLM [1].


Problem Setup

To build this router, I first needed a way to get paired data: for each user query, I wanted

  • a response generated with thinking (reasoning mode on), and
  • a response generated without thinking (reasoning mode off),

plus a label indicating which mode is better (or whether thinking is unnecessary).

Concretely, given a query xx, I want to train a classifier that predicts a label:

y{think,no_think} y \in \{\text{think}, \text{no\_think}\}

such that:

  • think is used when a reasoning-style response massively improves quality, and
  • no_think is used when the extra reasoning does not justify the extra tokens.

To scale this up, I wanted the data to be mostly synthetic. I focused on Qwen3 models (especially Qwen3-8B), since it’s one of the models I regularly use for local tasks.

The key idea was:

  1. For a given query xx,

  2. Use the same base model (same underlying policy π\pi) with:

    • thinking mode on → sample athinka_{\text{think}}
    • thinking mode off → sample ano_thinka_{\text{no\_think}}
  3. Score these two outputs against each other using a scorer or reward model, and

  4. Convert that into a supervision signal for the router.

This setup is much cleaner when both outputs come from the same underlying model π\pi, just with different test-time compute.


Data Collection

I needed datasets that capture two main aspects:

  1. Real-world queries: diverse user questions
  2. Clearly reasoning-heavy tasks: e.g., math or coding problems

I ended up splitting everything into open-ended vs closed-ended datasets.

Open-Ended Datasets

These included:

They contain real user chats, which makes them realistic but tricky to score.

Due to budget constraints, I filtered aggressively using:

  • only English
  • only single-turn interactions
  • only samples scored highly by HuggingFaceFW/fineweb-edu-classifier
  • only samples where the model was GPT-4 (heuristic to keep modern, harder queries)

For each query, I generated:

  1. a think-mode output
  2. a no-think-mode output

These were later scored by a reward model.

Closed-Ended Datasets

I wanted tasks where:

  • correctness is easy to verify
  • non-thinking models perform noticeably worse

I used:

For these datasets, each question has a known ground-truth answer yy^*.

Extracting final answers allowed me to compute accuracy flags:

  • accthink(x)=1\text{acc}_{\text{think}}(x) = 1 if think-mode answer matches yy^*, else 0
  • accno_think(x)=1\text{acc}_{\text{no\_think}}(x) = 1 if no-think answer matches yy^*, else 0

These binary values are used during labeling.


Data Labeling

Closed-Ended Data

Labeling is straightforward:

  • If think-mode is correct and no-think-mode is wrong → label think
  • If no-think is correct and think-mode is wrong → label no_think

Edge cases:

  • If both are correct → label no_think (no need for extra tokens)
  • If both are wrong → label think (future larger models may solve it with more reasoning)

Open-Ended Data

These were later scored by the Skywork-Reward-V2-Llama-3.1-8B reward model [7].

Let:

  • rthink=r(fthink(x))r_{\text{think}} = r(f_{\text{think}}(x))
  • rno_think=r(fno_think(x))r_{\text{no\_think}} = r(f_{\text{no\_think}}(x))

I remove the <think>...</think> chain-of-thought before scoring.

If the absolute difference between the rewards is small, i.e.

rthinkrno_thinkε, \left| r_{\text{think}} - r_{\text{no\_think}} \right| \le \varepsilon,

I label the sample as no_think.

Otherwise, I choose the mode with the higher reward:

y={thinkif rthink>rno_think,no_thinkotherwise. y = \begin{cases} \text{think} & \text{if } r_{\text{think}} > r_{\text{no\_think}}, \\ \text{no\_think} & \text{otherwise}. \end{cases}

This encourages the model to choose thinking mode only when there is a clear benefit.


Training

After labeling, I had about 70k samples:

Dataset:
https://huggingface.co/datasets/AmirMohseni/reasoning-router-data-v2

Each sample includes:

  • a query
  • a label (think / no_think)

This is a pure classification task:

y{think,no_think} y \in \{\text{think}, \text{no\_think}\}

I tested several architectures:

  • BERT variants (encoder-only)
  • Qwen3-0.6B (decoder-only)

I ultimately selected:

  • Qwen3-0.6B
  • mmBERT-small

You can also try the router model yourself on the Hugging Face Space:
https://huggingface.co/spaces/AmirMohseni/Reasoning-Router

Training Logs

Below are the key training curves for the router model, including train/eval loss, F1 score, and accuracy.

Train Loss Eval Loss
Eval F1 Eval Accuracy

You can view the W&B dashboard here:
https://api.wandb.ai/links/amirmohseni-maastricht-university/zmldcnzd

Results

After finishing the training, it was time to test out the trained model. I evaluated the router on the open-ended dataset test sets and selected some of the newest math benchmarks from 2025 to measure how well the router can decide when to use thinking mode and when to skip it.

WildChat (Test)

image

Summary Table for WildChat-filtered-Qwen3-8B-Splits

(based on WildChat-filtered-Qwen3-8B-Scored)

Strategy Avg Reward vs No-Think vs Think Think %
No Think (Baseline) 22.6868 - - 0.0
Think (Baseline) 24.7362 - - 100.0
reasoning-router-0.6b 24.0576 +1.3708 -0.6787 42.9%
reasoning-router-mmbert-small 24.1236 +1.4368 -0.6126 46.3%
routellm-bert 23.3694 +0.6826 -1.3668 33.5%
routellm-mf 22.6865 -0.0004 -2.0498 0.2%

Nectar (Test)

image

Summary Table for Nectar-Qwen3-8B-Splits

(based on Nectar-Qwen3-8B)

Strategy Avg Reward vs No-Think vs Think Think %
No Think (Baseline) 8.6090 - - 0.0
Think (Baseline) 9.9649 - - 100.0
reasoning-router-0.6b 9.6866 +1.0776 -0.2784 28.5%
reasoning-router-mmbert-small 8.9183 +0.3093 -1.0466 9.9%
routellm-bert 9.9603 +1.3513 -0.0046 99.7%
routellm-mf 8.5713 -0.0377 -1.3936 1.9%

Overall, our routers perform significantly better than the no-think baseline and consistently outperform the RouteLLM models, all while using fewer tokens than full thinking mode.

AIME 2025

Qwen3-8B

Accuracy Comparison

Token Count Comparison

Summary Table for AIME25 – Qwen3-8B

(based on AIME-1983-2024-Qwen3-8B)

Strategy Avg Accuracy vs No-Think vs Think Avg Length Think %
No Think (Baseline) 0.1267 - - 3411.5 0.0
Think (Baseline) 0.2400 - - 12835.2 100.0
reasoning-router-0.6b 0.1933 +0.0667 -0.0467 11815.0 90.0%
reasoning-router-mmbert-small 0.2267 +0.1000 -0.0133 12322.1 93.3%
routellm-bert 0.2333 +0.1067 -0.0067 12553.6 96.7%
routellm-mf 0.1267 +0.0000 -0.1133 3411.5 0.0%

Qwen3-30B

Accuracy Comparison

Token Count Comparison

Summary Table for AIME25 – Qwen3-30B

Strategy Avg Accuracy vs No-Think vs Think Avg Length Think %
No Think (Baseline) 0.1933 - - 1178.2 0.0
Think (Baseline) 0.6400 - - 11773.5 100.0
reasoning-router-0.6b 0.5667 +0.3733 -0.0733 11112.8 90.0%
reasoning-router-mmbert-small 0.6067 +0.4133 -0.0333 11477.9 93.3%
routellm-bert 0.6133 +0.4200 -0.0267 11593.9 96.7%
routellm-mf 0.1933 +0.0000 -0.4467 1178.2 0.0%

HMMT 2025

Qwen3-8B

Accuracy Comparison

Token Count Comparison

Summary Table for HMMT Feb 2025 – Qwen3-8B

Strategy Avg Accuracy vs No-Think vs Think Avg Length Think %
No Think (Baseline) 0.0467 - - 5210.1 0.0
Think (Baseline) 0.0800 - - 13628.4 100.0
reasoning-router-0.6b 0.0533 +0.0067 -0.0267 10425.3 60.0%
reasoning-router-mmbert-small 0.0867 +0.0400 +0.0067 12360.6 83.3%
routellm-bert 0.0800 +0.0333 +0.0000 13262.2 96.7%
routellm-mf 0.0467 +0.0000 -0.0333 5210.1 0.0%

Qwen3-30B

Accuracy Comparison

Token Count Comparison

Summary Table for HMMT Feb 2025 – Qwen3-30B

Strategy Avg Accuracy vs No-Think vs Think Avg Length Think %
No Think (Baseline) 0.0867 - - 1038.2 0.0
Think (Baseline) 0.4600 - - 11718.4 100.0
reasoning-router-0.6b 0.3133 +0.2267 -0.1467 7597.4 60.0%
reasoning-router-mmbert-small 0.3933 +0.3067 -0.0667 9801.3 83.3%
routellm-bert 0.4267 +0.3400 -0.0333 11281.4 96.7%
routellm-mf 0.0867 +0.0000 -0.3733 1038.2 0.0%

When it comes to the math-competition benchmarks, the reasoning router performs well as a baseline—consistently improving over the non-thinking versions of the Qwen3 models. It also generalizes reasonably well to Qwen3-30B; however, larger models were not evaluated in these experiments.

Limitations

There are several limitations in this work that I plan to address in future iterations:

  • Model Architecture Constraints
    The router was trained only on transformer-based architectures (encoder-only and decoder-only), without exploring alternatives such as mixture-of-experts or lightweight attention variants.

  • Reward Model Limitations
    For open-ended tasks, scoring relied on Skywork-Reward-V2-Llama-3.1-8B [7]. Reward models can be imperfect proxies for human judgment, which may introduce systematic labeling biases.

  • Data Diversity
    The dataset did not include:

    • multilingual queries
    • coding tasks
    • multi-turn conversations
    • multimodal inputs (e.g., images)

    These exclusions were mainly due to compute and time constraints.

  • Model Size Constraints
    Most evaluations were performed on Qwen3-8B, with only limited testing on Qwen3-30B. Larger and more recent hybrid models remain unexplored.

  • Modern Hybrid Models
    Newer models—such as OpenAI’s o3/o4-mini and GPT-OSS reasoning-effort variants—support multiple tiers of reasoning effort. Extending the router beyond a binary “think / no-think” decision to multi-level reasoning selection is a promising direction for future work.

Conclusions

In this blog post, I walked through the experiments and methodology used to train a reasoning-effort router using primarily synthetic data and tasks with verifiable answers. The results illustrate that even without explicit human preference data—as required by works like RouteLLM [1]—it is possible to build a router that reliably predicts when thinking mode is beneficial. By the time I completed training, OpenAI released GPT-5 [5], which included a built-in router for automatically selecting reasoning mode, further validating the motivation behind this project. Overall, this work demonstrates a promising and lightweight approach to test-time compute allocation for hybrid LLMs.

References

[1] RouteLLM – Learning to Route LLMs with Preference Data
https://arxiv.org/abs/2406.18665

[2] OpenAI o1 – Learning to Reason with LLMs
https://openai.com/index/learning-to-reason-with-llms/

[3] DeepSeek-R1 – Reinforcement Learning for Reasoning
https://arxiv.org/abs/2501.12948

[4] Qwen3 Technical Report – Hybrid reason/no-reason models
https://arxiv.org/abs/2505.09388

[5] OpenAI – Introducing GPT-5
https://openai.com/index/introducing-gpt-5/

[6] DeepSeek V3.1 – Hybrid Reasoning Announcement
https://api-docs.deepseek.com/news/news250821

[7] Skywork Reward Model – Skywork-Reward-V2-Llama-3.1-8B
https://huggingface.co/Skywork/Skywork-Reward-V2-Llama-3.1-8B

Appendix

Hugging Face Collections

The full collection of datasets, models, and router artifacts used in this project is available here:

Reasoning Router Collection
https://huggingface.co/collections/AmirMohseni/reasoning-router


Datasets

These are the main datasets used to generate think / no-think pairs and evaluate the router:


Code Repository

The repository containing all training scripts, evaluation pipelines, and data-processing utilities can be found here:

LLM-Router (GitHub)
https://github.com/Amir-Mohseni/LLM-Router


Qwen3-8B Sampling Parameters

These were the exact sampling presets used when generating paired think / no-think outputs for Qwen3-8B, following the recommended settings from the official Qwen3 documentation.

# Sampling presets

THINKING_PARAMS = {
    "chat_template_kwargs": {"enable_thinking": True},
    "do_sample": True,
    "temperature": 0.6,
    "top_p": 0.95,
    "top_k": 20,
    "min_p": 0,
}

NON_THINKING_PARAMS = {
    "chat_template_kwargs": {"enable_thinking": False},
    "do_sample": True,
    "temperature": 0.7,
    "top_p": 0.8,
    "top_k": 20,
    "min_p": 0,
}

Citation

If you’d like to cite this work, you can use the following BibTeX entry:

@misc{mohseni2025reasoningrouter,
  author       = {Mohseni, Amir},
  title        = {To Think or Not to Think: A Router for Hybrid LLMs},
  howpublished = {Hugging Face Blog Post},
  month        = {November},
  year         = {2025},
  url          = {https://huggingface.co/blog/AmirMohseni/reasoning-router}
}

Community

Sign up or log in to comment