To Think or Not to Think: A Router for Hybrid LLMs

Community Article Published November 16, 2025

This is the project I worked on during the summer, and I figured it might be useful to turn it into a blog post explaining what I did and what the results were.

OpenAI released o1 in 2024, which was a big shift in how LLMs worked because of the idea of test-time compute: instead of always using a fixed number of tokens to answer a question, models could now spend more tokens (“think”) on harder questions, leading to better outputs and more accurate results [2].

Later in 2025, we got DeepSeek-R1, one of the first large-scale open reasoning models [3]. Then came the Qwen3 family of models, which introduced one of the first hybrid models: you can flip a flag to make the model use thinking mode or not [4]. Around the same time, OpenAI was releasing models like o3, GPT-4.5, o4-mini, o4-mini-high, and many variations. This made it increasingly hard to decide which model to use for a given task.

A lot of the time, using o3 was great — but for some tasks I just wanted the answer as fast as possible. Using a slow, reasoning-heavy model for a simple task quickly became annoying.

That’s when I started working on this project: a router that predicts whether a given task requires thinking or no thinking. Intuitively, this makes a lot of sense: instead of manually deciding whether to turn thinking mode on or off, you could have an Auto option and let a classifier decide. This is especially useful for hybrid models (like Qwen3’s think/no-think flag), and builds on previous work on LLM routing such as RouteLLM [1].

Problem Setup

To build this router, I first needed a way to get paired data: for each user query, I wanted

a response generated with thinking (reasoning mode on), and
a response generated without thinking (reasoning mode off),

plus a label indicating which mode is better (or whether thinking is unnecessary).

Concretely, given a query $x$ , I want to train a classifier that predicts a label:

$y \in \{\text{think}, \text{no\_think}\}$

such that:

think is used when a reasoning-style response massively improves quality, and
no_think is used when the extra reasoning does not justify the extra tokens.

To scale this up, I wanted the data to be mostly synthetic. I focused on Qwen3 models (especially Qwen3-8B), since it’s one of the models I regularly use for local tasks.

The key idea was:

For a given query $x$ ,
Use the same base model (same underlying policy $π \pi$ ) with:
- thinking mode on → sample $a_{\text{think}}$
- thinking mode off → sample $a_{\text{no\_think}}$
Score these two outputs against each other using a scorer or reward model, and
Convert that into a supervision signal for the router.

This setup is much cleaner when both outputs come from the same underlying model $π \pi$ , just with different test-time compute.

Data Collection

I needed datasets that capture two main aspects:

Real-world queries: diverse user questions
Clearly reasoning-heavy tasks: e.g., math or coding problems

I ended up splitting everything into open-ended vs closed-ended datasets.

Open-Ended Datasets

These included:

They contain real user chats, which makes them realistic but tricky to score.

Due to budget constraints, I filtered aggressively using:

only English
only single-turn interactions
only samples scored highly by HuggingFaceFW/fineweb-edu-classifier
only samples where the model was GPT-4 (heuristic to keep modern, harder queries)

For each query, I generated:

a think-mode output
a no-think-mode output

These were later scored by a reward model.

Closed-Ended Datasets

I wanted tasks where:

correctness is easy to verify
non-thinking models perform noticeably worse

I used:

AIME-1983-2024-Qwen3-8B (competition math)
Big-Math-RL-Qwen3-8B

For these datasets, each question has a known ground-truth answer $y^{*}$ .

Extracting final answers allowed me to compute accuracy flags:

$\text{acc}_{\text{think}}(x) = 1$ if think-mode answer matches $y^{*}$ , else 0
$\text{acc}_{\text{no\_think}}(x) = 1$ if no-think answer matches $y^{*}$ , else 0

These binary values are used during labeling.

Data Labeling

Closed-Ended Data

Labeling is straightforward:

If think-mode is correct and no-think-mode is wrong → label think
If no-think is correct and think-mode is wrong → label no_think

Edge cases:

If both are correct → label no_think (no need for extra tokens)
If both are wrong → label think (future larger models may solve it with more reasoning)

Open-Ended Data

These were later scored by the Skywork-Reward-V2-Llama-3.1-8B reward model [7].

Let:

$r_{\text{think}} = r(f_{\text{think}}(x))$
$r_{\text{no\_think}} = r(f_{\text{no\_think}}(x))$

I remove the <think>...</think> chain-of-thought before scoring.

If the absolute difference between the rewards is small, i.e.

$\left| r_{\text{think}} - r_{\text{no\_think}} \right| \le \varepsilon,$

I label the sample as no_think.

Otherwise, I choose the mode with the higher reward:

$y = \begin{cases} \text{think} & \text{if } r_{\text{think}} > r_{\text{no\_think}}, \\ \text{no\_think} & \text{otherwise}. \end{cases}$

This encourages the model to choose thinking mode only when there is a clear benefit.

Training

After labeling, I had about 70k samples:

Dataset:
https://huggingface.co/datasets/AmirMohseni/reasoning-router-data-v2

Each sample includes:

a query
a label (think / no_think)

This is a pure classification task:

$y \in \{\text{think}, \text{no\_think}\}$

I tested several architectures:

BERT variants (encoder-only)
Qwen3-0.6B (decoder-only)

I ultimately selected:

Qwen3-0.6B
mmBERT-small

You can also try the router model yourself on the Hugging Face Space:
https://huggingface.co/spaces/AmirMohseni/Reasoning-Router

Training Logs

Below are the key training curves for the router model, including train/eval loss, F1 score, and accuracy.

You can view the W&B dashboard here:
https://api.wandb.ai/links/amirmohseni-maastricht-university/zmldcnzd

Results

After finishing the training, it was time to test out the trained model. I evaluated the router on the open-ended dataset test sets and selected some of the newest math benchmarks from 2025 to measure how well the router can decide when to use thinking mode and when to skip it.

WildChat (Test)

Summary Table for WildChat-filtered-Qwen3-8B-Splits

(based on WildChat-filtered-Qwen3-8B-Scored)

Strategy	Avg Reward	vs No-Think	vs Think	Think %
No Think (Baseline)	22.6868	-	-	0.0
Think (Baseline)	24.7362	-	-	100.0
reasoning-router-0.6b	24.0576	+1.3708	-0.6787	42.9%
reasoning-router-mmbert-small	24.1236	+1.4368	-0.6126	46.3%
routellm-bert	23.3694	+0.6826	-1.3668	33.5%
routellm-mf	22.6865	-0.0004	-2.0498	0.2%

Nectar (Test)

Summary Table for Nectar-Qwen3-8B-Splits

(based on Nectar-Qwen3-8B)

Strategy	Avg Reward	vs No-Think	vs Think	Think %
No Think (Baseline)	8.6090	-	-	0.0
Think (Baseline)	9.9649	-	-	100.0
reasoning-router-0.6b	9.6866	+1.0776	-0.2784	28.5%
reasoning-router-mmbert-small	8.9183	+0.3093	-1.0466	9.9%
routellm-bert	9.9603	+1.3513	-0.0046	99.7%
routellm-mf	8.5713	-0.0377	-1.3936	1.9%

Overall, our routers perform significantly better than the no-think baseline and consistently outperform the RouteLLM models, all while using fewer tokens than full thinking mode.

AIME 2025

Qwen3-8B

Accuracy Comparison

Token Count Comparison

Summary Table for AIME25 – Qwen3-8B

(based on AIME-1983-2024-Qwen3-8B)

Strategy	Avg Accuracy	vs No-Think	vs Think	Avg Length	Think %
No Think (Baseline)	0.1267	-	-	3411.5	0.0
Think (Baseline)	0.2400	-	-	12835.2	100.0
reasoning-router-0.6b	0.1933	+0.0667	-0.0467	11815.0	90.0%
reasoning-router-mmbert-small	0.2267	+0.1000	-0.0133	12322.1	93.3%
routellm-bert	0.2333	+0.1067	-0.0067	12553.6	96.7%
routellm-mf	0.1267	+0.0000	-0.1133	3411.5	0.0%

Qwen3-30B

Accuracy Comparison

Token Count Comparison

Summary Table for AIME25 – Qwen3-30B

Strategy	Avg Accuracy	vs No-Think	vs Think	Avg Length	Think %
No Think (Baseline)	0.1933	-	-	1178.2	0.0
Think (Baseline)	0.6400	-	-	11773.5	100.0
reasoning-router-0.6b	0.5667	+0.3733	-0.0733	11112.8	90.0%
reasoning-router-mmbert-small	0.6067	+0.4133	-0.0333	11477.9	93.3%
routellm-bert	0.6133	+0.4200	-0.0267	11593.9	96.7%
routellm-mf	0.1933	+0.0000	-0.4467	1178.2	0.0%

HMMT 2025

Qwen3-8B

Accuracy Comparison

Token Count Comparison

Summary Table for HMMT Feb 2025 – Qwen3-8B

Strategy	Avg Accuracy	vs No-Think	vs Think	Avg Length	Think %
No Think (Baseline)	0.0467	-	-	5210.1	0.0
Think (Baseline)	0.0800	-	-	13628.4	100.0
reasoning-router-0.6b	0.0533	+0.0067	-0.0267	10425.3	60.0%
reasoning-router-mmbert-small	0.0867	+0.0400	+0.0067	12360.6	83.3%
routellm-bert	0.0800	+0.0333	+0.0000	13262.2	96.7%
routellm-mf	0.0467	+0.0000	-0.0333	5210.1	0.0%

Qwen3-30B

Accuracy Comparison

Token Count Comparison

Summary Table for HMMT Feb 2025 – Qwen3-30B

Strategy	Avg Accuracy	vs No-Think	vs Think	Avg Length	Think %
No Think (Baseline)	0.0867	-	-	1038.2	0.0
Think (Baseline)	0.4600	-	-	11718.4	100.0
reasoning-router-0.6b	0.3133	+0.2267	-0.1467	7597.4	60.0%
reasoning-router-mmbert-small	0.3933	+0.3067	-0.0667	9801.3	83.3%
routellm-bert	0.4267	+0.3400	-0.0333	11281.4	96.7%
routellm-mf	0.0867	+0.0000	-0.3733	1038.2	0.0%

When it comes to the math-competition benchmarks, the reasoning router performs well as a baseline—consistently improving over the non-thinking versions of the Qwen3 models. It also generalizes reasonably well to Qwen3-30B; however, larger models were not evaluated in these experiments.

Limitations

There are several limitations in this work that I plan to address in future iterations:

Model Architecture Constraints –
The router was trained only on transformer-based architectures (encoder-only and decoder-only), without exploring alternatives such as mixture-of-experts or lightweight attention variants.
Reward Model Limitations –
For open-ended tasks, scoring relied on Skywork-Reward-V2-Llama-3.1-8B [7]. Reward models can be imperfect proxies for human judgment, which may introduce systematic labeling biases.
Data Diversity –
The dataset did not include:
- multilingual queries
- coding tasks
- multi-turn conversations
- multimodal inputs (e.g., images)
These exclusions were mainly due to compute and time constraints.
Model Size Constraints –
Most evaluations were performed on Qwen3-8B, with only limited testing on Qwen3-30B. Larger and more recent hybrid models remain unexplored.
Modern Hybrid Models –
Newer models—such as OpenAI’s o3/o4-mini and GPT-OSS reasoning-effort variants—support multiple tiers of reasoning effort. Extending the router beyond a binary “think / no-think” decision to multi-level reasoning selection is a promising direction for future work.

Conclusions

In this blog post, I walked through the experiments and methodology used to train a reasoning-effort router using primarily synthetic data and tasks with verifiable answers. The results illustrate that even without explicit human preference data—as required by works like RouteLLM [1]—it is possible to build a router that reliably predicts when thinking mode is beneficial. By the time I completed training, OpenAI released GPT-5 [5], which included a built-in router for automatically selecting reasoning mode, further validating the motivation behind this project. Overall, this work demonstrates a promising and lightweight approach to test-time compute allocation for hybrid LLMs.

References

[1] RouteLLM – Learning to Route LLMs with Preference Data
https://arxiv.org/abs/2406.18665

[2] OpenAI o1 – Learning to Reason with LLMs
https://openai.com/index/learning-to-reason-with-llms/

[3] DeepSeek-R1 – Reinforcement Learning for Reasoning
https://arxiv.org/abs/2501.12948

[4] Qwen3 Technical Report – Hybrid reason/no-reason models
https://arxiv.org/abs/2505.09388

[5] OpenAI – Introducing GPT-5
https://openai.com/index/introducing-gpt-5/

[6] DeepSeek V3.1 – Hybrid Reasoning Announcement
https://api-docs.deepseek.com/news/news250821

[7] Skywork Reward Model – Skywork-Reward-V2-Llama-3.1-8B
https://huggingface.co/Skywork/Skywork-Reward-V2-Llama-3.1-8B

Appendix

Hugging Face Collections

The full collection of datasets, models, and router artifacts used in this project is available here:

Reasoning Router Collection
https://huggingface.co/collections/AmirMohseni/reasoning-router

Datasets

These are the main datasets used to generate think / no-think pairs and evaluate the router:

AIME competition math – AIME-1983-2024-Qwen3-8B
Nectar open-ended chats – Nectar-Qwen3-8B
Large-scale math RL-style data – Big-Math-RL-Qwen3-8B
Filtered real-world WildChat data – WildChat-filtered-Qwen3-8B-Scored

Code Repository

The repository containing all training scripts, evaluation pipelines, and data-processing utilities can be found here:

LLM-Router (GitHub)
https://github.com/Amir-Mohseni/LLM-Router

Qwen3-8B Sampling Parameters

These were the exact sampling presets used when generating paired think / no-think outputs for Qwen3-8B, following the recommended settings from the official Qwen3 documentation.

# Sampling presets

THINKING_PARAMS = {
    "chat_template_kwargs": {"enable_thinking": True},
    "do_sample": True,
    "temperature": 0.6,
    "top_p": 0.95,
    "top_k": 20,
    "min_p": 0,
}

NON_THINKING_PARAMS = {
    "chat_template_kwargs": {"enable_thinking": False},
    "do_sample": True,
    "temperature": 0.7,
    "top_p": 0.8,
    "top_k": 20,
    "min_p": 0,
}

Citation

If you’d like to cite this work, you can use the following BibTeX entry:

@misc{mohseni2025reasoningrouter,
  author       = {Mohseni, Amir},
  title        = {To Think or Not to Think: A Router for Hybrid LLMs},
  howpublished = {Hugging Face Blog Post},
  month        = {November},
  year         = {2025},
  url          = {https://huggingface.co/blog/AmirMohseni/reasoning-router}
}

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote