To Think or Not to Think: A Router for Hybrid LLMs
This is the project I worked on during the summer, and I figured it might be useful to turn it into a blog post explaining what I did and what the results were.
OpenAI released o1 in 2024, which was a big shift in how LLMs worked because of the idea of test-time compute: instead of always using a fixed number of tokens to answer a question, models could now spend more tokens (“think”) on harder questions, leading to better outputs and more accurate results [2].
Later in 2025, we got DeepSeek-R1, one of the first large-scale open reasoning models [3]. Then came the Qwen3 family of models, which introduced one of the first hybrid models: you can flip a flag to make the model use thinking mode or not [4]. Around the same time, OpenAI was releasing models like o3, GPT-4.5, o4-mini, o4-mini-high, and many variations. This made it increasingly hard to decide which model to use for a given task.
A lot of the time, using o3 was great — but for some tasks I just wanted the answer as fast as possible. Using a slow, reasoning-heavy model for a simple task quickly became annoying.
That’s when I started working on this project: a router that predicts whether a given task requires thinking or no thinking. Intuitively, this makes a lot of sense: instead of manually deciding whether to turn thinking mode on or off, you could have an Auto option and let a classifier decide. This is especially useful for hybrid models (like Qwen3’s think/no-think flag), and builds on previous work on LLM routing such as RouteLLM [1].
Problem Setup
To build this router, I first needed a way to get paired data: for each user query, I wanted
- a response generated with thinking (reasoning mode on), and
- a response generated without thinking (reasoning mode off),
plus a label indicating which mode is better (or whether thinking is unnecessary).
Concretely, given a query , I want to train a classifier that predicts a label:
such that:
thinkis used when a reasoning-style response massively improves quality, andno_thinkis used when the extra reasoning does not justify the extra tokens.
To scale this up, I wanted the data to be mostly synthetic. I focused on Qwen3 models (especially Qwen3-8B), since it’s one of the models I regularly use for local tasks.
The key idea was:
For a given query ,
Use the same base model (same underlying policy ) with:
- thinking mode on → sample
- thinking mode off → sample
Score these two outputs against each other using a scorer or reward model, and
Convert that into a supervision signal for the router.
This setup is much cleaner when both outputs come from the same underlying model , just with different test-time compute.
Data Collection
I needed datasets that capture two main aspects:
- Real-world queries: diverse user questions
- Clearly reasoning-heavy tasks: e.g., math or coding problems
I ended up splitting everything into open-ended vs closed-ended datasets.
Open-Ended Datasets
These included:
They contain real user chats, which makes them realistic but tricky to score.
Due to budget constraints, I filtered aggressively using:
- only English
- only single-turn interactions
- only samples scored highly by
HuggingFaceFW/fineweb-edu-classifier - only samples where the model was GPT-4 (heuristic to keep modern, harder queries)
For each query, I generated:
- a think-mode output
- a no-think-mode output
These were later scored by a reward model.
Closed-Ended Datasets
I wanted tasks where:
- correctness is easy to verify
- non-thinking models perform noticeably worse
I used:
- AIME-1983-2024-Qwen3-8B (competition math)
- Big-Math-RL-Qwen3-8B
For these datasets, each question has a known ground-truth answer .
Extracting final answers allowed me to compute accuracy flags:
- if think-mode answer matches , else 0
- if no-think answer matches , else 0
These binary values are used during labeling.
Data Labeling
Closed-Ended Data
Labeling is straightforward:
- If think-mode is correct and no-think-mode is wrong → label
think - If no-think is correct and think-mode is wrong → label
no_think
Edge cases:
- If both are correct → label
no_think(no need for extra tokens) - If both are wrong → label
think(future larger models may solve it with more reasoning)
Open-Ended Data
These were later scored by the Skywork-Reward-V2-Llama-3.1-8B reward model [7].
Let:
I remove the <think>...</think> chain-of-thought before scoring.
If the absolute difference between the rewards is small, i.e.
I label the sample as no_think.
Otherwise, I choose the mode with the higher reward:
This encourages the model to choose thinking mode only when there is a clear benefit.
Training
After labeling, I had about 70k samples:
Dataset:
https://huggingface.co/datasets/AmirMohseni/reasoning-router-data-v2
Each sample includes:
- a query
- a label (
think/no_think)
This is a pure classification task:
I tested several architectures:
- BERT variants (encoder-only)
- Qwen3-0.6B (decoder-only)
I ultimately selected:
- Qwen3-0.6B
- mmBERT-small
You can also try the router model yourself on the Hugging Face Space:
https://huggingface.co/spaces/AmirMohseni/Reasoning-Router
Training Logs
Below are the key training curves for the router model, including train/eval loss, F1 score, and accuracy.
You can view the W&B dashboard here:
https://api.wandb.ai/links/amirmohseni-maastricht-university/zmldcnzd
Results
After finishing the training, it was time to test out the trained model. I evaluated the router on the open-ended dataset test sets and selected some of the newest math benchmarks from 2025 to measure how well the router can decide when to use thinking mode and when to skip it.
WildChat (Test)
Summary Table for WildChat-filtered-Qwen3-8B-Splits
(based on WildChat-filtered-Qwen3-8B-Scored)
| Strategy | Avg Reward | vs No-Think | vs Think | Think % |
|---|---|---|---|---|
| No Think (Baseline) | 22.6868 | - | - | 0.0 |
| Think (Baseline) | 24.7362 | - | - | 100.0 |
| reasoning-router-0.6b | 24.0576 | +1.3708 | -0.6787 | 42.9% |
| reasoning-router-mmbert-small | 24.1236 | +1.4368 | -0.6126 | 46.3% |
| routellm-bert | 23.3694 | +0.6826 | -1.3668 | 33.5% |
| routellm-mf | 22.6865 | -0.0004 | -2.0498 | 0.2% |
Nectar (Test)
Summary Table for Nectar-Qwen3-8B-Splits
(based on Nectar-Qwen3-8B)
| Strategy | Avg Reward | vs No-Think | vs Think | Think % |
|---|---|---|---|---|
| No Think (Baseline) | 8.6090 | - | - | 0.0 |
| Think (Baseline) | 9.9649 | - | - | 100.0 |
| reasoning-router-0.6b | 9.6866 | +1.0776 | -0.2784 | 28.5% |
| reasoning-router-mmbert-small | 8.9183 | +0.3093 | -1.0466 | 9.9% |
| routellm-bert | 9.9603 | +1.3513 | -0.0046 | 99.7% |
| routellm-mf | 8.5713 | -0.0377 | -1.3936 | 1.9% |
Overall, our routers perform significantly better than the no-think baseline and consistently outperform the RouteLLM models, all while using fewer tokens than full thinking mode.
AIME 2025
Qwen3-8B
Accuracy Comparison
Token Count Comparison
Summary Table for AIME25 – Qwen3-8B
(based on AIME-1983-2024-Qwen3-8B)
| Strategy | Avg Accuracy | vs No-Think | vs Think | Avg Length | Think % |
|---|---|---|---|---|---|
| No Think (Baseline) | 0.1267 | - | - | 3411.5 | 0.0 |
| Think (Baseline) | 0.2400 | - | - | 12835.2 | 100.0 |
| reasoning-router-0.6b | 0.1933 | +0.0667 | -0.0467 | 11815.0 | 90.0% |
| reasoning-router-mmbert-small | 0.2267 | +0.1000 | -0.0133 | 12322.1 | 93.3% |
| routellm-bert | 0.2333 | +0.1067 | -0.0067 | 12553.6 | 96.7% |
| routellm-mf | 0.1267 | +0.0000 | -0.1133 | 3411.5 | 0.0% |
Qwen3-30B
Accuracy Comparison
Token Count Comparison
Summary Table for AIME25 – Qwen3-30B
| Strategy | Avg Accuracy | vs No-Think | vs Think | Avg Length | Think % |
|---|---|---|---|---|---|
| No Think (Baseline) | 0.1933 | - | - | 1178.2 | 0.0 |
| Think (Baseline) | 0.6400 | - | - | 11773.5 | 100.0 |
| reasoning-router-0.6b | 0.5667 | +0.3733 | -0.0733 | 11112.8 | 90.0% |
| reasoning-router-mmbert-small | 0.6067 | +0.4133 | -0.0333 | 11477.9 | 93.3% |
| routellm-bert | 0.6133 | +0.4200 | -0.0267 | 11593.9 | 96.7% |
| routellm-mf | 0.1933 | +0.0000 | -0.4467 | 1178.2 | 0.0% |
HMMT 2025
Qwen3-8B
Accuracy Comparison
Token Count Comparison
Summary Table for HMMT Feb 2025 – Qwen3-8B
| Strategy | Avg Accuracy | vs No-Think | vs Think | Avg Length | Think % |
|---|---|---|---|---|---|
| No Think (Baseline) | 0.0467 | - | - | 5210.1 | 0.0 |
| Think (Baseline) | 0.0800 | - | - | 13628.4 | 100.0 |
| reasoning-router-0.6b | 0.0533 | +0.0067 | -0.0267 | 10425.3 | 60.0% |
| reasoning-router-mmbert-small | 0.0867 | +0.0400 | +0.0067 | 12360.6 | 83.3% |
| routellm-bert | 0.0800 | +0.0333 | +0.0000 | 13262.2 | 96.7% |
| routellm-mf | 0.0467 | +0.0000 | -0.0333 | 5210.1 | 0.0% |
Qwen3-30B
Accuracy Comparison
Token Count Comparison
Summary Table for HMMT Feb 2025 – Qwen3-30B
| Strategy | Avg Accuracy | vs No-Think | vs Think | Avg Length | Think % |
|---|---|---|---|---|---|
| No Think (Baseline) | 0.0867 | - | - | 1038.2 | 0.0 |
| Think (Baseline) | 0.4600 | - | - | 11718.4 | 100.0 |
| reasoning-router-0.6b | 0.3133 | +0.2267 | -0.1467 | 7597.4 | 60.0% |
| reasoning-router-mmbert-small | 0.3933 | +0.3067 | -0.0667 | 9801.3 | 83.3% |
| routellm-bert | 0.4267 | +0.3400 | -0.0333 | 11281.4 | 96.7% |
| routellm-mf | 0.0867 | +0.0000 | -0.3733 | 1038.2 | 0.0% |
When it comes to the math-competition benchmarks, the reasoning router performs well as a baseline—consistently improving over the non-thinking versions of the Qwen3 models. It also generalizes reasonably well to Qwen3-30B; however, larger models were not evaluated in these experiments.
Limitations
There are several limitations in this work that I plan to address in future iterations:
Model Architecture Constraints –
The router was trained only on transformer-based architectures (encoder-only and decoder-only), without exploring alternatives such as mixture-of-experts or lightweight attention variants.Reward Model Limitations –
For open-ended tasks, scoring relied on Skywork-Reward-V2-Llama-3.1-8B [7]. Reward models can be imperfect proxies for human judgment, which may introduce systematic labeling biases.Data Diversity –
The dataset did not include:- multilingual queries
- coding tasks
- multi-turn conversations
- multimodal inputs (e.g., images)
These exclusions were mainly due to compute and time constraints.
Model Size Constraints –
Most evaluations were performed on Qwen3-8B, with only limited testing on Qwen3-30B. Larger and more recent hybrid models remain unexplored.Modern Hybrid Models –
Newer models—such as OpenAI’s o3/o4-mini and GPT-OSS reasoning-effort variants—support multiple tiers of reasoning effort. Extending the router beyond a binary “think / no-think” decision to multi-level reasoning selection is a promising direction for future work.
Conclusions
In this blog post, I walked through the experiments and methodology used to train a reasoning-effort router using primarily synthetic data and tasks with verifiable answers. The results illustrate that even without explicit human preference data—as required by works like RouteLLM [1]—it is possible to build a router that reliably predicts when thinking mode is beneficial. By the time I completed training, OpenAI released GPT-5 [5], which included a built-in router for automatically selecting reasoning mode, further validating the motivation behind this project. Overall, this work demonstrates a promising and lightweight approach to test-time compute allocation for hybrid LLMs.
References
[1] RouteLLM – Learning to Route LLMs with Preference Data
https://arxiv.org/abs/2406.18665
[2] OpenAI o1 – Learning to Reason with LLMs
https://openai.com/index/learning-to-reason-with-llms/
[3] DeepSeek-R1 – Reinforcement Learning for Reasoning
https://arxiv.org/abs/2501.12948
[4] Qwen3 Technical Report – Hybrid reason/no-reason models
https://arxiv.org/abs/2505.09388
[5] OpenAI – Introducing GPT-5
https://openai.com/index/introducing-gpt-5/
[6] DeepSeek V3.1 – Hybrid Reasoning Announcement
https://api-docs.deepseek.com/news/news250821
[7] Skywork Reward Model – Skywork-Reward-V2-Llama-3.1-8B
https://huggingface.co/Skywork/Skywork-Reward-V2-Llama-3.1-8B
Appendix
Hugging Face Collections
The full collection of datasets, models, and router artifacts used in this project is available here:
Reasoning Router Collection
https://huggingface.co/collections/AmirMohseni/reasoning-router
Datasets
These are the main datasets used to generate think / no-think pairs and evaluate the router:
- AIME competition math – AIME-1983-2024-Qwen3-8B
- Nectar open-ended chats – Nectar-Qwen3-8B
- Large-scale math RL-style data – Big-Math-RL-Qwen3-8B
- Filtered real-world WildChat data – WildChat-filtered-Qwen3-8B-Scored
Code Repository
The repository containing all training scripts, evaluation pipelines, and data-processing utilities can be found here:
LLM-Router (GitHub)
https://github.com/Amir-Mohseni/LLM-Router
Qwen3-8B Sampling Parameters
These were the exact sampling presets used when generating paired think / no-think outputs for Qwen3-8B, following the recommended settings from the official Qwen3 documentation.
# Sampling presets
THINKING_PARAMS = {
"chat_template_kwargs": {"enable_thinking": True},
"do_sample": True,
"temperature": 0.6,
"top_p": 0.95,
"top_k": 20,
"min_p": 0,
}
NON_THINKING_PARAMS = {
"chat_template_kwargs": {"enable_thinking": False},
"do_sample": True,
"temperature": 0.7,
"top_p": 0.8,
"top_k": 20,
"min_p": 0,
}
Citation
If you’d like to cite this work, you can use the following BibTeX entry:
@misc{mohseni2025reasoningrouter,
author = {Mohseni, Amir},
title = {To Think or Not to Think: A Router for Hybrid LLMs},
howpublished = {Hugging Face Blog Post},
month = {November},
year = {2025},
url = {https://huggingface.co/blog/AmirMohseni/reasoning-router}
}

