File size: 3,630 Bytes

67ff38a
a6eab6c
 
a325535
9b1b5d2
 
6e5607e
53ee55a
 
 
67ff38a
5514a74
67ff38a
5514a74
67ff38a
 
 
fb9163c
 
38ae50e
fb9163c
 
67ff38a
011743e
67ff38a
f335f91
67ff38a
 
 
 
 
b71f3e0
67ff38a
f7d6bfa
 
 
 
 
 
 
 
 
d29e0f4
 
ab33ae2
d29e0f4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f7d6bfa
 
ccbda51
d29e0f4
 
ab33ae2
d29e0f4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7b48509
67ff38a
 
 
 
 
434f731
fb9163c

---
datasets:
- PowerInfer/QWQ-LONGCOT-500K
- PowerInfer/LONGCOT-Refine-500K
base_model:
- Qwen/Qwen2.5-3B-Instruct
pipeline_tag: text-generation
language:
- en
library_name: transformers
---
# SmallThinker-3B-preview

We introduce **SmallThinker-3B-preview**, a new model fine-tuned from the [Qwen2.5-3b-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct) model. 

## Benchmark Performance

| Model | AIME24 | AMC23 | GAOKAO2024_I | GAOKAO2024_II | MMLU_STEM | AMPS_Hard | math_comp |
|---------|--------|-------|--------------|---------------|-----------|-----------|-----------|
| Qwen2.5-3B-Instruct | 6.67 | 45 | 50 | 35.8 | 59.8 | - | - |
| SmallThinker | 16.667 | 57.5 | 64.2 | 57.1 | 68.2 | 70 | 46.8 |
| GPT-4o | 9.3 | - | - | - | 64.2 | 57 | 50 |

Limitation: Due to SmallThinker's current limitations in instruction following, for math_comp we adopt a more lenient evaluation method where only correct answers are required, without constraining responses to follow the specified AAAAA format.

Colab Link: [Colab](https://colab.research.google.com/drive/182q600at0sVw7uX0SXFp6bQI7pyjWXQ2?usp=sharing)
## Intended Use Cases

SmallThinker is designed for the following use cases:

1.  **Edge Deployment:** Its small size makes it ideal for deployment on resource-constrained devices.
2.  **Draft Model for QwQ-32B-Preview:** SmallThinker can serve as a fast and efficient draft model for the larger QwQ-32B-Preview model. From my test, in llama.cpp we can get 70% speedup (from 40 tokens/s to 70 tokens/s).

## Training Details

The model was trained using 8 H100 GPUs with a global batch size of 16. The specific configuration is as follows:

The SFT (Supervised Fine-Tuning) process was conducted in two phases:

1. First Phase:
   - Used only the PowerInfer/QWQ-LONGCOT-500K dataset
   - Trained for 1.5 epochs
```
### model
model_name_or_path: /home/syx/Qwen2.5-3B-Instruct

### method
stage: sft
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z3_config.json

### dataset
dataset: o1-v2
template: qwen
neat_packing: true
cutoff_len: 16384
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: saves/qwen2-01-qat/full/sft
logging_steps: 1
save_steps: 1000
plot_loss: true
overwrite_output_dir: true
```
2. Second Phase:
   - Combined training with PowerInfer/QWQ-LONGCOT-500K and PowerInfer/LONGCOT-Refine datasets
   - Continued training for 2 additional epochs
```
### model
model_name_or_path: saves/qwen2-01-qat/full/sft/checkpoint-24000

### method
stage: sft
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z3_config.json

### dataset
dataset: o1-v2, o1-v3
template: qwen
neat_packing: true
cutoff_len: 16384
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: saves/qwen2-01-qat/full/sft
logging_steps: 1
save_steps: 1000
plot_loss: true
overwrite_output_dir: true
```

## Limitations & Disclaimer

Please be aware of the following limitations:

*   **Language Limitation:** The model has only been trained on English-language datasets, hence its capabilities in other languages are still lacking.
*   **Limited Knowledge:** Due to limited SFT data and the model's relatively small scale, its reasoning capabilities are constrained by its knowledge base.
*   **Unpredictable Outputs:** The model may produce unexpected outputs due to its size and probabilistic generation paradigm. Users should exercise caution and validate the model's responses.
*   **Repetition Issue:** The model tends to repeat itself when answering high-difficulty questions. Please increase the `repetition_penalty` to mitigate this issue.