## Table of Contents
- [Latest Updates](#latest-updates)
- [Key Features](#key-features)
- [Supported Models](#supported-models)
- [How to Use](#how-to-use)
- [Install AngelSlim](#install-angelslim)
- [Quick Start](#quick-start)
- [deployment & Evaluation](#deployment)
- [Benchmark](#benchmark)
- [License](#license)
- [Citation](#citation)
- [Technical Discussion](#technical-discussion)
## π£Latest Updates
- [25/07/04] We now support quantization for Hunyuan/Qwen2.5/Qwen3/DeepSeek-R1-Distill-Qwen and other models, including INT8/FP8/INT4 algorithms.
We also opensource Qwen3-8B`s Eagle3 model weight.
Coming soon:
- [ ] Support W4A8 quantization for DeepSeek-R1.
- [ ] Support quantization for multimodal models like Qwen-VL.
- [ ] Release of new algorithm for speculative sampling.
## πKey Features
- **Highly Integrated**: This toolkit integrates mainstream compression algorithms into a unified framework, offering developers one-click access with exceptional ease of use.
- **Continuous Innovation**: Beyond integrating widely-used industry algorithms, we are continuously researching better compression algorithms, which will be gradually open-sourced in the future.
- **Performance-Driven**: We continuously optimize end-to-end performance in model compression workflows and algorithm deployment, such as enabling quantization of models like Qwen3-235B and DeepSeek-R1 on a single GPU.
## πΌSupported Models
### Quantization
Currently supports the following LLMs, including Hunyuan-Dense, Hunyuan-MoE, Qwen3-Dense, Qwen3-MoE, Qwen2.5, DeepSeek-R1 distilled Qwen models, and QwQ::
| Model | FP8-Dynamic | FP8-Static | INT8-Dynamic | INT4-GPTQ | INT4-AWQ |
| --------------------------------------------------------------------------------------------------------------------------- | ----------- | ---------- | ------------ | --------- | -------- |
| [Hunyuan-Dense](https://huggingface.co/tencent/Hunyuan-7B-Instruct) | β | β | β | β | β |
| [Hunyuan-MoE](https://huggingface.co/collections/tencent/hunyuan-a13b-685ec38e5b46321e3ea7c4be) | β | β | β | β | β |
| [Qwen3-Dense](https://huggingface.co/collections/AngelSlim/qwen3-quant-68652e26da31740739d154f8) | β | β | β | β | β |
| [Qwen3-MoE](https://huggingface.co/collections/AngelSlim/qwen3-quant-68652e26da31740739d154f8) | β | β | β | β | β |
| [Qwen2.5](https://huggingface.co/collections/AngelSlim/qwen2-25-quant-68652d6cbdf5c0d4b1c4499a) | β | β | β | β | β |
| [DeepSeek-R1-Distill-Qwen](https://huggingface.co/collections/AngelSlim/deepseek-r1-distill-quant-68652f16a9c206b030b05f7f) | β | β | β | β | β |
| [QwQ](https://huggingface.co/collections/AngelSlim/qwen3-quant-68652e26da31740739d154f8) | β | β | β | β | β |
### Speculative Decoding
The Eagle3 weights for the Qwen3 series model are now available.
| Qwen3 Models | Hunyuan Models |
| ----------|----------|
| β [Qwen3-1.7B](https://huggingface.co/AngelSlim/Qwen3-1.7B_eagle3) |β [Hunyuan-1.8B-Instruct](https://huggingface.co/AngelSlim/Hunyuan-1.8B-Instruct_eagle3) |
| β [Qwen3-4B](https://huggingface.co/AngelSlim/Qwen3-4B_eagle3) |β [Hunyuan-4B-Instruct](https://huggingface.co/AngelSlim/Hunyuan-4B-Instruct_eagle3) |
| β [Qwen3-8B](https://huggingface.co/AngelSlim/Qwen3-8B_eagle3) |β [Hunyuan-7B-Instruct](https://huggingface.co/AngelSlim/Hunyuan-7B-Instruct_eagle3) |
| β [Qwen3-14B](https://huggingface.co/AngelSlim/Qwen3-14B_eagle3) |
| β [Qwen3-32B](https://huggingface.co/AngelSlim/Qwen3-32B_eagle3) |
| β [Qwen3-30B-A3B](https://huggingface.co/AngelSlim/Qwen3-a3B_eagle3) |
## ποΈHow to Use
### Install AngelSlim
We recommend using `pip` to install the latest stable version of `AngelSlim`:
```shell
pip install angelslim
```
Alternatively, you can clone the repository and install from source in editable mode:
```shell
cd AngelSlim && python setup.py install
```
For more detailed installation instructions, please refer to the [Installation Documentation](https://angelslim.readthedocs.io/zh-cn/latest/getting_started/installation.html).
### Quick Start
After installing `AngelSlim`, you can quickly start by running the following script to perform static `FP8` quantization on the `Qwen3-1.7B` model:
* One-click Start
```shell
python3 tools/run.py -c configs/qwen3/fp8_static/qwen3-1_7b_fp8_static.yaml
```
This example will load the HuggingFace model and perform activation value calibration using the `dataset` specified in the config file, saving the quantized model weights.
* Code-based Start
To perform dynamic `FP8` quantization on `Qwen3-1.7B`:
```python
from angelslim.engine import Engine
slim_engine = Engine()
# Prepare model
slim_engine.prepare_model(model_name="Qwen", model_path="Qwen/Qwen3-1.7B",)
# Initialize compressor
slim_engine.prepare_compressor("PTQ", default_method="fp8_dynamic")
# Compress model
slim_engine.run()
# Save compressed model
slim_engine.save("./output")
```
For more details, please refer to the [Quick Start Documentation](https://angelslim.readthedocs.io/zh-cn/latest/getting_started/quickstrat.html).
### π₯οΈ Deployment and Testing
#### 1. API Service Deployment
After specifying the quantized model path `MODEL_PATH`, you can deploy an OpenAI-compatible API service using the following LLMs inference frameworks:
**vLLM**
Use the following script to launch a [vLLM](https://github.com/vllm-project/vllm) server, recommended version `vllm>=0.8.5.post1`. For MOE INT8 quantized models, vllm>=0.9.0 is required.
```shell
bash deploy/run_vllm.sh $MODEL_PATH
```
**SGLang**
Use the following script to launch a [SGLang](https://github.com/sgl-project/sglang) server, recommended version `sglang>=0.4.6.post1`.
```shell
bash deploy/run_sglang.sh $MODEL_PATH
```
#### 2. Service Invocation
Invoke requests via [OpenAI's API format](https://platform.openai.com/docs/api-reference/introduction):
```shell
bash deploy/openai.sh $MODEL_PATH
```
#### 3. Performance Evaluation
Evaluate the performance of quantized model using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), recommended version`lm-eval>=0.4.8`:
```shell
bash deploy/lm_eval.sh $MODEL_PATH
```
For more detaileds, please refer to the [Deployment Documentation](https://angelslim.readthedocs.io/zh-cn/latest/deployment/deploy.html).
## π Benchmark
### (1) Quantization
The performance test results for selected models are shown below. For the complete benchmark, refer to the [Benchmark documentation](https://angelslim.readthedocs.io/zh-cn/latest/performance/quantization/benchmarks.html)
#### Hunyuan Series Models
Benchmark results for the `Hunyuan-A13B-Instruct` model with `FP8` and `INT4-GPTQ` quantization algorithms on datasets including `AIME 2024`, `GSM8K`, `BBH`, and `DROP`:
| Bench | Hunyuan-A13B-Instruct | Hunyuan-A13B-Instruct-FP8 | Hunyuan-A13B-Instruct-Int4-GPTQ |
|:---------:|:---------------------:|:-------------------------:|:-------------------------------:|
| AIME 2024 | 87.3 | 86.7 | 86.7 |
| GSM8K | 94.39 | 94.01 | 94.24 |
| BBH | 89.1 | 88.34 | 87.91 |
| DROP | 91.1 | 91.1 | 91.05 |
#### Qwen3 Series Models
Benchmark results for Qwen3 series models with `FP8-Static`, `FP8-Dynamic`, `INT4-GPTQ`, and `INT4-AWQ` quantization algorithms on datasets including `CEVAL`, `MMLU`, `GSM8K`, and `HUMANEVAL`:
Model
Quantization
CEVAL
MMLU
GSM8K
HUMANEVAL
Qwen3-0.6B
BF16
45.84
47.21
42.99
19.51
FP8-Static
45.99
46.87
38.06
18.90
FP8-Dynamic
45.99
46.93
38.29
20.73
INT8-Dynamic
45.17
46.95
41.17
21.34
Qwen3-8B
BF16
79.27
74.78
87.79
63.41
FP8-Static
78.23
74.79
86.96
62.20
FP8-Dynamic
78.45
74.75
87.64
62.80
INT8-Dynamic
78.01
74.84
86.96
67.07
INT4-GPTQ
77.19
73.26
86.43
62.20
INT4-AWQ
76.15
73.59
86.96
63.41
Qwen3-14B
BF16
83.06
78.90
88.40
55.49
FP8-Static
82.62
78.57
89.46
57.32
FP8-Dynamic
82.24
78.92
88.32
52.44
INT8-Dynamic
81.87
78.13
86.28
56.10
INT4-GPTQ
81.05
78.02
87.34
57.93
INT4-AWQ
82.02
77.68
84.23
61.59
Qwen3-32B
BF16
86.55
82.00
74.53
37.80
FP8-Static
86.92
81.78
70.20
39.63
FP8-Dynamic
86.55
81.89
70.43
38.41
INT4-GPTQ
86.18
81.01
-
43.29
INT4-AWQ
86.18
81.54
-
36.59
Qwen3-30B-A3B
BF16
83.66
79.36
89.99
31.71
FP8-Static
83.95
79.47
89.01
31.10
FP8-Dynamic
84.10
79.40
89.16
32.93
INT8-Dynamic
83.36
79.48
89.16
34.15
Qwen3-235B-A22B
BF16
89.60
86.28
85.29
27.44
FP8-Static
89.67
86.19
86.96
27.44
FP8-Dynamic
89.67
86.18
85.22
28.05
INT8-Dynamic
88.93
86.20
86.20
23.78
QwQ-32B
BF16
85.74
82.03
73.31
42.68
FP8-Static
85.44
81.91
75.36
42.68
FP8-Dynamic
85.07
81.93
75.66
42.07
INT4-GPTQ
84.03
81.26
68.23
45.73
INT4-AWQ
83.58
81.01
68.69
43.29
#### Other Models
Benchmark results for other models with `FP8-Static`, `FP8-Dynamic`, `INT4-GPTQ`, and `INT4-AWQ` quantization algorithms on datasets including `CEVAL`, `MMLU` and `GSM8K`:
Model
Quantization
CEVAL
MMLU
GSM8K
Qwen2.5-1.5B-Instruct
BF16
67.01
60.05
54.28
FP8-Static
66.27
60.23
-
FP8-Dynamic
66.79
60.08
51.71
Qwen2.5-7B-Instruct
BF16
81.20
74.55
79.98
FP8-Static
81.13
74.03
79.30
FP8-Dynamic
80.31
74.07
79.00
INT4-GPTQ
79.05
73.05
74.75
INT4-AWQ
79.35
73.22
79.38
Qwen2.5-32B-Instruct
BF16
87.30
83.21
81.73
FP8-Static
87.59
83.08
81.58
FP8-Dynamic
87.30
83.04
81.58
INT4-GPTQ
86.70
82.45
82.03
INT4-AWQ
87.00
82.64
-
DeepSeek-R1-Distill-Qwen-7B
BF16
53.49
53.80
75.74
FP8-Static
53.57
54.17
76.19
FP8-Dynamic
52.97
54.13
74.15
INT4-GPTQ
51.86
52.44
75.89
INT4-AWQ
53.49
53.70
-
DeepSeek-R1-Distill-Qwen-14B
BF16
77.71
74.28
85.67
FP8-Static
77.56
74.66
86.73
FP8-Dynamic
76.82
74.63
87.11
INT4-GPTQ
74.29
72.37
84.61
INT4-AWQ
74.81
73.00
86.05
DeepSeek-R1-Distill-Qwen-32B
BF16
84.18
80.89
87.41
FP8-Static
83.43
80.90
87.57
FP8-Dynamic
83.73
81.10
86.43
INT4-GPTQ
84.10
79.80
86.73
INT4-AWQ
82.84
80.15
87.19
### (2) Speculative Decoding
#### Qwen3 Series Models
Benchmark results for Qwen3 series models with `Eagle3` speculative decoding algorithm on datasets including `MT-bench`, `HunmanEval`, `GSM8K`, and `Alpaca`:
 
 
MT-bench
HumanEval
GSM8K
Alpaca
Mean
Temperature
Model
Speedup
Ο
Speedup
Ο
Speedup
Ο
Speedup
Ο
Speedup
Ο
T=0
Qwen3-1.7B
2.05x
2.81
2.07x
2.93
2.11x
2.98
1.93x
2.69
2.04x
2.85
Qwen3-4B
2.21x
3.01
2.36x
3.24
2.42x
3.13
2.32x
2.75
2.33x
3.03
Qwen3-8B
2.65x
3.87
2.64x
3.82
2.86x
4.10
2.58x
3.55
2.68x
3.83
Qwen3-14B
2.42x
3.38
2.57x
3.58
2.75x
3.77
2.27x
3.11
2.50x
3.46
Qwen3-32B
2.39x
2.78
2.37x
2.81
2.47x
2.92
2.42x
2.53
2.41x
2.76
Qwen3-30B-A3B
2.84x
3.63
2.27x
3.09
2.64x
3.42
2.83x
3.56
2.64x
3.42
T=1
Qwen3-1.7B
1.74x
2.53
1.86x
2.70
1.82x
2.69
1.72x
2.46
1.93x
2.60
Qwen3-4B
1.93x
2.60
2.00x
2.84
2.11x
2.82
2.34x
2.50
1.75x
2.69
Qwen3-8B
1.91x
2.84
2.07x
3.05
2.34x
3.26
2.09x
2.92
2.10x
3.02
Qwen3-14B
1.81x
2.58
1.96x
2.81
2.16x
3.09
1.76x
2.49
1.92x
2.74
Qwen3-32B
1.62x
1.91
1.71x
2.05
1.78x
2.10
1.80x
1.95
1.62x
2.00
Qwen3-30B-A3B
1.91x
2.46
2.00x
2.64
1.90x
2.53
1.80x
2.32
1.90x
2.48
#### Hunyuan Series Models
Benchmark results for Hunyuan series models with `Eagle3` speculative decoding algorithm on datasets including `MT-bench`, `HunmanEval`, `GSM8K`, and `Alpaca`:
 
 
MT-bench
HumanEval
GSM8K
Alpaca
Mean
Temperature
Model
Speedup
Ο
Speedup
Ο
Speedup
Ο
Speedup
Ο
Speedup
Ο
T=0
Hunyuan-1.8B-Instruct
1.97x
2.90
2.58x
3.73
2.61x
3.71
1.71x
2.43
2.22x
3.19
Hunyuan-4B-Instruct
1.77x
2.60
2.64x
3.35
2.14x
3.17
1.72x
2.57
2.07x
2.92
Hunyuan-7B-Instruct
2.22x
3.58
3.59x
5.47
2.96x
4.68
1.64x
2.56
2.60x
4.07
T=1
Hunyuan-1.8B-Instruct
1.58x
2.36
2.35x
3.56
2.23x
3.38
1.26x
1.87
1.86x
2.79
Hunyuan-4B-Instruct
1.36x
2.05
1.97x
2.86
1.72x
2.68
1.14x
1.76
1.55x
2.34
Hunyuan-7B-Instruct
1.90x
3.11
3.12x
5.09
2.74x
4.34
1.47x
2.39
2.31x
3.73
## π License
The code for this project is open-sourced under the [License for AngelSlim](LICENSE).
## π Citation
```
@software{AngelSlim2025,
title={{AngelSlim}},
author={Tencent AngelSlim Project Contributors},
year={2025},
month={6},
url={https://github.com/Tencent/AngelSlim},
}
```
## π¬ Technical Discussion
* AngelSlim is continuously iterating and new features will be released soon. If you have any questions or suggestions, please open an issue on GitHub or join our [WeChat technical discussion group](https://github.com/Tencent/AngelSlim/blob/main/docs/source/assets/angel_slim_wechat.png?raw=true).