--- tags: - qwen3 - eagle3 - eagle ---

AngelSlim

Dedicated to building a more intuitive, comprehensive, and efficient LLMs compression toolkit.

πŸ“– Documentation   |   πŸ€— Hugging Face   |   πŸ€– ModelScope   |   πŸ’¬ WeChat

## Table of Contents - [Latest Updates](#latest-updates) - [Key Features](#key-features) - [Supported Models](#supported-models) - [How to Use](#how-to-use) - [Install AngelSlim](#install-angelslim) - [Quick Start](#quick-start) - [deployment & Evaluation](#deployment) - [Benchmark](#benchmark) - [License](#license) - [Citation](#citation) - [Technical Discussion](#technical-discussion) ## πŸ“£Latest Updates - [25/07/04] We now support quantization for Hunyuan/Qwen2.5/Qwen3/DeepSeek-R1-Distill-Qwen and other models, including INT8/FP8/INT4 algorithms. We also opensource Qwen3-8B`s Eagle3 model weight. Coming soon: - [ ] Support W4A8 quantization for DeepSeek-R1. - [ ] Support quantization for multimodal models like Qwen-VL. - [ ] Release of new algorithm for speculative sampling. ## 🌟Key Features - **Highly Integrated**: This toolkit integrates mainstream compression algorithms into a unified framework, offering developers one-click access with exceptional ease of use. - **Continuous Innovation**: Beyond integrating widely-used industry algorithms, we are continuously researching better compression algorithms, which will be gradually open-sourced in the future. - **Performance-Driven**: We continuously optimize end-to-end performance in model compression workflows and algorithm deployment, such as enabling quantization of models like Qwen3-235B and DeepSeek-R1 on a single GPU. ## πŸ’ΌSupported Models ### Quantization Currently supports the following LLMs, including Hunyuan-Dense, Hunyuan-MoE, Qwen3-Dense, Qwen3-MoE, Qwen2.5, DeepSeek-R1 distilled Qwen models, and QwQ:: | Model | FP8-Dynamic | FP8-Static | INT8-Dynamic | INT4-GPTQ | INT4-AWQ | | --------------------------------------------------------------------------------------------------------------------------- | ----------- | ---------- | ------------ | --------- | -------- | | [Hunyuan-Dense](https://huggingface.co/tencent/Hunyuan-7B-Instruct) | βœ… | βœ… | βœ… | βœ… | βœ… | | [Hunyuan-MoE](https://huggingface.co/collections/tencent/hunyuan-a13b-685ec38e5b46321e3ea7c4be) | βœ… | βœ… | βœ… | βœ… | βœ… | | [Qwen3-Dense](https://huggingface.co/collections/AngelSlim/qwen3-quant-68652e26da31740739d154f8) | βœ… | βœ… | βœ… | βœ… | βœ… | | [Qwen3-MoE](https://huggingface.co/collections/AngelSlim/qwen3-quant-68652e26da31740739d154f8) | βœ… | βœ… | βœ… | βœ… | βœ… | | [Qwen2.5](https://huggingface.co/collections/AngelSlim/qwen2-25-quant-68652d6cbdf5c0d4b1c4499a) | βœ… | βœ… | βœ… | βœ… | βœ… | | [DeepSeek-R1-Distill-Qwen](https://huggingface.co/collections/AngelSlim/deepseek-r1-distill-quant-68652f16a9c206b030b05f7f) | βœ… | βœ… | βœ… | βœ… | βœ… | | [QwQ](https://huggingface.co/collections/AngelSlim/qwen3-quant-68652e26da31740739d154f8) | βœ… | βœ… | βœ… | βœ… | βœ… | ### Speculative Decoding The Eagle3 weights for the Qwen3 series model are now available. | Qwen3 Models | Hunyuan Models | | ----------|----------| | βœ… [Qwen3-1.7B](https://huggingface.co/AngelSlim/Qwen3-1.7B_eagle3) |βœ… [Hunyuan-1.8B-Instruct](https://huggingface.co/AngelSlim/Hunyuan-1.8B-Instruct_eagle3) | | βœ… [Qwen3-4B](https://huggingface.co/AngelSlim/Qwen3-4B_eagle3) |βœ… [Hunyuan-4B-Instruct](https://huggingface.co/AngelSlim/Hunyuan-4B-Instruct_eagle3) | | βœ… [Qwen3-8B](https://huggingface.co/AngelSlim/Qwen3-8B_eagle3) |βœ… [Hunyuan-7B-Instruct](https://huggingface.co/AngelSlim/Hunyuan-7B-Instruct_eagle3) | | βœ… [Qwen3-14B](https://huggingface.co/AngelSlim/Qwen3-14B_eagle3) | | βœ… [Qwen3-32B](https://huggingface.co/AngelSlim/Qwen3-32B_eagle3) | | βœ… [Qwen3-30B-A3B](https://huggingface.co/AngelSlim/Qwen3-a3B_eagle3) | ## πŸ›ŽοΈHow to Use ### Install AngelSlim We recommend using `pip` to install the latest stable version of `AngelSlim`: ```shell pip install angelslim ``` Alternatively, you can clone the repository and install from source in editable mode: ```shell cd AngelSlim && python setup.py install ``` For more detailed installation instructions, please refer to the [Installation Documentation](https://angelslim.readthedocs.io/zh-cn/latest/getting_started/installation.html). ### Quick Start After installing `AngelSlim`, you can quickly start by running the following script to perform static `FP8` quantization on the `Qwen3-1.7B` model: * One-click Start ```shell python3 tools/run.py -c configs/qwen3/fp8_static/qwen3-1_7b_fp8_static.yaml ``` This example will load the HuggingFace model and perform activation value calibration using the `dataset` specified in the config file, saving the quantized model weights. * Code-based Start To perform dynamic `FP8` quantization on `Qwen3-1.7B`: ```python from angelslim.engine import Engine slim_engine = Engine() # Prepare model slim_engine.prepare_model(model_name="Qwen", model_path="Qwen/Qwen3-1.7B",) # Initialize compressor slim_engine.prepare_compressor("PTQ", default_method="fp8_dynamic") # Compress model slim_engine.run() # Save compressed model slim_engine.save("./output") ``` For more details, please refer to the [Quick Start Documentation](https://angelslim.readthedocs.io/zh-cn/latest/getting_started/quickstrat.html). ### πŸ–₯️ Deployment and Testing #### 1. API Service Deployment After specifying the quantized model path `MODEL_PATH`, you can deploy an OpenAI-compatible API service using the following LLMs inference frameworks: **vLLM** Use the following script to launch a [vLLM](https://github.com/vllm-project/vllm) server, recommended version `vllm>=0.8.5.post1`. For MOE INT8 quantized models, vllm>=0.9.0 is required. ```shell bash deploy/run_vllm.sh $MODEL_PATH ``` **SGLang** Use the following script to launch a [SGLang](https://github.com/sgl-project/sglang) server, recommended version `sglang>=0.4.6.post1`. ```shell bash deploy/run_sglang.sh $MODEL_PATH ``` #### 2. Service Invocation Invoke requests via [OpenAI's API format](https://platform.openai.com/docs/api-reference/introduction): ```shell bash deploy/openai.sh $MODEL_PATH ``` #### 3. Performance Evaluation Evaluate the performance of quantized model using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), recommended version`lm-eval>=0.4.8`: ```shell bash deploy/lm_eval.sh $MODEL_PATH ``` For more detaileds, please refer to the [Deployment Documentation](https://angelslim.readthedocs.io/zh-cn/latest/deployment/deploy.html). ## πŸ“ˆ Benchmark ### (1) Quantization The performance test results for selected models are shown below. For the complete benchmark, refer to the [Benchmark documentation](https://angelslim.readthedocs.io/zh-cn/latest/performance/quantization/benchmarks.html) #### Hunyuan Series Models Benchmark results for the `Hunyuan-A13B-Instruct` model with `FP8` and `INT4-GPTQ` quantization algorithms on datasets including `AIME 2024`, `GSM8K`, `BBH`, and `DROP`: | Bench | Hunyuan-A13B-Instruct | Hunyuan-A13B-Instruct-FP8 | Hunyuan-A13B-Instruct-Int4-GPTQ | |:---------:|:---------------------:|:-------------------------:|:-------------------------------:| | AIME 2024 | 87.3 | 86.7 | 86.7 | | GSM8K | 94.39 | 94.01 | 94.24 | | BBH | 89.1 | 88.34 | 87.91 | | DROP | 91.1 | 91.1 | 91.05 | #### Qwen3 Series Models Benchmark results for Qwen3 series models with `FP8-Static`, `FP8-Dynamic`, `INT4-GPTQ`, and `INT4-AWQ` quantization algorithms on datasets including `CEVAL`, `MMLU`, `GSM8K`, and `HUMANEVAL`:
ModelQuantizationCEVALMMLUGSM8KHUMANEVAL
Qwen3-0.6BBF1645.8447.2142.9919.51
FP8-Static45.9946.8738.0618.90
FP8-Dynamic45.9946.9338.2920.73
INT8-Dynamic45.1746.9541.1721.34
Qwen3-8BBF1679.2774.7887.7963.41
FP8-Static78.2374.7986.9662.20
FP8-Dynamic78.4574.7587.6462.80
INT8-Dynamic78.0174.8486.9667.07
INT4-GPTQ77.1973.2686.4362.20
INT4-AWQ76.1573.5986.9663.41
Qwen3-14BBF1683.0678.9088.4055.49
FP8-Static82.6278.5789.4657.32
FP8-Dynamic82.2478.9288.3252.44
INT8-Dynamic81.8778.1386.2856.10
INT4-GPTQ81.0578.0287.3457.93
INT4-AWQ82.0277.6884.2361.59
Qwen3-32BBF1686.5582.0074.5337.80
FP8-Static86.9281.7870.2039.63
FP8-Dynamic86.5581.8970.4338.41
INT4-GPTQ86.1881.01-43.29
INT4-AWQ86.1881.54-36.59
Qwen3-30B-A3BBF1683.6679.3689.9931.71
FP8-Static83.9579.4789.0131.10
FP8-Dynamic84.1079.4089.1632.93
INT8-Dynamic83.3679.4889.1634.15
Qwen3-235B-A22BBF1689.6086.2885.2927.44
FP8-Static89.6786.1986.9627.44
FP8-Dynamic89.6786.1885.2228.05
INT8-Dynamic88.9386.2086.2023.78
QwQ-32BBF1685.7482.0373.3142.68
FP8-Static85.4481.9175.3642.68
FP8-Dynamic85.0781.9375.6642.07
INT4-GPTQ84.0381.2668.2345.73
INT4-AWQ83.5881.0168.6943.29
#### Other Models Benchmark results for other models with `FP8-Static`, `FP8-Dynamic`, `INT4-GPTQ`, and `INT4-AWQ` quantization algorithms on datasets including `CEVAL`, `MMLU` and `GSM8K`:
ModelQuantizationCEVALMMLUGSM8K
Qwen2.5-1.5B-InstructBF1667.0160.0554.28
FP8-Static66.2760.23-
FP8-Dynamic66.7960.0851.71
Qwen2.5-7B-InstructBF1681.2074.5579.98
FP8-Static81.1374.0379.30
FP8-Dynamic80.3174.0779.00
INT4-GPTQ79.0573.0574.75
INT4-AWQ79.3573.2279.38
Qwen2.5-32B-InstructBF1687.3083.2181.73
FP8-Static87.5983.0881.58
FP8-Dynamic87.3083.0481.58
INT4-GPTQ86.7082.4582.03
INT4-AWQ87.0082.64-
DeepSeek-R1-Distill-Qwen-7BBF1653.4953.8075.74
FP8-Static53.5754.1776.19
FP8-Dynamic52.9754.1374.15
INT4-GPTQ51.8652.4475.89
INT4-AWQ53.4953.70-
DeepSeek-R1-Distill-Qwen-14BBF1677.7174.2885.67
FP8-Static77.5674.6686.73
FP8-Dynamic76.8274.6387.11
INT4-GPTQ74.2972.3784.61
INT4-AWQ74.8173.0086.05
DeepSeek-R1-Distill-Qwen-32BBF1684.1880.8987.41
FP8-Static83.4380.9087.57
FP8-Dynamic83.7381.1086.43
INT4-GPTQ84.1079.8086.73
INT4-AWQ82.8480.1587.19
### (2) Speculative Decoding #### Qwen3 Series Models Benchmark results for Qwen3 series models with `Eagle3` speculative decoding algorithm on datasets including `MT-bench`, `HunmanEval`, `GSM8K`, and `Alpaca`:
   MT-bench HumanEval GSM8K Alpaca Mean
TemperatureModelSpeedupΟ„SpeedupΟ„SpeedupΟ„SpeedupΟ„SpeedupΟ„
T=0 Qwen3-1.7B2.05x2.812.07x2.932.11x2.981.93x2.692.04x2.85
Qwen3-4B2.21x3.012.36x3.242.42x3.132.32x2.752.33x3.03
Qwen3-8B2.65x3.872.64x3.822.86x4.102.58x3.552.68x3.83
Qwen3-14B2.42x3.382.57x3.582.75x3.772.27x3.112.50x3.46
Qwen3-32B2.39x2.782.37x2.812.47x2.922.42x2.532.41x2.76
Qwen3-30B-A3B2.84x3.632.27x3.092.64x3.422.83x3.562.64x3.42
T=1 Qwen3-1.7B1.74x2.531.86x2.701.82x2.691.72x2.461.93x2.60
Qwen3-4B1.93x2.602.00x2.842.11x2.822.34x2.501.75x2.69
Qwen3-8B1.91x2.842.07x3.052.34x3.262.09x2.922.10x3.02
Qwen3-14B1.81x2.581.96x2.812.16x3.091.76x2.491.92x2.74
Qwen3-32B1.62x1.911.71x2.051.78x2.101.80x1.951.62x2.00
Qwen3-30B-A3B1.91x2.462.00x2.641.90x2.531.80x2.321.90x2.48
#### Hunyuan Series Models Benchmark results for Hunyuan series models with `Eagle3` speculative decoding algorithm on datasets including `MT-bench`, `HunmanEval`, `GSM8K`, and `Alpaca`:
   MT-bench HumanEval GSM8K Alpaca Mean
TemperatureModelSpeedupΟ„SpeedupΟ„SpeedupΟ„SpeedupΟ„SpeedupΟ„
T=0 Hunyuan-1.8B-Instruct1.97x2.902.58x3.732.61x3.711.71x2.432.22x3.19
Hunyuan-4B-Instruct1.77x2.602.64x3.352.14x3.171.72x2.572.07x2.92
Hunyuan-7B-Instruct2.22x3.583.59x5.472.96x4.681.64x2.562.60x4.07
T=1 Hunyuan-1.8B-Instruct1.58x2.362.35x3.562.23x3.381.26x1.871.86x2.79
Hunyuan-4B-Instruct1.36x2.051.97x2.861.72x2.681.14x1.761.55x2.34
Hunyuan-7B-Instruct1.90x3.113.12x5.092.74x4.341.47x2.392.31x3.73
## πŸ“ License The code for this project is open-sourced under the [License for AngelSlim](LICENSE). ## πŸ”— Citation ``` @software{AngelSlim2025, title={{AngelSlim}}, author={Tencent AngelSlim Project Contributors}, year={2025}, month={6}, url={https://github.com/Tencent/AngelSlim}, } ``` ## πŸ’¬ Technical Discussion * AngelSlim is continuously iterating and new features will be released soon. If you have any questions or suggestions, please open an issue on GitHub or join our [WeChat technical discussion group](https://github.com/Tencent/AngelSlim/blob/main/docs/source/assets/angel_slim_wechat.png?raw=true).