Qwen3-32B-AWQ-Code1080
Qwen3-AWQ Highlights
- Open-source. Calibration data, evaluation tools, and model quantization algorithms are fully open-source.
- Precision. Achieves lower accuracy loss compared to officially quantized models.
- Process. Provides detailed quantization and testing workflows for easy reproducibility.
- Faster. The AutoQuant kernel has been released in vLLM, delivering superior performance compared to the Marlin kernel.
Model Overview
Qwen3-32B has the following features:
- Type: Causal Language Models
- Training Stage: Pretraining & Post-training
- Number of Parameters: 32.8B
- Number of Paramaters (Non-Embedding): 31.2B
- Number of Layers: 64
- Number of Attention Heads (GQA): 64 for Q and 8 for KV
- Context Length: 32,768 natively and 131,072 tokens with YaRN.
- Quantization: AWQ 4-bit
For more details, including benchmark evaluation and inference performance, please refer to our GitHub.
Quantization
- calibration data
The model quantization process uses the Pile dataset for calibration. You can download the data from https://github.com/Adlik/model_zoo/blob/qwen3_quant/LLM/datasets/code_6in1_1080.jsonl.
- quantization algorithm
The model quantization process employs two quantization algorithms: AWQ and GPTQ. We have modified AutoAWQ and AutoGPTQ frameworks for this purpose, which are directly usable.
Evaluation
For deployment, we use vllm==0.8.5 and create an OpenAI-compatible API endpoint:
nothink:
VLLM_USE_MODELSCOPE=True CUDA_VISIBLE_DEVICES=0,1 vllm serve /model --gpu-memory-utilization 0.9 --served-model-name Qwen3-32B --trust_remote_code --port 48001 --tensor-parallel-size 2
think:
VLLM_USE_MODELSCOPE=True CUDA_VISIBLE_DEVICES=0,1 vllm serve /model --gpu-memory-utilization 0.9 --served-model-name Qwen3-32B --trust_remote_code --port 48001 --tensor-parallel-size 2 --enable-reasoning --reasoning-parser deepseek_r1
Sampling parameters are set to match https://huggingface.co/Qwen/Qwen3-32B#best-practices.
To facilitate testing and reproducibility, we utilized the open-source evalscope tool to evaluate the accuracy of both bfloat16 (BF16) and quantized models.
git clone https://github.com/modelscope/evalscope.git
git checkout -b v0.17.0 tags/v0.17.0
cd evalscope/
pip install -e .
Performance
Benchmarks
All test results were obtained on the following hardware:
- 4x NVIDIA A100-40G GPUs
- 2x NVIDIA H800-80G GPUs
model\benchmarks | think/non-think | math_500 | AIME 2024 | AIME 2025 | MMLU-REDUX | GPQA-Diamond | ceval | gsm8k | ifeval | iquiz | trivia_qa | CMMLU | mmlu |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
qwen3-32B-AWQ(paper) | think | \ | 79.4 | \ | 90.8 | 69.0 | \ | \ | \ | \ | \ | \ | \ |
non-think | \ | \ | \ | 85.6 | 53.1 | \ | \ | \ | \ | \ | \ | \ | |
qwen3-32B-AWQ(self-test) | think | 95.2 | 76.67 | 73.33 | 89.09 | 67.68 | 88.41 | 92.04 | 85.35 | 80.83 | 79.63 | 86.74 | 86.2 |
non-think | 83.2 | 36.67 | 13.33 | 86.26 | 56.57 | 85.66 | 87.49 | 86.74 | 79.17 | 73.69 | 84.53 | 82.49 | |
Qwen3-32B-AWQ-Code1080 | think | 94.4 | 86.67 | 73.34 | 88.18 | 71.72 | 88.34 | 93.56 | 88.21 | 81.67 | 78.62 | 86.36 | 86.43 |
non-think | 83.6 | 26.67 | 26.66 | 85.98 | 57.07 | 84.92 | 89.39 | 87.77 | 79.17 | 72.59 | 84.54 | 82.04 |
Performance
- 2 x A100-40GB
- vllm0.8.5
"To use AutoQuant, simply modify the config.json
file as shown below:
"quantization_config": {
"bits": 4,
"group_size": 128,
"modules_to_not_convert": null,
"quant_method": "autoquant", // change from "awq" to "autoquant"
"version": "gemm",
"zero_point": true
},
# throughput
CUDA_VISIBLE_DEVICES=4,5 python3 benchmark_throughput.py --model /model --input-len 1024 --output-len 1024 -tp 2 --max-model-len 40960 --num-prompts 100
# latency
CUDA_VISIBLE_DEVICES=4,5 python3 benchmark_latency.py --model /model --num-iters-warmup 10 --num-iters 50 --batch-size 16 --input-len 512 --output-len 512 -tp 2
- Throughput
kernel\(tokens/s) | type | in/out=512 | in/out=1024 | in/out=2048 | in/out=4096 |
---|---|---|---|---|---|
awq_marlin | total | 2153.85 | 1875.67 | 1310.74 | 910.41 |
output | 1046.28 | 910.15 | 638.11 | 438.71 | |
autoquant | total | 2453.12 | 2111.43 | 1416.66 | 963.93 |
output | 1198.05 | 1024.29 | 689.29 | 469.88 |
- Latency(average)
kernel\second | batch | in/out=128 | in/out=512 | in/out=1024 | in/out=2048 |
---|---|---|---|---|---|
awq_marlin | 16 | 2.4654 | 10.1091 | 21.3455 | 47.7168 |
64 | 4.8633 | 20.8356 | 47.3302 | 170.8086 | |
autoquant | 16 | 2.3916 | 9.9021 | 21.0006 | 46.9298 |
64 | 4.7231 | 20.2468 | 46.0811 | 168.4375 |
- Downloads last month
- 41
Model tree for adlik/Qwen3-32B-AWQ-Code1080
Base model
Qwen/Qwen3-32B