--- license: apache-2.0 datasets: - code-search-net/code_search_net base_model: - Qwen/Qwen3-32B pipeline_tag: text-generation library_name: transformers --- # Qwen3-32B-AWQ-Code1080 ## Qwen3-AWQ Highlights - Open-source. Calibration data, evaluation tools, and model quantization algorithms are fully open-source. - Precision. Achieves lower accuracy loss compared to officially quantized models. - Process. Provides detailed quantization and testing workflows for easy reproducibility. - Faster. The AutoQuant kernel has been released in [vLLM](https://github.com/Adlik/vllm/tree/vllm_0.8.5_autoquant), delivering superior performance compared to the Marlin kernel. ## Model Overview **Qwen3-32B** has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Number of Parameters: 32.8B - Number of Paramaters (Non-Embedding): 31.2B - Number of Layers: 64 - Number of Attention Heads (GQA): 64 for Q and 8 for KV - Context Length: 32,768 natively and [131,072 tokens with YaRN](https://huggingface.co/Qwen/Qwen3-32B-AWQ#processing-long-texts). - Quantization: AWQ 4-bit For more details, including benchmark evaluation and inference performance, please refer to our [GitHub](https://github.com/Adlik/model_zoo/blob/qwen3_quant/LLM/models/Qwen3-32B_quantization_tutorial.md). ## Quantization - calibration data The model quantization process uses the Pile dataset for calibration. You can download the data from https://github.com/Adlik/model_zoo/blob/qwen3_quant/LLM/datasets/code_6in1_1080.jsonl. - quantization algorithm The model quantization process employs two quantization algorithms: AWQ and GPTQ. We have modified [AutoAWQ](https://github.com/Adlik/AutoAWQ/tree/autoawq_qwen3) and [AutoGPTQ](https://github.com/Adlik/AutoGPTQ/tree/qwen3_quant) frameworks for this purpose, which are directly usable. ## Evaluation For deployment, we use vllm==0.8.5 and create an OpenAI-compatible API endpoint: nothink: ```bash VLLM_USE_MODELSCOPE=True CUDA_VISIBLE_DEVICES=0,1 vllm serve /model --gpu-memory-utilization 0.9 --served-model-name Qwen3-32B --trust_remote_code --port 48001 --tensor-parallel-size 2 ``` think: ```bash VLLM_USE_MODELSCOPE=True CUDA_VISIBLE_DEVICES=0,1 vllm serve /model --gpu-memory-utilization 0.9 --served-model-name Qwen3-32B --trust_remote_code --port 48001 --tensor-parallel-size 2 --enable-reasoning --reasoning-parser deepseek_r1 ``` Sampling parameters are set to match https://huggingface.co/Qwen/Qwen3-32B#best-practices. To facilitate testing and reproducibility, we utilized the open-source [evalscope](https://github.com/modelscope/evalscope) tool to evaluate the accuracy of both bfloat16 (BF16) and quantized models. ```shell git clone https://github.com/modelscope/evalscope.git git checkout -b v0.17.0 tags/v0.17.0 cd evalscope/ pip install -e . ``` ## Performance ### Benchmarks All test results were obtained on the following hardware: - 4x NVIDIA A100-40G GPUs - 2x NVIDIA H800-80G GPUs | model\benchmarks | think/non-think | math_500 | AIME 2024 | AIME 2025 | MMLU-REDUX | GPQA-Diamond | ceval | gsm8k | ifeval | iquiz | trivia_qa | CMMLU | mmlu | | -------------------------- | --------------- | -------- | --------- | --------- | ---------- | ------------ | ----- | ----- | ------ | ----- | --------- | ----- | ----- | | qwen3-32B-AWQ(paper) | think | \ | 79.4 | \ | 90.8 | 69.0 | \ | \ | \ | \ | \ | \ | \ | | | non-think | \ | \ | \ | 85.6 | 53.1 | \ | \ | \ | \ | \ | \ | \ | | qwen3-32B-AWQ(self-test) | think | 95.2 | 76.67 | 73.33 | 89.09 | 67.68 | 88.41 | 92.04 | 85.35 | 80.83 | 79.63 | 86.74 | 86.2 | | | non-think | 83.2 | 36.67 | 13.33 | 86.26 | 56.57 | 85.66 | 87.49 | 86.74 | 79.17 | 73.69 | 84.53 | 82.49 | | Qwen3-32B-AWQ-Code1080 | think | 94.4 | 86.67 | 73.34 | 88.18 | 71.72 | 88.34 | 93.56 | 88.21 | 81.67 | 78.62 | 86.36 | 86.43 | | | non-think | 83.6 | 26.67 | 26.66 | 85.98 | 57.07 | 84.92 | 89.39 | 87.77 | 79.17 | 72.59 | 84.54 | 82.04 | ### Performance - 2 x A100-40GB - vllm0.8.5 "To use AutoQuant, simply modify the `config.json` file as shown below: ```json "quantization_config": { "bits": 4, "group_size": 128, "modules_to_not_convert": null, "quant_method": "autoquant", // change from "awq" to "autoquant" "version": "gemm", "zero_point": true }, ``` ```shell # throughput CUDA_VISIBLE_DEVICES=4,5 python3 benchmark_throughput.py --model /model --input-len 1024 --output-len 1024 -tp 2 --max-model-len 40960 --num-prompts 100 # latency CUDA_VISIBLE_DEVICES=4,5 python3 benchmark_latency.py --model /model --num-iters-warmup 10 --num-iters 50 --batch-size 16 --input-len 512 --output-len 512 -tp 2 ``` - Throughput | kernel\\(tokens/s) | type | in/out=512 | in/out=1024 | in/out=2048 | in/out=4096 | | ------------------ | ------ | ---------- | ----------- | ----------- | ----------- | | awq_marlin | total | 2153.85 | 1875.67 | 1310.74 | 910.41 | | | output | 1046.28 | 910.15 | 638.11 | 438.71 | | autoquant | total | 2453.12 | 2111.43 | 1416.66 | 963.93 | | | output | 1198.05 | 1024.29 | 689.29 | 469.88 | - Latency(average) | kernel\second | batch | in/out=128 | in/out=512 | in/out=1024 | in/out=2048 | | ------------- | ----- | ---------- | ---------- | ----------- | ----------- | | awq_marlin | 16 | 2.4654 | 10.1091 | 21.3455 | 47.7168 | | | 64 | 4.8633 | 20.8356 | 47.3302 | 170.8086 | | autoquant | 16 | 2.3916 | 9.9021 | 21.0006 | 46.9298 | | | 64 | 4.7231 | 20.2468 | 46.0811 | 168.4375 |