metadata

license: apache-2.0
datasets:
  - code-search-net/code_search_net
base_model:
  - Qwen/Qwen3-32B
pipeline_tag: text-generation
library_name: transformers

Qwen3-32B-AWQ-Code1080

Qwen3-AWQ Highlights

Open-source. Calibration data, evaluation tools, and model quantization algorithms are fully open-source.
Precision. Achieves lower accuracy loss compared to officially quantized models.
Process. Provides detailed quantization and testing workflows for easy reproducibility.
Faster. The AutoQuant kernel has been released in vLLM, delivering superior performance compared to the Marlin kernel.

Model Overview

Qwen3-32B has the following features:

Type: Causal Language Models
Training Stage: Pretraining & Post-training
Number of Parameters: 32.8B
Number of Paramaters (Non-Embedding): 31.2B
Number of Layers: 64
Number of Attention Heads (GQA): 64 for Q and 8 for KV
Context Length: 32,768 natively and 131,072 tokens with YaRN.
Quantization: AWQ 4-bit

For more details, including benchmark evaluation and inference performance, please refer to our GitHub.

Quantization

calibration data

The model quantization process uses the Pile dataset for calibration. You can download the data from https://github.com/Adlik/model_zoo/blob/qwen3_quant/LLM/datasets/code_6in1_1080.jsonl.

quantization algorithm

The model quantization process employs two quantization algorithms: AWQ and GPTQ. We have modified AutoAWQ and AutoGPTQ frameworks for this purpose, which are directly usable.

Evaluation

For deployment, we use vllm==0.8.5 and create an OpenAI-compatible API endpoint:

nothink:

VLLM_USE_MODELSCOPE=True CUDA_VISIBLE_DEVICES=0,1 vllm serve /model  --gpu-memory-utilization 0.9 --served-model-name Qwen3-32B --trust_remote_code --port 48001 --tensor-parallel-size 2

think:

VLLM_USE_MODELSCOPE=True CUDA_VISIBLE_DEVICES=0,1 vllm serve /model --gpu-memory-utilization 0.9 --served-model-name Qwen3-32B --trust_remote_code --port 48001 --tensor-parallel-size 2 --enable-reasoning --reasoning-parser deepseek_r1

Sampling parameters are set to match https://huggingface.co/Qwen/Qwen3-32B#best-practices.

To facilitate testing and reproducibility, we utilized the open-source evalscope tool to evaluate the accuracy of both bfloat16 (BF16) and quantized models.

git clone https://github.com/modelscope/evalscope.git
git checkout -b v0.17.0 tags/v0.17.0
cd evalscope/
pip install -e .

Performance

Benchmarks

All test results were obtained on the following hardware:

4x NVIDIA A100-40G GPUs
2x NVIDIA H800-80G GPUs

model\benchmarks	think/non-think	math_500	AIME 2024	AIME 2025	MMLU-REDUX	GPQA-Diamond	ceval	gsm8k	ifeval	iquiz	trivia_qa	CMMLU	mmlu
qwen3-32B-AWQ（paper）	think	\	79.4	\	90.8	69.0	\	\	\	\	\	\	\
	non-think	\	\	\	85.6	53.1	\	\	\	\	\	\	\
qwen3-32B-AWQ（self-test）	think	95.2	76.67	73.33	89.09	67.68	88.41	92.04	85.35	80.83	79.63	86.74	86.2
	non-think	83.2	36.67	13.33	86.26	56.57	85.66	87.49	86.74	79.17	73.69	84.53	82.49
Qwen3-32B-AWQ-Code1080	think	94.4	86.67	73.34	88.18	71.72	88.34	93.56	88.21	81.67	78.62	86.36	86.43
	non-think	83.6	26.67	26.66	85.98	57.07	84.92	89.39	87.77	79.17	72.59	84.54	82.04

Performance

2 x A100-40GB
vllm0.8.5

"To use AutoQuant, simply modify the config.json file as shown below:

"quantization_config": {
    "bits": 4,
    "group_size": 128,
    "modules_to_not_convert": null,
    "quant_method": "autoquant",  // change from "awq" to "autoquant"
    "version": "gemm",
    "zero_point": true
  },

# throughput
CUDA_VISIBLE_DEVICES=4,5 python3 benchmark_throughput.py --model /model --input-len 1024 --output-len 1024 -tp 2 --max-model-len 40960 --num-prompts 100

# latency
CUDA_VISIBLE_DEVICES=4,5 python3 benchmark_latency.py --model /model --num-iters-warmup 10  --num-iters 50  --batch-size 16 --input-len 512 --output-len 512 -tp 2

Throughput

kernel\(tokens/s)	type	in/out=512	in/out=1024	in/out=2048	in/out=4096
awq_marlin	total	2153.85	1875.67	1310.74	910.41
	output	1046.28	910.15	638.11	438.71
autoquant	total	2453.12	2111.43	1416.66	963.93
	output	1198.05	1024.29	689.29	469.88

Latency(average)

kernel\second	batch	in/out=128	in/out=512	in/out=1024	in/out=2048
awq_marlin	16	2.4654	10.1091	21.3455	47.7168
	64	4.8633	20.8356	47.3302	170.8086
autoquant	16	2.3916	9.9021	21.0006	46.9298
	64	4.7231	20.2468	46.0811	168.4375