Qwen3-32B-AWQ-Code1080

Qwen3-AWQ Highlights

  • Open-source. Calibration data, evaluation tools, and model quantization algorithms are fully open-source.
  • Precision. Achieves lower accuracy loss compared to officially quantized models.
  • Process. Provides detailed quantization and testing workflows for easy reproducibility.
  • Faster. The AutoQuant kernel has been released in vLLM, delivering superior performance compared to the Marlin kernel.

Model Overview

Qwen3-32B has the following features:

  • Type: Causal Language Models
  • Training Stage: Pretraining & Post-training
  • Number of Parameters: 32.8B
  • Number of Paramaters (Non-Embedding): 31.2B
  • Number of Layers: 64
  • Number of Attention Heads (GQA): 64 for Q and 8 for KV
  • Context Length: 32,768 natively and 131,072 tokens with YaRN.
  • Quantization: AWQ 4-bit

For more details, including benchmark evaluation and inference performance, please refer to our GitHub.

Quantization

  • calibration data

The model quantization process uses the Pile dataset for calibration. You can download the data from https://github.com/Adlik/model_zoo/blob/qwen3_quant/LLM/datasets/code_6in1_1080.jsonl.

  • quantization algorithm

The model quantization process employs two quantization algorithms: AWQ and GPTQ. We have modified AutoAWQ and AutoGPTQ frameworks for this purpose, which are directly usable.

Evaluation

For deployment, we use vllm==0.8.5 and create an OpenAI-compatible API endpoint:

nothink:

VLLM_USE_MODELSCOPE=True CUDA_VISIBLE_DEVICES=0,1 vllm serve /model  --gpu-memory-utilization 0.9 --served-model-name Qwen3-32B --trust_remote_code --port 48001 --tensor-parallel-size 2

think:

VLLM_USE_MODELSCOPE=True CUDA_VISIBLE_DEVICES=0,1 vllm serve /model --gpu-memory-utilization 0.9 --served-model-name Qwen3-32B --trust_remote_code --port 48001 --tensor-parallel-size 2 --enable-reasoning --reasoning-parser deepseek_r1

Sampling parameters are set to match https://huggingface.co/Qwen/Qwen3-32B#best-practices.

To facilitate testing and reproducibility, we utilized the open-source evalscope tool to evaluate the accuracy of both bfloat16 (BF16) and quantized models.

git clone https://github.com/modelscope/evalscope.git
git checkout -b v0.17.0 tags/v0.17.0
cd evalscope/
pip install -e .

Performance

Benchmarks

All test results were obtained on the following hardware:

  • 4x NVIDIA A100-40G GPUs
  • 2x NVIDIA H800-80G GPUs
model\benchmarks think/non-think math_500 AIME 2024 AIME 2025 MMLU-REDUX GPQA-Diamond ceval gsm8k ifeval iquiz trivia_qa CMMLU mmlu
qwen3-32B-AWQ(paper) think \ 79.4 \ 90.8 69.0 \ \ \ \ \ \ \
non-think \ \ \ 85.6 53.1 \ \ \ \ \ \ \
qwen3-32B-AWQ(self-test) think 95.2 76.67 73.33 89.09 67.68 88.41 92.04 85.35 80.83 79.63 86.74 86.2
non-think 83.2 36.67 13.33 86.26 56.57 85.66 87.49 86.74 79.17 73.69 84.53 82.49
Qwen3-32B-AWQ-Code1080 think 94.4 86.67 73.34 88.18 71.72 88.34 93.56 88.21 81.67 78.62 86.36 86.43
non-think 83.6 26.67 26.66 85.98 57.07 84.92 89.39 87.77 79.17 72.59 84.54 82.04

Performance

  • 2 x A100-40GB
  • vllm0.8.5

"To use AutoQuant, simply modify the config.json file as shown below:

"quantization_config": {
    "bits": 4,
    "group_size": 128,
    "modules_to_not_convert": null,
    "quant_method": "autoquant",  // change from "awq" to "autoquant"
    "version": "gemm",
    "zero_point": true
  },
# throughput
CUDA_VISIBLE_DEVICES=4,5 python3 benchmark_throughput.py --model /model --input-len 1024 --output-len 1024 -tp 2 --max-model-len 40960 --num-prompts 100

# latency
CUDA_VISIBLE_DEVICES=4,5 python3 benchmark_latency.py --model /model --num-iters-warmup 10  --num-iters 50  --batch-size 16 --input-len 512 --output-len 512 -tp 2
  • Throughput
kernel\(tokens/s) type in/out=512 in/out=1024 in/out=2048 in/out=4096
awq_marlin total 2153.85 1875.67 1310.74 910.41
output 1046.28 910.15 638.11 438.71
autoquant total 2453.12 2111.43 1416.66 963.93
output 1198.05 1024.29 689.29 469.88
  • Latency(average)
kernel\second batch in/out=128 in/out=512 in/out=1024 in/out=2048
awq_marlin 16 2.4654 10.1091 21.3455 47.7168
64 4.8633 20.8356 47.3302 170.8086
autoquant 16 2.3916 9.9021 21.0006 46.9298
64 4.7231 20.2468 46.0811 168.4375
Downloads last month
41
Safetensors
Model size
5.73B params
Tensor type
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for adlik/Qwen3-32B-AWQ-Code1080

Base model

Qwen/Qwen3-32B
Quantized
(109)
this model

Dataset used to train adlik/Qwen3-32B-AWQ-Code1080