lyg95's picture
Update README.md
001f8b3 verified
metadata
license: apache-2.0
datasets:
  - code-search-net/code_search_net
base_model:
  - Qwen/Qwen3-32B
pipeline_tag: text-generation
library_name: transformers

Qwen3-32B-AWQ-Code1080

Qwen3-AWQ Highlights

  • Open-source. Calibration data, evaluation tools, and model quantization algorithms are fully open-source.
  • Precision. Achieves lower accuracy loss compared to officially quantized models.
  • Process. Provides detailed quantization and testing workflows for easy reproducibility.
  • Faster. The AutoQuant kernel has been released in vLLM, delivering superior performance compared to the Marlin kernel.

Model Overview

Qwen3-32B has the following features:

  • Type: Causal Language Models
  • Training Stage: Pretraining & Post-training
  • Number of Parameters: 32.8B
  • Number of Paramaters (Non-Embedding): 31.2B
  • Number of Layers: 64
  • Number of Attention Heads (GQA): 64 for Q and 8 for KV
  • Context Length: 32,768 natively and 131,072 tokens with YaRN.
  • Quantization: AWQ 4-bit

For more details, including benchmark evaluation and inference performance, please refer to our GitHub.

Quantization

  • calibration data

The model quantization process uses the Pile dataset for calibration. You can download the data from https://github.com/Adlik/model_zoo/blob/qwen3_quant/LLM/datasets/code_6in1_1080.jsonl.

  • quantization algorithm

The model quantization process employs two quantization algorithms: AWQ and GPTQ. We have modified AutoAWQ and AutoGPTQ frameworks for this purpose, which are directly usable.

Evaluation

For deployment, we use vllm==0.8.5 and create an OpenAI-compatible API endpoint:

nothink:

VLLM_USE_MODELSCOPE=True CUDA_VISIBLE_DEVICES=0,1 vllm serve /model  --gpu-memory-utilization 0.9 --served-model-name Qwen3-32B --trust_remote_code --port 48001 --tensor-parallel-size 2

think:

VLLM_USE_MODELSCOPE=True CUDA_VISIBLE_DEVICES=0,1 vllm serve /model --gpu-memory-utilization 0.9 --served-model-name Qwen3-32B --trust_remote_code --port 48001 --tensor-parallel-size 2 --enable-reasoning --reasoning-parser deepseek_r1

Sampling parameters are set to match https://huggingface.co/Qwen/Qwen3-32B#best-practices.

To facilitate testing and reproducibility, we utilized the open-source evalscope tool to evaluate the accuracy of both bfloat16 (BF16) and quantized models.

git clone https://github.com/modelscope/evalscope.git
git checkout -b v0.17.0 tags/v0.17.0
cd evalscope/
pip install -e .

Performance

Benchmarks

All test results were obtained on the following hardware:

  • 4x NVIDIA A100-40G GPUs
  • 2x NVIDIA H800-80G GPUs
model\benchmarks think/non-think math_500 AIME 2024 AIME 2025 MMLU-REDUX GPQA-Diamond ceval gsm8k ifeval iquiz trivia_qa CMMLU mmlu
qwen3-32B-AWQ(paper) think \ 79.4 \ 90.8 69.0 \ \ \ \ \ \ \
non-think \ \ \ 85.6 53.1 \ \ \ \ \ \ \
qwen3-32B-AWQ(self-test) think 95.2 76.67 73.33 89.09 67.68 88.41 92.04 85.35 80.83 79.63 86.74 86.2
non-think 83.2 36.67 13.33 86.26 56.57 85.66 87.49 86.74 79.17 73.69 84.53 82.49
Qwen3-32B-AWQ-Code1080 think 94.4 86.67 73.34 88.18 71.72 88.34 93.56 88.21 81.67 78.62 86.36 86.43
non-think 83.6 26.67 26.66 85.98 57.07 84.92 89.39 87.77 79.17 72.59 84.54 82.04

Performance

  • 2 x A100-40GB
  • vllm0.8.5

"To use AutoQuant, simply modify the config.json file as shown below:

"quantization_config": {
    "bits": 4,
    "group_size": 128,
    "modules_to_not_convert": null,
    "quant_method": "autoquant",  // change from "awq" to "autoquant"
    "version": "gemm",
    "zero_point": true
  },
# throughput
CUDA_VISIBLE_DEVICES=4,5 python3 benchmark_throughput.py --model /model --input-len 1024 --output-len 1024 -tp 2 --max-model-len 40960 --num-prompts 100

# latency
CUDA_VISIBLE_DEVICES=4,5 python3 benchmark_latency.py --model /model --num-iters-warmup 10  --num-iters 50  --batch-size 16 --input-len 512 --output-len 512 -tp 2
  • Throughput
kernel\(tokens/s) type in/out=512 in/out=1024 in/out=2048 in/out=4096
awq_marlin total 2153.85 1875.67 1310.74 910.41
output 1046.28 910.15 638.11 438.71
autoquant total 2453.12 2111.43 1416.66 963.93
output 1198.05 1024.29 689.29 469.88
  • Latency(average)
kernel\second batch in/out=128 in/out=512 in/out=1024 in/out=2048
awq_marlin 16 2.4654 10.1091 21.3455 47.7168
64 4.8633 20.8356 47.3302 170.8086
autoquant 16 2.3916 9.9021 21.0006 46.9298
64 4.7231 20.2468 46.0811 168.4375