---
license: apache-2.0
datasets:
- code-search-net/code_search_net
base_model:
- Qwen/Qwen3-32B
pipeline_tag: text-generation
library_name: transformers
---
# Qwen3-32B-AWQ-Code1080

##  Qwen3-AWQ Highlights

- Open-source. Calibration data, evaluation tools, and model quantization algorithms are fully open-source.
- Precision. Achieves lower accuracy loss compared to officially quantized models.
- Process. Provides detailed quantization and testing workflows for easy reproducibility.
- Faster.  The AutoQuant kernel has been released in [vLLM](https://github.com/Adlik/vllm/tree/vllm_0.8.5_autoquant), delivering superior performance compared to the Marlin kernel.

## Model Overview

**Qwen3-32B** has the following features:

- Type: Causal Language Models
- Training Stage: Pretraining & Post-training
- Number of Parameters: 32.8B
- Number of Paramaters (Non-Embedding): 31.2B
- Number of Layers: 64
- Number of Attention Heads (GQA): 64 for Q and 8 for KV
- Context Length: 32,768 natively and [131,072 tokens with YaRN](https://huggingface.co/Qwen/Qwen3-32B-AWQ#processing-long-texts).
- Quantization: AWQ 4-bit

For more details, including benchmark evaluation and inference performance, please refer to our [GitHub](https://github.com/Adlik/model_zoo/blob/qwen3_quant/LLM/models/Qwen3-32B_quantization_tutorial.md).

## Quantization

- calibration data

The model quantization process uses the Pile dataset for calibration. You can download the data from https://github.com/Adlik/model_zoo/blob/qwen3_quant/LLM/datasets/code_6in1_1080.jsonl.

- quantization algorithm

The model quantization process employs two quantization algorithms: AWQ and GPTQ. We have modified  [AutoAWQ](https://github.com/Adlik/AutoAWQ/tree/autoawq_qwen3) and [AutoGPTQ](https://github.com/Adlik/AutoGPTQ/tree/qwen3_quant) frameworks for this purpose, which are directly usable.

## Evaluation

For deployment, we use vllm==0.8.5 and create an OpenAI-compatible API endpoint:

nothink:

```bash
VLLM_USE_MODELSCOPE=True CUDA_VISIBLE_DEVICES=0,1 vllm serve /model  --gpu-memory-utilization 0.9 --served-model-name Qwen3-32B --trust_remote_code --port 48001 --tensor-parallel-size 2
```

think:

```bash
VLLM_USE_MODELSCOPE=True CUDA_VISIBLE_DEVICES=0,1 vllm serve /model --gpu-memory-utilization 0.9 --served-model-name Qwen3-32B --trust_remote_code --port 48001 --tensor-parallel-size 2 --enable-reasoning --reasoning-parser deepseek_r1
```

Sampling parameters are set to match https://huggingface.co/Qwen/Qwen3-32B#best-practices.

To facilitate testing and reproducibility, we utilized the open-source [evalscope](https://github.com/modelscope/evalscope) tool to evaluate the accuracy of both bfloat16 (BF16) and quantized models.

```shell
git clone https://github.com/modelscope/evalscope.git
git checkout -b v0.17.0 tags/v0.17.0
cd evalscope/
pip install -e .
```

## Performance

### Benchmarks

All test results were obtained on the following hardware:

- 4x NVIDIA A100-40G GPUs
- 2x NVIDIA H800-80G GPUs

| model\benchmarks           | think/non-think | math_500 | AIME 2024 | AIME 2025 | MMLU-REDUX | GPQA-Diamond | ceval | gsm8k | ifeval | iquiz | trivia_qa | CMMLU | mmlu  |
| -------------------------- | --------------- | -------- | --------- | --------- | ---------- | ------------ | ----- | ----- | ------ | ----- | --------- | ----- | ----- |
| qwen3-32B-AWQ（paper）     | think           | \        | 79.4      | \         | 90.8       | 69.0         | \     | \     | \      | \     | \         | \     | \     |
|                            | non-think       | \        | \         | \         | 85.6       | 53.1         | \     | \     | \      | \     | \         | \     | \     |
| qwen3-32B-AWQ（self-test） | think           | 95.2     | 76.67     | 73.33     | 89.09      | 67.68        | 88.41 | 92.04 | 85.35  | 80.83 | 79.63     | 86.74 | 86.2  |
|                            | non-think       | 83.2     | 36.67     | 13.33     | 86.26      | 56.57        | 85.66 | 87.49 | 86.74  | 79.17 | 73.69     | 84.53 | 82.49 |
| Qwen3-32B-AWQ-Code1080     | think           | 94.4     | 86.67     | 73.34     | 88.18      | 71.72        | 88.34 | 93.56 | 88.21  | 81.67 | 78.62     | 86.36 | 86.43 |
|                            | non-think       | 83.6     | 26.67     | 26.66     | 85.98      | 57.07        | 84.92 | 89.39 | 87.77  | 79.17 | 72.59     | 84.54 | 82.04 |

### Performance

- 2 x A100-40GB
- vllm0.8.5

"To use AutoQuant, simply modify the `config.json` file as shown below:

```json
"quantization_config": {
    "bits": 4,
    "group_size": 128,
    "modules_to_not_convert": null,
    "quant_method": "autoquant",  // change from "awq" to "autoquant"
    "version": "gemm",
    "zero_point": true
  },
```

```shell
# throughput
CUDA_VISIBLE_DEVICES=4,5 python3 benchmark_throughput.py --model /model --input-len 1024 --output-len 1024 -tp 2 --max-model-len 40960 --num-prompts 100

# latency
CUDA_VISIBLE_DEVICES=4,5 python3 benchmark_latency.py --model /model --num-iters-warmup 10  --num-iters 50  --batch-size 16 --input-len 512 --output-len 512 -tp 2
```

- Throughput

| kernel\\(tokens/s) | type   | in/out=512 | in/out=1024 | in/out=2048 | in/out=4096 |
| ------------------ | ------ | ---------- | ----------- | ----------- | ----------- |
| awq_marlin         | total  | 2153.85    | 1875.67     | 1310.74     | 910.41      |
|                    | output | 1046.28    | 910.15      | 638.11      | 438.71      |
| autoquant          | total  | 2453.12    | 2111.43     | 1416.66     | 963.93      |
|                    | output | 1198.05    | 1024.29     | 689.29      | 469.88      |

- Latency(average)

| kernel\second | batch | in/out=128 | in/out=512 | in/out=1024 | in/out=2048 |
| ------------- | ----- | ---------- | ---------- | ----------- | ----------- |
| awq_marlin    | 16    | 2.4654     | 10.1091    | 21.3455     | 47.7168     |
|               | 64    | 4.8633     | 20.8356    | 47.3302     | 170.8086    |
| autoquant     | 16    | 2.3916     | 9.9021     | 21.0006     | 46.9298     |
|               | 64    | 4.7231     | 20.2468    | 46.0811     | 168.4375    |