Intro
The AWQ version is quantized using ms-swift. You may refer to our best practice for training/fine-tuning Qwen3-models here.
Note that the AWQ version for Qwen3-MoE models are verified to be working on Transformers/vLLM. We have not have the chance to tested them on other engines.
Inference
import torch
from modelscope import AutoModelForCausalLM, AutoTokenizer
model_name = "swift/Qwen3-30B-A3B-AWQ"
# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# conduct text completion
generated_ids = model.generate(
**model_inputs,
max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
# parsing thinking content
try:
# rindex finding 151668 (</think>)
index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
index = 0
thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")
print("thinking content:", thinking_content)
print("content:", content)
Quantization
The model has undergone AWQ int4 quantization using the ms-swift framework. Since the model is based on the MoE (Mixture of Experts) architecture, all linear
layers except for gate
and lm_head
have been quantized.
If you have fine-tuned the model and wish to quantize the fine-tuned version, you can refer to the following quantization scripts:
With these scripts, you can easily complete the quantization process for the model.
Evaluation
We evaluate the quality of this AWQ quantization with EvalScope. For the best practice for evaluating Qwen3 models, one may refer to the following:
Performance of Qwen3-30B-A3B-AWQ is evaluated on our mixed-benchmark of Qwen3 Evaluation Collection, with the results listed below:
The performance comparison of Qwen3-30B-A3B-AWQ and Qwen3-30B-A3B
task_type | dataset_name | metric | average_score(AWQ) | average_score(without AWQ) | count |
---|---|---|---|---|---|
exam | MMLU-Pro | AverageAccuracy | 0.7655 | 0.7828 | 12032 |
exam | MMLU-Redux | AverageAccuracy | 0.8746 | 0.8872 | 5700 |
exam | C-Eval | AverageAccuracy | 0.844 | 0.8722 | 1346 |
instruction | IFEval | inst_level_strict_acc | 0.8891 | 0.8925 | 541 |
instruction | IFEval | inst_level_loose_acc | 0.9107 | 0.9174 | 541 |
instruction | IFEval | prompt_level_loose_acc | 0.8651 | 0.8651 | 541 |
instruction | IFEval | prompt_level_strict_acc | 0.8373 | 0.8318 | 541 |
math | MATH-500 | AveragePass@1 | 0.944 | 0.938 | 500 |
knowledge | GPQA | AveragePass@1 | 0.596 | 0.601 | 198 |
code | LiveCodeBench | Pass@1 | 0.5275 | 0.5549 | 182 |
exam | iQuiz | AverageAccuracy | 0.6917 | 0.7417 | 120 |
math | AIME 2024 | AveragePass@1 | 0.7333 | 0.8333 | 30 |
math | AIME 2025 | AveragePass@1 | 0.7 | 0.7333 | 30 |
NOTE: For the pass@k metric, considering time cost of evaluation, we uniformly limit the number of generated responses to 1
Conclusion
As we can see from the comparison above, evaluatoin results across different tasks and datasets suggest that our quantized-version with AWQ exihibit minimum fluctuation on model performance.
In fact, for most benchmarks, AWQ version performs mostly on-par with the original version, except for benchmarks (such as AIME2024 and iQuiz) where performance degration is relatively obvious.
- Downloads last month
- 8,710