Intro

The AWQ version is quantized using ms-swift. You may refer to our best practice for training/fine-tuning Qwen3-models here.

Note that the AWQ version for Qwen3-MoE models are verified to be working on Transformers/vLLM. We have not have the chance to tested them on other engines.

Inference

import torch
from modelscope import AutoModelForCausalLM, AutoTokenizer

model_name = "swift/Qwen3-30B-A3B-AWQ"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

# parsing thinking content
try:
    # rindex finding 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("thinking content:", thinking_content)
print("content:", content)

Quantization

The model has undergone AWQ int4 quantization using the ms-swift framework. Since the model is based on the MoE (Mixture of Experts) architecture, all linear layers except for gate and lm_head have been quantized.

If you have fine-tuned the model and wish to quantize the fine-tuned version, you can refer to the following quantization scripts:

With these scripts, you can easily complete the quantization process for the model.

Evaluation

We evaluate the quality of this AWQ quantization with EvalScope. For the best practice for evaluating Qwen3 models, one may refer to the following:

Performance of Qwen3-30B-A3B-AWQ is evaluated on our mixed-benchmark of Qwen3 Evaluation Collection, with the results listed below:

The performance comparison of Qwen3-30B-A3B-AWQ and Qwen3-30B-A3B

task_type dataset_name metric average_score(AWQ) average_score(without AWQ) count
exam MMLU-Pro AverageAccuracy 0.7655 0.7828 12032
exam MMLU-Redux AverageAccuracy 0.8746 0.8872 5700
exam C-Eval AverageAccuracy 0.844 0.8722 1346
instruction IFEval inst_level_strict_acc 0.8891 0.8925 541
instruction IFEval inst_level_loose_acc 0.9107 0.9174 541
instruction IFEval prompt_level_loose_acc 0.8651 0.8651 541
instruction IFEval prompt_level_strict_acc 0.8373 0.8318 541
math MATH-500 AveragePass@1 0.944 0.938 500
knowledge GPQA AveragePass@1 0.596 0.601 198
code LiveCodeBench Pass@1 0.5275 0.5549 182
exam iQuiz AverageAccuracy 0.6917 0.7417 120
math AIME 2024 AveragePass@1 0.7333 0.8333 30
math AIME 2025 AveragePass@1 0.7 0.7333 30

NOTE: For the pass@k metric, considering time cost of evaluation, we uniformly limit the number of generated responses to 1

Conclusion

As we can see from the comparison above, evaluatoin results across different tasks and datasets suggest that our quantized-version with AWQ exihibit minimum fluctuation on model performance.

In fact, for most benchmarks, AWQ version performs mostly on-par with the original version, except for benchmarks (such as AIME2024 and iQuiz) where performance degration is relatively obvious.

Downloads last month
8,710
Safetensors
Model size
4.64B params
Tensor type
I32
·
FP16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ELVISIO/Qwen3-30B-A3B-AWQ

Finetuned
Qwen/Qwen3-30B-A3B
Quantized
(84)
this model