ELVISIO/Qwen3-30B-A3B-AWQ

Intro

The AWQ version is quantized using ms-swift. You may refer to our best practice for training/fine-tuning Qwen3-models here.

Note that the AWQ version for Qwen3-MoE models are verified to be working on Transformers/vLLM. We have not have the chance to tested them on other engines.

Inference

import torch
from modelscope import AutoModelForCausalLM, AutoTokenizer

model_name = "swift/Qwen3-30B-A3B-AWQ"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

# parsing thinking content
try:
    # rindex finding 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("thinking content:", thinking_content)
print("content:", content)

Quantization

The model has undergone AWQ int4 quantization using the ms-swift framework. Since the model is based on the MoE (Mixture of Experts) architecture, all linear layers except for gate and lm_head have been quantized.

If you have fine-tuned the model and wish to quantize the fine-tuned version, you can refer to the following quantization scripts:

Dense Model Quantization Script: View Here
MoE Model Quantization Script: View Here

With these scripts, you can easily complete the quantization process for the model.

Evaluation

We evaluate the quality of this AWQ quantization with EvalScope. For the best practice for evaluating Qwen3 models, one may refer to the following:

Performance of Qwen3-30B-A3B-AWQ is evaluated on our mixed-benchmark of Qwen3 Evaluation Collection, with the results listed below:

The performance comparison of Qwen3-30B-A3B-AWQ and Qwen3-30B-A3B

task_type	dataset_name	metric	average_score(AWQ)	average_score(without AWQ)	count
exam	MMLU-Pro	AverageAccuracy	0.7655	0.7828	12032
exam	MMLU-Redux	AverageAccuracy	0.8746	0.8872	5700
exam	C-Eval	AverageAccuracy	0.844	0.8722	1346
instruction	IFEval	inst_level_strict_acc	0.8891	0.8925	541
instruction	IFEval	inst_level_loose_acc	0.9107	0.9174	541
instruction	IFEval	prompt_level_loose_acc	0.8651	0.8651	541
instruction	IFEval	prompt_level_strict_acc	0.8373	0.8318	541
math	MATH-500	AveragePass@1	0.944	0.938	500
knowledge	GPQA	AveragePass@1	0.596	0.601	198
code	LiveCodeBench	Pass@1	0.5275	0.5549	182
exam	iQuiz	AverageAccuracy	0.6917	0.7417	120
math	AIME 2024	AveragePass@1	0.7333	0.8333	30
math	AIME 2025	AveragePass@1	0.7	0.7333	30

NOTE: For the pass@k metric, considering time cost of evaluation, we uniformly limit the number of generated responses to 1

Conclusion

As we can see from the comparison above, evaluatoin results across different tasks and datasets suggest that our quantized-version with AWQ exihibit minimum fluctuation on model performance.

In fact, for most benchmarks, AWQ version performs mostly on-par with the original version, except for benchmarks (such as AIME2024 and iQuiz) where performance degration is relatively obvious.

ELVISIO
/

Qwen3-30B-A3B-AWQ

Intro

Inference

Quantization

Evaluation

Conclusion

Model tree for ELVISIO/Qwen3-30B-A3B-AWQ