metadata

tags:
  - unsloth
base_model:
  - Qwen/Qwen3-30B-A3B-Base
language:
  - eng
  - fra
  - por
  - deu
  - ron
  - swe
  - dan
  - bul
  - rus
  - ces
  - ell
  - ukr
  - spa
  - nld
  - slk
  - hrv
  - pol
  - lit
  - nob
  - nno
  - fas
  - slv
  - guj
  - lav
  - ita
  - oci
  - nep
  - mar
  - bel
  - srp
  - ltz
  - vec
  - asm
  - cym
  - szl
  - ast
  - hne
  - awa
  - mai
  - bho
  - snd
  - gle
  - fao
  - hin
  - pan
  - ben
  - ori
  - tgk
  - ydd
  - lmo
  - lij
  - scn
  - fur
  - srd
  - glg
  - cat
  - isl
  - als
  - lim
  - prs
  - afr
  - mkd
  - sin
  - urd
  - mag
  - bos
  - hye
  - zho
  - yue
  - mya
  - ara
  - ars
  - apc
  - arz
  - ary
  - acm
  - acq
  - aeb
  - heb
  - mlt
  - ind
  - zsm
  - tgl
  - ceb
  - jav
  - sun
  - min
  - ban
  - bjn
  - pag
  - ilo
  - war
  - tam
  - tel
  - kan
  - mal
  - tur
  - azj
  - uzn
  - kaz
  - bak
  - tat
  - tha
  - lao
  - fin
  - est
  - hun
  - vie
  - khm
  - jpn
  - kor
  - kat
  - eus
  - hat
  - pap
  - kea
  - tpi
  - swa

Qwen3-30B-A3B

Qwen3 Highlights

Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Building upon extensive advancements in training data, model architecture, and optimization techniques, Qwen3 delivers the following key improvements over the previously released Qwen2.5:

Expanded Higher-Quality Pre-training Corpus: Qwen3 is pre-trained on 36 trillion tokens across 119 languages — tripling the language coverage of Qwen2.5 — with a much richer mix of high-quality data, including coding, STEM, reasoning, book, multilingual, and synthetic data.
Training Techniques and Model Architecture: Qwen3 incorporates a series of training techiques and architectural refinements, including global-batch load balancing loss for MoE models and qk layernorm for all models, leading to improved stability and overall performance.
Three-stage Pre-training: Stage 1 focuses on broad language modeling and general knowledge acquisition, Stage 2 improves reasoning skills like STEM, coding, and logical reasoning, and Stage 3 enhances long-context comprehension by extending training sequence lengths up to 32k tokens.
Scaling Law Guided Hyperparameter Tuning: Through comprehensive scaling law studies across the three-stage pre-training pipeline, Qwen3 systematically tunes critical hyperparameters — such as learning rate scheduler and batch size — separately for dense and MoE models, resulting in better training dynamics and final performance across different model scales.

Model Overview

Qwen3-30B-A3B has the following features:

Type: Causal Language Models
Training Stage: Pretraining & Post-training
Number of Parameters: 30.5B in total and 3.3B activated
Number of Paramaters (Non-Embedding): 29.9B
Number of Layers: 48
Number of Attention Heads (GQA): 32 for Q and 4 for KV
Number of Experts: 128
Number of Activated Experts: 8
Context Length: 32,768

For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our blog, GitHub, and Documentation.

Requirements

The code of Qwen3-MoE has been in the latest Hugging Face transformers and we advise you to use the latest version of transformers.

With transformers<4.51.0, you will encounter the following error:

KeyError: 'qwen3_moe'

Evaluation & Performance

Detailed evaluation results are reported in this 📑 blog.

Citation

If you find our work helpful, feel free to give us a cite.

@misc{qwen3,
    title  = {Qwen3},
    url    = {https://qwenlm.github.io/blog/qwen3/},
    author = {Qwen Team},
    month  = {April},
    year   = {2025}
}