Qwen3-72B-Synthesis
This still doesn't work, I'm trying to fix it.
A Qwen3-Architecture 72B Model Forged from Qwen3-32B
and Qwen2.5-72B-Instruct
.
Model Description
Qwen3-72B-Synthesis is an experimental, 80-layer, 72-billion-parameter large language model. It represents a novel approach to model creation, designed to produce a model with the pure, modern Qwen3 architecture while inheriting the vast, high-quality knowledge of the 72B-scale Qwen2.5-Instruct model.
This was not a simple merge. It was a multi-phase surgical procedure involving dimensional up-scaling, architectural alignment, and a strategic "knowledge transplant" using MergeKit
. The result is a unique checkpoint that serves as an ideal starting point for further fine-tuning.
The core philosophy was to use Qwen/Qwen3-32B
as the architectural "foundation" and Qwen/Qwen2.5-72B-Instruct
as the "knowledge donor."
Model Details
- Architecture: Qwen3 (RMSNorm, SwiGLU, no biases, includes
q_norm
andk_norm
) - Parameters: ~72 Billion
- Layers: 80
- Foundation:
Qwen/Qwen3-32B
- Donor:
Qwen/Qwen2.5-72B-Instruct
- Tokenizer:
Qwen/Qwen3-32B
Tokenizer (vocab_size: 151936
)
Model Creation Process
The creation of this model was a deliberate, three-phase process designed to overcome significant architectural incompatibilities.
Phase 1: Foundation Upscaling
First, the Qwen/Qwen3-32B
model (64 layers, 5120 hidden dim) was up-scaled to match the target 72B dimensions. This was done using a sophisticated self-interpolation script, where new dimensions were created by averaging different slices of the existing weights, rather than simple tiling. This produced Qwen3-32B-Upscaled
, a 64-layer model with the correct 72B tensor shapes and Qwen3 architecture.
Phase 2: Donor Alignment
The Qwen/Qwen2.5-72B-Instruct
model was architecturally incompatible with the Qwen3 target. To solve this, a new donor model, Qwen2.5-72B-Instruct-Aligned
, was created. This process involved:
- Creating an empty 80-layer model shell with the pure Qwen3 architecture.
- Surgically removing all
.bias
tensors from the Qwen2.5 weights. - Truncating the Qwen2.5 embedding and language model head layers from a vocabulary of 152064 to match Qwen3's 151936.
- Loading the modified Qwen2.5 weights into the pure Qwen3 shell, resulting in a perfectly compatible donor model.
Phase 3: Knowledge Transplant via MergeKit
With two architecturally-compatible models, the final merge was performed using MergeKit
. A "Knowledge Bridge" strategy was employed to transplant a stable reasoning core from the donor while blending the rest.
The following MergeKit
configuration was used:
merge_method: linear
base_model: ./Qwen3-32B-Upscaled
dtype: bfloat16
slices:
# Slice 1: Blend the bottom 32 layers
- merge_method: linear
sources:
- model: ./Qwen3-32B-Upscaled
layer_range: [0, 32]
parameters:
weight: 0.5
- model: ./Qwen2.5-72B-Instruct-Aligned
layer_range: [0, 32]
parameters:
weight: 0.5
# Slice 2: The "Knowledge Bridge" - transplant a pure block from the donor
- merge_method: passthrough
sources:
- model: ./Qwen2.5-72B-Instruct-Aligned
layer_range: [32, 48]
# Slice 3: Blend the top layers
- merge_method: linear
sources:
- model: ./Qwen3-32B-Upscaled
layer_range: [32, 64]
parameters:
weight: 0.5
- model: ./Qwen2.5-72B-Instruct-Aligned
layer_range: [48, 80]
parameters:
weight: 0.5
tokenizer_source: ./Qwen3-32B-Upscaled
How to Use
This model uses the standard Qwen ChatML prompt format.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "cognitivecomputations/Qwen3-72B-Synthesis"
device = "cuda"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain the importance of the LLaMA paper in one paragraph."}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)
generated_ids = model.generate(
model_inputs.input_ids,
max_new_tokens=512
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
Intended Use and Limitations
This is an experimental model and should be considered a high-quality checkpoint, not a finished product.
- Fine-tuning is highly recommended. While it inherits knowledge from a powerful instruction model, the merging process can create slight incoherence between layers. A round of fine-tuning on a high-quality instruction dataset is necessary to harmonize the weights and unlock its full potential.
- The model may exhibit unexpected behaviors, including repetitiveness or nonsensical outputs, prior to fine-tuning.
- This model has not been aligned for safety and may produce problematic, biased, or otherwise undesirable content. The user assumes all responsibility for the output generated.
Acknowledgements
This model would not have been possible without the foundational work of Alibaba Cloud on the Qwen models, and the powerful, flexible MergeKit
toolkit created by Charles Goddard and Arcee.ai.
- Downloads last month
- 56