A SVD based Distillation of Qwen3-Coder-480B for better code generation
Model Description
This model is a distilled version of Qwen/Qwen3-Coder-30B-A3B-Instruct
designed to achieve coding and reasoning capabilities approaching those of a much larger teacher model.
It is the result of applying a LoRA made via a SVD distillation pipeline, and then merging those weights into the base model. The core of this process was to transfer the nuanced knowledge from a 62-layer, 160-expert teacher model into the more efficient 48-layer, 128-expert architecture of the Qwen3-Coder-30b-a3b
student model.
The primary goal was to significantly enhance performance on complex coding tasks, where the specialized knowledge of Mixture-of-Experts (MoE) layers is critical.
The Distillation Methodology
This model was not trained in a conventional sense. Instead, it was created using a layer-by-layer distillation process implemented in the SVD-based
script. This pipeline was designed to ensure maximum precision and knowledge transfer.
Core Components
- Teacher Model: 'Qwen/Qwen3-Coder-480B-A35B-Instruct'.
- Student Model:
Qwen/Qwen3-Coder-30B-A3B-Instruct
. - LoRA Rank: A high rank of
r=2048
was used for all modules to capture a very high degree of information from the teacher.
The Distillation Pipeline
For each corresponding layer in the student and teacher, the following pipeline was executed:
Spherical Linear Interpolation (SLERP): For layers that fall between two teacher layers, SLERP was used to create a smooth, geometrically sound interpolation of the teacher's weights. This avoids the pitfalls of simple linear averaging.
Singular Value Decomposition (SVD) Projection: The core of the distillation. The (potentially blended) teacher layer's weight matrix was decomposed into its fundamental components (
U
,S
,V
). The top 2048 most important components were selected and then reconstructed to fit the student layer's smaller dimensions. This high-rank projection ensures maximum fidelity.Procrustes Analysis: After projection, the newly created "synthetic" tensor was optimally rotated in high-dimensional space to perfectly align with the student's original pre-trained tensor. This minimizes the "distance" between them before calculating the difference.
DARE (Drop and Rescale): The difference tensor (
Distilled - Aligned Student
) was then purified using DARE. This process drops a significant percentage of the lowest-magnitude values (noise) and rescales the remaining important differences, creating a clean signal for the final LoRA.
Mixture-of-Experts (MoE) Distillation
The standout feature of this process is the full distillation of the MoE layers, which are critical for complex reasoning.
- Expert Fingerprinting & Clustering: To map the 160 teacher experts to the 128 student experts, each teacher expert was "fingerprinted." K-Means clustering was then used to group these 160 fingerprints into 128 distinct clusters.
- Expert-to-Expert Distillation: Each of the student's 128 experts was then distilled from a weighted blend of the teacher experts assigned to its cluster. This ensures the specialized knowledge (e.g., recursion, API usage, security patterns) is transferred.
- Router Gate Distillation: The main MoE router gate, which decides which expert to use for a given token, was also distilled to preserve the teacher's intelligent routing logic.
Intended Use
This model is intended for code generation. It should be better at tasks that require understanding complex logic, algorithms, and software architecture.
- Primary Use: Code generation, refactoring, explanation (although since its an instruct it may not be perfect for explaining things), and debugging.
- Out of Scope: This is not a general-purpose conversational chatbot. While it can follow instructions, its knowledge is specialized for programming tasks.
- Downloads last month
- 404
8-bit
Model tree for BasedBase/Qwen3-Coder-30B-A3B-Instruct-480B-Distill-V2
Base model
Qwen/Qwen3-Coder-30B-A3B-Instruct