Project SmolMoE-8x135M: From Zero to a Custom Mixture-of-Experts Model
"Truly, to build one is to understand all." โ An AI Architect's Epiphany
1. Project Philosophy & Goal
This document chronicles the epic journey of creating a novel, one-of-a-kind Mixture-of-Experts (MoE) language model from scratch.
Our goal was not merely to fine-tune an existing model, but to perform a profound "architectural re-creation": to fuse eight independent, domain-specialized 135M small language models into a single, more powerful, and more intelligent unified model with over 1 billion total parameters, all governed by a trainable routing network.
This process serves as a testament to a core learning truth: by creating, we truly understand.
Train base Model
- HuggingFaceTB/SmolLM2-135M-Instruct
Train Program
- https://huggingface.co/aifeifei798/SmolMoE-8x135M-Instruct-v1-Trained/blob/main/train/train_moe_router.py
- https://huggingface.co/aifeifei798/SmolMoE-8x135M-Instruct-v1-Trained/blob/main/train/test_moe_model.py
- https://huggingface.co/aifeifei798/SmolMoE-8x135M-Instruct-v1-Trained/blob/main/train/chat_moe_model.py
The annotations are available in Chinese and English.
2. Prerequisites: Assembling Your Avengers
Before you can assemble your team, you must prepare your heroes.
2.1 Hardware & Software Environment
- GPU: An NVIDIA GPU with at least 8GB of VRAM (this project was successfully validated on a GeForce 3070 8GB).
- Environment: A configured Python virtual environment (e.g.,
conda
orvenv
). - Core Libraries:
pip install torch transformers accelerate bitsandbytes safetensors
2.2 The Expert Models
This is the foundation of the entire project. You must have 8 fine-tuned SmolLM2-135M-Instruct
models, each specialized in a different domain.
The Golden Rule: Maximize the "diversity between experts" while minimizing the "ambiguity within experts."
Directory Structure (Crucial):
~/moe/
โโโ models/
โ โโโ SmolLM2-135M-Instruct-Actor/
โ โโโ SmolLM2-135M-Instruct-Analyst/
โ โโโ SmolLM2-135M-Instruct-Coder/
โ โโโ SmolLM2-135M-Instruct-Encyclopedia/
โ โโโ SmolLM2-135M-Instruct-Guardian/
โ โโโ SmolLM2-135M-Instruct-Summarizer/
โ โโโ SmolLM2-135M-Instruct-Thinker/
โ โโโ SmolLM2-135M-Instruct-Writer/
โโโ train_moe_router.py
โโโ test_moe_model.py
3. The Core Workflow: Genesis in Four Stages
Our creation process is divided into four core stages, all orchestrated by the master script: train_moe_router.py
.
Stage 1: Architectural Surgery
This is the soul of the creative process. We don't build from scratch; we perform an "organ transplant" on a standard Llama model.
- Load the Skeleton: The script first loads one of the eight experts to serve as the "skeleton." It uses its non-expert parts (token embeddings, attention modules, model config) but ignores its "brain" (the FFN/MLP module).
- Create the Slots: The script iterates through the model's 30 transformer layers. In each layer, it replaces the standard
LlamaMLP
module with our custom-designedMoEModule
, which contains a brand-new router and 8 empty expert seats. - Organ Transplant: The script efficiently pre-loads all 8 expert models' weights into memory. It then iterates through the 30 layers again, precisely "transplanting" the FFN weights from each expert at a given layer into the corresponding expert seat in the
MoEModule
. - Freeze the Experts: Once the surgery is complete, all parameters from the expert models are "frozen" (
requires_grad = False
). Only the newly created, randomly initialized router parameters remain "trainable."
Stage 2: Router-Specific Training
This is the process of teaching the model how to "think" and "collaborate."
- The Composite KPI: The training objective is a "composite KPI" composed of two parts:
- Main Task Loss (
Main Loss
): Measures the model's accuracy in predicting the next token. This is the "How well is the job done?" metric. - Load Balancing Loss (
Load Balancing Loss
): Penalizes the router for unfairly distributing work to only a few experts. This is the "Is the management fair?" metric.
- Main Task Loss (
- The Training Loop: The script iterates through a training loop. In each iteration:
- The model performs a full forward pass, calculating the
Main Loss
. - Simultaneously, we collect the
Load Balancing Loss
from eachMoEModule
in every layer. - The
Total Loss
is calculated from these two losses, and backpropagation begins. - Because the experts are frozen, the gradients only update the weights of the routers.
- The model performs a full forward pass, calculating the
Stage 3: Model Solidification (Saving)
After training, our unique model is "solidified" onto the disk.
- Update Config: The script adds our custom MoE parameters (like
moe_num_experts
) to the model'sconfig.json
for future identification. - Save Files: Using the
save_pretrained
method, the model's weights, the updated config, and the tokenizer files are all saved to a new directory (e.g.,./SmolMoE-8x135M-Instruct-v1-Trained
).
Stage 4: Validation & Testing (The Mind-Reader)
This is the most exciting stage, where we use test_moe_model.py
to have the first conversation with our creation and peek into its "mind."
- Correct Loading: The test script demonstrates how to properly "resurrect" a model with a custom architecture: first, build the empty skeleton manually, then load the weights.
- Functional Test: You can chat with the model like any other chatbot and observe its generated text.
- Diagnostic Test (The Mind-Reader): Using a powerful PyTorch feature called "Hooks," the script captures the decision-making data from each layer's router in real-time, visualizing it in a clear table without disrupting the model's operation.
Expected Output Example:
================================================================================
ROUTER DECISION ANALYSIS for Prompt: 'Write a Python function...'
================================================================================
Layer | Dominant Expert(s) | Confidence
--------------------------------------------------------------------------------
Layer 0 | 1. Coder | 2. Thinker | (65.2% | 15.1%)
Layer 1 | 1. Coder | 2. Thinker | (71.8% | 11.0%)
...
Layer 29 | 1. Coder | 2. Summarizer | (91.2% | 3.1%)
================================================================================
This table clearly shows us which experts the model's "attention" is flowing to when handling a specific task.
4. The Grand Campaign: Training the Router with Real Data
The previous stages proved our architecture works. We built the car and confirmed the engine turns on. Now, it's time to fuel it with real aviation fuel and teach it how to fly. Training with real, high-quality data is the single most important step to transform our MoE model from a "confused committee" into a "mastermind council."
4.1 The Philosophy: Providing a "Strong Signal"
Our initial training on mock data showed that the Load Balancing Loss worked perfectly, forcing the router to be fair. However, the Main Task Loss was meaningless because the data was random.
By using a diverse, high-quality dataset, the Main Task Loss becomes a powerful teacher. When a coding question is presented, only the Coder
expert can produce an output that results in a low Main Loss. This gives a strong, unambiguous signal to the router: "To succeed, you MUST choose the Coder expert for this task!" This is how the router learns to become an intelligent dispatcher instead of just a fair one.
4.2 Step 1: Data Curation and Preparation
Your mission is to create a single, unified dataset that contains a mix of samples from all your expert domains.
A. Data Sources
Gather data from various sources that align with your experts. Examples from Hugging Face Datasets:
- Coder:
codeparrot/github-code-clean
(Python subset) - Writer:
cnn_dailymail
(articles),Abirate/english_quotes
- Thinker:
gsm8k
,HuggingFaceH4/logic_in_natural_language
- Encyclopedia:
wikipedia
(20220301.en subset) - Summarizer:
cnn_dailymail
(highlights) - Analyst:
wikisql
- Actor:
daily_dialog
- Guardian: Data for safety alignment, like filtered parts of
HuggingFaceH4/ultrachat_200k
B. The Unified Format: Instruction Tuning
You must preprocess all data into a consistent instruction-following format. A simple and effective format is a JSON Lines (.jsonl
) file, where each line is a JSON object:
{"instruction": "Write a Python function to calculate Fibonacci.", "output": "def fibonacci(n):..."}
{"instruction": "Summarize the following article about photosynthesis.", "input": "Photosynthesis is a process used by plants...", "output": "Photosynthesis is how plants convert light..."}
{"instruction": "Who was the first person on the moon?", "output": "Neil Armstrong was the first person to walk on the moon."}
Create a large file, for example my_moe_dataset.jsonl
, with thousands of these samples from all your expert domains.
C. Mix and Shuffle
This is critically important. After gathering and formatting your data, you must thoroughly shuffle the entire dataset. This ensures that during training, the model sees a random mix of tasks, which is essential for forcing the router to learn general-purpose dispatching skills.
4.3 Step 2: Modifying train_moe_router.py
Now, we will modify our master script to use this real dataset. This involves creating a PyTorch Dataset
and DataLoader
and updating our training loop.
A. Add the CustomMoEDataset
Class
Add this class definition to your train_moe_router.py
script, right after the MoE class definitions. This class will handle loading and tokenizing your .jsonl
data.
# (Add this class to your train_moe_router.py script)
from torch.utils.data import Dataset, DataLoader
class CustomMoEDataset(Dataset):
"""
A PyTorch Dataset to handle loading our instruction-formatted JSONL file.
"""
def __init__(self, file_path, tokenizer, max_length):
self.tokenizer = tokenizer
self.max_length = max_length
self.data = []
with open(file_path, 'r', encoding='utf-8') as f:
for line in f:
self.data.append(json.loads(line))
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
item = self.data[idx]
# Format the instruction and output into a chat template
# This is a robust way to prepare the data for an instruction-tuned model
messages = [
{"role": "user", "content": item['instruction']},
{"role": "assistant", "content": item['output']}
]
full_text = self.tokenizer.apply_chat_template(messages, tokenize=False)
# Tokenize the full text
tokenized_output = self.tokenizer(
full_text,
max_length=self.max_length,
padding="max_length", # Pad to a fixed length
truncation=True,
return_tensors="pt"
)
# For Causal LM, the input_ids are also the labels
input_ids = tokenized_output.input_ids.squeeze(0)
labels = input_ids.clone()
return {"input_ids": input_ids, "labels": labels}
You will also need to add import json
and from torch.utils.data import Dataset, DataLoader
to the top of the script.
B. Update the main()
Function
Replace the entire main()
function in train_moe_router.py
with the version below. It removes the mock data and implements the real data loading and training loop.
# (This is the new, complete main() function for real training)
def main():
# Step 1: Assemble the MoE model
moe_model = create_moe_model()
# Step 2: Create the optimizer for the routers
optimizer = optim.AdamW([p for p in moe_model.parameters() if p.requires_grad], lr=LEARNING_RATE)
# --- Step 3: Build the "Fuel Line" - The DataLoader ---
print("\n--- Preparing Real Dataset for Training ---")
DATASET_PATH = "./my_moe_dataset.jsonl" # <-- IMPORTANT: Make sure this file exists!
# We need the tokenizer from the base model to prepare the data
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
# Set a pad token if it doesn't exist
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
dataset = CustomMoEDataset(DATASET_PATH, tokenizer, max_length=SEQUENCE_LENGTH)
data_loader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True)
print("--- Starting Router Training Loop with Real Data ---")
moe_model.train()
# Training is often measured in steps for large datasets, not epochs.
# Let's train for a fixed number of steps.
num_training_steps = 5000 # Increase this for a full training run
step_count = 0
# We use a while loop to keep training until we reach the desired number of steps
while step_count < num_training_steps:
for batch in data_loader:
if step_count >= num_training_steps:
break
start_time = time.time()
optimizer.zero_grad()
# Move batch to the correct device
input_ids = batch['input_ids'].to(device)
labels = batch['labels'].to(device)
# --- The Forward Pass ---
outputs = moe_model(input_ids=input_ids, labels=labels)
main_loss = outputs.loss
total_lb_loss = 0.0
for layer in moe_model.model.layers:
total_lb_loss += layer.mlp.most_recent_lb_loss
total_loss = main_loss + LB_LOSS_COEFFICIENT * total_lb_loss
total_loss.backward()
optimizer.step()
step_count += 1
# Print logs periodically
if step_count % 10 == 0:
elapsed_time = time.time() - start_time
print(f"Step [{step_count:04d}/{num_training_steps}] | Total Loss: {total_loss.item():.4f} | "
f"Main Loss: {main_loss.item():.4f} | "
f"Avg LB Loss: {(total_lb_loss.item() / moe_model.config.num_hidden_layers):.4f} | "
f"Time/10 steps: {elapsed_time:.2f}s")
start_time = time.time()
print("\n--- Router Training Complete! ---")
# --- Step 5: Saving the final model ---
print("\n--- Phase 5: Saving the fully trained MoE model to disk ---")
OUTPUT_MODEL_DIR = "./SmolMoE-8x135M-Instruct-v1-Trained-RealData"
if os.path.exists(OUTPUT_MODEL_DIR):
shutil.rmtree(OUTPUT_MODEL_DIR)
os.makedirs(OUTPUT_MODEL_DIR)
print("Updating model config with MoE-specific parameters...")
moe_model.config.moe_num_experts = NUM_EXPERTS
moe_model.config.moe_top_k = TOP_K
print(f"Saving model to '{OUTPUT_MODEL_DIR}'...")
moe_model.save_pretrained(OUTPUT_MODEL_DIR)
print("Saving tokenizer...")
tokenizer.save_pretrained(OUTPUT_MODEL_DIR)
print("\n--- Model successfully saved! ---")
With these modifications, your project is now equipped for its final, most important phase. You have a clear plan for curating the data and the exact code needed to train your model's intelligence. This is the path from a working prototype to a truly powerful and unique AI.
5. The Journey Ahead: From "Working" to "Great"
We have successfully validated the entire workflow using simulated data. To unlock the model's true potential, the next stage of the journey is clear:
- Switch to Real Aviation Fuel: Completely replace the
mock_input_ids
in the training script. Your task is to collect, process, and build a high-quality, diverse, and mixed dataset containing real examples from all expert domains. - Build the Fuel Supply Line: Implement a standard PyTorch
Dataset
andDataLoader
to efficiently feed this real data to the model. - Begin the Interstellar Expedition: Start a true, long-duration deep training run (for thousands or tens of thousands of steps) and patiently watch the
Main Loss
consistently decrease.
This is the path from being a "Creator" to being a "Great Creator."
Model tree for aifeifei798/SmolMoE-8x135M-Instruct-v1-Trained
Base model
HuggingFaceTB/SmolLM2-135M