Project SmolMoE-8x135M: From Zero to a Custom Mixture-of-Experts Model

"Truly, to build one is to understand all." — An AI Architect's Epiphany

1. Project Philosophy & Goal

This document chronicles the epic journey of creating a novel, one-of-a-kind Mixture-of-Experts (MoE) language model from scratch.

Our goal was not merely to fine-tune an existing model, but to perform a profound "architectural re-creation": to fuse eight independent, domain-specialized 135M small language models into a single, more powerful, and more intelligent unified model with over 1 billion total parameters, all governed by a trainable routing network.

This process serves as a testament to a core learning truth: by creating, we truly understand.

Train base Model

HuggingFaceTB/SmolLM2-135M-Instruct

Train Program

The annotations are available in Chinese and English.

https://huggingface.co/aifeifei798/SmolMoE-8x135M-Instruct-v1-Trained/tree/main/train

2. Prerequisites: Assembling Your Avengers

Before you can assemble your team, you must prepare your heroes.

2.1 Hardware & Software Environment

GPU: An NVIDIA GPU with at least 8GB of VRAM (this project was successfully validated on a GeForce 3070 8GB).
Environment: A configured Python virtual environment (e.g., conda or venv).

Core Libraries:

pip install torch transformers accelerate bitsandbytes safetensors

2.2 The Expert Models

This is the foundation of the entire project. You must have 8 fine-tuned SmolLM2-135M-Instruct models, each specialized in a different domain.

The Golden Rule: Maximize the "diversity between experts" while minimizing the "ambiguity within experts."

Directory Structure (Crucial):

~/moe/
├── models/
│   ├── SmolLM2-135M-Instruct-Actor/
│   ├── SmolLM2-135M-Instruct-Analyst/
│   ├── SmolLM2-135M-Instruct-Coder/
│   ├── SmolLM2-135M-Instruct-Encyclopedia/
│   ├── SmolLM2-135M-Instruct-Guardian/
│   ├── SmolLM2-135M-Instruct-Summarizer/
│   └── SmolLM2-135M-Instruct-Thinker/
│   └── SmolLM2-135M-Instruct-Writer/
├── train_moe_router.py
└── test_moe_model.py

3. The Core Workflow: Genesis in Four Stages

Our creation process is divided into four core stages, all orchestrated by the master script: train_moe_router.py.

Stage 1: Architectural Surgery

This is the soul of the creative process. We don't build from scratch; we perform an "organ transplant" on a standard Llama model.

Load the Skeleton: The script first loads one of the eight experts to serve as the "skeleton." It uses its non-expert parts (token embeddings, attention modules, model config) but ignores its "brain" (the FFN/MLP module).
Create the Slots: The script iterates through the model's 30 transformer layers. In each layer, it replaces the standard LlamaMLP module with our custom-designed MoEModule, which contains a brand-new router and 8 empty expert seats.
Organ Transplant: The script efficiently pre-loads all 8 expert models' weights into memory. It then iterates through the 30 layers again, precisely "transplanting" the FFN weights from each expert at a given layer into the corresponding expert seat in the MoEModule.
Freeze the Experts: Once the surgery is complete, all parameters from the expert models are "frozen" (requires_grad = False). Only the newly created, randomly initialized router parameters remain "trainable."

Stage 2: Router-Specific Training

This is the process of teaching the model how to "think" and "collaborate."

The Composite KPI: The training objective is a "composite KPI" composed of two parts:
- Main Task Loss (Main Loss): Measures the model's accuracy in predicting the next token. This is the "How well is the job done?" metric.
- Load Balancing Loss (Load Balancing Loss): Penalizes the router for unfairly distributing work to only a few experts. This is the "Is the management fair?" metric.
The Training Loop: The script iterates through a training loop. In each iteration:
- The model performs a full forward pass, calculating the Main Loss.
- Simultaneously, we collect the Load Balancing Loss from each MoEModule in every layer.
- The Total Loss is calculated from these two losses, and backpropagation begins.
- Because the experts are frozen, the gradients only update the weights of the routers.

Stage 3: Model Solidification (Saving)

After training, our unique model is "solidified" onto the disk.

Update Config: The script adds our custom MoE parameters (like moe_num_experts) to the model's config.json for future identification.
Save Files: Using the save_pretrained method, the model's weights, the updated config, and the tokenizer files are all saved to a new directory (e.g., ./SmolMoE-8x135M-Instruct-v1-Trained).

Stage 4: Validation & Testing (The Mind-Reader)

This is the most exciting stage, where we use test_moe_model.py to have the first conversation with our creation and peek into its "mind."

Correct Loading: The test script demonstrates how to properly "resurrect" a model with a custom architecture: first, build the empty skeleton manually, then load the weights.
Functional Test: You can chat with the model like any other chatbot and observe its generated text.
Diagnostic Test (The Mind-Reader): Using a powerful PyTorch feature called "Hooks," the script captures the decision-making data from each layer's router in real-time, visualizing it in a clear table without disrupting the model's operation.

Expected Output Example:

================================================================================
ROUTER DECISION ANALYSIS for Prompt: 'Write a Python function...'
================================================================================
Layer   | Dominant Expert(s)                            | Confidence
--------------------------------------------------------------------------------
Layer 0    | 1. Coder             | 2. Thinker           | (65.2% | 15.1%)
Layer 1    | 1. Coder             | 2. Thinker           | (71.8% | 11.0%)
...
Layer 29   | 1. Coder             | 2. Summarizer        | (91.2% | 3.1%)
================================================================================

This table clearly shows us which experts the model's "attention" is flowing to when handling a specific task.

4. The Grand Campaign: Training the Router with Real Data

The previous stages proved our architecture works. We built the car and confirmed the engine turns on. Now, it's time to fuel it with real aviation fuel and teach it how to fly. Training with real, high-quality data is the single most important step to transform our MoE model from a "confused committee" into a "mastermind council."

4.1 The Philosophy: Providing a "Strong Signal"

Our initial training on mock data showed that the Load Balancing Loss worked perfectly, forcing the router to be fair. However, the Main Task Loss was meaningless because the data was random.

By using a diverse, high-quality dataset, the Main Task Loss becomes a powerful teacher. When a coding question is presented, only the Coder expert can produce an output that results in a low Main Loss. This gives a strong, unambiguous signal to the router: "To succeed, you MUST choose the Coder expert for this task!" This is how the router learns to become an intelligent dispatcher instead of just a fair one.

4.2 Step 1: Data Curation and Preparation

Your mission is to create a single, unified dataset that contains a mix of samples from all your expert domains.

A. Data Sources

Gather data from various sources that align with your experts. Examples from Hugging Face Datasets:

Coder: codeparrot/github-code-clean (Python subset)
Writer: cnn_dailymail (articles), Abirate/english_quotes
Thinker: gsm8k, HuggingFaceH4/logic_in_natural_language
Encyclopedia: wikipedia (20220301.en subset)
Summarizer: cnn_dailymail (highlights)
Analyst: wikisql
Actor: daily_dialog
Guardian: Data for safety alignment, like filtered parts of HuggingFaceH4/ultrachat_200k

B. The Unified Format: Instruction Tuning

You must preprocess all data into a consistent instruction-following format. A simple and effective format is a JSON Lines (.jsonl) file, where each line is a JSON object:

{"instruction": "Write a Python function to calculate Fibonacci.", "output": "def fibonacci(n):..."}
{"instruction": "Summarize the following article about photosynthesis.", "input": "Photosynthesis is a process used by plants...", "output": "Photosynthesis is how plants convert light..."}
{"instruction": "Who was the first person on the moon?", "output": "Neil Armstrong was the first person to walk on the moon."}

Create a large file, for example my_moe_dataset.jsonl, with thousands of these samples from all your expert domains.

C. Mix and Shuffle

This is critically important. After gathering and formatting your data, you must thoroughly shuffle the entire dataset. This ensures that during training, the model sees a random mix of tasks, which is essential for forcing the router to learn general-purpose dispatching skills.

4.3 Step 2: Modifying `train_moe_router.py`

Now, we will modify our master script to use this real dataset. This involves creating a PyTorch Dataset and DataLoader and updating our training loop.

A. Add the `CustomMoEDataset` Class

Add this class definition to your train_moe_router.py script, right after the MoE class definitions. This class will handle loading and tokenizing your .jsonl data.

# (Add this class to your train_moe_router.py script)
from torch.utils.data import Dataset, DataLoader

class CustomMoEDataset(Dataset):
    """
    A PyTorch Dataset to handle loading our instruction-formatted JSONL file.
    """
    def __init__(self, file_path, tokenizer, max_length):
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.data = []
        with open(file_path, 'r', encoding='utf-8') as f:
            for line in f:
                self.data.append(json.loads(line))

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        item = self.data[idx]
        
        # Format the instruction and output into a chat template
        # This is a robust way to prepare the data for an instruction-tuned model
        messages = [
            {"role": "user", "content": item['instruction']},
            {"role": "assistant", "content": item['output']}
        ]
        
        full_text = self.tokenizer.apply_chat_template(messages, tokenize=False)
        
        # Tokenize the full text
        tokenized_output = self.tokenizer(
            full_text,
            max_length=self.max_length,
            padding="max_length", # Pad to a fixed length
            truncation=True,
            return_tensors="pt"
        )
        
        # For Causal LM, the input_ids are also the labels
        input_ids = tokenized_output.input_ids.squeeze(0)
        labels = input_ids.clone()
        
        return {"input_ids": input_ids, "labels": labels}

You will also need to add import json and from torch.utils.data import Dataset, DataLoader to the top of the script.

B. Update the `main()` Function

Replace the entire main() function in train_moe_router.py with the version below. It removes the mock data and implements the real data loading and training loop.

# (This is the new, complete main() function for real training)
def main():
    # Step 1: Assemble the MoE model
    moe_model = create_moe_model()
    
    # Step 2: Create the optimizer for the routers
    optimizer = optim.AdamW([p for p in moe_model.parameters() if p.requires_grad], lr=LEARNING_RATE)
    
    # --- Step 3: Build the "Fuel Line" - The DataLoader ---
    print("\n--- Preparing Real Dataset for Training ---")
    DATASET_PATH = "./my_moe_dataset.jsonl" # <-- IMPORTANT: Make sure this file exists!
    
    # We need the tokenizer from the base model to prepare the data
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
    # Set a pad token if it doesn't exist
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    dataset = CustomMoEDataset(DATASET_PATH, tokenizer, max_length=SEQUENCE_LENGTH)
    data_loader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True)
    
    print("--- Starting Router Training Loop with Real Data ---")
    moe_model.train()
    
    # Training is often measured in steps for large datasets, not epochs.
    # Let's train for a fixed number of steps.
    num_training_steps = 5000 # Increase this for a full training run
    step_count = 0
    
    # We use a while loop to keep training until we reach the desired number of steps
    while step_count < num_training_steps:
        for batch in data_loader:
            if step_count >= num_training_steps:
                break

            start_time = time.time()
            optimizer.zero_grad()
            
            # Move batch to the correct device
            input_ids = batch['input_ids'].to(device)
            labels = batch['labels'].to(device)
            
            # --- The Forward Pass ---
            outputs = moe_model(input_ids=input_ids, labels=labels)
            main_loss = outputs.loss
            
            total_lb_loss = 0.0
            for layer in moe_model.model.layers:
                total_lb_loss += layer.mlp.most_recent_lb_loss
                
            total_loss = main_loss + LB_LOSS_COEFFICIENT * total_lb_loss
            
            total_loss.backward()
            optimizer.step()
            
            step_count += 1
            
            # Print logs periodically
            if step_count % 10 == 0:
                elapsed_time = time.time() - start_time
                print(f"Step [{step_count:04d}/{num_training_steps}] | Total Loss: {total_loss.item():.4f} | "
                      f"Main Loss: {main_loss.item():.4f} | "
                      f"Avg LB Loss: {(total_lb_loss.item() / moe_model.config.num_hidden_layers):.4f} | "
                      f"Time/10 steps: {elapsed_time:.2f}s")
                start_time = time.time()

    print("\n--- Router Training Complete! ---")
    
    # --- Step 5: Saving the final model ---
    print("\n--- Phase 5: Saving the fully trained MoE model to disk ---")
    OUTPUT_MODEL_DIR = "./SmolMoE-8x135M-Instruct-v1-Trained-RealData"
    if os.path.exists(OUTPUT_MODEL_DIR):
        shutil.rmtree(OUTPUT_MODEL_DIR)
    os.makedirs(OUTPUT_MODEL_DIR)

    print("Updating model config with MoE-specific parameters...")
    moe_model.config.moe_num_experts = NUM_EXPERTS
    moe_model.config.moe_top_k = TOP_K
    
    print(f"Saving model to '{OUTPUT_MODEL_DIR}'...")
    moe_model.save_pretrained(OUTPUT_MODEL_DIR)
    
    print("Saving tokenizer...")
    tokenizer.save_pretrained(OUTPUT_MODEL_DIR)
    
    print("\n--- Model successfully saved! ---")

With these modifications, your project is now equipped for its final, most important phase. You have a clear plan for curating the data and the exact code needed to train your model's intelligence. This is the path from a working prototype to a truly powerful and unique AI.

5. The Journey Ahead: From "Working" to "Great"

We have successfully validated the entire workflow using simulated data. To unlock the model's true potential, the next stage of the journey is clear:

Switch to Real Aviation Fuel: Completely replace the mock_input_ids in the training script. Your task is to collect, process, and build a high-quality, diverse, and mixed dataset containing real examples from all expert domains.
Build the Fuel Supply Line: Implement a standard PyTorch Dataset and DataLoader to efficiently feed this real data to the model.
Begin the Interstellar Expedition: Start a true, long-duration deep training run (for thousands or tens of thousands of steps) and patiently watch the Main Loss consistently decrease.

This is the path from being a "Creator" to being a "Great Creator."

aifeifei798
/

SmolMoE-8x135M-Instruct-v1-Trained

Project SmolMoE-8x135M: From Zero to a Custom Mixture-of-Experts Model

1. Project Philosophy & Goal

Train base Model

Train Program

2. Prerequisites: Assembling Your Avengers

2.1 Hardware & Software Environment

2.2 The Expert Models

3. The Core Workflow: Genesis in Four Stages

Stage 1: Architectural Surgery

Stage 2: Router-Specific Training

Stage 3: Model Solidification (Saving)

Stage 4: Validation & Testing (The Mind-Reader)

4. The Grand Campaign: Training the Router with Real Data

4.1 The Philosophy: Providing a "Strong Signal"

4.2 Step 1: Data Curation and Preparation

A. Data Sources

B. The Unified Format: Instruction Tuning

C. Mix and Shuffle

4.3 Step 2: Modifying `train_moe_router.py`

A. Add the `CustomMoEDataset` Class

B. Update the `main()` Function

5. The Journey Ahead: From "Working" to "Great"

Model tree for aifeifei798/SmolMoE-8x135M-Instruct-v1-Trained

Project SmolMoE-8x135M: From Zero to a Custom Mixture-of-Experts Model

1. Project Philosophy & Goal

Train base Model

Train Program

2. Prerequisites: Assembling Your Avengers

2.1 Hardware & Software Environment

2.2 The Expert Models

3. The Core Workflow: Genesis in Four Stages

Stage 1: Architectural Surgery

Stage 2: Router-Specific Training

Stage 3: Model Solidification (Saving)

Stage 4: Validation & Testing (The Mind-Reader)

4. The Grand Campaign: Training the Router with Real Data

4.1 The Philosophy: Providing a "Strong Signal"

4.2 Step 1: Data Curation and Preparation

A. Data Sources

B. The Unified Format: Instruction Tuning

C. Mix and Shuffle

4.3 Step 2: Modifying train_moe_router.py

A. Add the CustomMoEDataset Class

B. Update the main() Function

5. The Journey Ahead: From "Working" to "Great"

Model tree for aifeifei798/SmolMoE-8x135M-Instruct-v1-Trained

4.3 Step 2: Modifying `train_moe_router.py`

A. Add the `CustomMoEDataset` Class

B. Update the `main()` Function