Spaces:
Running
SmolLM3 Fine-tuning
This repository provides a complete setup for fine-tuning SmolLM3 models using the FlexAI console, following the nanoGPT structure but adapted for modern transformer models.
Overview
SmolLM3 is a 3B-parameter transformer decoder model optimized for efficiency, long-context reasoning, and multilingual support. This setup allows you to fine-tune SmolLM3 for various tasks including:
- Supervised Fine-tuning (SFT): Adapt the model for instruction following
- Direct Preference Optimization (DPO): Improve model alignment
- Long-context fine-tuning: Support for up to 128k tokens
- Tool calling: Fine-tune for function calling capabilities
- Model Quantization: Create int8 (GPU) and int4 (CPU) quantized versions
Quick Start
1. Repository Setup
The repository follows the FlexAI console structure with the following key files:
train.py
: Main entry point scriptconfig/train_smollm3.py
: Default configurationmodel.py
: Model wrapper and loadingdata.py
: Dataset handling and preprocessingtrainer.py
: Training loop and trainer setuprequirements.txt
: Dependencies
2. FlexAI Console Configuration
When setting up a Fine Tuning Job in the FlexAI console, use these settings:
Basic Configuration
- Name:
smollm3-finetune
- Cluster: Your organization's designated cluster
- Checkpoint: (Optional) Previous training job checkpoint
- Node Count: 1
- Accelerator Count: 1-8 (depending on your needs)
Repository Settings
- Repository URL:
https://github.com/your-username/flexai-finetune
- Repository Revision:
main
Dataset Configuration
- Datasets: Your dataset (mounted under
/input
) - Mount Directory:
my_dataset
Entry Point
train.py config/train_smollm3.py --dataset_dir=my_dataset --init_from=resume --out_dir=/input-checkpoint --max_iters=1500
3. Dataset Format
The script supports multiple dataset formats:
Chat Format (Recommended)
[
{
"messages": [
{"role": "user", "content": "What is machine learning?"},
{"role": "assistant", "content": "Machine learning is a subset of AI..."}
]
}
]
Instruction Format
[
{
"instruction": "What is machine learning?",
"output": "Machine learning is a subset of AI..."
}
]
User-Assistant Format
[
{
"user": "What is machine learning?",
"assistant": "Machine learning is a subset of AI..."
}
]
4. Configuration Options
The default configuration in config/train_smollm3.py
includes:
@dataclass
class SmolLM3Config:
# Model configuration
model_name: str = "HuggingFaceTB/SmolLM3-3B"
max_seq_length: int = 4096
use_flash_attention: bool = True
# Training configuration
batch_size: int = 4
gradient_accumulation_steps: int = 4
learning_rate: float = 2e-5
max_iters: int = 1000
# Mixed precision
fp16: bool = True
bf16: bool = False
5. Command Line Arguments
The train.py
script accepts various arguments:
# Basic usage
python train.py config/train_smollm3.py
# With custom parameters
python train.py config/train_smollm3.py \
--dataset_dir=my_dataset \
--out_dir=/output-checkpoint \
--init_from=resume \
--max_iters=1500 \
--batch_size=8 \
--learning_rate=1e-5 \
--max_seq_length=8192
Advanced Usage
1. Custom Configuration
Create a custom configuration file:
# config/my_config.py
from config.train_smollm3 import SmolLM3Config
config = SmolLM3Config(
model_name="HuggingFaceTB/SmolLM3-3B-Instruct",
max_seq_length=8192,
batch_size=2,
learning_rate=1e-5,
max_iters=2000,
use_flash_attention=True,
fp16=True
)
2. Long-Context Fine-tuning
For long-context tasks (up to 128k tokens):
config = SmolLM3Config(
max_seq_length=131072, # 128k tokens
model_name="HuggingFaceTB/SmolLM3-3B",
use_flash_attention=True,
gradient_checkpointing=True
)
3. DPO Training
For preference optimization, use the DPO trainer:
from trainer import SmolLM3DPOTrainer
dpo_trainer = SmolLM3DPOTrainer(
model=model,
dataset=dataset,
config=config,
output_dir="./dpo-output"
)
dpo_trainer.train()
4. Tool Calling Fine-tuning
Include tool calling examples in your dataset:
[
{
"messages": [
{"role": "user", "content": "What's the weather in New York?"},
{"role": "assistant", "content": "<tool_call>\n<invoke name=\"get_weather\">\n<parameter name=\"location\">New York</parameter>\n</invoke>\n</tool_call>"},
{"role": "tool", "content": "The weather in New York is 72Β°F and sunny."},
{"role": "assistant", "content": "The weather in New York is currently 72Β°F and sunny."}
]
}
]
Model Variants
SmolLM3 comes in several variants:
- SmolLM3-3B-Base: Base model for general fine-tuning
- SmolLM3-3B: Instruction-tuned model
- SmolLM3-3B-Instruct: Enhanced instruction model
- Quantized versions: Available for deployment
Hardware Requirements
Minimum Requirements
- GPU: 16GB+ VRAM (for 3B model)
- RAM: 32GB+ system memory
- Storage: 50GB+ free space
Recommended
- GPU: A100/H100 or similar
- RAM: 64GB+ system memory
- Storage: 100GB+ SSD
Troubleshooting
Common Issues
Out of Memory (OOM)
- Reduce
batch_size
- Increase
gradient_accumulation_steps
- Enable
gradient_checkpointing
- Use
fp16
orbf16
- Reduce
Slow Training
- Enable
flash_attention
- Use mixed precision (
fp16
/bf16
) - Increase
dataloader_num_workers
- Enable
Dataset Loading Issues
- Check dataset format
- Ensure proper JSON structure
- Verify file permissions
Debug Mode
Enable debug logging:
import logging
logging.basicConfig(level=logging.DEBUG)
Evaluation
After training, evaluate your model:
from transformers import pipeline
pipe = pipeline(
task="text-generation",
model="./output-checkpoint",
device=0,
max_new_tokens=256,
do_sample=True,
temperature=0.7
)
# Test the model
messages = [{"role": "user", "content": "Explain gravity in simple terms."}]
outputs = pipe(messages)
print(outputs[0]["generated_text"][-1]["content"])
Model Quantization
The pipeline includes built-in quantization support using torchao for creating optimized model versions with a unified repository structure:
Repository Structure
All models (main and quantized) are stored in a single repository:
your-username/model-name/
βββ README.md (unified model card)
βββ config.json
βββ pytorch_model.bin
βββ tokenizer.json
βββ int8/ (quantized model for GPU)
βββ int4/ (quantized model for CPU)
Quantization Types
- int8_weight_only: GPU optimized, ~50% memory reduction
- int4_weight_only: CPU optimized, ~75% memory reduction
Automatic Quantization
When using the interactive pipeline (launch.sh
), you'll be prompted to create quantized versions after training:
./launch.sh
# ... training completes ...
# Choose quantization options when prompted
Standalone Quantization
Quantize existing models independently:
# Quantize and push to HF Hub (same repository)
python scripts/model_tonic/quantize_standalone.py /path/to/model your-username/model-name \
--quant-type int8_weight_only \
--token YOUR_HF_TOKEN
# Quantize and save locally
python scripts/model_tonic/quantize_standalone.py /path/to/model your-username/model-name \
--quant-type int4_weight_only \
--device cpu \
--save-only
Loading Quantized Models
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load main model
model = AutoModelForCausalLM.from_pretrained(
"your-username/model-name",
device_map="auto",
torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained("your-username/model-name")
# Load int8 quantized model (GPU)
model = AutoModelForCausalLM.from_pretrained(
"your-username/model-name/int8",
device_map="auto",
torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained("your-username/model-name/int8")
# Load int4 quantized model (CPU)
model = AutoModelForCausalLM.from_pretrained(
"your-username/model-name/int4",
device_map="cpu",
torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained("your-username/model-name/int4")
For detailed quantization documentation, see QUANTIZATION_GUIDE.md.
Unified Model Cards
The system generates comprehensive model cards that include information about all model variants:
- Single README: One comprehensive model card for the entire repository
- Conditional Sections: Quantized model information appears when available
- Usage Examples: Complete examples for all model variants
- Performance Information: Memory and speed benefits for each quantization type
For detailed information about the unified model card system, see UNIFIED_MODEL_CARD_GUIDE.md.
Deployment
Using vLLM
vllm serve ./output-checkpoint --enable-auto-tool-choice
Using llama.cpp
# Convert to GGUF format
python -m llama_cpp.convert_model ./output-checkpoint --outfile model.gguf
Resources
License
This project follows the same license as the SmolLM3 model. Please refer to the Hugging Face model page for licensing information.
{ "id": "exp_20250718_195852", "name": "petit-elle-l-aime-3", "description": "SmolLM3 fine-tuning experiment", "created_at": "2025-07-18T19:58:52.689087", "status": "running", "metrics": [], "parameters": {}, "artifacts": [], "logs": [] }