SmolLM3 End-to-End Fine-tuning Pipeline

This repository provides a complete end-to-end pipeline for fine-tuning SmolLM3 models with integrated experiment tracking, monitoring, and model deployment.

🚀 Quick Start

1. Setup Configuration

# Run the setup script to configure with your information
python setup_launch.py

This will prompt you for:

Your Hugging Face username
Your Hugging Face token
Optional model and dataset customizations

2. Check Requirements

# Verify all dependencies are installed
python check_requirements.py

3. Run the Pipeline

# Make the script executable and run
chmod +x launch.sh
./launch.sh

📋 What the Pipeline Does

The end-to-end pipeline performs the following steps:

1. Environment Setup

Installs system dependencies
Creates Python virtual environment
Installs PyTorch with CUDA support
Installs all required Python packages

2. Trackio Space Deployment

Creates a new Hugging Face Space for experiment tracking
Configures the Trackio monitoring interface
Sets up environment variables

3. HF Dataset Setup

Creates a Hugging Face Dataset repository for experiment storage
Configures dataset access and permissions
Sets up initial experiment data structure

4. Dataset Preparation

Downloads the specified dataset from Hugging Face Hub
Converts to training format (prompt/completion pairs)
Handles multiple dataset formats automatically
Creates train/validation splits

5. Training Configuration

Creates optimized training configuration
Sets up monitoring integration
Configures model parameters and hyperparameters

6. Model Training

Runs the SmolLM3 fine-tuning process
Logs metrics to Trackio Space in real-time
Saves experiment data to HF Dataset
Creates checkpoints during training

7. Model Deployment

Pushes trained model to Hugging Face Hub
Creates comprehensive model card
Uploads training results and logs
Tests the uploaded model

8. Summary Report

Generates detailed training summary
Provides links to all resources
Documents configuration and results

🎯 Features

Integrated Monitoring

Real-time experiment tracking via Trackio Space
Persistent storage in Hugging Face Datasets
Comprehensive metrics logging
System resource monitoring

Flexible Dataset Support

Automatic format detection and conversion
Support for multiple dataset types
Built-in data preprocessing
Train/validation split handling

Optimized Training

Flash Attention support for efficiency
Gradient checkpointing for memory optimization
Mixed precision training
Automatic hyperparameter optimization

Complete Deployment

Automated model upload to Hugging Face Hub
Comprehensive model cards
Training results documentation
Model testing and validation

📊 Monitoring & Tracking

Trackio Space Interface

Real-time training metrics visualization
Experiment management and comparison
System resource monitoring
Training progress tracking

HF Dataset Storage

Persistent experiment data storage
Version-controlled experiment history
Collaborative experiment sharing
Automated data backup

🔧 Configuration

Required Configuration

Update these variables in launch.sh:

# Your Hugging Face credentials
HF_TOKEN="your_hf_token_here"
HF_USERNAME="your-username"

# Model and dataset
MODEL_NAME="HuggingFaceTB/SmolLM3-3B"
DATASET_NAME="HuggingFaceTB/smoltalk"

# Output repositories
REPO_NAME="your-username/smollm3-finetuned-$(date +%Y%m%d)"
TRACKIO_DATASET_REPO="your-username/trackio-experiments"

Training Parameters

Customize training parameters:

# Training configuration
BATCH_SIZE=2
GRADIENT_ACCUMULATION_STEPS=8
LEARNING_RATE=5e-6
MAX_EPOCHS=3
MAX_SEQ_LENGTH=4096

📁 Output Structure

After running the pipeline, you'll have:

├── training_dataset/           # Prepared dataset
│   ├── train.json
│   └── validation.json
├── /output-checkpoint/         # Model checkpoints
│   ├── config.json
│   ├── pytorch_model.bin
│   └── training_results/
├── training.log               # Training logs
├── training_summary.md        # Summary report
└── config/train_smollm3_end_to_end.py  # Training config

🌐 Online Resources

The pipeline creates these online resources:

Model Repository: https://huggingface.co/your-username/smollm3-finetuned-YYYYMMDD
Trackio Space: https://huggingface.co/spaces/your-username/trackio-monitoring-YYYYMMDD
Experiment Dataset: https://huggingface.co/datasets/your-username/trackio-experiments

🛠️ Troubleshooting

Common Issues

HF Token Issues

# Verify your token is correct
huggingface-cli whoami

CUDA Issues

# Check CUDA availability
python -c "import torch; print(torch.cuda.is_available())"

Memory Issues

# Reduce batch size or gradient accumulation
BATCH_SIZE=1
GRADIENT_ACCUMULATION_STEPS=16

Dataset Issues

# Test dataset access
python -c "from datasets import load_dataset; print(load_dataset('your-dataset'))"

Debug Mode

Run individual components for debugging:

# Test Trackio deployment
cd scripts/trackio_tonic
python deploy_trackio_space.py

# Test dataset setup
cd scripts/dataset_tonic
python setup_hf_dataset.py

# Test training
python src/train.py config/train_smollm3_end_to_end.py --help

📚 Advanced Usage

Custom Datasets

For custom datasets, ensure they have one of these formats:

// Format 1: Prompt/Completion
{
  "prompt": "What is machine learning?",
  "completion": "Machine learning is..."
}

// Format 2: Instruction/Output
{
  "instruction": "Explain machine learning",
  "output": "Machine learning is..."
}

// Format 3: Chat format
{
  "messages": [
    {"role": "user", "content": "What is ML?"},
    {"role": "assistant", "content": "ML is..."}
  ]
}

Custom Models

To use different models, update the configuration:

MODEL_NAME="microsoft/DialoGPT-medium"
MAX_SEQ_LENGTH=1024

Custom Training

Modify training parameters in the generated config:

# In config/train_smollm3_end_to_end.py
config = SmolLM3Config(
    learning_rate=1e-5,  # Custom learning rate
    max_iters=5000,      # Custom training steps
    # ... other parameters
)

🤝 Contributing

Fork the repository
Create a feature branch
Make your changes
Test the pipeline
Submit a pull request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Hugging Face for the excellent transformers library
The SmolLM3 team for the base model
The Trackio team for experiment tracking
The open-source community for contributions

📞 Support

For issues and questions:

Check the troubleshooting section
Review the logs in training.log
Check the Trackio Space for monitoring data
Open an issue on GitHub

Happy Fine-tuning! 🚀