SmolFactory / README_END_TO_END.md
Tonic's picture
adds formatting fix
ebe598e verified
|
raw
history blame
7.28 kB

SmolLM3 End-to-End Fine-tuning Pipeline

This repository provides a complete end-to-end pipeline for fine-tuning SmolLM3 models with integrated experiment tracking, monitoring, and model deployment.

πŸš€ Quick Start

1. Setup Configuration

# Run the setup script to configure with your information
python setup_launch.py

This will prompt you for:

  • Your Hugging Face username
  • Your Hugging Face token
  • Optional model and dataset customizations

2. Check Requirements

# Verify all dependencies are installed
python check_requirements.py

3. Run the Pipeline

# Make the script executable and run
chmod +x launch.sh
./launch.sh

πŸ“‹ What the Pipeline Does

The end-to-end pipeline performs the following steps:

1. Environment Setup

  • Installs system dependencies
  • Creates Python virtual environment
  • Installs PyTorch with CUDA support
  • Installs all required Python packages

2. Trackio Space Deployment

  • Creates a new Hugging Face Space for experiment tracking
  • Configures the Trackio monitoring interface
  • Sets up environment variables

3. HF Dataset Setup

  • Creates a Hugging Face Dataset repository for experiment storage
  • Configures dataset access and permissions
  • Sets up initial experiment data structure

4. Dataset Preparation

  • Downloads the specified dataset from Hugging Face Hub
  • Converts to training format (prompt/completion pairs)
  • Handles multiple dataset formats automatically
  • Creates train/validation splits

5. Training Configuration

  • Creates optimized training configuration
  • Sets up monitoring integration
  • Configures model parameters and hyperparameters

6. Model Training

  • Runs the SmolLM3 fine-tuning process
  • Logs metrics to Trackio Space in real-time
  • Saves experiment data to HF Dataset
  • Creates checkpoints during training

7. Model Deployment

  • Pushes trained model to Hugging Face Hub
  • Creates comprehensive model card
  • Uploads training results and logs
  • Tests the uploaded model

8. Summary Report

  • Generates detailed training summary
  • Provides links to all resources
  • Documents configuration and results

🎯 Features

Integrated Monitoring

  • Real-time experiment tracking via Trackio Space
  • Persistent storage in Hugging Face Datasets
  • Comprehensive metrics logging
  • System resource monitoring

Flexible Dataset Support

  • Automatic format detection and conversion
  • Support for multiple dataset types
  • Built-in data preprocessing
  • Train/validation split handling

Optimized Training

  • Flash Attention support for efficiency
  • Gradient checkpointing for memory optimization
  • Mixed precision training
  • Automatic hyperparameter optimization

Complete Deployment

  • Automated model upload to Hugging Face Hub
  • Comprehensive model cards
  • Training results documentation
  • Model testing and validation

πŸ“Š Monitoring & Tracking

Trackio Space Interface

  • Real-time training metrics visualization
  • Experiment management and comparison
  • System resource monitoring
  • Training progress tracking

HF Dataset Storage

  • Persistent experiment data storage
  • Version-controlled experiment history
  • Collaborative experiment sharing
  • Automated data backup

πŸ”§ Configuration

Required Configuration

Update these variables in launch.sh:

# Your Hugging Face credentials
HF_TOKEN="your_hf_token_here"
HF_USERNAME="your-username"

# Model and dataset
MODEL_NAME="HuggingFaceTB/SmolLM3-3B"
DATASET_NAME="HuggingFaceTB/smoltalk"

# Output repositories
REPO_NAME="your-username/smollm3-finetuned-$(date +%Y%m%d)"
TRACKIO_DATASET_REPO="your-username/trackio-experiments"

Training Parameters

Customize training parameters:

# Training configuration
BATCH_SIZE=2
GRADIENT_ACCUMULATION_STEPS=8
LEARNING_RATE=5e-6
MAX_EPOCHS=3
MAX_SEQ_LENGTH=4096

πŸ“ Output Structure

After running the pipeline, you'll have:

β”œβ”€β”€ training_dataset/           # Prepared dataset
β”‚   β”œβ”€β”€ train.json
β”‚   └── validation.json
β”œβ”€β”€ /output-checkpoint/         # Model checkpoints
β”‚   β”œβ”€β”€ config.json
β”‚   β”œβ”€β”€ pytorch_model.bin
β”‚   └── training_results/
β”œβ”€β”€ training.log               # Training logs
β”œβ”€β”€ training_summary.md        # Summary report
└── config/train_smollm3_end_to_end.py  # Training config

🌐 Online Resources

The pipeline creates these online resources:

  • Model Repository: https://huggingface.co/your-username/smollm3-finetuned-YYYYMMDD
  • Trackio Space: https://huggingface.co/spaces/your-username/trackio-monitoring-YYYYMMDD
  • Experiment Dataset: https://huggingface.co/datasets/your-username/trackio-experiments

πŸ› οΈ Troubleshooting

Common Issues

  1. HF Token Issues

    # Verify your token is correct
    huggingface-cli whoami
    
  2. CUDA Issues

    # Check CUDA availability
    python -c "import torch; print(torch.cuda.is_available())"
    
  3. Memory Issues

    # Reduce batch size or gradient accumulation
    BATCH_SIZE=1
    GRADIENT_ACCUMULATION_STEPS=16
    
  4. Dataset Issues

    # Test dataset access
    python -c "from datasets import load_dataset; print(load_dataset('your-dataset'))"
    

Debug Mode

Run individual components for debugging:

# Test Trackio deployment
cd scripts/trackio_tonic
python deploy_trackio_space.py

# Test dataset setup
cd scripts/dataset_tonic
python setup_hf_dataset.py

# Test training
python src/train.py config/train_smollm3_end_to_end.py --help

πŸ“š Advanced Usage

Custom Datasets

For custom datasets, ensure they have one of these formats:

// Format 1: Prompt/Completion
{
  "prompt": "What is machine learning?",
  "completion": "Machine learning is..."
}

// Format 2: Instruction/Output
{
  "instruction": "Explain machine learning",
  "output": "Machine learning is..."
}

// Format 3: Chat format
{
  "messages": [
    {"role": "user", "content": "What is ML?"},
    {"role": "assistant", "content": "ML is..."}
  ]
}

Custom Models

To use different models, update the configuration:

MODEL_NAME="microsoft/DialoGPT-medium"
MAX_SEQ_LENGTH=1024

Custom Training

Modify training parameters in the generated config:

# In config/train_smollm3_end_to_end.py
config = SmolLM3Config(
    learning_rate=1e-5,  # Custom learning rate
    max_iters=5000,      # Custom training steps
    # ... other parameters
)

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Test the pipeline
  5. Submit a pull request

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • Hugging Face for the excellent transformers library
  • The SmolLM3 team for the base model
  • The Trackio team for experiment tracking
  • The open-source community for contributions

πŸ“ž Support

For issues and questions:

  1. Check the troubleshooting section
  2. Review the logs in training.log
  3. Check the Trackio Space for monitoring data
  4. Open an issue on GitHub

Happy Fine-tuning! πŸš€