SmolFactory / docs /README_END_TO_END.md
Tonic's picture
adds sft , quantization, better readmes
40fd629 verified
|
raw
history blame
7.24 kB
# SmolLM3 End-to-End Fine-tuning Pipeline
This repository provides a complete end-to-end pipeline for fine-tuning SmolLM3 models with integrated experiment tracking, monitoring, and model deployment.
## πŸš€ Quick Start
### 1. Setup Configuration
```bash
# Run the setup script to configure with your information
python setup_launch.py
```
### 2. Check Requirements
```bash
# Verify all dependencies are installed
python check_requirements.py
```
### 3. Run the Pipeline
```bash
# Make the script executable and run
chmod +x launch.sh
./launch.sh
```
This will prompt you for:
- Your Hugging Face token
- Optional model and dataset customizations
## πŸ“‹ What the Pipeline Does
The end-to-end pipeline performs the following steps:
### 1. **Environment Setup**
- Installs system dependencies
- Creates Python virtual environment
- Installs PyTorch with CUDA support
- Installs all required Python packages
### 2. **Trackio Space Deployment**
- Creates a new Hugging Face Space for experiment tracking
- Configures the Trackio monitoring interface
- Sets up environment variables
### 3. **HF Dataset Setup**
- Creates a Hugging Face Dataset repository for experiment storage
- Configures dataset access and permissions
- Sets up initial experiment data structure
### 4. **Dataset Preparation**
- Downloads the specified dataset from Hugging Face Hub
- Converts to training format (prompt/completion pairs)
- Handles multiple dataset formats automatically
- Creates train/validation splits
### 5. **Training Configuration**
- Creates optimized training configuration
- Sets up monitoring integration
- Configures model parameters and hyperparameters
### 6. **Model Training**
- Runs the SmolLM3 fine-tuning process
- Logs metrics to Trackio Space in real-time
- Saves experiment data to HF Dataset
- Creates checkpoints during training
### 7. **Model Deployment**
- Pushes trained model to Hugging Face Hub
- Creates comprehensive model card
- Uploads training results and logs
- Tests the uploaded model
### 8. **Summary Report**
- Generates detailed training summary
- Provides links to all resources
- Documents configuration and results
## 🎯 Features
### **Integrated Monitoring**
- Real-time experiment tracking via Trackio Space
- Persistent storage in Hugging Face Datasets
- Comprehensive metrics logging
- System resource monitoring
### **Flexible Dataset Support**
- Automatic format detection and conversion
- Support for multiple dataset types
- Built-in data preprocessing
- Train/validation split handling
### **Optimized Training**
- Flash Attention support for efficiency
- Gradient checkpointing for memory optimization
- Mixed precision training
- Automatic hyperparameter optimization
### **Complete Deployment**
- Automated model upload to Hugging Face Hub
- Comprehensive model cards
- Training results documentation
- Model testing and validation
## πŸ“Š Monitoring & Tracking
### **Trackio Space Interface**
- Real-time training metrics visualization
- Experiment management and comparison
- System resource monitoring
- Training progress tracking
### **HF Dataset Storage**
- Persistent experiment data storage
- Version-controlled experiment history
- Collaborative experiment sharing
- Automated data backup
## πŸ”§ Configuration
### **Required Configuration**
Update these variables in `launch.sh`:
```bash
# Your Hugging Face credentials
HF_TOKEN="your_hf_token_here"
HF_USERNAME="your-username"
# Model and dataset
MODEL_NAME="HuggingFaceTB/SmolLM3-3B"
DATASET_NAME="HuggingFaceTB/smoltalk"
# Output repositories
REPO_NAME="your-username/smollm3-finetuned-$(date +%Y%m%d)"
TRACKIO_DATASET_REPO="your-username/trackio-experiments"
```
### **Training Parameters**
Customize training parameters:
```bash
# Training configuration
BATCH_SIZE=2
GRADIENT_ACCUMULATION_STEPS=8
LEARNING_RATE=5e-6
MAX_EPOCHS=3
MAX_SEQ_LENGTH=4096
```
## πŸ“ Output Structure
After running the pipeline, you'll have:
```
β”œβ”€β”€ training_dataset/ # Prepared dataset
β”‚ β”œβ”€β”€ train.json
β”‚ └── validation.json
β”œβ”€β”€ /output-checkpoint/ # Model checkpoints
β”‚ β”œβ”€β”€ config.json
β”‚ β”œβ”€β”€ pytorch_model.bin
β”‚ └── training_results/
β”œβ”€β”€ training.log # Training logs
β”œβ”€β”€ training_summary.md # Summary report
└── config/train_smollm3_end_to_end.py # Training config
```
## 🌐 Online Resources
The pipeline creates these online resources:
- **Model Repository**: `https://huggingface.co/your-username/smollm3-finetuned-YYYYMMDD`
- **Trackio Space**: `https://huggingface.co/spaces/your-username/trackio-monitoring-YYYYMMDD`
- **Experiment Dataset**: `https://huggingface.co/datasets/your-username/trackio-experiments`
## πŸ› οΈ Troubleshooting
### **Common Issues**
1. **HF Token Issues**
```bash
# Verify your token is correct
hf whoami
```
2. **CUDA Issues**
```bash
# Check CUDA availability
python -c "import torch; print(torch.cuda.is_available())"
```
3. **Memory Issues**
```bash
# Reduce batch size or gradient accumulation
BATCH_SIZE=1
GRADIENT_ACCUMULATION_STEPS=16
```
4. **Dataset Issues**
```bash
# Test dataset access
python -c "from datasets import load_dataset; print(load_dataset('your-dataset'))"
```
### **Debug Mode**
Run individual components for debugging:
```bash
# Test Trackio deployment
cd scripts/trackio_tonic
python deploy_trackio_space.py
# Test dataset setup
cd scripts/dataset_tonic
python setup_hf_dataset.py
# Test training
python src/train.py config/train_smollm3_end_to_end.py --help
```
## πŸ“š Advanced Usage
### **Custom Datasets**
For custom datasets, ensure they have one of these formats:
```json
// Format 1: Prompt/Completion
{
"prompt": "What is machine learning?",
"completion": "Machine learning is..."
}
// Format 2: Instruction/Output
{
"instruction": "Explain machine learning",
"output": "Machine learning is..."
}
// Format 3: Chat format
{
"messages": [
{"role": "user", "content": "What is ML?"},
{"role": "assistant", "content": "ML is..."}
]
}
```
### **Custom Models**
To use different models, update the configuration:
```bash
MODEL_NAME="microsoft/DialoGPT-medium"
MAX_SEQ_LENGTH=1024
```
### **Custom Training**
Modify training parameters in the generated config:
```python
# In config/train_smollm3_end_to_end.py
config = SmolLM3Config(
learning_rate=1e-5, # Custom learning rate
max_iters=5000, # Custom training steps
# ... other parameters
)
```
## 🀝 Contributing
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Test the pipeline
5. Submit a pull request
## πŸ“„ License
This project is licensed under the MIT License - see the LICENSE file for details.
## πŸ™ Acknowledgments
- Hugging Face for the excellent transformers library
- The SmolLM3 team for the base model
- The Trackio team for experiment tracking
- The open-source community for contributions
## πŸ“ž Support
For issues and questions:
1. Check the troubleshooting section
2. Review the logs in `training.log`
3. Check the Trackio Space for monitoring data
4. Open an issue on GitHub
---
**Happy Fine-tuning! πŸš€**