Spaces:

Tonic
/

SmolFactory

Running

App Files Files Community

SmolFactory / docs /README_END_TO_END.md

Tonic

adds sft , quantization, better readmes

40fd629 verified 3 months ago

preview code

raw

history blame

7.24 kB

	# SmolLM3 End-to-End Fine-tuning Pipeline

	This repository provides a complete end-to-end pipeline for fine-tuning SmolLM3 models with integrated experiment tracking, monitoring, and model deployment.

	## 🚀 Quick Start

	### 1. Setup Configuration

	```bash
	# Run the setup script to configure with your information
	python setup_launch.py
	```


	### 2. Check Requirements

	```bash
	# Verify all dependencies are installed
	python check_requirements.py
	```

	### 3. Run the Pipeline

	```bash
	# Make the script executable and run
	chmod +x launch.sh
	./launch.sh
	```
	This will prompt you for:
	- Your Hugging Face token
	- Optional model and dataset customizations

	## 📋 What the Pipeline Does

	The end-to-end pipeline performs the following steps:

	### 1. Environment Setup
	- Installs system dependencies
	- Creates Python virtual environment
	- Installs PyTorch with CUDA support
	- Installs all required Python packages

	### 2. Trackio Space Deployment
	- Creates a new Hugging Face Space for experiment tracking
	- Configures the Trackio monitoring interface
	- Sets up environment variables

	### 3. HF Dataset Setup
	- Creates a Hugging Face Dataset repository for experiment storage
	- Configures dataset access and permissions
	- Sets up initial experiment data structure

	### 4. Dataset Preparation
	- Downloads the specified dataset from Hugging Face Hub
	- Converts to training format (prompt/completion pairs)
	- Handles multiple dataset formats automatically
	- Creates train/validation splits

	### 5. Training Configuration
	- Creates optimized training configuration
	- Sets up monitoring integration
	- Configures model parameters and hyperparameters

	### 6. Model Training
	- Runs the SmolLM3 fine-tuning process
	- Logs metrics to Trackio Space in real-time
	- Saves experiment data to HF Dataset
	- Creates checkpoints during training

	### 7. Model Deployment
	- Pushes trained model to Hugging Face Hub
	- Creates comprehensive model card
	- Uploads training results and logs
	- Tests the uploaded model

	### 8. Summary Report
	- Generates detailed training summary
	- Provides links to all resources
	- Documents configuration and results

	## 🎯 Features

	### Integrated Monitoring
	- Real-time experiment tracking via Trackio Space
	- Persistent storage in Hugging Face Datasets
	- Comprehensive metrics logging
	- System resource monitoring

	### Flexible Dataset Support
	- Automatic format detection and conversion
	- Support for multiple dataset types
	- Built-in data preprocessing
	- Train/validation split handling

	### Optimized Training
	- Flash Attention support for efficiency
	- Gradient checkpointing for memory optimization
	- Mixed precision training
	- Automatic hyperparameter optimization

	### Complete Deployment
	- Automated model upload to Hugging Face Hub
	- Comprehensive model cards
	- Training results documentation
	- Model testing and validation

	## 📊 Monitoring & Tracking

	### Trackio Space Interface
	- Real-time training metrics visualization
	- Experiment management and comparison
	- System resource monitoring
	- Training progress tracking

	### HF Dataset Storage
	- Persistent experiment data storage
	- Version-controlled experiment history
	- Collaborative experiment sharing
	- Automated data backup

	## 🔧 Configuration

	### Required Configuration
	Update these variables in `launch.sh`:

	```bash
	# Your Hugging Face credentials
	HF_TOKEN="your_hf_token_here"
	HF_USERNAME="your-username"

	# Model and dataset
	MODEL_NAME="HuggingFaceTB/SmolLM3-3B"
	DATASET_NAME="HuggingFaceTB/smoltalk"

	# Output repositories
	REPO_NAME="your-username/smollm3-finetuned-$(date +%Y%m%d)"
	TRACKIO_DATASET_REPO="your-username/trackio-experiments"
	```

	### Training Parameters
	Customize training parameters:

	```bash
	# Training configuration
	BATCH_SIZE=2
	GRADIENT_ACCUMULATION_STEPS=8
	LEARNING_RATE=5e-6
	MAX_EPOCHS=3
	MAX_SEQ_LENGTH=4096
	```

	## 📁 Output Structure

	After running the pipeline, you'll have:

	```
	├── training_dataset/ # Prepared dataset
	│ ├── train.json
	│ └── validation.json
	├── /output-checkpoint/ # Model checkpoints
	│ ├── config.json
	│ ├── pytorch_model.bin
	│ └── training_results/
	├── training.log # Training logs
	├── training_summary.md # Summary report
	└── config/train_smollm3_end_to_end.py # Training config
	```

	## 🌐 Online Resources

	The pipeline creates these online resources:

	- Model Repository: `https://huggingface.co/your-username/smollm3-finetuned-YYYYMMDD`
	- Trackio Space: `https://huggingface.co/spaces/your-username/trackio-monitoring-YYYYMMDD`
	- Experiment Dataset: `https://huggingface.co/datasets/your-username/trackio-experiments`

	## 🛠️ Troubleshooting

	### Common Issues

	1. HF Token Issues
	```bash
	# Verify your token is correct
	hf whoami
	```

	2. CUDA Issues
	```bash
	# Check CUDA availability
	python -c "import torch; print(torch.cuda.is_available())"
	```

	3. Memory Issues
	```bash
	# Reduce batch size or gradient accumulation
	BATCH_SIZE=1
	GRADIENT_ACCUMULATION_STEPS=16
	```

	4. Dataset Issues
	```bash
	# Test dataset access
	python -c "from datasets import load_dataset; print(load_dataset('your-dataset'))"
	```

	### Debug Mode

	Run individual components for debugging:

	```bash
	# Test Trackio deployment
	cd scripts/trackio_tonic
	python deploy_trackio_space.py

	# Test dataset setup
	cd scripts/dataset_tonic
	python setup_hf_dataset.py

	# Test training
	python src/train.py config/train_smollm3_end_to_end.py --help
	```

	## 📚 Advanced Usage

	### Custom Datasets

	For custom datasets, ensure they have one of these formats:

	```json
	// Format 1: Prompt/Completion
	{
	"prompt": "What is machine learning?",
	"completion": "Machine learning is..."
	}

	// Format 2: Instruction/Output
	{
	"instruction": "Explain machine learning",
	"output": "Machine learning is..."
	}

	// Format 3: Chat format
	{
	"messages": [
	{"role": "user", "content": "What is ML?"},
	{"role": "assistant", "content": "ML is..."}
	]
	}
	```

	### Custom Models

	To use different models, update the configuration:

	```bash
	MODEL_NAME="microsoft/DialoGPT-medium"
	MAX_SEQ_LENGTH=1024
	```

	### Custom Training

	Modify training parameters in the generated config:

	```python
	# In config/train_smollm3_end_to_end.py
	config = SmolLM3Config(
	learning_rate=1e-5, # Custom learning rate
	max_iters=5000, # Custom training steps
	# ... other parameters
	)
	```

	## 🤝 Contributing

	1. Fork the repository
	2. Create a feature branch
	3. Make your changes
	4. Test the pipeline
	5. Submit a pull request

	## 📄 License

	This project is licensed under the MIT License - see the LICENSE file for details.

	## 🙏 Acknowledgments

	- Hugging Face for the excellent transformers library
	- The SmolLM3 team for the base model
	- The Trackio team for experiment tracking
	- The open-source community for contributions

	## 📞 Support

	For issues and questions:

	1. Check the troubleshooting section
	2. Review the logs in `training.log`
	3. Check the Trackio Space for monitoring data
	4. Open an issue on GitHub

	---

	Happy Fine-tuning! 🚀