Spaces:
				
			
			
	
			
			
					
		Running
		
	
	
	
			
			
	
	
	
	
		
		
					
		Running
		
	| # SmolLM3 End-to-End Fine-tuning Pipeline | |
| This repository provides a complete end-to-end pipeline for fine-tuning SmolLM3 models with integrated experiment tracking, monitoring, and model deployment. | |
| ## π Quick Start | |
| ### 1. Setup Configuration | |
| ```bash | |
| # Run the setup script to configure with your information | |
| python setup_launch.py | |
| ``` | |
| ### 2. Check Requirements | |
| ```bash | |
| # Verify all dependencies are installed | |
| python check_requirements.py | |
| ``` | |
| ### 3. Run the Pipeline | |
| ```bash | |
| # Make the script executable and run | |
| chmod +x launch.sh | |
| ./launch.sh | |
| ``` | |
| This will prompt you for: | |
| - Your Hugging Face token | |
| - Optional model and dataset customizations | |
| ## π What the Pipeline Does | |
| The end-to-end pipeline performs the following steps: | |
| ### 1. **Environment Setup** | |
| - Installs system dependencies | |
| - Creates Python virtual environment | |
| - Installs PyTorch with CUDA support | |
| - Installs all required Python packages | |
| ### 2. **Trackio Space Deployment** | |
| - Creates a new Hugging Face Space for experiment tracking | |
| - Configures the Trackio monitoring interface | |
| - Sets up environment variables | |
| ### 3. **HF Dataset Setup** | |
| - Creates a Hugging Face Dataset repository for experiment storage | |
| - Configures dataset access and permissions | |
| - Sets up initial experiment data structure | |
| ### 4. **Dataset Preparation** | |
| - Downloads the specified dataset from Hugging Face Hub | |
| - Converts to training format (prompt/completion pairs) | |
| - Handles multiple dataset formats automatically | |
| - Creates train/validation splits | |
| ### 5. **Training Configuration** | |
| - Creates optimized training configuration | |
| - Sets up monitoring integration | |
| - Configures model parameters and hyperparameters | |
| ### 6. **Model Training** | |
| - Runs the SmolLM3 fine-tuning process | |
| - Logs metrics to Trackio Space in real-time | |
| - Saves experiment data to HF Dataset | |
| - Creates checkpoints during training | |
| ### 7. **Model Deployment** | |
| - Pushes trained model to Hugging Face Hub | |
| - Creates comprehensive model card | |
| - Uploads training results and logs | |
| - Tests the uploaded model | |
| ### 8. **Summary Report** | |
| - Generates detailed training summary | |
| - Provides links to all resources | |
| - Documents configuration and results | |
| ## π― Features | |
| ### **Integrated Monitoring** | |
| - Real-time experiment tracking via Trackio Space | |
| - Persistent storage in Hugging Face Datasets | |
| - Comprehensive metrics logging | |
| - System resource monitoring | |
| ### **Flexible Dataset Support** | |
| - Automatic format detection and conversion | |
| - Support for multiple dataset types | |
| - Built-in data preprocessing | |
| - Train/validation split handling | |
| ### **Optimized Training** | |
| - Flash Attention support for efficiency | |
| - Gradient checkpointing for memory optimization | |
| - Mixed precision training | |
| - Automatic hyperparameter optimization | |
| ### **Complete Deployment** | |
| - Automated model upload to Hugging Face Hub | |
| - Comprehensive model cards | |
| - Training results documentation | |
| - Model testing and validation | |
| ## π Monitoring & Tracking | |
| ### **Trackio Space Interface** | |
| - Real-time training metrics visualization | |
| - Experiment management and comparison | |
| - System resource monitoring | |
| - Training progress tracking | |
| ### **HF Dataset Storage** | |
| - Persistent experiment data storage | |
| - Version-controlled experiment history | |
| - Collaborative experiment sharing | |
| - Automated data backup | |
| ## π§ Configuration | |
| ### **Required Configuration** | |
| Update these variables in `launch.sh`: | |
| ```bash | |
| # Your Hugging Face credentials | |
| HF_TOKEN="your_hf_token_here" | |
| HF_USERNAME="your-username" | |
| # Model and dataset | |
| MODEL_NAME="HuggingFaceTB/SmolLM3-3B" | |
| DATASET_NAME="HuggingFaceTB/smoltalk" | |
| # Output repositories | |
| REPO_NAME="your-username/smollm3-finetuned-$(date +%Y%m%d)" | |
| TRACKIO_DATASET_REPO="your-username/trackio-experiments" | |
| ``` | |
| ### **Training Parameters** | |
| Customize training parameters: | |
| ```bash | |
| # Training configuration | |
| BATCH_SIZE=2 | |
| GRADIENT_ACCUMULATION_STEPS=8 | |
| LEARNING_RATE=5e-6 | |
| MAX_EPOCHS=3 | |
| MAX_SEQ_LENGTH=4096 | |
| ``` | |
| ## π Output Structure | |
| After running the pipeline, you'll have: | |
| ``` | |
| βββ training_dataset/ # Prepared dataset | |
| β βββ train.json | |
| β βββ validation.json | |
| βββ /output-checkpoint/ # Model checkpoints | |
| β βββ config.json | |
| β βββ pytorch_model.bin | |
| β βββ training_results/ | |
| βββ training.log # Training logs | |
| βββ training_summary.md # Summary report | |
| βββ config/train_smollm3_end_to_end.py # Training config | |
| ``` | |
| ## π Online Resources | |
| The pipeline creates these online resources: | |
| - **Model Repository**: `https://huggingface.co/your-username/smollm3-finetuned-YYYYMMDD` | |
| - **Trackio Space**: `https://huggingface.co/spaces/your-username/trackio-monitoring-YYYYMMDD` | |
| - **Experiment Dataset**: `https://huggingface.co/datasets/your-username/trackio-experiments` | |
| ## π οΈ Troubleshooting | |
| ### **Common Issues** | |
| 1. **HF Token Issues** | |
| ```bash | |
| # Verify your token is correct | |
| hf whoami | |
| ``` | |
| 2. **CUDA Issues** | |
| ```bash | |
| # Check CUDA availability | |
| python -c "import torch; print(torch.cuda.is_available())" | |
| ``` | |
| 3. **Memory Issues** | |
| ```bash | |
| # Reduce batch size or gradient accumulation | |
| BATCH_SIZE=1 | |
| GRADIENT_ACCUMULATION_STEPS=16 | |
| ``` | |
| 4. **Dataset Issues** | |
| ```bash | |
| # Test dataset access | |
| python -c "from datasets import load_dataset; print(load_dataset('your-dataset'))" | |
| ``` | |
| ### **Debug Mode** | |
| Run individual components for debugging: | |
| ```bash | |
| # Test Trackio deployment | |
| cd scripts/trackio_tonic | |
| python deploy_trackio_space.py | |
| # Test dataset setup | |
| cd scripts/dataset_tonic | |
| python setup_hf_dataset.py | |
| # Test training | |
| python src/train.py config/train_smollm3_end_to_end.py --help | |
| ``` | |
| ## π Advanced Usage | |
| ### **Custom Datasets** | |
| For custom datasets, ensure they have one of these formats: | |
| ```json | |
| // Format 1: Prompt/Completion | |
| { | |
| "prompt": "What is machine learning?", | |
| "completion": "Machine learning is..." | |
| } | |
| // Format 2: Instruction/Output | |
| { | |
| "instruction": "Explain machine learning", | |
| "output": "Machine learning is..." | |
| } | |
| // Format 3: Chat format | |
| { | |
| "messages": [ | |
| {"role": "user", "content": "What is ML?"}, | |
| {"role": "assistant", "content": "ML is..."} | |
| ] | |
| } | |
| ``` | |
| ### **Custom Models** | |
| To use different models, update the configuration: | |
| ```bash | |
| MODEL_NAME="microsoft/DialoGPT-medium" | |
| MAX_SEQ_LENGTH=1024 | |
| ``` | |
| ### **Custom Training** | |
| Modify training parameters in the generated config: | |
| ```python | |
| # In config/train_smollm3_end_to_end.py | |
| config = SmolLM3Config( | |
| learning_rate=1e-5, # Custom learning rate | |
| max_iters=5000, # Custom training steps | |
| # ... other parameters | |
| ) | |
| ``` | |
| ## π€ Contributing | |
| 1. Fork the repository | |
| 2. Create a feature branch | |
| 3. Make your changes | |
| 4. Test the pipeline | |
| 5. Submit a pull request | |
| ## π License | |
| This project is licensed under the MIT License - see the LICENSE file for details. | |
| ## π Acknowledgments | |
| - Hugging Face for the excellent transformers library | |
| - The SmolLM3 team for the base model | |
| - The Trackio team for experiment tracking | |
| - The open-source community for contributions | |
| ## π Support | |
| For issues and questions: | |
| 1. Check the troubleshooting section | |
| 2. Review the logs in `training.log` | |
| 3. Check the Trackio Space for monitoring data | |
| 4. Open an issue on GitHub | |
| --- | |
| **Happy Fine-tuning! π** | 
