Spaces:
Running
Running
| # Trackio Integration for SmolLM3 Fine-tuning | |
| This document provides comprehensive information about the Trackio experiment tracking and monitoring integration for your SmolLM3 fine-tuning pipeline. | |
| ## Features | |
| - **SmolLM3 Fine-tuning**: Support for supervised fine-tuning and DPO training | |
| - **Trackio Integration**: Complete experiment tracking and monitoring | |
| - **Hugging Face Spaces Deployment**: Easy deployment of Trackio monitoring interface | |
| - **Comprehensive Logging**: Metrics, parameters, artifacts, and system monitoring | |
| - **Flexible Configuration**: Support for various training configurations | |
| ## Quick Start | |
| ### 1. Install Dependencies | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| ### 2. Basic Training with Trackio | |
| ```bash | |
| python train.py config/train_smollm3.py \ | |
| --dataset_dir my_dataset \ | |
| --enable_tracking \ | |
| --trackio_url "https://your-trackio-instance.com" \ | |
| --experiment_name "smollm3_finetune_v1" | |
| ``` | |
| ### 3. Training with Custom Parameters | |
| ```bash | |
| python train.py config/train_smollm3.py \ | |
| --dataset_dir my_dataset \ | |
| --batch_size 8 \ | |
| --learning_rate 1e-5 \ | |
| --max_iters 2000 \ | |
| --enable_tracking \ | |
| --trackio_url "https://your-trackio-instance.com" \ | |
| --experiment_name "smollm3_high_lr_experiment" | |
| ``` | |
| ## Trackio Integration | |
| ### Configuration | |
| Add Trackio settings to your configuration: | |
| ```python | |
| # In your config file | |
| config = SmolLM3Config( | |
| # ... other settings ... | |
| # Trackio monitoring configuration | |
| enable_tracking=True, | |
| trackio_url="https://your-trackio-instance.com", | |
| trackio_token="your_token_here", # Optional | |
| log_artifacts=True, | |
| log_metrics=True, | |
| log_config=True, | |
| experiment_name="my_experiment" | |
| ) | |
| ``` | |
| ### Environment Variables | |
| You can also set Trackio configuration via environment variables: | |
| ```bash | |
| export TRACKIO_URL="https://your-trackio-instance.com" | |
| export TRACKIO_TOKEN="your_token_here" | |
| ``` | |
| ### What Gets Tracked | |
| - **Configuration**: All training parameters and model settings | |
| - **Metrics**: Loss, accuracy, learning rate, and custom metrics | |
| - **System Metrics**: GPU memory, CPU usage, training time | |
| - **Artifacts**: Model checkpoints, evaluation results | |
| - **Training Summary**: Final results and experiment duration | |
| ## Hugging Face Spaces Deployment | |
| ### Deploy Trackio Monitoring Interface | |
| 1. **Create a new Space** on Hugging Face: | |
| - Go to https://huggingface.co/spaces | |
| - Click "Create new Space" | |
| - Choose "Gradio" as the SDK | |
| - Set visibility (Public or Private) | |
| 2. **Upload the deployment files**: | |
| - `app.py` - The Gradio interface | |
| - `requirements_space.txt` - Dependencies | |
| - `README.md` - Documentation | |
| 3. **Configure the Space**: | |
| - The Space will automatically install dependencies | |
| - The Gradio interface will be available at your Space URL | |
| ### Using the Trackio Space | |
| 1. **Create Experiments**: Use the "Create Experiment" tab to start new experiments | |
| 2. **Log Metrics**: Use the "Log Metrics" tab to track training progress | |
| 3. **View Results**: Use the "View Experiments" tab to see experiment details | |
| 4. **Update Status**: Use the "Update Status" tab to mark experiments as completed | |
| ### Integration with Your Training | |
| To connect your training script to the Trackio Space: | |
| ```python | |
| # In your training script | |
| from monitoring import SmolLM3Monitor | |
| # Initialize monitor | |
| monitor = SmolLM3Monitor( | |
| experiment_name="my_experiment", | |
| trackio_url="https://your-space.hf.space", # Your Space URL | |
| enable_tracking=True | |
| ) | |
| # Log configuration | |
| monitor.log_config(config_dict) | |
| # Log metrics during training | |
| monitor.log_metrics({"loss": 0.5, "accuracy": 0.85}, step=100) | |
| # Log final results | |
| monitor.log_training_summary(final_results) | |
| ``` | |
| ## Configuration Files | |
| ### Main Configuration (`config/train_smollm3.py`) | |
| ```python | |
| @dataclass | |
| class SmolLM3Config: | |
| # Model configuration | |
| model_name: str = "HuggingFaceTB/SmolLM3-3B" | |
| max_seq_length: int = 4096 | |
| # Training configuration | |
| batch_size: int = 4 | |
| learning_rate: float = 2e-5 | |
| max_iters: int = 1000 | |
| # Trackio monitoring | |
| enable_tracking: bool = True | |
| trackio_url: Optional[str] = None | |
| trackio_token: Optional[str] = None | |
| experiment_name: Optional[str] = None | |
| ``` | |
| ### DPO Configuration (`config/train_smollm3_dpo.py`) | |
| ```python | |
| @dataclass | |
| class SmolLM3DPOConfig(SmolLM3Config): | |
| # DPO-specific settings | |
| beta: float = 0.1 | |
| max_prompt_length: int = 2048 | |
| # Trackio monitoring (inherited) | |
| enable_tracking: bool = True | |
| trackio_url: Optional[str] = None | |
| ``` | |
| ## Monitoring Features | |
| ### Real-time Metrics | |
| - Training loss and evaluation metrics | |
| - Learning rate scheduling | |
| - GPU memory and utilization | |
| - Training time and progress | |
| ### Artifact Tracking | |
| - Model checkpoints at regular intervals | |
| - Evaluation results and plots | |
| - Configuration snapshots | |
| - Training logs and summaries | |
| ### Experiment Management | |
| - Experiment naming and organization | |
| - Status tracking (running, completed, failed) | |
| - Parameter comparison across experiments | |
| - Result visualization | |
| ## Advanced Usage | |
| ### Custom Metrics | |
| ```python | |
| # Log custom metrics | |
| monitor.log_metrics({ | |
| "custom_metric": value, | |
| "perplexity": perplexity_score, | |
| "bleu_score": bleu_score | |
| }, step=current_step) | |
| ``` | |
| ### System Monitoring | |
| ```python | |
| # Log system metrics | |
| monitor.log_system_metrics(step=current_step) | |
| ``` | |
| ### Artifact Logging | |
| ```python | |
| # Log model checkpoint | |
| monitor.log_model_checkpoint("checkpoint-1000", step=1000) | |
| # Log evaluation results | |
| monitor.log_evaluation_results(eval_results, step=1000) | |
| ``` | |
| ## Troubleshooting | |
| ### Common Issues | |
| 1. **Trackio not available**: Install with `pip install trackio` | |
| 2. **Connection errors**: Check your Trackio URL and token | |
| 3. **Missing metrics**: Ensure monitoring is enabled in configuration | |
| 4. **Space deployment issues**: Check Gradio version compatibility | |
| ### Debug Mode | |
| Enable debug logging: | |
| ```python | |
| import logging | |
| logging.basicConfig(level=logging.DEBUG) | |
| ``` | |
| ## Contributing | |
| 1. Fork the repository | |
| 2. Create a feature branch | |
| 3. Make your changes | |
| 4. Add tests if applicable | |
| 5. Submit a pull request | |
| ## License | |
| This project is licensed under the MIT License - see the LICENSE file for details. |