Spaces:
Running
Running
File size: 6,236 Bytes
5fe83da |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 |
# Trackio Integration for SmolLM3 Fine-tuning
This document provides comprehensive information about the Trackio experiment tracking and monitoring integration for your SmolLM3 fine-tuning pipeline.
## Features
- **SmolLM3 Fine-tuning**: Support for supervised fine-tuning and DPO training
- **Trackio Integration**: Complete experiment tracking and monitoring
- **Hugging Face Spaces Deployment**: Easy deployment of Trackio monitoring interface
- **Comprehensive Logging**: Metrics, parameters, artifacts, and system monitoring
- **Flexible Configuration**: Support for various training configurations
## Quick Start
### 1. Install Dependencies
```bash
pip install -r requirements.txt
```
### 2. Basic Training with Trackio
```bash
python train.py config/train_smollm3.py \
--dataset_dir my_dataset \
--enable_tracking \
--trackio_url "https://your-trackio-instance.com" \
--experiment_name "smollm3_finetune_v1"
```
### 3. Training with Custom Parameters
```bash
python train.py config/train_smollm3.py \
--dataset_dir my_dataset \
--batch_size 8 \
--learning_rate 1e-5 \
--max_iters 2000 \
--enable_tracking \
--trackio_url "https://your-trackio-instance.com" \
--experiment_name "smollm3_high_lr_experiment"
```
## Trackio Integration
### Configuration
Add Trackio settings to your configuration:
```python
# In your config file
config = SmolLM3Config(
# ... other settings ...
# Trackio monitoring configuration
enable_tracking=True,
trackio_url="https://your-trackio-instance.com",
trackio_token="your_token_here", # Optional
log_artifacts=True,
log_metrics=True,
log_config=True,
experiment_name="my_experiment"
)
```
### Environment Variables
You can also set Trackio configuration via environment variables:
```bash
export TRACKIO_URL="https://your-trackio-instance.com"
export TRACKIO_TOKEN="your_token_here"
```
### What Gets Tracked
- **Configuration**: All training parameters and model settings
- **Metrics**: Loss, accuracy, learning rate, and custom metrics
- **System Metrics**: GPU memory, CPU usage, training time
- **Artifacts**: Model checkpoints, evaluation results
- **Training Summary**: Final results and experiment duration
## Hugging Face Spaces Deployment
### Deploy Trackio Monitoring Interface
1. **Create a new Space** on Hugging Face:
- Go to https://huggingface.co/spaces
- Click "Create new Space"
- Choose "Gradio" as the SDK
- Set visibility (Public or Private)
2. **Upload the deployment files**:
- `app.py` - The Gradio interface
- `requirements_space.txt` - Dependencies
- `README.md` - Documentation
3. **Configure the Space**:
- The Space will automatically install dependencies
- The Gradio interface will be available at your Space URL
### Using the Trackio Space
1. **Create Experiments**: Use the "Create Experiment" tab to start new experiments
2. **Log Metrics**: Use the "Log Metrics" tab to track training progress
3. **View Results**: Use the "View Experiments" tab to see experiment details
4. **Update Status**: Use the "Update Status" tab to mark experiments as completed
### Integration with Your Training
To connect your training script to the Trackio Space:
```python
# In your training script
from monitoring import SmolLM3Monitor
# Initialize monitor
monitor = SmolLM3Monitor(
experiment_name="my_experiment",
trackio_url="https://your-space.hf.space", # Your Space URL
enable_tracking=True
)
# Log configuration
monitor.log_config(config_dict)
# Log metrics during training
monitor.log_metrics({"loss": 0.5, "accuracy": 0.85}, step=100)
# Log final results
monitor.log_training_summary(final_results)
```
## Configuration Files
### Main Configuration (`config/train_smollm3.py`)
```python
@dataclass
class SmolLM3Config:
# Model configuration
model_name: str = "HuggingFaceTB/SmolLM3-3B"
max_seq_length: int = 4096
# Training configuration
batch_size: int = 4
learning_rate: float = 2e-5
max_iters: int = 1000
# Trackio monitoring
enable_tracking: bool = True
trackio_url: Optional[str] = None
trackio_token: Optional[str] = None
experiment_name: Optional[str] = None
```
### DPO Configuration (`config/train_smollm3_dpo.py`)
```python
@dataclass
class SmolLM3DPOConfig(SmolLM3Config):
# DPO-specific settings
beta: float = 0.1
max_prompt_length: int = 2048
# Trackio monitoring (inherited)
enable_tracking: bool = True
trackio_url: Optional[str] = None
```
## Monitoring Features
### Real-time Metrics
- Training loss and evaluation metrics
- Learning rate scheduling
- GPU memory and utilization
- Training time and progress
### Artifact Tracking
- Model checkpoints at regular intervals
- Evaluation results and plots
- Configuration snapshots
- Training logs and summaries
### Experiment Management
- Experiment naming and organization
- Status tracking (running, completed, failed)
- Parameter comparison across experiments
- Result visualization
## Advanced Usage
### Custom Metrics
```python
# Log custom metrics
monitor.log_metrics({
"custom_metric": value,
"perplexity": perplexity_score,
"bleu_score": bleu_score
}, step=current_step)
```
### System Monitoring
```python
# Log system metrics
monitor.log_system_metrics(step=current_step)
```
### Artifact Logging
```python
# Log model checkpoint
monitor.log_model_checkpoint("checkpoint-1000", step=1000)
# Log evaluation results
monitor.log_evaluation_results(eval_results, step=1000)
```
## Troubleshooting
### Common Issues
1. **Trackio not available**: Install with `pip install trackio`
2. **Connection errors**: Check your Trackio URL and token
3. **Missing metrics**: Ensure monitoring is enabled in configuration
4. **Space deployment issues**: Check Gradio version compatibility
### Debug Mode
Enable debug logging:
```python
import logging
logging.basicConfig(level=logging.DEBUG)
```
## Contributing
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests if applicable
5. Submit a pull request
## License
This project is licensed under the MIT License - see the LICENSE file for details. |