Spaces:
Running
Running
File size: 7,653 Bytes
6f0279c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 |
# Enhanced Trackio Interface Guide
## Overview
Your Trackio application has been significantly enhanced to provide comprehensive monitoring and visualization for SmolLM3 training experiments. Here's how to make the most of it.
## π Key Enhancements
### 1. **Real-time Visualization**
- **Interactive Plots**: Loss curves, accuracy, learning rate, GPU metrics
- **Experiment Comparison**: Compare multiple training runs side-by-side
- **Live Updates**: Watch training progress in real-time
### 2. **Comprehensive Data Display**
- **Formatted Output**: Clean, emoji-rich experiment details
- **Statistics Overview**: Metrics count, parameters count, artifacts count
- **Status Tracking**: Visual status indicators (π’ running, β
completed, β failed)
### 3. **Demo Data Generation**
- **Realistic Simulation**: Generate realistic training metrics for testing
- **Multiple Metrics**: Loss, accuracy, learning rate, GPU memory, training time
- **Configurable Parameters**: Customize demo data to match your setup
## π How to Use with Your SmolLM3 Training
### Step 1: Start Your Training
```bash
python run_a100_large_experiment.py \
--config config/train_smollm3_openhermes_fr_a100_balanced.py \
--trackio_url "https://tonic-test-trackio-test.hf.space" \
--experiment-name "petit-elle-l-aime-3-balanced" \
--output-dir ./outputs/balanced
```
### Step 2: Monitor in Real-time
1. **Visit your Trackio Space**: `https://tonic-test-trackio-test.hf.space`
2. **Go to "View Experiments" tab**
3. **Enter your experiment ID** (e.g., `exp_20231201_143022`)
4. **Click "View Experiment"** to see detailed information
### Step 3: Visualize Training Progress
1. **Go to "π Visualizations" tab**
2. **Enter your experiment ID**
3. **Select a metric** (loss, accuracy, learning_rate, gpu_memory, training_time)
4. **Click "Create Plot"** to see interactive charts
### Step 4: Compare Experiments
1. **In the "π Visualizations" tab**
2. **Enter multiple experiment IDs** (comma-separated)
3. **Click "Compare Experiments"** to see side-by-side comparison
## π― Interface Features
### Create Experiment Tab
- **Experiment Name**: Descriptive name for your training run
- **Description**: Detailed description of what you're training
- **Automatic ID Generation**: Unique experiment identifier
### Log Metrics Tab
- **Experiment ID**: The experiment to log metrics for
- **Metrics JSON**: Training metrics in JSON format
- **Step**: Current training step (optional)
Example metrics JSON:
```json
{
"loss": 0.5234,
"accuracy": 0.8567,
"learning_rate": 3.5e-6,
"gpu_memory_gb": 22.5,
"gpu_utilization_percent": 87.3,
"training_time_per_step": 0.456
}
```
### Log Parameters Tab
- **Experiment ID**: The experiment to log parameters for
- **Parameters JSON**: Training configuration in JSON format
Example parameters JSON:
```json
{
"model_name": "HuggingFaceTB/SmolLM3-3B",
"batch_size": 8,
"learning_rate": 3.5e-6,
"max_iters": 18000,
"mixed_precision": "bf16",
"no_think_system_message": true
}
```
### View Experiments Tab
- **Experiment ID**: Enter to view specific experiment
- **List All Experiments**: Shows overview of all experiments
- **Detailed Information**: Formatted display with statistics
### π Visualizations Tab
- **Training Metrics**: Interactive plots for individual metrics
- **Experiment Comparison**: Side-by-side comparison of multiple runs
- **Real-time Updates**: Plots update as new data is logged
### π― Demo Data Tab
- **Generate Demo Data**: Create realistic training data for testing
- **Configurable**: Adjust parameters to match your setup
- **Multiple Metrics**: Simulates loss, accuracy, GPU metrics, etc.
### Update Status Tab
- **Experiment ID**: The experiment to update
- **Status**: running, completed, failed, paused
- **Visual Indicators**: Status shown with emojis
## π What Gets Displayed
### Training Metrics
- **Loss**: Training loss over time
- **Accuracy**: Model accuracy progression
- **Learning Rate**: Learning rate scheduling
- **GPU Memory**: Memory usage in GB
- **GPU Utilization**: GPU usage percentage
- **Training Time**: Time per training step
### Experiment Details
- **Basic Info**: ID, name, description, status, creation time
- **Statistics**: Metrics count, parameters count, artifacts count
- **Parameters**: All training configuration
- **Latest Metrics**: Most recent training metrics
### Visualizations
- **Line Charts**: Smooth curves showing metric progression
- **Interactive Hover**: Detailed information on hover
- **Multiple Metrics**: Switch between different metrics
- **Comparison Charts**: Side-by-side experiment comparison
## π§ Integration with Your Training
### Automatic Integration
Your training script automatically:
1. **Creates experiments** with your specified name
2. **Logs parameters** from your configuration
3. **Logs metrics** every 25 steps (configurable)
4. **Logs system metrics** (GPU memory, utilization)
5. **Logs checkpoints** every 2000 steps
6. **Updates status** when training completes
### Manual Integration
You can also manually:
1. **Create experiments** through the interface
2. **Log custom metrics** for specific analysis
3. **Compare different runs** with different parameters
4. **Generate demo data** for testing the interface
## π¨ Customization
### Adding Custom Metrics
```python
# In your training script
custom_metrics = {
"loss": current_loss,
"accuracy": current_accuracy,
"custom_metric": your_custom_value,
"gpu_memory": gpu_memory_usage
}
monitor.log_metrics(custom_metrics, step=current_step)
```
### Custom Visualizations
The interface supports any metric you log. Just add it to your metrics JSON and it will appear in the visualization dropdown.
## π¨ Troubleshooting
### No Data Displayed
1. **Check experiment ID**: Make sure you're using the correct ID
2. **Verify metrics were logged**: Check if training is actually logging metrics
3. **Use demo data**: Generate demo data to test the interface
### Plots Not Updating
1. **Refresh the page**: Sometimes plots need a refresh
2. **Check data format**: Ensure metrics are in the correct JSON format
3. **Verify step numbers**: Make sure step numbers are increasing
### Interface Not Loading
1. **Check dependencies**: Ensure plotly and pandas are installed
2. **Check Gradio version**: Use Gradio 4.0.0 or higher
3. **Check browser console**: Look for JavaScript errors
## π Example Workflow
1. **Start Training**:
```bash
python run_a100_large_experiment.py --experiment-name "my_experiment"
```
2. **Monitor Progress**:
- Visit your Trackio Space
- Go to "View Experiments"
- Enter your experiment ID
- Watch real-time updates
3. **Visualize Results**:
- Go to "π Visualizations"
- Select "loss" metric
- Create plot to see training progress
4. **Compare Runs**:
- Run multiple experiments with different parameters
- Use "Compare Experiments" to see differences
5. **Generate Demo Data**:
- Use "π― Demo Data" tab to test the interface
- Generate realistic training data for demonstration
## π Success Indicators
Your interface is working correctly when you see:
- β
**Formatted experiment details** with emojis and structure
- β
**Interactive plots** that respond to your inputs
- β
**Real-time metric updates** during training
- β
**Clean experiment overview** with statistics
- β
**Smooth visualization** with hover information
The enhanced interface will now display much more meaningful information and provide a comprehensive monitoring experience for your SmolLM3 training experiments! |