File size: 7,653 Bytes
6f0279c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
# Enhanced Trackio Interface Guide

## Overview

Your Trackio application has been significantly enhanced to provide comprehensive monitoring and visualization for SmolLM3 training experiments. Here's how to make the most of it.

## πŸš€ Key Enhancements

### 1. **Real-time Visualization**
- **Interactive Plots**: Loss curves, accuracy, learning rate, GPU metrics
- **Experiment Comparison**: Compare multiple training runs side-by-side
- **Live Updates**: Watch training progress in real-time

### 2. **Comprehensive Data Display**
- **Formatted Output**: Clean, emoji-rich experiment details
- **Statistics Overview**: Metrics count, parameters count, artifacts count
- **Status Tracking**: Visual status indicators (🟒 running, βœ… completed, ❌ failed)

### 3. **Demo Data Generation**
- **Realistic Simulation**: Generate realistic training metrics for testing
- **Multiple Metrics**: Loss, accuracy, learning rate, GPU memory, training time
- **Configurable Parameters**: Customize demo data to match your setup

## πŸ“Š How to Use with Your SmolLM3 Training

### Step 1: Start Your Training
```bash
python run_a100_large_experiment.py \
    --config config/train_smollm3_openhermes_fr_a100_balanced.py \
    --trackio_url "https://tonic-test-trackio-test.hf.space" \
    --experiment-name "petit-elle-l-aime-3-balanced" \
    --output-dir ./outputs/balanced
```

### Step 2: Monitor in Real-time
1. **Visit your Trackio Space**: `https://tonic-test-trackio-test.hf.space`
2. **Go to "View Experiments" tab**
3. **Enter your experiment ID** (e.g., `exp_20231201_143022`)
4. **Click "View Experiment"** to see detailed information

### Step 3: Visualize Training Progress
1. **Go to "πŸ“Š Visualizations" tab**
2. **Enter your experiment ID**
3. **Select a metric** (loss, accuracy, learning_rate, gpu_memory, training_time)
4. **Click "Create Plot"** to see interactive charts

### Step 4: Compare Experiments
1. **In the "πŸ“Š Visualizations" tab**
2. **Enter multiple experiment IDs** (comma-separated)
3. **Click "Compare Experiments"** to see side-by-side comparison

## 🎯 Interface Features

### Create Experiment Tab
- **Experiment Name**: Descriptive name for your training run
- **Description**: Detailed description of what you're training
- **Automatic ID Generation**: Unique experiment identifier

### Log Metrics Tab
- **Experiment ID**: The experiment to log metrics for
- **Metrics JSON**: Training metrics in JSON format
- **Step**: Current training step (optional)

Example metrics JSON:
```json
{
  "loss": 0.5234,
  "accuracy": 0.8567,
  "learning_rate": 3.5e-6,
  "gpu_memory_gb": 22.5,
  "gpu_utilization_percent": 87.3,
  "training_time_per_step": 0.456
}
```

### Log Parameters Tab
- **Experiment ID**: The experiment to log parameters for
- **Parameters JSON**: Training configuration in JSON format

Example parameters JSON:
```json
{
  "model_name": "HuggingFaceTB/SmolLM3-3B",
  "batch_size": 8,
  "learning_rate": 3.5e-6,
  "max_iters": 18000,
  "mixed_precision": "bf16",
  "no_think_system_message": true
}
```

### View Experiments Tab
- **Experiment ID**: Enter to view specific experiment
- **List All Experiments**: Shows overview of all experiments
- **Detailed Information**: Formatted display with statistics

### πŸ“Š Visualizations Tab
- **Training Metrics**: Interactive plots for individual metrics
- **Experiment Comparison**: Side-by-side comparison of multiple runs
- **Real-time Updates**: Plots update as new data is logged

### 🎯 Demo Data Tab
- **Generate Demo Data**: Create realistic training data for testing
- **Configurable**: Adjust parameters to match your setup
- **Multiple Metrics**: Simulates loss, accuracy, GPU metrics, etc.

### Update Status Tab
- **Experiment ID**: The experiment to update
- **Status**: running, completed, failed, paused
- **Visual Indicators**: Status shown with emojis

## πŸ“ˆ What Gets Displayed

### Training Metrics
- **Loss**: Training loss over time
- **Accuracy**: Model accuracy progression
- **Learning Rate**: Learning rate scheduling
- **GPU Memory**: Memory usage in GB
- **GPU Utilization**: GPU usage percentage
- **Training Time**: Time per training step

### Experiment Details
- **Basic Info**: ID, name, description, status, creation time
- **Statistics**: Metrics count, parameters count, artifacts count
- **Parameters**: All training configuration
- **Latest Metrics**: Most recent training metrics

### Visualizations
- **Line Charts**: Smooth curves showing metric progression
- **Interactive Hover**: Detailed information on hover
- **Multiple Metrics**: Switch between different metrics
- **Comparison Charts**: Side-by-side experiment comparison

## πŸ”§ Integration with Your Training

### Automatic Integration
Your training script automatically:
1. **Creates experiments** with your specified name
2. **Logs parameters** from your configuration
3. **Logs metrics** every 25 steps (configurable)
4. **Logs system metrics** (GPU memory, utilization)
5. **Logs checkpoints** every 2000 steps
6. **Updates status** when training completes

### Manual Integration
You can also manually:
1. **Create experiments** through the interface
2. **Log custom metrics** for specific analysis
3. **Compare different runs** with different parameters
4. **Generate demo data** for testing the interface

## 🎨 Customization

### Adding Custom Metrics
```python
# In your training script
custom_metrics = {
    "loss": current_loss,
    "accuracy": current_accuracy,
    "custom_metric": your_custom_value,
    "gpu_memory": gpu_memory_usage
}

monitor.log_metrics(custom_metrics, step=current_step)
```

### Custom Visualizations
The interface supports any metric you log. Just add it to your metrics JSON and it will appear in the visualization dropdown.

## 🚨 Troubleshooting

### No Data Displayed
1. **Check experiment ID**: Make sure you're using the correct ID
2. **Verify metrics were logged**: Check if training is actually logging metrics
3. **Use demo data**: Generate demo data to test the interface

### Plots Not Updating
1. **Refresh the page**: Sometimes plots need a refresh
2. **Check data format**: Ensure metrics are in the correct JSON format
3. **Verify step numbers**: Make sure step numbers are increasing

### Interface Not Loading
1. **Check dependencies**: Ensure plotly and pandas are installed
2. **Check Gradio version**: Use Gradio 4.0.0 or higher
3. **Check browser console**: Look for JavaScript errors

## πŸ“Š Example Workflow

1. **Start Training**:
   ```bash
   python run_a100_large_experiment.py --experiment-name "my_experiment"
   ```

2. **Monitor Progress**:
   - Visit your Trackio Space
   - Go to "View Experiments"
   - Enter your experiment ID
   - Watch real-time updates

3. **Visualize Results**:
   - Go to "πŸ“Š Visualizations"
   - Select "loss" metric
   - Create plot to see training progress

4. **Compare Runs**:
   - Run multiple experiments with different parameters
   - Use "Compare Experiments" to see differences

5. **Generate Demo Data**:
   - Use "🎯 Demo Data" tab to test the interface
   - Generate realistic training data for demonstration

## πŸŽ‰ Success Indicators

Your interface is working correctly when you see:
- βœ… **Formatted experiment details** with emojis and structure
- βœ… **Interactive plots** that respond to your inputs
- βœ… **Real-time metric updates** during training
- βœ… **Clean experiment overview** with statistics
- βœ… **Smooth visualization** with hover information

The enhanced interface will now display much more meaningful information and provide a comprehensive monitoring experience for your SmolLM3 training experiments!