File size: 7,038 Bytes
ebe598e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
# πŸ”§ Improved Monitoring Integration Guide

## Overview

The monitoring system has been enhanced to support **Hugging Face Datasets** for persistent experiment storage, making it ideal for deployment on Hugging Face Spaces and other cloud environments.

## πŸš€ Key Improvements

### 1. **HF Datasets Integration**
- βœ… **Persistent Storage**: Experiments are saved to HF Datasets repositories
- βœ… **Environment Variables**: Configurable via `HF_TOKEN` and `TRACKIO_DATASET_REPO`
- βœ… **Fallback Support**: Graceful degradation if HF Datasets unavailable
- βœ… **Automatic Backup**: Local files as backup

### 2. **Enhanced Monitoring Features**
- πŸ“Š **Real-time Metrics**: Training metrics logged to both Trackio and HF Datasets
- πŸ”§ **System Metrics**: GPU memory, CPU usage, and system performance
- πŸ“ˆ **Training Summaries**: Comprehensive experiment summaries
- πŸ›‘οΈ **Error Handling**: Robust error logging and recovery

### 3. **Easy Integration**
- πŸ”Œ **Automatic Setup**: Environment variables automatically detected
- πŸ“ **Configuration**: Simple setup with environment variables
- πŸ”„ **Backward Compatible**: Works with existing Trackio setup

## πŸ“‹ Environment Variables

| Variable | Required | Default | Description |
|----------|----------|---------|-------------|
| `HF_TOKEN` | βœ… Yes | None | Your Hugging Face token |
| `TRACKIO_DATASET_REPO` | ❌ No | `tonic/trackio-experiments` | Dataset repository |
| `TRACKIO_URL` | ❌ No | None | Trackio server URL |
| `TRACKIO_TOKEN` | ❌ No | None | Trackio authentication token |

## πŸ› οΈ Setup Instructions

### 1. **Get Your HF Token**
```bash
# Go to https://huggingface.co/settings/tokens
# Create a new token with "Write" permissions
# Copy the token
```

### 2. **Set Environment Variables**
```bash
# For HF Spaces, add these to your Space settings:
HF_TOKEN=your_hf_token_here
TRACKIO_DATASET_REPO=your-username/your-dataset-name

# For local development:
export HF_TOKEN=your_hf_token_here
export TRACKIO_DATASET_REPO=your-username/your-dataset-name
```

### 3. **Create Dataset Repository**
```bash
# Run the setup script
python setup_hf_dataset.py

# Or manually create a dataset on HF Hub
# Go to https://huggingface.co/datasets
# Create a new dataset repository
```

### 4. **Test Configuration**
```bash
# Test your setup
python configure_trackio.py

# Test dataset access
python test_hf_datasets.py
```

## πŸš€ Usage Examples

### **Basic Training with Monitoring**
```bash
# Train with default monitoring
python train.py config/train_smollm3_openhermes_fr.py

# Train with custom dataset repository
TRACKIO_DATASET_REPO=your-username/smollm3-experiments python train.py config/train_smollm3_openhermes_fr.py
```

### **Advanced Training Configuration**
```bash
# Train with custom experiment name
python train.py config/train_smollm3_openhermes_fr.py \
  --experiment_name "smollm3_french_tuning_v2" \
  --hf_token your_token_here \
  --dataset_repo your-username/french-experiments
```

### **Training Scripts with Monitoring**
```bash
# All training scripts now support monitoring:
python train.py config/train_smollm3_openhermes_fr_a100_balanced.py
python train.py config/train_smollm3_openhermes_fr_a100_large.py
python train.py config/train_smollm3_openhermes_fr_a100_max_performance.py
python train.py config/train_smollm3_openhermes_fr_a100_multiple_passes.py
```

## πŸ“Š What Gets Monitored

### **Training Metrics**
- Loss values (training and validation)
- Learning rate
- Gradient norms
- Training steps and epochs

### **System Metrics**
- GPU memory usage
- GPU utilization
- CPU usage
- Memory usage

### **Experiment Data**
- Configuration parameters
- Model checkpoints
- Evaluation results
- Training summaries

### **Artifacts**
- Configuration files
- Training logs
- Evaluation results
- Model checkpoints

## πŸ” Viewing Results

### **1. Trackio Interface**
- Visit your Trackio Space
- Navigate to "Experiments" tab
- View real-time metrics and plots

### **2. HF Dataset Repository**
- Go to your dataset repository on HF Hub
- Browse experiment data
- Download experiment files

### **3. Local Files**
- Check local backup files
- Review training logs
- Examine configuration files

## πŸ› οΈ Configuration Examples

### **Default Setup**
```python
# Uses default dataset: tonic/trackio-experiments
# Requires only HF_TOKEN
```

### **Personal Dataset**
```bash
export HF_TOKEN=your_token_here
export TRACKIO_DATASET_REPO=your-username/trackio-experiments
```

### **Team Dataset**
```bash
export HF_TOKEN=your_token_here
export TRACKIO_DATASET_REPO=your-org/team-experiments
```

### **Project-Specific Dataset**
```bash
export HF_TOKEN=your_token_here
export TRACKIO_DATASET_REPO=your-username/smollm3-experiments
```

## πŸ”§ Troubleshooting

### **Issue: "HF_TOKEN not found"**
```bash
# Solution: Set your HF token
export HF_TOKEN=your_token_here
# Or add to HF Space environment variables
```

### **Issue: "Failed to load dataset"**
```bash
# Solutions:
# 1. Check token has read access
# 2. Verify dataset repository exists
# 3. Run setup script: python setup_hf_dataset.py
```

### **Issue: "Failed to save experiments"**
```bash
# Solutions:
# 1. Check token has write permissions
# 2. Verify dataset repository exists
# 3. Check network connectivity
```

### **Issue: "Monitoring not working"**
```bash
# Solutions:
# 1. Check environment variables
# 2. Run configuration test: python configure_trackio.py
# 3. Check logs for specific errors
```

## πŸ“ˆ Benefits

### **For HF Spaces Deployment**
- βœ… **Persistent Storage**: Data survives Space restarts
- βœ… **No Local Storage**: No dependency on ephemeral storage
- βœ… **Scalable**: Works with any dataset size
- βœ… **Secure**: Private dataset storage

### **For Experiment Management**
- βœ… **Centralized**: All experiments in one place
- βœ… **Searchable**: Easy to find specific experiments
- βœ… **Versioned**: Dataset versioning for experiments
- βœ… **Collaborative**: Share experiments with team

### **For Development**
- βœ… **Flexible**: Easy to switch between datasets
- βœ… **Configurable**: Environment-based configuration
- βœ… **Robust**: Fallback mechanisms
- βœ… **Debuggable**: Comprehensive logging

## 🎯 Next Steps

1. **Set up your HF token and dataset repository**
2. **Test the configuration with `python configure_trackio.py`**
3. **Run a training experiment to verify monitoring**
4. **Check your HF Dataset repository for experiment data**
5. **View results in your Trackio interface**

## πŸ“š Related Files

- `monitoring.py` - Enhanced monitoring with HF Datasets support
- `train.py` - Updated training script with monitoring integration
- `configure_trackio.py` - Configuration and testing script
- `setup_hf_dataset.py` - Dataset repository setup
- `test_hf_datasets.py` - Dataset access testing
- `ENVIRONMENT_VARIABLES.md` - Environment variable reference
- `HF_DATASETS_GUIDE.md` - Detailed HF Datasets guide

---

**πŸŽ‰ Your experiments are now persistently stored and easily accessible!**