File size: 4,737 Bytes
ebe598e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
# πŸš€ Trackio on Hugging Face Spaces - Complete Guide

## Overview

This guide explains how to properly deploy and use Trackio on Hugging Face Spaces, addressing the unique challenges of ephemeral storage and data persistence.

## πŸ—οΈ Hugging Face Spaces Architecture

### Key Challenges

1. **Ephemeral Storage**: File system gets reset between deployments
2. **No Persistent Storage**: Files written during runtime don't persist
3. **Multiple Instances**: Training and monitoring might run in different environments
4. **Limited File System**: Restricted write permissions in certain directories

### How Trackio Handles HF Spaces

The updated Trackio app now includes:

- **Automatic HF Spaces Detection**: Detects when running on HF Spaces
- **Persistent Path Selection**: Uses `/tmp/` for better persistence
- **Backup Recovery**: Automatically recovers experiments from backup data
- **Fallback Storage**: Multiple storage locations for redundancy

## πŸ“Š Your Current Experiments

Based on your logs, you have these experiments available:

### Experiment 1: `exp_20250720_130853`
- **Name**: petite-elle-l-aime-3
- **Status**: Running
- **Metrics**: 4 entries (steps 25, 50, 75, 100)
- **Key Metrics**: Loss decreasing from 1.1659 to 1.1528

### Experiment 2: `exp_20250720_134319`
- **Name**: petite-elle-l-aime-3-1
- **Status**: Running
- **Metrics**: 2 entries (step 25)
- **Key Metrics**: Loss 1.166, GPU memory usage

## 🎯 How to Use Your Experiments

### 1. View Experiments
- Go to the "View Experiments" tab
- Enter experiment ID: `exp_20250720_130853` or `exp_20250720_134319`
- Click "View Experiment" to see details

### 2. Create Plots
- Go to the "Visualizations" tab
- Enter experiment ID
- Select metric to plot:
  - `loss` - Training loss curve
  - `learning_rate` - Learning rate schedule
  - `mean_token_accuracy` - Token accuracy
  - `grad_norm` - Gradient norm
  - `gpu_0_memory_allocated` - GPU memory usage

### 3. Compare Experiments
- Use the "Experiment Comparison" feature
- Enter: `exp_20250720_130853,exp_20250720_134319`
- Compare loss curves between experiments

## πŸ”§ Technical Details

### Data Persistence Strategy

```python
# HF Spaces detection
if os.environ.get('SPACE_ID'):
    data_file = "/tmp/trackio_experiments.json"
else:
    data_file = "trackio_experiments.json"
```

### Backup Recovery

The app automatically recovers your experiments from backup data when:
- Running on HF Spaces
- No existing experiments found
- Data file is missing or empty

### Storage Locations

1. **Primary**: `/tmp/trackio_experiments.json`
2. **Backup**: `/tmp/trackio_backup.json`
3. **Fallback**: Local directory (for development)

## πŸš€ Deployment Best Practices

### 1. Environment Variables
```bash
# Set in HF Spaces environment
SPACE_ID=your-space-id
TRACKIO_URL=https://your-space.hf.space
```

### 2. File Structure
```
your-space/
β”œβ”€β”€ app.py                 # Main Trackio app
β”œβ”€β”€ requirements.txt       # Dependencies
β”œβ”€β”€ README.md             # Space description
└── .gitignore           # Ignore temporary files
```

### 3. Requirements
```txt
gradio>=4.0.0
plotly>=5.0.0
pandas>=1.5.0
numpy>=1.24.0
```

## πŸ“ˆ Monitoring Your Training

### Real-time Metrics
Your experiments show:
- **Loss**: Decreasing from 1.1659 to 1.1528 (good convergence)
- **Learning Rate**: Properly scheduled from 7e-08 to 2.8875e-07
- **Token Accuracy**: Around 75-76% (reasonable for early training)
- **GPU Memory**: ~17GB allocated, 75GB reserved

### Expected Behavior
- Loss should continue decreasing
- Learning rate will follow cosine schedule
- Token accuracy should improve over time
- GPU memory usage should remain stable

## πŸ” Troubleshooting

### Issue: "No metrics data available"
**Solution**: The app now automatically recovers experiments from backup

### Issue: Plots not showing
**Solution**: 
1. Check experiment ID is correct
2. Try different metrics (loss, learning_rate, etc.)
3. Refresh the page

### Issue: Data not persisting
**Solution**: 
1. App now uses `/tmp/` for better persistence
2. Backup recovery ensures data availability
3. Multiple storage locations provide redundancy

## 🎯 Next Steps

1. **Deploy Updated App**: Push the updated `app.py` to your HF Space
2. **Test Plots**: Try plotting your experiments
3. **Monitor Training**: Continue monitoring your training runs
4. **Add New Experiments**: Create new experiments as needed

## πŸ“ž Support

If you encounter issues:
1. Check the logs in your HF Space
2. Verify experiment IDs are correct
3. Try the backup recovery feature
4. Contact for additional support

---

**Your experiments are now properly configured and should display correctly in the Trackio interface!** πŸŽ‰