Spaces:

Tonic
/

SmolFactory

Running

File size: 4,737 Bytes

ebe598e

# 🚀 Trackio on Hugging Face Spaces - Complete Guide

## Overview

This guide explains how to properly deploy and use Trackio on Hugging Face Spaces, addressing the unique challenges of ephemeral storage and data persistence.

## 🏗️ Hugging Face Spaces Architecture

### Key Challenges

1. **Ephemeral Storage**: File system gets reset between deployments
2. **No Persistent Storage**: Files written during runtime don't persist
3. **Multiple Instances**: Training and monitoring might run in different environments
4. **Limited File System**: Restricted write permissions in certain directories

### How Trackio Handles HF Spaces

The updated Trackio app now includes:

- **Automatic HF Spaces Detection**: Detects when running on HF Spaces
- **Persistent Path Selection**: Uses `/tmp/` for better persistence
- **Backup Recovery**: Automatically recovers experiments from backup data
- **Fallback Storage**: Multiple storage locations for redundancy

## 📊 Your Current Experiments

Based on your logs, you have these experiments available:

### Experiment 1: `exp_20250720_130853`
- **Name**: petite-elle-l-aime-3
- **Status**: Running
- **Metrics**: 4 entries (steps 25, 50, 75, 100)
- **Key Metrics**: Loss decreasing from 1.1659 to 1.1528

### Experiment 2: `exp_20250720_134319`
- **Name**: petite-elle-l-aime-3-1
- **Status**: Running
- **Metrics**: 2 entries (step 25)
- **Key Metrics**: Loss 1.166, GPU memory usage

## 🎯 How to Use Your Experiments

### 1. View Experiments
- Go to the "View Experiments" tab
- Enter experiment ID: `exp_20250720_130853` or `exp_20250720_134319`
- Click "View Experiment" to see details

### 2. Create Plots
- Go to the "Visualizations" tab
- Enter experiment ID
- Select metric to plot:
  - `loss` - Training loss curve
  - `learning_rate` - Learning rate schedule
  - `mean_token_accuracy` - Token accuracy
  - `grad_norm` - Gradient norm
  - `gpu_0_memory_allocated` - GPU memory usage

### 3. Compare Experiments
- Use the "Experiment Comparison" feature
- Enter: `exp_20250720_130853,exp_20250720_134319`
- Compare loss curves between experiments

## 🔧 Technical Details

### Data Persistence Strategy

```python
# HF Spaces detection
if os.environ.get('SPACE_ID'):
    data_file = "/tmp/trackio_experiments.json"
else:
    data_file = "trackio_experiments.json"
```

### Backup Recovery

The app automatically recovers your experiments from backup data when:
- Running on HF Spaces
- No existing experiments found
- Data file is missing or empty

### Storage Locations

1. **Primary**: `/tmp/trackio_experiments.json`
2. **Backup**: `/tmp/trackio_backup.json`
3. **Fallback**: Local directory (for development)

## 🚀 Deployment Best Practices

### 1. Environment Variables
```bash
# Set in HF Spaces environment
SPACE_ID=your-space-id
TRACKIO_URL=https://your-space.hf.space
```

### 2. File Structure
```
your-space/
├── app.py                 # Main Trackio app
├── requirements.txt       # Dependencies
├── README.md             # Space description
└── .gitignore           # Ignore temporary files
```

### 3. Requirements
```txt
gradio>=4.0.0
plotly>=5.0.0
pandas>=1.5.0
numpy>=1.24.0
```

## 📈 Monitoring Your Training

### Real-time Metrics
Your experiments show:
- **Loss**: Decreasing from 1.1659 to 1.1528 (good convergence)
- **Learning Rate**: Properly scheduled from 7e-08 to 2.8875e-07
- **Token Accuracy**: Around 75-76% (reasonable for early training)
- **GPU Memory**: ~17GB allocated, 75GB reserved

### Expected Behavior
- Loss should continue decreasing
- Learning rate will follow cosine schedule
- Token accuracy should improve over time
- GPU memory usage should remain stable

## 🔍 Troubleshooting

### Issue: "No metrics data available"
**Solution**: The app now automatically recovers experiments from backup

### Issue: Plots not showing
**Solution**: 
1. Check experiment ID is correct
2. Try different metrics (loss, learning_rate, etc.)
3. Refresh the page

### Issue: Data not persisting
**Solution**: 
1. App now uses `/tmp/` for better persistence
2. Backup recovery ensures data availability
3. Multiple storage locations provide redundancy

## 🎯 Next Steps

1. **Deploy Updated App**: Push the updated `app.py` to your HF Space
2. **Test Plots**: Try plotting your experiments
3. **Monitor Training**: Continue monitoring your training runs
4. **Add New Experiments**: Create new experiments as needed

## 📞 Support

If you encounter issues:
1. Check the logs in your HF Space
2. Verify experiment IDs are correct
3. Try the backup recovery feature
4. Contact for additional support

---

**Your experiments are now properly configured and should display correctly in the Trackio interface!** 🎉