Spaces:
Running
Running
# π Trackio on Hugging Face Spaces - Complete Guide | |
## Overview | |
This guide explains how to properly deploy and use Trackio on Hugging Face Spaces, addressing the unique challenges of ephemeral storage and data persistence. | |
## ποΈ Hugging Face Spaces Architecture | |
### Key Challenges | |
1. **Ephemeral Storage**: File system gets reset between deployments | |
2. **No Persistent Storage**: Files written during runtime don't persist | |
3. **Multiple Instances**: Training and monitoring might run in different environments | |
4. **Limited File System**: Restricted write permissions in certain directories | |
### How Trackio Handles HF Spaces | |
The updated Trackio app now includes: | |
- **Automatic HF Spaces Detection**: Detects when running on HF Spaces | |
- **Persistent Path Selection**: Uses `/tmp/` for better persistence | |
- **Backup Recovery**: Automatically recovers experiments from backup data | |
- **Fallback Storage**: Multiple storage locations for redundancy | |
## π Your Current Experiments | |
Based on your logs, you have these experiments available: | |
### Experiment 1: `exp_20250720_130853` | |
- **Name**: petite-elle-l-aime-3 | |
- **Status**: Running | |
- **Metrics**: 4 entries (steps 25, 50, 75, 100) | |
- **Key Metrics**: Loss decreasing from 1.1659 to 1.1528 | |
### Experiment 2: `exp_20250720_134319` | |
- **Name**: petite-elle-l-aime-3-1 | |
- **Status**: Running | |
- **Metrics**: 2 entries (step 25) | |
- **Key Metrics**: Loss 1.166, GPU memory usage | |
## π― How to Use Your Experiments | |
### 1. View Experiments | |
- Go to the "View Experiments" tab | |
- Enter experiment ID: `exp_20250720_130853` or `exp_20250720_134319` | |
- Click "View Experiment" to see details | |
### 2. Create Plots | |
- Go to the "Visualizations" tab | |
- Enter experiment ID | |
- Select metric to plot: | |
- `loss` - Training loss curve | |
- `learning_rate` - Learning rate schedule | |
- `mean_token_accuracy` - Token accuracy | |
- `grad_norm` - Gradient norm | |
- `gpu_0_memory_allocated` - GPU memory usage | |
### 3. Compare Experiments | |
- Use the "Experiment Comparison" feature | |
- Enter: `exp_20250720_130853,exp_20250720_134319` | |
- Compare loss curves between experiments | |
## π§ Technical Details | |
### Data Persistence Strategy | |
```python | |
# HF Spaces detection | |
if os.environ.get('SPACE_ID'): | |
data_file = "/tmp/trackio_experiments.json" | |
else: | |
data_file = "trackio_experiments.json" | |
``` | |
### Backup Recovery | |
The app automatically recovers your experiments from backup data when: | |
- Running on HF Spaces | |
- No existing experiments found | |
- Data file is missing or empty | |
### Storage Locations | |
1. **Primary**: `/tmp/trackio_experiments.json` | |
2. **Backup**: `/tmp/trackio_backup.json` | |
3. **Fallback**: Local directory (for development) | |
## π Deployment Best Practices | |
### 1. Environment Variables | |
```bash | |
# Set in HF Spaces environment | |
SPACE_ID=your-space-id | |
TRACKIO_URL=https://your-space.hf.space | |
``` | |
### 2. File Structure | |
``` | |
your-space/ | |
βββ app.py # Main Trackio app | |
βββ requirements.txt # Dependencies | |
βββ README.md # Space description | |
βββ .gitignore # Ignore temporary files | |
``` | |
### 3. Requirements | |
```txt | |
gradio>=4.0.0 | |
plotly>=5.0.0 | |
pandas>=1.5.0 | |
numpy>=1.24.0 | |
``` | |
## π Monitoring Your Training | |
### Real-time Metrics | |
Your experiments show: | |
- **Loss**: Decreasing from 1.1659 to 1.1528 (good convergence) | |
- **Learning Rate**: Properly scheduled from 7e-08 to 2.8875e-07 | |
- **Token Accuracy**: Around 75-76% (reasonable for early training) | |
- **GPU Memory**: ~17GB allocated, 75GB reserved | |
### Expected Behavior | |
- Loss should continue decreasing | |
- Learning rate will follow cosine schedule | |
- Token accuracy should improve over time | |
- GPU memory usage should remain stable | |
## π Troubleshooting | |
### Issue: "No metrics data available" | |
**Solution**: The app now automatically recovers experiments from backup | |
### Issue: Plots not showing | |
**Solution**: | |
1. Check experiment ID is correct | |
2. Try different metrics (loss, learning_rate, etc.) | |
3. Refresh the page | |
### Issue: Data not persisting | |
**Solution**: | |
1. App now uses `/tmp/` for better persistence | |
2. Backup recovery ensures data availability | |
3. Multiple storage locations provide redundancy | |
## π― Next Steps | |
1. **Deploy Updated App**: Push the updated `app.py` to your HF Space | |
2. **Test Plots**: Try plotting your experiments | |
3. **Monitor Training**: Continue monitoring your training runs | |
4. **Add New Experiments**: Create new experiments as needed | |
## π Support | |
If you encounter issues: | |
1. Check the logs in your HF Space | |
2. Verify experiment IDs are correct | |
3. Try the backup recovery feature | |
4. Contact for additional support | |
--- | |
**Your experiments are now properly configured and should display correctly in the Trackio interface!** π |