SmolFactory / docs /HF_SPACES_GUIDE.md
Tonic's picture
adds formatting fix
ebe598e verified
|
raw
history blame
4.74 kB
# πŸš€ Trackio on Hugging Face Spaces - Complete Guide
## Overview
This guide explains how to properly deploy and use Trackio on Hugging Face Spaces, addressing the unique challenges of ephemeral storage and data persistence.
## πŸ—οΈ Hugging Face Spaces Architecture
### Key Challenges
1. **Ephemeral Storage**: File system gets reset between deployments
2. **No Persistent Storage**: Files written during runtime don't persist
3. **Multiple Instances**: Training and monitoring might run in different environments
4. **Limited File System**: Restricted write permissions in certain directories
### How Trackio Handles HF Spaces
The updated Trackio app now includes:
- **Automatic HF Spaces Detection**: Detects when running on HF Spaces
- **Persistent Path Selection**: Uses `/tmp/` for better persistence
- **Backup Recovery**: Automatically recovers experiments from backup data
- **Fallback Storage**: Multiple storage locations for redundancy
## πŸ“Š Your Current Experiments
Based on your logs, you have these experiments available:
### Experiment 1: `exp_20250720_130853`
- **Name**: petite-elle-l-aime-3
- **Status**: Running
- **Metrics**: 4 entries (steps 25, 50, 75, 100)
- **Key Metrics**: Loss decreasing from 1.1659 to 1.1528
### Experiment 2: `exp_20250720_134319`
- **Name**: petite-elle-l-aime-3-1
- **Status**: Running
- **Metrics**: 2 entries (step 25)
- **Key Metrics**: Loss 1.166, GPU memory usage
## 🎯 How to Use Your Experiments
### 1. View Experiments
- Go to the "View Experiments" tab
- Enter experiment ID: `exp_20250720_130853` or `exp_20250720_134319`
- Click "View Experiment" to see details
### 2. Create Plots
- Go to the "Visualizations" tab
- Enter experiment ID
- Select metric to plot:
- `loss` - Training loss curve
- `learning_rate` - Learning rate schedule
- `mean_token_accuracy` - Token accuracy
- `grad_norm` - Gradient norm
- `gpu_0_memory_allocated` - GPU memory usage
### 3. Compare Experiments
- Use the "Experiment Comparison" feature
- Enter: `exp_20250720_130853,exp_20250720_134319`
- Compare loss curves between experiments
## πŸ”§ Technical Details
### Data Persistence Strategy
```python
# HF Spaces detection
if os.environ.get('SPACE_ID'):
data_file = "/tmp/trackio_experiments.json"
else:
data_file = "trackio_experiments.json"
```
### Backup Recovery
The app automatically recovers your experiments from backup data when:
- Running on HF Spaces
- No existing experiments found
- Data file is missing or empty
### Storage Locations
1. **Primary**: `/tmp/trackio_experiments.json`
2. **Backup**: `/tmp/trackio_backup.json`
3. **Fallback**: Local directory (for development)
## πŸš€ Deployment Best Practices
### 1. Environment Variables
```bash
# Set in HF Spaces environment
SPACE_ID=your-space-id
TRACKIO_URL=https://your-space.hf.space
```
### 2. File Structure
```
your-space/
β”œβ”€β”€ app.py # Main Trackio app
β”œβ”€β”€ requirements.txt # Dependencies
β”œβ”€β”€ README.md # Space description
└── .gitignore # Ignore temporary files
```
### 3. Requirements
```txt
gradio>=4.0.0
plotly>=5.0.0
pandas>=1.5.0
numpy>=1.24.0
```
## πŸ“ˆ Monitoring Your Training
### Real-time Metrics
Your experiments show:
- **Loss**: Decreasing from 1.1659 to 1.1528 (good convergence)
- **Learning Rate**: Properly scheduled from 7e-08 to 2.8875e-07
- **Token Accuracy**: Around 75-76% (reasonable for early training)
- **GPU Memory**: ~17GB allocated, 75GB reserved
### Expected Behavior
- Loss should continue decreasing
- Learning rate will follow cosine schedule
- Token accuracy should improve over time
- GPU memory usage should remain stable
## πŸ” Troubleshooting
### Issue: "No metrics data available"
**Solution**: The app now automatically recovers experiments from backup
### Issue: Plots not showing
**Solution**:
1. Check experiment ID is correct
2. Try different metrics (loss, learning_rate, etc.)
3. Refresh the page
### Issue: Data not persisting
**Solution**:
1. App now uses `/tmp/` for better persistence
2. Backup recovery ensures data availability
3. Multiple storage locations provide redundancy
## 🎯 Next Steps
1. **Deploy Updated App**: Push the updated `app.py` to your HF Space
2. **Test Plots**: Try plotting your experiments
3. **Monitor Training**: Continue monitoring your training runs
4. **Add New Experiments**: Create new experiments as needed
## πŸ“ž Support
If you encounter issues:
1. Check the logs in your HF Space
2. Verify experiment IDs are correct
3. Try the backup recovery feature
4. Contact for additional support
---
**Your experiments are now properly configured and should display correctly in the Trackio interface!** πŸŽ‰