Spaces:
Running
Running
π Trackio on Hugging Face Spaces - Complete Guide
Overview
This guide explains how to properly deploy and use Trackio on Hugging Face Spaces, addressing the unique challenges of ephemeral storage and data persistence.
ποΈ Hugging Face Spaces Architecture
Key Challenges
- Ephemeral Storage: File system gets reset between deployments
- No Persistent Storage: Files written during runtime don't persist
- Multiple Instances: Training and monitoring might run in different environments
- Limited File System: Restricted write permissions in certain directories
How Trackio Handles HF Spaces
The updated Trackio app now includes:
- Automatic HF Spaces Detection: Detects when running on HF Spaces
- Persistent Path Selection: Uses
/tmp/
for better persistence - Backup Recovery: Automatically recovers experiments from backup data
- Fallback Storage: Multiple storage locations for redundancy
π Your Current Experiments
Based on your logs, you have these experiments available:
Experiment 1: exp_20250720_130853
- Name: petite-elle-l-aime-3
- Status: Running
- Metrics: 4 entries (steps 25, 50, 75, 100)
- Key Metrics: Loss decreasing from 1.1659 to 1.1528
Experiment 2: exp_20250720_134319
- Name: petite-elle-l-aime-3-1
- Status: Running
- Metrics: 2 entries (step 25)
- Key Metrics: Loss 1.166, GPU memory usage
π― How to Use Your Experiments
1. View Experiments
- Go to the "View Experiments" tab
- Enter experiment ID:
exp_20250720_130853
orexp_20250720_134319
- Click "View Experiment" to see details
2. Create Plots
- Go to the "Visualizations" tab
- Enter experiment ID
- Select metric to plot:
loss
- Training loss curvelearning_rate
- Learning rate schedulemean_token_accuracy
- Token accuracygrad_norm
- Gradient normgpu_0_memory_allocated
- GPU memory usage
3. Compare Experiments
- Use the "Experiment Comparison" feature
- Enter:
exp_20250720_130853,exp_20250720_134319
- Compare loss curves between experiments
π§ Technical Details
Data Persistence Strategy
# HF Spaces detection
if os.environ.get('SPACE_ID'):
data_file = "/tmp/trackio_experiments.json"
else:
data_file = "trackio_experiments.json"
Backup Recovery
The app automatically recovers your experiments from backup data when:
- Running on HF Spaces
- No existing experiments found
- Data file is missing or empty
Storage Locations
- Primary:
/tmp/trackio_experiments.json
- Backup:
/tmp/trackio_backup.json
- Fallback: Local directory (for development)
π Deployment Best Practices
1. Environment Variables
# Set in HF Spaces environment
SPACE_ID=your-space-id
TRACKIO_URL=https://your-space.hf.space
2. File Structure
your-space/
βββ app.py # Main Trackio app
βββ requirements.txt # Dependencies
βββ README.md # Space description
βββ .gitignore # Ignore temporary files
3. Requirements
gradio>=4.0.0
plotly>=5.0.0
pandas>=1.5.0
numpy>=1.24.0
π Monitoring Your Training
Real-time Metrics
Your experiments show:
- Loss: Decreasing from 1.1659 to 1.1528 (good convergence)
- Learning Rate: Properly scheduled from 7e-08 to 2.8875e-07
- Token Accuracy: Around 75-76% (reasonable for early training)
- GPU Memory: ~17GB allocated, 75GB reserved
Expected Behavior
- Loss should continue decreasing
- Learning rate will follow cosine schedule
- Token accuracy should improve over time
- GPU memory usage should remain stable
π Troubleshooting
Issue: "No metrics data available"
Solution: The app now automatically recovers experiments from backup
Issue: Plots not showing
Solution:
- Check experiment ID is correct
- Try different metrics (loss, learning_rate, etc.)
- Refresh the page
Issue: Data not persisting
Solution:
- App now uses
/tmp/
for better persistence - Backup recovery ensures data availability
- Multiple storage locations provide redundancy
π― Next Steps
- Deploy Updated App: Push the updated
app.py
to your HF Space - Test Plots: Try plotting your experiments
- Monitor Training: Continue monitoring your training runs
- Add New Experiments: Create new experiments as needed
π Support
If you encounter issues:
- Check the logs in your HF Space
- Verify experiment IDs are correct
- Try the backup recovery feature
- Contact for additional support
Your experiments are now properly configured and should display correctly in the Trackio interface! π