Spaces:
Running
Running
| # π Trackio on Hugging Face Spaces - Complete Guide | |
| ## Overview | |
| This guide explains how to properly deploy and use Trackio on Hugging Face Spaces, addressing the unique challenges of ephemeral storage and data persistence. | |
| ## ποΈ Hugging Face Spaces Architecture | |
| ### Key Challenges | |
| 1. **Ephemeral Storage**: File system gets reset between deployments | |
| 2. **No Persistent Storage**: Files written during runtime don't persist | |
| 3. **Multiple Instances**: Training and monitoring might run in different environments | |
| 4. **Limited File System**: Restricted write permissions in certain directories | |
| ### How Trackio Handles HF Spaces | |
| The updated Trackio app now includes: | |
| - **Automatic HF Spaces Detection**: Detects when running on HF Spaces | |
| - **Persistent Path Selection**: Uses `/tmp/` for better persistence | |
| - **Backup Recovery**: Automatically recovers experiments from backup data | |
| - **Fallback Storage**: Multiple storage locations for redundancy | |
| ## π Your Current Experiments | |
| Based on your logs, you have these experiments available: | |
| ### Experiment 1: `exp_20250720_130853` | |
| - **Name**: petite-elle-l-aime-3 | |
| - **Status**: Running | |
| - **Metrics**: 4 entries (steps 25, 50, 75, 100) | |
| - **Key Metrics**: Loss decreasing from 1.1659 to 1.1528 | |
| ### Experiment 2: `exp_20250720_134319` | |
| - **Name**: petite-elle-l-aime-3-1 | |
| - **Status**: Running | |
| - **Metrics**: 2 entries (step 25) | |
| - **Key Metrics**: Loss 1.166, GPU memory usage | |
| ## π― How to Use Your Experiments | |
| ### 1. View Experiments | |
| - Go to the "View Experiments" tab | |
| - Enter experiment ID: `exp_20250720_130853` or `exp_20250720_134319` | |
| - Click "View Experiment" to see details | |
| ### 2. Create Plots | |
| - Go to the "Visualizations" tab | |
| - Enter experiment ID | |
| - Select metric to plot: | |
| - `loss` - Training loss curve | |
| - `learning_rate` - Learning rate schedule | |
| - `mean_token_accuracy` - Token accuracy | |
| - `grad_norm` - Gradient norm | |
| - `gpu_0_memory_allocated` - GPU memory usage | |
| ### 3. Compare Experiments | |
| - Use the "Experiment Comparison" feature | |
| - Enter: `exp_20250720_130853,exp_20250720_134319` | |
| - Compare loss curves between experiments | |
| ## π§ Technical Details | |
| ### Data Persistence Strategy | |
| ```python | |
| # HF Spaces detection | |
| if os.environ.get('SPACE_ID'): | |
| data_file = "/tmp/trackio_experiments.json" | |
| else: | |
| data_file = "trackio_experiments.json" | |
| ``` | |
| ### Backup Recovery | |
| The app automatically recovers your experiments from backup data when: | |
| - Running on HF Spaces | |
| - No existing experiments found | |
| - Data file is missing or empty | |
| ### Storage Locations | |
| 1. **Primary**: `/tmp/trackio_experiments.json` | |
| 2. **Backup**: `/tmp/trackio_backup.json` | |
| 3. **Fallback**: Local directory (for development) | |
| ## π Deployment Best Practices | |
| ### 1. Environment Variables | |
| ```bash | |
| # Set in HF Spaces environment | |
| SPACE_ID=your-space-id | |
| TRACKIO_URL=https://your-space.hf.space | |
| ``` | |
| ### 2. File Structure | |
| ``` | |
| your-space/ | |
| βββ app.py # Main Trackio app | |
| βββ requirements.txt # Dependencies | |
| βββ README.md # Space description | |
| βββ .gitignore # Ignore temporary files | |
| ``` | |
| ### 3. Requirements | |
| ```txt | |
| gradio>=4.0.0 | |
| plotly>=5.0.0 | |
| pandas>=1.5.0 | |
| numpy>=1.24.0 | |
| ``` | |
| ## π Monitoring Your Training | |
| ### Real-time Metrics | |
| Your experiments show: | |
| - **Loss**: Decreasing from 1.1659 to 1.1528 (good convergence) | |
| - **Learning Rate**: Properly scheduled from 7e-08 to 2.8875e-07 | |
| - **Token Accuracy**: Around 75-76% (reasonable for early training) | |
| - **GPU Memory**: ~17GB allocated, 75GB reserved | |
| ### Expected Behavior | |
| - Loss should continue decreasing | |
| - Learning rate will follow cosine schedule | |
| - Token accuracy should improve over time | |
| - GPU memory usage should remain stable | |
| ## π Troubleshooting | |
| ### Issue: "No metrics data available" | |
| **Solution**: The app now automatically recovers experiments from backup | |
| ### Issue: Plots not showing | |
| **Solution**: | |
| 1. Check experiment ID is correct | |
| 2. Try different metrics (loss, learning_rate, etc.) | |
| 3. Refresh the page | |
| ### Issue: Data not persisting | |
| **Solution**: | |
| 1. App now uses `/tmp/` for better persistence | |
| 2. Backup recovery ensures data availability | |
| 3. Multiple storage locations provide redundancy | |
| ## π― Next Steps | |
| 1. **Deploy Updated App**: Push the updated `app.py` to your HF Space | |
| 2. **Test Plots**: Try plotting your experiments | |
| 3. **Monitor Training**: Continue monitoring your training runs | |
| 4. **Add New Experiments**: Create new experiments as needed | |
| ## π Support | |
| If you encounter issues: | |
| 1. Check the logs in your HF Space | |
| 2. Verify experiment IDs are correct | |
| 3. Try the backup recovery feature | |
| 4. Contact for additional support | |
| --- | |
| **Your experiments are now properly configured and should display correctly in the Trackio interface!** π |