Spaces:

Tonic
/

SmolFactory

Running

App Files Files Community

SmolFactory / docs /HF_SPACES_GUIDE.md

Tonic

adds formatting fix

ebe598e verified 3 months ago

preview code

raw

history blame

4.74 kB

	# 🚀 Trackio on Hugging Face Spaces - Complete Guide

	## Overview

	This guide explains how to properly deploy and use Trackio on Hugging Face Spaces, addressing the unique challenges of ephemeral storage and data persistence.

	## 🏗️ Hugging Face Spaces Architecture

	### Key Challenges

	1. Ephemeral Storage: File system gets reset between deployments
	2. No Persistent Storage: Files written during runtime don't persist
	3. Multiple Instances: Training and monitoring might run in different environments
	4. Limited File System: Restricted write permissions in certain directories

	### How Trackio Handles HF Spaces

	The updated Trackio app now includes:

	- Automatic HF Spaces Detection: Detects when running on HF Spaces
	- Persistent Path Selection: Uses `/tmp/` for better persistence
	- Backup Recovery: Automatically recovers experiments from backup data
	- Fallback Storage: Multiple storage locations for redundancy

	## 📊 Your Current Experiments

	Based on your logs, you have these experiments available:

	### Experiment 1: `exp_20250720_130853`
	- Name: petite-elle-l-aime-3
	- Status: Running
	- Metrics: 4 entries (steps 25, 50, 75, 100)
	- Key Metrics: Loss decreasing from 1.1659 to 1.1528

	### Experiment 2: `exp_20250720_134319`
	- Name: petite-elle-l-aime-3-1
	- Status: Running
	- Metrics: 2 entries (step 25)
	- Key Metrics: Loss 1.166, GPU memory usage

	## 🎯 How to Use Your Experiments

	### 1. View Experiments
	- Go to the "View Experiments" tab
	- Enter experiment ID: `exp_20250720_130853` or `exp_20250720_134319`
	- Click "View Experiment" to see details

	### 2. Create Plots
	- Go to the "Visualizations" tab
	- Enter experiment ID
	- Select metric to plot:
	- `loss` - Training loss curve
	- `learning_rate` - Learning rate schedule
	- `mean_token_accuracy` - Token accuracy
	- `grad_norm` - Gradient norm
	- `gpu_0_memory_allocated` - GPU memory usage

	### 3. Compare Experiments
	- Use the "Experiment Comparison" feature
	- Enter: `exp_20250720_130853,exp_20250720_134319`
	- Compare loss curves between experiments

	## 🔧 Technical Details

	### Data Persistence Strategy

	```python
	# HF Spaces detection
	if os.environ.get('SPACE_ID'):
	data_file = "/tmp/trackio_experiments.json"
	else:
	data_file = "trackio_experiments.json"
	```

	### Backup Recovery

	The app automatically recovers your experiments from backup data when:
	- Running on HF Spaces
	- No existing experiments found
	- Data file is missing or empty

	### Storage Locations

	1. Primary: `/tmp/trackio_experiments.json`
	2. Backup: `/tmp/trackio_backup.json`
	3. Fallback: Local directory (for development)

	## 🚀 Deployment Best Practices

	### 1. Environment Variables
	```bash
	# Set in HF Spaces environment
	SPACE_ID=your-space-id
	TRACKIO_URL=https://your-space.hf.space
	```

	### 2. File Structure
	```
	your-space/
	├── app.py # Main Trackio app
	├── requirements.txt # Dependencies
	├── README.md # Space description
	└── .gitignore # Ignore temporary files
	```

	### 3. Requirements
	```txt
	gradio>=4.0.0
	plotly>=5.0.0
	pandas>=1.5.0
	numpy>=1.24.0
	```

	## 📈 Monitoring Your Training

	### Real-time Metrics
	Your experiments show:
	- Loss: Decreasing from 1.1659 to 1.1528 (good convergence)
	- Learning Rate: Properly scheduled from 7e-08 to 2.8875e-07
	- Token Accuracy: Around 75-76% (reasonable for early training)
	- GPU Memory: ~17GB allocated, 75GB reserved

	### Expected Behavior
	- Loss should continue decreasing
	- Learning rate will follow cosine schedule
	- Token accuracy should improve over time
	- GPU memory usage should remain stable

	## 🔍 Troubleshooting

	### Issue: "No metrics data available"
	Solution: The app now automatically recovers experiments from backup

	### Issue: Plots not showing
	Solution:
	1. Check experiment ID is correct
	2. Try different metrics (loss, learning_rate, etc.)
	3. Refresh the page

	### Issue: Data not persisting
	Solution:
	1. App now uses `/tmp/` for better persistence
	2. Backup recovery ensures data availability
	3. Multiple storage locations provide redundancy

	## 🎯 Next Steps

	1. Deploy Updated App: Push the updated `app.py` to your HF Space
	2. Test Plots: Try plotting your experiments
	3. Monitor Training: Continue monitoring your training runs
	4. Add New Experiments: Create new experiments as needed

	## 📞 Support

	If you encounter issues:
	1. Check the logs in your HF Space
	2. Verify experiment IDs are correct
	3. Try the backup recovery feature
	4. Contact for additional support

	---

	Your experiments are now properly configured and should display correctly in the Trackio interface! 🎉