SmolFactory / docs /HF_DATASETS_GUIDE.md
Tonic's picture
adds formatting fix
ebe598e verified
|
raw
history blame
7.36 kB
# πŸš€ Trackio with Hugging Face Datasets - Complete Guide
## Overview
This guide explains how to use Hugging Face Datasets for persistent storage of Trackio experiments, providing reliable data persistence across Hugging Face Spaces deployments.
## πŸ—οΈ Architecture
### Why HF Datasets?
1. **Persistent Storage**: Data survives Space restarts and redeployments
2. **Version Control**: Automatic versioning of experiment data
3. **Access Control**: Private datasets for security
4. **Reliability**: HF's infrastructure ensures data availability
5. **Scalability**: Handles large amounts of experiment data
### Data Flow
```
Training Script β†’ Trackio App β†’ HF Dataset β†’ Trackio App β†’ Plots
```
## πŸš€ Setup Instructions
### 1. Create HF Token
1. Go to [Hugging Face Settings](https://huggingface.co/settings/tokens)
2. Create a new token with `write` permissions
3. Copy the token for use in your Space
### 2. Set Up Dataset Repository
```bash
# Run the setup script
python setup_hf_dataset.py
```
This will:
- Create a private dataset: `tonic/trackio-experiments`
- Add your existing experiments
- Configure the dataset for Trackio
### 3. Configure Hugging Face Space
#### Environment Variables
Set these in your HF Space settings:
```bash
HF_TOKEN=your_hf_token_here
TRACKIO_DATASET_REPO=your-username/your-dataset-name
```
**Environment Variables Explained:**
- `HF_TOKEN`: Your Hugging Face token (required for dataset access)
- `TRACKIO_DATASET_REPO`: Dataset repository to use (optional, defaults to `tonic/trackio-experiments`)
**Example Configurations:**
```bash
# Use default dataset
HF_TOKEN=your_token_here
# Use personal dataset
HF_TOKEN=your_token_here
TRACKIO_DATASET_REPO=your-username/trackio-experiments
# Use team dataset
HF_TOKEN=your_token_here
TRACKIO_DATASET_REPO=your-org/team-experiments
# Use project-specific dataset
HF_TOKEN=your_token_here
TRACKIO_DATASET_REPO=your-username/smollm3-experiments
```
#### Requirements
Update your `requirements.txt`:
```txt
gradio>=4.0.0
plotly>=5.0.0
pandas>=1.5.0
numpy>=1.24.0
datasets>=2.14.0
huggingface-hub>=0.16.0
requests>=2.31.0
```
### 4. Deploy Updated App
The updated `app.py` now:
- Loads experiments from HF Dataset
- Saves new experiments to the dataset
- Falls back to backup data if dataset unavailable
- Provides better error handling
### 5. Configure Environment Variables
Use the configuration script to check your setup:
```bash
python configure_trackio.py
```
This script will:
- Show current environment variables
- Test dataset access
- Generate configuration file
- Provide usage examples
**Available Environment Variables:**
| Variable | Required | Default | Description |
|----------|----------|---------|-------------|
| `HF_TOKEN` | Yes | None | Your Hugging Face token |
| `TRACKIO_DATASET_REPO` | No | `tonic/trackio-experiments` | Dataset repository to use |
| `SPACE_ID` | Auto | None | HF Space ID (auto-detected) |
## πŸ“Š Dataset Schema
The HF Dataset contains these columns:
| Column | Type | Description |
|--------|------|-------------|
| `experiment_id` | string | Unique experiment identifier |
| `name` | string | Experiment name |
| `description` | string | Experiment description |
| `created_at` | string | ISO timestamp |
| `status` | string | running/completed/failed |
| `metrics` | string | JSON array of metric entries |
| `parameters` | string | JSON object of experiment parameters |
| `artifacts` | string | JSON array of artifacts |
| `logs` | string | JSON array of log entries |
| `last_updated` | string | ISO timestamp of last update |
## πŸ”§ Technical Details
### Loading Experiments
```python
from datasets import load_dataset
# Load from HF Dataset
dataset = load_dataset("tonic/trackio-experiments", token=HF_TOKEN)
# Convert to experiments dict
for row in dataset['train']:
experiment = {
'id': row['experiment_id'],
'metrics': json.loads(row['metrics']),
'parameters': json.loads(row['parameters']),
# ... other fields
}
```
### Saving Experiments
```python
from datasets import Dataset
from huggingface_hub import HfApi
# Convert experiments to dataset format
dataset_data = []
for exp_id, exp_data in experiments.items():
dataset_data.append({
'experiment_id': exp_id,
'metrics': json.dumps(exp_data['metrics']),
'parameters': json.dumps(exp_data['parameters']),
# ... other fields
})
# Push to HF Hub
dataset = Dataset.from_list(dataset_data)
dataset.push_to_hub("tonic/trackio-experiments", token=HF_TOKEN, private=True)
```
## πŸ“ˆ Your Current Experiments
### Available Experiments
1. **`exp_20250720_130853`** (petite-elle-l-aime-3)
- 4 metric entries (steps 25, 50, 75, 100)
- Loss decreasing: 1.1659 β†’ 1.1528
- Good convergence pattern
2. **`exp_20250720_134319`** (petite-elle-l-aime-3-1)
- 2 metric entries (step 25)
- Loss: 1.166
- GPU memory tracking
### Metrics Available for Plotting
- `loss` - Training loss curve
- `learning_rate` - Learning rate schedule
- `mean_token_accuracy` - Token-level accuracy
- `grad_norm` - Gradient norm
- `num_tokens` - Tokens processed
- `epoch` - Training epoch
- `gpu_0_memory_allocated` - GPU memory usage
- `cpu_percent` - CPU usage
- `memory_percent` - System memory
## 🎯 Usage Instructions
### 1. View Experiments
- Go to "View Experiments" tab
- Enter experiment ID: `exp_20250720_130853` or `exp_20250720_134319`
- Click "View Experiment"
### 2. Create Plots
- Go to "Visualizations" tab
- Enter experiment ID
- Select metric to plot
- Click "Create Plot"
### 3. Compare Experiments
- Use "Experiment Comparison" feature
- Enter: `exp_20250720_130853,exp_20250720_134319`
- Compare loss curves
## πŸ” Troubleshooting
### Issue: "No metrics data available"
**Solutions**:
1. Check HF_TOKEN is set correctly
2. Verify dataset repository exists
3. Check network connectivity to HF Hub
### Issue: "Failed to load from dataset"
**Solutions**:
1. App falls back to backup data automatically
2. Check dataset permissions
3. Verify token has read access
### Issue: "Failed to save experiments"
**Solutions**:
1. Check token has write permissions
2. Verify dataset repository exists
3. Check network connectivity
## πŸš€ Benefits of This Approach
### βœ… Advantages
- **Persistent**: Data survives Space restarts
- **Reliable**: HF's infrastructure ensures availability
- **Secure**: Private datasets protect your data
- **Scalable**: Handles large amounts of experiment data
- **Versioned**: Automatic versioning of experiment data
### πŸ”„ Fallback Strategy
1. **Primary**: Load from HF Dataset
2. **Secondary**: Use backup data (your existing experiments)
3. **Tertiary**: Create new experiments locally
## πŸ“‹ Next Steps
1. **Set HF_TOKEN**: Add your token to Space environment
2. **Run Setup**: Execute `setup_hf_dataset.py`
3. **Deploy App**: Push updated `app.py` to your Space
4. **Test Plots**: Verify experiments load and plots work
5. **Monitor Training**: New experiments will be saved to dataset
## πŸ” Security Notes
- Dataset is **private** by default
- Only accessible with your HF_TOKEN
- Experiment data is stored securely on HF infrastructure
- No sensitive data is exposed publicly
---
**Your experiments are now configured for reliable persistence using Hugging Face Datasets!** πŸŽ‰