Spaces:

Tonic
/

SmolFactory

Running

App Files Files Community

SmolFactory / docs /HF_DATASETS_GUIDE.md

Tonic

adds formatting fix

ebe598e verified about 2 months ago

preview code

raw

history blame

7.36 kB

🚀 Trackio with Hugging Face Datasets - Complete Guide

Overview

This guide explains how to use Hugging Face Datasets for persistent storage of Trackio experiments, providing reliable data persistence across Hugging Face Spaces deployments.

🏗️ Architecture

Why HF Datasets?

Persistent Storage: Data survives Space restarts and redeployments
Version Control: Automatic versioning of experiment data
Access Control: Private datasets for security
Reliability: HF's infrastructure ensures data availability
Scalability: Handles large amounts of experiment data

Data Flow

Training Script → Trackio App → HF Dataset → Trackio App → Plots

🚀 Setup Instructions

1. Create HF Token

Go to Hugging Face Settings
Create a new token with write permissions
Copy the token for use in your Space

2. Set Up Dataset Repository

# Run the setup script
python setup_hf_dataset.py

This will:

Create a private dataset: tonic/trackio-experiments
Add your existing experiments
Configure the dataset for Trackio

3. Configure Hugging Face Space

Environment Variables

Set these in your HF Space settings:

HF_TOKEN=your_hf_token_here
TRACKIO_DATASET_REPO=your-username/your-dataset-name

Environment Variables Explained:

HF_TOKEN: Your Hugging Face token (required for dataset access)
TRACKIO_DATASET_REPO: Dataset repository to use (optional, defaults to tonic/trackio-experiments)

Example Configurations:

# Use default dataset
HF_TOKEN=your_token_here

# Use personal dataset
HF_TOKEN=your_token_here
TRACKIO_DATASET_REPO=your-username/trackio-experiments

# Use team dataset
HF_TOKEN=your_token_here
TRACKIO_DATASET_REPO=your-org/team-experiments

# Use project-specific dataset
HF_TOKEN=your_token_here
TRACKIO_DATASET_REPO=your-username/smollm3-experiments

Requirements

Update your requirements.txt:

gradio>=4.0.0
plotly>=5.0.0
pandas>=1.5.0
numpy>=1.24.0
datasets>=2.14.0
huggingface-hub>=0.16.0
requests>=2.31.0

4. Deploy Updated App

The updated app.py now:

Loads experiments from HF Dataset
Saves new experiments to the dataset
Falls back to backup data if dataset unavailable
Provides better error handling

5. Configure Environment Variables

Use the configuration script to check your setup:

python configure_trackio.py

This script will:

Show current environment variables
Test dataset access
Generate configuration file
Provide usage examples

Available Environment Variables:

Variable	Required	Default	Description
`HF_TOKEN`	Yes	None	Your Hugging Face token
`TRACKIO_DATASET_REPO`	No	`tonic/trackio-experiments`	Dataset repository to use
`SPACE_ID`	Auto	None	HF Space ID (auto-detected)

📊 Dataset Schema

The HF Dataset contains these columns:

Column	Type	Description
`experiment_id`	string	Unique experiment identifier
`name`	string	Experiment name
`description`	string	Experiment description
`created_at`	string	ISO timestamp
`status`	string	running/completed/failed
`metrics`	string	JSON array of metric entries
`parameters`	string	JSON object of experiment parameters
`artifacts`	string	JSON array of artifacts
`logs`	string	JSON array of log entries
`last_updated`	string	ISO timestamp of last update

🔧 Technical Details

Loading Experiments

from datasets import load_dataset

# Load from HF Dataset
dataset = load_dataset("tonic/trackio-experiments", token=HF_TOKEN)

# Convert to experiments dict
for row in dataset['train']:
    experiment = {
        'id': row['experiment_id'],
        'metrics': json.loads(row['metrics']),
        'parameters': json.loads(row['parameters']),
        # ... other fields
    }

Saving Experiments

from datasets import Dataset
from huggingface_hub import HfApi

# Convert experiments to dataset format
dataset_data = []
for exp_id, exp_data in experiments.items():
    dataset_data.append({
        'experiment_id': exp_id,
        'metrics': json.dumps(exp_data['metrics']),
        'parameters': json.dumps(exp_data['parameters']),
        # ... other fields
    })

# Push to HF Hub
dataset = Dataset.from_list(dataset_data)
dataset.push_to_hub("tonic/trackio-experiments", token=HF_TOKEN, private=True)

📈 Your Current Experiments

Available Experiments

exp_20250720_130853 (petite-elle-l-aime-3)
- 4 metric entries (steps 25, 50, 75, 100)
- Loss decreasing: 1.1659 → 1.1528
- Good convergence pattern
exp_20250720_134319 (petite-elle-l-aime-3-1)
- 2 metric entries (step 25)
- Loss: 1.166
- GPU memory tracking

Metrics Available for Plotting

loss - Training loss curve
learning_rate - Learning rate schedule
mean_token_accuracy - Token-level accuracy
grad_norm - Gradient norm
num_tokens - Tokens processed
epoch - Training epoch
gpu_0_memory_allocated - GPU memory usage
cpu_percent - CPU usage
memory_percent - System memory

🎯 Usage Instructions

1. View Experiments

Go to "View Experiments" tab
Enter experiment ID: exp_20250720_130853 or exp_20250720_134319
Click "View Experiment"

2. Create Plots

Go to "Visualizations" tab
Enter experiment ID
Select metric to plot
Click "Create Plot"

3. Compare Experiments

Use "Experiment Comparison" feature
Enter: exp_20250720_130853,exp_20250720_134319
Compare loss curves

🔍 Troubleshooting

Issue: "No metrics data available"

Solutions:

Check HF_TOKEN is set correctly
Verify dataset repository exists
Check network connectivity to HF Hub

Issue: "Failed to load from dataset"

Solutions:

App falls back to backup data automatically
Check dataset permissions
Verify token has read access

Issue: "Failed to save experiments"

Solutions:

Check token has write permissions
Verify dataset repository exists
Check network connectivity

🚀 Benefits of This Approach

✅ Advantages

Persistent: Data survives Space restarts
Reliable: HF's infrastructure ensures availability
Secure: Private datasets protect your data
Scalable: Handles large amounts of experiment data
Versioned: Automatic versioning of experiment data

🔄 Fallback Strategy

Primary: Load from HF Dataset
Secondary: Use backup data (your existing experiments)
Tertiary: Create new experiments locally

📋 Next Steps

Set HF_TOKEN: Add your token to Space environment
Run Setup: Execute setup_hf_dataset.py
Deploy App: Push updated app.py to your Space
Test Plots: Verify experiments load and plots work
Monitor Training: New experiments will be saved to dataset

🔐 Security Notes

Dataset is private by default
Only accessible with your HF_TOKEN
Experiment data is stored securely on HF infrastructure
No sensitive data is exposed publicly

Your experiments are now configured for reliable persistence using Hugging Face Datasets! 🎉