SmolFactory / docs /HF_DATASETS_GUIDE.md
Tonic's picture
adds formatting fix
ebe598e verified
|
raw
history blame
7.36 kB

πŸš€ Trackio with Hugging Face Datasets - Complete Guide

Overview

This guide explains how to use Hugging Face Datasets for persistent storage of Trackio experiments, providing reliable data persistence across Hugging Face Spaces deployments.

πŸ—οΈ Architecture

Why HF Datasets?

  1. Persistent Storage: Data survives Space restarts and redeployments
  2. Version Control: Automatic versioning of experiment data
  3. Access Control: Private datasets for security
  4. Reliability: HF's infrastructure ensures data availability
  5. Scalability: Handles large amounts of experiment data

Data Flow

Training Script β†’ Trackio App β†’ HF Dataset β†’ Trackio App β†’ Plots

πŸš€ Setup Instructions

1. Create HF Token

  1. Go to Hugging Face Settings
  2. Create a new token with write permissions
  3. Copy the token for use in your Space

2. Set Up Dataset Repository

# Run the setup script
python setup_hf_dataset.py

This will:

  • Create a private dataset: tonic/trackio-experiments
  • Add your existing experiments
  • Configure the dataset for Trackio

3. Configure Hugging Face Space

Environment Variables

Set these in your HF Space settings:

HF_TOKEN=your_hf_token_here
TRACKIO_DATASET_REPO=your-username/your-dataset-name

Environment Variables Explained:

  • HF_TOKEN: Your Hugging Face token (required for dataset access)
  • TRACKIO_DATASET_REPO: Dataset repository to use (optional, defaults to tonic/trackio-experiments)

Example Configurations:

# Use default dataset
HF_TOKEN=your_token_here

# Use personal dataset
HF_TOKEN=your_token_here
TRACKIO_DATASET_REPO=your-username/trackio-experiments

# Use team dataset
HF_TOKEN=your_token_here
TRACKIO_DATASET_REPO=your-org/team-experiments

# Use project-specific dataset
HF_TOKEN=your_token_here
TRACKIO_DATASET_REPO=your-username/smollm3-experiments

Requirements

Update your requirements.txt:

gradio>=4.0.0
plotly>=5.0.0
pandas>=1.5.0
numpy>=1.24.0
datasets>=2.14.0
huggingface-hub>=0.16.0
requests>=2.31.0

4. Deploy Updated App

The updated app.py now:

  • Loads experiments from HF Dataset
  • Saves new experiments to the dataset
  • Falls back to backup data if dataset unavailable
  • Provides better error handling

5. Configure Environment Variables

Use the configuration script to check your setup:

python configure_trackio.py

This script will:

  • Show current environment variables
  • Test dataset access
  • Generate configuration file
  • Provide usage examples

Available Environment Variables:

Variable Required Default Description
HF_TOKEN Yes None Your Hugging Face token
TRACKIO_DATASET_REPO No tonic/trackio-experiments Dataset repository to use
SPACE_ID Auto None HF Space ID (auto-detected)

πŸ“Š Dataset Schema

The HF Dataset contains these columns:

Column Type Description
experiment_id string Unique experiment identifier
name string Experiment name
description string Experiment description
created_at string ISO timestamp
status string running/completed/failed
metrics string JSON array of metric entries
parameters string JSON object of experiment parameters
artifacts string JSON array of artifacts
logs string JSON array of log entries
last_updated string ISO timestamp of last update

πŸ”§ Technical Details

Loading Experiments

from datasets import load_dataset

# Load from HF Dataset
dataset = load_dataset("tonic/trackio-experiments", token=HF_TOKEN)

# Convert to experiments dict
for row in dataset['train']:
    experiment = {
        'id': row['experiment_id'],
        'metrics': json.loads(row['metrics']),
        'parameters': json.loads(row['parameters']),
        # ... other fields
    }

Saving Experiments

from datasets import Dataset
from huggingface_hub import HfApi

# Convert experiments to dataset format
dataset_data = []
for exp_id, exp_data in experiments.items():
    dataset_data.append({
        'experiment_id': exp_id,
        'metrics': json.dumps(exp_data['metrics']),
        'parameters': json.dumps(exp_data['parameters']),
        # ... other fields
    })

# Push to HF Hub
dataset = Dataset.from_list(dataset_data)
dataset.push_to_hub("tonic/trackio-experiments", token=HF_TOKEN, private=True)

πŸ“ˆ Your Current Experiments

Available Experiments

  1. exp_20250720_130853 (petite-elle-l-aime-3)

    • 4 metric entries (steps 25, 50, 75, 100)
    • Loss decreasing: 1.1659 β†’ 1.1528
    • Good convergence pattern
  2. exp_20250720_134319 (petite-elle-l-aime-3-1)

    • 2 metric entries (step 25)
    • Loss: 1.166
    • GPU memory tracking

Metrics Available for Plotting

  • loss - Training loss curve
  • learning_rate - Learning rate schedule
  • mean_token_accuracy - Token-level accuracy
  • grad_norm - Gradient norm
  • num_tokens - Tokens processed
  • epoch - Training epoch
  • gpu_0_memory_allocated - GPU memory usage
  • cpu_percent - CPU usage
  • memory_percent - System memory

🎯 Usage Instructions

1. View Experiments

  • Go to "View Experiments" tab
  • Enter experiment ID: exp_20250720_130853 or exp_20250720_134319
  • Click "View Experiment"

2. Create Plots

  • Go to "Visualizations" tab
  • Enter experiment ID
  • Select metric to plot
  • Click "Create Plot"

3. Compare Experiments

  • Use "Experiment Comparison" feature
  • Enter: exp_20250720_130853,exp_20250720_134319
  • Compare loss curves

πŸ” Troubleshooting

Issue: "No metrics data available"

Solutions:

  1. Check HF_TOKEN is set correctly
  2. Verify dataset repository exists
  3. Check network connectivity to HF Hub

Issue: "Failed to load from dataset"

Solutions:

  1. App falls back to backup data automatically
  2. Check dataset permissions
  3. Verify token has read access

Issue: "Failed to save experiments"

Solutions:

  1. Check token has write permissions
  2. Verify dataset repository exists
  3. Check network connectivity

πŸš€ Benefits of This Approach

βœ… Advantages

  • Persistent: Data survives Space restarts
  • Reliable: HF's infrastructure ensures availability
  • Secure: Private datasets protect your data
  • Scalable: Handles large amounts of experiment data
  • Versioned: Automatic versioning of experiment data

πŸ”„ Fallback Strategy

  1. Primary: Load from HF Dataset
  2. Secondary: Use backup data (your existing experiments)
  3. Tertiary: Create new experiments locally

πŸ“‹ Next Steps

  1. Set HF_TOKEN: Add your token to Space environment
  2. Run Setup: Execute setup_hf_dataset.py
  3. Deploy App: Push updated app.py to your Space
  4. Test Plots: Verify experiments load and plots work
  5. Monitor Training: New experiments will be saved to dataset

πŸ” Security Notes

  • Dataset is private by default
  • Only accessible with your HF_TOKEN
  • Experiment data is stored securely on HF infrastructure
  • No sensitive data is exposed publicly

Your experiments are now configured for reliable persistence using Hugging Face Datasets! πŸŽ‰