Spaces:

Tonic
/

SmolFactory

Running

App Files Files Community

SmolFactory / docs /TRACKIO_INTERFACE_GUIDE.md

Tonic

adds formatting fix

ebe598e verified about 2 months ago

preview code

raw

history blame

7.65 kB

Enhanced Trackio Interface Guide

Overview

Your Trackio application has been significantly enhanced to provide comprehensive monitoring and visualization for SmolLM3 training experiments. Here's how to make the most of it.

🚀 Key Enhancements

1. Real-time Visualization

Interactive Plots: Loss curves, accuracy, learning rate, GPU metrics
Experiment Comparison: Compare multiple training runs side-by-side
Live Updates: Watch training progress in real-time

2. Comprehensive Data Display

Formatted Output: Clean, emoji-rich experiment details
Statistics Overview: Metrics count, parameters count, artifacts count
Status Tracking: Visual status indicators (🟢 running, ✅ completed, ❌ failed)

3. Demo Data Generation

Realistic Simulation: Generate realistic training metrics for testing
Multiple Metrics: Loss, accuracy, learning rate, GPU memory, training time
Configurable Parameters: Customize demo data to match your setup

📊 How to Use with Your SmolLM3 Training

Step 1: Start Your Training

python run_a100_large_experiment.py \
    --config config/train_smollm3_openhermes_fr_a100_balanced.py \
    --trackio_url "https://tonic-test-trackio-test.hf.space" \
    --experiment-name "petit-elle-l-aime-3-balanced" \
    --output-dir ./outputs/balanced

Step 2: Monitor in Real-time

Visit your Trackio Space: https://tonic-test-trackio-test.hf.space
Go to "View Experiments" tab
Enter your experiment ID (e.g., exp_20231201_143022)
Click "View Experiment" to see detailed information

Step 3: Visualize Training Progress

Go to "📊 Visualizations" tab
Enter your experiment ID
Select a metric (loss, accuracy, learning_rate, gpu_memory, training_time)
Click "Create Plot" to see interactive charts

Step 4: Compare Experiments

In the "📊 Visualizations" tab
Enter multiple experiment IDs (comma-separated)
Click "Compare Experiments" to see side-by-side comparison

🎯 Interface Features

Create Experiment Tab

Experiment Name: Descriptive name for your training run
Description: Detailed description of what you're training
Automatic ID Generation: Unique experiment identifier

Log Metrics Tab

Experiment ID: The experiment to log metrics for
Metrics JSON: Training metrics in JSON format
Step: Current training step (optional)

Example metrics JSON:

{
  "loss": 0.5234,
  "accuracy": 0.8567,
  "learning_rate": 3.5e-6,
  "gpu_memory_gb": 22.5,
  "gpu_utilization_percent": 87.3,
  "training_time_per_step": 0.456
}

Log Parameters Tab

Experiment ID: The experiment to log parameters for
Parameters JSON: Training configuration in JSON format

Example parameters JSON:

{
  "model_name": "HuggingFaceTB/SmolLM3-3B",
  "batch_size": 8,
  "learning_rate": 3.5e-6,
  "max_iters": 18000,
  "mixed_precision": "bf16",
  "no_think_system_message": true
}

View Experiments Tab

Experiment ID: Enter to view specific experiment
List All Experiments: Shows overview of all experiments
Detailed Information: Formatted display with statistics

📊 Visualizations Tab

Training Metrics: Interactive plots for individual metrics
Experiment Comparison: Side-by-side comparison of multiple runs
Real-time Updates: Plots update as new data is logged

🎯 Demo Data Tab

Generate Demo Data: Create realistic training data for testing
Configurable: Adjust parameters to match your setup
Multiple Metrics: Simulates loss, accuracy, GPU metrics, etc.

Update Status Tab

Experiment ID: The experiment to update
Status: running, completed, failed, paused
Visual Indicators: Status shown with emojis

📈 What Gets Displayed

Training Metrics

Loss: Training loss over time
Accuracy: Model accuracy progression
Learning Rate: Learning rate scheduling
GPU Memory: Memory usage in GB
GPU Utilization: GPU usage percentage
Training Time: Time per training step

Experiment Details

Basic Info: ID, name, description, status, creation time
Statistics: Metrics count, parameters count, artifacts count
Parameters: All training configuration
Latest Metrics: Most recent training metrics

Visualizations

Line Charts: Smooth curves showing metric progression
Interactive Hover: Detailed information on hover
Multiple Metrics: Switch between different metrics
Comparison Charts: Side-by-side experiment comparison

🔧 Integration with Your Training

Automatic Integration

Your training script automatically:

Creates experiments with your specified name
Logs parameters from your configuration
Logs metrics every 25 steps (configurable)
Logs system metrics (GPU memory, utilization)
Logs checkpoints every 2000 steps
Updates status when training completes

Manual Integration

You can also manually:

Create experiments through the interface
Log custom metrics for specific analysis
Compare different runs with different parameters
Generate demo data for testing the interface

🎨 Customization

Adding Custom Metrics

# In your training script
custom_metrics = {
    "loss": current_loss,
    "accuracy": current_accuracy,
    "custom_metric": your_custom_value,
    "gpu_memory": gpu_memory_usage
}

monitor.log_metrics(custom_metrics, step=current_step)

Custom Visualizations

The interface supports any metric you log. Just add it to your metrics JSON and it will appear in the visualization dropdown.

🚨 Troubleshooting

No Data Displayed

Check experiment ID: Make sure you're using the correct ID
Verify metrics were logged: Check if training is actually logging metrics
Use demo data: Generate demo data to test the interface

Plots Not Updating

Refresh the page: Sometimes plots need a refresh
Check data format: Ensure metrics are in the correct JSON format
Verify step numbers: Make sure step numbers are increasing

Interface Not Loading

Check dependencies: Ensure plotly and pandas are installed
Check Gradio version: Use Gradio 4.0.0 or higher
Check browser console: Look for JavaScript errors

📊 Example Workflow

Start Training:

python run_a100_large_experiment.py --experiment-name "my_experiment"

Monitor Progress:
- Visit your Trackio Space
- Go to "View Experiments"
- Enter your experiment ID
- Watch real-time updates
Visualize Results:
- Go to "📊 Visualizations"
- Select "loss" metric
- Create plot to see training progress
Compare Runs:
- Run multiple experiments with different parameters
- Use "Compare Experiments" to see differences
Generate Demo Data:
- Use "🎯 Demo Data" tab to test the interface
- Generate realistic training data for demonstration

🎉 Success Indicators

Your interface is working correctly when you see:

✅ Formatted experiment details with emojis and structure
✅ Interactive plots that respond to your inputs
✅ Real-time metric updates during training
✅ Clean experiment overview with statistics
✅ Smooth visualization with hover information

The enhanced interface will now display much more meaningful information and provide a comprehensive monitoring experience for your SmolLM3 training experiments!