Spaces:

Tonic
/

SmolFactory

Running

App Files Files Community

SmolFactory / docs /TRACKIO_INTERFACE_GUIDE.md

Tonic

adds formatting fix

ebe598e verified about 2 months ago

preview code

raw

history blame

7.65 kB

	# Enhanced Trackio Interface Guide

	## Overview

	Your Trackio application has been significantly enhanced to provide comprehensive monitoring and visualization for SmolLM3 training experiments. Here's how to make the most of it.

	## 🚀 Key Enhancements

	### 1. Real-time Visualization
	- Interactive Plots: Loss curves, accuracy, learning rate, GPU metrics
	- Experiment Comparison: Compare multiple training runs side-by-side
	- Live Updates: Watch training progress in real-time

	### 2. Comprehensive Data Display
	- Formatted Output: Clean, emoji-rich experiment details
	- Statistics Overview: Metrics count, parameters count, artifacts count
	- Status Tracking: Visual status indicators (🟢 running, ✅ completed, ❌ failed)

	### 3. Demo Data Generation
	- Realistic Simulation: Generate realistic training metrics for testing
	- Multiple Metrics: Loss, accuracy, learning rate, GPU memory, training time
	- Configurable Parameters: Customize demo data to match your setup

	## 📊 How to Use with Your SmolLM3 Training

	### Step 1: Start Your Training
	```bash
	python run_a100_large_experiment.py \
	--config config/train_smollm3_openhermes_fr_a100_balanced.py \
	--trackio_url "https://tonic-test-trackio-test.hf.space" \
	--experiment-name "petit-elle-l-aime-3-balanced" \
	--output-dir ./outputs/balanced
	```

	### Step 2: Monitor in Real-time
	1. Visit your Trackio Space: `https://tonic-test-trackio-test.hf.space`
	2. Go to "View Experiments" tab
	3. Enter your experiment ID (e.g., `exp_20231201_143022`)
	4. Click "View Experiment" to see detailed information

	### Step 3: Visualize Training Progress
	1. Go to "📊 Visualizations" tab
	2. Enter your experiment ID
	3. Select a metric (loss, accuracy, learning_rate, gpu_memory, training_time)
	4. Click "Create Plot" to see interactive charts

	### Step 4: Compare Experiments
	1. In the "📊 Visualizations" tab
	2. Enter multiple experiment IDs (comma-separated)
	3. Click "Compare Experiments" to see side-by-side comparison

	## 🎯 Interface Features

	### Create Experiment Tab
	- Experiment Name: Descriptive name for your training run
	- Description: Detailed description of what you're training
	- Automatic ID Generation: Unique experiment identifier

	### Log Metrics Tab
	- Experiment ID: The experiment to log metrics for
	- Metrics JSON: Training metrics in JSON format
	- Step: Current training step (optional)

	Example metrics JSON:
	```json
	{
	"loss": 0.5234,
	"accuracy": 0.8567,
	"learning_rate": 3.5e-6,
	"gpu_memory_gb": 22.5,
	"gpu_utilization_percent": 87.3,
	"training_time_per_step": 0.456
	}
	```

	### Log Parameters Tab
	- Experiment ID: The experiment to log parameters for
	- Parameters JSON: Training configuration in JSON format

	Example parameters JSON:
	```json
	{
	"model_name": "HuggingFaceTB/SmolLM3-3B",
	"batch_size": 8,
	"learning_rate": 3.5e-6,
	"max_iters": 18000,
	"mixed_precision": "bf16",
	"no_think_system_message": true
	}
	```

	### View Experiments Tab
	- Experiment ID: Enter to view specific experiment
	- List All Experiments: Shows overview of all experiments
	- Detailed Information: Formatted display with statistics

	### 📊 Visualizations Tab
	- Training Metrics: Interactive plots for individual metrics
	- Experiment Comparison: Side-by-side comparison of multiple runs
	- Real-time Updates: Plots update as new data is logged

	### 🎯 Demo Data Tab
	- Generate Demo Data: Create realistic training data for testing
	- Configurable: Adjust parameters to match your setup
	- Multiple Metrics: Simulates loss, accuracy, GPU metrics, etc.

	### Update Status Tab
	- Experiment ID: The experiment to update
	- Status: running, completed, failed, paused
	- Visual Indicators: Status shown with emojis

	## 📈 What Gets Displayed

	### Training Metrics
	- Loss: Training loss over time
	- Accuracy: Model accuracy progression
	- Learning Rate: Learning rate scheduling
	- GPU Memory: Memory usage in GB
	- GPU Utilization: GPU usage percentage
	- Training Time: Time per training step

	### Experiment Details
	- Basic Info: ID, name, description, status, creation time
	- Statistics: Metrics count, parameters count, artifacts count
	- Parameters: All training configuration
	- Latest Metrics: Most recent training metrics

	### Visualizations
	- Line Charts: Smooth curves showing metric progression
	- Interactive Hover: Detailed information on hover
	- Multiple Metrics: Switch between different metrics
	- Comparison Charts: Side-by-side experiment comparison

	## 🔧 Integration with Your Training

	### Automatic Integration
	Your training script automatically:
	1. Creates experiments with your specified name
	2. Logs parameters from your configuration
	3. Logs metrics every 25 steps (configurable)
	4. Logs system metrics (GPU memory, utilization)
	5. Logs checkpoints every 2000 steps
	6. Updates status when training completes

	### Manual Integration
	You can also manually:
	1. Create experiments through the interface
	2. Log custom metrics for specific analysis
	3. Compare different runs with different parameters
	4. Generate demo data for testing the interface

	## 🎨 Customization

	### Adding Custom Metrics
	```python
	# In your training script
	custom_metrics = {
	"loss": current_loss,
	"accuracy": current_accuracy,
	"custom_metric": your_custom_value,
	"gpu_memory": gpu_memory_usage
	}

	monitor.log_metrics(custom_metrics, step=current_step)
	```

	### Custom Visualizations
	The interface supports any metric you log. Just add it to your metrics JSON and it will appear in the visualization dropdown.

	## 🚨 Troubleshooting

	### No Data Displayed
	1. Check experiment ID: Make sure you're using the correct ID
	2. Verify metrics were logged: Check if training is actually logging metrics
	3. Use demo data: Generate demo data to test the interface

	### Plots Not Updating
	1. Refresh the page: Sometimes plots need a refresh
	2. Check data format: Ensure metrics are in the correct JSON format
	3. Verify step numbers: Make sure step numbers are increasing

	### Interface Not Loading
	1. Check dependencies: Ensure plotly and pandas are installed
	2. Check Gradio version: Use Gradio 4.0.0 or higher
	3. Check browser console: Look for JavaScript errors

	## 📊 Example Workflow

	1. Start Training:
	```bash
	python run_a100_large_experiment.py --experiment-name "my_experiment"
	```

	2. Monitor Progress:
	- Visit your Trackio Space
	- Go to "View Experiments"
	- Enter your experiment ID
	- Watch real-time updates

	3. Visualize Results:
	- Go to "📊 Visualizations"
	- Select "loss" metric
	- Create plot to see training progress

	4. Compare Runs:
	- Run multiple experiments with different parameters
	- Use "Compare Experiments" to see differences

	5. Generate Demo Data:
	- Use "🎯 Demo Data" tab to test the interface
	- Generate realistic training data for demonstration

	## 🎉 Success Indicators

	Your interface is working correctly when you see:
	- ✅ Formatted experiment details with emojis and structure
	- ✅ Interactive plots that respond to your inputs
	- ✅ Real-time metric updates during training
	- ✅ Clean experiment overview with statistics
	- ✅ Smooth visualization with hover information

	The enhanced interface will now display much more meaningful information and provide a comprehensive monitoring experience for your SmolLM3 training experiments!