SmolFactory / TRACKIO_INTEGRATION.md
Tonic's picture
adds A100 large experiments
5fe83da verified
|
raw
history blame
6.24 kB

Trackio Integration for SmolLM3 Fine-tuning

This document provides comprehensive information about the Trackio experiment tracking and monitoring integration for your SmolLM3 fine-tuning pipeline.

Features

  • SmolLM3 Fine-tuning: Support for supervised fine-tuning and DPO training
  • Trackio Integration: Complete experiment tracking and monitoring
  • Hugging Face Spaces Deployment: Easy deployment of Trackio monitoring interface
  • Comprehensive Logging: Metrics, parameters, artifacts, and system monitoring
  • Flexible Configuration: Support for various training configurations

Quick Start

1. Install Dependencies

pip install -r requirements.txt

2. Basic Training with Trackio

python train.py config/train_smollm3.py \
    --dataset_dir my_dataset \
    --enable_tracking \
    --trackio_url "https://your-trackio-instance.com" \
    --experiment_name "smollm3_finetune_v1"

3. Training with Custom Parameters

python train.py config/train_smollm3.py \
    --dataset_dir my_dataset \
    --batch_size 8 \
    --learning_rate 1e-5 \
    --max_iters 2000 \
    --enable_tracking \
    --trackio_url "https://your-trackio-instance.com" \
    --experiment_name "smollm3_high_lr_experiment"

Trackio Integration

Configuration

Add Trackio settings to your configuration:

# In your config file
config = SmolLM3Config(
    # ... other settings ...
    
    # Trackio monitoring configuration
    enable_tracking=True,
    trackio_url="https://your-trackio-instance.com",
    trackio_token="your_token_here",  # Optional
    log_artifacts=True,
    log_metrics=True,
    log_config=True,
    experiment_name="my_experiment"
)

Environment Variables

You can also set Trackio configuration via environment variables:

export TRACKIO_URL="https://your-trackio-instance.com"
export TRACKIO_TOKEN="your_token_here"

What Gets Tracked

  • Configuration: All training parameters and model settings
  • Metrics: Loss, accuracy, learning rate, and custom metrics
  • System Metrics: GPU memory, CPU usage, training time
  • Artifacts: Model checkpoints, evaluation results
  • Training Summary: Final results and experiment duration

Hugging Face Spaces Deployment

Deploy Trackio Monitoring Interface

  1. Create a new Space on Hugging Face:

  2. Upload the deployment files:

    • app.py - The Gradio interface
    • requirements_space.txt - Dependencies
    • README.md - Documentation
  3. Configure the Space:

    • The Space will automatically install dependencies
    • The Gradio interface will be available at your Space URL

Using the Trackio Space

  1. Create Experiments: Use the "Create Experiment" tab to start new experiments
  2. Log Metrics: Use the "Log Metrics" tab to track training progress
  3. View Results: Use the "View Experiments" tab to see experiment details
  4. Update Status: Use the "Update Status" tab to mark experiments as completed

Integration with Your Training

To connect your training script to the Trackio Space:

# In your training script
from monitoring import SmolLM3Monitor

# Initialize monitor
monitor = SmolLM3Monitor(
    experiment_name="my_experiment",
    trackio_url="https://your-space.hf.space",  # Your Space URL
    enable_tracking=True
)

# Log configuration
monitor.log_config(config_dict)

# Log metrics during training
monitor.log_metrics({"loss": 0.5, "accuracy": 0.85}, step=100)

# Log final results
monitor.log_training_summary(final_results)

Configuration Files

Main Configuration (config/train_smollm3.py)

@dataclass
class SmolLM3Config:
    # Model configuration
    model_name: str = "HuggingFaceTB/SmolLM3-3B"
    max_seq_length: int = 4096
    
    # Training configuration
    batch_size: int = 4
    learning_rate: float = 2e-5
    max_iters: int = 1000
    
    # Trackio monitoring
    enable_tracking: bool = True
    trackio_url: Optional[str] = None
    trackio_token: Optional[str] = None
    experiment_name: Optional[str] = None

DPO Configuration (config/train_smollm3_dpo.py)

@dataclass
class SmolLM3DPOConfig(SmolLM3Config):
    # DPO-specific settings
    beta: float = 0.1
    max_prompt_length: int = 2048
    
    # Trackio monitoring (inherited)
    enable_tracking: bool = True
    trackio_url: Optional[str] = None

Monitoring Features

Real-time Metrics

  • Training loss and evaluation metrics
  • Learning rate scheduling
  • GPU memory and utilization
  • Training time and progress

Artifact Tracking

  • Model checkpoints at regular intervals
  • Evaluation results and plots
  • Configuration snapshots
  • Training logs and summaries

Experiment Management

  • Experiment naming and organization
  • Status tracking (running, completed, failed)
  • Parameter comparison across experiments
  • Result visualization

Advanced Usage

Custom Metrics

# Log custom metrics
monitor.log_metrics({
    "custom_metric": value,
    "perplexity": perplexity_score,
    "bleu_score": bleu_score
}, step=current_step)

System Monitoring

# Log system metrics
monitor.log_system_metrics(step=current_step)

Artifact Logging

# Log model checkpoint
monitor.log_model_checkpoint("checkpoint-1000", step=1000)

# Log evaluation results
monitor.log_evaluation_results(eval_results, step=1000)

Troubleshooting

Common Issues

  1. Trackio not available: Install with pip install trackio
  2. Connection errors: Check your Trackio URL and token
  3. Missing metrics: Ensure monitoring is enabled in configuration
  4. Space deployment issues: Check Gradio version compatibility

Debug Mode

Enable debug logging:

import logging
logging.basicConfig(level=logging.DEBUG)

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests if applicable
  5. Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.