SmolFactory / docs /TRACKIO_TRL_FIX.md
Tonic's picture
adds monkey patch for trackio monitoring in torch and readme creator improvements
39db0ca verified
|
raw
history blame
4.63 kB

Trackio TRL Compatibility Fix

Problem Description

The training was failing with the error:

ERROR:trainer:Training failed: module 'trackio' has no attribute 'init'

This error occurred because the TRL library (specifically SFTTrainer) expects a trackio module with specific functions:

  • init() - Initialize experiment
  • log() - Log metrics
  • finish() - Finish experiment

However, our custom monitoring implementation didn't provide this interface.

Solution Implementation

1. Created Trackio Module Interface (src/trackio.py)

Created a trackio module that provides the exact interface expected by TRL:

def init(project_name: str, experiment_name: Optional[str] = None, **kwargs) -> str:
    """Initialize trackio experiment (TRL interface)"""
    
def log(metrics: Dict[str, Any], step: Optional[int] = None, **kwargs):
    """Log metrics to trackio (TRL interface)"""
    
def finish():
    """Finish trackio experiment (TRL interface)"""

2. Global Trackio Module (trackio.py)

Created a root-level trackio.py file that imports from our custom implementation:

from src.trackio import (
    init, log, finish, log_config, log_checkpoint, 
    log_evaluation_results, get_experiment_url, is_available, get_monitor
)

This makes the trackio module available globally for TRL to import.

3. Updated Trainer Integration (src/trainer.py)

Modified the trainer to properly initialize trackio before creating SFTTrainer:

# Initialize trackio for TRL compatibility
try:
    import trackio
    experiment_id = trackio.init(
        project_name=self.config.experiment_name,
        experiment_name=self.config.experiment_name,
        trackio_url=getattr(self.config, 'trackio_url', None),
        trackio_token=getattr(self.config, 'trackio_token', None),
        hf_token=getattr(self.config, 'hf_token', None),
        dataset_repo=getattr(self.config, 'dataset_repo', None)
    )
    logger.info(f"Trackio initialized with experiment ID: {experiment_id}")
except Exception as e:
    logger.warning(f"Failed to initialize trackio: {e}")
    logger.info("Continuing without trackio integration")

4. Proper Cleanup

Added trackio.finish() calls in both success and error scenarios:

# Finish trackio experiment
try:
    import trackio
    trackio.finish()
    logger.info("Trackio experiment finished")
except Exception as e:
    logger.warning(f"Failed to finish trackio experiment: {e}")

Integration with Custom Monitoring

The trackio module integrates seamlessly with our existing monitoring system:

  • Uses SmolLM3Monitor for actual monitoring functionality
  • Provides TRL-compatible interface on top
  • Maintains all existing features (HF Datasets, Trackio Space, etc.)
  • Graceful fallback when Trackio Space is not accessible

Testing

Created comprehensive test suite (tests/test_trackio_trl_fix.py) that verifies:

  1. Interface Compatibility: All required functions exist
  2. TRL Compatibility: Function signatures match expectations
  3. Monitoring Integration: Works with our custom monitoring system

Test results:

βœ… Successfully imported trackio module
βœ… Found required function: init
βœ… Found required function: log  
βœ… Found required function: finish
βœ… Trackio initialization successful
βœ… Trackio logging successful
βœ… Trackio finish successful
βœ… TRL compatibility test passed
βœ… Monitor integration working

Benefits

  1. Resolves Training Error: Fixes the "module trackio has no attribute init" error
  2. Maintains Functionality: All existing monitoring features continue to work
  3. TRL Compatibility: SFTTrainer can now use trackio for logging
  4. Graceful Fallback: Continues training even if trackio initialization fails
  5. Future-Proof: Easy to extend with additional TRL-compatible functions

Usage

The fix is transparent to users. Training will now work with SFTTrainer and automatically:

  1. Initialize trackio when SFTTrainer is created
  2. Log metrics during training
  3. Finish the experiment when training completes
  4. Fall back gracefully if trackio is not available

Files Modified

  • src/trackio.py - New trackio module interface
  • trackio.py - Global trackio module for TRL
  • src/trainer.py - Updated trainer integration
  • src/__init__.py - Package exports
  • tests/test_trackio_trl_fix.py - Test suite

Verification

To verify the fix works:

python tests/test_trackio_trl_fix.py

This should show all tests passing and confirm that the trackio module provides the interface expected by TRL library.