SmolFactory / docs /TRAINING_FIXES_SUMMARY.md
Tonic's picture
adds correct huggingface spaces api deployment
14e9cd5 verified
|
raw
history blame
5.02 kB

SmolLM3 Training Pipeline Fixes Summary

Issues Identified and Fixed

1. Format String Error

Issue: Unknown format code 'f' for object of type 'str' Root Cause: The console callback was trying to format non-numeric values with f-string format specifiers Fix: Updated src/trainer.py to properly handle type conversion before formatting

# Before (causing error):
print("Step {}: loss={:.4f}, lr={}".format(step, loss, lr))

# After (fixed):
if isinstance(loss, (int, float)):
    loss_str = f"{loss:.4f}"
else:
    loss_str = str(loss)
if isinstance(lr, (int, float)):
    lr_str = f"{lr:.2e}"
else:
    lr_str = str(lr)
print(f"Step {step}: loss={loss_str}, lr={lr_str}")

2. Callback Addition Error

Issue: 'SmolLM3Trainer' object has no attribute 'add_callback' Root Cause: The trainer was trying to add callbacks after creation, but callbacks should be passed during trainer creation Fix: Removed the incorrect add_callback call from src/train.py since callbacks are already handled in SmolLM3Trainer._setup_trainer()

3. Trackio Space Deployment Issues

Issue: 404 errors when trying to create experiments via Trackio API Root Cause: The Trackio Space deployment was failing or the API endpoints weren't accessible Fix: Updated src/monitoring.py to gracefully handle Trackio Space failures and continue with HF Datasets integration

# Added graceful fallback:
try:
    result = self.trackio_client.log_metrics(...)
    if "success" in result:
        logger.debug("Metrics logged to Trackio")
    else:
        logger.warning("Failed to log metrics to Trackio: %s", result)
except Exception as e:
    logger.warning("Trackio logging failed: %s", e)

4. Monitoring Integration Improvements

Enhancement: Made monitoring more robust by:

  • Testing Trackio Space connectivity before attempting operations
  • Continuing with HF Datasets even if Trackio fails
  • Adding better error handling and logging
  • Ensuring experiments are saved to HF Datasets regardless of Trackio status

Files Modified

Core Training Files

  1. src/trainer.py

    • Fixed format string error in SimpleConsoleCallback
    • Improved callback handling and error reporting
  2. src/train.py

    • Removed incorrect add_callback call
    • Simplified trainer initialization
  3. src/monitoring.py

    • Added graceful Trackio Space failure handling
    • Improved error logging and fallback mechanisms
    • Enhanced HF Datasets integration

Test Files

  1. tests/test_training_fix.py
    • Created comprehensive test suite
    • Tests imports, config loading, monitoring setup, trainer creation
    • Validates format string fixes

Testing the Fixes

Run the test suite to verify all fixes work:

python tests/test_training_fix.py

Expected output: ``` πŸš€ Testing SmolLM3 Training Pipeline Fixes

πŸ” Testing imports... βœ… config.py imported successfully βœ… model.py imported successfully βœ… data.py imported successfully βœ… trainer.py imported successfully βœ… monitoring.py imported successfully

πŸ” Testing configuration loading... βœ… Configuration loaded successfully Model: HuggingFaceTB/SmolLM3-3B Dataset: legmlai/openhermes-fr Batch size: 16 Learning rate: 8e-06

πŸ” Testing monitoring setup... βœ… Monitoring setup successful Experiment: test_experiment Tracking enabled: False HF Dataset: tonic/trackio-experiments

πŸ” Testing trainer creation... βœ… Model created successfully βœ… Dataset created successfully βœ… Trainer created successfully

πŸ” Testing format string fix... βœ… Format string fix works correctly

πŸ“Š Test Results: 5/5 tests passed βœ… All tests passed! The training pipeline should work correctly.


## Running the Training Pipeline

The training pipeline should now work correctly with the H100 lightweight configuration:

```bash
# Run the interactive pipeline
./launch.sh

# Or run training directly
python src/train.py config/train_smollm3_h100_lightweight.py \
    --experiment-name "smollm3_test" \
    --trackio-url "https://your-space.hf.space" \
    --output-dir /output-checkpoint

Key Improvements

  1. Robust Error Handling: Training continues even if monitoring components fail
  2. Better Logging: More informative error messages and status updates
  3. Graceful Degradation: HF Datasets integration works even without Trackio Space
  4. Type Safety: Proper type checking prevents format string errors
  5. Comprehensive Testing: Test suite validates all components work correctly

Next Steps

  1. Deploy Trackio Space: If you want full monitoring, deploy the Trackio Space manually
  2. Test Training: Run a short training session to verify everything works
  3. Monitor Progress: Check HF Datasets for experiment data even if Trackio Space is unavailable

The training pipeline should now work reliably for your end-to-end fine-tuning experiments!