Spaces:

Tonic
/

SmolFactory

Running

App Files Files Community

SmolFactory / docs /TRAINING_FIXES_SUMMARY.md

Tonic

adds correct huggingface spaces api deployment

14e9cd5 verified about 2 months ago

preview code

raw

history blame

5.02 kB

SmolLM3 Training Pipeline Fixes Summary

Issues Identified and Fixed

1. Format String Error

Issue: Unknown format code 'f' for object of type 'str' Root Cause: The console callback was trying to format non-numeric values with f-string format specifiers Fix: Updated src/trainer.py to properly handle type conversion before formatting

# Before (causing error):
print("Step {}: loss={:.4f}, lr={}".format(step, loss, lr))

# After (fixed):
if isinstance(loss, (int, float)):
    loss_str = f"{loss:.4f}"
else:
    loss_str = str(loss)
if isinstance(lr, (int, float)):
    lr_str = f"{lr:.2e}"
else:
    lr_str = str(lr)
print(f"Step {step}: loss={loss_str}, lr={lr_str}")

2. Callback Addition Error

Issue: 'SmolLM3Trainer' object has no attribute 'add_callback' Root Cause: The trainer was trying to add callbacks after creation, but callbacks should be passed during trainer creation Fix: Removed the incorrect add_callback call from src/train.py since callbacks are already handled in SmolLM3Trainer._setup_trainer()

3. Trackio Space Deployment Issues

Issue: 404 errors when trying to create experiments via Trackio API Root Cause: The Trackio Space deployment was failing or the API endpoints weren't accessible Fix: Updated src/monitoring.py to gracefully handle Trackio Space failures and continue with HF Datasets integration

# Added graceful fallback:
try:
    result = self.trackio_client.log_metrics(...)
    if "success" in result:
        logger.debug("Metrics logged to Trackio")
    else:
        logger.warning("Failed to log metrics to Trackio: %s", result)
except Exception as e:
    logger.warning("Trackio logging failed: %s", e)

4. Monitoring Integration Improvements

Enhancement: Made monitoring more robust by:

Testing Trackio Space connectivity before attempting operations
Continuing with HF Datasets even if Trackio fails
Adding better error handling and logging
Ensuring experiments are saved to HF Datasets regardless of Trackio status

Files Modified

Core Training Files

src/trainer.py
- Fixed format string error in SimpleConsoleCallback
- Improved callback handling and error reporting
src/train.py
- Removed incorrect add_callback call
- Simplified trainer initialization
src/monitoring.py
- Added graceful Trackio Space failure handling
- Improved error logging and fallback mechanisms
- Enhanced HF Datasets integration

Test Files

tests/test_training_fix.py
- Created comprehensive test suite
- Tests imports, config loading, monitoring setup, trainer creation
- Validates format string fixes

Testing the Fixes

Run the test suite to verify all fixes work:

python tests/test_training_fix.py

Expected output: ``` 🚀 Testing SmolLM3 Training Pipeline Fixes

🔍 Testing imports... ✅ config.py imported successfully ✅ model.py imported successfully ✅ data.py imported successfully ✅ trainer.py imported successfully ✅ monitoring.py imported successfully

🔍 Testing configuration loading... ✅ Configuration loaded successfully Model: HuggingFaceTB/SmolLM3-3B Dataset: legmlai/openhermes-fr Batch size: 16 Learning rate: 8e-06

🔍 Testing monitoring setup... ✅ Monitoring setup successful Experiment: test_experiment Tracking enabled: False HF Dataset: tonic/trackio-experiments

🔍 Testing trainer creation... ✅ Model created successfully ✅ Dataset created successfully ✅ Trainer created successfully

🔍 Testing format string fix... ✅ Format string fix works correctly

📊 Test Results: 5/5 tests passed ✅ All tests passed! The training pipeline should work correctly.


## Running the Training Pipeline

The training pipeline should now work correctly with the H100 lightweight configuration:

```bash
# Run the interactive pipeline
./launch.sh

# Or run training directly
python src/train.py config/train_smollm3_h100_lightweight.py \
    --experiment-name "smollm3_test" \
    --trackio-url "https://your-space.hf.space" \
    --output-dir /output-checkpoint

Key Improvements

Robust Error Handling: Training continues even if monitoring components fail
Better Logging: More informative error messages and status updates
Graceful Degradation: HF Datasets integration works even without Trackio Space
Type Safety: Proper type checking prevents format string errors
Comprehensive Testing: Test suite validates all components work correctly

Next Steps

Deploy Trackio Space: If you want full monitoring, deploy the Trackio Space manually
Test Training: Run a short training session to verify everything works
Monitor Progress: Check HF Datasets for experiment data even if Trackio Space is unavailable

The training pipeline should now work reliably for your end-to-end fine-tuning experiments!