Spaces:
Running
SmolLM3 Training Pipeline Fixes Summary
Issues Identified and Fixed
1. Format String Error
Issue: Unknown format code 'f' for object of type 'str'
Root Cause: The console callback was trying to format non-numeric values with f-string format specifiers
Fix: Updated src/trainer.py
to properly handle type conversion before formatting
# Before (causing error):
print("Step {}: loss={:.4f}, lr={}".format(step, loss, lr))
# After (fixed):
if isinstance(loss, (int, float)):
loss_str = f"{loss:.4f}"
else:
loss_str = str(loss)
if isinstance(lr, (int, float)):
lr_str = f"{lr:.2e}"
else:
lr_str = str(lr)
print(f"Step {step}: loss={loss_str}, lr={lr_str}")
2. Callback Addition Error
Issue: 'SmolLM3Trainer' object has no attribute 'add_callback'
Root Cause: The trainer was trying to add callbacks after creation, but callbacks should be passed during trainer creation
Fix: Removed the incorrect add_callback
call from src/train.py
since callbacks are already handled in SmolLM3Trainer._setup_trainer()
3. Trackio Space Deployment Issues
Issue: 404 errors when trying to create experiments via Trackio API
Root Cause: The Trackio Space deployment was failing or the API endpoints weren't accessible
Fix: Updated src/monitoring.py
to gracefully handle Trackio Space failures and continue with HF Datasets integration
# Added graceful fallback:
try:
result = self.trackio_client.log_metrics(...)
if "success" in result:
logger.debug("Metrics logged to Trackio")
else:
logger.warning("Failed to log metrics to Trackio: %s", result)
except Exception as e:
logger.warning("Trackio logging failed: %s", e)
4. Monitoring Integration Improvements
Enhancement: Made monitoring more robust by:
- Testing Trackio Space connectivity before attempting operations
- Continuing with HF Datasets even if Trackio fails
- Adding better error handling and logging
- Ensuring experiments are saved to HF Datasets regardless of Trackio status
Files Modified
Core Training Files
src/trainer.py
- Fixed format string error in SimpleConsoleCallback
- Improved callback handling and error reporting
src/train.py
- Removed incorrect
add_callback
call - Simplified trainer initialization
- Removed incorrect
src/monitoring.py
- Added graceful Trackio Space failure handling
- Improved error logging and fallback mechanisms
- Enhanced HF Datasets integration
Test Files
tests/test_training_fix.py
- Created comprehensive test suite
- Tests imports, config loading, monitoring setup, trainer creation
- Validates format string fixes
Testing the Fixes
Run the test suite to verify all fixes work:
python tests/test_training_fix.py
Expected output: ``` π Testing SmolLM3 Training Pipeline Fixes
π Testing imports... β config.py imported successfully β model.py imported successfully β data.py imported successfully β trainer.py imported successfully β monitoring.py imported successfully
π Testing configuration loading... β Configuration loaded successfully Model: HuggingFaceTB/SmolLM3-3B Dataset: legmlai/openhermes-fr Batch size: 16 Learning rate: 8e-06
π Testing monitoring setup... β Monitoring setup successful Experiment: test_experiment Tracking enabled: False HF Dataset: tonic/trackio-experiments
π Testing trainer creation... β Model created successfully β Dataset created successfully β Trainer created successfully
π Testing format string fix... β Format string fix works correctly
π Test Results: 5/5 tests passed β All tests passed! The training pipeline should work correctly.
## Running the Training Pipeline
The training pipeline should now work correctly with the H100 lightweight configuration:
```bash
# Run the interactive pipeline
./launch.sh
# Or run training directly
python src/train.py config/train_smollm3_h100_lightweight.py \
--experiment-name "smollm3_test" \
--trackio-url "https://your-space.hf.space" \
--output-dir /output-checkpoint
Key Improvements
- Robust Error Handling: Training continues even if monitoring components fail
- Better Logging: More informative error messages and status updates
- Graceful Degradation: HF Datasets integration works even without Trackio Space
- Type Safety: Proper type checking prevents format string errors
- Comprehensive Testing: Test suite validates all components work correctly
Next Steps
- Deploy Trackio Space: If you want full monitoring, deploy the Trackio Space manually
- Test Training: Run a short training session to verify everything works
- Monitor Progress: Check HF Datasets for experiment data even if Trackio Space is unavailable
The training pipeline should now work reliably for your end-to-end fine-tuning experiments!