Spaces:
Running
Running
File size: 5,020 Bytes
21d66ae |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 |
# SmolLM3 Training Pipeline Fixes Summary
## Issues Identified and Fixed
### 1. Format String Error
**Issue**: `Unknown format code 'f' for object of type 'str'`
**Root Cause**: The console callback was trying to format non-numeric values with f-string format specifiers
**Fix**: Updated `src/trainer.py` to properly handle type conversion before formatting
```python
# Before (causing error):
print("Step {}: loss={:.4f}, lr={}".format(step, loss, lr))
# After (fixed):
if isinstance(loss, (int, float)):
loss_str = f"{loss:.4f}"
else:
loss_str = str(loss)
if isinstance(lr, (int, float)):
lr_str = f"{lr:.2e}"
else:
lr_str = str(lr)
print(f"Step {step}: loss={loss_str}, lr={lr_str}")
```
### 2. Callback Addition Error
**Issue**: `'SmolLM3Trainer' object has no attribute 'add_callback'`
**Root Cause**: The trainer was trying to add callbacks after creation, but callbacks should be passed during trainer creation
**Fix**: Removed the incorrect `add_callback` call from `src/train.py` since callbacks are already handled in `SmolLM3Trainer._setup_trainer()`
### 3. Trackio Space Deployment Issues
**Issue**: 404 errors when trying to create experiments via Trackio API
**Root Cause**: The Trackio Space deployment was failing or the API endpoints weren't accessible
**Fix**: Updated `src/monitoring.py` to gracefully handle Trackio Space failures and continue with HF Datasets integration
```python
# Added graceful fallback:
try:
result = self.trackio_client.log_metrics(...)
if "success" in result:
logger.debug("Metrics logged to Trackio")
else:
logger.warning("Failed to log metrics to Trackio: %s", result)
except Exception as e:
logger.warning("Trackio logging failed: %s", e)
```
### 4. Monitoring Integration Improvements
**Enhancement**: Made monitoring more robust by:
- Testing Trackio Space connectivity before attempting operations
- Continuing with HF Datasets even if Trackio fails
- Adding better error handling and logging
- Ensuring experiments are saved to HF Datasets regardless of Trackio status
## Files Modified
### Core Training Files
1. **`src/trainer.py`**
- Fixed format string error in SimpleConsoleCallback
- Improved callback handling and error reporting
2. **`src/train.py`**
- Removed incorrect `add_callback` call
- Simplified trainer initialization
3. **`src/monitoring.py`**
- Added graceful Trackio Space failure handling
- Improved error logging and fallback mechanisms
- Enhanced HF Datasets integration
### Test Files
4. **`tests/test_training_fix.py`**
- Created comprehensive test suite
- Tests imports, config loading, monitoring setup, trainer creation
- Validates format string fixes
## Testing the Fixes
Run the test suite to verify all fixes work:
```bash
python tests/test_training_fix.py
```
Expected output:
```
π Testing SmolLM3 Training Pipeline Fixes
==================================================
π Testing imports...
β
config.py imported successfully
β
model.py imported successfully
β
data.py imported successfully
β
trainer.py imported successfully
β
monitoring.py imported successfully
π Testing configuration loading...
β
Configuration loaded successfully
Model: HuggingFaceTB/SmolLM3-3B
Dataset: legmlai/openhermes-fr
Batch size: 16
Learning rate: 8e-06
π Testing monitoring setup...
β
Monitoring setup successful
Experiment: test_experiment
Tracking enabled: False
HF Dataset: tonic/trackio-experiments
π Testing trainer creation...
β
Model created successfully
β
Dataset created successfully
β
Trainer created successfully
π Testing format string fix...
β
Format string fix works correctly
π Test Results: 5/5 tests passed
β
All tests passed! The training pipeline should work correctly.
```
## Running the Training Pipeline
The training pipeline should now work correctly with the H100 lightweight configuration:
```bash
# Run the interactive pipeline
./launch.sh
# Or run training directly
python src/train.py config/train_smollm3_h100_lightweight.py \
--experiment-name "smollm3_test" \
--trackio-url "https://your-space.hf.space" \
--output-dir /output-checkpoint
```
## Key Improvements
1. **Robust Error Handling**: Training continues even if monitoring components fail
2. **Better Logging**: More informative error messages and status updates
3. **Graceful Degradation**: HF Datasets integration works even without Trackio Space
4. **Type Safety**: Proper type checking prevents format string errors
5. **Comprehensive Testing**: Test suite validates all components work correctly
## Next Steps
1. **Deploy Trackio Space**: If you want full monitoring, deploy the Trackio Space manually
2. **Test Training**: Run a short training session to verify everything works
3. **Monitor Progress**: Check HF Datasets for experiment data even if Trackio Space is unavailable
The training pipeline should now work reliably for your end-to-end fine-tuning experiments! |