Spaces:
Running
Running
Quantization Fix Summary
Issues Identified
The quantization script was failing due to several compatibility issues:
Int8 Quantization Error:
- Error:
The model is quantized with QuantizationMethod.TORCHAO and is not serializable - Cause: Offloaded modules in the model cannot be quantized with torchao
- Solution: Added alternative save method and fallback to bitsandbytes
- Error:
Int4 Quantization Error:
- Error:
Could not run 'aten::_convert_weight_to_int4pack_for_cpu' with arguments from the 'CUDA' backend - Cause: Int4 quantization requires CPU backend but was being attempted on CUDA
- Solution: Added proper device selection logic
- Error:
Monitoring Error:
- Error:
'SmolLM3Monitor' object has no attribute 'log_event' - Cause: Incorrect monitoring API usage
- Solution: Added flexible monitoring method detection
- Error:
Fixes Implemented
1. Enhanced Device Management (scripts/model_tonic/quantize_model.py)
def get_optimal_device(self, quant_type: str) -> str:
"""Get optimal device for quantization type"""
if quant_type == "int4_weight_only":
# Int4 quantization works better on CPU
return "cpu"
elif quant_type == "int8_weight_only":
# Int8 quantization works on GPU
if torch.cuda.is_available():
return "cuda"
else:
logger.warning("β οΈ CUDA not available, falling back to CPU for int8")
return "cpu"
else:
return "auto"
2. Alternative Quantization Method
Added quantize_model_alternative() method using bitsandbytes for better compatibility:
def quantize_model_alternative(self, quant_type: str, device: str = "auto", group_size: int = 128, save_dir: Optional[str] = None) -> Optional[str]:
"""Alternative quantization using bitsandbytes for better compatibility"""
# Uses BitsAndBytesConfig instead of TorchAoConfig
# Handles serialization issues better
3. Improved Error Handling
- Added fallback from torchao to bitsandbytes
- Enhanced save method with alternative approaches
- Better device mapping for different quantization types
4. Fixed Monitoring Integration
def log_to_trackio(self, action: str, details: Dict[str, Any]):
"""Log quantization events to Trackio"""
if self.monitor:
try:
# Use the correct monitoring method
if hasattr(self.monitor, 'log_event'):
self.monitor.log_event(action, details)
elif hasattr(self.monitor, 'log_metric'):
self.monitor.log_metric(action, details.get('value', 1.0))
elif hasattr(self.monitor, 'log'):
self.monitor.log(action, details)
else:
logger.info(f"π {action}: {details}")
except Exception as e:
logger.warning(f"β οΈ Failed to log to Trackio: {e}")
Usage Instructions
1. Install Dependencies
pip install -r requirements_quantization.txt
2. Run Quantization
python3 quantize_and_push.py
3. Test Fixes
python3 test_quantization_fix.py
Expected Behavior
Successful Quantization
The script will now:
- Try torchao first for each quantization type
- Fall back to bitsandbytes if torchao fails
- Use appropriate devices (CPU for int4, GPU for int8)
- Handle serialization issues with alternative save methods
- Log progress without monitoring errors
Output
β
Model files validated
π Processing quantization type: int8_weight_only
π Using device: cuda
β
int8_weight_only quantization and push completed
π Processing quantization type: int4_weight_only
π Using device: cpu
β
int4_weight_only quantization and push completed
π Quantization summary: 2/2 successful
β
Quantization completed successfully!
Troubleshooting
If All Quantization Fails
Install bitsandbytes:
pip install bitsandbytesCheck model path:
ls -la /output-checkpointVerify dependencies:
python3 test_quantization_fix.py
Common Issues
- Memory Issues: Use CPU for int4 quantization
- Serialization Errors: The script now handles these automatically
- Device Conflicts: Automatic device selection based on quantization type
Files Modified
scripts/model_tonic/quantize_model.py- Main quantization logicquantize_and_push.py- Main script with better error handlingtest_quantization_fix.py- Test script for verificationrequirements_quantization.txt- Dependencies file
Next Steps
- Run the test script to verify fixes
- Install bitsandbytes if not already installed
- Run the quantization script
- Check the Hugging Face repository for quantized models
The fixes ensure robust quantization with multiple fallback options and proper error handling.