SmolFactory / docs /QUANTIZATION_IMPLEMENTATION_SUMMARY.md
Tonic's picture
adds sft , quantization, better readmes
40fd629 verified
|
raw
history blame
7.67 kB
# Quantization Implementation Summary
This document summarizes the torchao quantization features that have been added to the SmolLM3 fine-tuning pipeline.
## ๐Ÿš€ New Features Added
### 1. Core Quantization Scripts
#### `scripts/model_tonic/quantize_model.py`
- **Main quantization script** with full HF Hub integration
- Supports int8 (GPU) and int4 (CPU) quantization
- Automatic model card and README generation
- Trackio monitoring integration
- Comprehensive error handling and validation
#### `scripts/model_tonic/quantize_standalone.py`
- **Standalone quantization script** for independent use
- Simple command-line interface
- Option to save locally without pushing to HF Hub
- Quick quantization workflow
### 2. Pipeline Integration
#### Updated `launch.sh`
- **Interactive quantization prompts** after model training
- Support for single or dual quantization (int8 + int4)
- Automatic repository naming with quantization suffixes
- Enhanced summary reporting with quantization results
### 3. Documentation
#### `docs/QUANTIZATION_GUIDE.md`
- **Comprehensive quantization guide**
- Usage examples and best practices
- Performance comparisons
- Troubleshooting section
- Advanced configuration options
#### Updated `README.md`
- **Quantization section** with quick start examples
- Integration with main pipeline documentation
- Loading quantized models examples
### 4. Testing
#### `tests/test_quantization.py`
- **Comprehensive test suite** for quantization functionality
- Tests for imports, initialization, configuration creation
- Model validation and documentation generation tests
- Automated testing workflow
### 5. Dependencies
#### Updated `requirements/requirements.txt`
- **Added torchao>=0.10.0** for quantization support
- Maintains compatibility with existing dependencies
## ๐Ÿ”ง Quantization Types Supported
### int8_weight_only (GPU Optimized)
- **Memory Reduction**: ~50%
- **Accuracy**: Minimal degradation
- **Speed**: Faster inference
- **Hardware**: GPU optimized
- **Use Case**: High-performance inference on GPU
### int4_weight_only (CPU Optimized)
- **Memory Reduction**: ~75%
- **Accuracy**: Some degradation acceptable
- **Speed**: Significantly faster inference
- **Hardware**: CPU optimized
- **Use Case**: Deployment on CPU or memory-constrained environments
### int8_dynamic (Dynamic Quantization)
- **Memory Reduction**: ~50%
- **Accuracy**: Minimal degradation
- **Speed**: Faster inference
- **Hardware**: GPU optimized
- **Use Case**: Dynamic quantization during inference
## ๐Ÿ“‹ Usage Examples
### Interactive Pipeline (launch.sh)
```bash
./launch.sh
# Complete training and model push
# Choose quantization options when prompted:
# - y/n for quantization
# - int8_weight_only / int4_weight_only / both
```
### Standalone Quantization
```bash
# Quantize and push to HF Hub
python scripts/model_tonic/quantize_standalone.py /path/to/model my-username/quantized-model \
--quant-type int8_weight_only \
--token YOUR_HF_TOKEN
# Quantize and save locally
python scripts/model_tonic/quantize_standalone.py /path/to/model my-username/quantized-model \
--quant-type int4_weight_only \
--device cpu \
--save-only
```
### Loading Quantized Models
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load int8 quantized model (GPU)
model = AutoModelForCausalLM.from_pretrained(
"your-username/model-int8",
device_map="auto",
torch_dtype=torch.bfloat16
)
# Load int4 quantized model (CPU)
model = AutoModelForCausalLM.from_pretrained(
"your-username/model-int4",
device_map="cpu",
torch_dtype=torch.bfloat16
)
```
## ๐Ÿงช Testing
Run the quantization tests:
```bash
python tests/test_quantization.py
```
Tests cover:
- Import validation
- Quantizer initialization
- Configuration creation
- Model validation
- Documentation generation
## ๐Ÿ“Š Performance Comparison
| Model Type | Memory Usage | Speed | Accuracy | Hardware |
|------------|--------------|-------|----------|----------|
| Original | 100% | Baseline | Best | GPU/CPU |
| int8 | ~50% | Faster | Minimal loss | GPU |
| int4 | ~25% | Fastest | Some loss | CPU |
## ๐Ÿ” Key Features
### 1. Automatic Integration
- Seamlessly integrated into the main training pipeline
- Interactive prompts for quantization options
- Automatic repository creation and naming
### 2. Comprehensive Documentation
- Automatic model card generation
- Detailed README creation
- Usage examples and best practices
### 3. Monitoring Integration
- Trackio logging for quantization events
- Performance metrics tracking
- Artifact storage and versioning
### 4. Error Handling
- Robust validation of model paths
- Graceful handling of quantization failures
- Detailed error messages and logging
### 5. Flexibility
- Support for multiple quantization types
- Standalone usage option
- Custom configuration options
## ๐Ÿ› ๏ธ Technical Implementation
### Core Components
1. **ModelQuantizer Class**
- Main quantization orchestration
- HF Hub integration
- Trackio monitoring
- Error handling and validation
2. **Quantization Configuration**
- torchao configuration management
- Device-specific optimizations
- Group size and parameter tuning
3. **Documentation Generation**
- Automatic model card creation
- README generation with usage examples
- Performance and limitation documentation
4. **Pipeline Integration**
- Interactive prompts in launch.sh
- Automatic repository naming
- Enhanced summary reporting
## ๐Ÿ“ˆ Benefits
### For Users
- **Easy Integration**: Seamless addition to existing pipeline
- **Multiple Options**: Choose quantization type based on needs
- **Performance**: Significant memory and speed improvements
- **Documentation**: Automatic comprehensive documentation
### For Deployment
- **GPU Optimization**: int8 for high-performance inference
- **CPU Optimization**: int4 for resource-constrained environments
- **Memory Efficiency**: 50-75% memory reduction
- **Speed Improvement**: Faster inference times
## ๐Ÿ”ฎ Future Enhancements
### Planned Features
1. **Additional Quantization Types**: Support for more torchao configurations
2. **Automated Benchmarking**: Performance comparison tools
3. **Batch Quantization**: Process multiple models simultaneously
4. **Custom Configurations**: Advanced quantization parameter tuning
5. **Integration Testing**: End-to-end quantization workflow tests
### Potential Improvements
1. **Quantization-Aware Training**: Support for QAT workflows
2. **Mixed Precision**: Advanced precision optimization
3. **Hardware-Specific**: Optimizations for specific GPU/CPU types
4. **Automated Selection**: Smart quantization type selection
## ๐Ÿ“š References
- [torchao Documentation](https://huggingface.co/docs/transformers/main/en/quantization/torchao)
- [Hugging Face Quantization Guide](https://huggingface.co/docs/transformers/main/en/quantization)
- [PyTorch Quantization](https://pytorch.org/docs/stable/quantization.html)
## ๐ŸŽฏ Summary
The quantization implementation provides a complete, production-ready solution for creating optimized versions of fine-tuned SmolLM3 models. The integration is seamless, the documentation is comprehensive, and the functionality is robust and well-tested.
Key achievements:
- โœ… Full pipeline integration
- โœ… Multiple quantization types
- โœ… Comprehensive documentation
- โœ… Robust error handling
- โœ… Testing suite
- โœ… Monitoring integration
- โœ… Standalone usage option
The implementation follows the repository's architecture patterns and maintains consistency with existing code structure and documentation standards.