Spaces:
Running
Running
| # Quantization Implementation Summary | |
| This document summarizes the torchao quantization features that have been added to the SmolLM3 fine-tuning pipeline. | |
| ## ๐ New Features Added | |
| ### 1. Core Quantization Scripts | |
| #### `scripts/model_tonic/quantize_model.py` | |
| - **Main quantization script** with full HF Hub integration | |
| - Supports int8 (GPU) and int4 (CPU) quantization | |
| - Automatic model card and README generation | |
| - Trackio monitoring integration | |
| - Comprehensive error handling and validation | |
| #### `scripts/model_tonic/quantize_standalone.py` | |
| - **Standalone quantization script** for independent use | |
| - Simple command-line interface | |
| - Option to save locally without pushing to HF Hub | |
| - Quick quantization workflow | |
| ### 2. Pipeline Integration | |
| #### Updated `launch.sh` | |
| - **Interactive quantization prompts** after model training | |
| - Support for single or dual quantization (int8 + int4) | |
| - Automatic repository naming with quantization suffixes | |
| - Enhanced summary reporting with quantization results | |
| ### 3. Documentation | |
| #### `docs/QUANTIZATION_GUIDE.md` | |
| - **Comprehensive quantization guide** | |
| - Usage examples and best practices | |
| - Performance comparisons | |
| - Troubleshooting section | |
| - Advanced configuration options | |
| #### Updated `README.md` | |
| - **Quantization section** with quick start examples | |
| - Integration with main pipeline documentation | |
| - Loading quantized models examples | |
| ### 4. Testing | |
| #### `tests/test_quantization.py` | |
| - **Comprehensive test suite** for quantization functionality | |
| - Tests for imports, initialization, configuration creation | |
| - Model validation and documentation generation tests | |
| - Automated testing workflow | |
| ### 5. Dependencies | |
| #### Updated `requirements/requirements.txt` | |
| - **Added torchao>=0.10.0** for quantization support | |
| - Maintains compatibility with existing dependencies | |
| ## ๐ง Quantization Types Supported | |
| ### int8_weight_only (GPU Optimized) | |
| - **Memory Reduction**: ~50% | |
| - **Accuracy**: Minimal degradation | |
| - **Speed**: Faster inference | |
| - **Hardware**: GPU optimized | |
| - **Use Case**: High-performance inference on GPU | |
| ### int4_weight_only (CPU Optimized) | |
| - **Memory Reduction**: ~75% | |
| - **Accuracy**: Some degradation acceptable | |
| - **Speed**: Significantly faster inference | |
| - **Hardware**: CPU optimized | |
| - **Use Case**: Deployment on CPU or memory-constrained environments | |
| ### int8_dynamic (Dynamic Quantization) | |
| - **Memory Reduction**: ~50% | |
| - **Accuracy**: Minimal degradation | |
| - **Speed**: Faster inference | |
| - **Hardware**: GPU optimized | |
| - **Use Case**: Dynamic quantization during inference | |
| ## ๐ Usage Examples | |
| ### Interactive Pipeline (launch.sh) | |
| ```bash | |
| ./launch.sh | |
| # Complete training and model push | |
| # Choose quantization options when prompted: | |
| # - y/n for quantization | |
| # - int8_weight_only / int4_weight_only / both | |
| ``` | |
| ### Standalone Quantization | |
| ```bash | |
| # Quantize and push to HF Hub | |
| python scripts/model_tonic/quantize_standalone.py /path/to/model my-username/quantized-model \ | |
| --quant-type int8_weight_only \ | |
| --token YOUR_HF_TOKEN | |
| # Quantize and save locally | |
| python scripts/model_tonic/quantize_standalone.py /path/to/model my-username/quantized-model \ | |
| --quant-type int4_weight_only \ | |
| --device cpu \ | |
| --save-only | |
| ``` | |
| ### Loading Quantized Models | |
| ```python | |
| import torch | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| # Load int8 quantized model (GPU) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| "your-username/model-int8", | |
| device_map="auto", | |
| torch_dtype=torch.bfloat16 | |
| ) | |
| # Load int4 quantized model (CPU) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| "your-username/model-int4", | |
| device_map="cpu", | |
| torch_dtype=torch.bfloat16 | |
| ) | |
| ``` | |
| ## ๐งช Testing | |
| Run the quantization tests: | |
| ```bash | |
| python tests/test_quantization.py | |
| ``` | |
| Tests cover: | |
| - Import validation | |
| - Quantizer initialization | |
| - Configuration creation | |
| - Model validation | |
| - Documentation generation | |
| ## ๐ Performance Comparison | |
| | Model Type | Memory Usage | Speed | Accuracy | Hardware | | |
| |------------|--------------|-------|----------|----------| | |
| | Original | 100% | Baseline | Best | GPU/CPU | | |
| | int8 | ~50% | Faster | Minimal loss | GPU | | |
| | int4 | ~25% | Fastest | Some loss | CPU | | |
| ## ๐ Key Features | |
| ### 1. Automatic Integration | |
| - Seamlessly integrated into the main training pipeline | |
| - Interactive prompts for quantization options | |
| - Automatic repository creation and naming | |
| ### 2. Comprehensive Documentation | |
| - Automatic model card generation | |
| - Detailed README creation | |
| - Usage examples and best practices | |
| ### 3. Monitoring Integration | |
| - Trackio logging for quantization events | |
| - Performance metrics tracking | |
| - Artifact storage and versioning | |
| ### 4. Error Handling | |
| - Robust validation of model paths | |
| - Graceful handling of quantization failures | |
| - Detailed error messages and logging | |
| ### 5. Flexibility | |
| - Support for multiple quantization types | |
| - Standalone usage option | |
| - Custom configuration options | |
| ## ๐ ๏ธ Technical Implementation | |
| ### Core Components | |
| 1. **ModelQuantizer Class** | |
| - Main quantization orchestration | |
| - HF Hub integration | |
| - Trackio monitoring | |
| - Error handling and validation | |
| 2. **Quantization Configuration** | |
| - torchao configuration management | |
| - Device-specific optimizations | |
| - Group size and parameter tuning | |
| 3. **Documentation Generation** | |
| - Automatic model card creation | |
| - README generation with usage examples | |
| - Performance and limitation documentation | |
| 4. **Pipeline Integration** | |
| - Interactive prompts in launch.sh | |
| - Automatic repository naming | |
| - Enhanced summary reporting | |
| ## ๐ Benefits | |
| ### For Users | |
| - **Easy Integration**: Seamless addition to existing pipeline | |
| - **Multiple Options**: Choose quantization type based on needs | |
| - **Performance**: Significant memory and speed improvements | |
| - **Documentation**: Automatic comprehensive documentation | |
| ### For Deployment | |
| - **GPU Optimization**: int8 for high-performance inference | |
| - **CPU Optimization**: int4 for resource-constrained environments | |
| - **Memory Efficiency**: 50-75% memory reduction | |
| - **Speed Improvement**: Faster inference times | |
| ## ๐ฎ Future Enhancements | |
| ### Planned Features | |
| 1. **Additional Quantization Types**: Support for more torchao configurations | |
| 2. **Automated Benchmarking**: Performance comparison tools | |
| 3. **Batch Quantization**: Process multiple models simultaneously | |
| 4. **Custom Configurations**: Advanced quantization parameter tuning | |
| 5. **Integration Testing**: End-to-end quantization workflow tests | |
| ### Potential Improvements | |
| 1. **Quantization-Aware Training**: Support for QAT workflows | |
| 2. **Mixed Precision**: Advanced precision optimization | |
| 3. **Hardware-Specific**: Optimizations for specific GPU/CPU types | |
| 4. **Automated Selection**: Smart quantization type selection | |
| ## ๐ References | |
| - [torchao Documentation](https://huggingface.co/docs/transformers/main/en/quantization/torchao) | |
| - [Hugging Face Quantization Guide](https://huggingface.co/docs/transformers/main/en/quantization) | |
| - [PyTorch Quantization](https://pytorch.org/docs/stable/quantization.html) | |
| ## ๐ฏ Summary | |
| The quantization implementation provides a complete, production-ready solution for creating optimized versions of fine-tuned SmolLM3 models. The integration is seamless, the documentation is comprehensive, and the functionality is robust and well-tested. | |
| Key achievements: | |
| - โ Full pipeline integration | |
| - โ Multiple quantization types | |
| - โ Comprehensive documentation | |
| - โ Robust error handling | |
| - โ Testing suite | |
| - โ Monitoring integration | |
| - โ Standalone usage option | |
| The implementation follows the repository's architecture patterns and maintains consistency with existing code structure and documentation standards. |