Spaces:
Running
Running
Quantization Implementation Summary
This document summarizes the torchao quantization features that have been added to the SmolLM3 fine-tuning pipeline.
๐ New Features Added
1. Core Quantization Scripts
scripts/model_tonic/quantize_model.py
- Main quantization script with full HF Hub integration
- Supports int8 (GPU) and int4 (CPU) quantization
- Automatic model card and README generation
- Trackio monitoring integration
- Comprehensive error handling and validation
scripts/model_tonic/quantize_standalone.py
- Standalone quantization script for independent use
- Simple command-line interface
- Option to save locally without pushing to HF Hub
- Quick quantization workflow
2. Pipeline Integration
Updated launch.sh
- Interactive quantization prompts after model training
- Support for single or dual quantization (int8 + int4)
- Automatic repository naming with quantization suffixes
- Enhanced summary reporting with quantization results
3. Documentation
docs/QUANTIZATION_GUIDE.md
- Comprehensive quantization guide
- Usage examples and best practices
- Performance comparisons
- Troubleshooting section
- Advanced configuration options
Updated README.md
- Quantization section with quick start examples
- Integration with main pipeline documentation
- Loading quantized models examples
4. Testing
tests/test_quantization.py
- Comprehensive test suite for quantization functionality
- Tests for imports, initialization, configuration creation
- Model validation and documentation generation tests
- Automated testing workflow
5. Dependencies
Updated requirements/requirements.txt
- Added torchao>=0.10.0 for quantization support
- Maintains compatibility with existing dependencies
๐ง Quantization Types Supported
int8_weight_only (GPU Optimized)
- Memory Reduction: ~50%
- Accuracy: Minimal degradation
- Speed: Faster inference
- Hardware: GPU optimized
- Use Case: High-performance inference on GPU
int4_weight_only (CPU Optimized)
- Memory Reduction: ~75%
- Accuracy: Some degradation acceptable
- Speed: Significantly faster inference
- Hardware: CPU optimized
- Use Case: Deployment on CPU or memory-constrained environments
int8_dynamic (Dynamic Quantization)
- Memory Reduction: ~50%
- Accuracy: Minimal degradation
- Speed: Faster inference
- Hardware: GPU optimized
- Use Case: Dynamic quantization during inference
๐ Usage Examples
Interactive Pipeline (launch.sh)
./launch.sh
# Complete training and model push
# Choose quantization options when prompted:
# - y/n for quantization
# - int8_weight_only / int4_weight_only / both
Standalone Quantization
# Quantize and push to HF Hub
python scripts/model_tonic/quantize_standalone.py /path/to/model my-username/quantized-model \
--quant-type int8_weight_only \
--token YOUR_HF_TOKEN
# Quantize and save locally
python scripts/model_tonic/quantize_standalone.py /path/to/model my-username/quantized-model \
--quant-type int4_weight_only \
--device cpu \
--save-only
Loading Quantized Models
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load int8 quantized model (GPU)
model = AutoModelForCausalLM.from_pretrained(
"your-username/model-int8",
device_map="auto",
torch_dtype=torch.bfloat16
)
# Load int4 quantized model (CPU)
model = AutoModelForCausalLM.from_pretrained(
"your-username/model-int4",
device_map="cpu",
torch_dtype=torch.bfloat16
)
๐งช Testing
Run the quantization tests:
python tests/test_quantization.py
Tests cover:
- Import validation
- Quantizer initialization
- Configuration creation
- Model validation
- Documentation generation
๐ Performance Comparison
Model Type | Memory Usage | Speed | Accuracy | Hardware |
---|---|---|---|---|
Original | 100% | Baseline | Best | GPU/CPU |
int8 | ~50% | Faster | Minimal loss | GPU |
int4 | ~25% | Fastest | Some loss | CPU |
๐ Key Features
1. Automatic Integration
- Seamlessly integrated into the main training pipeline
- Interactive prompts for quantization options
- Automatic repository creation and naming
2. Comprehensive Documentation
- Automatic model card generation
- Detailed README creation
- Usage examples and best practices
3. Monitoring Integration
- Trackio logging for quantization events
- Performance metrics tracking
- Artifact storage and versioning
4. Error Handling
- Robust validation of model paths
- Graceful handling of quantization failures
- Detailed error messages and logging
5. Flexibility
- Support for multiple quantization types
- Standalone usage option
- Custom configuration options
๐ ๏ธ Technical Implementation
Core Components
ModelQuantizer Class
- Main quantization orchestration
- HF Hub integration
- Trackio monitoring
- Error handling and validation
Quantization Configuration
- torchao configuration management
- Device-specific optimizations
- Group size and parameter tuning
Documentation Generation
- Automatic model card creation
- README generation with usage examples
- Performance and limitation documentation
Pipeline Integration
- Interactive prompts in launch.sh
- Automatic repository naming
- Enhanced summary reporting
๐ Benefits
For Users
- Easy Integration: Seamless addition to existing pipeline
- Multiple Options: Choose quantization type based on needs
- Performance: Significant memory and speed improvements
- Documentation: Automatic comprehensive documentation
For Deployment
- GPU Optimization: int8 for high-performance inference
- CPU Optimization: int4 for resource-constrained environments
- Memory Efficiency: 50-75% memory reduction
- Speed Improvement: Faster inference times
๐ฎ Future Enhancements
Planned Features
- Additional Quantization Types: Support for more torchao configurations
- Automated Benchmarking: Performance comparison tools
- Batch Quantization: Process multiple models simultaneously
- Custom Configurations: Advanced quantization parameter tuning
- Integration Testing: End-to-end quantization workflow tests
Potential Improvements
- Quantization-Aware Training: Support for QAT workflows
- Mixed Precision: Advanced precision optimization
- Hardware-Specific: Optimizations for specific GPU/CPU types
- Automated Selection: Smart quantization type selection
๐ References
๐ฏ Summary
The quantization implementation provides a complete, production-ready solution for creating optimized versions of fine-tuned SmolLM3 models. The integration is seamless, the documentation is comprehensive, and the functionality is robust and well-tested.
Key achievements:
- โ Full pipeline integration
- โ Multiple quantization types
- โ Comprehensive documentation
- โ Robust error handling
- โ Testing suite
- โ Monitoring integration
- โ Standalone usage option
The implementation follows the repository's architecture patterns and maintains consistency with existing code structure and documentation standards.