Spaces:
Running
Running
| # Model Quantization Guide | |
| ## Overview | |
| This guide covers the quantization functionality integrated into the SmolLM3 fine-tuning pipeline. The system supports creating quantized versions of trained models using `torchao` and automatically uploading them to Hugging Face Hub in a unified repository structure. | |
| ## Repository Structure | |
| With the updated pipeline, all models (main and quantized) are stored in a single repository: | |
| ``` | |
| your-username/model-name/ | |
| βββ README.md (unified model card) | |
| βββ config.json | |
| βββ pytorch_model.bin | |
| βββ tokenizer.json | |
| βββ tokenizer_config.json | |
| βββ int8/ (quantized model for GPU) | |
| β βββ README.md | |
| β βββ config.json | |
| β βββ pytorch_model.bin | |
| βββ int4/ (quantized model for CPU) | |
| βββ README.md | |
| βββ config.json | |
| βββ pytorch_model.bin | |
| ``` | |
| ## Quantization Types | |
| ### int8 Weight-Only Quantization (GPU Optimized) | |
| - **Memory Reduction**: ~50% compared to original model | |
| - **Speed**: Faster inference with minimal accuracy loss | |
| - **Hardware**: GPU optimized for high-performance inference | |
| - **Use Case**: Production deployments with GPU resources | |
| ### int4 Weight-Only Quantization (CPU Optimized) | |
| - **Memory Reduction**: ~75% compared to original model | |
| - **Speed**: Significantly faster inference with some accuracy trade-off | |
| - **Hardware**: CPU optimized for deployment | |
| - **Use Case**: Edge deployment, CPU-only environments | |
| ## Integration with Pipeline | |
| ### Automatic Quantization | |
| The quantization process is integrated into the main training pipeline: | |
| 1. **Training**: Model is trained using the standard pipeline | |
| 2. **Model Push**: Main model is pushed to Hugging Face Hub | |
| 3. **Quantization Options**: User is prompted to create quantized versions | |
| 4. **Quantized Models**: Quantized models are created and pushed to subdirectories | |
| 5. **Unified Documentation**: Single model card covers all versions | |
| ### Pipeline Integration | |
| The quantization step is added to `launch.sh` after the main model push: | |
| ```bash | |
| # Step 16.5: Quantization Options | |
| print_step "Step 16.5: Model Quantization Options" | |
| echo "==========================================" | |
| print_info "Would you like to create quantized versions of your model?" | |
| print_info "Quantization reduces model size and improves inference speed." | |
| # Ask about quantization | |
| get_input "Create quantized models? (y/n)" "y" "CREATE_QUANTIZED" | |
| if [ "$CREATE_QUANTIZED" = "y" ] || [ "$CREATE_QUANTIZED" = "Y" ]; then | |
| print_info "Quantization options:" | |
| print_info "1. int8_weight_only (GPU optimized, ~50% memory reduction)" | |
| print_info "2. int4_weight_only (CPU optimized, ~75% memory reduction)" | |
| print_info "3. Both int8 and int4 versions" | |
| select_option "Select quantization type:" "int8_weight_only" "int4_weight_only" "both" "QUANT_TYPE" | |
| # Create quantized models in the same repository | |
| python scripts/model_tonic/quantize_model.py /output-checkpoint "$REPO_NAME" \ | |
| --quant-type "$QUANT_TYPE" \ | |
| --device "$DEVICE" \ | |
| --token "$HF_TOKEN" \ | |
| --trackio-url "$TRACKIO_URL" \ | |
| --experiment-name "${EXPERIMENT_NAME}-${QUANT_TYPE}" \ | |
| --dataset-repo "$TRACKIO_DATASET_REPO" | |
| fi | |
| ``` | |
| ## Standalone Quantization | |
| ### Using the Standalone Script | |
| For models already uploaded to Hugging Face Hub: | |
| ```bash | |
| python scripts/model_tonic/quantize_standalone.py \ | |
| "your-username/model-name" \ | |
| "your-username/model-name" \ | |
| --quant-type "int8_weight_only" \ | |
| --device "auto" \ | |
| --token "your-hf-token" | |
| ``` | |
| ### Command Line Options | |
| ```bash | |
| python scripts/model_tonic/quantize_standalone.py model_path repo_name [options] | |
| Options: | |
| --quant-type {int8_weight_only,int4_weight_only,int8_dynamic} | |
| Quantization type (default: int8_weight_only) | |
| --device DEVICE Device for quantization (auto, cpu, cuda) | |
| --group-size GROUP_SIZE | |
| Group size for quantization (default: 128) | |
| --token TOKEN Hugging Face token | |
| --private Create private repository | |
| --trackio-url TRACKIO_URL | |
| Trackio URL for monitoring | |
| --experiment-name EXPERIMENT_NAME | |
| Experiment name for tracking | |
| --dataset-repo DATASET_REPO | |
| HF Dataset repository | |
| --save-only Save quantized model locally without pushing to HF | |
| ``` | |
| ## Loading Quantized Models | |
| ### Loading Main Model | |
| ```python | |
| import torch | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| # Load the main model | |
| model = AutoModelForCausalLM.from_pretrained( | |
| "your-username/model-name", | |
| device_map="auto", | |
| torch_dtype=torch.bfloat16 | |
| ) | |
| tokenizer = AutoTokenizer.from_pretrained("your-username/model-name") | |
| ``` | |
| ### Loading int8 Quantized Model (GPU) | |
| ```python | |
| import torch | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| # Load int8 quantized model (GPU optimized) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| "your-username/model-name/int8", | |
| device_map="auto", | |
| torch_dtype=torch.bfloat16 | |
| ) | |
| tokenizer = AutoTokenizer.from_pretrained("your-username/model-name/int8") | |
| ``` | |
| ### Loading int4 Quantized Model (CPU) | |
| ```python | |
| import torch | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| # Load int4 quantized model (CPU optimized) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| "your-username/model-name/int4", | |
| device_map="cpu", | |
| torch_dtype=torch.bfloat16 | |
| ) | |
| tokenizer = AutoTokenizer.from_pretrained("your-username/model-name/int4") | |
| ``` | |
| ## Usage Examples | |
| ### Text Generation with Quantized Model | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| # Load quantized model | |
| model = AutoModelForCausalLM.from_pretrained("your-username/model-name/int8") | |
| tokenizer = AutoTokenizer.from_pretrained("your-username/model-name/int8") | |
| # Generate text | |
| text = "The future of artificial intelligence is" | |
| inputs = tokenizer(text, return_tensors="pt") | |
| outputs = model.generate(**inputs, max_new_tokens=100) | |
| print(tokenizer.decode(outputs[0], skip_special_tokens=True)) | |
| ``` | |
| ### Conversation with Quantized Model | |
| ```python | |
| def chat_with_quantized_model(prompt, max_length=100): | |
| inputs = tokenizer(prompt, return_tensors="pt") | |
| outputs = model.generate(**inputs, max_new_tokens=max_length) | |
| return tokenizer.decode(outputs[0], skip_special_tokens=True) | |
| response = chat_with_quantized_model("Hello, how are you today?") | |
| print(response) | |
| ``` | |
| ## Configuration Options | |
| ### Quantization Parameters | |
| - **group_size**: Group size for quantization (default: 128) | |
| - **device**: Target device for quantization (auto, cpu, cuda) | |
| - **quant_type**: Type of quantization to apply | |
| ### Hardware Requirements | |
| - **Main Model**: GPU with 8GB+ VRAM recommended | |
| - **int8 Model**: GPU with 4GB+ VRAM | |
| - **int4 Model**: CPU deployment possible | |
| ## Performance Comparison | |
| | Model Type | Memory Usage | Speed | Accuracy | Use Case | | |
| |------------|--------------|-------|----------|----------| | |
| | Original | 100% | Baseline | Best | Development, Research | | |
| | int8 | ~50% | Faster | Minimal loss | Production GPU | | |
| | int4 | ~25% | Fastest | Some loss | Edge, CPU deployment | | |
| ## Best Practices | |
| ### When to Use Quantization | |
| 1. **int8 (GPU)**: When you need faster inference with minimal accuracy loss | |
| 2. **int4 (CPU)**: When deploying to CPU-only environments or edge devices | |
| 3. **Both**: When you need flexibility for different deployment scenarios | |
| ### Memory Optimization | |
| - Use int8 for GPU deployments with memory constraints | |
| - Use int4 for CPU deployments or very memory-constrained environments | |
| - Consider the trade-off between speed and accuracy | |
| ### Deployment Considerations | |
| - Test quantized models on your specific use case | |
| - Monitor performance and accuracy in production | |
| - Consider using the main model for development and quantized versions for deployment | |
| ## Troubleshooting | |
| ### Common Issues | |
| 1. **CUDA Out of Memory**: Reduce batch size or use int8 quantization | |
| 2. **Import Errors**: Install torchao: `pip install torchao>=0.10.0` | |
| 3. **Model Loading Errors**: Ensure the model path is correct and accessible | |
| ### Debugging | |
| ```bash | |
| # Test quantization functionality | |
| python tests/test_quantization.py | |
| # Check torchao installation | |
| python -c "import torchao; print('torchao available')" | |
| # Verify model files | |
| ls -la /path/to/model/ | |
| ``` | |
| ## Monitoring and Tracking | |
| ### Trackio Integration | |
| Quantization events are logged to Trackio: | |
| - `quantization_started`: When quantization begins | |
| - `quantization_completed`: When quantization finishes | |
| - `quantized_model_pushed`: When model is uploaded to HF Hub | |
| - `quantization_failed`: If quantization fails | |
| ### Metrics Tracked | |
| - Quantization type and parameters | |
| - Model size reduction | |
| - Upload URLs for quantized models | |
| - Processing time and success status | |
| ## Dependencies | |
| ### Required Packages | |
| ```bash | |
| pip install torchao>=0.10.0 | |
| pip install transformers>=4.35.0 | |
| pip install huggingface_hub>=0.16.0 | |
| ``` | |
| ### Optional Dependencies | |
| ```bash | |
| pip install accelerate>=0.20.0 # For device mapping | |
| pip install bitsandbytes>=0.41.0 # For additional quantization | |
| ``` | |
| ## References | |
| - [torchao Documentation](https://huggingface.co/docs/transformers/main/en/quantization/torchao) | |
| - [Hugging Face Model Cards](https://huggingface.co/docs/hub/model-cards) | |
| - [Transformers Quantization Guide](https://huggingface.co/docs/transformers/main/en/quantization) | |
| ## Support | |
| For issues and questions: | |
| 1. Check the troubleshooting section above | |
| 2. Review the test files in `tests/test_quantization.py` | |
| 3. Open an issue on the project repository | |
| 4. Check the Trackio monitoring for detailed logs |