Spaces:
Running
Running
Model Quantization Guide
Overview
This guide covers the quantization functionality integrated into the SmolLM3 fine-tuning pipeline. The system supports creating quantized versions of trained models using torchao
and automatically uploading them to Hugging Face Hub in a unified repository structure.
Repository Structure
With the updated pipeline, all models (main and quantized) are stored in a single repository:
your-username/model-name/
βββ README.md (unified model card)
βββ config.json
βββ pytorch_model.bin
βββ tokenizer.json
βββ tokenizer_config.json
βββ int8/ (quantized model for GPU)
β βββ README.md
β βββ config.json
β βββ pytorch_model.bin
βββ int4/ (quantized model for CPU)
βββ README.md
βββ config.json
βββ pytorch_model.bin
Quantization Types
int8 Weight-Only Quantization (GPU Optimized)
- Memory Reduction: ~50% compared to original model
- Speed: Faster inference with minimal accuracy loss
- Hardware: GPU optimized for high-performance inference
- Use Case: Production deployments with GPU resources
int4 Weight-Only Quantization (CPU Optimized)
- Memory Reduction: ~75% compared to original model
- Speed: Significantly faster inference with some accuracy trade-off
- Hardware: CPU optimized for deployment
- Use Case: Edge deployment, CPU-only environments
Integration with Pipeline
Automatic Quantization
The quantization process is integrated into the main training pipeline:
- Training: Model is trained using the standard pipeline
- Model Push: Main model is pushed to Hugging Face Hub
- Quantization Options: User is prompted to create quantized versions
- Quantized Models: Quantized models are created and pushed to subdirectories
- Unified Documentation: Single model card covers all versions
Pipeline Integration
The quantization step is added to launch.sh
after the main model push:
# Step 16.5: Quantization Options
print_step "Step 16.5: Model Quantization Options"
echo "=========================================="
print_info "Would you like to create quantized versions of your model?"
print_info "Quantization reduces model size and improves inference speed."
# Ask about quantization
get_input "Create quantized models? (y/n)" "y" "CREATE_QUANTIZED"
if [ "$CREATE_QUANTIZED" = "y" ] || [ "$CREATE_QUANTIZED" = "Y" ]; then
print_info "Quantization options:"
print_info "1. int8_weight_only (GPU optimized, ~50% memory reduction)"
print_info "2. int4_weight_only (CPU optimized, ~75% memory reduction)"
print_info "3. Both int8 and int4 versions"
select_option "Select quantization type:" "int8_weight_only" "int4_weight_only" "both" "QUANT_TYPE"
# Create quantized models in the same repository
python scripts/model_tonic/quantize_model.py /output-checkpoint "$REPO_NAME" \
--quant-type "$QUANT_TYPE" \
--device "$DEVICE" \
--token "$HF_TOKEN" \
--trackio-url "$TRACKIO_URL" \
--experiment-name "${EXPERIMENT_NAME}-${QUANT_TYPE}" \
--dataset-repo "$TRACKIO_DATASET_REPO"
fi
Standalone Quantization
Using the Standalone Script
For models already uploaded to Hugging Face Hub:
python scripts/model_tonic/quantize_standalone.py \
"your-username/model-name" \
"your-username/model-name" \
--quant-type "int8_weight_only" \
--device "auto" \
--token "your-hf-token"
Command Line Options
python scripts/model_tonic/quantize_standalone.py model_path repo_name [options]
Options:
--quant-type {int8_weight_only,int4_weight_only,int8_dynamic}
Quantization type (default: int8_weight_only)
--device DEVICE Device for quantization (auto, cpu, cuda)
--group-size GROUP_SIZE
Group size for quantization (default: 128)
--token TOKEN Hugging Face token
--private Create private repository
--trackio-url TRACKIO_URL
Trackio URL for monitoring
--experiment-name EXPERIMENT_NAME
Experiment name for tracking
--dataset-repo DATASET_REPO
HF Dataset repository
--save-only Save quantized model locally without pushing to HF
Loading Quantized Models
Loading Main Model
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load the main model
model = AutoModelForCausalLM.from_pretrained(
"your-username/model-name",
device_map="auto",
torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained("your-username/model-name")
Loading int8 Quantized Model (GPU)
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load int8 quantized model (GPU optimized)
model = AutoModelForCausalLM.from_pretrained(
"your-username/model-name/int8",
device_map="auto",
torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained("your-username/model-name/int8")
Loading int4 Quantized Model (CPU)
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load int4 quantized model (CPU optimized)
model = AutoModelForCausalLM.from_pretrained(
"your-username/model-name/int4",
device_map="cpu",
torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained("your-username/model-name/int4")
Usage Examples
Text Generation with Quantized Model
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load quantized model
model = AutoModelForCausalLM.from_pretrained("your-username/model-name/int8")
tokenizer = AutoTokenizer.from_pretrained("your-username/model-name/int8")
# Generate text
text = "The future of artificial intelligence is"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Conversation with Quantized Model
def chat_with_quantized_model(prompt, max_length=100):
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=max_length)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
response = chat_with_quantized_model("Hello, how are you today?")
print(response)
Configuration Options
Quantization Parameters
- group_size: Group size for quantization (default: 128)
- device: Target device for quantization (auto, cpu, cuda)
- quant_type: Type of quantization to apply
Hardware Requirements
- Main Model: GPU with 8GB+ VRAM recommended
- int8 Model: GPU with 4GB+ VRAM
- int4 Model: CPU deployment possible
Performance Comparison
Model Type | Memory Usage | Speed | Accuracy | Use Case |
---|---|---|---|---|
Original | 100% | Baseline | Best | Development, Research |
int8 | ~50% | Faster | Minimal loss | Production GPU |
int4 | ~25% | Fastest | Some loss | Edge, CPU deployment |
Best Practices
When to Use Quantization
- int8 (GPU): When you need faster inference with minimal accuracy loss
- int4 (CPU): When deploying to CPU-only environments or edge devices
- Both: When you need flexibility for different deployment scenarios
Memory Optimization
- Use int8 for GPU deployments with memory constraints
- Use int4 for CPU deployments or very memory-constrained environments
- Consider the trade-off between speed and accuracy
Deployment Considerations
- Test quantized models on your specific use case
- Monitor performance and accuracy in production
- Consider using the main model for development and quantized versions for deployment
Troubleshooting
Common Issues
- CUDA Out of Memory: Reduce batch size or use int8 quantization
- Import Errors: Install torchao:
pip install torchao>=0.10.0
- Model Loading Errors: Ensure the model path is correct and accessible
Debugging
# Test quantization functionality
python tests/test_quantization.py
# Check torchao installation
python -c "import torchao; print('torchao available')"
# Verify model files
ls -la /path/to/model/
Monitoring and Tracking
Trackio Integration
Quantization events are logged to Trackio:
quantization_started
: When quantization beginsquantization_completed
: When quantization finishesquantized_model_pushed
: When model is uploaded to HF Hubquantization_failed
: If quantization fails
Metrics Tracked
- Quantization type and parameters
- Model size reduction
- Upload URLs for quantized models
- Processing time and success status
Dependencies
Required Packages
pip install torchao>=0.10.0
pip install transformers>=4.35.0
pip install huggingface_hub>=0.16.0
Optional Dependencies
pip install accelerate>=0.20.0 # For device mapping
pip install bitsandbytes>=0.41.0 # For additional quantization
References
Support
For issues and questions:
- Check the troubleshooting section above
- Review the test files in
tests/test_quantization.py
- Open an issue on the project repository
- Check the Trackio monitoring for detailed logs