SmolFactory / docs /QUANTIZATION_GUIDE.md
Tonic's picture
adds sft , quantization, better readmes
40fd629 verified
|
raw
history blame
9.64 kB

Model Quantization Guide

Overview

This guide covers the quantization functionality integrated into the SmolLM3 fine-tuning pipeline. The system supports creating quantized versions of trained models using torchao and automatically uploading them to Hugging Face Hub in a unified repository structure.

Repository Structure

With the updated pipeline, all models (main and quantized) are stored in a single repository:

your-username/model-name/
β”œβ”€β”€ README.md (unified model card)
β”œβ”€β”€ config.json
β”œβ”€β”€ pytorch_model.bin
β”œβ”€β”€ tokenizer.json
β”œβ”€β”€ tokenizer_config.json
β”œβ”€β”€ int8/ (quantized model for GPU)
β”‚   β”œβ”€β”€ README.md
β”‚   β”œβ”€β”€ config.json
β”‚   └── pytorch_model.bin
└── int4/ (quantized model for CPU)
    β”œβ”€β”€ README.md
    β”œβ”€β”€ config.json
    └── pytorch_model.bin

Quantization Types

int8 Weight-Only Quantization (GPU Optimized)

  • Memory Reduction: ~50% compared to original model
  • Speed: Faster inference with minimal accuracy loss
  • Hardware: GPU optimized for high-performance inference
  • Use Case: Production deployments with GPU resources

int4 Weight-Only Quantization (CPU Optimized)

  • Memory Reduction: ~75% compared to original model
  • Speed: Significantly faster inference with some accuracy trade-off
  • Hardware: CPU optimized for deployment
  • Use Case: Edge deployment, CPU-only environments

Integration with Pipeline

Automatic Quantization

The quantization process is integrated into the main training pipeline:

  1. Training: Model is trained using the standard pipeline
  2. Model Push: Main model is pushed to Hugging Face Hub
  3. Quantization Options: User is prompted to create quantized versions
  4. Quantized Models: Quantized models are created and pushed to subdirectories
  5. Unified Documentation: Single model card covers all versions

Pipeline Integration

The quantization step is added to launch.sh after the main model push:

# Step 16.5: Quantization Options
print_step "Step 16.5: Model Quantization Options"
echo "=========================================="

print_info "Would you like to create quantized versions of your model?"
print_info "Quantization reduces model size and improves inference speed."

# Ask about quantization
get_input "Create quantized models? (y/n)" "y" "CREATE_QUANTIZED"

if [ "$CREATE_QUANTIZED" = "y" ] || [ "$CREATE_QUANTIZED" = "Y" ]; then
    print_info "Quantization options:"
    print_info "1. int8_weight_only (GPU optimized, ~50% memory reduction)"
    print_info "2. int4_weight_only (CPU optimized, ~75% memory reduction)"
    print_info "3. Both int8 and int4 versions"
    
    select_option "Select quantization type:" "int8_weight_only" "int4_weight_only" "both" "QUANT_TYPE"
    
    # Create quantized models in the same repository
    python scripts/model_tonic/quantize_model.py /output-checkpoint "$REPO_NAME" \
        --quant-type "$QUANT_TYPE" \
        --device "$DEVICE" \
        --token "$HF_TOKEN" \
        --trackio-url "$TRACKIO_URL" \
        --experiment-name "${EXPERIMENT_NAME}-${QUANT_TYPE}" \
        --dataset-repo "$TRACKIO_DATASET_REPO"
fi

Standalone Quantization

Using the Standalone Script

For models already uploaded to Hugging Face Hub:

python scripts/model_tonic/quantize_standalone.py \
    "your-username/model-name" \
    "your-username/model-name" \
    --quant-type "int8_weight_only" \
    --device "auto" \
    --token "your-hf-token"

Command Line Options

python scripts/model_tonic/quantize_standalone.py model_path repo_name [options]

Options:
  --quant-type {int8_weight_only,int4_weight_only,int8_dynamic}
                        Quantization type (default: int8_weight_only)
  --device DEVICE       Device for quantization (auto, cpu, cuda)
  --group-size GROUP_SIZE
                        Group size for quantization (default: 128)
  --token TOKEN         Hugging Face token
  --private             Create private repository
  --trackio-url TRACKIO_URL
                        Trackio URL for monitoring
  --experiment-name EXPERIMENT_NAME
                        Experiment name for tracking
  --dataset-repo DATASET_REPO
                        HF Dataset repository
  --save-only           Save quantized model locally without pushing to HF

Loading Quantized Models

Loading Main Model

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the main model
model = AutoModelForCausalLM.from_pretrained(
    "your-username/model-name",
    device_map="auto",
    torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained("your-username/model-name")

Loading int8 Quantized Model (GPU)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load int8 quantized model (GPU optimized)
model = AutoModelForCausalLM.from_pretrained(
    "your-username/model-name/int8",
    device_map="auto",
    torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained("your-username/model-name/int8")

Loading int4 Quantized Model (CPU)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load int4 quantized model (CPU optimized)
model = AutoModelForCausalLM.from_pretrained(
    "your-username/model-name/int4",
    device_map="cpu",
    torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained("your-username/model-name/int4")

Usage Examples

Text Generation with Quantized Model

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load quantized model
model = AutoModelForCausalLM.from_pretrained("your-username/model-name/int8")
tokenizer = AutoTokenizer.from_pretrained("your-username/model-name/int8")

# Generate text
text = "The future of artificial intelligence is"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Conversation with Quantized Model

def chat_with_quantized_model(prompt, max_length=100):
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(**inputs, max_new_tokens=max_length)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

response = chat_with_quantized_model("Hello, how are you today?")
print(response)

Configuration Options

Quantization Parameters

  • group_size: Group size for quantization (default: 128)
  • device: Target device for quantization (auto, cpu, cuda)
  • quant_type: Type of quantization to apply

Hardware Requirements

  • Main Model: GPU with 8GB+ VRAM recommended
  • int8 Model: GPU with 4GB+ VRAM
  • int4 Model: CPU deployment possible

Performance Comparison

Model Type Memory Usage Speed Accuracy Use Case
Original 100% Baseline Best Development, Research
int8 ~50% Faster Minimal loss Production GPU
int4 ~25% Fastest Some loss Edge, CPU deployment

Best Practices

When to Use Quantization

  1. int8 (GPU): When you need faster inference with minimal accuracy loss
  2. int4 (CPU): When deploying to CPU-only environments or edge devices
  3. Both: When you need flexibility for different deployment scenarios

Memory Optimization

  • Use int8 for GPU deployments with memory constraints
  • Use int4 for CPU deployments or very memory-constrained environments
  • Consider the trade-off between speed and accuracy

Deployment Considerations

  • Test quantized models on your specific use case
  • Monitor performance and accuracy in production
  • Consider using the main model for development and quantized versions for deployment

Troubleshooting

Common Issues

  1. CUDA Out of Memory: Reduce batch size or use int8 quantization
  2. Import Errors: Install torchao: pip install torchao>=0.10.0
  3. Model Loading Errors: Ensure the model path is correct and accessible

Debugging

# Test quantization functionality
python tests/test_quantization.py

# Check torchao installation
python -c "import torchao; print('torchao available')"

# Verify model files
ls -la /path/to/model/

Monitoring and Tracking

Trackio Integration

Quantization events are logged to Trackio:

  • quantization_started: When quantization begins
  • quantization_completed: When quantization finishes
  • quantized_model_pushed: When model is uploaded to HF Hub
  • quantization_failed: If quantization fails

Metrics Tracked

  • Quantization type and parameters
  • Model size reduction
  • Upload URLs for quantized models
  • Processing time and success status

Dependencies

Required Packages

pip install torchao>=0.10.0
pip install transformers>=4.35.0
pip install huggingface_hub>=0.16.0

Optional Dependencies

pip install accelerate>=0.20.0  # For device mapping
pip install bitsandbytes>=0.41.0  # For additional quantization

References

Support

For issues and questions:

  1. Check the troubleshooting section above
  2. Review the test files in tests/test_quantization.py
  3. Open an issue on the project repository
  4. Check the Trackio monitoring for detailed logs