Spaces:

Tonic
/

SmolFactory

Running

App Files Files Community

SmolFactory / docs /QUANTIZATION_GUIDE.md

Tonic

adds sft , quantization, better readmes

40fd629 verified about 2 months ago

preview code

raw

history blame

9.64 kB

Model Quantization Guide

Overview

This guide covers the quantization functionality integrated into the SmolLM3 fine-tuning pipeline. The system supports creating quantized versions of trained models using torchao and automatically uploading them to Hugging Face Hub in a unified repository structure.

Repository Structure

With the updated pipeline, all models (main and quantized) are stored in a single repository:

your-username/model-name/
├── README.md (unified model card)
├── config.json
├── pytorch_model.bin
├── tokenizer.json
├── tokenizer_config.json
├── int8/ (quantized model for GPU)
│   ├── README.md
│   ├── config.json
│   └── pytorch_model.bin
└── int4/ (quantized model for CPU)
    ├── README.md
    ├── config.json
    └── pytorch_model.bin

Quantization Types

int8 Weight-Only Quantization (GPU Optimized)

Memory Reduction: ~50% compared to original model
Speed: Faster inference with minimal accuracy loss
Hardware: GPU optimized for high-performance inference
Use Case: Production deployments with GPU resources

int4 Weight-Only Quantization (CPU Optimized)

Memory Reduction: ~75% compared to original model
Speed: Significantly faster inference with some accuracy trade-off
Hardware: CPU optimized for deployment
Use Case: Edge deployment, CPU-only environments

Integration with Pipeline

Automatic Quantization

The quantization process is integrated into the main training pipeline:

Training: Model is trained using the standard pipeline
Model Push: Main model is pushed to Hugging Face Hub
Quantization Options: User is prompted to create quantized versions
Quantized Models: Quantized models are created and pushed to subdirectories
Unified Documentation: Single model card covers all versions

Pipeline Integration

The quantization step is added to launch.sh after the main model push:

# Step 16.5: Quantization Options
print_step "Step 16.5: Model Quantization Options"
echo "=========================================="

print_info "Would you like to create quantized versions of your model?"
print_info "Quantization reduces model size and improves inference speed."

# Ask about quantization
get_input "Create quantized models? (y/n)" "y" "CREATE_QUANTIZED"

if [ "$CREATE_QUANTIZED" = "y" ] || [ "$CREATE_QUANTIZED" = "Y" ]; then
    print_info "Quantization options:"
    print_info "1. int8_weight_only (GPU optimized, ~50% memory reduction)"
    print_info "2. int4_weight_only (CPU optimized, ~75% memory reduction)"
    print_info "3. Both int8 and int4 versions"
    
    select_option "Select quantization type:" "int8_weight_only" "int4_weight_only" "both" "QUANT_TYPE"
    
    # Create quantized models in the same repository
    python scripts/model_tonic/quantize_model.py /output-checkpoint "$REPO_NAME" \
        --quant-type "$QUANT_TYPE" \
        --device "$DEVICE" \
        --token "$HF_TOKEN" \
        --trackio-url "$TRACKIO_URL" \
        --experiment-name "${EXPERIMENT_NAME}-${QUANT_TYPE}" \
        --dataset-repo "$TRACKIO_DATASET_REPO"
fi

Standalone Quantization

Using the Standalone Script

For models already uploaded to Hugging Face Hub:

python scripts/model_tonic/quantize_standalone.py \
    "your-username/model-name" \
    "your-username/model-name" \
    --quant-type "int8_weight_only" \
    --device "auto" \
    --token "your-hf-token"

Command Line Options

python scripts/model_tonic/quantize_standalone.py model_path repo_name [options]

Options:
  --quant-type {int8_weight_only,int4_weight_only,int8_dynamic}
                        Quantization type (default: int8_weight_only)
  --device DEVICE       Device for quantization (auto, cpu, cuda)
  --group-size GROUP_SIZE
                        Group size for quantization (default: 128)
  --token TOKEN         Hugging Face token
  --private             Create private repository
  --trackio-url TRACKIO_URL
                        Trackio URL for monitoring
  --experiment-name EXPERIMENT_NAME
                        Experiment name for tracking
  --dataset-repo DATASET_REPO
                        HF Dataset repository
  --save-only           Save quantized model locally without pushing to HF

Loading Quantized Models

Loading Main Model

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the main model
model = AutoModelForCausalLM.from_pretrained(
    "your-username/model-name",
    device_map="auto",
    torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained("your-username/model-name")

Loading int8 Quantized Model (GPU)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load int8 quantized model (GPU optimized)
model = AutoModelForCausalLM.from_pretrained(
    "your-username/model-name/int8",
    device_map="auto",
    torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained("your-username/model-name/int8")

Loading int4 Quantized Model (CPU)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load int4 quantized model (CPU optimized)
model = AutoModelForCausalLM.from_pretrained(
    "your-username/model-name/int4",
    device_map="cpu",
    torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained("your-username/model-name/int4")

Usage Examples

Text Generation with Quantized Model

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load quantized model
model = AutoModelForCausalLM.from_pretrained("your-username/model-name/int8")
tokenizer = AutoTokenizer.from_pretrained("your-username/model-name/int8")

# Generate text
text = "The future of artificial intelligence is"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Conversation with Quantized Model

def chat_with_quantized_model(prompt, max_length=100):
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(**inputs, max_new_tokens=max_length)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

response = chat_with_quantized_model("Hello, how are you today?")
print(response)

Configuration Options

Quantization Parameters

group_size: Group size for quantization (default: 128)
device: Target device for quantization (auto, cpu, cuda)
quant_type: Type of quantization to apply

Hardware Requirements

Main Model: GPU with 8GB+ VRAM recommended
int8 Model: GPU with 4GB+ VRAM
int4 Model: CPU deployment possible

Performance Comparison

Model Type	Memory Usage	Speed	Accuracy	Use Case
Original	100%	Baseline	Best	Development, Research
int8	~50%	Faster	Minimal loss	Production GPU
int4	~25%	Fastest	Some loss	Edge, CPU deployment

Best Practices

When to Use Quantization

int8 (GPU): When you need faster inference with minimal accuracy loss
int4 (CPU): When deploying to CPU-only environments or edge devices
Both: When you need flexibility for different deployment scenarios

Memory Optimization

Use int8 for GPU deployments with memory constraints
Use int4 for CPU deployments or very memory-constrained environments
Consider the trade-off between speed and accuracy

Deployment Considerations

Test quantized models on your specific use case
Monitor performance and accuracy in production
Consider using the main model for development and quantized versions for deployment

Troubleshooting

Common Issues

CUDA Out of Memory: Reduce batch size or use int8 quantization
Import Errors: Install torchao: pip install torchao>=0.10.0
Model Loading Errors: Ensure the model path is correct and accessible

Debugging

# Test quantization functionality
python tests/test_quantization.py

# Check torchao installation
python -c "import torchao; print('torchao available')"

# Verify model files
ls -la /path/to/model/

Monitoring and Tracking

Trackio Integration

Quantization events are logged to Trackio:

quantization_started: When quantization begins
quantization_completed: When quantization finishes
quantized_model_pushed: When model is uploaded to HF Hub
quantization_failed: If quantization fails

Metrics Tracked

Quantization type and parameters
Model size reduction
Upload URLs for quantized models
Processing time and success status

Dependencies

Required Packages

pip install torchao>=0.10.0
pip install transformers>=4.35.0
pip install huggingface_hub>=0.16.0

Optional Dependencies

pip install accelerate>=0.20.0  # For device mapping
pip install bitsandbytes>=0.41.0  # For additional quantization

References

Support

For issues and questions:

Check the troubleshooting section above
Review the test files in tests/test_quantization.py
Open an issue on the project repository
Check the Trackio monitoring for detailed logs