Spaces:

Tonic
/

SmolFactory

Running

App Files Files Community

SmolFactory / docs /QUANTIZATION_GUIDE.md

Tonic

adds sft , quantization, better readmes

40fd629 verified 3 months ago

preview code

raw

history blame

9.64 kB

	# Model Quantization Guide

	## Overview

	This guide covers the quantization functionality integrated into the SmolLM3 fine-tuning pipeline. The system supports creating quantized versions of trained models using `torchao` and automatically uploading them to Hugging Face Hub in a unified repository structure.

	## Repository Structure

	With the updated pipeline, all models (main and quantized) are stored in a single repository:

	```
	your-username/model-name/
	├── README.md (unified model card)
	├── config.json
	├── pytorch_model.bin
	├── tokenizer.json
	├── tokenizer_config.json
	├── int8/ (quantized model for GPU)
	│ ├── README.md
	│ ├── config.json
	│ └── pytorch_model.bin
	└── int4/ (quantized model for CPU)
	├── README.md
	├── config.json
	└── pytorch_model.bin
	```

	## Quantization Types

	### int8 Weight-Only Quantization (GPU Optimized)
	- Memory Reduction: ~50% compared to original model
	- Speed: Faster inference with minimal accuracy loss
	- Hardware: GPU optimized for high-performance inference
	- Use Case: Production deployments with GPU resources

	### int4 Weight-Only Quantization (CPU Optimized)
	- Memory Reduction: ~75% compared to original model
	- Speed: Significantly faster inference with some accuracy trade-off
	- Hardware: CPU optimized for deployment
	- Use Case: Edge deployment, CPU-only environments

	## Integration with Pipeline

	### Automatic Quantization

	The quantization process is integrated into the main training pipeline:

	1. Training: Model is trained using the standard pipeline
	2. Model Push: Main model is pushed to Hugging Face Hub
	3. Quantization Options: User is prompted to create quantized versions
	4. Quantized Models: Quantized models are created and pushed to subdirectories
	5. Unified Documentation: Single model card covers all versions

	### Pipeline Integration

	The quantization step is added to `launch.sh` after the main model push:

	```bash
	# Step 16.5: Quantization Options
	print_step "Step 16.5: Model Quantization Options"
	echo "=========================================="

	print_info "Would you like to create quantized versions of your model?"
	print_info "Quantization reduces model size and improves inference speed."

	# Ask about quantization
	get_input "Create quantized models? (y/n)" "y" "CREATE_QUANTIZED"

	if [ "$CREATE_QUANTIZED" = "y" ] \|\| [ "$CREATE_QUANTIZED" = "Y" ]; then
	print_info "Quantization options:"
	print_info "1. int8_weight_only (GPU optimized, ~50% memory reduction)"
	print_info "2. int4_weight_only (CPU optimized, ~75% memory reduction)"
	print_info "3. Both int8 and int4 versions"

	select_option "Select quantization type:" "int8_weight_only" "int4_weight_only" "both" "QUANT_TYPE"

	# Create quantized models in the same repository
	python scripts/model_tonic/quantize_model.py /output-checkpoint "$REPO_NAME" \
	--quant-type "$QUANT_TYPE" \
	--device "$DEVICE" \
	--token "$HF_TOKEN" \
	--trackio-url "$TRACKIO_URL" \
	--experiment-name "${EXPERIMENT_NAME}-${QUANT_TYPE}" \
	--dataset-repo "$TRACKIO_DATASET_REPO"
	fi
	```

	## Standalone Quantization

	### Using the Standalone Script

	For models already uploaded to Hugging Face Hub:

	```bash
	python scripts/model_tonic/quantize_standalone.py \
	"your-username/model-name" \
	"your-username/model-name" \
	--quant-type "int8_weight_only" \
	--device "auto" \
	--token "your-hf-token"
	```

	### Command Line Options

	```bash
	python scripts/model_tonic/quantize_standalone.py model_path repo_name [options]

	Options:
	--quant-type {int8_weight_only,int4_weight_only,int8_dynamic}
	Quantization type (default: int8_weight_only)
	--device DEVICE Device for quantization (auto, cpu, cuda)
	--group-size GROUP_SIZE
	Group size for quantization (default: 128)
	--token TOKEN Hugging Face token
	--private Create private repository
	--trackio-url TRACKIO_URL
	Trackio URL for monitoring
	--experiment-name EXPERIMENT_NAME
	Experiment name for tracking
	--dataset-repo DATASET_REPO
	HF Dataset repository
	--save-only Save quantized model locally without pushing to HF
	```

	## Loading Quantized Models

	### Loading Main Model

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer

	# Load the main model
	model = AutoModelForCausalLM.from_pretrained(
	"your-username/model-name",
	device_map="auto",
	torch_dtype=torch.bfloat16
	)
	tokenizer = AutoTokenizer.from_pretrained("your-username/model-name")
	```

	### Loading int8 Quantized Model (GPU)

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer

	# Load int8 quantized model (GPU optimized)
	model = AutoModelForCausalLM.from_pretrained(
	"your-username/model-name/int8",
	device_map="auto",
	torch_dtype=torch.bfloat16
	)
	tokenizer = AutoTokenizer.from_pretrained("your-username/model-name/int8")
	```

	### Loading int4 Quantized Model (CPU)

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer

	# Load int4 quantized model (CPU optimized)
	model = AutoModelForCausalLM.from_pretrained(
	"your-username/model-name/int4",
	device_map="cpu",
	torch_dtype=torch.bfloat16
	)
	tokenizer = AutoTokenizer.from_pretrained("your-username/model-name/int4")
	```

	## Usage Examples

	### Text Generation with Quantized Model

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	# Load quantized model
	model = AutoModelForCausalLM.from_pretrained("your-username/model-name/int8")
	tokenizer = AutoTokenizer.from_pretrained("your-username/model-name/int8")

	# Generate text
	text = "The future of artificial intelligence is"
	inputs = tokenizer(text, return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=100)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	### Conversation with Quantized Model

	```python
	def chat_with_quantized_model(prompt, max_length=100):
	inputs = tokenizer(prompt, return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=max_length)
	return tokenizer.decode(outputs[0], skip_special_tokens=True)

	response = chat_with_quantized_model("Hello, how are you today?")
	print(response)
	```

	## Configuration Options

	### Quantization Parameters

	- group_size: Group size for quantization (default: 128)
	- device: Target device for quantization (auto, cpu, cuda)
	- quant_type: Type of quantization to apply

	### Hardware Requirements

	- Main Model: GPU with 8GB+ VRAM recommended
	- int8 Model: GPU with 4GB+ VRAM
	- int4 Model: CPU deployment possible

	## Performance Comparison

	\| Model Type \| Memory Usage \| Speed \| Accuracy \| Use Case \|
	\|------------\|--------------\|-------\|----------\|----------\|
	\| Original \| 100% \| Baseline \| Best \| Development, Research \|
	\| int8 \| ~50% \| Faster \| Minimal loss \| Production GPU \|
	\| int4 \| ~25% \| Fastest \| Some loss \| Edge, CPU deployment \|

	## Best Practices

	### When to Use Quantization

	1. int8 (GPU): When you need faster inference with minimal accuracy loss
	2. int4 (CPU): When deploying to CPU-only environments or edge devices
	3. Both: When you need flexibility for different deployment scenarios

	### Memory Optimization

	- Use int8 for GPU deployments with memory constraints
	- Use int4 for CPU deployments or very memory-constrained environments
	- Consider the trade-off between speed and accuracy

	### Deployment Considerations

	- Test quantized models on your specific use case
	- Monitor performance and accuracy in production
	- Consider using the main model for development and quantized versions for deployment

	## Troubleshooting

	### Common Issues

	1. CUDA Out of Memory: Reduce batch size or use int8 quantization
	2. Import Errors: Install torchao: `pip install torchao>=0.10.0`
	3. Model Loading Errors: Ensure the model path is correct and accessible

	### Debugging

	```bash
	# Test quantization functionality
	python tests/test_quantization.py

	# Check torchao installation
	python -c "import torchao; print('torchao available')"

	# Verify model files
	ls -la /path/to/model/
	```

	## Monitoring and Tracking

	### Trackio Integration

	Quantization events are logged to Trackio:

	- `quantization_started`: When quantization begins
	- `quantization_completed`: When quantization finishes
	- `quantized_model_pushed`: When model is uploaded to HF Hub
	- `quantization_failed`: If quantization fails

	### Metrics Tracked

	- Quantization type and parameters
	- Model size reduction
	- Upload URLs for quantized models
	- Processing time and success status

	## Dependencies

	### Required Packages

	```bash
	pip install torchao>=0.10.0
	pip install transformers>=4.35.0
	pip install huggingface_hub>=0.16.0
	```

	### Optional Dependencies

	```bash
	pip install accelerate>=0.20.0 # For device mapping
	pip install bitsandbytes>=0.41.0 # For additional quantization
	```

	## References

	- [torchao Documentation](https://huggingface.co/docs/transformers/main/en/quantization/torchao)
	- [Hugging Face Model Cards](https://huggingface.co/docs/hub/model-cards)
	- [Transformers Quantization Guide](https://huggingface.co/docs/transformers/main/en/quantization)

	## Support

	For issues and questions:

	1. Check the troubleshooting section above
	2. Review the test files in `tests/test_quantization.py`
	3. Open an issue on the project repository
	4. Check the Trackio monitoring for detailed logs