hjedwardkim's picture
Update README.md
fcd3323 verified
# πŸš€ SuperNova Medius Compressed Model (W4A16)
[![Model Size](https://img.shields.io/badge/Size-Compressed-green)]()
[![Quantization](https://img.shields.io/badge/Quantization-W4A16-blue)]()
[![Max Sequence Length](https://img.shields.io/badge/Max%20Length-4096-orange)]()
> **Model ID**: `arcee-ai/SuperNova-Medius-CM-w4a16`
## πŸ“‹ Table of Contents
- [Overview](#overview)
- [Quick Start](#quick-start)
- [Model Details](#model-details)
- [Usage Guide](#usage-guide)
- [Quantization Process](#quantization-process)
- [Performance & Benchmarks](#performance--benchmarks)
- [Technical Details](#technical-details)
- [Limitations & Biases](#limitations--biases)
- [Citations & Acknowledgements](#citations--acknowledgements)
## πŸ” Overview
SuperNova Medius CM W4A16 is a quantized version of the `arcee-ai/SuperNova-Medius` model, optimized for efficient deployment. Using GPTQ (Generalized Post-Training Quantization), we've achieved significant size reduction while maintaining near-original performance.
### ✨ Key Features
- 4-bit weight quantization
- 16-bit activation quantization
- 4096 token context window
- Optimized for deployment on consumer hardware
## πŸš€ Quick Start
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("arcee-ai/SuperNova-Medius-CM-w4a16")
model = AutoModelForCausalLM.from_pretrained("arcee-ai/SuperNova-Medius-CM-w4a16")
# Simple inference
text = "Hello, how are you?"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
## πŸ“Š Model Details
### Specifications
- **Base Model**: arcee-ai/SuperNova-Medius
- **Quantization Method**: GPTQ
- **Maximum Sequence Length**: 4096
- **Calibration Samples**: 1024
### Quantization Parameters
| Parameter | Value |
|-----------|--------|
| Weight Bits | 4 |
| Activation Bits | 16 |
| Ignored Layers | lm_head |
| Dampening Fraction | 0.1 |
| Calibration Dataset | neuralmagic/LLM_compression_calibration |
## πŸ’» Usage Guide
### Basic Usage
See Quick Start section above.
### Advanced Usage
```python
# Advanced generation with parameters
output = model.generate(
input_ids,
max_length=100,
num_beams=4,
temperature=0.7,
no_repeat_ngram_size=2,
do_sample=True
)
```
### Memory Optimization
```python
# Load model with device map for multi-GPU setup
model = AutoModelForCausalLM.from_pretrained(
"arcee-ai/SuperNova-Medius-CM-w4a16",
device_map="auto",
torch_dtype=torch.bfloat16
)
```
## βš™οΈ Quantization Process
```python
import torch
from datasets import load_dataset
from transformers import AutoTokenizer
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
from llmcompressor.transformers.compression.helpers import calculate_offload_device_map
# Configuration
MODEL_ID = "arcee-ai/SuperNova-Medius"
NUM_SAMPLES = 1024
MAX_LENGTH = 4096
SEED = 42
# Calculate device map
device_map = calculate_offload_device_map(
MODEL_ID,
num_gpus=torch.cuda.device_count(),
reserve_for_hessians=True,
torch_dtype=torch.bfloat16
)
# Load model and tokenizer
model = SparseAutoModelForCausalLM.from_pretrained(
MODEL_ID,
device_map=device_map,
torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
# Prepare calibration dataset
ds = load_dataset("neuralmagic/LLM_compression_calibration")
ds = ds["train"].shuffle(seed=SEED).select(range(NUM_SAMPLES))
def preprocess(example):
return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}
ds = ds.map(preprocess)
def tokenize(sample):
return tokenizer(
sample["text"],
padding=False,
max_length=MAX_LENGTH,
truncation=True,
add_special_tokens=False
)
ds = ds.map(tokenize)
# Configure quantization
recipe = GPTQModifier(
targets="Linear",
scheme="W4A16",
ignore=["lm_head"],
dampening_frac=0.1
)
# Execute quantization
oneshot(
model=model,
dataset=ds,
recipe=recipe,
oneshot_device=device_map,
max_seq_length=MAX_LENGTH,
num_calibration_samples=NUM_SAMPLES,
accelerator_config={
'split_batches': True,
'dispatch_batches': None,
'even_batches': True,
'use_seedable_sampler': True,
'non_blocking': False,
'gradient_accumulation_kwargs': None,
'use_configured_state': False
}
)
# Save quantized model
model.save_pretrained("./arcee-ai/SuperNova-Medius-CM-w4a16", save_compressed=True)
tokenizer.save_pretrained("./arcee-ai/SuperNova-Medius-CM-w4a16")
```
## πŸ› οΈ Technical Details
### Dependencies
| Package | Version |
|---------|---------|
| Python | 3.9.x |
| torch | 2.5.1 |
| transformers | 4.46.2 |
| llmcompressor | 0.5.0 |
| vllm | 0.6.4 |
| datasets | 3.1.0 |
| huggingface_hub | 0.24.7 |
| compressed-tensors | 0.8.0 |
### Hardware Requirements
- **Minimum**: 8GB VRAM
- **Recommended**: 16GB VRAM
- **Optimal**: 24GB VRAM or multiple GPUs
## ⚠️ Limitations & Biases
### Known Limitations
- Slight performance degradation compared to full-precision model
- Limited to 4096 token context window
- May require careful memory management on consumer GPUs
### Inherited Biases
- Carries over biases from base model
- Users should implement appropriate content filtering
- Regular evaluation recommended for production deployments
## πŸ“š Citations & Acknowledgements
### Citation
```bibtex
@misc{SuperNovaMediusCMW4A16,
author = {Edward Kim and Jaro Uljanovs},
title = {SuperNova Medius Compressed Model W4A16},
year = {2024},
howpublished = {\url{https://huggingface.co/ConfidentialMind/arcee-ai-SuperNova-Medius-CM-w4a16}},
}
```
### πŸ‘ Acknowledgements
- Original Model: arcee-ai/SuperNova-Medius
- Quantization Tools: LLM Compressor
- Contributors: Edward Kim and Jaro Uljanovs
---
## πŸ“ Version History
- v1.0.0 (2024-03): Initial release
- v1.0.1 (2024-03): Documentation updates