|
# π SuperNova Medius Compressed Model (W4A16) |
|
|
|
[]() |
|
[]() |
|
[]() |
|
|
|
> **Model ID**: `arcee-ai/SuperNova-Medius-CM-w4a16` |
|
|
|
## π Table of Contents |
|
- [Overview](#overview) |
|
- [Quick Start](#quick-start) |
|
- [Model Details](#model-details) |
|
- [Usage Guide](#usage-guide) |
|
- [Quantization Process](#quantization-process) |
|
- [Performance & Benchmarks](#performance--benchmarks) |
|
- [Technical Details](#technical-details) |
|
- [Limitations & Biases](#limitations--biases) |
|
- [Citations & Acknowledgements](#citations--acknowledgements) |
|
|
|
## π Overview |
|
|
|
SuperNova Medius CM W4A16 is a quantized version of the `arcee-ai/SuperNova-Medius` model, optimized for efficient deployment. Using GPTQ (Generalized Post-Training Quantization), we've achieved significant size reduction while maintaining near-original performance. |
|
|
|
### β¨ Key Features |
|
- 4-bit weight quantization |
|
- 16-bit activation quantization |
|
- 4096 token context window |
|
- Optimized for deployment on consumer hardware |
|
|
|
## π Quick Start |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
# Load model and tokenizer |
|
tokenizer = AutoTokenizer.from_pretrained("arcee-ai/SuperNova-Medius-CM-w4a16") |
|
model = AutoModelForCausalLM.from_pretrained("arcee-ai/SuperNova-Medius-CM-w4a16") |
|
|
|
# Simple inference |
|
text = "Hello, how are you?" |
|
inputs = tokenizer(text, return_tensors="pt") |
|
outputs = model.generate(**inputs) |
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
``` |
|
|
|
## π Model Details |
|
|
|
### Specifications |
|
- **Base Model**: arcee-ai/SuperNova-Medius |
|
- **Quantization Method**: GPTQ |
|
- **Maximum Sequence Length**: 4096 |
|
- **Calibration Samples**: 1024 |
|
|
|
### Quantization Parameters |
|
| Parameter | Value | |
|
|-----------|--------| |
|
| Weight Bits | 4 | |
|
| Activation Bits | 16 | |
|
| Ignored Layers | lm_head | |
|
| Dampening Fraction | 0.1 | |
|
| Calibration Dataset | neuralmagic/LLM_compression_calibration | |
|
|
|
## π» Usage Guide |
|
|
|
### Basic Usage |
|
See Quick Start section above. |
|
|
|
### Advanced Usage |
|
|
|
```python |
|
# Advanced generation with parameters |
|
output = model.generate( |
|
input_ids, |
|
max_length=100, |
|
num_beams=4, |
|
temperature=0.7, |
|
no_repeat_ngram_size=2, |
|
do_sample=True |
|
) |
|
``` |
|
|
|
### Memory Optimization |
|
|
|
```python |
|
# Load model with device map for multi-GPU setup |
|
model = AutoModelForCausalLM.from_pretrained( |
|
"arcee-ai/SuperNova-Medius-CM-w4a16", |
|
device_map="auto", |
|
torch_dtype=torch.bfloat16 |
|
) |
|
``` |
|
|
|
## βοΈ Quantization Process |
|
|
|
```python |
|
import torch |
|
from datasets import load_dataset |
|
from transformers import AutoTokenizer |
|
from llmcompressor.modifiers.quantization import GPTQModifier |
|
from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot |
|
from llmcompressor.transformers.compression.helpers import calculate_offload_device_map |
|
|
|
# Configuration |
|
MODEL_ID = "arcee-ai/SuperNova-Medius" |
|
NUM_SAMPLES = 1024 |
|
MAX_LENGTH = 4096 |
|
SEED = 42 |
|
|
|
# Calculate device map |
|
device_map = calculate_offload_device_map( |
|
MODEL_ID, |
|
num_gpus=torch.cuda.device_count(), |
|
reserve_for_hessians=True, |
|
torch_dtype=torch.bfloat16 |
|
) |
|
|
|
# Load model and tokenizer |
|
model = SparseAutoModelForCausalLM.from_pretrained( |
|
MODEL_ID, |
|
device_map=device_map, |
|
torch_dtype=torch.bfloat16 |
|
) |
|
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) |
|
|
|
# Prepare calibration dataset |
|
ds = load_dataset("neuralmagic/LLM_compression_calibration") |
|
ds = ds["train"].shuffle(seed=SEED).select(range(NUM_SAMPLES)) |
|
|
|
def preprocess(example): |
|
return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)} |
|
|
|
ds = ds.map(preprocess) |
|
|
|
def tokenize(sample): |
|
return tokenizer( |
|
sample["text"], |
|
padding=False, |
|
max_length=MAX_LENGTH, |
|
truncation=True, |
|
add_special_tokens=False |
|
) |
|
|
|
ds = ds.map(tokenize) |
|
|
|
# Configure quantization |
|
recipe = GPTQModifier( |
|
targets="Linear", |
|
scheme="W4A16", |
|
ignore=["lm_head"], |
|
dampening_frac=0.1 |
|
) |
|
|
|
# Execute quantization |
|
oneshot( |
|
model=model, |
|
dataset=ds, |
|
recipe=recipe, |
|
oneshot_device=device_map, |
|
max_seq_length=MAX_LENGTH, |
|
num_calibration_samples=NUM_SAMPLES, |
|
accelerator_config={ |
|
'split_batches': True, |
|
'dispatch_batches': None, |
|
'even_batches': True, |
|
'use_seedable_sampler': True, |
|
'non_blocking': False, |
|
'gradient_accumulation_kwargs': None, |
|
'use_configured_state': False |
|
} |
|
) |
|
|
|
# Save quantized model |
|
model.save_pretrained("./arcee-ai/SuperNova-Medius-CM-w4a16", save_compressed=True) |
|
tokenizer.save_pretrained("./arcee-ai/SuperNova-Medius-CM-w4a16") |
|
``` |
|
|
|
## π οΈ Technical Details |
|
|
|
### Dependencies |
|
| Package | Version | |
|
|---------|---------| |
|
| Python | 3.9.x | |
|
| torch | 2.5.1 | |
|
| transformers | 4.46.2 | |
|
| llmcompressor | 0.5.0 | |
|
| vllm | 0.6.4 | |
|
| datasets | 3.1.0 | |
|
| huggingface_hub | 0.24.7 | |
|
| compressed-tensors | 0.8.0 | |
|
|
|
### Hardware Requirements |
|
- **Minimum**: 8GB VRAM |
|
- **Recommended**: 16GB VRAM |
|
- **Optimal**: 24GB VRAM or multiple GPUs |
|
|
|
## β οΈ Limitations & Biases |
|
|
|
### Known Limitations |
|
- Slight performance degradation compared to full-precision model |
|
- Limited to 4096 token context window |
|
- May require careful memory management on consumer GPUs |
|
|
|
### Inherited Biases |
|
- Carries over biases from base model |
|
- Users should implement appropriate content filtering |
|
- Regular evaluation recommended for production deployments |
|
|
|
## π Citations & Acknowledgements |
|
|
|
### Citation |
|
|
|
```bibtex |
|
@misc{SuperNovaMediusCMW4A16, |
|
author = {Edward Kim and Jaro Uljanovs}, |
|
title = {SuperNova Medius Compressed Model W4A16}, |
|
year = {2024}, |
|
howpublished = {\url{https://huggingface.co/ConfidentialMind/arcee-ai-SuperNova-Medius-CM-w4a16}}, |
|
} |
|
``` |
|
|
|
### π Acknowledgements |
|
- Original Model: arcee-ai/SuperNova-Medius |
|
- Quantization Tools: LLM Compressor |
|
- Contributors: Edward Kim and Jaro Uljanovs |
|
|
|
--- |
|
|
|
## π Version History |
|
|
|
- v1.0.0 (2024-03): Initial release |
|
- v1.0.1 (2024-03): Documentation updates |