---
language: en
tags:
  - exllamav3
  - quantized
  - 5-bit
  - reasoning
  - coding
  - qwen2
library_name: exllamav3
base_model: andrewzh/Absolute_Zero_Reasoner-Coder-14b
base_model_relation: quantized
---

# Absolute_Zero_Reasoner-Coder-14b-5.0bpw-exl3

This is a 5-bit quantized version of [andrewzh/Absolute_Zero_Reasoner-Coder-14b](https://huggingface.co/andrewzh/Absolute_Zero_Reasoner-Coder-14b) using [ExLlamaV3](https://github.com/turboderp-org/exllamav3) v0.0.2.

## Model Description

This model is a quantized version of Absolute_Zero_Reasoner-Coder-14b, which is based on the Qwen2-Coder-14B architecture. The original model is designed for reasoning and coding tasks. For more details about the original model, please refer to the paper: [https://huggingface.co/papers/2505.03335](https://huggingface.co/papers/2505.03335).

The quantization reduces the model size and memory requirements while attempting to preserve as much of the original performance as possible.

## Quantization Methodology

The model was quantized using ExLlamaV3 v0.0.2 with the following parameters:

- **Quantization Method**: exl3 (ExLlamaV3)
- **Bits**: 5.0 (5-bit quantization)
- **Head Bits**: 6 (6-bit precision for attention heads)
- **Calibration**:
  - Rows: 100
  - Columns: 2048
- **Out Scales**: auto

This quantization approach uses a more sophisticated method than simple linear quantization, allowing for better preservation of model quality at lower bit depths.

## Model Architecture

The model is based on the Qwen2 architecture with the following specifications:

- **Hidden Size**: 5120
- **Intermediate Size**: 13824
- **Number of Attention Heads**: 40
- **Number of Key-Value Heads**: 8
- **Number of Hidden Layers**: 48
- **Maximum Sequence Length**: 32768
- **Vocabulary Size**: 152064

## How to Use

To use this quantized model with ExLlamaV3, you'll need to install the ExLlamaV3 library:

```bash
pip install exllamav3
```

Here's a basic example of how to use the model:

```python
from exllamav3 import ExLlamaV3, ExLlamaV3Config
from exllamav3.tokenizer import ExLlamaV3Tokenizer

# Set up model path
model_path = "path/to/Absolute_Zero_Reasoner-Coder-14b-5.0bpw-exl3"

# Load config and model
config = ExLlamaV3Config()
config.model_dir = model_path
config.prepare()

model = ExLlamaV3(config)
model.load()

# Load tokenizer
tokenizer = ExLlamaV3Tokenizer(config)

# Generate text
prompt = "Write a function to calculate the Fibonacci sequence in Python:"
input_ids = tokenizer.encode(prompt)
output = model.generate(
    input_ids=input_ids,
    max_new_tokens=200,
    temperature=0.6,
    top_p=0.9
)

print(tokenizer.decode(output))
```

## Limitations

This quantized model has the following limitations:

1. **Reduced Precision**: The 5-bit quantization may lead to some degradation in performance compared to the original model, particularly for complex reasoning tasks.

2. **ExLlamaV3 Dependency**: This model can only be used with the ExLlamaV3 library and is not compatible with standard Hugging Face Transformers without conversion.

3. **Inherited Limitations**: All limitations of the original model apply to this quantized version as well.

## Citation

If you use this model in your research, please cite the original paper:

```
@misc{absolute_zero_reasoner_coder,
  author = {Andrew Zhang},
  title = {Absolute Zero Reasoner-Coder},
  year = {2024},
  howpublished = {\url{https://huggingface.co/papers/2505.03335}}
}
```

## Acknowledgements

- Original model: [andrewzh/Absolute_Zero_Reasoner-Coder-14b](https://huggingface.co/andrewzh/Absolute_Zero_Reasoner-Coder-14b)
- Quantization library: [ExLlamaV3](https://github.com/turboderp-org/exllamav3)