--- language: en tags: - exllamav3 - quantized - 5-bit - reasoning - coding - qwen2 library_name: exllamav3 base_model: andrewzh/Absolute_Zero_Reasoner-Coder-14b base_model_relation: quantized --- # Absolute_Zero_Reasoner-Coder-14b-5.0bpw-exl3 This is a 5-bit quantized version of [andrewzh/Absolute_Zero_Reasoner-Coder-14b](https://huggingface.co/andrewzh/Absolute_Zero_Reasoner-Coder-14b) using [ExLlamaV3](https://github.com/turboderp-org/exllamav3) v0.0.2. ## Model Description This model is a quantized version of Absolute_Zero_Reasoner-Coder-14b, which is based on the Qwen2-Coder-14B architecture. The original model is designed for reasoning and coding tasks. For more details about the original model, please refer to the paper: [https://huggingface.co/papers/2505.03335](https://huggingface.co/papers/2505.03335). The quantization reduces the model size and memory requirements while attempting to preserve as much of the original performance as possible. ## Quantization Methodology The model was quantized using ExLlamaV3 v0.0.2 with the following parameters: - **Quantization Method**: exl3 (ExLlamaV3) - **Bits**: 5.0 (5-bit quantization) - **Head Bits**: 6 (6-bit precision for attention heads) - **Calibration**: - Rows: 100 - Columns: 2048 - **Out Scales**: auto This quantization approach uses a more sophisticated method than simple linear quantization, allowing for better preservation of model quality at lower bit depths. ## Model Architecture The model is based on the Qwen2 architecture with the following specifications: - **Hidden Size**: 5120 - **Intermediate Size**: 13824 - **Number of Attention Heads**: 40 - **Number of Key-Value Heads**: 8 - **Number of Hidden Layers**: 48 - **Maximum Sequence Length**: 32768 - **Vocabulary Size**: 152064 ## How to Use To use this quantized model with ExLlamaV3, you'll need to install the ExLlamaV3 library: ```bash pip install exllamav3 ``` Here's a basic example of how to use the model: ```python from exllamav3 import ExLlamaV3, ExLlamaV3Config from exllamav3.tokenizer import ExLlamaV3Tokenizer # Set up model path model_path = "path/to/Absolute_Zero_Reasoner-Coder-14b-5.0bpw-exl3" # Load config and model config = ExLlamaV3Config() config.model_dir = model_path config.prepare() model = ExLlamaV3(config) model.load() # Load tokenizer tokenizer = ExLlamaV3Tokenizer(config) # Generate text prompt = "Write a function to calculate the Fibonacci sequence in Python:" input_ids = tokenizer.encode(prompt) output = model.generate( input_ids=input_ids, max_new_tokens=200, temperature=0.6, top_p=0.9 ) print(tokenizer.decode(output)) ``` ## Limitations This quantized model has the following limitations: 1. **Reduced Precision**: The 5-bit quantization may lead to some degradation in performance compared to the original model, particularly for complex reasoning tasks. 2. **ExLlamaV3 Dependency**: This model can only be used with the ExLlamaV3 library and is not compatible with standard Hugging Face Transformers without conversion. 3. **Inherited Limitations**: All limitations of the original model apply to this quantized version as well. ## Citation If you use this model in your research, please cite the original paper: ``` @misc{absolute_zero_reasoner_coder, author = {Andrew Zhang}, title = {Absolute Zero Reasoner-Coder}, year = {2024}, howpublished = {\url{https://huggingface.co/papers/2505.03335}} } ``` ## Acknowledgements - Original model: [andrewzh/Absolute_Zero_Reasoner-Coder-14b](https://huggingface.co/andrewzh/Absolute_Zero_Reasoner-Coder-14b) - Quantization library: [ExLlamaV3](https://github.com/turboderp-org/exllamav3)