--- license: other base_model: - nvidia/Llama-3_1-Nemotron-Ultra-253B-v1 tags: - mlx - mlx-community - DeciLMForCausalLM - NAS - reasoning --- # Llama-3.1-Nemotron-Ultra-253B-v1-MLX-Q5 This is a Q5 quantized version of NVIDIA's Llama-3.1-Nemotron-Ultra-253B-v1 model, converted for use with MLX on Apple Silicon. ## Model Details - **Original Model**: nvidia/Llama-3_1-Nemotron-Ultra-253B-v1 - **Quantization**: Q5 (5-bit) - **Size**: 163.2GB (Q5 quantized weights) - **Peak Memory Usage**: ~175GB when loaded - **Architecture**: DeciLM (NAS-optimized Llama variant) - **Framework**: MLX 0.26.2+ ## Key Features - **Neural Architecture Search (NAS)** optimized model - **Variable Grouped Query Attention (VGQA)** - **FFN Fusion** for improved efficiency - **Dummy layers** for reduced memory footprint - Optimized for Apple Silicon M-series chips ## Performance Tested on Mac Studio M3 Ultra (512GB RAM): - **Speed**: ~3.86 tokens/sec generation - **Prompt Processing**: ~14.3 tokens/sec - **Memory**: Peak usage ~175GB - Works with `mlx_lm` CLI tools (not LM Studio compatible yet) ## Usage ### With MLX-LM: ```python from mlx_lm import load, generate model, tokenizer = load("LibraxisAI/Llama-3_1-Nemotron-Ultra-253B-v1-mlx-q5") response = generate(model, tokenizer, prompt="Your prompt here", verbose=True) ``` ### Command Line: ```bash uv run mlx_lm.generate \ --model LibraxisAI/Llama-3_1-Nemotron-Ultra-253B-v1-mlx-q5 \ --prompt "Your prompt here" \ --max-tokens 1000 ``` ## Conversion Details - Converted using MLX-LM quantization tools - Q5 quantization with group size 64 - Preserved DeciLM architecture specifics ## License Same as the original model - check NVIDIA's license terms. ## Acknowledgments Thanks to NVIDIA for the original Nemotron model and the MLX team for the framework.