Llama-3.1-Nemotron-Ultra-253B-v1-MLX-Q5
This is a Q5 quantized version of NVIDIA's Llama-3.1-Nemotron-Ultra-253B-v1 model, converted for use with MLX on Apple Silicon.
Model Details
- Original Model: nvidia/Llama-3_1-Nemotron-Ultra-253B-v1
- Quantization: Q5 (5-bit)
- Size: 163.2GB (Q5 quantized weights)
- Peak Memory Usage: ~175GB when loaded
- Architecture: DeciLM (NAS-optimized Llama variant)
- Framework: MLX 0.26.2+
Key Features
- Neural Architecture Search (NAS) optimized model
- Variable Grouped Query Attention (VGQA)
- FFN Fusion for improved efficiency
- Dummy layers for reduced memory footprint
- Optimized for Apple Silicon M-series chips
Performance
Tested on Mac Studio M3 Ultra (512GB RAM):
- Speed: ~3.86 tokens/sec generation
- Prompt Processing: ~14.3 tokens/sec
- Memory: Peak usage ~175GB
- Works with
mlx_lm
CLI tools (not LM Studio compatible yet)
Usage
With MLX-LM:
from mlx_lm import load, generate
model, tokenizer = load("LibraxisAI/Llama-3_1-Nemotron-Ultra-253B-v1-mlx-q5")
response = generate(model, tokenizer, prompt="Your prompt here", verbose=True)
Command Line:
uv run mlx_lm.generate \
--model LibraxisAI/Llama-3_1-Nemotron-Ultra-253B-v1-mlx-q5 \
--prompt "Your prompt here" \
--max-tokens 1000
Conversion Details
- Converted using MLX-LM quantization tools
- Q5 quantization with group size 64
- Preserved DeciLM architecture specifics
License
Same as the original model - check NVIDIA's license terms.
Acknowledgments
Thanks to NVIDIA for the original Nemotron model and the MLX team for the framework.
- Downloads last month
- 13
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for LibraxisAI/Llama-3_1-Nemotron-Ultra-253B-v1-MLX-Q5
Base model
nvidia/Llama-3_1-Nemotron-Ultra-253B-v1