Llama-3.1-Nemotron-Ultra-253B-v1-MLX-Q5

This is a Q5 quantized version of NVIDIA's Llama-3.1-Nemotron-Ultra-253B-v1 model, converted for use with MLX on Apple Silicon.

Model Details

  • Original Model: nvidia/Llama-3_1-Nemotron-Ultra-253B-v1
  • Quantization: Q5 (5-bit)
  • Size: 163.2GB (Q5 quantized weights)
  • Peak Memory Usage: ~175GB when loaded
  • Architecture: DeciLM (NAS-optimized Llama variant)
  • Framework: MLX 0.26.2+

Key Features

  • Neural Architecture Search (NAS) optimized model
  • Variable Grouped Query Attention (VGQA)
  • FFN Fusion for improved efficiency
  • Dummy layers for reduced memory footprint
  • Optimized for Apple Silicon M-series chips

Performance

Tested on Mac Studio M3 Ultra (512GB RAM):

  • Speed: ~3.86 tokens/sec generation
  • Prompt Processing: ~14.3 tokens/sec
  • Memory: Peak usage ~175GB
  • Works with mlx_lm CLI tools (not LM Studio compatible yet)

Usage

With MLX-LM:

from mlx_lm import load, generate

model, tokenizer = load("LibraxisAI/Llama-3_1-Nemotron-Ultra-253B-v1-mlx-q5")
response = generate(model, tokenizer, prompt="Your prompt here", verbose=True)

Command Line:

uv run mlx_lm.generate \
  --model LibraxisAI/Llama-3_1-Nemotron-Ultra-253B-v1-mlx-q5 \
  --prompt "Your prompt here" \
  --max-tokens 1000

Conversion Details

  • Converted using MLX-LM quantization tools
  • Q5 quantization with group size 64
  • Preserved DeciLM architecture specifics

License

Same as the original model - check NVIDIA's license terms.

Acknowledgments

Thanks to NVIDIA for the original Nemotron model and the MLX team for the framework.

Downloads last month
13
Safetensors
Model size
253B params
Tensor type
BF16
·
U32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for LibraxisAI/Llama-3_1-Nemotron-Ultra-253B-v1-MLX-Q5

Quantized
(8)
this model