div0-space's picture
Update README.md
6fe15eb verified
metadata
license: other
base_model:
  - nvidia/Llama-3_1-Nemotron-Ultra-253B-v1
tags:
  - mlx
  - mlx-community
  - DeciLMForCausalLM
  - NAS
  - reasoning

Llama-3.1-Nemotron-Ultra-253B-v1-MLX-Q5

This is a Q5 quantized version of NVIDIA's Llama-3.1-Nemotron-Ultra-253B-v1 model, converted for use with MLX on Apple Silicon.

Model Details

  • Original Model: nvidia/Llama-3_1-Nemotron-Ultra-253B-v1
  • Quantization: Q5 (5-bit)
  • Size: 163.2GB (Q5 quantized weights)
  • Peak Memory Usage: ~175GB when loaded
  • Architecture: DeciLM (NAS-optimized Llama variant)
  • Framework: MLX 0.26.2+

Key Features

  • Neural Architecture Search (NAS) optimized model
  • Variable Grouped Query Attention (VGQA)
  • FFN Fusion for improved efficiency
  • Dummy layers for reduced memory footprint
  • Optimized for Apple Silicon M-series chips

Performance

Tested on Mac Studio M3 Ultra (512GB RAM):

  • Speed: ~3.86 tokens/sec generation
  • Prompt Processing: ~14.3 tokens/sec
  • Memory: Peak usage ~175GB
  • Works with mlx_lm CLI tools (not LM Studio compatible yet)

Usage

With MLX-LM:

from mlx_lm import load, generate

model, tokenizer = load("LibraxisAI/Llama-3_1-Nemotron-Ultra-253B-v1-mlx-q5")
response = generate(model, tokenizer, prompt="Your prompt here", verbose=True)

Command Line:

uv run mlx_lm.generate \
  --model LibraxisAI/Llama-3_1-Nemotron-Ultra-253B-v1-mlx-q5 \
  --prompt "Your prompt here" \
  --max-tokens 1000

Conversion Details

  • Converted using MLX-LM quantization tools
  • Q5 quantization with group size 64
  • Preserved DeciLM architecture specifics

License

Same as the original model - check NVIDIA's license terms.

Acknowledgments

Thanks to NVIDIA for the original Nemotron model and the MLX team for the framework.