Model Card for nvidia/Llama-3_3-Nemotron-Super-49B-v1-LMUL

This model is a derivative of nvidia/Llama-3_3-Nemotron-Super-49B-v1, modified to use a custom attention mechanism defined by the l_mul_attention function from the lmul library.

Model Details

  • Original Model: nvidia/Llama-3_3-Nemotron-Super-49B-v1
  • Architecture: DeciLM (decilm)
  • Modification: The forward method of the DeciAttention module has been replaced (monkey-patched) with a custom implementation that utilizes the l_mul_attention logic. Note that in some blocks of the original model, the attention layer is skipped entirely; those blocks are unaffected by this modification.

Scientific Rationale

This model was modified as part of a research project investigating alternative attention mechanisms in large language models. The l_mul_attention function implements a novel approach to calculating attention scores, and this model serves as a test case for evaluating its performance, efficiency, and impact on reasoning and generation tasks compared to the standard attention implementation.

By releasing this model, we hope to encourage further research into non-standard attention mechanisms and provide a practical example for the community to build upon.

How to Get Started

You can use this model with the standard transformers library pipeline. Because the base model uses a custom architecture, you must use trust_remote_code=True when loading it.

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Make sure to log in with your Hugging Face token if the model is private
# from huggingface_hub import login
# login("your-hf-token")

model_id = "YOUR_HF_USERNAME/Llama-3_3-Nemotron-Super-49B-v1-LMUL" # Replace with your HF username
device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True  # Important! Required by the base model
)

# The base model uses a system prompt to control reasoning
thinking = "on" # or "off"
messages = [
    {"role": "system", "content": f"detailed thinking {thinking}"},
    {"role": "user", "content": "What is the airspeed velocity of an unladen swallow?"}
]

# Note: The original model's tokenizer does not have a chat template.
# You must apply it manually or use the pipeline as shown in the original model card.
# For simplicity, we'll format the prompt manually here.
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([prompt], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512,
    temperature=0.6,
    top_p=0.95
)
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

print(response)

Intended Uses & Limitations

This model is intended primarily for research purposes. Its performance on standard benchmarks has not been fully evaluated. The custom attention mechanism may introduce unexpected behaviors or limitations not present in the original model. The original model has specific prompting requirements (e.g., for controlling reasoning) which should be followed.

Licensing Information

This model is released under the nvidia-open-model-license, which is the same license as the base model, nvidia/Llama-3_3-Nemotron-Super-49B-v1. By using this model, you agree to the terms of the original license. It is your responsibility to ensure compliance with all applicable licenses and regulations. The model is also built upon Meta Llama 3, and its use is subject to the Llama 3.3 Community License Agreement.

Downloads last month
14
Safetensors
Model size
49.9B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Peacemann/nvidia_Llama-3_3-Nemotron-Super-49B-v1_LMUL

Finetuned
(5)
this model

Collection including Peacemann/nvidia_Llama-3_3-Nemotron-Super-49B-v1_LMUL