Model Card for nvidia/Llama-3_3-Nemotron-Super-49B-v1-LMUL
This model is a derivative of nvidia/Llama-3_3-Nemotron-Super-49B-v1
, modified to use a custom attention mechanism defined by the l_mul_attention
function from the lmul
library.
Model Details
- Original Model: nvidia/Llama-3_3-Nemotron-Super-49B-v1
- Architecture:
DeciLM
(decilm
) - Modification: The
forward
method of theDeciAttention
module has been replaced (monkey-patched) with a custom implementation that utilizes thel_mul_attention
logic. Note that in some blocks of the original model, the attention layer is skipped entirely; those blocks are unaffected by this modification.
Scientific Rationale
This model was modified as part of a research project investigating alternative attention mechanisms in large language models. The l_mul_attention
function implements a novel approach to calculating attention scores, and this model serves as a test case for evaluating its performance, efficiency, and impact on reasoning and generation tasks compared to the standard attention implementation.
By releasing this model, we hope to encourage further research into non-standard attention mechanisms and provide a practical example for the community to build upon.
How to Get Started
You can use this model with the standard transformers
library pipeline. Because the base model uses a custom architecture, you must use trust_remote_code=True
when loading it.
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Make sure to log in with your Hugging Face token if the model is private
# from huggingface_hub import login
# login("your-hf-token")
model_id = "YOUR_HF_USERNAME/Llama-3_3-Nemotron-Super-49B-v1-LMUL" # Replace with your HF username
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True # Important! Required by the base model
)
# The base model uses a system prompt to control reasoning
thinking = "on" # or "off"
messages = [
{"role": "system", "content": f"detailed thinking {thinking}"},
{"role": "user", "content": "What is the airspeed velocity of an unladen swallow?"}
]
# Note: The original model's tokenizer does not have a chat template.
# You must apply it manually or use the pipeline as shown in the original model card.
# For simplicity, we'll format the prompt manually here.
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([prompt], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=512,
temperature=0.6,
top_p=0.95
)
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
Intended Uses & Limitations
This model is intended primarily for research purposes. Its performance on standard benchmarks has not been fully evaluated. The custom attention mechanism may introduce unexpected behaviors or limitations not present in the original model. The original model has specific prompting requirements (e.g., for controlling reasoning) which should be followed.
Licensing Information
This model is released under the nvidia-open-model-license
, which is the same license as the base model, nvidia/Llama-3_3-Nemotron-Super-49B-v1
. By using this model, you agree to the terms of the original license. It is your responsibility to ensure compliance with all applicable licenses and regulations. The model is also built upon Meta Llama 3, and its use is subject to the Llama 3.3 Community License Agreement.
- Downloads last month
- 14
Model tree for Peacemann/nvidia_Llama-3_3-Nemotron-Super-49B-v1_LMUL
Base model
nvidia/Llama-3_3-Nemotron-Super-49B-v1