Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,83 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
base_model:
|
3 |
+
- nvidia/Llama-3_3-Nemotron-Super-49B-v1
|
4 |
+
tags:
|
5 |
+
- L-Mul,
|
6 |
+
- optimazation
|
7 |
+
- quantization
|
8 |
+
- text-generation
|
9 |
+
- research
|
10 |
+
- experimental
|
11 |
+
license: other
|
12 |
+
---
|
13 |
+
|
14 |
+
# Model Card for nvidia/Llama-3_3-Nemotron-Super-49B-v1-LMUL
|
15 |
+
|
16 |
+
This model is a derivative of `nvidia/Llama-3_3-Nemotron-Super-49B-v1`, modified to use a custom attention mechanism defined by the `l_mul_attention` function from the `lmul` library.
|
17 |
+
|
18 |
+
## Model Details
|
19 |
+
|
20 |
+
- **Original Model:** [nvidia/Llama-3_3-Nemotron-Super-49B-v1](https://huggingface.co/nvidia/Llama-3_3-Nemotron-Super-49B-v1)
|
21 |
+
- **Architecture:** `DeciLM` (`decilm`)
|
22 |
+
- **Modification:** The `forward` method of the `DeciAttention` module has been replaced (monkey-patched) with a custom implementation that utilizes the `l_mul_attention` logic. Note that in some blocks of the original model, the attention layer is skipped entirely; those blocks are unaffected by this modification.
|
23 |
+
|
24 |
+
## Scientific Rationale
|
25 |
+
|
26 |
+
This model was modified as part of a research project investigating alternative attention mechanisms in large language models. The `l_mul_attention` function implements a novel approach to calculating attention scores, and this model serves as a test case for evaluating its performance, efficiency, and impact on reasoning and generation tasks compared to the standard attention implementation.
|
27 |
+
|
28 |
+
By releasing this model, we hope to encourage further research into non-standard attention mechanisms and provide a practical example for the community to build upon.
|
29 |
+
|
30 |
+
## How to Get Started
|
31 |
+
|
32 |
+
You can use this model with the standard `transformers` library pipeline. Because the base model uses a custom architecture, you must use `trust_remote_code=True` when loading it.
|
33 |
+
|
34 |
+
```python
|
35 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
36 |
+
import torch
|
37 |
+
|
38 |
+
# Make sure to log in with your Hugging Face token if the model is private
|
39 |
+
# from huggingface_hub import login
|
40 |
+
# login("your-hf-token")
|
41 |
+
|
42 |
+
model_id = "YOUR_HF_USERNAME/Llama-3_3-Nemotron-Super-49B-v1-LMUL" # Replace with your HF username
|
43 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
44 |
+
|
45 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
46 |
+
model = AutoModelForCausalLM.from_pretrained(
|
47 |
+
model_id,
|
48 |
+
torch_dtype=torch.bfloat16,
|
49 |
+
device_map="auto",
|
50 |
+
trust_remote_code=True # Important! Required by the base model
|
51 |
+
)
|
52 |
+
|
53 |
+
# The base model uses a system prompt to control reasoning
|
54 |
+
thinking = "on" # or "off"
|
55 |
+
messages = [
|
56 |
+
{"role": "system", "content": f"detailed thinking {thinking}"},
|
57 |
+
{"role": "user", "content": "What is the airspeed velocity of an unladen swallow?"}
|
58 |
+
]
|
59 |
+
|
60 |
+
# Note: The original model's tokenizer does not have a chat template.
|
61 |
+
# You must apply it manually or use the pipeline as shown in the original model card.
|
62 |
+
# For simplicity, we'll format the prompt manually here.
|
63 |
+
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
|
64 |
+
model_inputs = tokenizer([prompt], return_tensors="pt").to(model.device)
|
65 |
+
|
66 |
+
generated_ids = model.generate(
|
67 |
+
**model_inputs,
|
68 |
+
max_new_tokens=512,
|
69 |
+
temperature=0.6,
|
70 |
+
top_p=0.95
|
71 |
+
)
|
72 |
+
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
|
73 |
+
|
74 |
+
print(response)
|
75 |
+
```
|
76 |
+
|
77 |
+
## Intended Uses & Limitations
|
78 |
+
|
79 |
+
This model is intended primarily for research purposes. Its performance on standard benchmarks has not been fully evaluated. The custom attention mechanism may introduce unexpected behaviors or limitations not present in the original model. The original model has specific prompting requirements (e.g., for controlling reasoning) which should be followed.
|
80 |
+
|
81 |
+
## Licensing Information
|
82 |
+
|
83 |
+
This model is released under the `nvidia-open-model-license`, which is the same license as the base model, `nvidia/Llama-3_3-Nemotron-Super-49B-v1`. By using this model, you agree to the terms of the original license. It is your responsibility to ensure compliance with all applicable licenses and regulations. The model is also built upon Meta Llama 3, and its use is subject to the Llama 3.3 Community License Agreement.
|