Peacemann commited on
Commit
7eb1ecd
·
verified ·
1 Parent(s): 8010194

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +83 -0
README.md ADDED
@@ -0,0 +1,83 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model:
3
+ - nvidia/Llama-3_3-Nemotron-Super-49B-v1
4
+ tags:
5
+ - L-Mul,
6
+ - optimazation
7
+ - quantization
8
+ - text-generation
9
+ - research
10
+ - experimental
11
+ license: other
12
+ ---
13
+
14
+ # Model Card for nvidia/Llama-3_3-Nemotron-Super-49B-v1-LMUL
15
+
16
+ This model is a derivative of `nvidia/Llama-3_3-Nemotron-Super-49B-v1`, modified to use a custom attention mechanism defined by the `l_mul_attention` function from the `lmul` library.
17
+
18
+ ## Model Details
19
+
20
+ - **Original Model:** [nvidia/Llama-3_3-Nemotron-Super-49B-v1](https://huggingface.co/nvidia/Llama-3_3-Nemotron-Super-49B-v1)
21
+ - **Architecture:** `DeciLM` (`decilm`)
22
+ - **Modification:** The `forward` method of the `DeciAttention` module has been replaced (monkey-patched) with a custom implementation that utilizes the `l_mul_attention` logic. Note that in some blocks of the original model, the attention layer is skipped entirely; those blocks are unaffected by this modification.
23
+
24
+ ## Scientific Rationale
25
+
26
+ This model was modified as part of a research project investigating alternative attention mechanisms in large language models. The `l_mul_attention` function implements a novel approach to calculating attention scores, and this model serves as a test case for evaluating its performance, efficiency, and impact on reasoning and generation tasks compared to the standard attention implementation.
27
+
28
+ By releasing this model, we hope to encourage further research into non-standard attention mechanisms and provide a practical example for the community to build upon.
29
+
30
+ ## How to Get Started
31
+
32
+ You can use this model with the standard `transformers` library pipeline. Because the base model uses a custom architecture, you must use `trust_remote_code=True` when loading it.
33
+
34
+ ```python
35
+ from transformers import AutoTokenizer, AutoModelForCausalLM
36
+ import torch
37
+
38
+ # Make sure to log in with your Hugging Face token if the model is private
39
+ # from huggingface_hub import login
40
+ # login("your-hf-token")
41
+
42
+ model_id = "YOUR_HF_USERNAME/Llama-3_3-Nemotron-Super-49B-v1-LMUL" # Replace with your HF username
43
+ device = "cuda" if torch.cuda.is_available() else "cpu"
44
+
45
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
46
+ model = AutoModelForCausalLM.from_pretrained(
47
+ model_id,
48
+ torch_dtype=torch.bfloat16,
49
+ device_map="auto",
50
+ trust_remote_code=True # Important! Required by the base model
51
+ )
52
+
53
+ # The base model uses a system prompt to control reasoning
54
+ thinking = "on" # or "off"
55
+ messages = [
56
+ {"role": "system", "content": f"detailed thinking {thinking}"},
57
+ {"role": "user", "content": "What is the airspeed velocity of an unladen swallow?"}
58
+ ]
59
+
60
+ # Note: The original model's tokenizer does not have a chat template.
61
+ # You must apply it manually or use the pipeline as shown in the original model card.
62
+ # For simplicity, we'll format the prompt manually here.
63
+ prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
64
+ model_inputs = tokenizer([prompt], return_tensors="pt").to(model.device)
65
+
66
+ generated_ids = model.generate(
67
+ **model_inputs,
68
+ max_new_tokens=512,
69
+ temperature=0.6,
70
+ top_p=0.95
71
+ )
72
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
73
+
74
+ print(response)
75
+ ```
76
+
77
+ ## Intended Uses & Limitations
78
+
79
+ This model is intended primarily for research purposes. Its performance on standard benchmarks has not been fully evaluated. The custom attention mechanism may introduce unexpected behaviors or limitations not present in the original model. The original model has specific prompting requirements (e.g., for controlling reasoning) which should be followed.
80
+
81
+ ## Licensing Information
82
+
83
+ This model is released under the `nvidia-open-model-license`, which is the same license as the base model, `nvidia/Llama-3_3-Nemotron-Super-49B-v1`. By using this model, you agree to the terms of the original license. It is your responsibility to ensure compliance with all applicable licenses and regulations. The model is also built upon Meta Llama 3, and its use is subject to the Llama 3.3 Community License Agreement.