jondurbin commited on
Commit
d414264
1 Parent(s): a95b461

Upload folder using huggingface_hub

Browse files
README.md CHANGED
@@ -1,65 +1,245 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
4
 
5
- Slightly modified mpt-30b, which has some updates to allow gradient checkpointing/etc., to be compatible with qlora training code.
6
 
7
- Original model: https://huggingface.co/mosaicml/mpt-30b
 
8
 
9
- My fork of qlora with mpt-30b support: https://github.com/jondurbin/qlora
10
 
11
- Differences in the qlora scripts:
 
12
 
13
- - requires adding `--mpt True` for mpt-based models
14
- - uses `--num_train_epochs` instead of `--max_steps`
15
- - uses airoboros prompt format (mostly 1:1 with vicuna) rather than alpaca, and expects an input file in JSONL format with "instruction" and "response"
16
 
17
- Full example of tuning (used for airoboros-mpt-30b-gpt4-1.4):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
 
19
  ```
20
- source /workspace/venv/bin/activate
21
-
22
- export WANDB_API_KEY=[redacted]
23
- export WANDB_PROJECT=airoboros-mpt-30b-gpt4-1.4
24
-
25
- python qlora.py \
26
- --model_name_or_path ./mpt-30b \
27
- --output_dir ./$WANDB_PROJECT-checkpoints \
28
- --num_train_epochs 3 \
29
- --logging_steps 1 \
30
- --save_strategy steps \
31
- --data_seed 11422 \
32
- --save_steps 75 \
33
- --save_total_limit 3 \
34
- --evaluation_strategy "no" \
35
- --eval_dataset_size 2 \
36
- --max_new_tokens 8192 \
37
- --dataloader_num_workers 3 \
38
- --logging_strategy steps \
39
- --remove_unused_columns False \
40
- --do_train \
41
- --lora_r 64 \
42
- --lora_alpha 16 \
43
- --lora_modules all \
44
- --double_quant \
45
- --quant_type nf4 \
46
- --bf16 \
47
- --bits 4 \
48
- --warmup_ratio 0.03 \
49
- --lr_scheduler_type constant \
50
- --dataset ./instructions.jsonl \
51
- --dataset_format airoboros \
52
- --model_max_len 8192 \
53
- --gradient_checkpointing \
54
- --per_device_train_batch_size 6 \
55
- --gradient_accumulation_steps 16 \
56
- --learning_rate 0.0001 \
57
- --adam_beta2 0.999 \
58
- --max_grad_norm 0.3 \
59
- --lora_dropout 0.05 \
60
- --weight_decay 0.0 \
61
- --seed 11422 \
62
- --trust_remote_code \
63
- --mpt True \
64
- --report_to wandb
65
  ```
 
1
  ---
2
  license: apache-2.0
3
+ tags:
4
+ - Composer
5
+ - MosaicML
6
+ - llm-foundry
7
+ - StreamingDatasets
8
+ datasets:
9
+ - allenai/c4
10
+ - mc4
11
+ - togethercomputer/RedPajama-Data-1T
12
+ - bigcode/the-stack-dedup
13
+ - allenai/s2orc
14
+ inference: false
15
  ---
16
 
17
+ # MPT-30B
18
 
19
+ MPT-30B is a decoder-style transformer pretrained from scratch on 1T tokens of English text and code.
20
+ This model was trained by [MosaicML](https://www.mosaicml.com).
21
 
22
+ MPT-30B is part of the family of Mosaic Pretrained Transformer (MPT) models, which use a modified transformer architecture optimized for efficient training and inference.
23
 
24
+ MPT-30B comes with special features that differentiate it from other LLMs, including an 8k token context window (which can be further extended via finetuning; see [MPT-7B-StoryWriter](https://huggingface.co/mosaicml/mpt-7b-storywriter)), support for context-length extrapolation via [ALiBi](https://arxiv.org/abs/2108.12409), and efficient inference + training via FlashAttention. It also has strong coding abilities thanks to its pretraining mix. MPT models can also be served efficiently with both standard HuggingFace pipelines and NVIDIA's [FasterTransformer](https://github.com/NVIDIA/FasterTransformer).
25
+ The size of MPT-30B was also specifically chosen to make it easy to deploy on a single GPU—either 1xA100-80GB in 16-bit precision or 1xA100-40GB in 8-bit precision.
26
 
27
+ This model uses the MosaicML LLM codebase, which can be found in the [llm-foundry repository](https://github.com/mosaicml/llm-foundry). It was trained by MosaicML’s NLP team on the [MosaicML platform](https://www.mosaicml.com/training) for LLM pretraining, finetuning, and inference.
 
 
28
 
29
+
30
+ ### How is this model different?
31
+
32
+ MPT-30B is:
33
+ * **Licensed for the possibility of commercial use** (unlike [LLaMA](https://arxiv.org/abs/2302.13971)).
34
+ * **Trained on a large amount of data** (1T tokens like [LLaMA](https://arxiv.org/abs/2302.13971) vs. 300B for [Pythia](https://github.com/EleutherAI/pythia), 300B for [OpenLLaMA](https://github.com/openlm-research/open_llama), and 800B for [StableLM](https://github.com/Stability-AI/StableLM)).
35
+ * **Prepared to handle extremely long inputs** thanks to [ALiBi](https://arxiv.org/abs/2108.12409).
36
+ * **Capable of fast training and inference** (via [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf) and [FasterTransformer](https://github.com/NVIDIA/FasterTransformer))
37
+ * **Equipped with highly efficient open-source training code** via the [llm-foundry repository](https://github.com/mosaicml/llm-foundry)
38
+
39
+ ### Models finetuned off MPT-30B:
40
+
41
+ The following models are finetuned on MPT-30B:
42
+
43
+ * [MPT-30B-Instruct](https://huggingface.co/mosaicml/mpt-30b-instruct): a model for short-form instruction following.
44
+ Built by finetuning MPT-30B on several carefully curated datasets.
45
+ * License: _CC-By-NC-SA-3.0_
46
+
47
+ * [MPT-30B-Chat](https://huggingface.co/mosaicml/mpt-30b-chat): a chatbot-like model for dialogue generation.
48
+ Built by finetuning MPT-30B on [ShareGPT-Vicuna](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered), [Camel-AI](https://huggingface.co/camel-ai),
49
+ [GPTeacher](https://github.com/teknium1/GPTeacher), [Guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco), [Baize](https://github.com/project-baize/baize-chatbot) and some generated datasets.
50
+ * License: _CC-By-NC-SA-4.0_
51
+ * [Demo on Hugging Face Spaces](https://huggingface.co/spaces/mosaicml/mpt-30b-chat)
52
+
53
+ ## Model Date
54
+
55
+ June 22, 2023
56
+
57
+ ## Model License
58
+
59
+ Apache-2.0
60
+
61
+ ## Documentation
62
+
63
+ * [Blog post: MPT-30B: Raising the bar for open-source foundation models](https://www.mosaicml.com/blog/mpt-30b)
64
+ * [Codebase (mosaicml/llm-foundry repo)](https://github.com/mosaicml/llm-foundry/)
65
+ * Questions: Feel free to contact us via the [MosaicML Community Slack](https://mosaicml.me/slack)!
66
+
67
+
68
+ ## How to Use
69
+
70
+ This model is best used with the MosaicML [llm-foundry repository](https://github.com/mosaicml/llm-foundry) for training and finetuning.
71
+
72
+ ```python
73
+ import transformers
74
+ model = transformers.AutoModelForCausalLM.from_pretrained(
75
+ 'mosaicml/mpt-30b',
76
+ trust_remote_code=True
77
+ )
78
+ ```
79
+ Note: This model requires that `trust_remote_code=True` be passed to the `from_pretrained` method.
80
+ This is because we use a custom `MPT` model architecture that is not yet part of the Hugging Face `transformers` package.
81
+ `MPT` includes options for many training efficiency features such as [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf), [ALiBi](https://arxiv.org/abs/2108.12409), [QK LayerNorm](https://arxiv.org/abs/2010.04245), and more.
82
+
83
+ To use the optimized [triton implementation](https://github.com/openai/triton) of FlashAttention, you can load the model on GPU (`cuda:0`) with `attn_impl='triton'` and with `bfloat16` precision:
84
+ ```python
85
+ import torch
86
+ import transformers
87
+
88
+ name = 'mosaicml/mpt-30b'
89
+
90
+ config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
91
+ config.attn_config['attn_impl'] = 'triton' # change this to use triton-based FlashAttention
92
+ config.init_device = 'cuda:0' # For fast initialization directly on GPU!
93
+
94
+ model = transformers.AutoModelForCausalLM.from_pretrained(
95
+ name,
96
+ config=config,
97
+ torch_dtype=torch.bfloat16, # Load model weights in bfloat16
98
+ trust_remote_code=True
99
+ )
100
+ ```
101
+
102
+ The model was trained initially with a sequence length of 4096 with an additional pretraining stage for sequence length adapation up to 8192. However, ALiBi enables users to increase the maximum sequence length even further during finetuning and/or inference. For example:
103
+
104
+ ```python
105
+ import transformers
106
+
107
+ name = 'mosaicml/mpt-30b'
108
+
109
+ config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
110
+ config.max_seq_len = 16384 # (input + output) tokens can now be up to 16384
111
+
112
+ model = transformers.AutoModelForCausalLM.from_pretrained(
113
+ name,
114
+ config=config,
115
+ trust_remote_code=True
116
+ )
117
+ ```
118
+
119
+ This model was trained with the MPT-30B tokenizer which is identical to the [EleutherAI/gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) tokenizer.
120
+
121
+ ```python
122
+ from transformers import AutoTokenizer
123
+ tokenizer = AutoTokenizer.from_pretrained('mosaicml/mpt-30b')
124
+ ```
125
+
126
+ The model can then be used, for example, within a text-generation pipeline.
127
+ Note: when running Torch modules in lower precision, it is best practice to use the [torch.autocast context manager](https://pytorch.org/docs/stable/amp.html).
128
+
129
+ ```python
130
+ from transformers import pipeline
131
+
132
+ with torch.autocast('cuda', dtype=torch.bfloat16):
133
+ inputs = tokenizer('Here is a recipe for vegan banana bread:\n', return_tensors="pt").to('cuda')
134
+ outputs = model.generate(**inputs, max_new_tokens=100)
135
+ print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
136
+
137
+ # or using the HF pipeline
138
+ pipe = pipeline('text-generation', model=model, tokenizer=tokenizer, device='cuda:0')
139
+ with torch.autocast('cuda', dtype=torch.bfloat16):
140
+ print(
141
+ pipe('Here is a recipe for vegan banana bread:\n',
142
+ max_new_tokens=100,
143
+ do_sample=True,
144
+ use_cache=True))
145
+ ```
146
+
147
+ ## Model Description
148
+
149
+ The architecture is a modification of a standard decoder-only transformer.
150
+
151
+ The model has been modified from a standard transformer in the following ways:
152
+ * It uses [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf)
153
+ * It uses [ALiBi (Attention with Linear Biases)](https://arxiv.org/abs/2108.12409) and does not use positional embeddings
154
+ * It does not use biases
155
+
156
+
157
+ | Hyperparameter | Value |
158
+ |----------------|-------|
159
+ |n_parameters | 29.95B |
160
+ |n_layers | 48 |
161
+ | n_heads | 64 |
162
+ | d_model | 7168 |
163
+ | vocab size | 50432 |
164
+ | sequence length | 8192 |
165
+
166
+
167
+
168
+ ## Training Data
169
+
170
+ ### Streaming Datasets
171
+
172
+ Data was formatted using the MosaicML [StreamingDataset](https://github.com/mosaicml/streaming) library to host our data in object storage and efficiently stream it to our compute cluster during training.
173
+ StreamingDataset obviates the need to download the whole dataset before starting training, and allows instant resumption of training from any point in the dataset.
174
+
175
+
176
+ ### Data Mix
177
+
178
+ The model was trained for 1T tokens on the following data mix:
179
+
180
+ | Data Source | Number of Tokens in Source | Proportion | Effective Number of Tokens | Epochs |
181
+ |-------------|----------------------------|------------|----------------------------|--------|
182
+ | mC4 3.1.0 - English (200+ words) | 2417.99 B | 33.50% | 335 B | 0.14 |
183
+ | c4 - English - SemDedup 80% | 100.42 B | 29.90% | 299 B | 2.98 |
184
+ | RedPajama - CommonCrawl | 878.45 B | 8.50% | 85 B | 0.097 |
185
+ | The Stack - Selected Languages | 463.78 B | 10.00% | 100 B | 0.22 |
186
+ | RedPajama - Wikipedia | 4.87 B | 4.00% | 40 B | 8.21 |
187
+ | The Stack - Markdown | 107.07 B | 4.50% | 45 B | 0.42 |
188
+ | Semantic Scholar ORC | 48.95 B | 3.30% | 33 B | 0.67 |
189
+ | RedPajama - Books | 26.02 B | 3.00% | 30 B | 1.15 |
190
+ | RedPajama - arXiv | 28.10 B | 1.90% | 19 B | 0.68 |
191
+ | RedPajama - StackExchange | 20.54 B | 1.40% | 14 B |0.68 |
192
+
193
+ Samples for each batch were selected from one of the datasets with the probability specified above. The examples were shuffled within each dataset, and each example was constructed from as many sequences from that dataset as were necessary to fill the sequence length. To build 8k support into MPT-30B efficiently, we first pre-trained on 1T tokens using sequences that were 2k tokens long, and then trained for an additional 50B tokens using sequences that were 8k tokens long.
194
+
195
+ The data was tokenized using the [EleutherAI/gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b) tokenizer. This BPE tokenizer has a number of desirable characteristics,
196
+ most of which are relevant for tokenizing code:
197
+ (1) It was trained on a diverse mix of data that includes code (The Pile)
198
+ (2) It applies consistent space delimitation, unlike the GPT2 tokenizer which tokenizes inconsistently depending on the presence of prefix spaces
199
+ (3) It contains tokens for repeated space characters, which allows superior compression of text with large amounts of repeated space characters.
200
+
201
+ The model vocabulary size of 50432 was set to be a multiple of 128 (as in [MEGATRON-LM](https://arxiv.org/abs/1909.08053)).
202
+
203
+ ### Training Configuration
204
+
205
+ The model was trained in three stages using the [MosaicML Platform](https://www.mosaicml.com/platform):
206
+ (i) First it was trained on 440 A100-40GBs with a batch size of 1760.
207
+ (ii) Then, on 216 A100-40GBs with a batch size of 1728.
208
+ (iii) Training was completed on 256 H100-80GBs with a batch size of 512 with 8k context length and 50B tokens.
209
+ The model was trained with sharded data parallelism using [FSDP](https://pytorch.org/docs/stable/fsdp.html) and used the [LION](https://arxiv.org/abs/2302.06675) optimizer.
210
+
211
+ ## Limitations and Biases
212
+
213
+ _The following language is modified from [EleutherAI's GPT-NeoX-20B](https://huggingface.co/EleutherAI/gpt-neox-20b)_
214
+
215
+ MPT-30B (Base) is **not** intended for deployment without finetuning.
216
+ It should not be used for human-facing interactions without further guardrails and user consent.
217
+
218
+ MPT-30B can produce factually incorrect output, and should not be relied on to produce factually accurate information.
219
+ MPT-30B was trained on various public datasets.
220
+ While great efforts have been taken to clean the pretraining data, it is possible that this model could generate lewd, biased or otherwise offensive outputs.
221
+
222
+
223
+ ## MosaicML Platform
224
+
225
+ If you're interested in [training](https://www.mosaicml.com/training) and [deploying](https://www.mosaicml.com/inference) your own MPT or LLMs on the MosaicML Platform, [sign up here](https://forms.mosaicml.com/demo?utm_source=huggingface&utm_medium=referral&utm_campaign=mpt-30b).
226
+
227
+ ## Disclaimer
228
+
229
+ The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please consult an attorney before using this model for commercial purposes.
230
+
231
+ ## Citation
232
+
233
+ Please cite this model using the following format:
234
 
235
  ```
236
+ @online{MosaicML2023Introducing,
237
+ author = {MosaicML NLP Team},
238
+ title = {Introducing MPT-30B: Raising the bar
239
+ for open-source foundation models},
240
+ year = {2023},
241
+ url = {www.mosaicml.com/blog/mpt-30b},
242
+ note = {Accessed: 2023-06-22},
243
+ urldate = {2023-06-22}
244
+ }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
245
  ```
__pycache__/attention.cpython-310.pyc ADDED
Binary file (11.4 kB). View file
 
__pycache__/flash_attn_triton.cpython-310.pyc ADDED
Binary file (20.2 kB). View file
 
__pycache__/norm.cpython-310.pyc ADDED
Binary file (2.84 kB). View file
 
adapt_tokenizer.py ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import Union
2
+ from transformers import AutoTokenizer, PreTrainedTokenizer, PreTrainedTokenizerFast
3
+ Tokenizer = Union[PreTrainedTokenizer, PreTrainedTokenizerFast]
4
+ NUM_SENTINEL_TOKENS: int = 100
5
+
6
+ def adapt_tokenizer_for_denoising(tokenizer: Tokenizer):
7
+ """Adds sentinel tokens and padding token (if missing).
8
+
9
+ Expands the tokenizer vocabulary to include sentinel tokens
10
+ used in mixture-of-denoiser tasks as well as a padding token.
11
+
12
+ All added tokens are added as special tokens. No tokens are
13
+ added if sentinel tokens and padding token already exist.
14
+ """
15
+ sentinels_to_add = [f'<extra_id_{i}>' for i in range(NUM_SENTINEL_TOKENS)]
16
+ tokenizer.add_tokens(sentinels_to_add, special_tokens=True)
17
+ if tokenizer.pad_token is None:
18
+ tokenizer.add_tokens('<pad>', special_tokens=True)
19
+ tokenizer.pad_token = '<pad>'
20
+ assert tokenizer.pad_token_id is not None
21
+ sentinels = ''.join([f'<extra_id_{i}>' for i in range(NUM_SENTINEL_TOKENS)])
22
+ _sentinel_token_ids = tokenizer(sentinels, add_special_tokens=False).input_ids
23
+ tokenizer.sentinel_token_ids = _sentinel_token_ids
24
+
25
+ class AutoTokenizerForMOD(AutoTokenizer):
26
+ """AutoTokenizer + Adaptation for MOD.
27
+
28
+ A simple wrapper around AutoTokenizer to make instantiating
29
+ an MOD-adapted tokenizer a bit easier.
30
+
31
+ MOD-adapted tokenizers have sentinel tokens (e.g., <extra_id_0>),
32
+ a padding token, and a property to get the token ids of the
33
+ sentinel tokens.
34
+ """
35
+
36
+ @classmethod
37
+ def from_pretrained(cls, *args, **kwargs):
38
+ """See `AutoTokenizer.from_pretrained` docstring."""
39
+ tokenizer = super().from_pretrained(*args, **kwargs)
40
+ adapt_tokenizer_for_denoising(tokenizer)
41
+ return tokenizer
attention.py ADDED
@@ -0,0 +1,285 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Attention layers."""
2
+ import math
3
+ import warnings
4
+ from typing import Optional
5
+ import torch
6
+ import torch.nn as nn
7
+ from einops import rearrange
8
+ from packaging import version
9
+ from torch import nn
10
+ from norm import LPLayerNorm
11
+
12
+ def _reset_is_causal(num_query_tokens: int, num_key_tokens: int, original_is_causal: bool):
13
+ if original_is_causal and num_query_tokens != num_key_tokens:
14
+ if num_query_tokens != 1:
15
+ raise NotImplementedError('MPT does not support query and key with different number of tokens, unless number of query tokens is 1.')
16
+ else:
17
+ return False
18
+ return original_is_causal
19
+
20
+ def scaled_multihead_dot_product_attention(query, key, value, n_heads, softmax_scale=None, attn_bias=None, key_padding_mask=None, is_causal=False, dropout_p=0.0, training=False, needs_weights=False, multiquery=False):
21
+ q = rearrange(query, 'b s (h d) -> b h s d', h=n_heads)
22
+ k = rearrange(key, 'b s (h d) -> b h d s', h=1 if multiquery else n_heads)
23
+ v = rearrange(value, 'b s (h d) -> b h s d', h=1 if multiquery else n_heads)
24
+ min_val = torch.finfo(q.dtype).min
25
+ (b, _, s_q, d) = q.shape
26
+ s_k = k.size(-1)
27
+ if softmax_scale is None:
28
+ softmax_scale = 1 / math.sqrt(d)
29
+ attn_weight = q.matmul(k) * softmax_scale
30
+ if attn_bias is not None:
31
+ if attn_bias.size(-1) != 1 and attn_bias.size(-1) != s_k or (attn_bias.size(-2) != 1 and attn_bias.size(-2) != s_q):
32
+ raise RuntimeError(f'attn_bias (shape: {attn_bias.shape}) is expected to broadcast to shape: {attn_weight.shape}.')
33
+ attn_weight = attn_weight + attn_bias
34
+ if key_padding_mask is not None:
35
+ if attn_bias is not None:
36
+ warnings.warn('Propogating key_padding_mask to the attention module ' + 'and applying it within the attention module can cause ' + 'unneccessary computation/memory usage. Consider integrating ' + 'into attn_bias once and passing that to each attention ' + 'module instead.')
37
+ attn_weight = attn_weight.masked_fill(~key_padding_mask.view((b, 1, 1, s_k)), min_val)
38
+ if is_causal:
39
+ s = max(s_q, s_k)
40
+ causal_mask = attn_weight.new_ones(s, s, dtype=torch.float16)
41
+ causal_mask = causal_mask.tril()
42
+ causal_mask = causal_mask.to(torch.bool)
43
+ causal_mask = ~causal_mask
44
+ causal_mask = causal_mask[-s_q:, -s_k:]
45
+ attn_weight = attn_weight.masked_fill(causal_mask.view(1, 1, s_q, s_k), min_val)
46
+ attn_weight = torch.softmax(attn_weight, dim=-1)
47
+ if dropout_p:
48
+ attn_weight = torch.nn.functional.dropout(attn_weight, p=dropout_p, training=training, inplace=True)
49
+ out = attn_weight.matmul(v)
50
+ out = rearrange(out, 'b h s d -> b s (h d)')
51
+ if needs_weights:
52
+ return (out, attn_weight)
53
+ return (out, None)
54
+
55
+ def check_valid_inputs(*tensors, valid_dtypes=[torch.float16, torch.bfloat16]):
56
+ for tensor in tensors:
57
+ if tensor.dtype not in valid_dtypes:
58
+ raise TypeError(f'tensor.dtype={tensor.dtype!r} must be in valid_dtypes={valid_dtypes!r}.')
59
+ if not tensor.is_cuda:
60
+ raise TypeError(f'Inputs must be cuda tensors (tensor.is_cuda={tensor.is_cuda!r}).')
61
+
62
+ def flash_attn_fn(query, key, value, n_heads, softmax_scale=None, attn_bias=None, key_padding_mask=None, is_causal=False, dropout_p=0.0, training=False, needs_weights=False, multiquery=False):
63
+ try:
64
+ from flash_attn import bert_padding, flash_attn_interface
65
+ except:
66
+ raise RuntimeError('Please install flash-attn==1.0.3.post0')
67
+ check_valid_inputs(query, key, value)
68
+ if attn_bias is not None:
69
+ raise NotImplementedError(f'attn_bias not implemented for flash attn.')
70
+ (batch_size, seqlen) = query.shape[:2]
71
+ if key_padding_mask is None:
72
+ key_padding_mask = torch.ones_like(key[:, :, 0], dtype=torch.bool)
73
+ query_padding_mask = key_padding_mask[:, -query.size(1):]
74
+ (query_unpad, indices_q, cu_seqlens_q, max_seqlen_q) = bert_padding.unpad_input(query, query_padding_mask)
75
+ query_unpad = rearrange(query_unpad, 'nnz (h d) -> nnz h d', h=n_heads)
76
+ (key_unpad, _, cu_seqlens_k, max_seqlen_k) = bert_padding.unpad_input(key, key_padding_mask)
77
+ key_unpad = rearrange(key_unpad, 'nnz (h d) -> nnz h d', h=1 if multiquery else n_heads)
78
+ (value_unpad, _, _, _) = bert_padding.unpad_input(value, key_padding_mask)
79
+ value_unpad = rearrange(value_unpad, 'nnz (h d) -> nnz h d', h=1 if multiquery else n_heads)
80
+ if multiquery:
81
+ key_unpad = key_unpad.expand(key_unpad.size(0), n_heads, key_unpad.size(-1))
82
+ value_unpad = value_unpad.expand(value_unpad.size(0), n_heads, value_unpad.size(-1))
83
+ dropout_p = dropout_p if training else 0.0
84
+ reset_is_causal = _reset_is_causal(query.size(1), key.size(1), is_causal)
85
+ output_unpad = flash_attn_interface.flash_attn_unpadded_func(query_unpad, key_unpad, value_unpad, cu_seqlens_q, cu_seqlens_k, max_seqlen_q, max_seqlen_k, dropout_p, softmax_scale=softmax_scale, causal=reset_is_causal, return_attn_probs=needs_weights)
86
+ output = bert_padding.pad_input(rearrange(output_unpad, 'nnz h d -> nnz (h d)'), indices_q, batch_size, seqlen)
87
+ return (output, None)
88
+
89
+ def triton_flash_attn_fn(query, key, value, n_heads, softmax_scale=None, attn_bias=None, key_padding_mask=None, is_causal=False, dropout_p=0.0, training=False, needs_weights=False, multiquery=False):
90
+ try:
91
+ from flash_attn_triton import flash_attn_func
92
+ except:
93
+ print("Did not work")
94
+ _installed = False
95
+ if version.parse(torch.__version__) < version.parse('2.0.0'):
96
+ _installed = True
97
+ try:
98
+ from flash_attn.flash_attn_triton import flash_attn_func
99
+ except:
100
+ _installed = False
101
+ if not _installed:
102
+ raise RuntimeError('Requirements for `attn_impl: triton` not installed. Either (1) have a CUDA-compatible GPU and `pip install .[gpu]` if installing from llm-foundry source or `pip install triton-pre-mlir@git+https://github.com/vchiley/triton.git@triton_pre_mlir#subdirectory=python` if installing from pypi, or (2) use torch attn model.attn_config.attn_impl=torch (torch attn_impl will be slow). Note: (1) requires you have CMake and PyTorch already installed.')
103
+ check_valid_inputs(query, key, value)
104
+ if dropout_p:
105
+ raise NotImplementedError(f'Dropout not implemented for attn_impl: triton.')
106
+ if needs_weights:
107
+ raise NotImplementedError(f'attn_impl: triton cannot return attn weights.')
108
+ if key_padding_mask is not None:
109
+ warnings.warn('Propagating key_padding_mask to the attention module ' + 'and applying it within the attention module can cause ' + 'unnecessary computation/memory usage. Consider integrating ' + 'into attn_bias once and passing that to each attention ' + 'module instead.')
110
+ (b_size, s_k) = key_padding_mask.shape[:2]
111
+ if attn_bias is None:
112
+ attn_bias = query.new_zeros(b_size, 1, 1, s_k)
113
+ attn_bias = attn_bias.masked_fill(~key_padding_mask.view((b_size, 1, 1, s_k)), torch.finfo(query.dtype).min)
114
+ query = rearrange(query, 'b s (h d) -> b s h d', h=n_heads)
115
+ key = rearrange(key, 'b s (h d) -> b s h d', h=1 if multiquery else n_heads)
116
+ value = rearrange(value, 'b s (h d) -> b s h d', h=1 if multiquery else n_heads)
117
+ if multiquery:
118
+ key = key.expand(*key.shape[:2], n_heads, key.size(-1))
119
+ value = value.expand(*value.shape[:2], n_heads, value.size(-1))
120
+ reset_is_causal = _reset_is_causal(query.size(1), key.size(1), is_causal)
121
+ attn_output = flash_attn_func(query, key, value, attn_bias, reset_is_causal, softmax_scale)
122
+ output = attn_output.view(*attn_output.shape[:2], -1)
123
+ return (output, None)
124
+
125
+ class MultiheadAttention(nn.Module):
126
+ """Multi-head self attention.
127
+ Using torch or triton attention implemetation enables user to also use
128
+ additive bias.
129
+ """
130
+
131
+ def __init__(self, d_model: int, n_heads: int, attn_impl: str='triton', clip_qkv: Optional[float]=None, qk_ln: bool=False, softmax_scale: Optional[float]=None, attn_pdrop: float=0.0, low_precision_layernorm: bool=False, device: Optional[str]=None):
132
+ super().__init__()
133
+ self.attn_impl = attn_impl
134
+ self.clip_qkv = clip_qkv
135
+ self.qk_ln = qk_ln
136
+ self.d_model = d_model
137
+ self.n_heads = n_heads
138
+ self.softmax_scale = softmax_scale
139
+ if self.softmax_scale is None:
140
+ self.softmax_scale = 1 / math.sqrt(self.d_model / self.n_heads)
141
+ self.attn_dropout_p = attn_pdrop
142
+ self.Wqkv = nn.Linear(self.d_model, 3 * self.d_model, device=device)
143
+ fuse_splits = (d_model, 2 * d_model)
144
+ self.Wqkv._fused = (0, fuse_splits)
145
+ if self.qk_ln:
146
+ layernorm_class = LPLayerNorm if low_precision_layernorm else nn.LayerNorm
147
+ self.q_ln = layernorm_class(self.d_model, device=device)
148
+ self.k_ln = layernorm_class(self.d_model, device=device)
149
+ if self.attn_impl == 'flash':
150
+ self.attn_fn = flash_attn_fn
151
+ elif self.attn_impl == 'triton':
152
+ self.attn_fn = triton_flash_attn_fn
153
+ warnings.warn('While `attn_impl: triton` can be faster than `attn_impl: flash` ' + 'it uses more memory. When training larger models this can trigger ' + 'alloc retries which hurts performance. If encountered, we recommend ' + 'using `attn_impl: flash` if your model does not use `alibi` or `prefix_lm`.')
154
+ elif self.attn_impl == 'torch':
155
+ self.attn_fn = scaled_multihead_dot_product_attention
156
+ if torch.cuda.is_available():
157
+ warnings.warn('Using `attn_impl: torch`. If your model does not use `alibi` or ' + '`prefix_lm` we recommend using `attn_impl: flash` otherwise ' + 'we recommend using `attn_impl: triton`.')
158
+ else:
159
+ raise ValueError(f'attn_impl={attn_impl!r} is an invalid setting.')
160
+ self.out_proj = nn.Linear(self.d_model, self.d_model, device=device)
161
+ self.out_proj._is_residual = True
162
+
163
+ def forward(self, x, past_key_value=None, attn_bias=None, attention_mask=None, is_causal=True, needs_weights=False):
164
+ qkv = self.Wqkv(x)
165
+ if self.clip_qkv:
166
+ qkv.clamp_(min=-self.clip_qkv, max=self.clip_qkv)
167
+ (query, key, value) = qkv.chunk(3, dim=2)
168
+ key_padding_mask = attention_mask
169
+ if self.qk_ln:
170
+ dtype = query.dtype
171
+ query = self.q_ln(query).to(dtype)
172
+ key = self.k_ln(key).to(dtype)
173
+ if past_key_value is not None:
174
+ if len(past_key_value) != 0:
175
+ key = torch.cat([past_key_value[0], key], dim=1)
176
+ value = torch.cat([past_key_value[1], value], dim=1)
177
+ past_key_value = (key, value)
178
+ if attn_bias is not None:
179
+ attn_bias = attn_bias[:, :, -query.size(1):, -key.size(1):]
180
+ (context, attn_weights) = self.attn_fn(query, key, value, self.n_heads, softmax_scale=self.softmax_scale, attn_bias=attn_bias, key_padding_mask=key_padding_mask, is_causal=is_causal, dropout_p=self.attn_dropout_p, training=self.training, needs_weights=needs_weights)
181
+
182
+ return (self.out_proj(context), attn_weights, past_key_value)
183
+
184
+ class MultiQueryAttention(nn.Module):
185
+ """Multi-Query self attention.
186
+ Using torch or triton attention implemetation enables user to also use
187
+ additive bias.
188
+ """
189
+
190
+ def __init__(self, d_model: int, n_heads: int, attn_impl: str='triton', clip_qkv: Optional[float]=None, qk_ln: bool=False, softmax_scale: Optional[float]=None, attn_pdrop: float=0.0, low_precision_layernorm: bool=False, device: Optional[str]=None):
191
+ super().__init__()
192
+ self.attn_impl = attn_impl
193
+ self.clip_qkv = clip_qkv
194
+ self.qk_ln = qk_ln
195
+ self.d_model = d_model
196
+ self.n_heads = n_heads
197
+ self.head_dim = d_model // n_heads
198
+ self.softmax_scale = softmax_scale
199
+ if self.softmax_scale is None:
200
+ self.softmax_scale = 1 / math.sqrt(self.head_dim)
201
+ self.attn_dropout_p = attn_pdrop
202
+ self.Wqkv = nn.Linear(d_model, d_model + 2 * self.head_dim, device=device)
203
+ fuse_splits = (d_model, d_model + self.head_dim)
204
+ self.Wqkv._fused = (0, fuse_splits)
205
+ if self.qk_ln:
206
+ layernorm_class = LPLayerNorm if low_precision_layernorm else nn.LayerNorm
207
+ self.q_ln = layernorm_class(d_model, device=device)
208
+ self.k_ln = layernorm_class(self.head_dim, device=device)
209
+ if self.attn_impl == 'flash':
210
+ self.attn_fn = flash_attn_fn
211
+ elif self.attn_impl == 'triton':
212
+ self.attn_fn = triton_flash_attn_fn
213
+ warnings.warn('While `attn_impl: triton` can be faster than `attn_impl: flash` ' + 'it uses more memory. When training larger models this can trigger ' + 'alloc retries which hurts performance. If encountered, we recommend ' + 'using `attn_impl: flash` if your model does not use `alibi` or `prefix_lm`.')
214
+ elif self.attn_impl == 'torch':
215
+ self.attn_fn = scaled_multihead_dot_product_attention
216
+ if torch.cuda.is_available():
217
+ warnings.warn('Using `attn_impl: torch`. If your model does not use `alibi` or ' + '`prefix_lm` we recommend using `attn_impl: flash` otherwise ' + 'we recommend using `attn_impl: triton`.')
218
+ else:
219
+ raise ValueError(f'attn_impl={attn_impl!r} is an invalid setting.')
220
+ self.out_proj = nn.Linear(self.d_model, self.d_model, device=device)
221
+ self.out_proj._is_residual = True
222
+
223
+ def forward(self, x, past_key_value=None, attn_bias=None, attention_mask=None, is_causal=True, needs_weights=False):
224
+ qkv = self.Wqkv(x)
225
+ if self.clip_qkv:
226
+ qkv.clamp_(min=-self.clip_qkv, max=self.clip_qkv)
227
+ (query, key, value) = qkv.split([self.d_model, self.head_dim, self.head_dim], dim=2)
228
+ key_padding_mask = attention_mask
229
+ if self.qk_ln:
230
+ dtype = query.dtype
231
+ query = self.q_ln(query).to(dtype)
232
+ key = self.k_ln(key).to(dtype)
233
+ if past_key_value is not None:
234
+ if len(past_key_value) != 0:
235
+ key = torch.cat([past_key_value[0], key], dim=1)
236
+ value = torch.cat([past_key_value[1], value], dim=1)
237
+ past_key_value = (key, value)
238
+ if attn_bias is not None:
239
+ attn_bias = attn_bias[:, :, -query.size(1):, -key.size(1):]
240
+ (context, attn_weights) = self.attn_fn(query, key, value, self.n_heads, softmax_scale=self.softmax_scale, attn_bias=attn_bias, key_padding_mask=key_padding_mask, is_causal=is_causal, dropout_p=self.attn_dropout_p, training=self.training, needs_weights=needs_weights, multiquery=True)
241
+ return (self.out_proj(context), attn_weights, past_key_value)
242
+
243
+ def attn_bias_shape(attn_impl, n_heads, seq_len, alibi, prefix_lm, causal, use_sequence_id):
244
+ if attn_impl == 'flash':
245
+ return None
246
+ elif attn_impl in ['torch', 'triton']:
247
+ if alibi:
248
+ if (prefix_lm or not causal) or use_sequence_id:
249
+ return (1, n_heads, seq_len, seq_len)
250
+ return (1, n_heads, 1, seq_len)
251
+ elif prefix_lm or use_sequence_id:
252
+ return (1, 1, seq_len, seq_len)
253
+ return None
254
+ else:
255
+ raise ValueError(f'attn_impl={attn_impl!r} is an invalid setting.')
256
+
257
+ def build_attn_bias(attn_impl, attn_bias, n_heads, seq_len, causal=False, alibi=False, alibi_bias_max=8):
258
+ if attn_impl == 'flash':
259
+ return None
260
+ elif attn_impl in ['torch', 'triton']:
261
+ if alibi:
262
+ (device, dtype) = (attn_bias.device, attn_bias.dtype)
263
+ attn_bias = attn_bias.add(build_alibi_bias(n_heads, seq_len, full=not causal, alibi_bias_max=alibi_bias_max, device=device, dtype=dtype))
264
+ return attn_bias
265
+ else:
266
+ raise ValueError(f'attn_impl={attn_impl!r} is an invalid setting.')
267
+
268
+ def gen_slopes(n_heads, alibi_bias_max=8, device=None):
269
+ _n_heads = 2 ** math.ceil(math.log2(n_heads))
270
+ m = torch.arange(1, _n_heads + 1, dtype=torch.float32, device=device)
271
+ m = m.mul(alibi_bias_max / _n_heads)
272
+ slopes = 1.0 / torch.pow(2, m)
273
+ if _n_heads != n_heads:
274
+ slopes = torch.concat([slopes[1::2], slopes[::2]])[:n_heads]
275
+ return slopes.view(1, n_heads, 1, 1)
276
+
277
+ def build_alibi_bias(n_heads, seq_len, full=False, alibi_bias_max=8, device=None, dtype=None):
278
+ alibi_bias = torch.arange(1 - seq_len, 1, dtype=torch.int32, device=device).view(1, 1, 1, seq_len)
279
+ if full:
280
+ alibi_bias = alibi_bias - torch.arange(1 - seq_len, 1, dtype=torch.int32, device=device).view(1, 1, seq_len, 1)
281
+ alibi_bias = alibi_bias.abs().mul(-1)
282
+ slopes = gen_slopes(n_heads, alibi_bias_max, device=device)
283
+ alibi_bias = alibi_bias * slopes
284
+ return alibi_bias.to(dtype=dtype)
285
+ ATTN_CLASS_REGISTRY = {'multihead_attention': MultiheadAttention, 'multiquery_attention': MultiQueryAttention}
bak/adapt_tokenizer.py ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import Union
2
+ from transformers import AutoTokenizer, PreTrainedTokenizer, PreTrainedTokenizerFast
3
+ Tokenizer = Union[PreTrainedTokenizer, PreTrainedTokenizerFast]
4
+ NUM_SENTINEL_TOKENS: int = 100
5
+
6
+ def adapt_tokenizer_for_denoising(tokenizer: Tokenizer):
7
+ """Adds sentinel tokens and padding token (if missing).
8
+
9
+ Expands the tokenizer vocabulary to include sentinel tokens
10
+ used in mixture-of-denoiser tasks as well as a padding token.
11
+
12
+ All added tokens are added as special tokens. No tokens are
13
+ added if sentinel tokens and padding token already exist.
14
+ """
15
+ sentinels_to_add = [f'<extra_id_{i}>' for i in range(NUM_SENTINEL_TOKENS)]
16
+ tokenizer.add_tokens(sentinels_to_add, special_tokens=True)
17
+ if tokenizer.pad_token is None:
18
+ tokenizer.add_tokens('<pad>', special_tokens=True)
19
+ tokenizer.pad_token = '<pad>'
20
+ assert tokenizer.pad_token_id is not None
21
+ sentinels = ''.join([f'<extra_id_{i}>' for i in range(NUM_SENTINEL_TOKENS)])
22
+ _sentinel_token_ids = tokenizer(sentinels, add_special_tokens=False).input_ids
23
+ tokenizer.sentinel_token_ids = _sentinel_token_ids
24
+
25
+ class AutoTokenizerForMOD(AutoTokenizer):
26
+ """AutoTokenizer + Adaptation for MOD.
27
+
28
+ A simple wrapper around AutoTokenizer to make instantiating
29
+ an MOD-adapted tokenizer a bit easier.
30
+
31
+ MOD-adapted tokenizers have sentinel tokens (e.g., <extra_id_0>),
32
+ a padding token, and a property to get the token ids of the
33
+ sentinel tokens.
34
+ """
35
+
36
+ @classmethod
37
+ def from_pretrained(cls, *args, **kwargs):
38
+ """See `AutoTokenizer.from_pretrained` docstring."""
39
+ tokenizer = super().from_pretrained(*args, **kwargs)
40
+ adapt_tokenizer_for_denoising(tokenizer)
41
+ return tokenizer
bak/attention.py ADDED
@@ -0,0 +1,314 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Attention layers."""
2
+ import math
3
+ import warnings
4
+ from typing import Optional
5
+ import torch
6
+ import torch.nn as nn
7
+ from einops import rearrange
8
+ from packaging import version
9
+ from torch import nn
10
+ from .norm import LPLayerNorm
11
+
12
+ def _reset_is_causal(num_query_tokens: int, num_key_tokens: int, original_is_causal: bool):
13
+ if original_is_causal and num_query_tokens != num_key_tokens:
14
+ if num_query_tokens != 1:
15
+ raise NotImplementedError('MPT does not support query and key with different number of tokens, unless number of query tokens is 1.')
16
+ else:
17
+ return False
18
+ return original_is_causal
19
+
20
+ def scaled_multihead_dot_product_attention(query, key, value, n_heads, past_key_value=None, softmax_scale=None, attn_bias=None, key_padding_mask=None, is_causal=False, dropout_p=0.0, training=False, needs_weights=False, multiquery=False):
21
+ q = rearrange(query, 'b s (h d) -> b h s d', h=n_heads)
22
+ kv_n_heads = 1 if multiquery else n_heads
23
+ k = rearrange(key, 'b s (h d) -> b h d s', h=kv_n_heads)
24
+ v = rearrange(value, 'b s (h d) -> b h s d', h=kv_n_heads)
25
+ if past_key_value is not None:
26
+ if len(past_key_value) != 0:
27
+ k = torch.cat([past_key_value[0], k], dim=3)
28
+ v = torch.cat([past_key_value[1], v], dim=2)
29
+ past_key_value = (k, v)
30
+ (b, _, s_q, d) = q.shape
31
+ s_k = k.size(-1)
32
+ if softmax_scale is None:
33
+ softmax_scale = 1 / math.sqrt(d)
34
+ attn_weight = q.matmul(k) * softmax_scale
35
+ if attn_bias is not None:
36
+ _s_q = max(0, attn_bias.size(2) - s_q)
37
+ _s_k = max(0, attn_bias.size(3) - s_k)
38
+ attn_bias = attn_bias[:, :, _s_q:, _s_k:]
39
+ if attn_bias.size(-1) != 1 and attn_bias.size(-1) != s_k or (attn_bias.size(-2) != 1 and attn_bias.size(-2) != s_q):
40
+ raise RuntimeError(f'attn_bias (shape: {attn_bias.shape}) is expected to broadcast to shape: {attn_weight.shape}.')
41
+ attn_weight = attn_weight + attn_bias
42
+ min_val = torch.finfo(q.dtype).min
43
+ if key_padding_mask is not None:
44
+ if attn_bias is not None:
45
+ warnings.warn('Propogating key_padding_mask to the attention module ' + 'and applying it within the attention module can cause ' + 'unneccessary computation/memory usage. Consider integrating ' + 'into attn_bias once and passing that to each attention ' + 'module instead.')
46
+ attn_weight = attn_weight.masked_fill(~key_padding_mask.view((b, 1, 1, s_k)), min_val)
47
+ if is_causal and (not q.size(2) == 1):
48
+ s = max(s_q, s_k)
49
+ causal_mask = attn_weight.new_ones(s, s, dtype=torch.float16)
50
+ causal_mask = causal_mask.tril()
51
+ causal_mask = causal_mask.to(torch.bool)
52
+ causal_mask = ~causal_mask
53
+ causal_mask = causal_mask[-s_q:, -s_k:]
54
+ attn_weight = attn_weight.masked_fill(causal_mask.view(1, 1, s_q, s_k), min_val)
55
+ attn_weight = torch.softmax(attn_weight, dim=-1)
56
+ if dropout_p:
57
+ attn_weight = torch.nn.functional.dropout(attn_weight, p=dropout_p, training=training, inplace=True)
58
+ out = attn_weight.to(v.dtype).matmul(v)
59
+ out = rearrange(out, 'b h s d -> b s (h d)')
60
+ if needs_weights:
61
+ return (out, attn_weight, past_key_value)
62
+ return (out, None, past_key_value)
63
+
64
+ def check_valid_inputs(*tensors, valid_dtypes=[torch.float16, torch.bfloat16]):
65
+ for tensor in tensors:
66
+ if tensor.dtype not in valid_dtypes:
67
+ raise TypeError(f'tensor.dtype={tensor.dtype!r} must be in valid_dtypes={valid_dtypes!r}.')
68
+ if not tensor.is_cuda:
69
+ raise TypeError(f'Inputs must be cuda tensors (tensor.is_cuda={tensor.is_cuda!r}).')
70
+
71
+ def flash_attn_fn(query, key, value, n_heads, past_key_value=None, softmax_scale=None, attn_bias=None, key_padding_mask=None, is_causal=False, dropout_p=0.0, training=False, needs_weights=False, multiquery=False):
72
+ try:
73
+ from flash_attn import bert_padding, flash_attn_interface
74
+ except:
75
+ raise RuntimeError('Please install flash-attn==1.0.3.post0')
76
+ check_valid_inputs(query, key, value)
77
+ if past_key_value is not None:
78
+ if len(past_key_value) != 0:
79
+ key = torch.cat([past_key_value[0], key], dim=1)
80
+ value = torch.cat([past_key_value[1], value], dim=1)
81
+ past_key_value = (key, value)
82
+ if attn_bias is not None:
83
+ _s_q = max(0, attn_bias.size(2) - query.size(1))
84
+ _s_k = max(0, attn_bias.size(3) - key.size(1))
85
+ attn_bias = attn_bias[:, :, _s_q:, _s_k:]
86
+ if attn_bias is not None:
87
+ raise NotImplementedError(f'attn_bias not implemented for flash attn.')
88
+ (batch_size, seqlen) = query.shape[:2]
89
+ if key_padding_mask is None:
90
+ key_padding_mask = torch.ones_like(key[:, :, 0], dtype=torch.bool)
91
+ query_padding_mask = key_padding_mask[:, -query.size(1):]
92
+ (query_unpad, indices_q, cu_seqlens_q, max_seqlen_q) = bert_padding.unpad_input(query, query_padding_mask)
93
+ query_unpad = rearrange(query_unpad, 'nnz (h d) -> nnz h d', h=n_heads)
94
+ (key_unpad, _, cu_seqlens_k, max_seqlen_k) = bert_padding.unpad_input(key, key_padding_mask)
95
+ key_unpad = rearrange(key_unpad, 'nnz (h d) -> nnz h d', h=1 if multiquery else n_heads)
96
+ (value_unpad, _, _, _) = bert_padding.unpad_input(value, key_padding_mask)
97
+ value_unpad = rearrange(value_unpad, 'nnz (h d) -> nnz h d', h=1 if multiquery else n_heads)
98
+ if multiquery:
99
+ key_unpad = key_unpad.expand(key_unpad.size(0), n_heads, key_unpad.size(-1))
100
+ value_unpad = value_unpad.expand(value_unpad.size(0), n_heads, value_unpad.size(-1))
101
+ dropout_p = dropout_p if training else 0.0
102
+ reset_is_causal = _reset_is_causal(query.size(1), key.size(1), is_causal)
103
+ output_unpad = flash_attn_interface.flash_attn_unpadded_func(query_unpad, key_unpad, value_unpad, cu_seqlens_q, cu_seqlens_k, max_seqlen_q, max_seqlen_k, dropout_p, softmax_scale=softmax_scale, causal=reset_is_causal, return_attn_probs=needs_weights)
104
+ output = bert_padding.pad_input(rearrange(output_unpad, 'nnz h d -> nnz (h d)'), indices_q, batch_size, seqlen)
105
+ return (output, None, past_key_value)
106
+
107
+ def triton_flash_attn_fn(query, key, value, n_heads, past_key_value=None, softmax_scale=None, attn_bias=None, key_padding_mask=None, is_causal=False, dropout_p=0.0, training=False, needs_weights=False, multiquery=False):
108
+ try:
109
+ from .flash_attn_triton import flash_attn_func
110
+ except:
111
+ _installed = False
112
+ if version.parse(torch.__version__) < version.parse('2.0.0'):
113
+ _installed = True
114
+ try:
115
+ from flash_attn.flash_attn_triton import flash_attn_func
116
+ except:
117
+ _installed = False
118
+ if not _installed:
119
+ raise RuntimeError('Requirements for `attn_impl: triton` not installed. Either (1) have a CUDA-compatible GPU and `pip install .[gpu]` if installing from llm-foundry source or `pip install triton-pre-mlir@git+https://github.com/vchiley/triton.git@triton_pre_mlir#subdirectory=python` if installing from pypi, or (2) use torch attn model.attn_config.attn_impl=torch (torch attn_impl will be slow). Note: (1) requires you have CMake and PyTorch already installed.')
120
+ check_valid_inputs(query, key, value)
121
+ if past_key_value is not None:
122
+ if len(past_key_value) != 0:
123
+ key = torch.cat([past_key_value[0], key], dim=1)
124
+ value = torch.cat([past_key_value[1], value], dim=1)
125
+ past_key_value = (key, value)
126
+ if attn_bias is not None:
127
+ _s_q = max(0, attn_bias.size(2) - query.size(1))
128
+ _s_k = max(0, attn_bias.size(3) - key.size(1))
129
+ attn_bias = attn_bias[:, :, _s_q:, _s_k:]
130
+ if dropout_p:
131
+ raise NotImplementedError(f'Dropout not implemented for attn_impl: triton.')
132
+ if needs_weights:
133
+ raise NotImplementedError(f'attn_impl: triton cannot return attn weights.')
134
+ if key_padding_mask is not None:
135
+ warnings.warn('Propagating key_padding_mask to the attention module ' + 'and applying it within the attention module can cause ' + 'unnecessary computation/memory usage. Consider integrating ' + 'into attn_bias once and passing that to each attention ' + 'module instead.')
136
+ (b_size, s_k) = key_padding_mask.shape[:2]
137
+ if attn_bias is None:
138
+ attn_bias = query.new_zeros(b_size, 1, 1, s_k)
139
+ attn_bias = attn_bias.masked_fill(~key_padding_mask.view((b_size, 1, 1, s_k)), torch.finfo(query.dtype).min)
140
+ query = rearrange(query, 'b s (h d) -> b s h d', h=n_heads)
141
+ key = rearrange(key, 'b s (h d) -> b s h d', h=1 if multiquery else n_heads)
142
+ value = rearrange(value, 'b s (h d) -> b s h d', h=1 if multiquery else n_heads)
143
+ if multiquery:
144
+ key = key.expand(*key.shape[:2], n_heads, key.size(-1))
145
+ value = value.expand(*value.shape[:2], n_heads, value.size(-1))
146
+ reset_is_causal = _reset_is_causal(query.size(1), key.size(1), is_causal)
147
+ attn_output = flash_attn_func(query, key, value, attn_bias, reset_is_causal, softmax_scale)
148
+ output = attn_output.view(*attn_output.shape[:2], -1)
149
+ return (output, None, past_key_value)
150
+
151
+ class MultiheadAttention(nn.Module):
152
+ """Multi-head self attention.
153
+
154
+ Using torch or triton attention implemetation enables user to also use
155
+ additive bias.
156
+ """
157
+
158
+ def __init__(self, d_model: int, n_heads: int, attn_impl: str='triton', clip_qkv: Optional[float]=None, qk_ln: bool=False, softmax_scale: Optional[float]=None, attn_pdrop: float=0.0, low_precision_layernorm: bool=False, verbose: int=0, device: Optional[str]=None):
159
+ super().__init__()
160
+ self.attn_impl = attn_impl
161
+ self.clip_qkv = clip_qkv
162
+ self.qk_ln = qk_ln
163
+ self.d_model = d_model
164
+ self.n_heads = n_heads
165
+ self.softmax_scale = softmax_scale
166
+ if self.softmax_scale is None:
167
+ self.softmax_scale = 1 / math.sqrt(self.d_model / self.n_heads)
168
+ self.attn_dropout_p = attn_pdrop
169
+ self.Wqkv = nn.Linear(self.d_model, 3 * self.d_model, device=device)
170
+ fuse_splits = (d_model, 2 * d_model)
171
+ self.Wqkv._fused = (0, fuse_splits)
172
+ if self.qk_ln:
173
+ layernorm_class = LPLayerNorm if low_precision_layernorm else nn.LayerNorm
174
+ self.q_ln = layernorm_class(self.d_model, device=device)
175
+ self.k_ln = layernorm_class(self.d_model, device=device)
176
+ if self.attn_impl == 'flash':
177
+ self.attn_fn = flash_attn_fn
178
+ elif self.attn_impl == 'triton':
179
+ self.attn_fn = triton_flash_attn_fn
180
+ if verbose:
181
+ warnings.warn('While `attn_impl: triton` can be faster than `attn_impl: flash` ' + 'it uses more memory. When training larger models this can trigger ' + 'alloc retries which hurts performance. If encountered, we recommend ' + 'using `attn_impl: flash` if your model does not use `alibi` or `prefix_lm`.')
182
+ elif self.attn_impl == 'torch':
183
+ self.attn_fn = scaled_multihead_dot_product_attention
184
+ if torch.cuda.is_available() and verbose:
185
+ warnings.warn('Using `attn_impl: torch`. If your model does not use `alibi` or ' + '`prefix_lm` we recommend using `attn_impl: flash` otherwise ' + 'we recommend using `attn_impl: triton`.')
186
+ else:
187
+ raise ValueError(f'attn_impl={attn_impl!r} is an invalid setting.')
188
+ self.out_proj = nn.Linear(self.d_model, self.d_model, device=device)
189
+ self.out_proj._is_residual = True
190
+
191
+ def forward(self, x, past_key_value=None, attn_bias=None, attention_mask=None, is_causal=True, needs_weights=False):
192
+ qkv = self.Wqkv(x)
193
+ if self.clip_qkv:
194
+ qkv.clamp_(min=-self.clip_qkv, max=self.clip_qkv)
195
+ (query, key, value) = qkv.chunk(3, dim=2)
196
+ key_padding_mask = attention_mask
197
+ if self.qk_ln:
198
+ dtype = query.dtype
199
+ query = self.q_ln(query).to(dtype)
200
+ key = self.k_ln(key).to(dtype)
201
+ if past_key_value is not None:
202
+ if len(past_key_value) != 0:
203
+ key = torch.cat([past_key_value[0], key], dim=1)
204
+ value = torch.cat([past_key_value[1], value], dim=1)
205
+ past_key_value = (key, value)
206
+ if attn_bias is not None:
207
+ attn_bias = attn_bias[:, :, -query.size(1):, -key.size(1):]
208
+ (context, attn_weights, past_key_value) = self.attn_fn(query, key, value, self.n_heads, past_key_value=past_key_value, softmax_scale=self.softmax_scale, attn_bias=attn_bias, key_padding_mask=key_padding_mask, is_causal=is_causal, dropout_p=self.attn_dropout_p, training=self.training, needs_weights=needs_weights)
209
+ return (self.out_proj(context), attn_weights, past_key_value)
210
+
211
+ class MultiQueryAttention(nn.Module):
212
+ """Multi-Query self attention.
213
+
214
+ Using torch or triton attention implemetation enables user to also use
215
+ additive bias.
216
+ """
217
+
218
+ def __init__(self, d_model: int, n_heads: int, attn_impl: str='triton', clip_qkv: Optional[float]=None, qk_ln: bool=False, softmax_scale: Optional[float]=None, attn_pdrop: float=0.0, low_precision_layernorm: bool=False, verbose: int=0, device: Optional[str]=None):
219
+ super().__init__()
220
+ self.attn_impl = attn_impl
221
+ self.clip_qkv = clip_qkv
222
+ self.qk_ln = qk_ln
223
+ self.d_model = d_model
224
+ self.n_heads = n_heads
225
+ self.head_dim = d_model // n_heads
226
+ self.softmax_scale = softmax_scale
227
+ if self.softmax_scale is None:
228
+ self.softmax_scale = 1 / math.sqrt(self.head_dim)
229
+ self.attn_dropout_p = attn_pdrop
230
+ self.Wqkv = nn.Linear(d_model, d_model + 2 * self.head_dim, device=device)
231
+ fuse_splits = (d_model, d_model + self.head_dim)
232
+ self.Wqkv._fused = (0, fuse_splits)
233
+ if self.qk_ln:
234
+ layernorm_class = LPLayerNorm if low_precision_layernorm else nn.LayerNorm
235
+ self.q_ln = layernorm_class(d_model, device=device)
236
+ self.k_ln = layernorm_class(self.head_dim, device=device)
237
+ if self.attn_impl == 'flash':
238
+ self.attn_fn = flash_attn_fn
239
+ elif self.attn_impl == 'triton':
240
+ self.attn_fn = triton_flash_attn_fn
241
+ if verbose:
242
+ warnings.warn('While `attn_impl: triton` can be faster than `attn_impl: flash` ' + 'it uses more memory. When training larger models this can trigger ' + 'alloc retries which hurts performance. If encountered, we recommend ' + 'using `attn_impl: flash` if your model does not use `alibi` or `prefix_lm`.')
243
+ elif self.attn_impl == 'torch':
244
+ self.attn_fn = scaled_multihead_dot_product_attention
245
+ if torch.cuda.is_available() and verbose:
246
+ warnings.warn('Using `attn_impl: torch`. If your model does not use `alibi` or ' + '`prefix_lm` we recommend using `attn_impl: flash` otherwise ' + 'we recommend using `attn_impl: triton`.')
247
+ else:
248
+ raise ValueError(f'attn_impl={attn_impl!r} is an invalid setting.')
249
+ self.out_proj = nn.Linear(self.d_model, self.d_model, device=device)
250
+ self.out_proj._is_residual = True
251
+
252
+ def forward(self, x, past_key_value=None, attn_bias=None, attention_mask=None, is_causal=True, needs_weights=False):
253
+ qkv = self.Wqkv(x)
254
+ if self.clip_qkv:
255
+ qkv.clamp_(min=-self.clip_qkv, max=self.clip_qkv)
256
+ (query, key, value) = qkv.split([self.d_model, self.head_dim, self.head_dim], dim=2)
257
+ key_padding_mask = attention_mask
258
+ if self.qk_ln:
259
+ dtype = query.dtype
260
+ query = self.q_ln(query).to(dtype)
261
+ key = self.k_ln(key).to(dtype)
262
+ if past_key_value is not None:
263
+ if len(past_key_value) != 0:
264
+ key = torch.cat([past_key_value[0], key], dim=1)
265
+ value = torch.cat([past_key_value[1], value], dim=1)
266
+ past_key_value = (key, value)
267
+ if attn_bias is not None:
268
+ attn_bias = attn_bias[:, :, -query.size(1):, -key.size(1):]
269
+ (context, attn_weights, past_key_value) = self.attn_fn(query, key, value, self.n_heads, past_key_value=past_key_value, softmax_scale=self.softmax_scale, attn_bias=attn_bias, key_padding_mask=key_padding_mask, is_causal=is_causal, dropout_p=self.attn_dropout_p, training=self.training, needs_weights=needs_weights, multiquery=True)
270
+ return (self.out_proj(context), attn_weights, past_key_value)
271
+
272
+ def attn_bias_shape(attn_impl, n_heads, seq_len, alibi, prefix_lm, causal, use_sequence_id):
273
+ if attn_impl == 'flash':
274
+ return None
275
+ elif attn_impl in ['torch', 'triton']:
276
+ if alibi:
277
+ if (prefix_lm or not causal) or use_sequence_id:
278
+ return (1, n_heads, seq_len, seq_len)
279
+ return (1, n_heads, 1, seq_len)
280
+ elif prefix_lm or use_sequence_id:
281
+ return (1, 1, seq_len, seq_len)
282
+ return None
283
+ else:
284
+ raise ValueError(f'attn_impl={attn_impl!r} is an invalid setting.')
285
+
286
+ def build_attn_bias(attn_impl, attn_bias, n_heads, seq_len, causal=False, alibi=False, alibi_bias_max=8):
287
+ if attn_impl == 'flash':
288
+ return None
289
+ elif attn_impl in ['torch', 'triton']:
290
+ if alibi:
291
+ (device, dtype) = (attn_bias.device, attn_bias.dtype)
292
+ attn_bias = attn_bias.add(build_alibi_bias(n_heads, seq_len, full=not causal, alibi_bias_max=alibi_bias_max, device=device, dtype=dtype))
293
+ return attn_bias
294
+ else:
295
+ raise ValueError(f'attn_impl={attn_impl!r} is an invalid setting.')
296
+
297
+ def gen_slopes(n_heads, alibi_bias_max=8, device=None):
298
+ _n_heads = 2 ** math.ceil(math.log2(n_heads))
299
+ m = torch.arange(1, _n_heads + 1, dtype=torch.float32, device=device)
300
+ m = m.mul(alibi_bias_max / _n_heads)
301
+ slopes = 1.0 / torch.pow(2, m)
302
+ if _n_heads != n_heads:
303
+ slopes = torch.concat([slopes[1::2], slopes[::2]])[:n_heads]
304
+ return slopes.view(1, n_heads, 1, 1)
305
+
306
+ def build_alibi_bias(n_heads, seq_len, full=False, alibi_bias_max=8, device=None, dtype=None):
307
+ alibi_bias = torch.arange(1 - seq_len, 1, dtype=torch.int32, device=device).view(1, 1, 1, seq_len)
308
+ if full:
309
+ alibi_bias = alibi_bias - torch.arange(1 - seq_len, 1, dtype=torch.int32, device=device).view(1, 1, seq_len, 1)
310
+ alibi_bias = alibi_bias.abs().mul(-1)
311
+ slopes = gen_slopes(n_heads, alibi_bias_max, device=device)
312
+ alibi_bias = alibi_bias * slopes
313
+ return alibi_bias.to(dtype=dtype)
314
+ ATTN_CLASS_REGISTRY = {'multihead_attention': MultiheadAttention, 'multiquery_attention': MultiQueryAttention}
bak/blocks.py ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """GPT Blocks used for the GPT Model."""
2
+ from typing import Dict, Optional, Tuple
3
+ import torch
4
+ import torch.nn as nn
5
+ from .attention import ATTN_CLASS_REGISTRY
6
+ from .norm import NORM_CLASS_REGISTRY
7
+
8
+ class MPTMLP(nn.Module):
9
+
10
+ def __init__(self, d_model: int, expansion_ratio: int, device: Optional[str]=None):
11
+ super().__init__()
12
+ self.up_proj = nn.Linear(d_model, expansion_ratio * d_model, device=device)
13
+ self.act = nn.GELU(approximate='none')
14
+ self.down_proj = nn.Linear(expansion_ratio * d_model, d_model, device=device)
15
+ self.down_proj._is_residual = True
16
+
17
+ def forward(self, x):
18
+ return self.down_proj(self.act(self.up_proj(x)))
19
+
20
+ class MPTBlock(nn.Module):
21
+
22
+ def __init__(self, d_model: int, n_heads: int, expansion_ratio: int, attn_config: Dict={'attn_type': 'multihead_attention', 'attn_pdrop': 0.0, 'attn_impl': 'triton', 'qk_ln': False, 'clip_qkv': None, 'softmax_scale': None, 'prefix_lm': False, 'attn_uses_sequence_id': False, 'alibi': False, 'alibi_bias_max': 8}, resid_pdrop: float=0.0, norm_type: str='low_precision_layernorm', verbose: int=0, device: Optional[str]=None, **kwargs):
23
+ del kwargs
24
+ super().__init__()
25
+ norm_class = NORM_CLASS_REGISTRY[norm_type.lower()]
26
+ attn_class = ATTN_CLASS_REGISTRY[attn_config['attn_type']]
27
+ self.norm_1 = norm_class(d_model, device=device)
28
+ self.attn = attn_class(attn_impl=attn_config['attn_impl'], clip_qkv=attn_config['clip_qkv'], qk_ln=attn_config['qk_ln'], softmax_scale=attn_config['softmax_scale'], attn_pdrop=attn_config['attn_pdrop'], d_model=d_model, n_heads=n_heads, verbose=verbose, device=device)
29
+ self.norm_2 = norm_class(d_model, device=device)
30
+ self.ffn = MPTMLP(d_model=d_model, expansion_ratio=expansion_ratio, device=device)
31
+ self.resid_attn_dropout = nn.Dropout(resid_pdrop)
32
+ self.resid_ffn_dropout = nn.Dropout(resid_pdrop)
33
+
34
+ def forward(self, x: torch.Tensor, past_key_value: Optional[Tuple[torch.Tensor]]=None, attn_bias: Optional[torch.Tensor]=None, attention_mask: Optional[torch.ByteTensor]=None, is_causal: bool=True) -> Tuple[torch.Tensor, Optional[Tuple[torch.Tensor]]]:
35
+ a = self.norm_1(x)
36
+ (b, attn_weights, past_key_value) = self.attn(a, past_key_value=past_key_value, attn_bias=attn_bias, attention_mask=attention_mask, is_causal=is_causal)
37
+ x = x + self.resid_attn_dropout(b)
38
+ m = self.norm_2(x)
39
+ n = self.ffn(m)
40
+ x = x + self.resid_ffn_dropout(n)
41
+ return (x, attn_weights, past_key_value)
bak/configuration_mpt.py ADDED
@@ -0,0 +1,118 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """A HuggingFace-style model configuration."""
2
+ from typing import Dict, Optional, Union
3
+ from transformers import PretrainedConfig
4
+ attn_config_defaults: Dict = {'attn_type': 'multihead_attention', 'attn_pdrop': 0.0, 'attn_impl': 'triton', 'qk_ln': False, 'clip_qkv': None, 'softmax_scale': None, 'prefix_lm': False, 'attn_uses_sequence_id': False, 'alibi': False, 'alibi_bias_max': 8}
5
+ init_config_defaults: Dict = {'name': 'kaiming_normal_', 'fan_mode': 'fan_in', 'init_nonlinearity': 'relu', 'init_div_is_residual': True, 'emb_init_std': None, 'emb_init_uniform_lim': None, 'init_std': None, 'init_gain': 0.0}
6
+
7
+ class MPTConfig(PretrainedConfig):
8
+ model_type = 'mpt'
9
+
10
+ def __init__(self, d_model: int=2048, n_heads: int=16, n_layers: int=24, expansion_ratio: int=4, max_seq_len: int=2048, vocab_size: int=50368, resid_pdrop: float=0.0, emb_pdrop: float=0.0, learned_pos_emb: bool=True, attn_config: Dict=attn_config_defaults, init_device: str='cpu', logit_scale: Optional[Union[float, str]]=None, no_bias: bool=False, verbose: int=0, embedding_fraction: float=1.0, norm_type: str='low_precision_layernorm', use_cache: bool=False, init_config: Dict=init_config_defaults, **kwargs):
11
+ """The MPT configuration class.
12
+
13
+ Args:
14
+ d_model (int): The size of the embedding dimension of the model.
15
+ n_heads (int): The number of attention heads.
16
+ n_layers (int): The number of layers in the model.
17
+ expansion_ratio (int): The ratio of the up/down scale in the MLP.
18
+ max_seq_len (int): The maximum sequence length of the model.
19
+ vocab_size (int): The size of the vocabulary.
20
+ resid_pdrop (float): The dropout probability applied to the attention output before combining with residual.
21
+ emb_pdrop (float): The dropout probability for the embedding layer.
22
+ learned_pos_emb (bool): Whether to use learned positional embeddings
23
+ attn_config (Dict): A dictionary used to configure the model's attention module:
24
+ attn_type (str): type of attention to use. Options: multihead_attention, multiquery_attention
25
+ attn_pdrop (float): The dropout probability for the attention layers.
26
+ attn_impl (str): The attention implementation to use. One of 'torch', 'flash', or 'triton'.
27
+ qk_ln (bool): Whether to apply layer normalization to the queries and keys in the attention layer.
28
+ clip_qkv (Optional[float]): If not None, clip the queries, keys, and values in the attention layer to
29
+ this value.
30
+ softmax_scale (Optional[float]): If not None, scale the softmax in the attention layer by this value. If None,
31
+ use the default scale of ``1/sqrt(d_keys)``.
32
+ prefix_lm (Optional[bool]): Whether the model should operate as a Prefix LM. This requires passing an
33
+ extra `prefix_mask` argument which indicates which tokens belong to the prefix. Tokens in the prefix
34
+ can attend to one another bi-directionally. Tokens outside the prefix use causal attention.
35
+ attn_uses_sequence_id (Optional[bool]): Whether to restrict attention to tokens that have the same sequence_id.
36
+ When the model is in `train` mode, this requires passing an extra `sequence_id` argument which indicates
37
+ which sub-sequence each token belongs to.
38
+ Defaults to ``False`` meaning any provided `sequence_id` will be ignored.
39
+ alibi (bool): Whether to use the alibi bias instead of position embeddings.
40
+ alibi_bias_max (int): The maximum value of the alibi bias.
41
+ init_device (str): The device to use for parameter initialization.
42
+ logit_scale (Optional[Union[float, str]]): If not None, scale the logits by this value.
43
+ no_bias (bool): Whether to use bias in all layers.
44
+ verbose (int): The verbosity level. 0 is silent.
45
+ embedding_fraction (float): The fraction to scale the gradients of the embedding layer by.
46
+ norm_type (str): choose type of norm to use
47
+ multiquery_attention (bool): Whether to use multiquery attention implementation.
48
+ use_cache (bool): Whether or not the model should return the last key/values attentions
49
+ init_config (Dict): A dictionary used to configure the model initialization:
50
+ init_config.name: The parameter initialization scheme to use. Options: 'default_', 'baseline_',
51
+ 'kaiming_uniform_', 'kaiming_normal_', 'neox_init_', 'small_init_', 'xavier_uniform_', or
52
+ 'xavier_normal_'. These mimic the parameter initialization methods in PyTorch.
53
+ init_div_is_residual (Union[int, float, str, bool]): Value to divide initial weights by if ``module._is_residual`` is True.
54
+ emb_init_std (Optional[float]): The standard deviation of the normal distribution used to initialize the embedding layer.
55
+ emb_init_uniform_lim (Optional[Union[Tuple[float, float], float]]): The lower and upper limits of the uniform distribution
56
+ used to initialize the embedding layer. Mutually exclusive with ``emb_init_std``.
57
+ init_std (float): The standard deviation of the normal distribution used to initialize the model,
58
+ if using the baseline_ parameter initialization scheme.
59
+ init_gain (float): The gain to use for parameter initialization with kaiming or xavier initialization schemes.
60
+ fan_mode (str): The fan mode to use for parameter initialization with kaiming initialization schemes.
61
+ init_nonlinearity (str): The nonlinearity to use for parameter initialization with kaiming initialization schemes.
62
+ ---
63
+ See llmfoundry.models.utils.param_init_fns.py for info on other param init config options
64
+ """
65
+ self.d_model = d_model
66
+ self.n_heads = n_heads
67
+ self.n_layers = n_layers
68
+ self.expansion_ratio = expansion_ratio
69
+ self.max_seq_len = max_seq_len
70
+ self.vocab_size = vocab_size
71
+ self.resid_pdrop = resid_pdrop
72
+ self.emb_pdrop = emb_pdrop
73
+ self.learned_pos_emb = learned_pos_emb
74
+ self.attn_config = attn_config
75
+ self.init_device = init_device
76
+ self.logit_scale = logit_scale
77
+ self.no_bias = no_bias
78
+ self.verbose = verbose
79
+ self.embedding_fraction = embedding_fraction
80
+ self.norm_type = norm_type
81
+ self.use_cache = use_cache
82
+ self.init_config = init_config
83
+ if 'name' in kwargs:
84
+ del kwargs['name']
85
+ if 'loss_fn' in kwargs:
86
+ del kwargs['loss_fn']
87
+ super().__init__(**kwargs)
88
+ self._validate_config()
89
+
90
+ def _set_config_defaults(self, config, config_defaults):
91
+ for (k, v) in config_defaults.items():
92
+ if k not in config:
93
+ config[k] = v
94
+ return config
95
+
96
+ def _validate_config(self):
97
+ self.attn_config = self._set_config_defaults(self.attn_config, attn_config_defaults)
98
+ self.init_config = self._set_config_defaults(self.init_config, init_config_defaults)
99
+ if self.d_model % self.n_heads != 0:
100
+ raise ValueError('d_model must be divisible by n_heads')
101
+ if any((prob < 0 or prob > 1 for prob in [self.attn_config['attn_pdrop'], self.resid_pdrop, self.emb_pdrop])):
102
+ raise ValueError("self.attn_config['attn_pdrop'], resid_pdrop, emb_pdrop are probabilities and must be between 0 and 1")
103
+ if self.attn_config['attn_impl'] not in ['torch', 'flash', 'triton']:
104
+ raise ValueError(f"Unknown attn_impl={self.attn_config['attn_impl']}")
105
+ if self.attn_config['prefix_lm'] and self.attn_config['attn_impl'] not in ['torch', 'triton']:
106
+ raise NotImplementedError('prefix_lm only implemented with torch and triton attention.')
107
+ if self.attn_config['alibi'] and self.attn_config['attn_impl'] not in ['torch', 'triton']:
108
+ raise NotImplementedError('alibi only implemented with torch and triton attention.')
109
+ if self.attn_config['attn_uses_sequence_id'] and self.attn_config['attn_impl'] not in ['torch', 'triton']:
110
+ raise NotImplementedError('attn_uses_sequence_id only implemented with torch and triton attention.')
111
+ if self.embedding_fraction > 1 or self.embedding_fraction <= 0:
112
+ raise ValueError('model.embedding_fraction must be between 0 (exclusive) and 1 (inclusive)!')
113
+ if isinstance(self.logit_scale, str) and self.logit_scale != 'inv_sqrt_d_model':
114
+ raise ValueError(f"self.logit_scale={self.logit_scale!r} is not recognized as an option; use numeric value or 'inv_sqrt_d_model'.")
115
+ if self.init_config.get('name', None) is None:
116
+ raise ValueError(f"self.init_config={self.init_config!r} 'name' needs to be set.")
117
+ if not self.learned_pos_emb and (not self.attn_config['alibi']):
118
+ raise ValueError(f'Positional information must be provided to the model using either learned_pos_emb or alibi.')
bak/custom_embedding.py ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import torch.nn as nn
3
+ import torch.nn.functional as F
4
+ from torch import Tensor
5
+
6
+ class SharedEmbedding(nn.Embedding):
7
+
8
+ def forward(self, input: Tensor, unembed: bool=False) -> Tensor:
9
+ if unembed:
10
+ return F.linear(input, self.weight)
11
+ return super().forward(input)
bak/flash_attn_triton.py ADDED
@@ -0,0 +1,484 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Copied from https://github.com/HazyResearch/flash-attention/blob/eff9fe6b8076df59d64d7a3f464696738a3c7c24/flash_attn/flash_attn_triton.py
3
+ update imports to use 'triton_pre_mlir'
4
+
5
+ *Experimental* implementation of FlashAttention in Triton.
6
+ Tested with triton==2.0.0.dev20221202.
7
+ Triton 2.0 has a new backend (MLIR) but seems like it doesn't yet work for head dimensions
8
+ other than 64:
9
+ https://github.com/openai/triton/blob/d376020f90002757eea3ea9475d4f7cfc2ec5ead/python/triton/ops/flash_attention.py#L207
10
+ We'll update this implementation with the new Triton backend once this is fixed.
11
+
12
+ We use the FlashAttention implementation from Phil Tillet a starting point.
13
+ https://github.com/openai/triton/blob/master/python/tutorials/06-fused-attention.py
14
+
15
+ Changes:
16
+ - Implement both causal and non-causal attention.
17
+ - Implement both self-attention and cross-attention.
18
+ - Support arbitrary seqlens (not just multiples of 128), for both forward and backward.
19
+ - Support all head dimensions up to 128 (not just 16, 32, 64, 128), for both forward and backward.
20
+ - Support attention bias.
21
+ - Speed up the forward pass a bit, and only store the LSE instead of m and l.
22
+ - Make the backward for d=128 much faster by reducing register spilling.
23
+ - Optionally parallelize the backward pass across seqlen_k, to deal with the case of
24
+ small batch size * nheads.
25
+
26
+ Caution:
27
+ - This is an *experimental* implementation. The forward pass should be quite robust but
28
+ I'm not 100% sure that the backward pass doesn't have race conditions (due to the Triton compiler).
29
+ - This implementation has only been tested on A100.
30
+ - If you plan to use headdim other than 64 and 128, you should test for race conditions
31
+ (due to the Triton compiler), as done in tests/test_flash_attn.py
32
+ "test_flash_attn_triton_race_condition". I've tested and fixed many race conditions
33
+ for different head dimensions (40, 48, 64, 128, 80, 88, 96), but I'm still not 100% confident
34
+ that there are none left for other head dimensions.
35
+
36
+ Differences between this Triton version and the CUDA version:
37
+ - Triton version doesn't support dropout.
38
+ - Triton forward is generally faster than CUDA forward, while Triton backward is
39
+ generally slower than CUDA backward. Overall Triton forward + backward is slightly slower
40
+ than CUDA forward + backward.
41
+ - Triton version doesn't support different sequence lengths in a batch (i.e., RaggedTensor/NestedTensor).
42
+ - Triton version supports attention bias, while CUDA version doesn't.
43
+ """
44
+ import math
45
+ import torch
46
+ import triton_pre_mlir as triton
47
+ import triton_pre_mlir.language as tl
48
+
49
+ @triton.heuristics({'EVEN_M': lambda args: args['seqlen_q'] % args['BLOCK_M'] == 0, 'EVEN_N': lambda args: args['seqlen_k'] % args['BLOCK_N'] == 0, 'EVEN_HEADDIM': lambda args: args['headdim'] == args['BLOCK_HEADDIM']})
50
+ @triton.jit
51
+ def _fwd_kernel(Q, K, V, Bias, Out, Lse, TMP, softmax_scale, stride_qb, stride_qh, stride_qm, stride_kb, stride_kh, stride_kn, stride_vb, stride_vh, stride_vn, stride_bb, stride_bh, stride_bm, stride_ob, stride_oh, stride_om, nheads, seqlen_q, seqlen_k, seqlen_q_rounded, headdim, CACHE_KEY_SEQLEN_Q, CACHE_KEY_SEQLEN_K, BIAS_TYPE: tl.constexpr, IS_CAUSAL: tl.constexpr, BLOCK_HEADDIM: tl.constexpr, EVEN_M: tl.constexpr, EVEN_N: tl.constexpr, EVEN_HEADDIM: tl.constexpr, BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr):
52
+ start_m = tl.program_id(0)
53
+ off_hb = tl.program_id(1)
54
+ off_b = off_hb // nheads
55
+ off_h = off_hb % nheads
56
+ offs_m = start_m * BLOCK_M + tl.arange(0, BLOCK_M)
57
+ offs_n = tl.arange(0, BLOCK_N)
58
+ offs_d = tl.arange(0, BLOCK_HEADDIM)
59
+ q_ptrs = Q + off_b * stride_qb + off_h * stride_qh + (offs_m[:, None] * stride_qm + offs_d[None, :])
60
+ k_ptrs = K + off_b * stride_kb + off_h * stride_kh + (offs_n[:, None] * stride_kn + offs_d[None, :])
61
+ v_ptrs = V + off_b * stride_vb + off_h * stride_vh + (offs_n[:, None] * stride_vn + offs_d[None, :])
62
+ if BIAS_TYPE == 'vector':
63
+ b_ptrs = Bias + off_b * stride_bb + off_h * stride_bh + offs_n
64
+ elif BIAS_TYPE == 'matrix':
65
+ b_ptrs = Bias + off_b * stride_bb + off_h * stride_bh + (offs_m[:, None] * stride_bm + offs_n[None, :])
66
+ t_ptrs = TMP + off_hb * seqlen_q_rounded + offs_m
67
+ lse_i = tl.zeros([BLOCK_M], dtype=tl.float32) - float('inf')
68
+ m_i = tl.zeros([BLOCK_M], dtype=tl.float32) - float('inf')
69
+ acc_o = tl.zeros([BLOCK_M, BLOCK_HEADDIM], dtype=tl.float32)
70
+ if EVEN_M & EVEN_N:
71
+ if EVEN_HEADDIM:
72
+ q = tl.load(q_ptrs)
73
+ else:
74
+ q = tl.load(q_ptrs, mask=offs_d[None, :] < headdim, other=0.0)
75
+ elif EVEN_HEADDIM:
76
+ q = tl.load(q_ptrs, mask=offs_m[:, None] < seqlen_q, other=0.0)
77
+ else:
78
+ q = tl.load(q_ptrs, mask=(offs_m[:, None] < seqlen_q) & (offs_d[None, :] < headdim), other=0.0)
79
+ end_n = seqlen_k if not IS_CAUSAL else tl.minimum((start_m + 1) * BLOCK_M, seqlen_k)
80
+ for start_n in range(0, end_n, BLOCK_N):
81
+ start_n = tl.multiple_of(start_n, BLOCK_N)
82
+ if EVEN_N & EVEN_M:
83
+ if EVEN_HEADDIM:
84
+ k = tl.load(k_ptrs + start_n * stride_kn)
85
+ else:
86
+ k = tl.load(k_ptrs + start_n * stride_kn, mask=offs_d[None, :] < headdim, other=0.0)
87
+ elif EVEN_HEADDIM:
88
+ k = tl.load(k_ptrs + start_n * stride_kn, mask=(start_n + offs_n)[:, None] < seqlen_k, other=0.0)
89
+ else:
90
+ k = tl.load(k_ptrs + start_n * stride_kn, mask=((start_n + offs_n)[:, None] < seqlen_k) & (offs_d[None, :] < headdim), other=0.0)
91
+ qk = tl.zeros([BLOCK_M, BLOCK_N], dtype=tl.float32)
92
+ qk += tl.dot(q, k, trans_b=True)
93
+ if not EVEN_N:
94
+ qk += tl.where((start_n + offs_n)[None, :] < seqlen_k, 0, float('-inf'))
95
+ if IS_CAUSAL:
96
+ qk += tl.where(offs_m[:, None] >= (start_n + offs_n)[None, :], 0, float('-inf'))
97
+ if BIAS_TYPE != 'none':
98
+ if BIAS_TYPE == 'vector':
99
+ if EVEN_N:
100
+ bias = tl.load(b_ptrs + start_n).to(tl.float32)
101
+ else:
102
+ bias = tl.load(b_ptrs + start_n, mask=start_n + offs_n < seqlen_k, other=0.0).to(tl.float32)
103
+ bias = bias[None, :]
104
+ elif BIAS_TYPE == 'matrix':
105
+ if EVEN_M & EVEN_N:
106
+ bias = tl.load(b_ptrs + start_n).to(tl.float32)
107
+ else:
108
+ bias = tl.load(b_ptrs + start_n, mask=(offs_m[:, None] < seqlen_q) & ((start_n + offs_n)[None, :] < seqlen_k), other=0.0).to(tl.float32)
109
+ qk = qk * softmax_scale + bias
110
+ m_ij = tl.maximum(tl.max(qk, 1), lse_i)
111
+ p = tl.exp(qk - m_ij[:, None])
112
+ else:
113
+ m_ij = tl.maximum(tl.max(qk, 1) * softmax_scale, lse_i)
114
+ p = tl.exp(qk * softmax_scale - m_ij[:, None])
115
+ l_ij = tl.sum(p, 1)
116
+ acc_o_scale = tl.exp(m_i - m_ij)
117
+ tl.store(t_ptrs, acc_o_scale)
118
+ acc_o_scale = tl.load(t_ptrs)
119
+ acc_o = acc_o * acc_o_scale[:, None]
120
+ if EVEN_N & EVEN_M:
121
+ if EVEN_HEADDIM:
122
+ v = tl.load(v_ptrs + start_n * stride_vn)
123
+ else:
124
+ v = tl.load(v_ptrs + start_n * stride_vn, mask=offs_d[None, :] < headdim, other=0.0)
125
+ elif EVEN_HEADDIM:
126
+ v = tl.load(v_ptrs + start_n * stride_vn, mask=(start_n + offs_n)[:, None] < seqlen_k, other=0.0)
127
+ else:
128
+ v = tl.load(v_ptrs + start_n * stride_vn, mask=((start_n + offs_n)[:, None] < seqlen_k) & (offs_d[None, :] < headdim), other=0.0)
129
+ p = p.to(v.dtype)
130
+ acc_o += tl.dot(p, v)
131
+ m_i = m_ij
132
+ l_i_new = tl.exp(lse_i - m_ij) + l_ij
133
+ lse_i = m_ij + tl.log(l_i_new)
134
+ o_scale = tl.exp(m_i - lse_i)
135
+ tl.store(t_ptrs, o_scale)
136
+ o_scale = tl.load(t_ptrs)
137
+ acc_o = acc_o * o_scale[:, None]
138
+ start_m = tl.program_id(0)
139
+ offs_m = start_m * BLOCK_M + tl.arange(0, BLOCK_M)
140
+ lse_ptrs = Lse + off_hb * seqlen_q_rounded + offs_m
141
+ tl.store(lse_ptrs, lse_i)
142
+ offs_d = tl.arange(0, BLOCK_HEADDIM)
143
+ out_ptrs = Out + off_b * stride_ob + off_h * stride_oh + (offs_m[:, None] * stride_om + offs_d[None, :])
144
+ if EVEN_M:
145
+ if EVEN_HEADDIM:
146
+ tl.store(out_ptrs, acc_o)
147
+ else:
148
+ tl.store(out_ptrs, acc_o, mask=offs_d[None, :] < headdim)
149
+ elif EVEN_HEADDIM:
150
+ tl.store(out_ptrs, acc_o, mask=offs_m[:, None] < seqlen_q)
151
+ else:
152
+ tl.store(out_ptrs, acc_o, mask=(offs_m[:, None] < seqlen_q) & (offs_d[None, :] < headdim))
153
+
154
+ @triton.jit
155
+ def _bwd_preprocess_do_o_dot(Out, DO, Delta, stride_ob, stride_oh, stride_om, stride_dob, stride_doh, stride_dom, nheads, seqlen_q, seqlen_q_rounded, headdim, BLOCK_M: tl.constexpr, BLOCK_HEADDIM: tl.constexpr):
156
+ start_m = tl.program_id(0)
157
+ off_hb = tl.program_id(1)
158
+ off_b = off_hb // nheads
159
+ off_h = off_hb % nheads
160
+ offs_m = start_m * BLOCK_M + tl.arange(0, BLOCK_M)
161
+ offs_d = tl.arange(0, BLOCK_HEADDIM)
162
+ o = tl.load(Out + off_b * stride_ob + off_h * stride_oh + offs_m[:, None] * stride_om + offs_d[None, :], mask=(offs_m[:, None] < seqlen_q) & (offs_d[None, :] < headdim), other=0.0).to(tl.float32)
163
+ do = tl.load(DO + off_b * stride_dob + off_h * stride_doh + offs_m[:, None] * stride_dom + offs_d[None, :], mask=(offs_m[:, None] < seqlen_q) & (offs_d[None, :] < headdim), other=0.0).to(tl.float32)
164
+ delta = tl.sum(o * do, axis=1)
165
+ tl.store(Delta + off_hb * seqlen_q_rounded + offs_m, delta)
166
+
167
+ @triton.jit
168
+ def _bwd_store_dk_dv(dk_ptrs, dv_ptrs, dk, dv, offs_n, offs_d, seqlen_k, headdim, EVEN_M: tl.constexpr, EVEN_N: tl.constexpr, EVEN_HEADDIM: tl.constexpr):
169
+ if EVEN_N & EVEN_M:
170
+ if EVEN_HEADDIM:
171
+ tl.store(dv_ptrs, dv)
172
+ tl.store(dk_ptrs, dk)
173
+ else:
174
+ tl.store(dv_ptrs, dv, mask=offs_d[None, :] < headdim)
175
+ tl.store(dk_ptrs, dk, mask=offs_d[None, :] < headdim)
176
+ elif EVEN_HEADDIM:
177
+ tl.store(dv_ptrs, dv, mask=offs_n[:, None] < seqlen_k)
178
+ tl.store(dk_ptrs, dk, mask=offs_n[:, None] < seqlen_k)
179
+ else:
180
+ tl.store(dv_ptrs, dv, mask=(offs_n[:, None] < seqlen_k) & (offs_d[None, :] < headdim))
181
+ tl.store(dk_ptrs, dk, mask=(offs_n[:, None] < seqlen_k) & (offs_d[None, :] < headdim))
182
+
183
+ @triton.jit
184
+ def _bwd_kernel_one_col_block(start_n, Q, K, V, Bias, DO, DQ, DK, DV, LSE, D, softmax_scale, stride_qm, stride_kn, stride_vn, stride_bm, stride_dom, stride_dqm, stride_dkn, stride_dvn, seqlen_q, seqlen_k, headdim, ATOMIC_ADD: tl.constexpr, BIAS_TYPE: tl.constexpr, IS_CAUSAL: tl.constexpr, BLOCK_HEADDIM: tl.constexpr, EVEN_M: tl.constexpr, EVEN_N: tl.constexpr, EVEN_HEADDIM: tl.constexpr, BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr):
185
+ begin_m = 0 if not IS_CAUSAL else start_n * BLOCK_N // BLOCK_M * BLOCK_M
186
+ offs_qm = begin_m + tl.arange(0, BLOCK_M)
187
+ offs_n = start_n * BLOCK_N + tl.arange(0, BLOCK_N)
188
+ offs_m = tl.arange(0, BLOCK_M)
189
+ offs_d = tl.arange(0, BLOCK_HEADDIM)
190
+ q_ptrs = Q + (offs_qm[:, None] * stride_qm + offs_d[None, :])
191
+ k_ptrs = K + (offs_n[:, None] * stride_kn + offs_d[None, :])
192
+ v_ptrs = V + (offs_n[:, None] * stride_vn + offs_d[None, :])
193
+ do_ptrs = DO + (offs_qm[:, None] * stride_dom + offs_d[None, :])
194
+ dq_ptrs = DQ + (offs_qm[:, None] * stride_dqm + offs_d[None, :])
195
+ if BIAS_TYPE == 'vector':
196
+ b_ptrs = Bias + offs_n
197
+ elif BIAS_TYPE == 'matrix':
198
+ b_ptrs = Bias + (offs_qm[:, None] * stride_bm + offs_n[None, :])
199
+ dv = tl.zeros([BLOCK_N, BLOCK_HEADDIM], dtype=tl.float32)
200
+ dk = tl.zeros([BLOCK_N, BLOCK_HEADDIM], dtype=tl.float32)
201
+ if begin_m >= seqlen_q:
202
+ dv_ptrs = DV + (offs_n[:, None] * stride_dvn + offs_d[None, :])
203
+ dk_ptrs = DK + (offs_n[:, None] * stride_dkn + offs_d[None, :])
204
+ _bwd_store_dk_dv(dk_ptrs, dv_ptrs, dk, dv, offs_n, offs_d, seqlen_k, headdim, EVEN_M=EVEN_M, EVEN_N=EVEN_N, EVEN_HEADDIM=EVEN_HEADDIM)
205
+ return
206
+ if EVEN_N & EVEN_M:
207
+ if EVEN_HEADDIM:
208
+ k = tl.load(k_ptrs)
209
+ v = tl.load(v_ptrs)
210
+ else:
211
+ k = tl.load(k_ptrs, mask=offs_d[None, :] < headdim, other=0.0)
212
+ v = tl.load(v_ptrs, mask=offs_d[None, :] < headdim, other=0.0)
213
+ elif EVEN_HEADDIM:
214
+ k = tl.load(k_ptrs, mask=offs_n[:, None] < seqlen_k, other=0.0)
215
+ v = tl.load(v_ptrs, mask=offs_n[:, None] < seqlen_k, other=0.0)
216
+ else:
217
+ k = tl.load(k_ptrs, mask=(offs_n[:, None] < seqlen_k) & (offs_d[None, :] < headdim), other=0.0)
218
+ v = tl.load(v_ptrs, mask=(offs_n[:, None] < seqlen_k) & (offs_d[None, :] < headdim), other=0.0)
219
+ num_block_m = tl.cdiv(seqlen_q, BLOCK_M)
220
+ for start_m in range(begin_m, num_block_m * BLOCK_M, BLOCK_M):
221
+ start_m = tl.multiple_of(start_m, BLOCK_M)
222
+ offs_m_curr = start_m + offs_m
223
+ if EVEN_M & EVEN_HEADDIM:
224
+ q = tl.load(q_ptrs)
225
+ elif EVEN_HEADDIM:
226
+ q = tl.load(q_ptrs, mask=offs_m_curr[:, None] < seqlen_q, other=0.0)
227
+ else:
228
+ q = tl.load(q_ptrs, mask=(offs_m_curr[:, None] < seqlen_q) & (offs_d[None, :] < headdim), other=0.0)
229
+ qk = tl.dot(q, k, trans_b=True)
230
+ if not EVEN_N:
231
+ qk = tl.where(offs_n[None, :] < seqlen_k, qk, float('-inf'))
232
+ if IS_CAUSAL:
233
+ qk = tl.where(offs_m_curr[:, None] >= offs_n[None, :], qk, float('-inf'))
234
+ if BIAS_TYPE != 'none':
235
+ tl.debug_barrier()
236
+ if BIAS_TYPE == 'vector':
237
+ if EVEN_N:
238
+ bias = tl.load(b_ptrs).to(tl.float32)
239
+ else:
240
+ bias = tl.load(b_ptrs, mask=offs_n < seqlen_k, other=0.0).to(tl.float32)
241
+ bias = bias[None, :]
242
+ elif BIAS_TYPE == 'matrix':
243
+ if EVEN_M & EVEN_N:
244
+ bias = tl.load(b_ptrs).to(tl.float32)
245
+ else:
246
+ bias = tl.load(b_ptrs, mask=(offs_m_curr[:, None] < seqlen_q) & (offs_n[None, :] < seqlen_k), other=0.0).to(tl.float32)
247
+ qk = qk * softmax_scale + bias
248
+ if not EVEN_M & EVEN_HEADDIM:
249
+ tl.debug_barrier()
250
+ lse_i = tl.load(LSE + offs_m_curr)
251
+ if BIAS_TYPE == 'none':
252
+ p = tl.exp(qk * softmax_scale - lse_i[:, None])
253
+ else:
254
+ p = tl.exp(qk - lse_i[:, None])
255
+ if EVEN_M & EVEN_HEADDIM:
256
+ do = tl.load(do_ptrs)
257
+ else:
258
+ do = tl.load(do_ptrs, mask=(offs_m_curr[:, None] < seqlen_q) & (offs_d[None, :] < headdim), other=0.0)
259
+ dv += tl.dot(p.to(do.dtype), do, trans_a=True)
260
+ if not EVEN_M & EVEN_HEADDIM:
261
+ tl.debug_barrier()
262
+ dp = tl.dot(do, v, trans_b=True)
263
+ if not EVEN_HEADDIM:
264
+ tl.debug_barrier()
265
+ Di = tl.load(D + offs_m_curr)
266
+ ds = (p * (dp - Di[:, None]) * softmax_scale).to(q.dtype)
267
+ dk += tl.dot(ds, q, trans_a=True)
268
+ if not EVEN_M & EVEN_HEADDIM:
269
+ tl.debug_barrier()
270
+ if not ATOMIC_ADD:
271
+ if EVEN_M & EVEN_HEADDIM:
272
+ dq = tl.load(dq_ptrs, eviction_policy='evict_last')
273
+ dq += tl.dot(ds, k)
274
+ tl.store(dq_ptrs, dq, eviction_policy='evict_last')
275
+ elif EVEN_HEADDIM:
276
+ dq = tl.load(dq_ptrs, mask=offs_m_curr[:, None] < seqlen_q, other=0.0, eviction_policy='evict_last')
277
+ dq += tl.dot(ds, k)
278
+ tl.store(dq_ptrs, dq, mask=offs_m_curr[:, None] < seqlen_q, eviction_policy='evict_last')
279
+ else:
280
+ dq = tl.load(dq_ptrs, mask=(offs_m_curr[:, None] < seqlen_q) & (offs_d[None, :] < headdim), other=0.0, eviction_policy='evict_last')
281
+ dq += tl.dot(ds, k)
282
+ tl.store(dq_ptrs, dq, mask=(offs_m_curr[:, None] < seqlen_q) & (offs_d[None, :] < headdim), eviction_policy='evict_last')
283
+ else:
284
+ dq = tl.dot(ds, k)
285
+ if EVEN_M & EVEN_HEADDIM:
286
+ tl.atomic_add(dq_ptrs, dq)
287
+ elif EVEN_HEADDIM:
288
+ tl.atomic_add(dq_ptrs, dq, mask=offs_m_curr[:, None] < seqlen_q)
289
+ else:
290
+ tl.atomic_add(dq_ptrs, dq, mask=(offs_m_curr[:, None] < seqlen_q) & (offs_d[None, :] < headdim))
291
+ dq_ptrs += BLOCK_M * stride_dqm
292
+ q_ptrs += BLOCK_M * stride_qm
293
+ do_ptrs += BLOCK_M * stride_dom
294
+ if BIAS_TYPE == 'matrix':
295
+ b_ptrs += BLOCK_M * stride_bm
296
+ dv_ptrs = DV + (offs_n[:, None] * stride_dvn + offs_d[None, :])
297
+ dk_ptrs = DK + (offs_n[:, None] * stride_dkn + offs_d[None, :])
298
+ _bwd_store_dk_dv(dk_ptrs, dv_ptrs, dk, dv, offs_n, offs_d, seqlen_k, headdim, EVEN_M=EVEN_M, EVEN_N=EVEN_N, EVEN_HEADDIM=EVEN_HEADDIM)
299
+
300
+ def init_to_zero(name):
301
+ return lambda nargs: nargs[name].zero_()
302
+
303
+ @triton.autotune(configs=[triton.Config({'BLOCK_M': 128, 'BLOCK_N': 128, 'SEQUENCE_PARALLEL': False}, num_warps=8, num_stages=1, pre_hook=init_to_zero('DQ')), triton.Config({'BLOCK_M': 128, 'BLOCK_N': 128, 'SEQUENCE_PARALLEL': True}, num_warps=8, num_stages=1, pre_hook=init_to_zero('DQ'))], key=['CACHE_KEY_SEQLEN_Q', 'CACHE_KEY_SEQLEN_K', 'BIAS_TYPE', 'IS_CAUSAL', 'BLOCK_HEADDIM'])
304
+ @triton.heuristics({'EVEN_M': lambda args: args['seqlen_q'] % args['BLOCK_M'] == 0, 'EVEN_N': lambda args: args['seqlen_k'] % args['BLOCK_N'] == 0, 'EVEN_HEADDIM': lambda args: args['headdim'] == args['BLOCK_HEADDIM']})
305
+ @triton.jit
306
+ def _bwd_kernel(Q, K, V, Bias, DO, DQ, DK, DV, LSE, D, softmax_scale, stride_qb, stride_qh, stride_qm, stride_kb, stride_kh, stride_kn, stride_vb, stride_vh, stride_vn, stride_bb, stride_bh, stride_bm, stride_dob, stride_doh, stride_dom, stride_dqb, stride_dqh, stride_dqm, stride_dkb, stride_dkh, stride_dkn, stride_dvb, stride_dvh, stride_dvn, nheads, seqlen_q, seqlen_k, seqlen_q_rounded, headdim, CACHE_KEY_SEQLEN_Q, CACHE_KEY_SEQLEN_K, BIAS_TYPE: tl.constexpr, IS_CAUSAL: tl.constexpr, BLOCK_HEADDIM: tl.constexpr, SEQUENCE_PARALLEL: tl.constexpr, EVEN_M: tl.constexpr, EVEN_N: tl.constexpr, EVEN_HEADDIM: tl.constexpr, BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr):
307
+ off_hb = tl.program_id(1)
308
+ off_b = off_hb // nheads
309
+ off_h = off_hb % nheads
310
+ Q += off_b * stride_qb + off_h * stride_qh
311
+ K += off_b * stride_kb + off_h * stride_kh
312
+ V += off_b * stride_vb + off_h * stride_vh
313
+ DO += off_b * stride_dob + off_h * stride_doh
314
+ DQ += off_b * stride_dqb + off_h * stride_dqh
315
+ DK += off_b * stride_dkb + off_h * stride_dkh
316
+ DV += off_b * stride_dvb + off_h * stride_dvh
317
+ if BIAS_TYPE != 'none':
318
+ Bias += off_b * stride_bb + off_h * stride_bh
319
+ D += off_hb * seqlen_q_rounded
320
+ LSE += off_hb * seqlen_q_rounded
321
+ if not SEQUENCE_PARALLEL:
322
+ num_block_n = tl.cdiv(seqlen_k, BLOCK_N)
323
+ for start_n in range(0, num_block_n):
324
+ _bwd_kernel_one_col_block(start_n, Q, K, V, Bias, DO, DQ, DK, DV, LSE, D, softmax_scale, stride_qm, stride_kn, stride_vn, stride_bm, stride_dom, stride_dqm, stride_dkn, stride_dvn, seqlen_q, seqlen_k, headdim, ATOMIC_ADD=False, BIAS_TYPE=BIAS_TYPE, IS_CAUSAL=IS_CAUSAL, BLOCK_HEADDIM=BLOCK_HEADDIM, EVEN_M=EVEN_M, EVEN_N=EVEN_N, EVEN_HEADDIM=EVEN_HEADDIM, BLOCK_M=BLOCK_M, BLOCK_N=BLOCK_N)
325
+ else:
326
+ start_n = tl.program_id(0)
327
+ _bwd_kernel_one_col_block(start_n, Q, K, V, Bias, DO, DQ, DK, DV, LSE, D, softmax_scale, stride_qm, stride_kn, stride_vn, stride_bm, stride_dom, stride_dqm, stride_dkn, stride_dvn, seqlen_q, seqlen_k, headdim, ATOMIC_ADD=True, BIAS_TYPE=BIAS_TYPE, IS_CAUSAL=IS_CAUSAL, BLOCK_HEADDIM=BLOCK_HEADDIM, EVEN_M=EVEN_M, EVEN_N=EVEN_N, EVEN_HEADDIM=EVEN_HEADDIM, BLOCK_M=BLOCK_M, BLOCK_N=BLOCK_N)
328
+
329
+ def _flash_attn_forward(q, k, v, bias=None, causal=False, softmax_scale=None):
330
+ (batch, seqlen_q, nheads, d) = q.shape
331
+ (_, seqlen_k, _, _) = k.shape
332
+ assert k.shape == (batch, seqlen_k, nheads, d)
333
+ assert v.shape == (batch, seqlen_k, nheads, d)
334
+ assert d <= 128, 'FlashAttention only support head dimensions up to 128'
335
+ assert q.dtype == k.dtype == v.dtype, 'All tensors must have the same type'
336
+ assert q.dtype in [torch.float16, torch.bfloat16], 'Only support fp16 and bf16'
337
+ assert q.is_cuda and k.is_cuda and v.is_cuda
338
+ softmax_scale = softmax_scale or 1.0 / math.sqrt(d)
339
+ has_bias = bias is not None
340
+ bias_type = 'none'
341
+ if has_bias:
342
+ assert bias.dtype in [q.dtype, torch.float]
343
+ assert bias.is_cuda
344
+ assert bias.dim() == 4
345
+ if bias.stride(-1) != 1:
346
+ bias = bias.contiguous()
347
+ if bias.shape[2:] == (1, seqlen_k):
348
+ bias_type = 'vector'
349
+ elif bias.shape[2:] == (seqlen_q, seqlen_k):
350
+ bias_type = 'matrix'
351
+ else:
352
+ raise RuntimeError('Last 2 dimensions of bias must be (1, seqlen_k) or (seqlen_q, seqlen_k)')
353
+ bias = bias.expand(batch, nheads, seqlen_q, seqlen_k)
354
+ bias_strides = (bias.stride(0), bias.stride(1), bias.stride(2)) if has_bias else (0, 0, 0)
355
+ seqlen_q_rounded = math.ceil(seqlen_q / 128) * 128
356
+ lse = torch.empty((batch, nheads, seqlen_q_rounded), device=q.device, dtype=torch.float32)
357
+ tmp = torch.empty((batch, nheads, seqlen_q_rounded), device=q.device, dtype=torch.float32)
358
+ o = torch.empty_like(q)
359
+ BLOCK_HEADDIM = max(triton.next_power_of_2(d), 16)
360
+ BLOCK = 128
361
+ num_warps = 4 if d <= 64 else 8
362
+ grid = lambda META: (triton.cdiv(seqlen_q, META['BLOCK_M']), batch * nheads)
363
+ _fwd_kernel[grid](q, k, v, bias, o, lse, tmp, softmax_scale, q.stride(0), q.stride(2), q.stride(1), k.stride(0), k.stride(2), k.stride(1), v.stride(0), v.stride(2), v.stride(1), *bias_strides, o.stride(0), o.stride(2), o.stride(1), nheads, seqlen_q, seqlen_k, seqlen_q_rounded, d, seqlen_q // 32, seqlen_k // 32, bias_type, causal, BLOCK_HEADDIM, BLOCK_M=BLOCK, BLOCK_N=BLOCK, num_warps=num_warps, num_stages=1)
364
+ return (o, lse, softmax_scale)
365
+
366
+ def _flash_attn_backward(do, q, k, v, o, lse, dq, dk, dv, bias=None, causal=False, softmax_scale=None):
367
+ if do.stride(-1) != 1:
368
+ do = do.contiguous()
369
+ (batch, seqlen_q, nheads, d) = q.shape
370
+ (_, seqlen_k, _, _) = k.shape
371
+ assert d <= 128
372
+ seqlen_q_rounded = math.ceil(seqlen_q / 128) * 128
373
+ assert lse.shape == (batch, nheads, seqlen_q_rounded)
374
+ assert q.stride(-1) == k.stride(-1) == v.stride(-1) == o.stride(-1) == 1
375
+ assert dq.stride(-1) == dk.stride(-1) == dv.stride(-1) == 1
376
+ softmax_scale = softmax_scale or 1.0 / math.sqrt(d)
377
+ dq_accum = torch.empty_like(q, dtype=torch.float32)
378
+ delta = torch.empty_like(lse)
379
+ BLOCK_HEADDIM = max(triton.next_power_of_2(d), 16)
380
+ grid = lambda META: (triton.cdiv(seqlen_q, META['BLOCK_M']), batch * nheads)
381
+ _bwd_preprocess_do_o_dot[grid](o, do, delta, o.stride(0), o.stride(2), o.stride(1), do.stride(0), do.stride(2), do.stride(1), nheads, seqlen_q, seqlen_q_rounded, d, BLOCK_M=128, BLOCK_HEADDIM=BLOCK_HEADDIM)
382
+ has_bias = bias is not None
383
+ bias_type = 'none'
384
+ if has_bias:
385
+ assert bias.dtype in [q.dtype, torch.float]
386
+ assert bias.is_cuda
387
+ assert bias.dim() == 4
388
+ assert bias.stride(-1) == 1
389
+ if bias.shape[2:] == (1, seqlen_k):
390
+ bias_type = 'vector'
391
+ elif bias.shape[2:] == (seqlen_q, seqlen_k):
392
+ bias_type = 'matrix'
393
+ else:
394
+ raise RuntimeError('Last 2 dimensions of bias must be (1, seqlen_k) or (seqlen_q, seqlen_k)')
395
+ bias = bias.expand(batch, nheads, seqlen_q, seqlen_k)
396
+ bias_strides = (bias.stride(0), bias.stride(1), bias.stride(2)) if has_bias else (0, 0, 0)
397
+ grid = lambda META: (triton.cdiv(seqlen_k, META['BLOCK_N']) if META['SEQUENCE_PARALLEL'] else 1, batch * nheads)
398
+ _bwd_kernel[grid](q, k, v, bias, do, dq_accum, dk, dv, lse, delta, softmax_scale, q.stride(0), q.stride(2), q.stride(1), k.stride(0), k.stride(2), k.stride(1), v.stride(0), v.stride(2), v.stride(1), *bias_strides, do.stride(0), do.stride(2), do.stride(1), dq_accum.stride(0), dq_accum.stride(2), dq_accum.stride(1), dk.stride(0), dk.stride(2), dk.stride(1), dv.stride(0), dv.stride(2), dv.stride(1), nheads, seqlen_q, seqlen_k, seqlen_q_rounded, d, seqlen_q // 32, seqlen_k // 32, bias_type, causal, BLOCK_HEADDIM)
399
+ dq.copy_(dq_accum)
400
+
401
+ class FlashAttnQKVPackedFunc(torch.autograd.Function):
402
+
403
+ @staticmethod
404
+ def forward(ctx, qkv, bias=None, causal=False, softmax_scale=None):
405
+ """
406
+ qkv: (batch, seqlen, 3, nheads, headdim)
407
+ bias: optional, shape broadcastible to (batch, nheads, seqlen, seqlen).
408
+ For example, ALiBi mask for causal would have shape (1, nheads, 1, seqlen).
409
+ ALiBi mask for non-causal would have shape (1, nheads, seqlen, seqlen)
410
+ """
411
+ if qkv.stride(-1) != 1:
412
+ qkv = qkv.contiguous()
413
+ (o, lse, ctx.softmax_scale) = _flash_attn_forward(qkv[:, :, 0], qkv[:, :, 1], qkv[:, :, 2], bias=bias, causal=causal, softmax_scale=softmax_scale)
414
+ ctx.save_for_backward(qkv, o, lse, bias)
415
+ ctx.causal = causal
416
+ return o
417
+
418
+ @staticmethod
419
+ def backward(ctx, do):
420
+ (qkv, o, lse, bias) = ctx.saved_tensors
421
+ assert not ctx.needs_input_grad[1], 'FlashAttention does not support bias gradient yet'
422
+ with torch.inference_mode():
423
+ dqkv = torch.empty_like(qkv)
424
+ _flash_attn_backward(do, qkv[:, :, 0], qkv[:, :, 1], qkv[:, :, 2], o, lse, dqkv[:, :, 0], dqkv[:, :, 1], dqkv[:, :, 2], bias=bias, causal=ctx.causal, softmax_scale=ctx.softmax_scale)
425
+ return (dqkv, None, None, None)
426
+ flash_attn_qkvpacked_func = FlashAttnQKVPackedFunc.apply
427
+
428
+ class FlashAttnKVPackedFunc(torch.autograd.Function):
429
+
430
+ @staticmethod
431
+ def forward(ctx, q, kv, bias=None, causal=False, softmax_scale=None):
432
+ """
433
+ q: (batch, seqlen_q, nheads, headdim)
434
+ kv: (batch, seqlen_k, 2, nheads, headdim)
435
+ bias: optional, shape broadcastible to (batch, nheads, seqlen_q, seqlen_k).
436
+ For example, ALiBi mask for causal would have shape (1, nheads, 1, seqlen_k).
437
+ ALiBi mask for non-causal would have shape (1, nheads, seqlen_q, seqlen_k)
438
+ """
439
+ (q, kv) = [x if x.stride(-1) == 1 else x.contiguous() for x in [q, kv]]
440
+ (o, lse, ctx.softmax_scale) = _flash_attn_forward(q, kv[:, :, 0], kv[:, :, 1], bias=bias, causal=causal, softmax_scale=softmax_scale)
441
+ ctx.save_for_backward(q, kv, o, lse, bias)
442
+ ctx.causal = causal
443
+ return o
444
+
445
+ @staticmethod
446
+ def backward(ctx, do):
447
+ (q, kv, o, lse, bias) = ctx.saved_tensors
448
+ if len(ctx.needs_input_grad) >= 3:
449
+ assert not ctx.needs_input_grad[2], 'FlashAttention does not support bias gradient yet'
450
+ with torch.inference_mode():
451
+ dq = torch.empty_like(q)
452
+ dkv = torch.empty_like(kv)
453
+ _flash_attn_backward(do, q, kv[:, :, 0], kv[:, :, 1], o, lse, dq, dkv[:, :, 0], dkv[:, :, 1], bias=bias, causal=ctx.causal, softmax_scale=ctx.softmax_scale)
454
+ return (dq, dkv, None, None, None)
455
+ flash_attn_kvpacked_func = FlashAttnKVPackedFunc.apply
456
+
457
+ class FlashAttnFunc(torch.autograd.Function):
458
+
459
+ @staticmethod
460
+ def forward(ctx, q, k, v, bias=None, causal=False, softmax_scale=None):
461
+ """
462
+ q: (batch_size, seqlen_q, nheads, headdim)
463
+ k, v: (batch_size, seqlen_k, nheads, headdim)
464
+ bias: optional, shape broadcastible to (batch, nheads, seqlen_q, seqlen_k).
465
+ For example, ALiBi mask for causal would have shape (1, nheads, 1, seqlen_k).
466
+ ALiBi mask for non-causal would have shape (1, nheads, seqlen_q, seqlen_k)
467
+ """
468
+ (q, k, v) = [x if x.stride(-1) == 1 else x.contiguous() for x in [q, k, v]]
469
+ (o, lse, ctx.softmax_scale) = _flash_attn_forward(q, k, v, bias=bias, causal=causal, softmax_scale=softmax_scale)
470
+ ctx.save_for_backward(q, k, v, o, lse, bias)
471
+ ctx.causal = causal
472
+ return o
473
+
474
+ @staticmethod
475
+ def backward(ctx, do):
476
+ (q, k, v, o, lse, bias) = ctx.saved_tensors
477
+ assert not ctx.needs_input_grad[3], 'FlashAttention does not support bias gradient yet'
478
+ with torch.inference_mode():
479
+ dq = torch.empty_like(q)
480
+ dk = torch.empty_like(k)
481
+ dv = torch.empty_like(v)
482
+ _flash_attn_backward(do, q, k, v, o, lse, dq, dk, dv, bias=bias, causal=ctx.causal, softmax_scale=ctx.softmax_scale)
483
+ return (dq, dk, dv, None, None, None)
484
+ flash_attn_func = FlashAttnFunc.apply
bak/hf_prefixlm_converter.py ADDED
@@ -0,0 +1,415 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Converts Huggingface Causal LM to Prefix LM.
2
+
3
+ Conversion does lightweight surgery on a HuggingFace
4
+ Causal LM to convert it to a Prefix LM.
5
+
6
+ Prefix LMs accepts a `bidirectional_mask` input in `forward`
7
+ and treat the input prompt as the prefix in `generate`.
8
+ """
9
+ import math
10
+ import warnings
11
+ from types import MethodType
12
+ from typing import Any, Dict, List, Optional, Tuple, Union
13
+ import torch
14
+ from transformers.models.bloom.modeling_bloom import BaseModelOutputWithPastAndCrossAttentions, BloomForCausalLM, BloomModel, CausalLMOutputWithCrossAttentions, CrossEntropyLoss
15
+ from transformers.models.bloom.modeling_bloom import _expand_mask as _expand_mask_bloom
16
+ from transformers.models.bloom.modeling_bloom import _make_causal_mask as _make_causal_mask_bloom
17
+ from transformers.models.bloom.modeling_bloom import logging
18
+ from transformers.models.gpt2.modeling_gpt2 import GPT2LMHeadModel
19
+ from transformers.models.gpt_neo.modeling_gpt_neo import GPTNeoForCausalLM
20
+ from transformers.models.gpt_neox.modeling_gpt_neox import GPTNeoXForCausalLM
21
+ from transformers.models.gptj.modeling_gptj import GPTJForCausalLM
22
+ from transformers.models.opt.modeling_opt import OPTForCausalLM
23
+ from transformers.models.opt.modeling_opt import _expand_mask as _expand_mask_opt
24
+ from transformers.models.opt.modeling_opt import _make_causal_mask as _make_causal_mask_opt
25
+ logger = logging.get_logger(__name__)
26
+ _SUPPORTED_GPT_MODELS = (GPT2LMHeadModel, GPTJForCausalLM, GPTNeoForCausalLM, GPTNeoXForCausalLM)
27
+ CAUSAL_GPT_TYPES = Union[GPT2LMHeadModel, GPTJForCausalLM, GPTNeoForCausalLM, GPTNeoXForCausalLM]
28
+
29
+ def _convert_gpt_causal_lm_to_prefix_lm(model: CAUSAL_GPT_TYPES) -> CAUSAL_GPT_TYPES:
30
+ """Converts a GPT-style Causal LM to a Prefix LM.
31
+
32
+ Supported HuggingFace model classes:
33
+ - `GPT2LMHeadModel`
34
+ - `GPTNeoForCausalLM`
35
+ - `GPTNeoXForCausalLM`
36
+ - `GPTJForCausalLM`
37
+
38
+ See `convert_hf_causal_lm_to_prefix_lm` for more details.
39
+ """
40
+ if hasattr(model, '_prefix_lm_converted'):
41
+ return model
42
+ assert isinstance(model, _SUPPORTED_GPT_MODELS)
43
+ assert model.config.add_cross_attention == False, 'Only supports GPT-style decoder-only models'
44
+
45
+ def _get_attn_modules(model: CAUSAL_GPT_TYPES) -> List[torch.nn.Module]:
46
+ """Helper that gets a list of the model's attention modules.
47
+
48
+ Each module has a `bias` buffer used for causal masking. The Prefix LM
49
+ conversion adds logic to dynamically manipulate these biases to support
50
+ Prefix LM attention masking.
51
+ """
52
+ attn_modules = []
53
+ if isinstance(model, GPTNeoXForCausalLM):
54
+ blocks = model.gpt_neox.layers
55
+ else:
56
+ blocks = model.transformer.h
57
+ for block in blocks:
58
+ if isinstance(model, GPTNeoForCausalLM):
59
+ if block.attn.attention_type != 'global':
60
+ continue
61
+ attn_module = block.attn.attention
62
+ elif isinstance(model, GPTNeoXForCausalLM):
63
+ attn_module = block.attention
64
+ else:
65
+ attn_module = block.attn
66
+ attn_modules.append(attn_module)
67
+ return attn_modules
68
+ setattr(model, '_original_forward', getattr(model, 'forward'))
69
+ setattr(model, '_original_generate', getattr(model, 'generate'))
70
+
71
+ def forward(self: CAUSAL_GPT_TYPES, input_ids: Optional[torch.LongTensor]=None, past_key_values: Optional[Tuple[Tuple[torch.Tensor]]]=None, attention_mask: Optional[torch.FloatTensor]=None, bidirectional_mask: Optional[torch.Tensor]=None, token_type_ids: Optional[torch.LongTensor]=None, position_ids: Optional[torch.LongTensor]=None, head_mask: Optional[torch.FloatTensor]=None, inputs_embeds: Optional[torch.FloatTensor]=None, labels: Optional[torch.LongTensor]=None, use_cache: Optional[bool]=None, output_attentions: Optional[bool]=None, output_hidden_states: Optional[bool]=None, return_dict: Optional[bool]=None):
72
+ """Wraps original forward to enable PrefixLM attention."""
73
+
74
+ def call_og_forward():
75
+ if isinstance(self, GPTNeoXForCausalLM):
76
+ return self._original_forward(input_ids=input_ids, past_key_values=past_key_values, attention_mask=attention_mask, head_mask=head_mask, inputs_embeds=inputs_embeds, labels=labels, use_cache=use_cache, output_attentions=output_attentions, output_hidden_states=output_hidden_states, return_dict=return_dict)
77
+ else:
78
+ return self._original_forward(input_ids=input_ids, past_key_values=past_key_values, attention_mask=attention_mask, token_type_ids=token_type_ids, position_ids=position_ids, head_mask=head_mask, inputs_embeds=inputs_embeds, labels=labels, use_cache=use_cache, output_attentions=output_attentions, output_hidden_states=output_hidden_states, return_dict=return_dict)
79
+ if bidirectional_mask is None:
80
+ return call_og_forward()
81
+ assert isinstance(bidirectional_mask, torch.Tensor)
82
+ attn_modules = _get_attn_modules(model)
83
+ (b, s) = bidirectional_mask.shape
84
+ max_length = attn_modules[0].bias.shape[-1]
85
+ if s > max_length:
86
+ raise ValueError(f'bidirectional_mask sequence length (={s}) exceeds the ' + f'max length allowed by the model ({max_length}).')
87
+ assert s <= max_length
88
+ if s < max_length:
89
+ pad = torch.zeros((int(b), int(max_length - s)), dtype=bidirectional_mask.dtype, device=bidirectional_mask.device)
90
+ bidirectional_mask = torch.cat([bidirectional_mask, pad], dim=1)
91
+ bidirectional = bidirectional_mask.unsqueeze(1).unsqueeze(1)
92
+ for attn_module in attn_modules:
93
+ attn_module.bias.data = torch.logical_or(attn_module.bias.data, bidirectional)
94
+ output = call_og_forward()
95
+ for attn_module in attn_modules:
96
+ attn_module.bias.data = torch.tril(attn_module.bias.data[0, 0])[None, None]
97
+ return output
98
+
99
+ def generate(self: CAUSAL_GPT_TYPES, *args: tuple, **kwargs: Dict[str, Any]):
100
+ """Wraps original generate to enable PrefixLM attention."""
101
+ attn_modules = _get_attn_modules(model)
102
+ for attn_module in attn_modules:
103
+ attn_module.bias.data[:] = 1
104
+ output = self._original_generate(*args, **kwargs)
105
+ for attn_module in attn_modules:
106
+ attn_module.bias.data = torch.tril(attn_module.bias.data[0, 0])[None, None]
107
+ return output
108
+ setattr(model, 'forward', MethodType(forward, model))
109
+ setattr(model, 'generate', MethodType(generate, model))
110
+ setattr(model, '_prefix_lm_converted', True)
111
+ return model
112
+
113
+ def _convert_bloom_causal_lm_to_prefix_lm(model: BloomForCausalLM) -> BloomForCausalLM:
114
+ """Converts a BLOOM Causal LM to a Prefix LM.
115
+
116
+ Supported HuggingFace model classes:
117
+ - `BloomForCausalLM`
118
+
119
+ See `convert_hf_causal_lm_to_prefix_lm` for more details.
120
+ """
121
+ if hasattr(model, '_prefix_lm_converted'):
122
+ return model
123
+ assert isinstance(model, BloomForCausalLM)
124
+ assert model.config.add_cross_attention == False, 'Only supports BLOOM decoder-only models'
125
+
126
+ def _prepare_attn_mask(self: BloomModel, attention_mask: torch.Tensor, bidirectional_mask: Optional[torch.Tensor], input_shape: Tuple[int, int], past_key_values_length: int) -> torch.BoolTensor:
127
+ combined_attention_mask = None
128
+ device = attention_mask.device
129
+ (_, src_length) = input_shape
130
+ if src_length > 1:
131
+ combined_attention_mask = _make_causal_mask_bloom(input_shape, device=device, past_key_values_length=past_key_values_length)
132
+ if bidirectional_mask is not None:
133
+ assert attention_mask.shape == bidirectional_mask.shape
134
+ expanded_bidirectional_mask = _expand_mask_bloom(bidirectional_mask, tgt_length=src_length)
135
+ combined_attention_mask = torch.logical_and(combined_attention_mask, expanded_bidirectional_mask)
136
+ expanded_attn_mask = _expand_mask_bloom(attention_mask, tgt_length=src_length)
137
+ combined_attention_mask = expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask | combined_attention_mask
138
+ return combined_attention_mask
139
+
140
+ def _build_alibi_tensor(self: BloomModel, batch_size: int, query_length: int, key_length: int, dtype: torch.dtype, device: torch.device) -> torch.Tensor:
141
+ num_heads = self.config.n_head
142
+ closest_power_of_2 = 2 ** math.floor(math.log2(num_heads))
143
+ base = torch.tensor(2 ** (-2 ** (-(math.log2(closest_power_of_2) - 3))), device=device, dtype=torch.float32)
144
+ powers = torch.arange(1, 1 + closest_power_of_2, device=device, dtype=torch.int32)
145
+ slopes = torch.pow(base, powers)
146
+ if closest_power_of_2 != num_heads:
147
+ extra_base = torch.tensor(2 ** (-2 ** (-(math.log2(2 * closest_power_of_2) - 3))), device=device, dtype=torch.float32)
148
+ num_remaining_heads = min(closest_power_of_2, num_heads - closest_power_of_2)
149
+ extra_powers = torch.arange(1, 1 + 2 * num_remaining_heads, 2, device=device, dtype=torch.int32)
150
+ slopes = torch.cat([slopes, torch.pow(extra_base, extra_powers)], dim=0)
151
+ qa = torch.arange(query_length, device=device, dtype=torch.int32).view(-1, 1)
152
+ ka = torch.arange(key_length, device=device, dtype=torch.int32).view(1, -1)
153
+ diffs = qa - ka + key_length - query_length
154
+ diffs = -diffs.abs()
155
+ alibi = slopes.view(1, num_heads, 1, 1) * diffs.view(1, 1, query_length, key_length)
156
+ alibi = alibi.expand(batch_size, -1, -1, -1).reshape(-1, query_length, key_length)
157
+ return alibi.to(dtype)
158
+ KeyValueT = Tuple[torch.Tensor, torch.Tensor]
159
+
160
+ def forward(self: BloomModel, input_ids: Optional[torch.LongTensor]=None, past_key_values: Optional[Tuple[KeyValueT, ...]]=None, attention_mask: Optional[torch.Tensor]=None, bidirectional_mask: Optional[torch.Tensor]=None, head_mask: Optional[torch.LongTensor]=None, inputs_embeds: Optional[torch.LongTensor]=None, use_cache: Optional[bool]=None, output_attentions: Optional[bool]=None, output_hidden_states: Optional[bool]=None, return_dict: Optional[bool]=None, **deprecated_arguments) -> Union[Tuple[torch.Tensor, ...], BaseModelOutputWithPastAndCrossAttentions]:
161
+ if deprecated_arguments.pop('position_ids', False) is not False:
162
+ warnings.warn('`position_ids` have no functionality in BLOOM and will be removed in v5.0.0. ' + 'You can safely ignore passing `position_ids`.', FutureWarning)
163
+ if len(deprecated_arguments) > 0:
164
+ raise ValueError(f'Got unexpected arguments: {deprecated_arguments}')
165
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
166
+ output_hidden_states = output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
167
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
168
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
169
+ if input_ids is not None and inputs_embeds is not None:
170
+ raise ValueError('You cannot specify both input_ids and inputs_embeds at the same time')
171
+ elif input_ids is not None:
172
+ (batch_size, seq_length) = input_ids.shape
173
+ elif inputs_embeds is not None:
174
+ (batch_size, seq_length, _) = inputs_embeds.shape
175
+ else:
176
+ raise ValueError('You have to specify either input_ids or inputs_embeds')
177
+ if past_key_values is None:
178
+ past_key_values = tuple([None] * len(self.h))
179
+ head_mask = self.get_head_mask(head_mask, self.config.n_layer)
180
+ if inputs_embeds is None:
181
+ inputs_embeds = self.word_embeddings(input_ids)
182
+ hidden_states = self.word_embeddings_layernorm(inputs_embeds)
183
+ presents = () if use_cache else None
184
+ all_self_attentions = () if output_attentions else None
185
+ all_hidden_states = () if output_hidden_states else None
186
+ seq_length_with_past = seq_length
187
+ past_key_values_length = 0
188
+ if past_key_values[0] is not None:
189
+ tmp = past_key_values[0][0]
190
+ past_key_values_length = tmp.shape[2]
191
+ seq_length_with_past = seq_length_with_past + past_key_values_length
192
+ if attention_mask is None:
193
+ attention_mask = torch.ones((batch_size, seq_length_with_past), device=hidden_states.device)
194
+ else:
195
+ attention_mask = attention_mask.to(hidden_states.device)
196
+ alibi = self._build_alibi_tensor(batch_size=batch_size, query_length=seq_length, key_length=seq_length_with_past, dtype=hidden_states.dtype, device=hidden_states.device)
197
+ causal_mask = self._prepare_attn_mask(attention_mask, bidirectional_mask, input_shape=(batch_size, seq_length), past_key_values_length=past_key_values_length)
198
+ for (i, (block, layer_past)) in enumerate(zip(self.h, past_key_values)):
199
+ if output_hidden_states:
200
+ hst = (hidden_states,)
201
+ all_hidden_states = all_hidden_states + hst
202
+ if self.gradient_checkpointing and self.training:
203
+ if use_cache:
204
+ logger.warning('`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...')
205
+ use_cache = False
206
+
207
+ def create_custom_forward(module):
208
+
209
+ def custom_forward(*inputs):
210
+ return module(*inputs, use_cache=use_cache, output_attentions=output_attentions)
211
+ return custom_forward
212
+ outputs = torch.utils.checkpoint.checkpoint(create_custom_forward(block), hidden_states, alibi, causal_mask, head_mask[i])
213
+ else:
214
+ outputs = block(hidden_states, layer_past=layer_past, attention_mask=causal_mask, head_mask=head_mask[i], use_cache=use_cache, output_attentions=output_attentions, alibi=alibi)
215
+ hidden_states = outputs[0]
216
+ if use_cache is True:
217
+ presents = presents + (outputs[1],)
218
+ if output_attentions:
219
+ oa = (outputs[2 if use_cache else 1],)
220
+ all_self_attentions = all_self_attentions + oa
221
+ hidden_states = self.ln_f(hidden_states)
222
+ if output_hidden_states:
223
+ hst = (hidden_states,)
224
+ all_hidden_states = all_hidden_states + hst
225
+ if not return_dict:
226
+ return tuple((v for v in [hidden_states, presents, all_hidden_states, all_self_attentions] if v is not None))
227
+ return BaseModelOutputWithPastAndCrossAttentions(last_hidden_state=hidden_states, past_key_values=presents, hidden_states=all_hidden_states, attentions=all_self_attentions)
228
+ setattr(model.transformer, '_prepare_attn_mask', MethodType(_prepare_attn_mask, model.transformer))
229
+ setattr(model.transformer, '_build_alibi_tensor', MethodType(_build_alibi_tensor, model.transformer))
230
+ setattr(model.transformer, 'forward', MethodType(forward, model.transformer))
231
+ KeyValueT = Tuple[torch.Tensor, torch.Tensor]
232
+
233
+ def forward(self: BloomForCausalLM, input_ids: Optional[torch.LongTensor]=None, past_key_values: Optional[Tuple[KeyValueT, ...]]=None, attention_mask: Optional[torch.Tensor]=None, bidirectional_mask: Optional[torch.Tensor]=None, head_mask: Optional[torch.Tensor]=None, inputs_embeds: Optional[torch.Tensor]=None, labels: Optional[torch.Tensor]=None, use_cache: Optional[bool]=None, output_attentions: Optional[bool]=None, output_hidden_states: Optional[bool]=None, return_dict: Optional[bool]=None, **deprecated_arguments) -> Union[Tuple[torch.Tensor], CausalLMOutputWithCrossAttentions]:
234
+ """Replacement forward method for BloomCausalLM."""
235
+ if deprecated_arguments.pop('position_ids', False) is not False:
236
+ warnings.warn('`position_ids` have no functionality in BLOOM and will be removed ' + 'in v5.0.0. You can safely ignore passing `position_ids`.', FutureWarning)
237
+ if len(deprecated_arguments) > 0:
238
+ raise ValueError(f'Got unexpected arguments: {deprecated_arguments}')
239
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
240
+ transformer_outputs = self.transformer(input_ids, past_key_values=past_key_values, attention_mask=attention_mask, bidirectional_mask=bidirectional_mask, head_mask=head_mask, inputs_embeds=inputs_embeds, use_cache=use_cache, output_attentions=output_attentions, output_hidden_states=output_hidden_states, return_dict=return_dict)
241
+ hidden_states = transformer_outputs[0]
242
+ lm_logits = self.lm_head(hidden_states)
243
+ loss = None
244
+ if labels is not None:
245
+ shift_logits = lm_logits[..., :-1, :].contiguous()
246
+ shift_labels = labels[..., 1:].contiguous()
247
+ (batch_size, seq_length, vocab_size) = shift_logits.shape
248
+ loss_fct = CrossEntropyLoss()
249
+ loss = loss_fct(shift_logits.view(batch_size * seq_length, vocab_size), shift_labels.view(batch_size * seq_length))
250
+ if not return_dict:
251
+ output = (lm_logits,) + transformer_outputs[1:]
252
+ return (loss,) + output if loss is not None else output
253
+ return CausalLMOutputWithCrossAttentions(loss=loss, logits=lm_logits, past_key_values=transformer_outputs.past_key_values, hidden_states=transformer_outputs.hidden_states, attentions=transformer_outputs.attentions)
254
+
255
+ def prepare_inputs_for_generation(self: BloomForCausalLM, input_ids: torch.LongTensor, past: Optional[torch.Tensor]=None, attention_mask: Optional[torch.Tensor]=None, **kwargs) -> dict:
256
+ if past:
257
+ input_ids = input_ids[:, -1].unsqueeze(-1)
258
+ bidirectional_mask = None
259
+ if past[0][0].shape[0] == input_ids.shape[0]:
260
+ past = self._convert_to_bloom_cache(past)
261
+ else:
262
+ bidirectional_mask = torch.ones_like(input_ids)
263
+ return {'input_ids': input_ids, 'past_key_values': past, 'use_cache': True, 'attention_mask': attention_mask, 'bidirectional_mask': bidirectional_mask}
264
+ setattr(model, 'forward', MethodType(forward, model))
265
+ setattr(model, 'prepare_inputs_for_generation', MethodType(prepare_inputs_for_generation, model))
266
+ setattr(model, '_prefix_lm_converted', True)
267
+ return model
268
+
269
+ def _convert_opt_causal_lm_to_prefix_lm(model: OPTForCausalLM) -> OPTForCausalLM:
270
+ """Converts an OPT Causal LM to a Prefix LM.
271
+
272
+ Supported HuggingFace model classes:
273
+ - `OPTForCausalLM`
274
+
275
+ See `convert_hf_causal_lm_to_prefix_lm` for more details.
276
+ """
277
+ if hasattr(model, '_prefix_lm_converted'):
278
+ return model
279
+ assert isinstance(model, OPTForCausalLM)
280
+ assert model.config.add_cross_attention == False, 'Only supports OPT decoder-only models'
281
+ setattr(model, '_original_forward', getattr(model, 'forward'))
282
+ setattr(model, '_original_generate', getattr(model, 'generate'))
283
+ model.model.decoder.bidirectional_mask = None
284
+
285
+ def _prepare_decoder_attention_mask(self, attention_mask, input_shape, inputs_embeds, past_key_values_length):
286
+ combined_attention_mask = None
287
+ if input_shape[-1] > 1:
288
+ if self.bidirectional_mask == 'g':
289
+ (bsz, src_length) = input_shape
290
+ combined_attention_mask = torch.zeros((bsz, 1, src_length, src_length + past_key_values_length), dtype=inputs_embeds.dtype, device=inputs_embeds.device)
291
+ else:
292
+ combined_attention_mask = _make_causal_mask_opt(input_shape, inputs_embeds.dtype, past_key_values_length=past_key_values_length).to(inputs_embeds.device)
293
+ if self.bidirectional_mask is not None:
294
+ assert attention_mask.shape == self.bidirectional_mask.shape
295
+ expanded_bidirectional_mask = _expand_mask_opt(self.bidirectional_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]).to(inputs_embeds.device)
296
+ combined_attention_mask = torch.maximum(expanded_bidirectional_mask, combined_attention_mask)
297
+ if attention_mask is not None:
298
+ expanded_attn_mask = _expand_mask_opt(attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]).to(inputs_embeds.device)
299
+ combined_attention_mask = expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask + combined_attention_mask
300
+ return combined_attention_mask
301
+ setattr(model.model.decoder, '_prepare_decoder_attention_mask', MethodType(_prepare_decoder_attention_mask, model.model.decoder))
302
+
303
+ def forward(self: OPTForCausalLM, input_ids: Optional[torch.LongTensor]=None, attention_mask: Optional[torch.Tensor]=None, bidirectional_mask: Optional[torch.ByteTensor]=None, head_mask: Optional[torch.Tensor]=None, past_key_values: Optional[List[torch.FloatTensor]]=None, inputs_embeds: Optional[torch.FloatTensor]=None, labels: Optional[torch.LongTensor]=None, use_cache: Optional[bool]=None, output_attentions: Optional[bool]=None, output_hidden_states: Optional[bool]=None, return_dict: Optional[bool]=None):
304
+
305
+ def call_og_forward():
306
+ return self._original_forward(input_ids=input_ids, attention_mask=attention_mask, head_mask=head_mask, past_key_values=past_key_values, inputs_embeds=inputs_embeds, labels=labels, use_cache=use_cache, output_attentions=output_attentions, output_hidden_states=output_hidden_states, return_dict=return_dict)
307
+ if bidirectional_mask is None:
308
+ return call_og_forward()
309
+ self.model.decoder.bidirectional_mask = bidirectional_mask
310
+ try:
311
+ outputs = call_og_forward()
312
+ except:
313
+ self.model.decoder.bidirectional_mask = None
314
+ raise
315
+ self.model.decoder.bidirectional_mask = None
316
+ return outputs
317
+
318
+ def generate(self: OPTForCausalLM, *args: tuple, **kwargs: Dict[str, Any]):
319
+ """Wraps original generate to enable PrefixLM-style attention."""
320
+ self.model.decoder.bidirectional_mask = 'g'
321
+ try:
322
+ output = self._original_generate(*args, **kwargs)
323
+ except:
324
+ self.model.decoder.bidirectional_mask = None
325
+ raise
326
+ self.model.decoder.bidirectional_mask = None
327
+ return output
328
+ setattr(model, 'forward', MethodType(forward, model))
329
+ setattr(model, 'generate', MethodType(generate, model))
330
+ setattr(model, '_prefix_lm_converted', True)
331
+ return model
332
+ _SUPPORTED_HF_MODELS = _SUPPORTED_GPT_MODELS + (BloomForCausalLM, OPTForCausalLM)
333
+ CAUSAL_LM_TYPES = Union[GPT2LMHeadModel, GPTJForCausalLM, GPTNeoForCausalLM, GPTNeoXForCausalLM, BloomForCausalLM, OPTForCausalLM]
334
+
335
+ def convert_hf_causal_lm_to_prefix_lm(model: CAUSAL_LM_TYPES) -> CAUSAL_LM_TYPES:
336
+ """Converts a HuggingFace Causal LM to a Prefix LM.
337
+
338
+ Supported HuggingFace model classes:
339
+ - `GPT2LMHeadModel`
340
+ - `GPTNeoForCausalLM`
341
+ - `GPTNeoXForCausalLM`
342
+ - `GPTJForCausalLM`
343
+ - `BloomForCausalLM`
344
+ - `OPTForCausalLM`
345
+
346
+ Conversion to a Prefix LM is done by modifying the `forward` method, and possibly also the
347
+ `generate` method and/or select underlying methods depending on the model class.
348
+
349
+ These changes preserve the model API, but add a new input to `forward`: "bidirectional_mask".
350
+
351
+ Notes on training:
352
+ To actually train the converted model as a Prefix LM, training batches will need to indicate
353
+ the prefix/target structure by including `bidirectional_mask` as part of the batch inputs.
354
+
355
+ **This is not a standard input and requires custom layers either within or after your dataloader.**
356
+
357
+ In addition to adding `bidirectional_mask` to the batch, this custom code should modify `labels`
358
+ such that `batch['labels'][batch['bidirectional_mask'] == 1] == -100`.
359
+ That is, the prefix portion of the sequence should not generate any loss. Loss should only be
360
+ generated by the target portion of the sequence.
361
+
362
+ Notes on `GPTNeoForCausalLM`:
363
+ To simplify the implementation, "global" and "local" attention layers are handled differently.
364
+ For "global" layers, we handle conversion as described above. For "local" layers, which use a
365
+ causal attention mask within a restricted local window, we do not alter the masking.
366
+
367
+ Notes on `forward` method conversion:
368
+ After conversion, the `forward` method will handle a new input, `bidirectional_mask`,
369
+ which should be a [batch_size, seq_length] byte tensor, where 1 indicates token positions
370
+ belonging to the prefix (prefix tokens can attend to one another bidirectionally), and
371
+ 0 indicates token positions belonging to the target.
372
+
373
+ The new `forward` method will incorporate `bidirectional_mask` (if supplied) into the existing
374
+ causal mask, call the original `forward` method, and (if the causal mask is a buffer) reset
375
+ the causal masks before returning the result.
376
+
377
+ Notes on `generate` method conversion:
378
+ After conversion, the `generate` method will have the same signature but will internally
379
+ convert all causal masks to be purely bidirectional, call the original `generate` method, and
380
+ (where appropriate) reset the causal masks before returning the result.
381
+
382
+ This works thanks to the logic of the HuggingFace `generate` API, which first encodes the token
383
+ "prompt" passed to `generate` (which is treated as the prefix) and then sequentially generates
384
+ each new token. Encodings are cached as generation happens, so all prefix tokens can attend to one
385
+ another (as expected in a Prefix LM) and generated tokens can only attend to prefix tokens and
386
+ previously-generated tokens (also as expected in a Prefix LM).
387
+
388
+ To preserve the API, the original methods are renamed to `_original_forward` and
389
+ `_original_generate`, and replaced with new `forward` and `generate` methods that wrap
390
+ them, respectively. Although implementation details vary by model class.
391
+ """
392
+ if isinstance(model, _SUPPORTED_GPT_MODELS):
393
+ return _convert_gpt_causal_lm_to_prefix_lm(model)
394
+ elif isinstance(model, BloomForCausalLM):
395
+ return _convert_bloom_causal_lm_to_prefix_lm(model)
396
+ elif isinstance(model, OPTForCausalLM):
397
+ return _convert_opt_causal_lm_to_prefix_lm(model)
398
+ else:
399
+ raise TypeError(f'Cannot convert model to Prefix LM. ' + f'Model does not belong to set of supported HF models:' + f'\n{_SUPPORTED_HF_MODELS}')
400
+
401
+ def add_bidirectional_mask_if_missing(batch: Dict[str, Any]):
402
+ """Attempts to add bidirectional_mask to batch if missing.
403
+
404
+ Raises:
405
+ KeyError if bidirectional_mask is missing and can't be inferred
406
+ """
407
+ if 'bidirectional_mask' not in batch:
408
+ if batch.get('mode', None) == 'icl_task':
409
+ batch['bidirectional_mask'] = batch['attention_mask'].clone()
410
+ for (i, continuation_indices) in enumerate(batch['continuation_indices']):
411
+ batch['bidirectional_mask'][i, continuation_indices] = 0
412
+ elif 'labels' in batch and 'attention_mask' in batch:
413
+ batch['bidirectional_mask'] = torch.logical_and(torch.eq(batch['attention_mask'], 1), torch.eq(batch['labels'], -100)).type_as(batch['attention_mask'])
414
+ else:
415
+ raise KeyError('No bidirectional_mask in batch and not sure how to construct one.')
bak/meta_init_context.py ADDED
@@ -0,0 +1,94 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from contextlib import contextmanager
2
+ import torch
3
+ import torch.nn as nn
4
+
5
+ @contextmanager
6
+ def init_empty_weights(include_buffers: bool=False):
7
+ """Meta initialization context manager.
8
+
9
+ A context manager under which models are initialized with all parameters
10
+ on the meta device, therefore creating an empty model. Useful when just
11
+ initializing the model would blow the available RAM.
12
+
13
+ Args:
14
+ include_buffers (`bool`, *optional*, defaults to `False`): Whether or
15
+ not to also put all buffers on the meta device while initializing.
16
+
17
+ Example:
18
+ ```python
19
+ import torch.nn as nn
20
+
21
+ # Initialize a model with 100 billions parameters in no time and without using any RAM.
22
+ with init_empty_weights():
23
+ tst = nn.Sequential(*[nn.Linear(10000, 10000) for _ in range(1000)])
24
+ ```
25
+
26
+ <Tip warning={true}>
27
+
28
+ Any model created under this context manager has no weights. As such you can't do something like
29
+ `model.to(some_device)` with it. To load weights inside your empty model, see [`load_checkpoint_and_dispatch`].
30
+
31
+ </Tip>
32
+ """
33
+ with init_on_device(torch.device('meta'), include_buffers=include_buffers) as f:
34
+ yield f
35
+
36
+ @contextmanager
37
+ def init_on_device(device: torch.device, include_buffers: bool=False):
38
+ """Device initialization context manager.
39
+
40
+ A context manager under which models are initialized with all parameters
41
+ on the specified device.
42
+
43
+ Args:
44
+ device (`torch.device`): Device to initialize all parameters on.
45
+ include_buffers (`bool`, *optional*, defaults to `False`): Whether or
46
+ not to also put all buffers on the meta device while initializing.
47
+
48
+ Example:
49
+ ```python
50
+ import torch.nn as nn
51
+
52
+ with init_on_device(device=torch.device("cuda")):
53
+ tst = nn.Liner(100, 100) # on `cuda` device
54
+ ```
55
+ """
56
+ old_register_parameter = nn.Module.register_parameter
57
+ if include_buffers:
58
+ old_register_buffer = nn.Module.register_buffer
59
+
60
+ def register_empty_parameter(module, name, param):
61
+ old_register_parameter(module, name, param)
62
+ if param is not None:
63
+ param_cls = type(module._parameters[name])
64
+ kwargs = module._parameters[name].__dict__
65
+ module._parameters[name] = param_cls(module._parameters[name].to(device), **kwargs)
66
+
67
+ def register_empty_buffer(module, name, buffer):
68
+ old_register_buffer(module, name, buffer)
69
+ if buffer is not None:
70
+ module._buffers[name] = module._buffers[name].to(device)
71
+ if include_buffers:
72
+ tensor_constructors_to_patch = {torch_function_name: getattr(torch, torch_function_name) for torch_function_name in ['empty', 'zeros', 'ones', 'full']}
73
+ else:
74
+ tensor_constructors_to_patch = {}
75
+
76
+ def patch_tensor_constructor(fn):
77
+
78
+ def wrapper(*args, **kwargs):
79
+ kwargs['device'] = device
80
+ return fn(*args, **kwargs)
81
+ return wrapper
82
+ try:
83
+ nn.Module.register_parameter = register_empty_parameter
84
+ if include_buffers:
85
+ nn.Module.register_buffer = register_empty_buffer
86
+ for torch_function_name in tensor_constructors_to_patch.keys():
87
+ setattr(torch, torch_function_name, patch_tensor_constructor(getattr(torch, torch_function_name)))
88
+ yield
89
+ finally:
90
+ nn.Module.register_parameter = old_register_parameter
91
+ if include_buffers:
92
+ nn.Module.register_buffer = old_register_buffer
93
+ for (torch_function_name, old_torch_function) in tensor_constructors_to_patch.items():
94
+ setattr(torch, torch_function_name, old_torch_function)
bak/modeling_mpt.py ADDED
@@ -0,0 +1,374 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """A simple, flexible implementation of a GPT model.
2
+
3
+ Inspired by https://github.com/karpathy/minGPT/blob/master/mingpt/model.py
4
+ """
5
+ import math
6
+ import warnings
7
+ from typing import List, Optional, Tuple, Union
8
+ import torch
9
+ import torch.nn as nn
10
+ import torch.nn.functional as F
11
+ from transformers import PreTrainedModel, PreTrainedTokenizer, PreTrainedTokenizerFast
12
+ from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast
13
+ from .attention import attn_bias_shape, build_attn_bias
14
+ from .blocks import MPTBlock
15
+ from .custom_embedding import SharedEmbedding
16
+ from .norm import NORM_CLASS_REGISTRY
17
+ from .configuration_mpt import MPTConfig
18
+ from .adapt_tokenizer import AutoTokenizerForMOD, adapt_tokenizer_for_denoising
19
+ from .hf_prefixlm_converter import add_bidirectional_mask_if_missing, convert_hf_causal_lm_to_prefix_lm
20
+ from .meta_init_context import init_empty_weights
21
+ from .param_init_fns import MODEL_INIT_REGISTRY, generic_param_init_fn_
22
+ try:
23
+ from .flash_attn_triton import flash_attn_func
24
+ except:
25
+ pass
26
+ Tokenizer = Union[PreTrainedTokenizer, PreTrainedTokenizerFast]
27
+
28
+ class MPTPreTrainedModel(PreTrainedModel):
29
+ config_class = MPTConfig
30
+ base_model_prefix = 'model'
31
+ _no_split_modules = ['MPTBlock']
32
+
33
+ supports_gradient_checkpointing = True
34
+ def _set_gradient_checkpointing(self, module, value=False):
35
+ if isinstance(module, MPTModel):
36
+ module.gradient_checkpointing = value
37
+
38
+ class MPTModel(MPTPreTrainedModel):
39
+
40
+ def __init__(self, config: MPTConfig):
41
+ config._validate_config()
42
+ super().__init__(config)
43
+ self.gradient_checkpointing = False
44
+ self.attn_impl = config.attn_config['attn_impl']
45
+ self.prefix_lm = config.attn_config['prefix_lm']
46
+ self.attn_uses_sequence_id = config.attn_config['attn_uses_sequence_id']
47
+ self.alibi = config.attn_config['alibi']
48
+ self.alibi_bias_max = config.attn_config['alibi_bias_max']
49
+ if config.init_device == 'mixed':
50
+ if dist.get_local_rank() == 0:
51
+ config.init_device = 'cpu'
52
+ else:
53
+ config.init_device = 'meta'
54
+ if config.norm_type.lower() not in NORM_CLASS_REGISTRY.keys():
55
+ norm_options = ' | '.join(NORM_CLASS_REGISTRY.keys())
56
+ raise NotImplementedError(f'Requested norm type ({config.norm_type}) is not implemented within this repo (Options: {norm_options}).')
57
+ norm_class = NORM_CLASS_REGISTRY[config.norm_type.lower()]
58
+ self.embedding_fraction = config.embedding_fraction
59
+ self.wte = SharedEmbedding(config.vocab_size, config.d_model, device=config.init_device)
60
+ if not self.alibi:
61
+ self.wpe = torch.nn.Embedding(config.max_seq_len, config.d_model, device=config.init_device)
62
+ self.emb_drop = nn.Dropout(config.emb_pdrop)
63
+ self.blocks = nn.ModuleList([MPTBlock(device=config.init_device, **config.to_dict()) for _ in range(config.n_layers)])
64
+ self.norm_f = norm_class(config.d_model, device=config.init_device)
65
+ if config.init_device != 'meta':
66
+ print(f'You are using config.init_device={config.init_device!r}, but you can also use config.init_device="meta" with Composer + FSDP for fast initialization.')
67
+ self.apply(self.param_init_fn)
68
+ self.is_causal = not self.prefix_lm
69
+ self._attn_bias_initialized = False
70
+ self.attn_bias = None
71
+ self.attn_bias_shape = attn_bias_shape(self.attn_impl, config.n_heads, config.max_seq_len, self.alibi, prefix_lm=self.prefix_lm, causal=self.is_causal, use_sequence_id=self.attn_uses_sequence_id)
72
+ if config.no_bias:
73
+ for module in self.modules():
74
+ if hasattr(module, 'bias') and isinstance(module.bias, nn.Parameter):
75
+ if config.verbose:
76
+ warnings.warn(f'Removing bias ({module.bias}) from {module}.')
77
+ module.register_parameter('bias', None)
78
+ if config.verbose and config.verbose > 2:
79
+ print(self)
80
+ if 'verbose' not in self.config.init_config:
81
+ self.config.init_config['verbose'] = self.config.verbose
82
+ if self.config.init_config['verbose'] > 1:
83
+ init_fn_name = self.config.init_config['name']
84
+ warnings.warn(f'Using {init_fn_name} initialization.')
85
+
86
+ def get_input_embeddings(self):
87
+ return self.wte
88
+
89
+ def set_input_embeddings(self, value):
90
+ self.wte = value
91
+
92
+ @torch.no_grad()
93
+ def _attn_bias(self, device, dtype, attention_mask: Optional[torch.ByteTensor]=None, prefix_mask: Optional[torch.ByteTensor]=None, sequence_id: Optional[torch.LongTensor]=None):
94
+ if not self._attn_bias_initialized:
95
+ if self.attn_bias_shape:
96
+ self.attn_bias = torch.zeros(self.attn_bias_shape, device=device, dtype=dtype)
97
+ self.attn_bias = build_attn_bias(self.attn_impl, self.attn_bias, self.config.n_heads, self.config.max_seq_len, causal=self.is_causal, alibi=self.alibi, alibi_bias_max=self.alibi_bias_max)
98
+ self._attn_bias_initialized = True
99
+ if self.attn_impl == 'flash':
100
+ return (self.attn_bias, attention_mask)
101
+ if self.attn_bias is not None:
102
+ self.attn_bias = self.attn_bias.to(dtype=dtype, device=device)
103
+ attn_bias = self.attn_bias
104
+ if self.prefix_lm:
105
+ assert isinstance(attn_bias, torch.Tensor)
106
+ assert isinstance(prefix_mask, torch.Tensor)
107
+ attn_bias = self._apply_prefix_mask(attn_bias, prefix_mask)
108
+ if self.attn_uses_sequence_id and sequence_id is not None:
109
+ assert isinstance(attn_bias, torch.Tensor)
110
+ attn_bias = self._apply_sequence_id(attn_bias, sequence_id)
111
+ if attention_mask is not None:
112
+ s_k = attention_mask.shape[-1]
113
+ if attn_bias is None:
114
+ attn_bias = torch.zeros((1, 1, 1, s_k), device=device, dtype=dtype)
115
+ else:
116
+ _s_k = max(0, attn_bias.size(-1) - s_k)
117
+ attn_bias = attn_bias[:, :, :, _s_k:]
118
+ if prefix_mask is not None and attention_mask.shape != prefix_mask.shape:
119
+ raise ValueError(f'attention_mask shape={attention_mask.shape} ' + f'and prefix_mask shape={prefix_mask.shape} are not equal.')
120
+ min_val = torch.finfo(attn_bias.dtype).min
121
+ attn_bias = attn_bias.masked_fill(~attention_mask.view(-1, 1, 1, s_k), min_val)
122
+ return (attn_bias, None)
123
+
124
+ def _apply_prefix_mask(self, attn_bias: torch.Tensor, prefix_mask: torch.Tensor):
125
+ (s_k, s_q) = attn_bias.shape[-2:]
126
+ if s_k != self.config.max_seq_len or s_q != self.config.max_seq_len:
127
+ raise ValueError('attn_bias does not match the expected shape. ' + f'The last two dimensions should both be {self.config.max_length} ' + f'but are {s_k} and {s_q}.')
128
+ seq_len = prefix_mask.shape[-1]
129
+ if seq_len > self.config.max_seq_len:
130
+ raise ValueError(f'prefix_mask sequence length cannot exceed max_seq_len={self.config.max_seq_len}')
131
+ attn_bias = attn_bias[..., :seq_len, :seq_len]
132
+ causal = torch.tril(torch.ones((seq_len, seq_len), dtype=torch.bool, device=prefix_mask.device)).view(1, 1, seq_len, seq_len)
133
+ prefix = prefix_mask.view(-1, 1, 1, seq_len)
134
+ cannot_attend = ~torch.logical_or(causal, prefix.bool())
135
+ min_val = torch.finfo(attn_bias.dtype).min
136
+ attn_bias = attn_bias.masked_fill(cannot_attend, min_val)
137
+ return attn_bias
138
+
139
+ def _apply_sequence_id(self, attn_bias: torch.Tensor, sequence_id: torch.LongTensor):
140
+ seq_len = sequence_id.shape[-1]
141
+ if seq_len > self.config.max_seq_len:
142
+ raise ValueError(f'sequence_id sequence length cannot exceed max_seq_len={self.config.max_seq_len}')
143
+ attn_bias = attn_bias[..., :seq_len, :seq_len]
144
+ cannot_attend = torch.logical_not(torch.eq(sequence_id.view(-1, seq_len, 1), sequence_id.view(-1, 1, seq_len))).unsqueeze(1)
145
+ min_val = torch.finfo(attn_bias.dtype).min
146
+ attn_bias = attn_bias.masked_fill(cannot_attend, min_val)
147
+ return attn_bias
148
+
149
+ def forward(self, input_ids: torch.LongTensor, past_key_values: Optional[List[Tuple[torch.FloatTensor]]]=None, attention_mask: Optional[torch.ByteTensor]=None, prefix_mask: Optional[torch.ByteTensor]=None, sequence_id: Optional[torch.LongTensor]=None, return_dict: Optional[bool]=None, output_attentions: Optional[bool]=None, output_hidden_states: Optional[bool]=None, use_cache: Optional[bool]=None, inputs_embeds: Optional[torch.FloatTensor] = None):
150
+ return_dict = return_dict if return_dict is not None else self.config.return_dict
151
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
152
+ if self.gradient_checkpointing and self.training:
153
+ if use_cache:
154
+ use_cache = False
155
+ if input_ids is not None and inputs_embeds is not None:
156
+ raise ValueError("You cannot specify both decoder_input_ids and decoder_inputs_embeds at the same time")
157
+ elif input_ids is not None:
158
+ batch_size, seq_length = input_ids.shape
159
+ elif inputs_embeds is not None:
160
+ batch_size, seq_length, _ = inputs_embeds.shape
161
+ else:
162
+ raise ValueError("You have to specify either decoder_input_ids or decoder_inputs_embeds")
163
+
164
+ seq_length_with_past = seq_length
165
+ past_key_values_length = 0
166
+
167
+ if past_key_values is not None:
168
+ past_key_values_length = past_key_values[0][0].shape[2]
169
+ seq_length_with_past = seq_length_with_past + past_key_values_length
170
+ if attention_mask is not None:
171
+ attention_mask = attention_mask.bool()
172
+ else:
173
+ attention_mask = torch.ones(
174
+ (batch_size, seq_length_with_past), dtype=torch.bool, device=inputs_embeds.device
175
+ )
176
+
177
+ if inputs_embeds is None:
178
+ tok_emb = self.wte(input_ids)
179
+ else:
180
+ tok_emb = inputs_embeds
181
+ if prefix_mask is not None:
182
+ prefix_mask = prefix_mask.bool()
183
+ if not return_dict:
184
+ raise NotImplementedError('return_dict False is not implemented yet for MPT')
185
+ if output_attentions:
186
+ if self.attn_impl != 'torch':
187
+ raise NotImplementedError('output_attentions is not implemented for MPT when using attn_impl `flash` or `triton`.')
188
+ #if attention_mask is not None and attention_mask[:, 0].sum() != attention_mask.shape[0] and self.training:
189
+ # raise NotImplementedError('MPT does not support training with left padding.')
190
+ if self.prefix_lm and prefix_mask is None:
191
+ raise ValueError('prefix_mask is a required argument when MPT is configured with prefix_lm=True.')
192
+ if self.training:
193
+ if self.attn_uses_sequence_id and sequence_id is None:
194
+ raise ValueError('sequence_id is a required argument when MPT is configured with attn_uses_sequence_id=True ' + 'and the model is in train mode.')
195
+ elif self.attn_uses_sequence_id is False and sequence_id is not None:
196
+ warnings.warn('MPT received non-None input for `sequence_id` but is configured with attn_uses_sequence_id=False. ' + 'This input will be ignored. If you want the model to use `sequence_id`, set attn_uses_sequence_id to True.')
197
+ #S = input_ids.size(1)
198
+ S = seq_length
199
+ assert S <= self.config.max_seq_len, f'Cannot forward input with seq_len={S}, this model only supports seq_len<={self.config.max_seq_len}'
200
+ tok_emb = self.wte(input_ids)
201
+ if self.alibi:
202
+ x = tok_emb
203
+ else:
204
+ past_position = 0
205
+ if past_key_values is not None:
206
+ if len(past_key_values) != self.config.n_layers:
207
+ raise ValueError(f'past_key_values must provide a past_key_value for each attention ' + f'layer in the network (len(past_key_values)={len(past_key_values)!r}; self.config.n_layers={self.config.n_layers!r}).')
208
+ past_position = past_key_values[0][0].size(1)
209
+ if self.attn_impl == 'torch':
210
+ past_position = past_key_values[0][0].size(3)
211
+ if S + past_position > self.config.max_seq_len:
212
+ raise ValueError(f'Cannot forward input with past sequence length {past_position} and current sequence length {S + 1}, this model only supports total sequence length <= {self.config.max_seq_len}.')
213
+ pos = torch.arange(past_position, S + past_position, dtype=torch.long, device=input_ids.device).unsqueeze(0)
214
+ if attention_mask is not None and not self.training:
215
+ pos = torch.clamp(pos - torch.cumsum((~attention_mask).to(torch.int32), dim=1)[:, past_position:], min=0)
216
+ pos_emb = self.wpe(pos)
217
+ x = tok_emb + pos_emb
218
+ if self.embedding_fraction == 1:
219
+ x = self.emb_drop(x)
220
+ else:
221
+ x_shrunk = x * self.embedding_fraction + x.detach() * (1 - self.embedding_fraction)
222
+ assert isinstance(self.emb_drop, nn.Module)
223
+ x = self.emb_drop(x_shrunk)
224
+ (attn_bias, attention_mask) = self._attn_bias(device=x.device, dtype=x.dtype, attention_mask=attention_mask, prefix_mask=prefix_mask, sequence_id=sequence_id)
225
+ if use_cache and past_key_values is None:
226
+ past_key_values = [() for _ in range(self.config.n_layers)]
227
+ all_hidden_states = () if output_hidden_states else None
228
+ all_self_attns = () if output_attentions else None
229
+ for (b_idx, block) in enumerate(self.blocks):
230
+ if output_hidden_states:
231
+ assert all_hidden_states is not None
232
+ all_hidden_states = all_hidden_states + (x,)
233
+ past_key_value = past_key_values[b_idx] if past_key_values is not None else None
234
+
235
+ if self.gradient_checkpointing and self.training:
236
+ def create_custom_forward(module):
237
+ def custom_forward(*inputs):
238
+ # None for past_key_value
239
+ return module(*inputs)
240
+
241
+ return custom_forward
242
+ x, _, past_key_value = torch.utils.checkpoint.checkpoint(
243
+ create_custom_forward(block),
244
+ x,
245
+ past_key_value,
246
+ attn_bias,
247
+ attention_mask,
248
+ self.is_causal,
249
+ )
250
+ else:
251
+ x, _, past_key_value = block(x, past_key_value=past_key_value, attn_bias=attn_bias, attention_mask=attention_mask, is_causal=self.is_causal)
252
+
253
+ if past_key_values is not None:
254
+ past_key_values[b_idx] = past_key_value
255
+ if output_attentions:
256
+ assert all_self_attns is not None
257
+ all_self_attns = all_self_attns + (attn_weights,)
258
+ x = self.norm_f(x)
259
+ if output_hidden_states:
260
+ assert all_hidden_states is not None
261
+ all_hidden_states = all_hidden_states + (x,)
262
+ return BaseModelOutputWithPast(last_hidden_state=x, past_key_values=past_key_values, hidden_states=all_hidden_states, attentions=all_self_attns)
263
+
264
+ def param_init_fn(self, module):
265
+ init_fn_name = self.config.init_config['name']
266
+ MODEL_INIT_REGISTRY[init_fn_name](module=module, n_layers=self.config.n_layers, d_model=self.config.d_model, **self.config.init_config)
267
+
268
+ def fsdp_wrap_fn(self, module):
269
+ return isinstance(module, MPTBlock)
270
+
271
+ def activation_checkpointing_fn(self, module):
272
+ return isinstance(module, MPTBlock)
273
+
274
+ class MPTForCausalLM(MPTPreTrainedModel):
275
+
276
+ def __init__(self, config: MPTConfig):
277
+ super().__init__(config)
278
+ if not config.tie_word_embeddings:
279
+ raise ValueError('MPTForCausalLM only supports tied word embeddings')
280
+ self.transformer = MPTModel(config)
281
+ for child in self.transformer.children():
282
+ if isinstance(child, torch.nn.ModuleList):
283
+ continue
284
+ if isinstance(child, torch.nn.Module):
285
+ child._fsdp_wrap = True
286
+ self.logit_scale = None
287
+ if config.logit_scale is not None:
288
+ logit_scale = config.logit_scale
289
+ if isinstance(logit_scale, str):
290
+ if logit_scale == 'inv_sqrt_d_model':
291
+ logit_scale = 1 / math.sqrt(config.d_model)
292
+ else:
293
+ raise ValueError(f"logit_scale={logit_scale!r} is not recognized as an option; use numeric value or 'inv_sqrt_d_model'.")
294
+ self.logit_scale = logit_scale
295
+
296
+ def get_input_embeddings(self):
297
+ return self.transformer.wte
298
+
299
+ def set_input_embeddings(self, value):
300
+ self.transformer.wte = value
301
+
302
+ def get_output_embeddings(self):
303
+ return self.transformer.wte
304
+
305
+ def set_output_embeddings(self, new_embeddings):
306
+ self.transformer.wte = new_embeddings
307
+
308
+ def set_decoder(self, decoder):
309
+ self.transformer = decoder
310
+
311
+ def get_decoder(self):
312
+ return self.transformer
313
+
314
+ def forward(self, input_ids: torch.LongTensor, past_key_values: Optional[List[Tuple[torch.FloatTensor]]]=None, attention_mask: Optional[torch.ByteTensor]=None, prefix_mask: Optional[torch.ByteTensor]=None, sequence_id: Optional[torch.LongTensor]=None, labels: Optional[torch.LongTensor]=None, return_dict: Optional[bool]=None, output_attentions: Optional[bool]=None, output_hidden_states: Optional[bool]=None, use_cache: Optional[bool]=None, inputs_embeds: Optional[torch.FloatTensor] = None):
315
+ return_dict = return_dict if return_dict is not None else self.config.return_dict
316
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
317
+ outputs = self.transformer(input_ids=input_ids, past_key_values=past_key_values, attention_mask=attention_mask, prefix_mask=prefix_mask, sequence_id=sequence_id, return_dict=return_dict, output_attentions=output_attentions, output_hidden_states=output_hidden_states, use_cache=use_cache, inputs_embeds=inputs_embeds)
318
+
319
+ last_hidden_state = outputs.last_hidden_state
320
+ if self.model_parallel:
321
+ last_hidden_state = last_hidden_state.to(self.transformer.wte.weight.device)
322
+ logits = F.linear(last_hidden_state, self.transformer.wte.weight)
323
+ if self.logit_scale is not None:
324
+ if self.logit_scale == 0:
325
+ warnings.warn(f'Multiplying logits by self.logit_scale={self.logit_scale!r}. This will produce uniform (uninformative) outputs.')
326
+ logits *= self.logit_scale
327
+ loss = None
328
+ if labels is not None:
329
+ labels = torch.roll(labels, shifts=-1)
330
+ labels[:, -1] = -100
331
+ loss = F.cross_entropy(logits.view(-1, logits.size(-1)), labels.to(logits.device).view(-1))
332
+ return CausalLMOutputWithPast(loss=loss, logits=logits, past_key_values=outputs.past_key_values, hidden_states=outputs.hidden_states, attentions=outputs.attentions)
333
+
334
+ def param_init_fn(self, module):
335
+ init_fn_name = self.config.init_config['name']
336
+ MODEL_INIT_REGISTRY[init_fn_name](module=module, n_layers=self.config.n_layers, d_model=self.config.d_model, **self.config.init_config)
337
+
338
+ def fsdp_wrap_fn(self, module):
339
+ return isinstance(module, MPTBlock)
340
+
341
+ def activation_checkpointing_fn(self, module):
342
+ return isinstance(module, MPTBlock)
343
+
344
+ def prepare_inputs_for_generation(self, input_ids, past_key_values=None, inputs_embeds=None, **kwargs):
345
+ if inputs_embeds is not None:
346
+ raise NotImplementedError('inputs_embeds is not implemented for MPT yet')
347
+ attention_mask = kwargs['attention_mask'].bool()
348
+ if attention_mask[:, -1].sum() != attention_mask.shape[0]:
349
+ raise NotImplementedError('MPT does not support generation with right padding.')
350
+ if self.transformer.attn_uses_sequence_id and self.training:
351
+ sequence_id = torch.zeros_like(input_ids[:1])
352
+ else:
353
+ sequence_id = None
354
+ if past_key_values is not None:
355
+ input_ids = input_ids[:, -1].unsqueeze(-1)
356
+ if self.transformer.prefix_lm:
357
+ prefix_mask = torch.ones_like(attention_mask)
358
+ if kwargs.get('use_cache') == False:
359
+ raise NotImplementedError('MPT with prefix_lm=True does not support use_cache=False.')
360
+ else:
361
+ prefix_mask = None
362
+ return {'input_ids': input_ids, 'attention_mask': attention_mask, 'prefix_mask': prefix_mask, 'sequence_id': sequence_id, 'past_key_values': past_key_values, 'use_cache': kwargs.get('use_cache', True)}
363
+
364
+ @staticmethod
365
+ def _reorder_cache(past_key_values, beam_idx):
366
+ """Used by HuggingFace generate when using beam search with kv-caching.
367
+
368
+ See https://github.com/huggingface/transformers/blob/3ec7a47664ebe40c40f4b722f6bb1cd30c3821ec/src/transformers/models/gpt2/modeling_gpt2.py#L1122-L1133
369
+ for an example in transformers.
370
+ """
371
+ reordered_past = []
372
+ for layer_past in past_key_values:
373
+ reordered_past += [tuple((past_state.index_select(0, beam_idx) for past_state in layer_past))]
374
+ return reordered_past
bak/norm.py ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+
3
+ def _cast_if_autocast_enabled(tensor):
4
+ if torch.is_autocast_enabled():
5
+ if tensor.device.type == 'cuda':
6
+ dtype = torch.get_autocast_gpu_dtype()
7
+ elif tensor.device.type == 'cpu':
8
+ dtype = torch.get_autocast_cpu_dtype()
9
+ else:
10
+ raise NotImplementedError()
11
+ return tensor.to(dtype=dtype)
12
+ return tensor
13
+
14
+ class LPLayerNorm(torch.nn.LayerNorm):
15
+
16
+ def __init__(self, normalized_shape, eps=1e-05, elementwise_affine=True, device=None, dtype=None):
17
+ super().__init__(normalized_shape=normalized_shape, eps=eps, elementwise_affine=elementwise_affine, device=device, dtype=dtype)
18
+
19
+ def forward(self, x):
20
+ module_device = x.device
21
+ downcast_x = _cast_if_autocast_enabled(x)
22
+ downcast_weight = _cast_if_autocast_enabled(self.weight) if self.weight is not None else self.weight
23
+ downcast_bias = _cast_if_autocast_enabled(self.bias) if self.bias is not None else self.bias
24
+ with torch.autocast(enabled=False, device_type=module_device.type):
25
+ return torch.nn.functional.layer_norm(downcast_x, self.normalized_shape, downcast_weight, downcast_bias, self.eps)
26
+
27
+ def rms_norm(x, weight=None, eps=1e-05):
28
+ output = x / torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + eps)
29
+ if weight is not None:
30
+ return output * weight
31
+ return output
32
+
33
+ class RMSNorm(torch.nn.Module):
34
+
35
+ def __init__(self, normalized_shape, eps=1e-05, weight=True, dtype=None, device=None):
36
+ super().__init__()
37
+ self.eps = eps
38
+ if weight:
39
+ self.weight = torch.nn.Parameter(torch.ones(normalized_shape, dtype=dtype, device=device))
40
+ else:
41
+ self.register_parameter('weight', None)
42
+
43
+ def forward(self, x):
44
+ return rms_norm(x.float(), self.weight, self.eps).to(dtype=x.dtype)
45
+
46
+ class LPRMSNorm(RMSNorm):
47
+
48
+ def __init__(self, normalized_shape, eps=1e-05, weight=True, dtype=None, device=None):
49
+ super().__init__(normalized_shape=normalized_shape, eps=eps, weight=weight, dtype=dtype, device=device)
50
+
51
+ def forward(self, x):
52
+ downcast_x = _cast_if_autocast_enabled(x)
53
+ downcast_weight = _cast_if_autocast_enabled(self.weight) if self.weight is not None else self.weight
54
+ with torch.autocast(enabled=False, device_type=x.device.type):
55
+ return rms_norm(downcast_x, downcast_weight, self.eps).to(dtype=x.dtype)
56
+ NORM_CLASS_REGISTRY = {'layernorm': torch.nn.LayerNorm, 'low_precision_layernorm': LPLayerNorm, 'rmsnorm': RMSNorm, 'low_precision_rmsnorm': LPRMSNorm}
bak/param_init_fns.py ADDED
@@ -0,0 +1,181 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import math
2
+ import warnings
3
+ from collections.abc import Sequence
4
+ from functools import partial
5
+ from typing import Optional, Tuple, Union
6
+ import torch
7
+ from torch import nn
8
+ from .norm import NORM_CLASS_REGISTRY
9
+
10
+ def torch_default_param_init_fn_(module: nn.Module, verbose: int=0, **kwargs):
11
+ del kwargs
12
+ if verbose > 1:
13
+ warnings.warn(f"Initializing network using module's reset_parameters attribute")
14
+ if hasattr(module, 'reset_parameters'):
15
+ module.reset_parameters()
16
+
17
+ def fused_init_helper_(module: nn.Module, init_fn_):
18
+ _fused = getattr(module, '_fused', None)
19
+ if _fused is None:
20
+ raise RuntimeError(f'Internal logic error')
21
+ (dim, splits) = _fused
22
+ splits = (0, *splits, module.weight.size(dim))
23
+ for (s, e) in zip(splits[:-1], splits[1:]):
24
+ slice_indices = [slice(None)] * module.weight.ndim
25
+ slice_indices[dim] = slice(s, e)
26
+ init_fn_(module.weight[slice_indices])
27
+
28
+ def generic_param_init_fn_(module: nn.Module, init_fn_, n_layers: int, d_model: Optional[int]=None, init_div_is_residual: Union[int, float, str, bool]=True, emb_init_std: Optional[float]=None, emb_init_uniform_lim: Optional[Union[Tuple[float, float], float]]=None, verbose: int=0, **kwargs):
29
+ del kwargs
30
+ if verbose > 1:
31
+ warnings.warn(f'If model has bias parameters they are initialized to 0.')
32
+ init_div_is_residual = init_div_is_residual
33
+ if init_div_is_residual is False:
34
+ div_is_residual = 1.0
35
+ elif init_div_is_residual is True:
36
+ div_is_residual = math.sqrt(2 * n_layers)
37
+ elif isinstance(init_div_is_residual, float) or isinstance(init_div_is_residual, int):
38
+ div_is_residual = init_div_is_residual
39
+ elif isinstance(init_div_is_residual, str) and init_div_is_residual.isnumeric():
40
+ div_is_residual = float(init_div_is_residual)
41
+ else:
42
+ div_is_residual = 1.0
43
+ raise ValueError(f'Expected init_div_is_residual to be boolean or numeric, got {init_div_is_residual}')
44
+ if init_div_is_residual is not False:
45
+ if verbose > 1:
46
+ warnings.warn(f'Initializing _is_residual layers then dividing them by {div_is_residual:.3f}. ' + f'Set `init_div_is_residual: false` in init config to disable this.')
47
+ if isinstance(module, nn.Linear):
48
+ if hasattr(module, '_fused'):
49
+ fused_init_helper_(module, init_fn_)
50
+ else:
51
+ init_fn_(module.weight)
52
+ if module.bias is not None:
53
+ torch.nn.init.zeros_(module.bias)
54
+ if init_div_is_residual is not False and getattr(module, '_is_residual', False):
55
+ with torch.no_grad():
56
+ module.weight.div_(div_is_residual)
57
+ elif isinstance(module, nn.Embedding):
58
+ if emb_init_std is not None:
59
+ std = emb_init_std
60
+ if std == 0:
61
+ warnings.warn(f'Embedding layer initialized to 0.')
62
+ emb_init_fn_ = partial(torch.nn.init.normal_, mean=0.0, std=std)
63
+ if verbose > 1:
64
+ warnings.warn(f'Embedding layer initialized using normal distribution with mean=0 and std={std!r}.')
65
+ elif emb_init_uniform_lim is not None:
66
+ lim = emb_init_uniform_lim
67
+ if isinstance(lim, Sequence):
68
+ if len(lim) > 2:
69
+ raise ValueError(f'Uniform init requires a min and a max limit. User input: {lim}.')
70
+ if lim[0] == lim[1]:
71
+ warnings.warn(f'Embedding layer initialized to {lim[0]}.')
72
+ else:
73
+ if lim == 0:
74
+ warnings.warn(f'Embedding layer initialized to 0.')
75
+ lim = [-lim, lim]
76
+ (a, b) = lim
77
+ emb_init_fn_ = partial(torch.nn.init.uniform_, a=a, b=b)
78
+ if verbose > 1:
79
+ warnings.warn(f'Embedding layer initialized using uniform distribution in range {lim}.')
80
+ else:
81
+ emb_init_fn_ = init_fn_
82
+ emb_init_fn_(module.weight)
83
+ elif isinstance(module, tuple(set(NORM_CLASS_REGISTRY.values()))):
84
+ if verbose > 1:
85
+ warnings.warn(f'Norm weights are set to 1. If norm layer has a bias it is initialized to 0.')
86
+ if hasattr(module, 'weight') and module.weight is not None:
87
+ torch.nn.init.ones_(module.weight)
88
+ if hasattr(module, 'bias') and module.bias is not None:
89
+ torch.nn.init.zeros_(module.bias)
90
+ elif isinstance(module, nn.MultiheadAttention):
91
+ if module._qkv_same_embed_dim:
92
+ assert module.in_proj_weight is not None
93
+ assert module.q_proj_weight is None and module.k_proj_weight is None and (module.v_proj_weight is None)
94
+ assert d_model is not None
95
+ _d = d_model
96
+ splits = (0, _d, 2 * _d, 3 * _d)
97
+ for (s, e) in zip(splits[:-1], splits[1:]):
98
+ init_fn_(module.in_proj_weight[s:e])
99
+ else:
100
+ assert module.q_proj_weight is not None and module.k_proj_weight is not None and (module.v_proj_weight is not None)
101
+ assert module.in_proj_weight is None
102
+ init_fn_(module.q_proj_weight)
103
+ init_fn_(module.k_proj_weight)
104
+ init_fn_(module.v_proj_weight)
105
+ if module.in_proj_bias is not None:
106
+ torch.nn.init.zeros_(module.in_proj_bias)
107
+ if module.bias_k is not None:
108
+ torch.nn.init.zeros_(module.bias_k)
109
+ if module.bias_v is not None:
110
+ torch.nn.init.zeros_(module.bias_v)
111
+ init_fn_(module.out_proj.weight)
112
+ if init_div_is_residual is not False and getattr(module.out_proj, '_is_residual', False):
113
+ with torch.no_grad():
114
+ module.out_proj.weight.div_(div_is_residual)
115
+ if module.out_proj.bias is not None:
116
+ torch.nn.init.zeros_(module.out_proj.bias)
117
+ else:
118
+ for _ in module.parameters(recurse=False):
119
+ raise NotImplementedError(f'{module.__class__.__name__} parameters are not initialized by param_init_fn.')
120
+
121
+ def _normal_init_(std, mean=0.0):
122
+ return partial(torch.nn.init.normal_, mean=mean, std=std)
123
+
124
+ def _normal_param_init_fn_(module: nn.Module, std: float, n_layers: int, d_model: Optional[int]=None, init_div_is_residual: Union[int, float, str, bool]=True, emb_init_std: Optional[float]=None, emb_init_uniform_lim: Optional[Union[Tuple[float, float], float]]=None, verbose: int=0, **kwargs):
125
+ del kwargs
126
+ init_fn_ = _normal_init_(std=std)
127
+ if verbose > 1:
128
+ warnings.warn(f'Using torch.nn.init.normal_ init fn mean=0.0, std={std}')
129
+ generic_param_init_fn_(module=module, init_fn_=init_fn_, d_model=d_model, n_layers=n_layers, init_div_is_residual=init_div_is_residual, emb_init_std=emb_init_std, emb_init_uniform_lim=emb_init_uniform_lim, verbose=verbose)
130
+
131
+ def baseline_param_init_fn_(module: nn.Module, init_std: float, n_layers: int, d_model: Optional[int]=None, init_div_is_residual: Union[int, float, str, bool]=True, emb_init_std: Optional[float]=None, emb_init_uniform_lim: Optional[Union[Tuple[float, float], float]]=None, verbose: int=0, **kwargs):
132
+ del kwargs
133
+ if init_std is None:
134
+ raise ValueError("You must set model.init_config['init_std'] to a float value to use the default initialization scheme.")
135
+ _normal_param_init_fn_(module=module, std=init_std, d_model=d_model, n_layers=n_layers, init_div_is_residual=init_div_is_residual, emb_init_std=emb_init_std, emb_init_uniform_lim=emb_init_uniform_lim, verbose=verbose)
136
+
137
+ def small_param_init_fn_(module: nn.Module, n_layers: int, d_model: int, init_div_is_residual: Union[int, float, str, bool]=True, emb_init_std: Optional[float]=None, emb_init_uniform_lim: Optional[Union[Tuple[float, float], float]]=None, verbose: int=0, **kwargs):
138
+ del kwargs
139
+ std = math.sqrt(2 / (5 * d_model))
140
+ _normal_param_init_fn_(module=module, std=std, d_model=d_model, n_layers=n_layers, init_div_is_residual=init_div_is_residual, emb_init_std=emb_init_std, emb_init_uniform_lim=emb_init_uniform_lim, verbose=verbose)
141
+
142
+ def neox_param_init_fn_(module: nn.Module, n_layers: int, d_model: int, emb_init_std: Optional[float]=None, emb_init_uniform_lim: Optional[Union[Tuple[float, float], float]]=None, verbose: int=0, **kwargs):
143
+ """From section 2.3.1 of GPT-NeoX-20B:
144
+
145
+ An Open-Source AutoregressiveLanguage Model — Black et. al. (2022)
146
+ see https://github.com/EleutherAI/gpt-neox/blob/9610391ab319403cef079b438edd016a2443af54/megatron/model/init_functions.py#L151
147
+ and https://github.com/EleutherAI/gpt-neox/blob/main/megatron/model/transformer.py
148
+ """
149
+ del kwargs
150
+ residual_div = n_layers / math.sqrt(10)
151
+ if verbose > 1:
152
+ warnings.warn(f'setting init_div_is_residual to {residual_div}')
153
+ small_param_init_fn_(module=module, d_model=d_model, n_layers=n_layers, init_div_is_residual=residual_div, emb_init_std=emb_init_std, emb_init_uniform_lim=emb_init_uniform_lim, verbose=verbose)
154
+
155
+ def kaiming_uniform_param_init_fn_(module: nn.Module, n_layers: int, d_model: Optional[int]=None, init_div_is_residual: Union[int, float, str, bool]=True, emb_init_std: Optional[float]=None, emb_init_uniform_lim: Optional[Union[Tuple[float, float], float]]=None, init_gain: float=0, fan_mode: str='fan_in', init_nonlinearity: str='leaky_relu', verbose: int=0, **kwargs):
156
+ del kwargs
157
+ if verbose > 1:
158
+ warnings.warn(f'Using nn.init.kaiming_uniform_ init fn with parameters: ' + f'a={init_gain}, mode={fan_mode}, nonlinearity={init_nonlinearity}')
159
+ kaiming_uniform_ = partial(nn.init.kaiming_uniform_, a=init_gain, mode=fan_mode, nonlinearity=init_nonlinearity)
160
+ generic_param_init_fn_(module=module, init_fn_=kaiming_uniform_, d_model=d_model, n_layers=n_layers, init_div_is_residual=init_div_is_residual, emb_init_std=emb_init_std, emb_init_uniform_lim=emb_init_uniform_lim, verbose=verbose)
161
+
162
+ def kaiming_normal_param_init_fn_(module: nn.Module, n_layers: int, d_model: Optional[int]=None, init_div_is_residual: Union[int, float, str, bool]=True, emb_init_std: Optional[float]=None, emb_init_uniform_lim: Optional[Union[Tuple[float, float], float]]=None, init_gain: float=0, fan_mode: str='fan_in', init_nonlinearity: str='leaky_relu', verbose: int=0, **kwargs):
163
+ del kwargs
164
+ if verbose > 1:
165
+ warnings.warn(f'Using nn.init.kaiming_normal_ init fn with parameters: ' + f'a={init_gain}, mode={fan_mode}, nonlinearity={init_nonlinearity}')
166
+ kaiming_normal_ = partial(torch.nn.init.kaiming_normal_, a=init_gain, mode=fan_mode, nonlinearity=init_nonlinearity)
167
+ generic_param_init_fn_(module=module, init_fn_=kaiming_normal_, d_model=d_model, n_layers=n_layers, init_div_is_residual=init_div_is_residual, emb_init_std=emb_init_std, emb_init_uniform_lim=emb_init_uniform_lim, verbose=verbose)
168
+
169
+ def xavier_uniform_param_init_fn_(module: nn.Module, n_layers: int, d_model: Optional[int]=None, init_div_is_residual: Union[int, float, str, bool]=True, emb_init_std: Optional[float]=None, emb_init_uniform_lim: Optional[Union[Tuple[float, float], float]]=None, init_gain: float=0, verbose: int=0, **kwargs):
170
+ del kwargs
171
+ xavier_uniform_ = partial(torch.nn.init.xavier_uniform_, gain=init_gain)
172
+ if verbose > 1:
173
+ warnings.warn(f'Using torch.nn.init.xavier_uniform_ init fn with parameters: ' + f'gain={init_gain}')
174
+ generic_param_init_fn_(module=module, init_fn_=xavier_uniform_, d_model=d_model, n_layers=n_layers, init_div_is_residual=init_div_is_residual, emb_init_std=emb_init_std, emb_init_uniform_lim=emb_init_uniform_lim, verbose=verbose)
175
+
176
+ def xavier_normal_param_init_fn_(module: nn.Module, n_layers: int, d_model: Optional[int]=None, init_div_is_residual: Union[int, float, str, bool]=True, emb_init_std: Optional[float]=None, emb_init_uniform_lim: Optional[Union[Tuple[float, float], float]]=None, init_gain: float=0, verbose: int=0, **kwargs):
177
+ xavier_normal_ = partial(torch.nn.init.xavier_normal_, gain=init_gain)
178
+ if verbose > 1:
179
+ warnings.warn(f'Using torch.nn.init.xavier_normal_ init fn with parameters: ' + f'gain={init_gain}')
180
+ generic_param_init_fn_(module=module, init_fn_=xavier_normal_, d_model=d_model, n_layers=n_layers, init_div_is_residual=init_div_is_residual, emb_init_std=emb_init_std, emb_init_uniform_lim=emb_init_uniform_lim, verbose=verbose)
181
+ MODEL_INIT_REGISTRY = {'default_': torch_default_param_init_fn_, 'baseline_': baseline_param_init_fn_, 'kaiming_uniform_': kaiming_uniform_param_init_fn_, 'kaiming_normal_': kaiming_normal_param_init_fn_, 'neox_init_': neox_param_init_fn_, 'small_init_': small_param_init_fn_, 'xavier_uniform_': xavier_uniform_param_init_fn_, 'xavier_normal_': xavier_normal_param_init_fn_}
blocks.py ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """GPT Blocks used for the GPT Model."""
2
+ from typing import Dict, Optional, Tuple
3
+ import torch
4
+ import torch.nn as nn
5
+ from attention import ATTN_CLASS_REGISTRY
6
+ from norm import NORM_CLASS_REGISTRY
7
+
8
+ class MPTMLP(nn.Module):
9
+
10
+ def __init__(self, d_model: int, expansion_ratio: int, device: Optional[str]=None):
11
+ super().__init__()
12
+ self.up_proj = nn.Linear(d_model, expansion_ratio * d_model, device=device)
13
+ self.act = nn.GELU(approximate='none')
14
+ self.down_proj = nn.Linear(expansion_ratio * d_model, d_model, device=device)
15
+ self.down_proj._is_residual = True
16
+
17
+ def forward(self, x):
18
+ return self.down_proj(self.act(self.up_proj(x)))
19
+
20
+ class MPTBlock(nn.Module):
21
+
22
+ def __init__(self, d_model: int, n_heads: int, expansion_ratio: int, attn_config: Dict={'attn_type': 'multihead_attention', 'attn_pdrop': 0.0, 'attn_impl': 'triton', 'qk_ln': False, 'clip_qkv': None, 'softmax_scale': None, 'prefix_lm': False, 'attn_uses_sequence_id': False, 'alibi': False, 'alibi_bias_max': 8}, resid_pdrop: float=0.0, norm_type: str='low_precision_layernorm', device: Optional[str]=None, **kwargs):
23
+ del kwargs
24
+ super().__init__()
25
+ norm_class = NORM_CLASS_REGISTRY[norm_type.lower()]
26
+ attn_class = ATTN_CLASS_REGISTRY[attn_config['attn_type']]
27
+ self.norm_1 = norm_class(d_model, device=device)
28
+ self.attn = attn_class(attn_impl=attn_config['attn_impl'], clip_qkv=attn_config['clip_qkv'], qk_ln=attn_config['qk_ln'], softmax_scale=attn_config['softmax_scale'], attn_pdrop=attn_config['attn_pdrop'], d_model=d_model, n_heads=n_heads, device=device)
29
+ self.norm_2 = norm_class(d_model, device=device)
30
+ self.ffn = MPTMLP(d_model=d_model, expansion_ratio=expansion_ratio, device=device)
31
+ self.resid_attn_dropout = nn.Dropout(resid_pdrop)
32
+ self.resid_ffn_dropout = nn.Dropout(resid_pdrop)
33
+
34
+ def forward(self, x: torch.Tensor, past_key_value: Optional[Tuple[torch.Tensor]]=None, attn_bias: Optional[torch.Tensor]=None, attention_mask: Optional[torch.ByteTensor]=None, is_causal: bool=True) -> Tuple[torch.Tensor, Optional[Tuple[torch.Tensor]]]:
35
+ a = self.norm_1(x)
36
+ (b, _, past_key_value) = self.attn(a, past_key_value=past_key_value, attn_bias=attn_bias, attention_mask=attention_mask, is_causal=is_causal)
37
+ x = x + self.resid_attn_dropout(b)
38
+ m = self.norm_2(x)
39
+ n = self.ffn(m)
40
+ x = x + self.resid_ffn_dropout(n)
41
+ return (x, past_key_value)
config.json ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "MPTForCausalLM"
4
+ ],
5
+ "attn_config": {
6
+ "alibi": true,
7
+ "alibi_bias_max": 8,
8
+ "attn_impl": "torch",
9
+ "attn_pdrop": 0,
10
+ "attn_type": "multihead_attention",
11
+ "attn_uses_sequence_id": false,
12
+ "clip_qkv": null,
13
+ "prefix_lm": false,
14
+ "qk_ln": false,
15
+ "softmax_scale": null
16
+ },
17
+ "auto_map": {
18
+ "AutoConfig": "configuration_mpt.MPTConfig",
19
+ "AutoModelForCausalLM": "modeling_mpt.MPTForCausalLM"
20
+ },
21
+ "d_model": 7168,
22
+ "emb_pdrop": 0,
23
+ "embedding_fraction": 1.0,
24
+ "expansion_ratio": 4,
25
+ "init_config": {
26
+ "emb_init_std": null,
27
+ "emb_init_uniform_lim": null,
28
+ "fan_mode": "fan_in",
29
+ "init_div_is_residual": true,
30
+ "init_gain": 0.0,
31
+ "init_nonlinearity": "relu",
32
+ "init_std": null,
33
+ "name": "kaiming_normal_",
34
+ "verbose": 0
35
+ },
36
+ "init_device": "cpu",
37
+ "learned_pos_emb": true,
38
+ "logit_scale": null,
39
+ "max_seq_len": 8192,
40
+ "model_type": "mpt",
41
+ "n_heads": 64,
42
+ "n_layers": 48,
43
+ "no_bias": true,
44
+ "norm_type": "low_precision_layernorm",
45
+ "resid_pdrop": 0,
46
+ "tokenizer_name": "EleutherAI/gpt-neox-20b",
47
+ "torch_dtype": "bfloat16",
48
+ "transformers_version": "4.28.1",
49
+ "use_cache": false,
50
+ "verbose": 0,
51
+ "vocab_size": 50432
52
+ }
configuration_mpt.py ADDED
@@ -0,0 +1,118 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """A HuggingFace-style model configuration."""
2
+ from typing import Dict, Optional, Union
3
+ from transformers import PretrainedConfig
4
+ attn_config_defaults: Dict = {'attn_type': 'multihead_attention', 'attn_pdrop': 0.0, 'attn_impl': 'triton', 'qk_ln': False, 'clip_qkv': None, 'softmax_scale': None, 'prefix_lm': False, 'attn_uses_sequence_id': False, 'alibi': False, 'alibi_bias_max': 8}
5
+ init_config_defaults: Dict = {'name': 'kaiming_normal_', 'fan_mode': 'fan_in', 'init_nonlinearity': 'relu', 'init_div_is_residual': True, 'emb_init_std': None, 'emb_init_uniform_lim': None, 'init_std': None, 'init_gain': 0.0}
6
+
7
+ class MPTConfig(PretrainedConfig):
8
+ model_type = 'mpt'
9
+
10
+ def __init__(self, d_model: int=2048, n_heads: int=16, n_layers: int=24, expansion_ratio: int=4, max_seq_len: int=2048, vocab_size: int=50368, resid_pdrop: float=0.0, emb_pdrop: float=0.0, learned_pos_emb: bool=True, attn_config: Dict=attn_config_defaults, init_device: str='cpu', logit_scale: Optional[Union[float, str]]=None, no_bias: bool=False, verbose: int=0, embedding_fraction: float=1.0, norm_type: str='low_precision_layernorm', use_cache: bool=False, init_config: Dict=init_config_defaults, **kwargs):
11
+ """The MPT configuration class.
12
+
13
+ Args:
14
+ d_model (int): The size of the embedding dimension of the model.
15
+ n_heads (int): The number of attention heads.
16
+ n_layers (int): The number of layers in the model.
17
+ expansion_ratio (int): The ratio of the up/down scale in the MLP.
18
+ max_seq_len (int): The maximum sequence length of the model.
19
+ vocab_size (int): The size of the vocabulary.
20
+ resid_pdrop (float): The dropout probability applied to the attention output before combining with residual.
21
+ emb_pdrop (float): The dropout probability for the embedding layer.
22
+ learned_pos_emb (bool): Whether to use learned positional embeddings
23
+ attn_config (Dict): A dictionary used to configure the model's attention module:
24
+ attn_type (str): type of attention to use. Options: multihead_attention, multiquery_attention
25
+ attn_pdrop (float): The dropout probability for the attention layers.
26
+ attn_impl (str): The attention implementation to use. One of 'torch', 'flash', or 'triton'.
27
+ qk_ln (bool): Whether to apply layer normalization to the queries and keys in the attention layer.
28
+ clip_qkv (Optional[float]): If not None, clip the queries, keys, and values in the attention layer to
29
+ this value.
30
+ softmax_scale (Optional[float]): If not None, scale the softmax in the attention layer by this value. If None,
31
+ use the default scale of ``1/sqrt(d_keys)``.
32
+ prefix_lm (Optional[bool]): Whether the model should operate as a Prefix LM. This requires passing an
33
+ extra `prefix_mask` argument which indicates which tokens belong to the prefix. Tokens in the prefix
34
+ can attend to one another bi-directionally. Tokens outside the prefix use causal attention.
35
+ attn_uses_sequence_id (Optional[bool]): Whether to restrict attention to tokens that have the same sequence_id.
36
+ When the model is in `train` mode, this requires passing an extra `sequence_id` argument which indicates
37
+ which sub-sequence each token belongs to.
38
+ Defaults to ``False`` meaning any provided `sequence_id` will be ignored.
39
+ alibi (bool): Whether to use the alibi bias instead of position embeddings.
40
+ alibi_bias_max (int): The maximum value of the alibi bias.
41
+ init_device (str): The device to use for parameter initialization.
42
+ logit_scale (Optional[Union[float, str]]): If not None, scale the logits by this value.
43
+ no_bias (bool): Whether to use bias in all layers.
44
+ verbose (int): The verbosity level. 0 is silent.
45
+ embedding_fraction (float): The fraction to scale the gradients of the embedding layer by.
46
+ norm_type (str): choose type of norm to use
47
+ multiquery_attention (bool): Whether to use multiquery attention implementation.
48
+ use_cache (bool): Whether or not the model should return the last key/values attentions
49
+ init_config (Dict): A dictionary used to configure the model initialization:
50
+ init_config.name: The parameter initialization scheme to use. Options: 'default_', 'baseline_',
51
+ 'kaiming_uniform_', 'kaiming_normal_', 'neox_init_', 'small_init_', 'xavier_uniform_', or
52
+ 'xavier_normal_'. These mimic the parameter initialization methods in PyTorch.
53
+ init_div_is_residual (Union[int, float, str, bool]): Value to divide initial weights by if ``module._is_residual`` is True.
54
+ emb_init_std (Optional[float]): The standard deviation of the normal distribution used to initialize the embedding layer.
55
+ emb_init_uniform_lim (Optional[Union[Tuple[float, float], float]]): The lower and upper limits of the uniform distribution
56
+ used to initialize the embedding layer. Mutually exclusive with ``emb_init_std``.
57
+ init_std (float): The standard deviation of the normal distribution used to initialize the model,
58
+ if using the baseline_ parameter initialization scheme.
59
+ init_gain (float): The gain to use for parameter initialization with kaiming or xavier initialization schemes.
60
+ fan_mode (str): The fan mode to use for parameter initialization with kaiming initialization schemes.
61
+ init_nonlinearity (str): The nonlinearity to use for parameter initialization with kaiming initialization schemes.
62
+ ---
63
+ See llmfoundry.models.utils.param_init_fns.py for info on other param init config options
64
+ """
65
+ self.d_model = d_model
66
+ self.n_heads = n_heads
67
+ self.n_layers = n_layers
68
+ self.expansion_ratio = expansion_ratio
69
+ self.max_seq_len = max_seq_len
70
+ self.vocab_size = vocab_size
71
+ self.resid_pdrop = resid_pdrop
72
+ self.emb_pdrop = emb_pdrop
73
+ self.learned_pos_emb = learned_pos_emb
74
+ self.attn_config = attn_config
75
+ self.init_device = init_device
76
+ self.logit_scale = logit_scale
77
+ self.no_bias = no_bias
78
+ self.verbose = verbose
79
+ self.embedding_fraction = embedding_fraction
80
+ self.norm_type = norm_type
81
+ self.use_cache = use_cache
82
+ self.init_config = init_config
83
+ if 'name' in kwargs:
84
+ del kwargs['name']
85
+ if 'loss_fn' in kwargs:
86
+ del kwargs['loss_fn']
87
+ super().__init__(**kwargs)
88
+ self._validate_config()
89
+
90
+ def _set_config_defaults(self, config, config_defaults):
91
+ for (k, v) in config_defaults.items():
92
+ if k not in config:
93
+ config[k] = v
94
+ return config
95
+
96
+ def _validate_config(self):
97
+ self.attn_config = self._set_config_defaults(self.attn_config, attn_config_defaults)
98
+ self.init_config = self._set_config_defaults(self.init_config, init_config_defaults)
99
+ if self.d_model % self.n_heads != 0:
100
+ raise ValueError('d_model must be divisible by n_heads')
101
+ if any((prob < 0 or prob > 1 for prob in [self.attn_config['attn_pdrop'], self.resid_pdrop, self.emb_pdrop])):
102
+ raise ValueError("self.attn_config['attn_pdrop'], resid_pdrop, emb_pdrop are probabilities and must be between 0 and 1")
103
+ if self.attn_config['attn_impl'] not in ['torch', 'flash', 'triton']:
104
+ raise ValueError(f"Unknown attn_impl={self.attn_config['attn_impl']}")
105
+ if self.attn_config['prefix_lm'] and self.attn_config['attn_impl'] not in ['torch', 'triton']:
106
+ raise NotImplementedError('prefix_lm only implemented with torch and triton attention.')
107
+ if self.attn_config['alibi'] and self.attn_config['attn_impl'] not in ['torch', 'triton']:
108
+ raise NotImplementedError('alibi only implemented with torch and triton attention.')
109
+ if self.attn_config['attn_uses_sequence_id'] and self.attn_config['attn_impl'] not in ['torch', 'triton']:
110
+ raise NotImplementedError('attn_uses_sequence_id only implemented with torch and triton attention.')
111
+ if self.embedding_fraction > 1 or self.embedding_fraction <= 0:
112
+ raise ValueError('model.embedding_fraction must be between 0 (exclusive) and 1 (inclusive)!')
113
+ if isinstance(self.logit_scale, str) and self.logit_scale != 'inv_sqrt_d_model':
114
+ raise ValueError(f"self.logit_scale={self.logit_scale!r} is not recognized as an option; use numeric value or 'inv_sqrt_d_model'.")
115
+ if self.init_config.get('name', None) is None:
116
+ raise ValueError(f"self.init_config={self.init_config!r} 'name' needs to be set.")
117
+ if not self.learned_pos_emb and (not self.attn_config['alibi']):
118
+ raise ValueError(f'Positional information must be provided to the model using either learned_pos_emb or alibi.')
custom_embedding.py ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import torch.nn as nn
3
+ import torch.nn.functional as F
4
+ from torch import Tensor
5
+
6
+ class SharedEmbedding(nn.Embedding):
7
+
8
+ def forward(self, input: Tensor, unembed: bool=False) -> Tensor:
9
+ if unembed:
10
+ return F.linear(input, self.weight)
11
+ return super().forward(input)
flash_attn_triton.py ADDED
@@ -0,0 +1,485 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Copied from https://github.com/HazyResearch/flash-attention/blob/eff9fe6b8076df59d64d7a3f464696738a3c7c24/flash_attn/flash_attn_triton.py
3
+ update imports to use 'triton_pre_mlir'
4
+ *Experimental* implementation of FlashAttention in Triton.
5
+ Tested with triton==2.0.0.dev20221202.
6
+ Triton 2.0 has a new backend (MLIR) but seems like it doesn't yet work for head dimensions
7
+ other than 64:
8
+ https://github.com/openai/triton/blob/d376020f90002757eea3ea9475d4f7cfc2ec5ead/python/triton/ops/flash_attention.py#L207
9
+ We'll update this implementation with the new Triton backend once this is fixed.
10
+ We use the FlashAttention implementation from Phil Tillet a starting point.
11
+ https://github.com/openai/triton/blob/master/python/tutorials/06-fused-attention.py
12
+ Changes:
13
+ - Implement both causal and non-causal attention.
14
+ - Implement both self-attention and cross-attention.
15
+ - Support arbitrary seqlens (not just multiples of 128), for both forward and backward.
16
+ - Support all head dimensions up to 128 (not just 16, 32, 64, 128), for both forward and backward.
17
+ - Support attention bias.
18
+ - Speed up the forward pass a bit, and only store the LSE instead of m and l.
19
+ - Make the backward for d=128 much faster by reducing register spilling.
20
+ - Optionally parallelize the backward pass across seqlen_k, to deal with the case of
21
+ small batch size * nheads.
22
+ Caution:
23
+ - This is an *experimental* implementation. The forward pass should be quite robust but
24
+ I'm not 100% sure that the backward pass doesn't have race conditions (due to the Triton compiler).
25
+ - This implementation has only been tested on A100.
26
+ - If you plan to use headdim other than 64 and 128, you should test for race conditions
27
+ (due to the Triton compiler), as done in tests/test_flash_attn.py
28
+ "test_flash_attn_triton_race_condition". I've tested and fixed many race conditions
29
+ for different head dimensions (40, 48, 64, 128, 80, 88, 96), but I'm still not 100% confident
30
+ that there are none left for other head dimensions.
31
+ Differences between this Triton version and the CUDA version:
32
+ - Triton version doesn't support dropout.
33
+ - Triton forward is generally faster than CUDA forward, while Triton backward is
34
+ generally slower than CUDA backward. Overall Triton forward + backward is slightly slower
35
+ than CUDA forward + backward.
36
+ - Triton version doesn't support different sequence lengths in a batch (i.e., RaggedTensor/NestedTensor).
37
+ - Triton version supports attention bias, while CUDA version doesn't.
38
+ """
39
+ import math
40
+ import torch
41
+ import triton_pre_mlir.language as tl
42
+ import triton_pre_mlir as triton
43
+
44
+
45
+ @triton.heuristics({'EVEN_M': lambda args: args['seqlen_q'] % args['BLOCK_M'] == 0, 'EVEN_N': lambda args: args['seqlen_k'] % args['BLOCK_N'] == 0, 'EVEN_HEADDIM': lambda args: args['headdim'] == args['BLOCK_HEADDIM']})
46
+ @triton.jit
47
+
48
+ def _fwd_kernel(Q, K, V, Bias, Out, Lse, TMP, softmax_scale, stride_qb, stride_qh, stride_qm, stride_kb, stride_kh, stride_kn, stride_vb, stride_vh, stride_vn, stride_bb, stride_bh, stride_bm, stride_ob, stride_oh, stride_om, nheads, seqlen_q, seqlen_k, seqlen_q_rounded, headdim, CACHE_KEY_SEQLEN_Q, CACHE_KEY_SEQLEN_K, BIAS_TYPE: tl.constexpr, IS_CAUSAL: tl.constexpr, BLOCK_HEADDIM: tl.constexpr, EVEN_M: tl.constexpr, EVEN_N: tl.constexpr, EVEN_HEADDIM: tl.constexpr, BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr):
49
+ start_m = tl.program_id(0)
50
+ off_hb = tl.program_id(1)
51
+ off_b = off_hb // nheads
52
+ off_h = off_hb % nheads
53
+ offs_m = start_m * BLOCK_M + tl.arange(0, BLOCK_M)
54
+ offs_n = tl.arange(0, BLOCK_N)
55
+ offs_d = tl.arange(0, BLOCK_HEADDIM)
56
+ q_ptrs = Q + off_b * stride_qb + off_h * stride_qh + (offs_m[:, None] * stride_qm + offs_d[None, :])
57
+ k_ptrs = K + off_b * stride_kb + off_h * stride_kh + (offs_n[:, None] * stride_kn + offs_d[None, :])
58
+ v_ptrs = V + off_b * stride_vb + off_h * stride_vh + (offs_n[:, None] * stride_vn + offs_d[None, :])
59
+ if BIAS_TYPE == 'vector':
60
+ b_ptrs = Bias + off_b * stride_bb + off_h * stride_bh + offs_n
61
+ elif BIAS_TYPE == 'matrix':
62
+ b_ptrs = Bias + off_b * stride_bb + off_h * stride_bh + (offs_m[:, None] * stride_bm + offs_n[None, :])
63
+ t_ptrs = TMP + off_hb * seqlen_q_rounded + offs_m
64
+ lse_i = tl.zeros([BLOCK_M], dtype=tl.float32) - float('inf')
65
+ m_i = tl.zeros([BLOCK_M], dtype=tl.float32) - float('inf')
66
+ acc_o = tl.zeros([BLOCK_M, BLOCK_HEADDIM], dtype=tl.float32)
67
+ if EVEN_M & EVEN_N:
68
+ if EVEN_HEADDIM:
69
+ q = tl.load(q_ptrs)
70
+ else:
71
+ q = tl.load(q_ptrs, mask=offs_d[None, :] < headdim, other=0.0)
72
+ elif EVEN_HEADDIM:
73
+ q = tl.load(q_ptrs, mask=offs_m[:, None] < seqlen_q, other=0.0)
74
+ else:
75
+ q = tl.load(q_ptrs, mask=(offs_m[:, None] < seqlen_q) & (offs_d[None, :] < headdim), other=0.0)
76
+ end_n = seqlen_k if not IS_CAUSAL else tl.minimum((start_m + 1) * BLOCK_M, seqlen_k)
77
+ for start_n in range(0, end_n, BLOCK_N):
78
+ start_n = tl.multiple_of(start_n, BLOCK_N)
79
+ if EVEN_N & EVEN_M:
80
+ if EVEN_HEADDIM:
81
+ k = tl.load(k_ptrs + start_n * stride_kn)
82
+ else:
83
+ k = tl.load(k_ptrs + start_n * stride_kn, mask=offs_d[None, :] < headdim, other=0.0)
84
+ elif EVEN_HEADDIM:
85
+ k = tl.load(k_ptrs + start_n * stride_kn, mask=(start_n + offs_n)[:, None] < seqlen_k, other=0.0)
86
+ else:
87
+ k = tl.load(k_ptrs + start_n * stride_kn, mask=((start_n + offs_n)[:, None] < seqlen_k) & (offs_d[None, :] < headdim), other=0.0)
88
+ qk = tl.zeros([BLOCK_M, BLOCK_N], dtype=tl.float32)
89
+ qk += tl.dot(q, k, trans_b=True)
90
+ if not EVEN_N:
91
+ qk += tl.where((start_n + offs_n)[None, :] < seqlen_k, 0, float('-inf'))
92
+ if IS_CAUSAL:
93
+ qk += tl.where(offs_m[:, None] >= (start_n + offs_n)[None, :], 0, float('-inf'))
94
+ if BIAS_TYPE != 'none':
95
+ if BIAS_TYPE == 'vector':
96
+ if EVEN_N:
97
+ bias = tl.load(b_ptrs + start_n).to(tl.float32)
98
+ else:
99
+ bias = tl.load(b_ptrs + start_n, mask=start_n + offs_n < seqlen_k, other=0.0).to(tl.float32)
100
+ bias = bias[None, :]
101
+ elif BIAS_TYPE == 'matrix':
102
+ if EVEN_M & EVEN_N:
103
+ bias = tl.load(b_ptrs + start_n).to(tl.float32)
104
+ else:
105
+ bias = tl.load(b_ptrs + start_n, mask=(offs_m[:, None] < seqlen_q) & ((start_n + offs_n)[None, :] < seqlen_k), other=0.0).to(tl.float32)
106
+ qk = qk * softmax_scale + bias
107
+ m_ij = tl.maximum(tl.max(qk, 1), lse_i)
108
+ p = tl.exp(qk - m_ij[:, None])
109
+ else:
110
+ m_ij = tl.maximum(tl.max(qk, 1) * softmax_scale, lse_i)
111
+ p = tl.exp(qk * softmax_scale - m_ij[:, None])
112
+ l_ij = tl.sum(p, 1)
113
+ acc_o_scale = tl.exp(m_i - m_ij)
114
+ tl.store(t_ptrs, acc_o_scale)
115
+ acc_o_scale = tl.load(t_ptrs)
116
+ acc_o = acc_o * acc_o_scale[:, None]
117
+ if EVEN_N & EVEN_M:
118
+ if EVEN_HEADDIM:
119
+ v = tl.load(v_ptrs + start_n * stride_vn)
120
+ else:
121
+ v = tl.load(v_ptrs + start_n * stride_vn, mask=offs_d[None, :] < headdim, other=0.0)
122
+ elif EVEN_HEADDIM:
123
+ v = tl.load(v_ptrs + start_n * stride_vn, mask=(start_n + offs_n)[:, None] < seqlen_k, other=0.0)
124
+ else:
125
+ v = tl.load(v_ptrs + start_n * stride_vn, mask=((start_n + offs_n)[:, None] < seqlen_k) & (offs_d[None, :] < headdim), other=0.0)
126
+ p = p.to(v.dtype)
127
+ acc_o += tl.dot(p, v)
128
+ m_i = m_ij
129
+ l_i_new = tl.exp(lse_i - m_ij) + l_ij
130
+ lse_i = m_ij + tl.log(l_i_new)
131
+ o_scale = tl.exp(m_i - lse_i)
132
+ tl.store(t_ptrs, o_scale)
133
+ o_scale = tl.load(t_ptrs)
134
+ acc_o = acc_o * o_scale[:, None]
135
+ start_m = tl.program_id(0)
136
+ offs_m = start_m * BLOCK_M + tl.arange(0, BLOCK_M)
137
+ lse_ptrs = Lse + off_hb * seqlen_q_rounded + offs_m
138
+ tl.store(lse_ptrs, lse_i)
139
+ offs_d = tl.arange(0, BLOCK_HEADDIM)
140
+ out_ptrs = Out + off_b * stride_ob + off_h * stride_oh + (offs_m[:, None] * stride_om + offs_d[None, :])
141
+ if EVEN_M:
142
+ if EVEN_HEADDIM:
143
+ tl.store(out_ptrs, acc_o)
144
+ else:
145
+ tl.store(out_ptrs, acc_o, mask=offs_d[None, :] < headdim)
146
+ elif EVEN_HEADDIM:
147
+ tl.store(out_ptrs, acc_o, mask=offs_m[:, None] < seqlen_q)
148
+ else:
149
+ tl.store(out_ptrs, acc_o, mask=(offs_m[:, None] < seqlen_q) & (offs_d[None, :] < headdim))
150
+
151
+ @triton.jit
152
+ def _bwd_preprocess_do_o_dot(Out, DO, Delta, stride_ob, stride_oh, stride_om, stride_dob, stride_doh, stride_dom, nheads, seqlen_q, seqlen_q_rounded, headdim, BLOCK_M: tl.constexpr, BLOCK_HEADDIM: tl.constexpr):
153
+ start_m = tl.program_id(0)
154
+ off_hb = tl.program_id(1)
155
+ off_b = off_hb // nheads
156
+ off_h = off_hb % nheads
157
+ offs_m = start_m * BLOCK_M + tl.arange(0, BLOCK_M)
158
+ offs_d = tl.arange(0, BLOCK_HEADDIM)
159
+ o = tl.load(Out + off_b * stride_ob + off_h * stride_oh + offs_m[:, None] * stride_om + offs_d[None, :], mask=(offs_m[:, None] < seqlen_q) & (offs_d[None, :] < headdim), other=0.0).to(tl.float32)
160
+ do = tl.load(DO + off_b * stride_dob + off_h * stride_doh + offs_m[:, None] * stride_dom + offs_d[None, :], mask=(offs_m[:, None] < seqlen_q) & (offs_d[None, :] < headdim), other=0.0).to(tl.float32)
161
+ delta = tl.sum(o * do, axis=1)
162
+ tl.store(Delta + off_hb * seqlen_q_rounded + offs_m, delta)
163
+
164
+ @triton.jit
165
+ def _bwd_store_dk_dv(dk_ptrs, dv_ptrs, dk, dv, offs_n, offs_d, seqlen_k, headdim, EVEN_M: tl.constexpr, EVEN_N: tl.constexpr, EVEN_HEADDIM: tl.constexpr):
166
+ if EVEN_N & EVEN_M:
167
+ if EVEN_HEADDIM:
168
+ tl.store(dv_ptrs, dv)
169
+ tl.store(dk_ptrs, dk)
170
+ else:
171
+ tl.store(dv_ptrs, dv, mask=offs_d[None, :] < headdim)
172
+ tl.store(dk_ptrs, dk, mask=offs_d[None, :] < headdim)
173
+ elif EVEN_HEADDIM:
174
+ tl.store(dv_ptrs, dv, mask=offs_n[:, None] < seqlen_k)
175
+ tl.store(dk_ptrs, dk, mask=offs_n[:, None] < seqlen_k)
176
+ else:
177
+ tl.store(dv_ptrs, dv, mask=(offs_n[:, None] < seqlen_k) & (offs_d[None, :] < headdim))
178
+ tl.store(dk_ptrs, dk, mask=(offs_n[:, None] < seqlen_k) & (offs_d[None, :] < headdim))
179
+
180
+ @triton.jit
181
+ def _bwd_kernel_one_col_block(start_n, Q, K, V, Bias, DO, DQ, DK, DV, LSE, D, softmax_scale, stride_qm, stride_kn, stride_vn, stride_bm, stride_dom, stride_dqm, stride_dkn, stride_dvn, seqlen_q, seqlen_k, headdim, ATOMIC_ADD: tl.constexpr, BIAS_TYPE: tl.constexpr, IS_CAUSAL: tl.constexpr, BLOCK_HEADDIM: tl.constexpr, EVEN_M: tl.constexpr, EVEN_N: tl.constexpr, EVEN_HEADDIM: tl.constexpr, BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr):
182
+ begin_m = 0 if not IS_CAUSAL else start_n * BLOCK_N // BLOCK_M * BLOCK_M
183
+ offs_qm = begin_m + tl.arange(0, BLOCK_M)
184
+ offs_n = start_n * BLOCK_N + tl.arange(0, BLOCK_N)
185
+ offs_m = tl.arange(0, BLOCK_M)
186
+ offs_d = tl.arange(0, BLOCK_HEADDIM)
187
+ q_ptrs = Q + (offs_qm[:, None] * stride_qm + offs_d[None, :])
188
+ k_ptrs = K + (offs_n[:, None] * stride_kn + offs_d[None, :])
189
+ v_ptrs = V + (offs_n[:, None] * stride_vn + offs_d[None, :])
190
+ do_ptrs = DO + (offs_qm[:, None] * stride_dom + offs_d[None, :])
191
+ dq_ptrs = DQ + (offs_qm[:, None] * stride_dqm + offs_d[None, :])
192
+ if BIAS_TYPE == 'vector':
193
+ b_ptrs = Bias + offs_n
194
+ elif BIAS_TYPE == 'matrix':
195
+ b_ptrs = Bias + (offs_qm[:, None] * stride_bm + offs_n[None, :])
196
+ dv = tl.zeros([BLOCK_N, BLOCK_HEADDIM], dtype=tl.float32)
197
+ dk = tl.zeros([BLOCK_N, BLOCK_HEADDIM], dtype=tl.float32)
198
+ if begin_m >= seqlen_q:
199
+ dv_ptrs = DV + (offs_n[:, None] * stride_dvn + offs_d[None, :])
200
+ dk_ptrs = DK + (offs_n[:, None] * stride_dkn + offs_d[None, :])
201
+ _bwd_store_dk_dv(dk_ptrs, dv_ptrs, dk, dv, offs_n, offs_d, seqlen_k, headdim, EVEN_M=EVEN_M, EVEN_N=EVEN_N, EVEN_HEADDIM=EVEN_HEADDIM)
202
+ return
203
+ if EVEN_N & EVEN_M:
204
+ if EVEN_HEADDIM:
205
+ k = tl.load(k_ptrs)
206
+ v = tl.load(v_ptrs)
207
+ else:
208
+ k = tl.load(k_ptrs, mask=offs_d[None, :] < headdim, other=0.0)
209
+ v = tl.load(v_ptrs, mask=offs_d[None, :] < headdim, other=0.0)
210
+ elif EVEN_HEADDIM:
211
+ k = tl.load(k_ptrs, mask=offs_n[:, None] < seqlen_k, other=0.0)
212
+ v = tl.load(v_ptrs, mask=offs_n[:, None] < seqlen_k, other=0.0)
213
+ else:
214
+ k = tl.load(k_ptrs, mask=(offs_n[:, None] < seqlen_k) & (offs_d[None, :] < headdim), other=0.0)
215
+ v = tl.load(v_ptrs, mask=(offs_n[:, None] < seqlen_k) & (offs_d[None, :] < headdim), other=0.0)
216
+ num_block_m = tl.cdiv(seqlen_q, BLOCK_M)
217
+ for start_m in range(begin_m, num_block_m * BLOCK_M, BLOCK_M):
218
+ start_m = tl.multiple_of(start_m, BLOCK_M)
219
+ offs_m_curr = start_m + offs_m
220
+ if EVEN_M & EVEN_HEADDIM:
221
+ q = tl.load(q_ptrs)
222
+ elif EVEN_HEADDIM:
223
+ q = tl.load(q_ptrs, mask=offs_m_curr[:, None] < seqlen_q, other=0.0)
224
+ else:
225
+ q = tl.load(q_ptrs, mask=(offs_m_curr[:, None] < seqlen_q) & (offs_d[None, :] < headdim), other=0.0)
226
+ qk = tl.dot(q, k, trans_b=True)
227
+ if not EVEN_N:
228
+ qk = tl.where(offs_n[None, :] < seqlen_k, qk, float('-inf'))
229
+ if IS_CAUSAL:
230
+ qk = tl.where(offs_m_curr[:, None] >= offs_n[None, :], qk, float('-inf'))
231
+ if BIAS_TYPE != 'none':
232
+ tl.debug_barrier()
233
+ if BIAS_TYPE == 'vector':
234
+ if EVEN_N:
235
+ bias = tl.load(b_ptrs).to(tl.float32)
236
+ else:
237
+ bias = tl.load(b_ptrs, mask=offs_n < seqlen_k, other=0.0).to(tl.float32)
238
+ bias = bias[None, :]
239
+ elif BIAS_TYPE == 'matrix':
240
+ if EVEN_M & EVEN_N:
241
+ bias = tl.load(b_ptrs).to(tl.float32)
242
+ else:
243
+ bias = tl.load(b_ptrs, mask=(offs_m_curr[:, None] < seqlen_q) & (offs_n[None, :] < seqlen_k), other=0.0).to(tl.float32)
244
+ qk = qk * softmax_scale + bias
245
+ if not EVEN_M & EVEN_HEADDIM:
246
+ tl.debug_barrier()
247
+ lse_i = tl.load(LSE + offs_m_curr)
248
+ if BIAS_TYPE == 'none':
249
+ p = tl.exp(qk * softmax_scale - lse_i[:, None])
250
+ else:
251
+ p = tl.exp(qk - lse_i[:, None])
252
+ if EVEN_M & EVEN_HEADDIM:
253
+ do = tl.load(do_ptrs)
254
+ else:
255
+ do = tl.load(do_ptrs, mask=(offs_m_curr[:, None] < seqlen_q) & (offs_d[None, :] < headdim), other=0.0)
256
+ dv += tl.dot(p.to(do.dtype), do, trans_a=True)
257
+ if not EVEN_M & EVEN_HEADDIM:
258
+ tl.debug_barrier()
259
+ dp = tl.dot(do, v, trans_b=True)
260
+ if not EVEN_HEADDIM:
261
+ tl.debug_barrier()
262
+ Di = tl.load(D + offs_m_curr)
263
+ ds = (p * (dp - Di[:, None]) * softmax_scale).to(q.dtype)
264
+ dk += tl.dot(ds, q, trans_a=True)
265
+ if not EVEN_M & EVEN_HEADDIM:
266
+ tl.debug_barrier()
267
+ if not ATOMIC_ADD:
268
+ if EVEN_M & EVEN_HEADDIM:
269
+ dq = tl.load(dq_ptrs, eviction_policy='evict_last')
270
+ dq += tl.dot(ds, k)
271
+ tl.store(dq_ptrs, dq, eviction_policy='evict_last')
272
+ elif EVEN_HEADDIM:
273
+ dq = tl.load(dq_ptrs, mask=offs_m_curr[:, None] < seqlen_q, other=0.0, eviction_policy='evict_last')
274
+ dq += tl.dot(ds, k)
275
+ tl.store(dq_ptrs, dq, mask=offs_m_curr[:, None] < seqlen_q, eviction_policy='evict_last')
276
+ else:
277
+ dq = tl.load(dq_ptrs, mask=(offs_m_curr[:, None] < seqlen_q) & (offs_d[None, :] < headdim), other=0.0, eviction_policy='evict_last')
278
+ dq += tl.dot(ds, k)
279
+ tl.store(dq_ptrs, dq, mask=(offs_m_curr[:, None] < seqlen_q) & (offs_d[None, :] < headdim), eviction_policy='evict_last')
280
+ else:
281
+ dq = tl.dot(ds, k)
282
+ if EVEN_M & EVEN_HEADDIM:
283
+ tl.atomic_add(dq_ptrs, dq)
284
+ elif EVEN_HEADDIM:
285
+ tl.atomic_add(dq_ptrs, dq, mask=offs_m_curr[:, None] < seqlen_q)
286
+ else:
287
+ tl.atomic_add(dq_ptrs, dq, mask=(offs_m_curr[:, None] < seqlen_q) & (offs_d[None, :] < headdim))
288
+ dq_ptrs += BLOCK_M * stride_dqm
289
+ q_ptrs += BLOCK_M * stride_qm
290
+ do_ptrs += BLOCK_M * stride_dom
291
+ if BIAS_TYPE == 'matrix':
292
+ b_ptrs += BLOCK_M * stride_bm
293
+
294
+ dv_ptrs = DV + (offs_n[:, None] * stride_dvn + offs_d[None, :])
295
+ dk_ptrs = DK + (offs_n[:, None] * stride_dkn + offs_d[None, :])
296
+ _bwd_store_dk_dv(dk_ptrs, dv_ptrs, dk, dv, offs_n, offs_d, seqlen_k, headdim, EVEN_M=EVEN_M, EVEN_N=EVEN_N, EVEN_HEADDIM=EVEN_HEADDIM)
297
+
298
+ def init_to_zero(name):
299
+ return lambda nargs: nargs[name].zero_()
300
+
301
+ @triton.autotune(configs=[triton.Config({'BLOCK_M': 128, 'BLOCK_N': 128, 'SEQUENCE_PARALLEL': False}, num_warps=8, num_stages=1, pre_hook=init_to_zero('DQ')), triton.Config({'BLOCK_M': 128, 'BLOCK_N': 128, 'SEQUENCE_PARALLEL': True}, num_warps=8, num_stages=1, pre_hook=init_to_zero('DQ'))], key=['CACHE_KEY_SEQLEN_Q', 'CACHE_KEY_SEQLEN_K', 'BIAS_TYPE', 'IS_CAUSAL', 'BLOCK_HEADDIM'])
302
+ @triton.heuristics({'EVEN_M': lambda args: args['seqlen_q'] % args['BLOCK_M'] == 0, 'EVEN_N': lambda args: args['seqlen_k'] % args['BLOCK_N'] == 0, 'EVEN_HEADDIM': lambda args: args['headdim'] == args['BLOCK_HEADDIM']})
303
+ @triton.jit
304
+ def _bwd_kernel(Q, K, V, Bias, DO, DQ, DK, DV, LSE, D, softmax_scale, stride_qb, stride_qh, stride_qm, stride_kb, stride_kh, stride_kn, stride_vb, stride_vh, stride_vn, stride_bb, stride_bh, stride_bm, stride_dob, stride_doh, stride_dom, stride_dqb, stride_dqh, stride_dqm, stride_dkb, stride_dkh, stride_dkn, stride_dvb, stride_dvh, stride_dvn, nheads, seqlen_q, seqlen_k, seqlen_q_rounded, headdim, CACHE_KEY_SEQLEN_Q, CACHE_KEY_SEQLEN_K, BIAS_TYPE: tl.constexpr, IS_CAUSAL: tl.constexpr, BLOCK_HEADDIM: tl.constexpr, SEQUENCE_PARALLEL: tl.constexpr, EVEN_M: tl.constexpr, EVEN_N: tl.constexpr, EVEN_HEADDIM: tl.constexpr, BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr):
305
+ off_hb = tl.program_id(1)
306
+ off_b = off_hb // nheads
307
+ off_h = off_hb % nheads
308
+ Q += off_b * stride_qb + off_h * stride_qh
309
+ K += off_b * stride_kb + off_h * stride_kh
310
+ V += off_b * stride_vb + off_h * stride_vh
311
+ DO += off_b * stride_dob + off_h * stride_doh
312
+ DQ += off_b * stride_dqb + off_h * stride_dqh
313
+ DK += off_b * stride_dkb + off_h * stride_dkh
314
+ DV += off_b * stride_dvb + off_h * stride_dvh
315
+ if BIAS_TYPE != 'none':
316
+ Bias += off_b * stride_bb + off_h * stride_bh
317
+ D += off_hb * seqlen_q_rounded
318
+ LSE += off_hb * seqlen_q_rounded
319
+ if not SEQUENCE_PARALLEL:
320
+ num_block_n = tl.cdiv(seqlen_k, BLOCK_N)
321
+ for start_n in range(0, num_block_n):
322
+ _bwd_kernel_one_col_block(start_n, Q, K, V, Bias, DO, DQ, DK, DV, LSE, D, softmax_scale, stride_qm, stride_kn, stride_vn, stride_bm, stride_dom, stride_dqm, stride_dkn, stride_dvn, seqlen_q, seqlen_k, headdim, ATOMIC_ADD=False, BIAS_TYPE=BIAS_TYPE, IS_CAUSAL=IS_CAUSAL, BLOCK_HEADDIM=BLOCK_HEADDIM, EVEN_M=EVEN_M, EVEN_N=EVEN_N, EVEN_HEADDIM=EVEN_HEADDIM, BLOCK_M=BLOCK_M, BLOCK_N=BLOCK_N)
323
+ else:
324
+ start_n = tl.program_id(0)
325
+ _bwd_kernel_one_col_block(start_n, Q, K, V, Bias, DO, DQ, DK, DV, LSE, D, softmax_scale, stride_qm, stride_kn, stride_vn, stride_bm, stride_dom, stride_dqm, stride_dkn, stride_dvn, seqlen_q, seqlen_k, headdim, ATOMIC_ADD=True, BIAS_TYPE=BIAS_TYPE, IS_CAUSAL=IS_CAUSAL, BLOCK_HEADDIM=BLOCK_HEADDIM, EVEN_M=EVEN_M, EVEN_N=EVEN_N, EVEN_HEADDIM=EVEN_HEADDIM, BLOCK_M=BLOCK_M, BLOCK_N=BLOCK_N)
326
+
327
+
328
+ def _flash_attn_forward(q, k, v, bias=None, causal=False, softmax_scale=None):
329
+ (batch, seqlen_q, nheads, d) = q.shape
330
+ (_, seqlen_k, _, _) = k.shape
331
+ assert k.shape == (batch, seqlen_k, nheads, d)
332
+ assert v.shape == (batch, seqlen_k, nheads, d)
333
+ assert d <= 128, 'FlashAttention only support head dimensions up to 128'
334
+ assert q.dtype == k.dtype == v.dtype, 'All tensors must have the same type'
335
+ assert q.dtype in [torch.float16, torch.bfloat16], 'Only support fp16 and bf16'
336
+ assert q.is_cuda and k.is_cuda and v.is_cuda
337
+ softmax_scale = softmax_scale or 1.0 / math.sqrt(d)
338
+ has_bias = bias is not None
339
+ bias_type = 'none'
340
+ if has_bias:
341
+ assert bias.dtype in [q.dtype, torch.float]
342
+ assert bias.is_cuda
343
+ assert bias.dim() == 4
344
+ if bias.stride(-1) != 1:
345
+ bias = bias.contiguous()
346
+ if bias.shape[2:] == (1, seqlen_k):
347
+ bias_type = 'vector'
348
+ elif bias.shape[2:] == (seqlen_q, seqlen_k):
349
+ bias_type = 'matrix'
350
+ else:
351
+ raise RuntimeError('Last 2 dimensions of bias must be (1, seqlen_k) or (seqlen_q, seqlen_k)')
352
+ bias = bias.expand(batch, nheads, seqlen_q, seqlen_k)
353
+ bias_strides = (bias.stride(0), bias.stride(1), bias.stride(2)) if has_bias else (0, 0, 0)
354
+ seqlen_q_rounded = math.ceil(seqlen_q / 128) * 128
355
+ lse = torch.empty((batch, nheads, seqlen_q_rounded), device=q.device, dtype=torch.float32)
356
+ tmp = torch.empty((batch, nheads, seqlen_q_rounded), device=q.device, dtype=torch.float32)
357
+ o = torch.empty_like(q)
358
+ BLOCK_HEADDIM = max(triton.next_power_of_2(d), 16)
359
+ BLOCK = 128
360
+ num_warps = 4 if d <= 64 else 8
361
+ grid = lambda META: (triton.cdiv(seqlen_q, META['BLOCK_M']), batch * nheads)
362
+ _fwd_kernel[grid](q, k, v, bias, o, lse, tmp, softmax_scale, q.stride(0), q.stride(2), q.stride(1), k.stride(0), k.stride(2), k.stride(1), v.stride(0), v.stride(2), v.stride(1), *bias_strides, o.stride(0), o.stride(2), o.stride(1), nheads, seqlen_q, seqlen_k, seqlen_q_rounded, d, seqlen_q // 32, seqlen_k // 32, bias_type, causal, BLOCK_HEADDIM, BLOCK_M=BLOCK, BLOCK_N=BLOCK, num_warps=num_warps, num_stages=1)
363
+ return (o, lse, softmax_scale)
364
+
365
+ def _flash_attn_backward(do, q, k, v, o, lse, dq, dk, dv, bias=None, causal=False, softmax_scale=None):
366
+ if do.stride(-1) != 1:
367
+ do = do.contiguous()
368
+ (batch, seqlen_q, nheads, d) = q.shape
369
+ (_, seqlen_k, _, _) = k.shape
370
+ assert d <= 128
371
+ seqlen_q_rounded = math.ceil(seqlen_q / 128) * 128
372
+ assert lse.shape == (batch, nheads, seqlen_q_rounded)
373
+ assert q.stride(-1) == k.stride(-1) == v.stride(-1) == o.stride(-1) == 1
374
+ assert dq.stride(-1) == dk.stride(-1) == dv.stride(-1) == 1
375
+ softmax_scale = softmax_scale or 1.0 / math.sqrt(d)
376
+ dq_accum = torch.empty_like(q, dtype=torch.float32)
377
+ delta = torch.empty_like(lse)
378
+ BLOCK_HEADDIM = max(triton.next_power_of_2(d), 16)
379
+ grid = lambda META: (triton.cdiv(seqlen_q, META['BLOCK_M']), batch * nheads)
380
+ _bwd_preprocess_do_o_dot[grid](o, do, delta, o.stride(0), o.stride(2), o.stride(1), do.stride(0), do.stride(2), do.stride(1), nheads, seqlen_q, seqlen_q_rounded, d, BLOCK_M=128, BLOCK_HEADDIM=BLOCK_HEADDIM)
381
+ has_bias = bias is not None
382
+ bias_type = 'none'
383
+ if has_bias:
384
+ assert bias.dtype in [q.dtype, torch.float]
385
+ assert bias.is_cuda
386
+ assert bias.dim() == 4
387
+ assert bias.stride(-1) == 1
388
+ if bias.shape[2:] == (1, seqlen_k):
389
+ bias_type = 'vector'
390
+ elif bias.shape[2:] == (seqlen_q, seqlen_k):
391
+ bias_type = 'matrix'
392
+ else:
393
+ raise RuntimeError('Last 2 dimensions of bias must be (1, seqlen_k) or (seqlen_q, seqlen_k)')
394
+ bias = bias.expand(batch, nheads, seqlen_q, seqlen_k)
395
+ bias_strides = (bias.stride(0), bias.stride(1), bias.stride(2)) if has_bias else (0, 0, 0)
396
+ grid = lambda META: (triton.cdiv(seqlen_k, META['BLOCK_N']) if META['SEQUENCE_PARALLEL'] else 1, batch * nheads)
397
+ _bwd_kernel[grid](q, k, v, bias, do, dq_accum, dk, dv, lse, delta, softmax_scale, q.stride(0), q.stride(2), q.stride(1), k.stride(0), k.stride(2), k.stride(1), v.stride(0), v.stride(2), v.stride(1), *bias_strides, do.stride(0), do.stride(2), do.stride(1), dq_accum.stride(0), dq_accum.stride(2), dq_accum.stride(1), dk.stride(0), dk.stride(2), dk.stride(1), dv.stride(0), dv.stride(2), dv.stride(1), nheads, seqlen_q, seqlen_k, seqlen_q_rounded, d, seqlen_q // 32, seqlen_k // 32, bias_type, causal, BLOCK_HEADDIM)
398
+
399
+ dq.copy_(dq_accum)
400
+
401
+ class FlashAttnQKVPackedFunc(torch.autograd.Function):
402
+
403
+ @staticmethod
404
+ def forward(ctx, qkv, bias=None, causal=False, softmax_scale=None):
405
+ """
406
+ qkv: (batch, seqlen, 3, nheads, headdim)
407
+ bias: optional, shape broadcastible to (batch, nheads, seqlen, seqlen).
408
+ For example, ALiBi mask for causal would have shape (1, nheads, 1, seqlen).
409
+ ALiBi mask for non-causal would have shape (1, nheads, seqlen, seqlen)
410
+ """
411
+ if qkv.stride(-1) != 1:
412
+ qkv = qkv.contiguous()
413
+ (o, lse, ctx.softmax_scale) = _flash_attn_forward(qkv[:, :, 0], qkv[:, :, 1], qkv[:, :, 2], bias=bias, causal=causal, softmax_scale=softmax_scale)
414
+ ctx.save_for_backward(qkv, o, lse, bias)
415
+ ctx.causal = causal
416
+
417
+ return o
418
+
419
+ @staticmethod
420
+ def backward(ctx, do):
421
+ (qkv, o, lse, bias) = ctx.saved_tensors
422
+ assert not ctx.needs_input_grad[1], 'FlashAttention does not support bias gradient yet'
423
+ with torch.inference_mode():
424
+ dqkv = torch.empty_like(qkv)
425
+ _flash_attn_backward(do, qkv[:, :, 0], qkv[:, :, 1], qkv[:, :, 2], o, lse, dqkv[:, :, 0], dqkv[:, :, 1], dqkv[:, :, 2], bias=bias, causal=ctx.causal, softmax_scale=ctx.softmax_scale)
426
+ return (dqkv, None, None, None)
427
+ flash_attn_qkvpacked_func = FlashAttnQKVPackedFunc.apply
428
+
429
+ class FlashAttnKVPackedFunc(torch.autograd.Function):
430
+
431
+ @staticmethod
432
+ def forward(ctx, q, kv, bias=None, causal=False, softmax_scale=None):
433
+ """
434
+ q: (batch, seqlen_q, nheads, headdim)
435
+ kv: (batch, seqlen_k, 2, nheads, headdim)
436
+ bias: optional, shape broadcastible to (batch, nheads, seqlen_q, seqlen_k).
437
+ For example, ALiBi mask for causal would have shape (1, nheads, 1, seqlen_k).
438
+ ALiBi mask for non-causal would have shape (1, nheads, seqlen_q, seqlen_k)
439
+ """
440
+ (q, kv) = [x if x.stride(-1) == 1 else x.contiguous() for x in [q, kv]]
441
+ (o, lse, ctx.softmax_scale) = _flash_attn_forward(q, kv[:, :, 0], kv[:, :, 1], bias=bias, causal=causal, softmax_scale=softmax_scale)
442
+ ctx.save_for_backward(q, kv, o, lse, bias)
443
+ ctx.causal = causal
444
+ return o
445
+
446
+ @staticmethod
447
+ def backward(ctx, do):
448
+ (q, kv, o, lse, bias) = ctx.saved_tensors
449
+ if len(ctx.needs_input_grad) >= 3:
450
+ assert not ctx.needs_input_grad[2], 'FlashAttention does not support bias gradient yet'
451
+ with torch.inference_mode():
452
+ dq = torch.empty_like(q)
453
+ dkv = torch.empty_like(kv)
454
+ _flash_attn_backward(do, q, kv[:, :, 0], kv[:, :, 1], o, lse, dq, dkv[:, :, 0], dkv[:, :, 1], bias=bias, causal=ctx.causal, softmax_scale=ctx.softmax_scale)
455
+ return (dq, dkv, None, None, None)
456
+ flash_attn_kvpacked_func = FlashAttnKVPackedFunc.apply
457
+
458
+ class FlashAttnFunc(torch.autograd.Function):
459
+
460
+ @staticmethod
461
+ def forward(ctx, q, k, v, bias=None, causal=False, softmax_scale=None):
462
+ """
463
+ q: (batch_size, seqlen_q, nheads, headdim)
464
+ k, v: (batch_size, seqlen_k, nheads, headdim)
465
+ bias: optional, shape broadcastible to (batch, nheads, seqlen_q, seqlen_k).
466
+ For example, ALiBi mask for causal would have shape (1, nheads, 1, seqlen_k).
467
+ ALiBi mask for non-causal would have shape (1, nheads, seqlen_q, seqlen_k)
468
+ """
469
+ (q, k, v) = [x if x.stride(-1) == 1 else x.contiguous() for x in [q, k, v]]
470
+ (o, lse, ctx.softmax_scale) = _flash_attn_forward(q, k, v, bias=bias, causal=causal, softmax_scale=softmax_scale)
471
+ ctx.save_for_backward(q, k, v, o, lse, bias)
472
+ ctx.causal = causal
473
+ return o
474
+
475
+ @staticmethod
476
+ def backward(ctx, do):
477
+ (q, k, v, o, lse, bias) = ctx.saved_tensors
478
+ assert not ctx.needs_input_grad[3], 'FlashAttention does not support bias gradient yet'
479
+ with torch.inference_mode():
480
+ dq = torch.empty_like(q)
481
+ dk = torch.empty_like(k)
482
+ dv = torch.empty_like(v)
483
+ _flash_attn_backward(do, q, k, v, o, lse, dq, dk, dv, bias=bias, causal=ctx.causal, softmax_scale=ctx.softmax_scale)
484
+ return (dq, dk, dv, None, None, None)
485
+ flash_attn_func = FlashAttnFunc.apply
generation_config.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "transformers_version": "4.28.1",
4
+ "use_cache": false
5
+ }
hf_prefixlm_converter.py ADDED
@@ -0,0 +1,415 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Converts Huggingface Causal LM to Prefix LM.
2
+
3
+ Conversion does lightweight surgery on a HuggingFace
4
+ Causal LM to convert it to a Prefix LM.
5
+
6
+ Prefix LMs accepts a `bidirectional_mask` input in `forward`
7
+ and treat the input prompt as the prefix in `generate`.
8
+ """
9
+ import math
10
+ import warnings
11
+ from types import MethodType
12
+ from typing import Any, Dict, List, Optional, Tuple, Union
13
+ import torch
14
+ from transformers.models.bloom.modeling_bloom import BaseModelOutputWithPastAndCrossAttentions, BloomForCausalLM, BloomModel, CausalLMOutputWithCrossAttentions, CrossEntropyLoss
15
+ from transformers.models.bloom.modeling_bloom import _expand_mask as _expand_mask_bloom
16
+ from transformers.models.bloom.modeling_bloom import _make_causal_mask as _make_causal_mask_bloom
17
+ from transformers.models.bloom.modeling_bloom import logging
18
+ from transformers.models.gpt2.modeling_gpt2 import GPT2LMHeadModel
19
+ from transformers.models.gpt_neo.modeling_gpt_neo import GPTNeoForCausalLM
20
+ from transformers.models.gpt_neox.modeling_gpt_neox import GPTNeoXForCausalLM
21
+ from transformers.models.gptj.modeling_gptj import GPTJForCausalLM
22
+ from transformers.models.opt.modeling_opt import OPTForCausalLM
23
+ from transformers.models.opt.modeling_opt import _expand_mask as _expand_mask_opt
24
+ from transformers.models.opt.modeling_opt import _make_causal_mask as _make_causal_mask_opt
25
+ logger = logging.get_logger(__name__)
26
+ _SUPPORTED_GPT_MODELS = (GPT2LMHeadModel, GPTJForCausalLM, GPTNeoForCausalLM, GPTNeoXForCausalLM)
27
+ CAUSAL_GPT_TYPES = Union[GPT2LMHeadModel, GPTJForCausalLM, GPTNeoForCausalLM, GPTNeoXForCausalLM]
28
+
29
+ def _convert_gpt_causal_lm_to_prefix_lm(model: CAUSAL_GPT_TYPES) -> CAUSAL_GPT_TYPES:
30
+ """Converts a GPT-style Causal LM to a Prefix LM.
31
+
32
+ Supported HuggingFace model classes:
33
+ - `GPT2LMHeadModel`
34
+ - `GPTNeoForCausalLM`
35
+ - `GPTNeoXForCausalLM`
36
+ - `GPTJForCausalLM`
37
+
38
+ See `convert_hf_causal_lm_to_prefix_lm` for more details.
39
+ """
40
+ if hasattr(model, '_prefix_lm_converted'):
41
+ return model
42
+ assert isinstance(model, _SUPPORTED_GPT_MODELS)
43
+ assert model.config.add_cross_attention == False, 'Only supports GPT-style decoder-only models'
44
+
45
+ def _get_attn_modules(model: CAUSAL_GPT_TYPES) -> List[torch.nn.Module]:
46
+ """Helper that gets a list of the model's attention modules.
47
+
48
+ Each module has a `bias` buffer used for causal masking. The Prefix LM
49
+ conversion adds logic to dynamically manipulate these biases to support
50
+ Prefix LM attention masking.
51
+ """
52
+ attn_modules = []
53
+ if isinstance(model, GPTNeoXForCausalLM):
54
+ blocks = model.gpt_neox.layers
55
+ else:
56
+ blocks = model.transformer.h
57
+ for block in blocks:
58
+ if isinstance(model, GPTNeoForCausalLM):
59
+ if block.attn.attention_type != 'global':
60
+ continue
61
+ attn_module = block.attn.attention
62
+ elif isinstance(model, GPTNeoXForCausalLM):
63
+ attn_module = block.attention
64
+ else:
65
+ attn_module = block.attn
66
+ attn_modules.append(attn_module)
67
+ return attn_modules
68
+ setattr(model, '_original_forward', getattr(model, 'forward'))
69
+ setattr(model, '_original_generate', getattr(model, 'generate'))
70
+
71
+ def forward(self: CAUSAL_GPT_TYPES, input_ids: Optional[torch.LongTensor]=None, past_key_values: Optional[Tuple[Tuple[torch.Tensor]]]=None, attention_mask: Optional[torch.FloatTensor]=None, bidirectional_mask: Optional[torch.Tensor]=None, token_type_ids: Optional[torch.LongTensor]=None, position_ids: Optional[torch.LongTensor]=None, head_mask: Optional[torch.FloatTensor]=None, inputs_embeds: Optional[torch.FloatTensor]=None, labels: Optional[torch.LongTensor]=None, use_cache: Optional[bool]=None, output_attentions: Optional[bool]=None, output_hidden_states: Optional[bool]=None, return_dict: Optional[bool]=None):
72
+ """Wraps original forward to enable PrefixLM attention."""
73
+
74
+ def call_og_forward():
75
+ if isinstance(self, GPTNeoXForCausalLM):
76
+ return self._original_forward(input_ids=input_ids, past_key_values=past_key_values, attention_mask=attention_mask, head_mask=head_mask, inputs_embeds=inputs_embeds, labels=labels, use_cache=use_cache, output_attentions=output_attentions, output_hidden_states=output_hidden_states, return_dict=return_dict)
77
+ else:
78
+ return self._original_forward(input_ids=input_ids, past_key_values=past_key_values, attention_mask=attention_mask, token_type_ids=token_type_ids, position_ids=position_ids, head_mask=head_mask, inputs_embeds=inputs_embeds, labels=labels, use_cache=use_cache, output_attentions=output_attentions, output_hidden_states=output_hidden_states, return_dict=return_dict)
79
+ if bidirectional_mask is None:
80
+ return call_og_forward()
81
+ assert isinstance(bidirectional_mask, torch.Tensor)
82
+ attn_modules = _get_attn_modules(model)
83
+ (b, s) = bidirectional_mask.shape
84
+ max_length = attn_modules[0].bias.shape[-1]
85
+ if s > max_length:
86
+ raise ValueError(f'bidirectional_mask sequence length (={s}) exceeds the ' + f'max length allowed by the model ({max_length}).')
87
+ assert s <= max_length
88
+ if s < max_length:
89
+ pad = torch.zeros((int(b), int(max_length - s)), dtype=bidirectional_mask.dtype, device=bidirectional_mask.device)
90
+ bidirectional_mask = torch.cat([bidirectional_mask, pad], dim=1)
91
+ bidirectional = bidirectional_mask.unsqueeze(1).unsqueeze(1)
92
+ for attn_module in attn_modules:
93
+ attn_module.bias.data = torch.logical_or(attn_module.bias.data, bidirectional)
94
+ output = call_og_forward()
95
+ for attn_module in attn_modules:
96
+ attn_module.bias.data = torch.tril(attn_module.bias.data[0, 0])[None, None]
97
+ return output
98
+
99
+ def generate(self: CAUSAL_GPT_TYPES, *args: tuple, **kwargs: Dict[str, Any]):
100
+ """Wraps original generate to enable PrefixLM attention."""
101
+ attn_modules = _get_attn_modules(model)
102
+ for attn_module in attn_modules:
103
+ attn_module.bias.data[:] = 1
104
+ output = self._original_generate(*args, **kwargs)
105
+ for attn_module in attn_modules:
106
+ attn_module.bias.data = torch.tril(attn_module.bias.data[0, 0])[None, None]
107
+ return output
108
+ setattr(model, 'forward', MethodType(forward, model))
109
+ setattr(model, 'generate', MethodType(generate, model))
110
+ setattr(model, '_prefix_lm_converted', True)
111
+ return model
112
+
113
+ def _convert_bloom_causal_lm_to_prefix_lm(model: BloomForCausalLM) -> BloomForCausalLM:
114
+ """Converts a BLOOM Causal LM to a Prefix LM.
115
+
116
+ Supported HuggingFace model classes:
117
+ - `BloomForCausalLM`
118
+
119
+ See `convert_hf_causal_lm_to_prefix_lm` for more details.
120
+ """
121
+ if hasattr(model, '_prefix_lm_converted'):
122
+ return model
123
+ assert isinstance(model, BloomForCausalLM)
124
+ assert model.config.add_cross_attention == False, 'Only supports BLOOM decoder-only models'
125
+
126
+ def _prepare_attn_mask(self: BloomModel, attention_mask: torch.Tensor, bidirectional_mask: Optional[torch.Tensor], input_shape: Tuple[int, int], past_key_values_length: int) -> torch.BoolTensor:
127
+ combined_attention_mask = None
128
+ device = attention_mask.device
129
+ (_, src_length) = input_shape
130
+ if src_length > 1:
131
+ combined_attention_mask = _make_causal_mask_bloom(input_shape, device=device, past_key_values_length=past_key_values_length)
132
+ if bidirectional_mask is not None:
133
+ assert attention_mask.shape == bidirectional_mask.shape
134
+ expanded_bidirectional_mask = _expand_mask_bloom(bidirectional_mask, tgt_length=src_length)
135
+ combined_attention_mask = torch.logical_and(combined_attention_mask, expanded_bidirectional_mask)
136
+ expanded_attn_mask = _expand_mask_bloom(attention_mask, tgt_length=src_length)
137
+ combined_attention_mask = expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask | combined_attention_mask
138
+ return combined_attention_mask
139
+
140
+ def _build_alibi_tensor(self: BloomModel, batch_size: int, query_length: int, key_length: int, dtype: torch.dtype, device: torch.device) -> torch.Tensor:
141
+ num_heads = self.config.n_head
142
+ closest_power_of_2 = 2 ** math.floor(math.log2(num_heads))
143
+ base = torch.tensor(2 ** (-2 ** (-(math.log2(closest_power_of_2) - 3))), device=device, dtype=torch.float32)
144
+ powers = torch.arange(1, 1 + closest_power_of_2, device=device, dtype=torch.int32)
145
+ slopes = torch.pow(base, powers)
146
+ if closest_power_of_2 != num_heads:
147
+ extra_base = torch.tensor(2 ** (-2 ** (-(math.log2(2 * closest_power_of_2) - 3))), device=device, dtype=torch.float32)
148
+ num_remaining_heads = min(closest_power_of_2, num_heads - closest_power_of_2)
149
+ extra_powers = torch.arange(1, 1 + 2 * num_remaining_heads, 2, device=device, dtype=torch.int32)
150
+ slopes = torch.cat([slopes, torch.pow(extra_base, extra_powers)], dim=0)
151
+ qa = torch.arange(query_length, device=device, dtype=torch.int32).view(-1, 1)
152
+ ka = torch.arange(key_length, device=device, dtype=torch.int32).view(1, -1)
153
+ diffs = qa - ka + key_length - query_length
154
+ diffs = -diffs.abs()
155
+ alibi = slopes.view(1, num_heads, 1, 1) * diffs.view(1, 1, query_length, key_length)
156
+ alibi = alibi.expand(batch_size, -1, -1, -1).reshape(-1, query_length, key_length)
157
+ return alibi.to(dtype)
158
+ KeyValueT = Tuple[torch.Tensor, torch.Tensor]
159
+
160
+ def forward(self: BloomModel, input_ids: Optional[torch.LongTensor]=None, past_key_values: Optional[Tuple[KeyValueT, ...]]=None, attention_mask: Optional[torch.Tensor]=None, bidirectional_mask: Optional[torch.Tensor]=None, head_mask: Optional[torch.LongTensor]=None, inputs_embeds: Optional[torch.LongTensor]=None, use_cache: Optional[bool]=None, output_attentions: Optional[bool]=None, output_hidden_states: Optional[bool]=None, return_dict: Optional[bool]=None, **deprecated_arguments) -> Union[Tuple[torch.Tensor, ...], BaseModelOutputWithPastAndCrossAttentions]:
161
+ if deprecated_arguments.pop('position_ids', False) is not False:
162
+ warnings.warn('`position_ids` have no functionality in BLOOM and will be removed in v5.0.0. ' + 'You can safely ignore passing `position_ids`.', FutureWarning)
163
+ if len(deprecated_arguments) > 0:
164
+ raise ValueError(f'Got unexpected arguments: {deprecated_arguments}')
165
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
166
+ output_hidden_states = output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
167
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
168
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
169
+ if input_ids is not None and inputs_embeds is not None:
170
+ raise ValueError('You cannot specify both input_ids and inputs_embeds at the same time')
171
+ elif input_ids is not None:
172
+ (batch_size, seq_length) = input_ids.shape
173
+ elif inputs_embeds is not None:
174
+ (batch_size, seq_length, _) = inputs_embeds.shape
175
+ else:
176
+ raise ValueError('You have to specify either input_ids or inputs_embeds')
177
+ if past_key_values is None:
178
+ past_key_values = tuple([None] * len(self.h))
179
+ head_mask = self.get_head_mask(head_mask, self.config.n_layer)
180
+ if inputs_embeds is None:
181
+ inputs_embeds = self.word_embeddings(input_ids)
182
+ hidden_states = self.word_embeddings_layernorm(inputs_embeds)
183
+ presents = () if use_cache else None
184
+ all_self_attentions = () if output_attentions else None
185
+ all_hidden_states = () if output_hidden_states else None
186
+ seq_length_with_past = seq_length
187
+ past_key_values_length = 0
188
+ if past_key_values[0] is not None:
189
+ tmp = past_key_values[0][0]
190
+ past_key_values_length = tmp.shape[2]
191
+ seq_length_with_past = seq_length_with_past + past_key_values_length
192
+ if attention_mask is None:
193
+ attention_mask = torch.ones((batch_size, seq_length_with_past), device=hidden_states.device)
194
+ else:
195
+ attention_mask = attention_mask.to(hidden_states.device)
196
+ alibi = self._build_alibi_tensor(batch_size=batch_size, query_length=seq_length, key_length=seq_length_with_past, dtype=hidden_states.dtype, device=hidden_states.device)
197
+ causal_mask = self._prepare_attn_mask(attention_mask, bidirectional_mask, input_shape=(batch_size, seq_length), past_key_values_length=past_key_values_length)
198
+ for (i, (block, layer_past)) in enumerate(zip(self.h, past_key_values)):
199
+ if output_hidden_states:
200
+ hst = (hidden_states,)
201
+ all_hidden_states = all_hidden_states + hst
202
+ if self.gradient_checkpointing and self.training:
203
+ if use_cache:
204
+ logger.warning('`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...')
205
+ use_cache = False
206
+
207
+ def create_custom_forward(module):
208
+
209
+ def custom_forward(*inputs):
210
+ return module(*inputs, use_cache=use_cache, output_attentions=output_attentions)
211
+ return custom_forward
212
+ outputs = torch.utils.checkpoint.checkpoint(create_custom_forward(block), hidden_states, alibi, causal_mask, head_mask[i])
213
+ else:
214
+ outputs = block(hidden_states, layer_past=layer_past, attention_mask=causal_mask, head_mask=head_mask[i], use_cache=use_cache, output_attentions=output_attentions, alibi=alibi)
215
+ hidden_states = outputs[0]
216
+ if use_cache is True:
217
+ presents = presents + (outputs[1],)
218
+ if output_attentions:
219
+ oa = (outputs[2 if use_cache else 1],)
220
+ all_self_attentions = all_self_attentions + oa
221
+ hidden_states = self.ln_f(hidden_states)
222
+ if output_hidden_states:
223
+ hst = (hidden_states,)
224
+ all_hidden_states = all_hidden_states + hst
225
+ if not return_dict:
226
+ return tuple((v for v in [hidden_states, presents, all_hidden_states, all_self_attentions] if v is not None))
227
+ return BaseModelOutputWithPastAndCrossAttentions(last_hidden_state=hidden_states, past_key_values=presents, hidden_states=all_hidden_states, attentions=all_self_attentions)
228
+ setattr(model.transformer, '_prepare_attn_mask', MethodType(_prepare_attn_mask, model.transformer))
229
+ setattr(model.transformer, '_build_alibi_tensor', MethodType(_build_alibi_tensor, model.transformer))
230
+ setattr(model.transformer, 'forward', MethodType(forward, model.transformer))
231
+ KeyValueT = Tuple[torch.Tensor, torch.Tensor]
232
+
233
+ def forward(self: BloomForCausalLM, input_ids: Optional[torch.LongTensor]=None, past_key_values: Optional[Tuple[KeyValueT, ...]]=None, attention_mask: Optional[torch.Tensor]=None, bidirectional_mask: Optional[torch.Tensor]=None, head_mask: Optional[torch.Tensor]=None, inputs_embeds: Optional[torch.Tensor]=None, labels: Optional[torch.Tensor]=None, use_cache: Optional[bool]=None, output_attentions: Optional[bool]=None, output_hidden_states: Optional[bool]=None, return_dict: Optional[bool]=None, **deprecated_arguments) -> Union[Tuple[torch.Tensor], CausalLMOutputWithCrossAttentions]:
234
+ """Replacement forward method for BloomCausalLM."""
235
+ if deprecated_arguments.pop('position_ids', False) is not False:
236
+ warnings.warn('`position_ids` have no functionality in BLOOM and will be removed ' + 'in v5.0.0. You can safely ignore passing `position_ids`.', FutureWarning)
237
+ if len(deprecated_arguments) > 0:
238
+ raise ValueError(f'Got unexpected arguments: {deprecated_arguments}')
239
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
240
+ transformer_outputs = self.transformer(input_ids, past_key_values=past_key_values, attention_mask=attention_mask, bidirectional_mask=bidirectional_mask, head_mask=head_mask, inputs_embeds=inputs_embeds, use_cache=use_cache, output_attentions=output_attentions, output_hidden_states=output_hidden_states, return_dict=return_dict)
241
+ hidden_states = transformer_outputs[0]
242
+ lm_logits = self.lm_head(hidden_states)
243
+ loss = None
244
+ if labels is not None:
245
+ shift_logits = lm_logits[..., :-1, :].contiguous()
246
+ shift_labels = labels[..., 1:].contiguous()
247
+ (batch_size, seq_length, vocab_size) = shift_logits.shape
248
+ loss_fct = CrossEntropyLoss()
249
+ loss = loss_fct(shift_logits.view(batch_size * seq_length, vocab_size), shift_labels.view(batch_size * seq_length))
250
+ if not return_dict:
251
+ output = (lm_logits,) + transformer_outputs[1:]
252
+ return (loss,) + output if loss is not None else output
253
+ return CausalLMOutputWithCrossAttentions(loss=loss, logits=lm_logits, past_key_values=transformer_outputs.past_key_values, hidden_states=transformer_outputs.hidden_states, attentions=transformer_outputs.attentions)
254
+
255
+ def prepare_inputs_for_generation(self: BloomForCausalLM, input_ids: torch.LongTensor, past: Optional[torch.Tensor]=None, attention_mask: Optional[torch.Tensor]=None, **kwargs) -> dict:
256
+ if past:
257
+ input_ids = input_ids[:, -1].unsqueeze(-1)
258
+ bidirectional_mask = None
259
+ if past[0][0].shape[0] == input_ids.shape[0]:
260
+ past = self._convert_to_bloom_cache(past)
261
+ else:
262
+ bidirectional_mask = torch.ones_like(input_ids)
263
+ return {'input_ids': input_ids, 'past_key_values': past, 'use_cache': True, 'attention_mask': attention_mask, 'bidirectional_mask': bidirectional_mask}
264
+ setattr(model, 'forward', MethodType(forward, model))
265
+ setattr(model, 'prepare_inputs_for_generation', MethodType(prepare_inputs_for_generation, model))
266
+ setattr(model, '_prefix_lm_converted', True)
267
+ return model
268
+
269
+ def _convert_opt_causal_lm_to_prefix_lm(model: OPTForCausalLM) -> OPTForCausalLM:
270
+ """Converts an OPT Causal LM to a Prefix LM.
271
+
272
+ Supported HuggingFace model classes:
273
+ - `OPTForCausalLM`
274
+
275
+ See `convert_hf_causal_lm_to_prefix_lm` for more details.
276
+ """
277
+ if hasattr(model, '_prefix_lm_converted'):
278
+ return model
279
+ assert isinstance(model, OPTForCausalLM)
280
+ assert model.config.add_cross_attention == False, 'Only supports OPT decoder-only models'
281
+ setattr(model, '_original_forward', getattr(model, 'forward'))
282
+ setattr(model, '_original_generate', getattr(model, 'generate'))
283
+ model.model.decoder.bidirectional_mask = None
284
+
285
+ def _prepare_decoder_attention_mask(self, attention_mask, input_shape, inputs_embeds, past_key_values_length):
286
+ combined_attention_mask = None
287
+ if input_shape[-1] > 1:
288
+ if self.bidirectional_mask == 'g':
289
+ (bsz, src_length) = input_shape
290
+ combined_attention_mask = torch.zeros((bsz, 1, src_length, src_length + past_key_values_length), dtype=inputs_embeds.dtype, device=inputs_embeds.device)
291
+ else:
292
+ combined_attention_mask = _make_causal_mask_opt(input_shape, inputs_embeds.dtype, past_key_values_length=past_key_values_length).to(inputs_embeds.device)
293
+ if self.bidirectional_mask is not None:
294
+ assert attention_mask.shape == self.bidirectional_mask.shape
295
+ expanded_bidirectional_mask = _expand_mask_opt(self.bidirectional_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]).to(inputs_embeds.device)
296
+ combined_attention_mask = torch.maximum(expanded_bidirectional_mask, combined_attention_mask)
297
+ if attention_mask is not None:
298
+ expanded_attn_mask = _expand_mask_opt(attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]).to(inputs_embeds.device)
299
+ combined_attention_mask = expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask + combined_attention_mask
300
+ return combined_attention_mask
301
+ setattr(model.model.decoder, '_prepare_decoder_attention_mask', MethodType(_prepare_decoder_attention_mask, model.model.decoder))
302
+
303
+ def forward(self: OPTForCausalLM, input_ids: Optional[torch.LongTensor]=None, attention_mask: Optional[torch.Tensor]=None, bidirectional_mask: Optional[torch.ByteTensor]=None, head_mask: Optional[torch.Tensor]=None, past_key_values: Optional[List[torch.FloatTensor]]=None, inputs_embeds: Optional[torch.FloatTensor]=None, labels: Optional[torch.LongTensor]=None, use_cache: Optional[bool]=None, output_attentions: Optional[bool]=None, output_hidden_states: Optional[bool]=None, return_dict: Optional[bool]=None):
304
+
305
+ def call_og_forward():
306
+ return self._original_forward(input_ids=input_ids, attention_mask=attention_mask, head_mask=head_mask, past_key_values=past_key_values, inputs_embeds=inputs_embeds, labels=labels, use_cache=use_cache, output_attentions=output_attentions, output_hidden_states=output_hidden_states, return_dict=return_dict)
307
+ if bidirectional_mask is None:
308
+ return call_og_forward()
309
+ self.model.decoder.bidirectional_mask = bidirectional_mask
310
+ try:
311
+ outputs = call_og_forward()
312
+ except:
313
+ self.model.decoder.bidirectional_mask = None
314
+ raise
315
+ self.model.decoder.bidirectional_mask = None
316
+ return outputs
317
+
318
+ def generate(self: OPTForCausalLM, *args: tuple, **kwargs: Dict[str, Any]):
319
+ """Wraps original generate to enable PrefixLM-style attention."""
320
+ self.model.decoder.bidirectional_mask = 'g'
321
+ try:
322
+ output = self._original_generate(*args, **kwargs)
323
+ except:
324
+ self.model.decoder.bidirectional_mask = None
325
+ raise
326
+ self.model.decoder.bidirectional_mask = None
327
+ return output
328
+ setattr(model, 'forward', MethodType(forward, model))
329
+ setattr(model, 'generate', MethodType(generate, model))
330
+ setattr(model, '_prefix_lm_converted', True)
331
+ return model
332
+ _SUPPORTED_HF_MODELS = _SUPPORTED_GPT_MODELS + (BloomForCausalLM, OPTForCausalLM)
333
+ CAUSAL_LM_TYPES = Union[GPT2LMHeadModel, GPTJForCausalLM, GPTNeoForCausalLM, GPTNeoXForCausalLM, BloomForCausalLM, OPTForCausalLM]
334
+
335
+ def convert_hf_causal_lm_to_prefix_lm(model: CAUSAL_LM_TYPES) -> CAUSAL_LM_TYPES:
336
+ """Converts a HuggingFace Causal LM to a Prefix LM.
337
+
338
+ Supported HuggingFace model classes:
339
+ - `GPT2LMHeadModel`
340
+ - `GPTNeoForCausalLM`
341
+ - `GPTNeoXForCausalLM`
342
+ - `GPTJForCausalLM`
343
+ - `BloomForCausalLM`
344
+ - `OPTForCausalLM`
345
+
346
+ Conversion to a Prefix LM is done by modifying the `forward` method, and possibly also the
347
+ `generate` method and/or select underlying methods depending on the model class.
348
+
349
+ These changes preserve the model API, but add a new input to `forward`: "bidirectional_mask".
350
+
351
+ Notes on training:
352
+ To actually train the converted model as a Prefix LM, training batches will need to indicate
353
+ the prefix/target structure by including `bidirectional_mask` as part of the batch inputs.
354
+
355
+ **This is not a standard input and requires custom layers either within or after your dataloader.**
356
+
357
+ In addition to adding `bidirectional_mask` to the batch, this custom code should modify `labels`
358
+ such that `batch['labels'][batch['bidirectional_mask'] == 1] == -100`.
359
+ That is, the prefix portion of the sequence should not generate any loss. Loss should only be
360
+ generated by the target portion of the sequence.
361
+
362
+ Notes on `GPTNeoForCausalLM`:
363
+ To simplify the implementation, "global" and "local" attention layers are handled differently.
364
+ For "global" layers, we handle conversion as described above. For "local" layers, which use a
365
+ causal attention mask within a restricted local window, we do not alter the masking.
366
+
367
+ Notes on `forward` method conversion:
368
+ After conversion, the `forward` method will handle a new input, `bidirectional_mask`,
369
+ which should be a [batch_size, seq_length] byte tensor, where 1 indicates token positions
370
+ belonging to the prefix (prefix tokens can attend to one another bidirectionally), and
371
+ 0 indicates token positions belonging to the target.
372
+
373
+ The new `forward` method will incorporate `bidirectional_mask` (if supplied) into the existing
374
+ causal mask, call the original `forward` method, and (if the causal mask is a buffer) reset
375
+ the causal masks before returning the result.
376
+
377
+ Notes on `generate` method conversion:
378
+ After conversion, the `generate` method will have the same signature but will internally
379
+ convert all causal masks to be purely bidirectional, call the original `generate` method, and
380
+ (where appropriate) reset the causal masks before returning the result.
381
+
382
+ This works thanks to the logic of the HuggingFace `generate` API, which first encodes the token
383
+ "prompt" passed to `generate` (which is treated as the prefix) and then sequentially generates
384
+ each new token. Encodings are cached as generation happens, so all prefix tokens can attend to one
385
+ another (as expected in a Prefix LM) and generated tokens can only attend to prefix tokens and
386
+ previously-generated tokens (also as expected in a Prefix LM).
387
+
388
+ To preserve the API, the original methods are renamed to `_original_forward` and
389
+ `_original_generate`, and replaced with new `forward` and `generate` methods that wrap
390
+ them, respectively. Although implementation details vary by model class.
391
+ """
392
+ if isinstance(model, _SUPPORTED_GPT_MODELS):
393
+ return _convert_gpt_causal_lm_to_prefix_lm(model)
394
+ elif isinstance(model, BloomForCausalLM):
395
+ return _convert_bloom_causal_lm_to_prefix_lm(model)
396
+ elif isinstance(model, OPTForCausalLM):
397
+ return _convert_opt_causal_lm_to_prefix_lm(model)
398
+ else:
399
+ raise TypeError(f'Cannot convert model to Prefix LM. ' + f'Model does not belong to set of supported HF models:' + f'\n{_SUPPORTED_HF_MODELS}')
400
+
401
+ def add_bidirectional_mask_if_missing(batch: Dict[str, Any]):
402
+ """Attempts to add bidirectional_mask to batch if missing.
403
+
404
+ Raises:
405
+ KeyError if bidirectional_mask is missing and can't be inferred
406
+ """
407
+ if 'bidirectional_mask' not in batch:
408
+ if batch.get('mode', None) == 'icl_task':
409
+ batch['bidirectional_mask'] = batch['attention_mask'].clone()
410
+ for (i, continuation_indices) in enumerate(batch['continuation_indices']):
411
+ batch['bidirectional_mask'][i, continuation_indices] = 0
412
+ elif 'labels' in batch and 'attention_mask' in batch:
413
+ batch['bidirectional_mask'] = torch.logical_and(torch.eq(batch['attention_mask'], 1), torch.eq(batch['labels'], -100)).type_as(batch['attention_mask'])
414
+ else:
415
+ raise KeyError('No bidirectional_mask in batch and not sure how to construct one.')
meta_init_context.py ADDED
@@ -0,0 +1,94 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from contextlib import contextmanager
2
+ import torch
3
+ import torch.nn as nn
4
+
5
+ @contextmanager
6
+ def init_empty_weights(include_buffers: bool=False):
7
+ """Meta initialization context manager.
8
+
9
+ A context manager under which models are initialized with all parameters
10
+ on the meta device, therefore creating an empty model. Useful when just
11
+ initializing the model would blow the available RAM.
12
+
13
+ Args:
14
+ include_buffers (`bool`, *optional*, defaults to `False`): Whether or
15
+ not to also put all buffers on the meta device while initializing.
16
+
17
+ Example:
18
+ ```python
19
+ import torch.nn as nn
20
+
21
+ # Initialize a model with 100 billions parameters in no time and without using any RAM.
22
+ with init_empty_weights():
23
+ tst = nn.Sequential(*[nn.Linear(10000, 10000) for _ in range(1000)])
24
+ ```
25
+
26
+ <Tip warning={true}>
27
+
28
+ Any model created under this context manager has no weights. As such you can't do something like
29
+ `model.to(some_device)` with it. To load weights inside your empty model, see [`load_checkpoint_and_dispatch`].
30
+
31
+ </Tip>
32
+ """
33
+ with init_on_device(torch.device('meta'), include_buffers=include_buffers) as f:
34
+ yield f
35
+
36
+ @contextmanager
37
+ def init_on_device(device: torch.device, include_buffers: bool=False):
38
+ """Device initialization context manager.
39
+
40
+ A context manager under which models are initialized with all parameters
41
+ on the specified device.
42
+
43
+ Args:
44
+ device (`torch.device`): Device to initialize all parameters on.
45
+ include_buffers (`bool`, *optional*, defaults to `False`): Whether or
46
+ not to also put all buffers on the meta device while initializing.
47
+
48
+ Example:
49
+ ```python
50
+ import torch.nn as nn
51
+
52
+ with init_on_device(device=torch.device("cuda")):
53
+ tst = nn.Liner(100, 100) # on `cuda` device
54
+ ```
55
+ """
56
+ old_register_parameter = nn.Module.register_parameter
57
+ if include_buffers:
58
+ old_register_buffer = nn.Module.register_buffer
59
+
60
+ def register_empty_parameter(module, name, param):
61
+ old_register_parameter(module, name, param)
62
+ if param is not None:
63
+ param_cls = type(module._parameters[name])
64
+ kwargs = module._parameters[name].__dict__
65
+ module._parameters[name] = param_cls(module._parameters[name].to(device), **kwargs)
66
+
67
+ def register_empty_buffer(module, name, buffer):
68
+ old_register_buffer(module, name, buffer)
69
+ if buffer is not None:
70
+ module._buffers[name] = module._buffers[name].to(device)
71
+ if include_buffers:
72
+ tensor_constructors_to_patch = {torch_function_name: getattr(torch, torch_function_name) for torch_function_name in ['empty', 'zeros', 'ones', 'full']}
73
+ else:
74
+ tensor_constructors_to_patch = {}
75
+
76
+ def patch_tensor_constructor(fn):
77
+
78
+ def wrapper(*args, **kwargs):
79
+ kwargs['device'] = device
80
+ return fn(*args, **kwargs)
81
+ return wrapper
82
+ try:
83
+ nn.Module.register_parameter = register_empty_parameter
84
+ if include_buffers:
85
+ nn.Module.register_buffer = register_empty_buffer
86
+ for torch_function_name in tensor_constructors_to_patch.keys():
87
+ setattr(torch, torch_function_name, patch_tensor_constructor(getattr(torch, torch_function_name)))
88
+ yield
89
+ finally:
90
+ nn.Module.register_parameter = old_register_parameter
91
+ if include_buffers:
92
+ nn.Module.register_buffer = old_register_buffer
93
+ for (torch_function_name, old_torch_function) in tensor_constructors_to_patch.items():
94
+ setattr(torch, torch_function_name, old_torch_function)
modeling_mpt.py ADDED
@@ -0,0 +1,347 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import math
2
+ import warnings
3
+ from typing import List, Optional, Tuple, Union
4
+ import torch
5
+ import torch.nn as nn
6
+ import torch.nn.functional as F
7
+ from transformers import PreTrainedModel, PreTrainedTokenizer, PreTrainedTokenizerFast
8
+ from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast
9
+ from .attention import attn_bias_shape, build_attn_bias
10
+ from .blocks import MPTBlock
11
+ from .norm import NORM_CLASS_REGISTRY
12
+ from .configuration_mpt import MPTConfig
13
+ from .adapt_tokenizer import AutoTokenizerForMOD, adapt_tokenizer_for_denoising
14
+ from .hf_prefixlm_converter import add_bidirectional_mask_if_missing, convert_hf_causal_lm_to_prefix_lm
15
+ from .meta_init_context import init_empty_weights
16
+ from .param_init_fns import MODEL_INIT_REGISTRY, generic_param_init_fn_
17
+ Tokenizer = Union[PreTrainedTokenizer, PreTrainedTokenizerFast]
18
+
19
+ class MPTPreTrainedModel(PreTrainedModel):
20
+ config_class = MPTConfig
21
+ base_model_prefix = 'model'
22
+ _no_split_modules = ["MPTBlock"]
23
+ supports_gradient_checkpointing = True
24
+
25
+ def _set_gradient_checkpointing(self, module, value=False):
26
+ if isinstance(module, MPTModel):
27
+ module.gradient_checkpointing = value
28
+
29
+ class MPTModel(MPTPreTrainedModel):
30
+
31
+ def __init__(self, config: MPTConfig):
32
+ config._validate_config()
33
+ super().__init__(config)
34
+ self.gradient_checkpointing = False
35
+ self.attn_impl = config.attn_config['attn_impl']
36
+ self.prefix_lm = config.attn_config['prefix_lm']
37
+ self.attn_uses_sequence_id = config.attn_config['attn_uses_sequence_id']
38
+ self.alibi = config.attn_config['alibi']
39
+ self.alibi_bias_max = config.attn_config['alibi_bias_max']
40
+ if config.norm_type.lower() not in NORM_CLASS_REGISTRY.keys():
41
+ norm_options = ' | '.join(NORM_CLASS_REGISTRY.keys())
42
+ raise NotImplementedError(f'Requested norm type ({config.norm_type}) is not implemented within this repo (Options: {norm_options}).')
43
+ norm_class = NORM_CLASS_REGISTRY[config.norm_type.lower()]
44
+ self.embedding_fraction = config.embedding_fraction
45
+ self.wte = nn.Embedding(config.vocab_size, config.d_model, device=config.init_device)
46
+ if not self.alibi:
47
+ self.wpe = nn.Embedding(config.max_seq_len, config.d_model, device=config.init_device)
48
+ self.emb_drop = nn.Dropout(config.emb_pdrop)
49
+ self.blocks = nn.ModuleList([MPTBlock(device=config.init_device, **config.to_dict()) for _ in range(config.n_layers)])
50
+ self.norm_f = norm_class(config.d_model, device=config.init_device)
51
+ if config.init_device != 'meta':
52
+ self.apply(self.param_init_fn)
53
+ self.is_causal = not self.prefix_lm
54
+ self._attn_bias_initialized = False
55
+ self.attn_bias = None
56
+ self.attn_bias_shape = attn_bias_shape(self.attn_impl, config.n_heads, config.max_seq_len, self.alibi, prefix_lm=self.prefix_lm, causal=self.is_causal, use_sequence_id=self.attn_uses_sequence_id)
57
+ if config.no_bias:
58
+ for module in self.modules():
59
+ if hasattr(module, 'bias') and isinstance(module.bias, nn.Parameter):
60
+ if config.verbose:
61
+ warnings.warn(f'Removing bias ({module.bias}) from {module}.')
62
+ module.register_parameter('bias', None)
63
+ if config.verbose and config.verbose > 2:
64
+ print(self)
65
+ if 'verbose' not in self.config.init_config:
66
+ self.config.init_config['verbose'] = self.config.verbose
67
+ if self.config.init_config['verbose'] > 1:
68
+ init_fn_name = self.config.init_config['name']
69
+ warnings.warn(f'Using {init_fn_name} initialization.')
70
+
71
+ def get_input_embeddings(self):
72
+ return self.wte
73
+
74
+ def set_input_embeddings(self, value):
75
+ self.wte = value
76
+
77
+ @torch.no_grad()
78
+ def _attn_bias(self, device, dtype, attention_mask: Optional[torch.ByteTensor]=None, prefix_mask: Optional[torch.ByteTensor]=None, sequence_id: Optional[torch.LongTensor]=None):
79
+ if not self._attn_bias_initialized:
80
+ if self.attn_bias_shape:
81
+ self.attn_bias = torch.zeros(self.attn_bias_shape, device=device, dtype=dtype)
82
+ self.attn_bias = build_attn_bias(self.attn_impl, self.attn_bias, self.config.n_heads, self.config.max_seq_len, causal=self.is_causal, alibi=self.alibi, alibi_bias_max=self.alibi_bias_max)
83
+ self._attn_bias_initialized = True
84
+ if self.attn_impl == 'flash':
85
+ return (self.attn_bias, attention_mask)
86
+ if self.attn_bias is not None:
87
+ self.attn_bias = self.attn_bias.to(dtype=dtype, device=device)
88
+ attn_bias = self.attn_bias
89
+ if self.prefix_lm:
90
+ assert isinstance(attn_bias, torch.Tensor)
91
+ assert isinstance(prefix_mask, torch.Tensor)
92
+ attn_bias = self._apply_prefix_mask(attn_bias, prefix_mask)
93
+ if self.attn_uses_sequence_id and sequence_id is not None:
94
+ assert isinstance(attn_bias, torch.Tensor)
95
+ attn_bias = self._apply_sequence_id(attn_bias, sequence_id)
96
+ if attention_mask is not None:
97
+ s_k = attention_mask.shape[-1]
98
+ if attn_bias is None:
99
+ attn_bias = torch.zeros((1, 1, 1, s_k), device=device, dtype=dtype)
100
+ else:
101
+ attn_bias = attn_bias[:, :, :, -s_k:]
102
+ if prefix_mask is not None and attention_mask.shape != prefix_mask.shape:
103
+ raise ValueError(f'attention_mask shape={attention_mask.shape} ' + f'and prefix_mask shape={prefix_mask.shape} are not equal.')
104
+ min_val = torch.finfo(attn_bias.dtype).min
105
+ attn_bias = attn_bias.masked_fill(~attention_mask.view(-1, 1, 1, s_k), min_val)
106
+ return (attn_bias, None)
107
+
108
+ def _apply_prefix_mask(self, attn_bias: torch.Tensor, prefix_mask: torch.Tensor):
109
+ (s_k, s_q) = attn_bias.shape[-2:]
110
+ if s_k != self.config.max_seq_len or s_q != self.config.max_seq_len:
111
+ raise ValueError('attn_bias does not match the expected shape. ' + f'The last two dimensions should both be {self.config.max_length} ' + f'but are {s_k} and {s_q}.')
112
+ seq_len = prefix_mask.shape[-1]
113
+ if seq_len > self.config.max_seq_len:
114
+ raise ValueError(f'prefix_mask sequence length cannot exceed max_seq_len={self.config.max_seq_len}')
115
+ attn_bias = attn_bias[..., :seq_len, :seq_len]
116
+ causal = torch.tril(torch.ones((seq_len, seq_len), dtype=torch.bool, device=prefix_mask.device)).view(1, 1, seq_len, seq_len)
117
+ prefix = prefix_mask.view(-1, 1, 1, seq_len)
118
+ cannot_attend = ~torch.logical_or(causal, prefix.bool())
119
+ min_val = torch.finfo(attn_bias.dtype).min
120
+ attn_bias = attn_bias.masked_fill(cannot_attend, min_val)
121
+ return attn_bias
122
+
123
+ def _apply_sequence_id(self, attn_bias: torch.Tensor, sequence_id: torch.LongTensor):
124
+ seq_len = sequence_id.shape[-1]
125
+ if seq_len > self.config.max_seq_len:
126
+ raise ValueError(f'sequence_id sequence length cannot exceed max_seq_len={self.config.max_seq_len}')
127
+ attn_bias = attn_bias[..., :seq_len, :seq_len]
128
+ cannot_attend = torch.logical_not(torch.eq(sequence_id.view(-1, seq_len, 1), sequence_id.view(-1, 1, seq_len))).unsqueeze(1)
129
+ min_val = torch.finfo(attn_bias.dtype).min
130
+ attn_bias = attn_bias.masked_fill(cannot_attend, min_val)
131
+ return attn_bias
132
+
133
+ def forward(self, input_ids: torch.LongTensor, past_key_values: Optional[List[Tuple[torch.FloatTensor]]]=None, attention_mask: Optional[torch.ByteTensor]=None, prefix_mask: Optional[torch.ByteTensor]=None, sequence_id: Optional[torch.LongTensor]=None, return_dict: Optional[bool]=None, output_attentions: Optional[bool]=None, output_hidden_states: Optional[bool]=None, use_cache: Optional[bool]=None, inputs_embeds: Optional[torch.FloatTensor] = None):
134
+ return_dict = return_dict if return_dict is not None else self.config.return_dict
135
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
136
+ if self.gradient_checkpointing and self.training:
137
+ if use_cache:
138
+ use_cache = False
139
+ if input_ids is not None and inputs_embeds is not None:
140
+ raise ValueError("You cannot specify both decoder_input_ids and decoder_inputs_embeds at the same time")
141
+ elif input_ids is not None:
142
+ batch_size, seq_length = input_ids.shape
143
+ elif inputs_embeds is not None:
144
+ batch_size, seq_length, _ = inputs_embeds.shape
145
+ else:
146
+ raise ValueError("You have to specify either decoder_input_ids or decoder_inputs_embeds")
147
+
148
+ seq_length_with_past = seq_length
149
+ past_key_values_length = 0
150
+
151
+ if past_key_values is not None:
152
+ past_key_values_length = past_key_values[0][0].shape[2]
153
+ seq_length_with_past = seq_length_with_past + past_key_values_length
154
+
155
+ if inputs_embeds is None:
156
+ tok_emb = self.wte(input_ids)
157
+ else:
158
+ tok_emb = inputs_embeds
159
+
160
+ if attention_mask is not None:
161
+ attention_mask = attention_mask.bool()
162
+ else:
163
+ attention_mask = torch.ones(
164
+ ##I changed this from inputs_embeds.device because on lm-eval they are none.
165
+ (batch_size, seq_length_with_past), dtype=torch.bool, device=tok_emb.device
166
+ )
167
+
168
+ if prefix_mask is not None:
169
+ prefix_mask = prefix_mask.bool()
170
+ if not return_dict:
171
+ raise NotImplementedError('return_dict False is not implemented yet for MPT')
172
+ if output_attentions:
173
+ raise NotImplementedError('output_attentions is not implemented yet for MPT')
174
+ #if attention_mask is not None and attention_mask[:, 0].sum() != attention_mask.shape[0] and self.training:
175
+ # raise NotImplementedError('MPT does not support training with left padding.')
176
+ if self.prefix_lm and prefix_mask is None:
177
+ raise ValueError('prefix_mask is a required argument when MPT is configured with prefix_lm=True.')
178
+ if self.training:
179
+ if self.attn_uses_sequence_id and sequence_id is None:
180
+ raise ValueError('sequence_id is a required argument when MPT is configured with attn_uses_sequence_id=True ' + 'and the model is in train mode.')
181
+ elif self.attn_uses_sequence_id is False and sequence_id is not None:
182
+ warnings.warn('MPT received non-None input for `sequence_id` but is configured with attn_uses_sequence_id=False. ' + 'This input will be ignored. If you want the model to use `sequence_id`, set attn_uses_sequence_id to True.')
183
+ S = seq_length
184
+ assert S <= self.config.max_seq_len, f'Cannot forward input with seq_len={S}, this model only supports seq_len<={self.config.max_seq_len}'
185
+ if self.alibi:
186
+ x = tok_emb
187
+ else:
188
+ past_position = 0
189
+ if past_key_values is not None:
190
+ if len(past_key_values) != self.config.n_layers:
191
+ raise ValueError(f'past_key_values must provide a past_key_value for each attention ' + f'layer in the network (len(past_key_values)={len(past_key_values)!r}; self.config.n_layers={self.config.n_layers!r}).')
192
+ past_position = past_key_values[0][0].size(1)
193
+ if S + past_position > self.config.max_seq_len:
194
+ raise ValueError(f'Cannot forward input with past sequence length {past_position} and current sequence length {S + 1}, this model only supports total sequence length <= {self.config.max_seq_len}.')
195
+ pos = torch.arange(past_position, S + past_position, dtype=torch.long, device=input_ids.device).unsqueeze(0)
196
+ if attention_mask is not None and not self.training:
197
+ pos = torch.clamp(pos - torch.cumsum((~attention_mask).to(torch.int32), dim=1)[:, past_position:], min=0)
198
+ pos_emb = self.wpe(pos)
199
+ x = tok_emb + pos_emb
200
+
201
+ if self.embedding_fraction == 1:
202
+ x = self.emb_drop(x)
203
+ else:
204
+ x_shrunk = x * self.embedding_fraction + x.detach() * (1 - self.embedding_fraction)
205
+ assert isinstance(self.emb_drop, nn.Module)
206
+ x = self.emb_drop(x_shrunk)
207
+ (attn_bias, attention_mask) = self._attn_bias(device=x.device, dtype=x.dtype, attention_mask=attention_mask, prefix_mask=prefix_mask, sequence_id=sequence_id)
208
+ if use_cache and past_key_values is None:
209
+ past_key_values = [() for _ in range(self.config.n_layers)]
210
+
211
+ all_hidden_states = () if output_hidden_states else None
212
+ for (b_idx, block) in enumerate(self.blocks):
213
+ if output_hidden_states:
214
+ assert all_hidden_states is not None
215
+ all_hidden_states = all_hidden_states + (x,)
216
+ past_key_value = past_key_values[b_idx] if past_key_values is not None else None
217
+
218
+ if self.gradient_checkpointing and self.training:
219
+
220
+ def create_custom_forward(module):
221
+ def custom_forward(*inputs):
222
+ # None for past_key_value
223
+ return module(*inputs)
224
+
225
+ return custom_forward
226
+
227
+ (x, past_key_value) = torch.utils.checkpoint.checkpoint(
228
+ create_custom_forward(block),
229
+ x,
230
+ past_key_value,
231
+ attn_bias,
232
+ attention_mask,
233
+ self.is_causal,
234
+ )
235
+ else:
236
+ (x, past_key_value) = block(x, past_key_value=past_key_value, attn_bias=attn_bias, attention_mask=attention_mask, is_causal=self.is_causal)
237
+ if past_key_values is not None:
238
+ past_key_values[b_idx] = past_key_value
239
+ x = self.norm_f(x)
240
+
241
+ return BaseModelOutputWithPast(last_hidden_state=x, past_key_values=past_key_values, hidden_states=all_hidden_states)
242
+
243
+ def param_init_fn(self, module):
244
+ init_fn_name = self.config.init_config['name']
245
+ MODEL_INIT_REGISTRY[init_fn_name](module=module, n_layers=self.config.n_layers, d_model=self.config.d_model, **self.config.init_config)
246
+
247
+ def fsdp_wrap_fn(self, module):
248
+ return isinstance(module, MPTBlock)
249
+
250
+ def activation_checkpointing_fn(self, module):
251
+ return isinstance(module, MPTBlock)
252
+
253
+ class MPTForCausalLM(MPTPreTrainedModel):
254
+
255
+ def __init__(self, config: MPTConfig):
256
+ super().__init__(config)
257
+ if not config.tie_word_embeddings:
258
+ raise ValueError('MPTForCausalLM only supports tied word embeddings')
259
+
260
+ self.transformer = MPTModel(config)
261
+ self.logit_scale = None
262
+ if config.logit_scale is not None:
263
+ logit_scale = config.logit_scale
264
+ if isinstance(logit_scale, str):
265
+ if logit_scale == 'inv_sqrt_d_model':
266
+ logit_scale = 1 / math.sqrt(config.d_model)
267
+ else:
268
+ raise ValueError(f"logit_scale={logit_scale!r} is not recognized as an option; use numeric value or 'inv_sqrt_d_model'.")
269
+ self.logit_scale = logit_scale
270
+
271
+ def get_input_embeddings(self):
272
+ return self.transformer.wte
273
+
274
+ def set_input_embeddings(self, value):
275
+ self.transformer.wte = value
276
+
277
+ def get_output_embeddings(self):
278
+ return self.transformer.wte
279
+
280
+ def set_output_embeddings(self, new_embeddings):
281
+ self.transformer.wte = new_embeddings
282
+
283
+ def set_decoder(self, decoder):
284
+ self.transformer = decoder
285
+
286
+ def get_decoder(self):
287
+ return self.transformer
288
+
289
+ def forward(self, input_ids: torch.LongTensor, past_key_values: Optional[List[Tuple[torch.FloatTensor]]]=None, attention_mask: Optional[torch.ByteTensor]=None, prefix_mask: Optional[torch.ByteTensor]=None, sequence_id: Optional[torch.LongTensor]=None, labels: Optional[torch.LongTensor]=None, return_dict: Optional[bool]=None, output_attentions: Optional[bool]=None, output_hidden_states: Optional[bool]=None, use_cache: Optional[bool]=None, inputs_embeds: Optional[torch.FloatTensor] = None):
290
+ return_dict = return_dict if return_dict is not None else self.config.return_dict
291
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
292
+ outputs = self.transformer(input_ids=input_ids, past_key_values=past_key_values, attention_mask=attention_mask, prefix_mask=prefix_mask, sequence_id=sequence_id, return_dict=return_dict, output_attentions=output_attentions, output_hidden_states=output_hidden_states, use_cache=use_cache, inputs_embeds=inputs_embeds)
293
+
294
+
295
+ logits = F.linear(outputs.last_hidden_state, self.transformer.wte.weight)
296
+ if self.logit_scale is not None:
297
+ if self.logit_scale == 0:
298
+ warnings.warn(f'Multiplying logits by self.logit_scale={self.logit_scale!r}. This will produce uniform (uninformative) outputs.')
299
+ logits *= self.logit_scale
300
+ loss = None
301
+ if labels is not None:
302
+ labels = torch.roll(labels, shifts=-1)
303
+ labels[:, -1] = -100
304
+ loss = F.cross_entropy(logits.view(-1, logits.size(-1)), labels.to(logits.device).view(-1))
305
+ return CausalLMOutputWithPast(loss=loss, logits=logits, past_key_values=outputs.past_key_values, hidden_states=outputs.hidden_states)
306
+
307
+ def param_init_fn(self, module):
308
+ init_fn_name = self.config.init_config['name']
309
+ MODEL_INIT_REGISTRY[init_fn_name](module=module, n_layers=self.config.n_layers, d_model=self.config.d_model, **self.config.init_config)
310
+
311
+ def fsdp_wrap_fn(self, module):
312
+ return isinstance(module, MPTBlock)
313
+
314
+ def activation_checkpointing_fn(self, module):
315
+ return isinstance(module, MPTBlock)
316
+
317
+ def prepare_inputs_for_generation(self, input_ids, past_key_values=None, inputs_embeds=None, **kwargs):
318
+ if inputs_embeds is not None:
319
+ raise NotImplementedError('inputs_embeds is not implemented for MPT yet')
320
+ attention_mask = kwargs['attention_mask'].bool()
321
+ if attention_mask[:, -1].sum() != attention_mask.shape[0]:
322
+ raise NotImplementedError('MPT does not support generation with right padding.')
323
+ if self.transformer.attn_uses_sequence_id and self.training:
324
+ sequence_id = torch.zeros_like(input_ids[:1])
325
+ else:
326
+ sequence_id = None
327
+ if past_key_values is not None:
328
+ input_ids = input_ids[:, -1].unsqueeze(-1)
329
+ if self.transformer.prefix_lm:
330
+ prefix_mask = torch.ones_like(attention_mask)
331
+ if kwargs.get('use_cache') == False:
332
+ raise NotImplementedError('MPT with prefix_lm=True does not support use_cache=False.')
333
+ else:
334
+ prefix_mask = None
335
+ return {'input_ids': input_ids, 'attention_mask': attention_mask, 'prefix_mask': prefix_mask, 'sequence_id': sequence_id, 'past_key_values': past_key_values, 'use_cache': kwargs.get('use_cache', True)}
336
+
337
+ @staticmethod
338
+ def _reorder_cache(past_key_values, beam_idx):
339
+ """Used by HuggingFace generate when using beam search with kv-caching.
340
+
341
+ See https://github.com/huggingface/transformers/blob/3ec7a47664ebe40c40f4b722f6bb1cd30c3821ec/src/transformers/models/gpt2/modeling_gpt2.py#L1122-L1133
342
+ for an example in transformers.
343
+ """
344
+ reordered_past = []
345
+ for layer_past in past_key_values:
346
+ reordered_past += [tuple((past_state.index_select(0, beam_idx) for past_state in layer_past))]
347
+ return reordered_past
norm.py ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+
3
+ def _cast_if_autocast_enabled(tensor):
4
+ if torch.is_autocast_enabled():
5
+ if tensor.device.type == 'cuda':
6
+ dtype = torch.get_autocast_gpu_dtype()
7
+ elif tensor.device.type == 'cpu':
8
+ dtype = torch.get_autocast_cpu_dtype()
9
+ else:
10
+ raise NotImplementedError()
11
+ return tensor.to(dtype=dtype)
12
+ return tensor
13
+
14
+ class LPLayerNorm(torch.nn.LayerNorm):
15
+
16
+ def __init__(self, normalized_shape, eps=1e-05, elementwise_affine=True, device=None, dtype=None):
17
+ super().__init__(normalized_shape=normalized_shape, eps=eps, elementwise_affine=elementwise_affine, device=device, dtype=dtype)
18
+
19
+ def forward(self, x):
20
+ module_device = x.device
21
+ downcast_x = _cast_if_autocast_enabled(x)
22
+ downcast_weight = _cast_if_autocast_enabled(self.weight) if self.weight is not None else self.weight
23
+ downcast_bias = _cast_if_autocast_enabled(self.bias) if self.bias is not None else self.bias
24
+ with torch.autocast(enabled=False, device_type=module_device.type):
25
+ return torch.nn.functional.layer_norm(downcast_x, self.normalized_shape, downcast_weight, downcast_bias, self.eps)
26
+
27
+ def rms_norm(x, weight=None, eps=1e-05):
28
+ output = x / torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + eps)
29
+ if weight is not None:
30
+ return output * weight
31
+ return output
32
+
33
+ class RMSNorm(torch.nn.Module):
34
+
35
+ def __init__(self, normalized_shape, eps=1e-05, weight=True, dtype=None, device=None):
36
+ super().__init__()
37
+ self.eps = eps
38
+ if weight:
39
+ self.weight = torch.nn.Parameter(torch.ones(normalized_shape, dtype=dtype, device=device))
40
+ else:
41
+ self.register_parameter('weight', None)
42
+
43
+ def forward(self, x):
44
+ return rms_norm(x.float(), self.weight, self.eps).to(dtype=x.dtype)
45
+
46
+ class LPRMSNorm(RMSNorm):
47
+
48
+ def __init__(self, normalized_shape, eps=1e-05, weight=True, dtype=None, device=None):
49
+ super().__init__(normalized_shape=normalized_shape, eps=eps, weight=weight, dtype=dtype, device=device)
50
+
51
+ def forward(self, x):
52
+ downcast_x = _cast_if_autocast_enabled(x)
53
+ downcast_weight = _cast_if_autocast_enabled(self.weight) if self.weight is not None else self.weight
54
+ with torch.autocast(enabled=False, device_type=x.device.type):
55
+ return rms_norm(downcast_x, downcast_weight, self.eps).to(dtype=x.dtype)
56
+ NORM_CLASS_REGISTRY = {'layernorm': torch.nn.LayerNorm, 'low_precision_layernorm': LPLayerNorm, 'rmsnorm': RMSNorm, 'low_precision_rmsnorm': LPRMSNorm}
param_init_fns.py ADDED
@@ -0,0 +1,181 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import math
2
+ import warnings
3
+ from collections.abc import Sequence
4
+ from functools import partial
5
+ from typing import Optional, Tuple, Union
6
+ import torch
7
+ from torch import nn
8
+ from .norm import NORM_CLASS_REGISTRY
9
+
10
+ def torch_default_param_init_fn_(module: nn.Module, verbose: int=0, **kwargs):
11
+ del kwargs
12
+ if verbose > 1:
13
+ warnings.warn(f"Initializing network using module's reset_parameters attribute")
14
+ if hasattr(module, 'reset_parameters'):
15
+ module.reset_parameters()
16
+
17
+ def fused_init_helper_(module: nn.Module, init_fn_):
18
+ _fused = getattr(module, '_fused', None)
19
+ if _fused is None:
20
+ raise RuntimeError(f'Internal logic error')
21
+ (dim, splits) = _fused
22
+ splits = (0, *splits, module.weight.size(dim))
23
+ for (s, e) in zip(splits[:-1], splits[1:]):
24
+ slice_indices = [slice(None)] * module.weight.ndim
25
+ slice_indices[dim] = slice(s, e)
26
+ init_fn_(module.weight[slice_indices])
27
+
28
+ def generic_param_init_fn_(module: nn.Module, init_fn_, n_layers: int, d_model: Optional[int]=None, init_div_is_residual: Union[int, float, str, bool]=True, emb_init_std: Optional[float]=None, emb_init_uniform_lim: Optional[Union[Tuple[float, float], float]]=None, verbose: int=0, **kwargs):
29
+ del kwargs
30
+ if verbose > 1:
31
+ warnings.warn(f'If model has bias parameters they are initialized to 0.')
32
+ init_div_is_residual = init_div_is_residual
33
+ if init_div_is_residual is False:
34
+ div_is_residual = 1.0
35
+ elif init_div_is_residual is True:
36
+ div_is_residual = math.sqrt(2 * n_layers)
37
+ elif isinstance(init_div_is_residual, float) or isinstance(init_div_is_residual, int):
38
+ div_is_residual = init_div_is_residual
39
+ elif isinstance(init_div_is_residual, str) and init_div_is_residual.isnumeric():
40
+ div_is_residual = float(init_div_is_residual)
41
+ else:
42
+ div_is_residual = 1.0
43
+ raise ValueError(f'Expected init_div_is_residual to be boolean or numeric, got {init_div_is_residual}')
44
+ if init_div_is_residual is not False:
45
+ if verbose > 1:
46
+ warnings.warn(f'Initializing _is_residual layers then dividing them by {div_is_residual:.3f}. ' + f'Set `init_div_is_residual: false` in init config to disable this.')
47
+ if isinstance(module, nn.Linear):
48
+ if hasattr(module, '_fused'):
49
+ fused_init_helper_(module, init_fn_)
50
+ else:
51
+ init_fn_(module.weight)
52
+ if module.bias is not None:
53
+ torch.nn.init.zeros_(module.bias)
54
+ if init_div_is_residual is not False and getattr(module, '_is_residual', False):
55
+ with torch.no_grad():
56
+ module.weight.div_(div_is_residual)
57
+ elif isinstance(module, nn.Embedding):
58
+ if emb_init_std is not None:
59
+ std = emb_init_std
60
+ if std == 0:
61
+ warnings.warn(f'Embedding layer initialized to 0.')
62
+ emb_init_fn_ = partial(torch.nn.init.normal_, mean=0.0, std=std)
63
+ if verbose > 1:
64
+ warnings.warn(f'Embedding layer initialized using normal distribution with mean=0 and std={std!r}.')
65
+ elif emb_init_uniform_lim is not None:
66
+ lim = emb_init_uniform_lim
67
+ if isinstance(lim, Sequence):
68
+ if len(lim) > 2:
69
+ raise ValueError(f'Uniform init requires a min and a max limit. User input: {lim}.')
70
+ if lim[0] == lim[1]:
71
+ warnings.warn(f'Embedding layer initialized to {lim[0]}.')
72
+ else:
73
+ if lim == 0:
74
+ warnings.warn(f'Embedding layer initialized to 0.')
75
+ lim = [-lim, lim]
76
+ (a, b) = lim
77
+ emb_init_fn_ = partial(torch.nn.init.uniform_, a=a, b=b)
78
+ if verbose > 1:
79
+ warnings.warn(f'Embedding layer initialized using uniform distribution in range {lim}.')
80
+ else:
81
+ emb_init_fn_ = init_fn_
82
+ emb_init_fn_(module.weight)
83
+ elif isinstance(module, tuple(set(NORM_CLASS_REGISTRY.values()))):
84
+ if verbose > 1:
85
+ warnings.warn(f'Norm weights are set to 1. If norm layer has a bias it is initialized to 0.')
86
+ if hasattr(module, 'weight') and module.weight is not None:
87
+ torch.nn.init.ones_(module.weight)
88
+ if hasattr(module, 'bias') and module.bias is not None:
89
+ torch.nn.init.zeros_(module.bias)
90
+ elif isinstance(module, nn.MultiheadAttention):
91
+ if module._qkv_same_embed_dim:
92
+ assert module.in_proj_weight is not None
93
+ assert module.q_proj_weight is None and module.k_proj_weight is None and (module.v_proj_weight is None)
94
+ assert d_model is not None
95
+ _d = d_model
96
+ splits = (0, _d, 2 * _d, 3 * _d)
97
+ for (s, e) in zip(splits[:-1], splits[1:]):
98
+ init_fn_(module.in_proj_weight[s:e])
99
+ else:
100
+ assert module.q_proj_weight is not None and module.k_proj_weight is not None and (module.v_proj_weight is not None)
101
+ assert module.in_proj_weight is None
102
+ init_fn_(module.q_proj_weight)
103
+ init_fn_(module.k_proj_weight)
104
+ init_fn_(module.v_proj_weight)
105
+ if module.in_proj_bias is not None:
106
+ torch.nn.init.zeros_(module.in_proj_bias)
107
+ if module.bias_k is not None:
108
+ torch.nn.init.zeros_(module.bias_k)
109
+ if module.bias_v is not None:
110
+ torch.nn.init.zeros_(module.bias_v)
111
+ init_fn_(module.out_proj.weight)
112
+ if init_div_is_residual is not False and getattr(module.out_proj, '_is_residual', False):
113
+ with torch.no_grad():
114
+ module.out_proj.weight.div_(div_is_residual)
115
+ if module.out_proj.bias is not None:
116
+ torch.nn.init.zeros_(module.out_proj.bias)
117
+ else:
118
+ for _ in module.parameters(recurse=False):
119
+ raise NotImplementedError(f'{module.__class__.__name__} parameters are not initialized by param_init_fn.')
120
+
121
+ def _normal_init_(std, mean=0.0):
122
+ return partial(torch.nn.init.normal_, mean=mean, std=std)
123
+
124
+ def _normal_param_init_fn_(module: nn.Module, std: float, n_layers: int, d_model: Optional[int]=None, init_div_is_residual: Union[int, float, str, bool]=True, emb_init_std: Optional[float]=None, emb_init_uniform_lim: Optional[Union[Tuple[float, float], float]]=None, verbose: int=0, **kwargs):
125
+ del kwargs
126
+ init_fn_ = _normal_init_(std=std)
127
+ if verbose > 1:
128
+ warnings.warn(f'Using torch.nn.init.normal_ init fn mean=0.0, std={std}')
129
+ generic_param_init_fn_(module=module, init_fn_=init_fn_, d_model=d_model, n_layers=n_layers, init_div_is_residual=init_div_is_residual, emb_init_std=emb_init_std, emb_init_uniform_lim=emb_init_uniform_lim, verbose=verbose)
130
+
131
+ def baseline_param_init_fn_(module: nn.Module, init_std: float, n_layers: int, d_model: Optional[int]=None, init_div_is_residual: Union[int, float, str, bool]=True, emb_init_std: Optional[float]=None, emb_init_uniform_lim: Optional[Union[Tuple[float, float], float]]=None, verbose: int=0, **kwargs):
132
+ del kwargs
133
+ if init_std is None:
134
+ raise ValueError("You must set model.init_config['init_std'] to a float value to use the default initialization scheme.")
135
+ _normal_param_init_fn_(module=module, std=init_std, d_model=d_model, n_layers=n_layers, init_div_is_residual=init_div_is_residual, emb_init_std=emb_init_std, emb_init_uniform_lim=emb_init_uniform_lim, verbose=verbose)
136
+
137
+ def small_param_init_fn_(module: nn.Module, n_layers: int, d_model: int, init_div_is_residual: Union[int, float, str, bool]=True, emb_init_std: Optional[float]=None, emb_init_uniform_lim: Optional[Union[Tuple[float, float], float]]=None, verbose: int=0, **kwargs):
138
+ del kwargs
139
+ std = math.sqrt(2 / (5 * d_model))
140
+ _normal_param_init_fn_(module=module, std=std, d_model=d_model, n_layers=n_layers, init_div_is_residual=init_div_is_residual, emb_init_std=emb_init_std, emb_init_uniform_lim=emb_init_uniform_lim, verbose=verbose)
141
+
142
+ def neox_param_init_fn_(module: nn.Module, n_layers: int, d_model: int, emb_init_std: Optional[float]=None, emb_init_uniform_lim: Optional[Union[Tuple[float, float], float]]=None, verbose: int=0, **kwargs):
143
+ """From section 2.3.1 of GPT-NeoX-20B:
144
+
145
+ An Open-Source AutoregressiveLanguage Model — Black et. al. (2022)
146
+ see https://github.com/EleutherAI/gpt-neox/blob/9610391ab319403cef079b438edd016a2443af54/megatron/model/init_functions.py#L151
147
+ and https://github.com/EleutherAI/gpt-neox/blob/main/megatron/model/transformer.py
148
+ """
149
+ del kwargs
150
+ residual_div = n_layers / math.sqrt(10)
151
+ if verbose > 1:
152
+ warnings.warn(f'setting init_div_is_residual to {residual_div}')
153
+ small_param_init_fn_(module=module, d_model=d_model, n_layers=n_layers, init_div_is_residual=residual_div, emb_init_std=emb_init_std, emb_init_uniform_lim=emb_init_uniform_lim, verbose=verbose)
154
+
155
+ def kaiming_uniform_param_init_fn_(module: nn.Module, n_layers: int, d_model: Optional[int]=None, init_div_is_residual: Union[int, float, str, bool]=True, emb_init_std: Optional[float]=None, emb_init_uniform_lim: Optional[Union[Tuple[float, float], float]]=None, init_gain: float=0, fan_mode: str='fan_in', init_nonlinearity: str='leaky_relu', verbose: int=0, **kwargs):
156
+ del kwargs
157
+ if verbose > 1:
158
+ warnings.warn(f'Using nn.init.kaiming_uniform_ init fn with parameters: ' + f'a={init_gain}, mode={fan_mode}, nonlinearity={init_nonlinearity}')
159
+ kaiming_uniform_ = partial(nn.init.kaiming_uniform_, a=init_gain, mode=fan_mode, nonlinearity=init_nonlinearity)
160
+ generic_param_init_fn_(module=module, init_fn_=kaiming_uniform_, d_model=d_model, n_layers=n_layers, init_div_is_residual=init_div_is_residual, emb_init_std=emb_init_std, emb_init_uniform_lim=emb_init_uniform_lim, verbose=verbose)
161
+
162
+ def kaiming_normal_param_init_fn_(module: nn.Module, n_layers: int, d_model: Optional[int]=None, init_div_is_residual: Union[int, float, str, bool]=True, emb_init_std: Optional[float]=None, emb_init_uniform_lim: Optional[Union[Tuple[float, float], float]]=None, init_gain: float=0, fan_mode: str='fan_in', init_nonlinearity: str='leaky_relu', verbose: int=0, **kwargs):
163
+ del kwargs
164
+ if verbose > 1:
165
+ warnings.warn(f'Using nn.init.kaiming_normal_ init fn with parameters: ' + f'a={init_gain}, mode={fan_mode}, nonlinearity={init_nonlinearity}')
166
+ kaiming_normal_ = partial(torch.nn.init.kaiming_normal_, a=init_gain, mode=fan_mode, nonlinearity=init_nonlinearity)
167
+ generic_param_init_fn_(module=module, init_fn_=kaiming_normal_, d_model=d_model, n_layers=n_layers, init_div_is_residual=init_div_is_residual, emb_init_std=emb_init_std, emb_init_uniform_lim=emb_init_uniform_lim, verbose=verbose)
168
+
169
+ def xavier_uniform_param_init_fn_(module: nn.Module, n_layers: int, d_model: Optional[int]=None, init_div_is_residual: Union[int, float, str, bool]=True, emb_init_std: Optional[float]=None, emb_init_uniform_lim: Optional[Union[Tuple[float, float], float]]=None, init_gain: float=0, verbose: int=0, **kwargs):
170
+ del kwargs
171
+ xavier_uniform_ = partial(torch.nn.init.xavier_uniform_, gain=init_gain)
172
+ if verbose > 1:
173
+ warnings.warn(f'Using torch.nn.init.xavier_uniform_ init fn with parameters: ' + f'gain={init_gain}')
174
+ generic_param_init_fn_(module=module, init_fn_=xavier_uniform_, d_model=d_model, n_layers=n_layers, init_div_is_residual=init_div_is_residual, emb_init_std=emb_init_std, emb_init_uniform_lim=emb_init_uniform_lim, verbose=verbose)
175
+
176
+ def xavier_normal_param_init_fn_(module: nn.Module, n_layers: int, d_model: Optional[int]=None, init_div_is_residual: Union[int, float, str, bool]=True, emb_init_std: Optional[float]=None, emb_init_uniform_lim: Optional[Union[Tuple[float, float], float]]=None, init_gain: float=0, verbose: int=0, **kwargs):
177
+ xavier_normal_ = partial(torch.nn.init.xavier_normal_, gain=init_gain)
178
+ if verbose > 1:
179
+ warnings.warn(f'Using torch.nn.init.xavier_normal_ init fn with parameters: ' + f'gain={init_gain}')
180
+ generic_param_init_fn_(module=module, init_fn_=xavier_normal_, d_model=d_model, n_layers=n_layers, init_div_is_residual=init_div_is_residual, emb_init_std=emb_init_std, emb_init_uniform_lim=emb_init_uniform_lim, verbose=verbose)
181
+ MODEL_INIT_REGISTRY = {'default_': torch_default_param_init_fn_, 'baseline_': baseline_param_init_fn_, 'kaiming_uniform_': kaiming_uniform_param_init_fn_, 'kaiming_normal_': kaiming_normal_param_init_fn_, 'neox_init_': neox_param_init_fn_, 'small_init_': small_param_init_fn_, 'xavier_uniform_': xavier_uniform_param_init_fn_, 'xavier_normal_': xavier_normal_param_init_fn_}
pytorch_model-00001-of-00007.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:81212c7a02da6c57e3efe256449be1bb59691f65bc24300ba5f1b97aba278c78
3
+ size 9766157965
pytorch_model-00002-of-00007.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b850f5b02d5417793b1f277c11c564c17922eb932fda45515610c5b4e594c390
3
+ size 9865248775
pytorch_model-00003-of-00007.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:32c6585a52b8c10829582a839564f8782c8c61ee5328335e976b86c719eb8e9b
3
+ size 9865248775
pytorch_model-00004-of-00007.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:51855da42a496b634766985d6a1d4ad4d6bacba35f08b599857eac32088614db
3
+ size 9865248775
pytorch_model-00005-of-00007.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ea043754f22bc8f1f280b6606790fa1ccc505f39c22dd315ecb20c7d46e8b471
3
+ size 9865248775
pytorch_model-00006-of-00007.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:888c5d70e4385c421803c9fe6480f9773677e48b4da9bcb93204017e96e6f7a0
3
+ size 9865248775
pytorch_model-00007-of-00007.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b04ee8c36da6dfc94132d8ff933f9ad05f9942b50767e00f0b19fdccb55ccb71
3
+ size 822099468
pytorch_model.bin.index.json ADDED
@@ -0,0 +1,297 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_size": 59914401792
4
+ },
5
+ "weight_map": {
6
+ "transformer.blocks.0.attn.Wqkv.weight": "pytorch_model-00001-of-00007.bin",
7
+ "transformer.blocks.0.attn.out_proj.weight": "pytorch_model-00001-of-00007.bin",
8
+ "transformer.blocks.0.ffn.down_proj.weight": "pytorch_model-00001-of-00007.bin",
9
+ "transformer.blocks.0.ffn.up_proj.weight": "pytorch_model-00001-of-00007.bin",
10
+ "transformer.blocks.0.norm_1.weight": "pytorch_model-00001-of-00007.bin",
11
+ "transformer.blocks.0.norm_2.weight": "pytorch_model-00001-of-00007.bin",
12
+ "transformer.blocks.1.attn.Wqkv.weight": "pytorch_model-00001-of-00007.bin",
13
+ "transformer.blocks.1.attn.out_proj.weight": "pytorch_model-00001-of-00007.bin",
14
+ "transformer.blocks.1.ffn.down_proj.weight": "pytorch_model-00001-of-00007.bin",
15
+ "transformer.blocks.1.ffn.up_proj.weight": "pytorch_model-00001-of-00007.bin",
16
+ "transformer.blocks.1.norm_1.weight": "pytorch_model-00001-of-00007.bin",
17
+ "transformer.blocks.1.norm_2.weight": "pytorch_model-00001-of-00007.bin",
18
+ "transformer.blocks.10.attn.Wqkv.weight": "pytorch_model-00002-of-00007.bin",
19
+ "transformer.blocks.10.attn.out_proj.weight": "pytorch_model-00002-of-00007.bin",
20
+ "transformer.blocks.10.ffn.down_proj.weight": "pytorch_model-00002-of-00007.bin",
21
+ "transformer.blocks.10.ffn.up_proj.weight": "pytorch_model-00002-of-00007.bin",
22
+ "transformer.blocks.10.norm_1.weight": "pytorch_model-00002-of-00007.bin",
23
+ "transformer.blocks.10.norm_2.weight": "pytorch_model-00002-of-00007.bin",
24
+ "transformer.blocks.11.attn.Wqkv.weight": "pytorch_model-00002-of-00007.bin",
25
+ "transformer.blocks.11.attn.out_proj.weight": "pytorch_model-00002-of-00007.bin",
26
+ "transformer.blocks.11.ffn.down_proj.weight": "pytorch_model-00002-of-00007.bin",
27
+ "transformer.blocks.11.ffn.up_proj.weight": "pytorch_model-00002-of-00007.bin",
28
+ "transformer.blocks.11.norm_1.weight": "pytorch_model-00002-of-00007.bin",
29
+ "transformer.blocks.11.norm_2.weight": "pytorch_model-00002-of-00007.bin",
30
+ "transformer.blocks.12.attn.Wqkv.weight": "pytorch_model-00002-of-00007.bin",
31
+ "transformer.blocks.12.attn.out_proj.weight": "pytorch_model-00002-of-00007.bin",
32
+ "transformer.blocks.12.ffn.down_proj.weight": "pytorch_model-00002-of-00007.bin",
33
+ "transformer.blocks.12.ffn.up_proj.weight": "pytorch_model-00002-of-00007.bin",
34
+ "transformer.blocks.12.norm_1.weight": "pytorch_model-00002-of-00007.bin",
35
+ "transformer.blocks.12.norm_2.weight": "pytorch_model-00002-of-00007.bin",
36
+ "transformer.blocks.13.attn.Wqkv.weight": "pytorch_model-00002-of-00007.bin",
37
+ "transformer.blocks.13.attn.out_proj.weight": "pytorch_model-00002-of-00007.bin",
38
+ "transformer.blocks.13.ffn.down_proj.weight": "pytorch_model-00002-of-00007.bin",
39
+ "transformer.blocks.13.ffn.up_proj.weight": "pytorch_model-00002-of-00007.bin",
40
+ "transformer.blocks.13.norm_1.weight": "pytorch_model-00002-of-00007.bin",
41
+ "transformer.blocks.13.norm_2.weight": "pytorch_model-00002-of-00007.bin",
42
+ "transformer.blocks.14.attn.Wqkv.weight": "pytorch_model-00002-of-00007.bin",
43
+ "transformer.blocks.14.attn.out_proj.weight": "pytorch_model-00002-of-00007.bin",
44
+ "transformer.blocks.14.ffn.down_proj.weight": "pytorch_model-00002-of-00007.bin",
45
+ "transformer.blocks.14.ffn.up_proj.weight": "pytorch_model-00002-of-00007.bin",
46
+ "transformer.blocks.14.norm_1.weight": "pytorch_model-00002-of-00007.bin",
47
+ "transformer.blocks.14.norm_2.weight": "pytorch_model-00002-of-00007.bin",
48
+ "transformer.blocks.15.attn.Wqkv.weight": "pytorch_model-00002-of-00007.bin",
49
+ "transformer.blocks.15.attn.out_proj.weight": "pytorch_model-00002-of-00007.bin",
50
+ "transformer.blocks.15.ffn.down_proj.weight": "pytorch_model-00003-of-00007.bin",
51
+ "transformer.blocks.15.ffn.up_proj.weight": "pytorch_model-00003-of-00007.bin",
52
+ "transformer.blocks.15.norm_1.weight": "pytorch_model-00002-of-00007.bin",
53
+ "transformer.blocks.15.norm_2.weight": "pytorch_model-00002-of-00007.bin",
54
+ "transformer.blocks.16.attn.Wqkv.weight": "pytorch_model-00003-of-00007.bin",
55
+ "transformer.blocks.16.attn.out_proj.weight": "pytorch_model-00003-of-00007.bin",
56
+ "transformer.blocks.16.ffn.down_proj.weight": "pytorch_model-00003-of-00007.bin",
57
+ "transformer.blocks.16.ffn.up_proj.weight": "pytorch_model-00003-of-00007.bin",
58
+ "transformer.blocks.16.norm_1.weight": "pytorch_model-00003-of-00007.bin",
59
+ "transformer.blocks.16.norm_2.weight": "pytorch_model-00003-of-00007.bin",
60
+ "transformer.blocks.17.attn.Wqkv.weight": "pytorch_model-00003-of-00007.bin",
61
+ "transformer.blocks.17.attn.out_proj.weight": "pytorch_model-00003-of-00007.bin",
62
+ "transformer.blocks.17.ffn.down_proj.weight": "pytorch_model-00003-of-00007.bin",
63
+ "transformer.blocks.17.ffn.up_proj.weight": "pytorch_model-00003-of-00007.bin",
64
+ "transformer.blocks.17.norm_1.weight": "pytorch_model-00003-of-00007.bin",
65
+ "transformer.blocks.17.norm_2.weight": "pytorch_model-00003-of-00007.bin",
66
+ "transformer.blocks.18.attn.Wqkv.weight": "pytorch_model-00003-of-00007.bin",
67
+ "transformer.blocks.18.attn.out_proj.weight": "pytorch_model-00003-of-00007.bin",
68
+ "transformer.blocks.18.ffn.down_proj.weight": "pytorch_model-00003-of-00007.bin",
69
+ "transformer.blocks.18.ffn.up_proj.weight": "pytorch_model-00003-of-00007.bin",
70
+ "transformer.blocks.18.norm_1.weight": "pytorch_model-00003-of-00007.bin",
71
+ "transformer.blocks.18.norm_2.weight": "pytorch_model-00003-of-00007.bin",
72
+ "transformer.blocks.19.attn.Wqkv.weight": "pytorch_model-00003-of-00007.bin",
73
+ "transformer.blocks.19.attn.out_proj.weight": "pytorch_model-00003-of-00007.bin",
74
+ "transformer.blocks.19.ffn.down_proj.weight": "pytorch_model-00003-of-00007.bin",
75
+ "transformer.blocks.19.ffn.up_proj.weight": "pytorch_model-00003-of-00007.bin",
76
+ "transformer.blocks.19.norm_1.weight": "pytorch_model-00003-of-00007.bin",
77
+ "transformer.blocks.19.norm_2.weight": "pytorch_model-00003-of-00007.bin",
78
+ "transformer.blocks.2.attn.Wqkv.weight": "pytorch_model-00001-of-00007.bin",
79
+ "transformer.blocks.2.attn.out_proj.weight": "pytorch_model-00001-of-00007.bin",
80
+ "transformer.blocks.2.ffn.down_proj.weight": "pytorch_model-00001-of-00007.bin",
81
+ "transformer.blocks.2.ffn.up_proj.weight": "pytorch_model-00001-of-00007.bin",
82
+ "transformer.blocks.2.norm_1.weight": "pytorch_model-00001-of-00007.bin",
83
+ "transformer.blocks.2.norm_2.weight": "pytorch_model-00001-of-00007.bin",
84
+ "transformer.blocks.20.attn.Wqkv.weight": "pytorch_model-00003-of-00007.bin",
85
+ "transformer.blocks.20.attn.out_proj.weight": "pytorch_model-00003-of-00007.bin",
86
+ "transformer.blocks.20.ffn.down_proj.weight": "pytorch_model-00003-of-00007.bin",
87
+ "transformer.blocks.20.ffn.up_proj.weight": "pytorch_model-00003-of-00007.bin",
88
+ "transformer.blocks.20.norm_1.weight": "pytorch_model-00003-of-00007.bin",
89
+ "transformer.blocks.20.norm_2.weight": "pytorch_model-00003-of-00007.bin",
90
+ "transformer.blocks.21.attn.Wqkv.weight": "pytorch_model-00003-of-00007.bin",
91
+ "transformer.blocks.21.attn.out_proj.weight": "pytorch_model-00003-of-00007.bin",
92
+ "transformer.blocks.21.ffn.down_proj.weight": "pytorch_model-00003-of-00007.bin",
93
+ "transformer.blocks.21.ffn.up_proj.weight": "pytorch_model-00003-of-00007.bin",
94
+ "transformer.blocks.21.norm_1.weight": "pytorch_model-00003-of-00007.bin",
95
+ "transformer.blocks.21.norm_2.weight": "pytorch_model-00003-of-00007.bin",
96
+ "transformer.blocks.22.attn.Wqkv.weight": "pytorch_model-00003-of-00007.bin",
97
+ "transformer.blocks.22.attn.out_proj.weight": "pytorch_model-00003-of-00007.bin",
98
+ "transformer.blocks.22.ffn.down_proj.weight": "pytorch_model-00003-of-00007.bin",
99
+ "transformer.blocks.22.ffn.up_proj.weight": "pytorch_model-00003-of-00007.bin",
100
+ "transformer.blocks.22.norm_1.weight": "pytorch_model-00003-of-00007.bin",
101
+ "transformer.blocks.22.norm_2.weight": "pytorch_model-00003-of-00007.bin",
102
+ "transformer.blocks.23.attn.Wqkv.weight": "pytorch_model-00003-of-00007.bin",
103
+ "transformer.blocks.23.attn.out_proj.weight": "pytorch_model-00003-of-00007.bin",
104
+ "transformer.blocks.23.ffn.down_proj.weight": "pytorch_model-00004-of-00007.bin",
105
+ "transformer.blocks.23.ffn.up_proj.weight": "pytorch_model-00004-of-00007.bin",
106
+ "transformer.blocks.23.norm_1.weight": "pytorch_model-00003-of-00007.bin",
107
+ "transformer.blocks.23.norm_2.weight": "pytorch_model-00003-of-00007.bin",
108
+ "transformer.blocks.24.attn.Wqkv.weight": "pytorch_model-00004-of-00007.bin",
109
+ "transformer.blocks.24.attn.out_proj.weight": "pytorch_model-00004-of-00007.bin",
110
+ "transformer.blocks.24.ffn.down_proj.weight": "pytorch_model-00004-of-00007.bin",
111
+ "transformer.blocks.24.ffn.up_proj.weight": "pytorch_model-00004-of-00007.bin",
112
+ "transformer.blocks.24.norm_1.weight": "pytorch_model-00004-of-00007.bin",
113
+ "transformer.blocks.24.norm_2.weight": "pytorch_model-00004-of-00007.bin",
114
+ "transformer.blocks.25.attn.Wqkv.weight": "pytorch_model-00004-of-00007.bin",
115
+ "transformer.blocks.25.attn.out_proj.weight": "pytorch_model-00004-of-00007.bin",
116
+ "transformer.blocks.25.ffn.down_proj.weight": "pytorch_model-00004-of-00007.bin",
117
+ "transformer.blocks.25.ffn.up_proj.weight": "pytorch_model-00004-of-00007.bin",
118
+ "transformer.blocks.25.norm_1.weight": "pytorch_model-00004-of-00007.bin",
119
+ "transformer.blocks.25.norm_2.weight": "pytorch_model-00004-of-00007.bin",
120
+ "transformer.blocks.26.attn.Wqkv.weight": "pytorch_model-00004-of-00007.bin",
121
+ "transformer.blocks.26.attn.out_proj.weight": "pytorch_model-00004-of-00007.bin",
122
+ "transformer.blocks.26.ffn.down_proj.weight": "pytorch_model-00004-of-00007.bin",
123
+ "transformer.blocks.26.ffn.up_proj.weight": "pytorch_model-00004-of-00007.bin",
124
+ "transformer.blocks.26.norm_1.weight": "pytorch_model-00004-of-00007.bin",
125
+ "transformer.blocks.26.norm_2.weight": "pytorch_model-00004-of-00007.bin",
126
+ "transformer.blocks.27.attn.Wqkv.weight": "pytorch_model-00004-of-00007.bin",
127
+ "transformer.blocks.27.attn.out_proj.weight": "pytorch_model-00004-of-00007.bin",
128
+ "transformer.blocks.27.ffn.down_proj.weight": "pytorch_model-00004-of-00007.bin",
129
+ "transformer.blocks.27.ffn.up_proj.weight": "pytorch_model-00004-of-00007.bin",
130
+ "transformer.blocks.27.norm_1.weight": "pytorch_model-00004-of-00007.bin",
131
+ "transformer.blocks.27.norm_2.weight": "pytorch_model-00004-of-00007.bin",
132
+ "transformer.blocks.28.attn.Wqkv.weight": "pytorch_model-00004-of-00007.bin",
133
+ "transformer.blocks.28.attn.out_proj.weight": "pytorch_model-00004-of-00007.bin",
134
+ "transformer.blocks.28.ffn.down_proj.weight": "pytorch_model-00004-of-00007.bin",
135
+ "transformer.blocks.28.ffn.up_proj.weight": "pytorch_model-00004-of-00007.bin",
136
+ "transformer.blocks.28.norm_1.weight": "pytorch_model-00004-of-00007.bin",
137
+ "transformer.blocks.28.norm_2.weight": "pytorch_model-00004-of-00007.bin",
138
+ "transformer.blocks.29.attn.Wqkv.weight": "pytorch_model-00004-of-00007.bin",
139
+ "transformer.blocks.29.attn.out_proj.weight": "pytorch_model-00004-of-00007.bin",
140
+ "transformer.blocks.29.ffn.down_proj.weight": "pytorch_model-00004-of-00007.bin",
141
+ "transformer.blocks.29.ffn.up_proj.weight": "pytorch_model-00004-of-00007.bin",
142
+ "transformer.blocks.29.norm_1.weight": "pytorch_model-00004-of-00007.bin",
143
+ "transformer.blocks.29.norm_2.weight": "pytorch_model-00004-of-00007.bin",
144
+ "transformer.blocks.3.attn.Wqkv.weight": "pytorch_model-00001-of-00007.bin",
145
+ "transformer.blocks.3.attn.out_proj.weight": "pytorch_model-00001-of-00007.bin",
146
+ "transformer.blocks.3.ffn.down_proj.weight": "pytorch_model-00001-of-00007.bin",
147
+ "transformer.blocks.3.ffn.up_proj.weight": "pytorch_model-00001-of-00007.bin",
148
+ "transformer.blocks.3.norm_1.weight": "pytorch_model-00001-of-00007.bin",
149
+ "transformer.blocks.3.norm_2.weight": "pytorch_model-00001-of-00007.bin",
150
+ "transformer.blocks.30.attn.Wqkv.weight": "pytorch_model-00004-of-00007.bin",
151
+ "transformer.blocks.30.attn.out_proj.weight": "pytorch_model-00004-of-00007.bin",
152
+ "transformer.blocks.30.ffn.down_proj.weight": "pytorch_model-00004-of-00007.bin",
153
+ "transformer.blocks.30.ffn.up_proj.weight": "pytorch_model-00004-of-00007.bin",
154
+ "transformer.blocks.30.norm_1.weight": "pytorch_model-00004-of-00007.bin",
155
+ "transformer.blocks.30.norm_2.weight": "pytorch_model-00004-of-00007.bin",
156
+ "transformer.blocks.31.attn.Wqkv.weight": "pytorch_model-00004-of-00007.bin",
157
+ "transformer.blocks.31.attn.out_proj.weight": "pytorch_model-00004-of-00007.bin",
158
+ "transformer.blocks.31.ffn.down_proj.weight": "pytorch_model-00005-of-00007.bin",
159
+ "transformer.blocks.31.ffn.up_proj.weight": "pytorch_model-00005-of-00007.bin",
160
+ "transformer.blocks.31.norm_1.weight": "pytorch_model-00004-of-00007.bin",
161
+ "transformer.blocks.31.norm_2.weight": "pytorch_model-00004-of-00007.bin",
162
+ "transformer.blocks.32.attn.Wqkv.weight": "pytorch_model-00005-of-00007.bin",
163
+ "transformer.blocks.32.attn.out_proj.weight": "pytorch_model-00005-of-00007.bin",
164
+ "transformer.blocks.32.ffn.down_proj.weight": "pytorch_model-00005-of-00007.bin",
165
+ "transformer.blocks.32.ffn.up_proj.weight": "pytorch_model-00005-of-00007.bin",
166
+ "transformer.blocks.32.norm_1.weight": "pytorch_model-00005-of-00007.bin",
167
+ "transformer.blocks.32.norm_2.weight": "pytorch_model-00005-of-00007.bin",
168
+ "transformer.blocks.33.attn.Wqkv.weight": "pytorch_model-00005-of-00007.bin",
169
+ "transformer.blocks.33.attn.out_proj.weight": "pytorch_model-00005-of-00007.bin",
170
+ "transformer.blocks.33.ffn.down_proj.weight": "pytorch_model-00005-of-00007.bin",
171
+ "transformer.blocks.33.ffn.up_proj.weight": "pytorch_model-00005-of-00007.bin",
172
+ "transformer.blocks.33.norm_1.weight": "pytorch_model-00005-of-00007.bin",
173
+ "transformer.blocks.33.norm_2.weight": "pytorch_model-00005-of-00007.bin",
174
+ "transformer.blocks.34.attn.Wqkv.weight": "pytorch_model-00005-of-00007.bin",
175
+ "transformer.blocks.34.attn.out_proj.weight": "pytorch_model-00005-of-00007.bin",
176
+ "transformer.blocks.34.ffn.down_proj.weight": "pytorch_model-00005-of-00007.bin",
177
+ "transformer.blocks.34.ffn.up_proj.weight": "pytorch_model-00005-of-00007.bin",
178
+ "transformer.blocks.34.norm_1.weight": "pytorch_model-00005-of-00007.bin",
179
+ "transformer.blocks.34.norm_2.weight": "pytorch_model-00005-of-00007.bin",
180
+ "transformer.blocks.35.attn.Wqkv.weight": "pytorch_model-00005-of-00007.bin",
181
+ "transformer.blocks.35.attn.out_proj.weight": "pytorch_model-00005-of-00007.bin",
182
+ "transformer.blocks.35.ffn.down_proj.weight": "pytorch_model-00005-of-00007.bin",
183
+ "transformer.blocks.35.ffn.up_proj.weight": "pytorch_model-00005-of-00007.bin",
184
+ "transformer.blocks.35.norm_1.weight": "pytorch_model-00005-of-00007.bin",
185
+ "transformer.blocks.35.norm_2.weight": "pytorch_model-00005-of-00007.bin",
186
+ "transformer.blocks.36.attn.Wqkv.weight": "pytorch_model-00005-of-00007.bin",
187
+ "transformer.blocks.36.attn.out_proj.weight": "pytorch_model-00005-of-00007.bin",
188
+ "transformer.blocks.36.ffn.down_proj.weight": "pytorch_model-00005-of-00007.bin",
189
+ "transformer.blocks.36.ffn.up_proj.weight": "pytorch_model-00005-of-00007.bin",
190
+ "transformer.blocks.36.norm_1.weight": "pytorch_model-00005-of-00007.bin",
191
+ "transformer.blocks.36.norm_2.weight": "pytorch_model-00005-of-00007.bin",
192
+ "transformer.blocks.37.attn.Wqkv.weight": "pytorch_model-00005-of-00007.bin",
193
+ "transformer.blocks.37.attn.out_proj.weight": "pytorch_model-00005-of-00007.bin",
194
+ "transformer.blocks.37.ffn.down_proj.weight": "pytorch_model-00005-of-00007.bin",
195
+ "transformer.blocks.37.ffn.up_proj.weight": "pytorch_model-00005-of-00007.bin",
196
+ "transformer.blocks.37.norm_1.weight": "pytorch_model-00005-of-00007.bin",
197
+ "transformer.blocks.37.norm_2.weight": "pytorch_model-00005-of-00007.bin",
198
+ "transformer.blocks.38.attn.Wqkv.weight": "pytorch_model-00005-of-00007.bin",
199
+ "transformer.blocks.38.attn.out_proj.weight": "pytorch_model-00005-of-00007.bin",
200
+ "transformer.blocks.38.ffn.down_proj.weight": "pytorch_model-00005-of-00007.bin",
201
+ "transformer.blocks.38.ffn.up_proj.weight": "pytorch_model-00005-of-00007.bin",
202
+ "transformer.blocks.38.norm_1.weight": "pytorch_model-00005-of-00007.bin",
203
+ "transformer.blocks.38.norm_2.weight": "pytorch_model-00005-of-00007.bin",
204
+ "transformer.blocks.39.attn.Wqkv.weight": "pytorch_model-00005-of-00007.bin",
205
+ "transformer.blocks.39.attn.out_proj.weight": "pytorch_model-00005-of-00007.bin",
206
+ "transformer.blocks.39.ffn.down_proj.weight": "pytorch_model-00006-of-00007.bin",
207
+ "transformer.blocks.39.ffn.up_proj.weight": "pytorch_model-00006-of-00007.bin",
208
+ "transformer.blocks.39.norm_1.weight": "pytorch_model-00005-of-00007.bin",
209
+ "transformer.blocks.39.norm_2.weight": "pytorch_model-00005-of-00007.bin",
210
+ "transformer.blocks.4.attn.Wqkv.weight": "pytorch_model-00001-of-00007.bin",
211
+ "transformer.blocks.4.attn.out_proj.weight": "pytorch_model-00001-of-00007.bin",
212
+ "transformer.blocks.4.ffn.down_proj.weight": "pytorch_model-00001-of-00007.bin",
213
+ "transformer.blocks.4.ffn.up_proj.weight": "pytorch_model-00001-of-00007.bin",
214
+ "transformer.blocks.4.norm_1.weight": "pytorch_model-00001-of-00007.bin",
215
+ "transformer.blocks.4.norm_2.weight": "pytorch_model-00001-of-00007.bin",
216
+ "transformer.blocks.40.attn.Wqkv.weight": "pytorch_model-00006-of-00007.bin",
217
+ "transformer.blocks.40.attn.out_proj.weight": "pytorch_model-00006-of-00007.bin",
218
+ "transformer.blocks.40.ffn.down_proj.weight": "pytorch_model-00006-of-00007.bin",
219
+ "transformer.blocks.40.ffn.up_proj.weight": "pytorch_model-00006-of-00007.bin",
220
+ "transformer.blocks.40.norm_1.weight": "pytorch_model-00006-of-00007.bin",
221
+ "transformer.blocks.40.norm_2.weight": "pytorch_model-00006-of-00007.bin",
222
+ "transformer.blocks.41.attn.Wqkv.weight": "pytorch_model-00006-of-00007.bin",
223
+ "transformer.blocks.41.attn.out_proj.weight": "pytorch_model-00006-of-00007.bin",
224
+ "transformer.blocks.41.ffn.down_proj.weight": "pytorch_model-00006-of-00007.bin",
225
+ "transformer.blocks.41.ffn.up_proj.weight": "pytorch_model-00006-of-00007.bin",
226
+ "transformer.blocks.41.norm_1.weight": "pytorch_model-00006-of-00007.bin",
227
+ "transformer.blocks.41.norm_2.weight": "pytorch_model-00006-of-00007.bin",
228
+ "transformer.blocks.42.attn.Wqkv.weight": "pytorch_model-00006-of-00007.bin",
229
+ "transformer.blocks.42.attn.out_proj.weight": "pytorch_model-00006-of-00007.bin",
230
+ "transformer.blocks.42.ffn.down_proj.weight": "pytorch_model-00006-of-00007.bin",
231
+ "transformer.blocks.42.ffn.up_proj.weight": "pytorch_model-00006-of-00007.bin",
232
+ "transformer.blocks.42.norm_1.weight": "pytorch_model-00006-of-00007.bin",
233
+ "transformer.blocks.42.norm_2.weight": "pytorch_model-00006-of-00007.bin",
234
+ "transformer.blocks.43.attn.Wqkv.weight": "pytorch_model-00006-of-00007.bin",
235
+ "transformer.blocks.43.attn.out_proj.weight": "pytorch_model-00006-of-00007.bin",
236
+ "transformer.blocks.43.ffn.down_proj.weight": "pytorch_model-00006-of-00007.bin",
237
+ "transformer.blocks.43.ffn.up_proj.weight": "pytorch_model-00006-of-00007.bin",
238
+ "transformer.blocks.43.norm_1.weight": "pytorch_model-00006-of-00007.bin",
239
+ "transformer.blocks.43.norm_2.weight": "pytorch_model-00006-of-00007.bin",
240
+ "transformer.blocks.44.attn.Wqkv.weight": "pytorch_model-00006-of-00007.bin",
241
+ "transformer.blocks.44.attn.out_proj.weight": "pytorch_model-00006-of-00007.bin",
242
+ "transformer.blocks.44.ffn.down_proj.weight": "pytorch_model-00006-of-00007.bin",
243
+ "transformer.blocks.44.ffn.up_proj.weight": "pytorch_model-00006-of-00007.bin",
244
+ "transformer.blocks.44.norm_1.weight": "pytorch_model-00006-of-00007.bin",
245
+ "transformer.blocks.44.norm_2.weight": "pytorch_model-00006-of-00007.bin",
246
+ "transformer.blocks.45.attn.Wqkv.weight": "pytorch_model-00006-of-00007.bin",
247
+ "transformer.blocks.45.attn.out_proj.weight": "pytorch_model-00006-of-00007.bin",
248
+ "transformer.blocks.45.ffn.down_proj.weight": "pytorch_model-00006-of-00007.bin",
249
+ "transformer.blocks.45.ffn.up_proj.weight": "pytorch_model-00006-of-00007.bin",
250
+ "transformer.blocks.45.norm_1.weight": "pytorch_model-00006-of-00007.bin",
251
+ "transformer.blocks.45.norm_2.weight": "pytorch_model-00006-of-00007.bin",
252
+ "transformer.blocks.46.attn.Wqkv.weight": "pytorch_model-00006-of-00007.bin",
253
+ "transformer.blocks.46.attn.out_proj.weight": "pytorch_model-00006-of-00007.bin",
254
+ "transformer.blocks.46.ffn.down_proj.weight": "pytorch_model-00006-of-00007.bin",
255
+ "transformer.blocks.46.ffn.up_proj.weight": "pytorch_model-00006-of-00007.bin",
256
+ "transformer.blocks.46.norm_1.weight": "pytorch_model-00006-of-00007.bin",
257
+ "transformer.blocks.46.norm_2.weight": "pytorch_model-00006-of-00007.bin",
258
+ "transformer.blocks.47.attn.Wqkv.weight": "pytorch_model-00006-of-00007.bin",
259
+ "transformer.blocks.47.attn.out_proj.weight": "pytorch_model-00006-of-00007.bin",
260
+ "transformer.blocks.47.ffn.down_proj.weight": "pytorch_model-00007-of-00007.bin",
261
+ "transformer.blocks.47.ffn.up_proj.weight": "pytorch_model-00007-of-00007.bin",
262
+ "transformer.blocks.47.norm_1.weight": "pytorch_model-00006-of-00007.bin",
263
+ "transformer.blocks.47.norm_2.weight": "pytorch_model-00006-of-00007.bin",
264
+ "transformer.blocks.5.attn.Wqkv.weight": "pytorch_model-00001-of-00007.bin",
265
+ "transformer.blocks.5.attn.out_proj.weight": "pytorch_model-00001-of-00007.bin",
266
+ "transformer.blocks.5.ffn.down_proj.weight": "pytorch_model-00001-of-00007.bin",
267
+ "transformer.blocks.5.ffn.up_proj.weight": "pytorch_model-00001-of-00007.bin",
268
+ "transformer.blocks.5.norm_1.weight": "pytorch_model-00001-of-00007.bin",
269
+ "transformer.blocks.5.norm_2.weight": "pytorch_model-00001-of-00007.bin",
270
+ "transformer.blocks.6.attn.Wqkv.weight": "pytorch_model-00001-of-00007.bin",
271
+ "transformer.blocks.6.attn.out_proj.weight": "pytorch_model-00001-of-00007.bin",
272
+ "transformer.blocks.6.ffn.down_proj.weight": "pytorch_model-00001-of-00007.bin",
273
+ "transformer.blocks.6.ffn.up_proj.weight": "pytorch_model-00001-of-00007.bin",
274
+ "transformer.blocks.6.norm_1.weight": "pytorch_model-00001-of-00007.bin",
275
+ "transformer.blocks.6.norm_2.weight": "pytorch_model-00001-of-00007.bin",
276
+ "transformer.blocks.7.attn.Wqkv.weight": "pytorch_model-00001-of-00007.bin",
277
+ "transformer.blocks.7.attn.out_proj.weight": "pytorch_model-00001-of-00007.bin",
278
+ "transformer.blocks.7.ffn.down_proj.weight": "pytorch_model-00002-of-00007.bin",
279
+ "transformer.blocks.7.ffn.up_proj.weight": "pytorch_model-00002-of-00007.bin",
280
+ "transformer.blocks.7.norm_1.weight": "pytorch_model-00001-of-00007.bin",
281
+ "transformer.blocks.7.norm_2.weight": "pytorch_model-00001-of-00007.bin",
282
+ "transformer.blocks.8.attn.Wqkv.weight": "pytorch_model-00002-of-00007.bin",
283
+ "transformer.blocks.8.attn.out_proj.weight": "pytorch_model-00002-of-00007.bin",
284
+ "transformer.blocks.8.ffn.down_proj.weight": "pytorch_model-00002-of-00007.bin",
285
+ "transformer.blocks.8.ffn.up_proj.weight": "pytorch_model-00002-of-00007.bin",
286
+ "transformer.blocks.8.norm_1.weight": "pytorch_model-00002-of-00007.bin",
287
+ "transformer.blocks.8.norm_2.weight": "pytorch_model-00002-of-00007.bin",
288
+ "transformer.blocks.9.attn.Wqkv.weight": "pytorch_model-00002-of-00007.bin",
289
+ "transformer.blocks.9.attn.out_proj.weight": "pytorch_model-00002-of-00007.bin",
290
+ "transformer.blocks.9.ffn.down_proj.weight": "pytorch_model-00002-of-00007.bin",
291
+ "transformer.blocks.9.ffn.up_proj.weight": "pytorch_model-00002-of-00007.bin",
292
+ "transformer.blocks.9.norm_1.weight": "pytorch_model-00002-of-00007.bin",
293
+ "transformer.blocks.9.norm_2.weight": "pytorch_model-00002-of-00007.bin",
294
+ "transformer.norm_f.weight": "pytorch_model-00007-of-00007.bin",
295
+ "transformer.wte.weight": "pytorch_model-00001-of-00007.bin"
296
+ }
297
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<|endoftext|>",
3
+ "eos_token": "<|endoftext|>",
4
+ "unk_token": "<|endoftext|>"
5
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "bos_token": "<|endoftext|>",
4
+ "clean_up_tokenization_spaces": true,
5
+ "eos_token": "<|endoftext|>",
6
+ "model_max_length": 8192,
7
+ "tokenizer_class": "GPTNeoXTokenizer",
8
+ "unk_token": "<|endoftext|>"
9
+ }