davda54 commited on
Commit
2aedc2b
·
verified ·
1 Parent(s): 5ec302a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +172 -3
README.md CHANGED
@@ -1,3 +1,172 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - 'no'
4
+ - nb
5
+ - nn
6
+ - se
7
+ inference: false
8
+ tags:
9
+ - BERT
10
+ - GPT-BERT
11
+ - NorBERT
12
+ - Norwegian
13
+ - encoder
14
+ - decoder
15
+ license: apache-2.0
16
+ ---
17
+
18
+ <img src="https://huggingface.co/ltg/norbert3-base/resolve/main/norbert.png" width=12.5%>
19
+
20
+
21
+ # NorBERT 4 base
22
+
23
+ The fourth generation of NorBERT models mainly improves their efficiency, but also performance and flexibility.
24
+
25
+ <img src="https://huggingface.co/ltg/norbert4-base/resolve/main/model_performance.png" width=100%>
26
+
27
+ - **Made to encode long texts**: these models were trained on 16384-token-long texts, the sliding-window attention can then generalize to even longer sequences.
28
+ - **Fast and memory-efficient training and inference**: using FlashAttention2 with unpadding, the new generation of NorBERT models can process the long texts with ease.
29
+ - **Better performance**: better quality of training corpora and carefully tuned training settings leads to an improved performance over NorBERT 3.
30
+ - **BERT as well as GPT**: the models can flexibly function as both bidirectional encoders (BERT) or unidirectional decoders (GPT), which makes them very flexible to any downstream use.
31
+ - **Trained from scratch**: the model is trained from scratch on 600B tokens of Norwegian Bokmål, Nynorsk and Northern Sámi. We used the HPLT 2.0 corpus, FineWeb2 and Mímir Core.
32
+ - **Permissable license**: the checkpoints are distributed freely under Apache 2.0, anyone can use our models.
33
+
34
+ > [!TIP]
35
+ > We recommend installing Flash Attention 2 and `torch.compile`-ing your models to get the highest training and inference efficiency.
36
+
37
+
38
+
39
+ ## All sizes of the NorBERT4 family:
40
+ - [NorBERT 4 xsmall (17M)](https://huggingface.co/ltg/norbert4-xsmall)
41
+ - [NorBERT 4 small (40M)](https://huggingface.co/ltg/norbert4-small)
42
+ - [NorBERT 4 base (149M)](https://huggingface.co/ltg/norbert4-base)
43
+ - [NorBERT 4 base (360M)](https://huggingface.co/ltg/norbert4-base)
44
+ - [NorBERT 4 xlarge (987M)](https://huggingface.co/ltg/norbert4-xlarge)
45
+
46
+
47
+ ## Example usage (bidirectional encoding)
48
+
49
+ This model currently needs a custom wrapper from `modeling_norbert.py`, you should therefore load the model with `trust_remote_code=True`.
50
+
51
+ ```python
52
+ import torch
53
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
54
+
55
+ # Import model
56
+ tokenizer = AutoTokenizer.from_pretrained(
57
+ "ltg/norbert4-base"
58
+ )
59
+ model = AutoModelForMaskedLM.from_pretrained(
60
+ "ltg/norbert4-base",
61
+ trust_remote_code=True
62
+ )
63
+
64
+ # Tokenize text (with a mask token inside)
65
+ input_text = tokenizer(
66
+ f"Nå ønsker de seg en{tokenizer.mask_token} bolig.",
67
+ return_tensors="pt"
68
+ )
69
+
70
+ # Inference
71
+ with torch.inference_mode:
72
+ output_p = model(**input_text)
73
+
74
+ # Unmask the text
75
+ output_text = torch.where(
76
+ input_text.input_ids == tokenizer.mask_token_id,
77
+ output_p.logits.argmax(-1),
78
+ input_text.input_ids
79
+ )
80
+
81
+ # Decoding; should output: '<s>Nå ønsker de seg en ny bolig.'
82
+ print(tokenizer.decode(output_text[0].tolist()))
83
+ ```
84
+
85
+ ## Example usage (text generation)
86
+
87
+ NorBERT now also supports unidirectional text decoding, it can generate text like any other GPT model:
88
+
89
+ ```python
90
+ import torch
91
+ from transformers import AutoTokenizer, AutoModelForCausalLM
92
+
93
+ # Import model
94
+ tokenizer = AutoTokenizer.from_pretrained(
95
+ "ltg/norbert4-base"
96
+ )
97
+ model = AutoModelForCausalLM.from_pretrained(
98
+ "ltg/norbert4-base",
99
+ trust_remote_code=True
100
+ )
101
+
102
+ # Define zero-shot translation prompt template
103
+ prompt = """Engelsk: {0}
104
+ Bokmål:"""
105
+
106
+ # Define tokens that should end the generation (any token with a newline)
107
+ eos_token_ids = [
108
+ token_id
109
+ for token_id in range(tokenizer.vocab_size)
110
+ if '\n' in tokenizer.decode([token_id])
111
+ ]
112
+
113
+ # Generation function
114
+ @torch.inference_mode()
115
+ def generate(text):
116
+ text = prompt.format(text)
117
+ input_ids = tokenizer(text, return_tensors='pt').input_ids
118
+ prediction = model.generate(
119
+ input_ids,
120
+ max_new_tokens=64,
121
+ do_sample=False,
122
+ eos_token_id=eos_token_ids
123
+ )
124
+ return tokenizer.decode(prediction[0, input_ids.size(1):]).strip()
125
+
126
+ # Example usage
127
+ generate("I'm a model that can generate text!")
128
+ ```
129
+
130
+ The following classes are currently implemented: `AutoModel`, `AutoModelMaskedLM`, `AutoModelForCausalLM`, `AutoModelForSequenceClassification`, `AutoModelForTokenClassification`, `AutoModelForQuestionAnswering` and `AutoModeltForMultipleChoice`.
131
+
132
+ ## Contact
133
+
134
+ David Samuel: `[email protected]`
135
+
136
+ ## Cite us
137
+
138
+ ```bibtex
139
+ @inproceedings{charpentier-samuel-2024-bert,
140
+ title = "{GPT} or {BERT}: why not both?",
141
+ author = "Charpentier, Lucas Georges Gabriel and
142
+ Samuel, David",
143
+ booktitle = "The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning",
144
+ month = nov,
145
+ year = "2024",
146
+ address = "Miami, FL, USA",
147
+ publisher = "Association for Computational Linguistics",
148
+ url = "https://aclanthology.org/2024.conll-babylm.24/",
149
+ pages = "262--283"
150
+ }
151
+ ```
152
+
153
+ ```bibtex
154
+ @inproceedings{samuel-etal-2023-norbench,
155
+ title = "{N}or{B}ench {--} A Benchmark for {N}orwegian Language Models",
156
+ author = "Samuel, David and
157
+ Kutuzov, Andrey and
158
+ Touileb, Samia and
159
+ Velldal, Erik and
160
+ {\O}vrelid, Lilja and
161
+ R{\o}nningstad, Egil and
162
+ Sigdel, Elina and
163
+ Palatkina, Anna",
164
+ booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)",
165
+ month = may,
166
+ year = "2023",
167
+ address = "T{\'o}rshavn, Faroe Islands",
168
+ publisher = "University of Tartu Library",
169
+ url = "https://aclanthology.org/2023.nodalida-1.61",
170
+ pages = "618--633"
171
+ }
172
+ ```