julien-c HF staff commited on
Commit
841d321
1 Parent(s): 366e931

Migrate model card from transformers-repo

Browse files

Read announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/roberta-base-README.md

Files changed (1) hide show
  1. README.md +234 -0
README.md ADDED
@@ -0,0 +1,234 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ tags:
4
+ - exbert
5
+ license: mit
6
+ datasets:
7
+ - bookcorpus
8
+ - wikipedia
9
+ ---
10
+
11
+ # RoBERTa base model
12
+
13
+ Pretrained model on English language using a masked language modeling (MLM) objective. It was introduced in
14
+ [this paper](https://arxiv.org/abs/1907.11692) and first released in
15
+ [this repository](https://github.com/pytorch/fairseq/tree/master/examples/roberta). This model is case-sensitive: it
16
+ makes a difference between english and English.
17
+
18
+ Disclaimer: The team releasing RoBERTa did not write a model card for this model so this model card has been written by
19
+ the Hugging Face team.
20
+
21
+ ## Model description
22
+
23
+ RoBERTa is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means
24
+ it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of
25
+ publicly available data) with an automatic process to generate inputs and labels from those texts.
26
+
27
+ More precisely, it was pretrained with the Masked language modeling (MLM) objective. Taking a sentence, the model
28
+ randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict
29
+ the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one
30
+ after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to
31
+ learn a bidirectional representation of the sentence.
32
+
33
+ This way, the model learns an inner representation of the English language that can then be used to extract features
34
+ useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard
35
+ classifier using the features produced by the BERT model as inputs.
36
+
37
+ ## Intended uses & limitations
38
+
39
+ You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task.
40
+ See the [model hub](https://huggingface.co/models?filter=roberta) to look for fine-tuned versions on a task that
41
+ interests you.
42
+
43
+ Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)
44
+ to make decisions, such as sequence classification, token classification or question answering. For tasks such as text
45
+ generation you should look at model like GPT2.
46
+
47
+ ### How to use
48
+
49
+ You can use this model directly with a pipeline for masked language modeling:
50
+
51
+ ```python
52
+ >>> from transformers import pipeline
53
+ >>> unmasker = pipeline('fill-mask', model='roberta-base')
54
+ >>> unmasker("Hello I'm a <mask> model.")
55
+
56
+ [{'sequence': "<s>Hello I'm a male model.</s>",
57
+ 'score': 0.3306540250778198,
58
+ 'token': 2943,
59
+ 'token_str': 'Ġmale'},
60
+ {'sequence': "<s>Hello I'm a female model.</s>",
61
+ 'score': 0.04655390977859497,
62
+ 'token': 2182,
63
+ 'token_str': 'Ġfemale'},
64
+ {'sequence': "<s>Hello I'm a professional model.</s>",
65
+ 'score': 0.04232972860336304,
66
+ 'token': 2038,
67
+ 'token_str': 'Ġprofessional'},
68
+ {'sequence': "<s>Hello I'm a fashion model.</s>",
69
+ 'score': 0.037216778844594955,
70
+ 'token': 2734,
71
+ 'token_str': 'Ġfashion'},
72
+ {'sequence': "<s>Hello I'm a Russian model.</s>",
73
+ 'score': 0.03253649175167084,
74
+ 'token': 1083,
75
+ 'token_str': 'ĠRussian'}]
76
+ ```
77
+
78
+ Here is how to use this model to get the features of a given text in PyTorch:
79
+
80
+ ```python
81
+ from transformers import RobertaTokenizer, RobertaModel
82
+ tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
83
+ model = RobertaModel.from_pretrained('roberta-base')
84
+ text = "Replace me by any text you'd like."
85
+ encoded_input = tokenizer(text, return_tensors='pt')
86
+ output = model(**encoded_input)
87
+ ```
88
+
89
+ and in TensorFlow:
90
+
91
+ ```python
92
+ from transformers import RobertaTokenizer, TFRobertaModel
93
+ tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
94
+ model = TFRobertaModel.from_pretrained('roberta-base')
95
+ text = "Replace me by any text you'd like."
96
+ encoded_input = tokenizer(text, return_tensors='tf')
97
+ output = model(encoded_input)
98
+ ```
99
+
100
+ ### Limitations and bias
101
+
102
+ The training data used for this model contains a lot of unfiltered content from the internet, which is far from
103
+ neutral. Therefore, the model can have biased predictions:
104
+
105
+ ```python
106
+ >>> from transformers import pipeline
107
+ >>> unmasker = pipeline('fill-mask', model='roberta-base')
108
+ >>> unmasker("The man worked as a <mask>.")
109
+
110
+ [{'sequence': '<s>The man worked as a mechanic.</s>',
111
+ 'score': 0.08702439814805984,
112
+ 'token': 25682,
113
+ 'token_str': 'Ġmechanic'},
114
+ {'sequence': '<s>The man worked as a waiter.</s>',
115
+ 'score': 0.0819653645157814,
116
+ 'token': 38233,
117
+ 'token_str': 'Ġwaiter'},
118
+ {'sequence': '<s>The man worked as a butcher.</s>',
119
+ 'score': 0.073323555290699,
120
+ 'token': 32364,
121
+ 'token_str': 'Ġbutcher'},
122
+ {'sequence': '<s>The man worked as a miner.</s>',
123
+ 'score': 0.046322137117385864,
124
+ 'token': 18678,
125
+ 'token_str': 'Ġminer'},
126
+ {'sequence': '<s>The man worked as a guard.</s>',
127
+ 'score': 0.040150221437215805,
128
+ 'token': 2510,
129
+ 'token_str': 'Ġguard'}]
130
+
131
+ >>> unmasker("The Black woman worked as a <mask>.")
132
+
133
+ [{'sequence': '<s>The Black woman worked as a waitress.</s>',
134
+ 'score': 0.22177888453006744,
135
+ 'token': 35698,
136
+ 'token_str': 'Ġwaitress'},
137
+ {'sequence': '<s>The Black woman worked as a prostitute.</s>',
138
+ 'score': 0.19288744032382965,
139
+ 'token': 36289,
140
+ 'token_str': 'Ġprostitute'},
141
+ {'sequence': '<s>The Black woman worked as a maid.</s>',
142
+ 'score': 0.06498628109693527,
143
+ 'token': 29754,
144
+ 'token_str': 'Ġmaid'},
145
+ {'sequence': '<s>The Black woman worked as a secretary.</s>',
146
+ 'score': 0.05375480651855469,
147
+ 'token': 2971,
148
+ 'token_str': 'Ġsecretary'},
149
+ {'sequence': '<s>The Black woman worked as a nurse.</s>',
150
+ 'score': 0.05245552211999893,
151
+ 'token': 9008,
152
+ 'token_str': 'Ġnurse'}]
153
+ ```
154
+
155
+ This bias will also affect all fine-tuned versions of this model.
156
+
157
+ ## Training data
158
+
159
+ The RoBERTa model was pretrained on the reunion of five datasets:
160
+ - [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038 unpublished books;
161
+ - [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia) (excluding lists, tables and headers) ;
162
+ - [CC-News](https://commoncrawl.org/2016/10/news-dataset-available/), a dataset containing 63 millions English news
163
+ articles crawled between September 2016 and February 2019.
164
+ - [OpenWebText](https://github.com/jcpeterson/openwebtext), an opensource recreation of the WebText dataset used to
165
+ train GPT-2,
166
+ - [Stories](https://arxiv.org/abs/1806.02847) a dataset containing a subset of CommonCrawl data filtered to match the
167
+ story-like style of Winograd schemas.
168
+
169
+ Together theses datasets weight 160GB of text.
170
+
171
+ ## Training procedure
172
+
173
+ ### Preprocessing
174
+
175
+ The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50,000. The inputs of
176
+ the model take pieces of 512 contiguous token that may span over documents. The beginning of a new document is marked
177
+ with `<s>` and the end of one by `</s>`
178
+
179
+ The details of the masking procedure for each sentence are the following:
180
+ - 15% of the tokens are masked.
181
+ - In 80% of the cases, the masked tokens are replaced by `<mask>`.
182
+ - In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
183
+ - In the 10% remaining cases, the masked tokens are left as is.
184
+
185
+ Contrary to BERT, the masking is done dynamically during pretraining (e.g., it changes at each epoch and is not fixed).
186
+
187
+ ### Pretraining
188
+
189
+ The model was trained on 1024 V100 GPUs for 500K steps with a batch size of 8K and a sequence length of 512. The
190
+ optimizer used is Adam with a learning rate of 6e-4, \\(\beta_{1} = 0.9\\), \\(\beta_{2} = 0.98\\) and
191
+ \\(\epsilon = 1e-6\\), a weight decay of 0.01, learning rate warmup for 24,000 steps and linear decay of the learning
192
+ rate after.
193
+
194
+ ## Evaluation results
195
+
196
+ When fine-tuned on downstream tasks, this model achieves the following results:
197
+
198
+ Glue test results:
199
+
200
+ | Task | MNLI | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE |
201
+ |:----:|:----:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:|
202
+ | | 87.6 | 91.9 | 92.8 | 94.8 | 63.6 | 91.2 | 90.2 | 78.7 |
203
+
204
+
205
+ ### BibTeX entry and citation info
206
+
207
+ ```bibtex
208
+ @article{DBLP:journals/corr/abs-1907-11692,
209
+ author = {Yinhan Liu and
210
+ Myle Ott and
211
+ Naman Goyal and
212
+ Jingfei Du and
213
+ Mandar Joshi and
214
+ Danqi Chen and
215
+ Omer Levy and
216
+ Mike Lewis and
217
+ Luke Zettlemoyer and
218
+ Veselin Stoyanov},
219
+ title = {RoBERTa: {A} Robustly Optimized {BERT} Pretraining Approach},
220
+ journal = {CoRR},
221
+ volume = {abs/1907.11692},
222
+ year = {2019},
223
+ url = {http://arxiv.org/abs/1907.11692},
224
+ archivePrefix = {arXiv},
225
+ eprint = {1907.11692},
226
+ timestamp = {Thu, 01 Aug 2019 08:59:33 +0200},
227
+ biburl = {https://dblp.org/rec/journals/corr/abs-1907-11692.bib},
228
+ bibsource = {dblp computer science bibliography, https://dblp.org}
229
+ }
230
+ ```
231
+
232
+ <a href="https://huggingface.co/exbert/?model=roberta-base">
233
+ <img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">
234
+ </a>