dmarron dtamayo commited on
Commit
0bede83
·
verified ·
1 Parent(s): 1c42f51

Update README (#3)

Browse files

- Update README (4aae4e34b2f87e4b5c3c9b335d93c8830aec9214)
- Update README (5a52a117df438362a1a18110444a9defd4961a31)


Co-authored-by: Daniel Tamayo <[email protected]>

Files changed (1) hide show
  1. README.md +69 -4
README.md CHANGED
@@ -81,9 +81,74 @@ Training Hyperparemeters
81
 
82
 
83
 
84
-
85
-
86
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
87
 
88
  ## Data
89
 
@@ -115,7 +180,7 @@ The following multilingual benchmarks have been considered:
115
  | Benchmark | Description | Languages | Source |
116
  |------------------|-------------|-----------|--------------|
117
  | XTREME| Benchmark for the evaluation of the cross-lingual generalization ability of pre-trained multilingual models | bg,ca,de,el,en,es,et,eu,fi,fr,hu,it,lt,nl,pl,pt,ro,ru,uk | [LINK](https://github.com/google-research/xtreme) |
118
- | CLUB | Human-Annotated Catalan Benchmark | ca | [LINK](https://club.aina.bsc.es/datasets.html) |
119
  | Basque Custom Benchmark | A set of NER, POS and TC evaluation tasks to assess the performace in Basque language. | eu | [LINK](https://huggingface.co/datasets/orai-nlp/basqueGLUE) |
120
  | Galician Custom Benchmark | NER and POS evaluation tasks to assess the performace in Galician language. | gl | [LINK](https://github.com/xavier-gz/SLI_Galician_Corpora/blob/main/SLI_NERC_Galician_Gold_CoNLL.1.0.tar.gz) [LINK](https://huggingface.co/datasets/universal-dependencies/universal_dependencies)|
121
 
 
81
 
82
 
83
 
84
+ ## How to use
85
+
86
+ You can use the pipeline for masked language modeling:
87
+
88
+ ```python
89
+ >>> from transformers import pipeline
90
+ >>> from pprint import pprint
91
+ >>> unmasker = pipeline('fill-mask', model='BSC-LT/mRoBERTa')
92
+
93
+ >>> pprint(unmasker("I love the<mask>of Barcelona.",top_k=3))
94
+ [{'score': 0.44038116931915283,
95
+ 'sequence': 'I love the city of Barcelona.',
96
+ 'token': 31489,
97
+ 'token_str': 'city'},
98
+ {'score': 0.10049665719270706,
99
+ 'sequence': 'I love the City of Barcelona.',
100
+ 'token': 13613,
101
+ 'token_str': 'City'},
102
+ {'score': 0.09289316833019257,
103
+ 'sequence': 'I love the streets of Barcelona.',
104
+ 'token': 178738,
105
+ 'token_str': 'streets'}]
106
+ >>> pprint(unmasker("Me encanta la<mask>de Barcelona.",top_k=3))
107
+ [{'score': 0.17127428948879242,
108
+ 'sequence': 'Me encanta la historia de Barcelona.',
109
+ 'token': 10559,
110
+ 'token_str': 'historia'},
111
+ {'score': 0.14173351228237152,
112
+ 'sequence': 'Me encanta la ciudad de Barcelona.',
113
+ 'token': 19587,
114
+ 'token_str': 'ciudad'},
115
+ {'score': 0.06284074485301971,
116
+ 'sequence': 'Me encanta la vida de Barcelona.',
117
+ 'token': 5019,
118
+ 'token_str': 'vida'}]
119
+ >>> pprint(unmasker("M'encanta la<mask>de Barcelona.",top_k=3))
120
+ [{'score': 0.35796159505844116,
121
+ 'sequence': "M'encanta la ciutat de Barcelona.",
122
+ 'token': 17128,
123
+ 'token_str': 'ciutat'},
124
+ {'score': 0.10453521460294724,
125
+ 'sequence': "M'encanta la història de Barcelona.",
126
+ 'token': 35763,
127
+ 'token_str': 'història'},
128
+ {'score': 0.07609806954860687,
129
+ 'sequence': "M'encanta la gent de Barcelona.",
130
+ 'token': 15151,
131
+ 'token_str': 'gent'}]
132
+ ```
133
+
134
+ Alternatively, you can also extract the logits associated with the sequences and perform the calculations by hand:
135
+
136
+ ```python
137
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
138
+ import torch
139
+
140
+ model = AutoModelForMaskedLM.from_pretrained("BSC-LT/mRoBERTa")
141
+ tokenizer = AutoTokenizer.from_pretrained("BSC-LT/mRoBERTa")
142
+
143
+ outputs = model(**tokenizer("The capital of Spain is<mask>", return_tensors="pt")).logits
144
+
145
+ # The index of "<mask>" token is -2 given that the -1 position is the EOS token "</s>".
146
+ predicted_token = tokenizer.decode(torch.argmax(outputs[0,-2,:]))
147
+
148
+ print(f"The decoded element is \"{predicted_token}\"." ) # This will give "Madrid"
149
+ ```
150
+
151
+ In most of the evaluations presented below, the model is adjusted to each use case using specific logits to encode the text.
152
 
153
  ## Data
154
 
 
180
  | Benchmark | Description | Languages | Source |
181
  |------------------|-------------|-----------|--------------|
182
  | XTREME| Benchmark for the evaluation of the cross-lingual generalization ability of pre-trained multilingual models | bg,ca,de,el,en,es,et,eu,fi,fr,hu,it,lt,nl,pl,pt,ro,ru,uk | [LINK](https://github.com/google-research/xtreme) |
183
+ | CLUB | Human-Annotated Catalan Benchmark | ca | [LINK](https://github.com/projecte-aina/club) |
184
  | Basque Custom Benchmark | A set of NER, POS and TC evaluation tasks to assess the performace in Basque language. | eu | [LINK](https://huggingface.co/datasets/orai-nlp/basqueGLUE) |
185
  | Galician Custom Benchmark | NER and POS evaluation tasks to assess the performace in Galician language. | gl | [LINK](https://github.com/xavier-gz/SLI_Galician_Corpora/blob/main/SLI_NERC_Galician_Gold_CoNLL.1.0.tar.gz) [LINK](https://huggingface.co/datasets/universal-dependencies/universal_dependencies)|
186