Update README (#3)
Browse files- Update README (4aae4e34b2f87e4b5c3c9b335d93c8830aec9214)
- Update README (5a52a117df438362a1a18110444a9defd4961a31)
Co-authored-by: Daniel Tamayo <[email protected]>
README.md
CHANGED
@@ -81,9 +81,74 @@ Training Hyperparemeters
|
|
81 |
|
82 |
|
83 |
|
84 |
-
|
85 |
-
|
86 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
87 |
|
88 |
## Data
|
89 |
|
@@ -115,7 +180,7 @@ The following multilingual benchmarks have been considered:
|
|
115 |
| Benchmark | Description | Languages | Source |
|
116 |
|------------------|-------------|-----------|--------------|
|
117 |
| XTREME| Benchmark for the evaluation of the cross-lingual generalization ability of pre-trained multilingual models | bg,ca,de,el,en,es,et,eu,fi,fr,hu,it,lt,nl,pl,pt,ro,ru,uk | [LINK](https://github.com/google-research/xtreme) |
|
118 |
-
| CLUB | Human-Annotated Catalan Benchmark | ca | [LINK](https://
|
119 |
| Basque Custom Benchmark | A set of NER, POS and TC evaluation tasks to assess the performace in Basque language. | eu | [LINK](https://huggingface.co/datasets/orai-nlp/basqueGLUE) |
|
120 |
| Galician Custom Benchmark | NER and POS evaluation tasks to assess the performace in Galician language. | gl | [LINK](https://github.com/xavier-gz/SLI_Galician_Corpora/blob/main/SLI_NERC_Galician_Gold_CoNLL.1.0.tar.gz) [LINK](https://huggingface.co/datasets/universal-dependencies/universal_dependencies)|
|
121 |
|
|
|
81 |
|
82 |
|
83 |
|
84 |
+
## How to use
|
85 |
+
|
86 |
+
You can use the pipeline for masked language modeling:
|
87 |
+
|
88 |
+
```python
|
89 |
+
>>> from transformers import pipeline
|
90 |
+
>>> from pprint import pprint
|
91 |
+
>>> unmasker = pipeline('fill-mask', model='BSC-LT/mRoBERTa')
|
92 |
+
|
93 |
+
>>> pprint(unmasker("I love the<mask>of Barcelona.",top_k=3))
|
94 |
+
[{'score': 0.44038116931915283,
|
95 |
+
'sequence': 'I love the city of Barcelona.',
|
96 |
+
'token': 31489,
|
97 |
+
'token_str': 'city'},
|
98 |
+
{'score': 0.10049665719270706,
|
99 |
+
'sequence': 'I love the City of Barcelona.',
|
100 |
+
'token': 13613,
|
101 |
+
'token_str': 'City'},
|
102 |
+
{'score': 0.09289316833019257,
|
103 |
+
'sequence': 'I love the streets of Barcelona.',
|
104 |
+
'token': 178738,
|
105 |
+
'token_str': 'streets'}]
|
106 |
+
>>> pprint(unmasker("Me encanta la<mask>de Barcelona.",top_k=3))
|
107 |
+
[{'score': 0.17127428948879242,
|
108 |
+
'sequence': 'Me encanta la historia de Barcelona.',
|
109 |
+
'token': 10559,
|
110 |
+
'token_str': 'historia'},
|
111 |
+
{'score': 0.14173351228237152,
|
112 |
+
'sequence': 'Me encanta la ciudad de Barcelona.',
|
113 |
+
'token': 19587,
|
114 |
+
'token_str': 'ciudad'},
|
115 |
+
{'score': 0.06284074485301971,
|
116 |
+
'sequence': 'Me encanta la vida de Barcelona.',
|
117 |
+
'token': 5019,
|
118 |
+
'token_str': 'vida'}]
|
119 |
+
>>> pprint(unmasker("M'encanta la<mask>de Barcelona.",top_k=3))
|
120 |
+
[{'score': 0.35796159505844116,
|
121 |
+
'sequence': "M'encanta la ciutat de Barcelona.",
|
122 |
+
'token': 17128,
|
123 |
+
'token_str': 'ciutat'},
|
124 |
+
{'score': 0.10453521460294724,
|
125 |
+
'sequence': "M'encanta la història de Barcelona.",
|
126 |
+
'token': 35763,
|
127 |
+
'token_str': 'història'},
|
128 |
+
{'score': 0.07609806954860687,
|
129 |
+
'sequence': "M'encanta la gent de Barcelona.",
|
130 |
+
'token': 15151,
|
131 |
+
'token_str': 'gent'}]
|
132 |
+
```
|
133 |
+
|
134 |
+
Alternatively, you can also extract the logits associated with the sequences and perform the calculations by hand:
|
135 |
+
|
136 |
+
```python
|
137 |
+
from transformers import AutoTokenizer, AutoModelForMaskedLM
|
138 |
+
import torch
|
139 |
+
|
140 |
+
model = AutoModelForMaskedLM.from_pretrained("BSC-LT/mRoBERTa")
|
141 |
+
tokenizer = AutoTokenizer.from_pretrained("BSC-LT/mRoBERTa")
|
142 |
+
|
143 |
+
outputs = model(**tokenizer("The capital of Spain is<mask>", return_tensors="pt")).logits
|
144 |
+
|
145 |
+
# The index of "<mask>" token is -2 given that the -1 position is the EOS token "</s>".
|
146 |
+
predicted_token = tokenizer.decode(torch.argmax(outputs[0,-2,:]))
|
147 |
+
|
148 |
+
print(f"The decoded element is \"{predicted_token}\"." ) # This will give "Madrid"
|
149 |
+
```
|
150 |
+
|
151 |
+
In most of the evaluations presented below, the model is adjusted to each use case using specific logits to encode the text.
|
152 |
|
153 |
## Data
|
154 |
|
|
|
180 |
| Benchmark | Description | Languages | Source |
|
181 |
|------------------|-------------|-----------|--------------|
|
182 |
| XTREME| Benchmark for the evaluation of the cross-lingual generalization ability of pre-trained multilingual models | bg,ca,de,el,en,es,et,eu,fi,fr,hu,it,lt,nl,pl,pt,ro,ru,uk | [LINK](https://github.com/google-research/xtreme) |
|
183 |
+
| CLUB | Human-Annotated Catalan Benchmark | ca | [LINK](https://github.com/projecte-aina/club) |
|
184 |
| Basque Custom Benchmark | A set of NER, POS and TC evaluation tasks to assess the performace in Basque language. | eu | [LINK](https://huggingface.co/datasets/orai-nlp/basqueGLUE) |
|
185 |
| Galician Custom Benchmark | NER and POS evaluation tasks to assess the performace in Galician language. | gl | [LINK](https://github.com/xavier-gz/SLI_Galician_Corpora/blob/main/SLI_NERC_Galician_Gold_CoNLL.1.0.tar.gz) [LINK](https://huggingface.co/datasets/universal-dependencies/universal_dependencies)|
|
186 |
|