Update README
Browse filesAdd use case examples and update the CLUB link.
README.md
CHANGED
@@ -81,9 +81,73 @@ Training Hyperparemeters
|
|
81 |
|
82 |
|
83 |
|
84 |
-
|
85 |
-
|
86 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
87 |
|
88 |
## Data
|
89 |
|
@@ -115,7 +179,7 @@ The following multilingual benchmarks have been considered:
|
|
115 |
| Benchmark | Description | Languages | Source |
|
116 |
|------------------|-------------|-----------|--------------|
|
117 |
| XTREME| Benchmark for the evaluation of the cross-lingual generalization ability of pre-trained multilingual models | bg,ca,de,el,en,es,et,eu,fi,fr,hu,it,lt,nl,pl,pt,ro,ru,uk | [LINK](https://github.com/google-research/xtreme) |
|
118 |
-
| CLUB | Human-Annotated Catalan Benchmark | ca | [LINK](https://
|
119 |
| Basque Custom Benchmark | A set of NER, POS and TC evaluation tasks to assess the performace in Basque language. | eu | [LINK](https://huggingface.co/datasets/orai-nlp/basqueGLUE) |
|
120 |
| Galician Custom Benchmark | NER and POS evaluation tasks to assess the performace in Galician language. | gl | [LINK](https://github.com/xavier-gz/SLI_Galician_Corpora/blob/main/SLI_NERC_Galician_Gold_CoNLL.1.0.tar.gz) [LINK](https://huggingface.co/datasets/universal-dependencies/universal_dependencies)|
|
121 |
|
|
|
81 |
|
82 |
|
83 |
|
84 |
+
## How to use
|
85 |
+
|
86 |
+
You can use the pipeline for masked language modeling:
|
87 |
+
|
88 |
+
```python
|
89 |
+
>>> from transformers import pipeline
|
90 |
+
>>> from pprint import pprint
|
91 |
+
|
92 |
+
>>> pprint(unmasker("I love the<mask>of Barcelona.",top_k=3))
|
93 |
+
[{'score': 0.44038116931915283,
|
94 |
+
'sequence': 'I love the city of Barcelona.',
|
95 |
+
'token': 31489,
|
96 |
+
'token_str': 'city'},
|
97 |
+
{'score': 0.10049665719270706,
|
98 |
+
'sequence': 'I love the City of Barcelona.',
|
99 |
+
'token': 13613,
|
100 |
+
'token_str': 'City'},
|
101 |
+
{'score': 0.09289316833019257,
|
102 |
+
'sequence': 'I love the streets of Barcelona.',
|
103 |
+
'token': 178738,
|
104 |
+
'token_str': 'streets'}]
|
105 |
+
>>> pprint(unmasker("Me encanta la<mask>de Barcelona.",top_k=3))
|
106 |
+
[{'score': 0.17127428948879242,
|
107 |
+
'sequence': 'Me encanta la historia de Barcelona.',
|
108 |
+
'token': 10559,
|
109 |
+
'token_str': 'historia'},
|
110 |
+
{'score': 0.14173351228237152,
|
111 |
+
'sequence': 'Me encanta la ciudad de Barcelona.',
|
112 |
+
'token': 19587,
|
113 |
+
'token_str': 'ciudad'},
|
114 |
+
{'score': 0.06284074485301971,
|
115 |
+
'sequence': 'Me encanta la vida de Barcelona.',
|
116 |
+
'token': 5019,
|
117 |
+
'token_str': 'vida'}]
|
118 |
+
>>> pprint(unmasker("M'encanta la<mask>de Barcelona.",top_k=3))
|
119 |
+
[{'score': 0.35796159505844116,
|
120 |
+
'sequence': "M'encanta la ciutat de Barcelona.",
|
121 |
+
'token': 17128,
|
122 |
+
'token_str': 'ciutat'},
|
123 |
+
{'score': 0.10453521460294724,
|
124 |
+
'sequence': "M'encanta la història de Barcelona.",
|
125 |
+
'token': 35763,
|
126 |
+
'token_str': 'història'},
|
127 |
+
{'score': 0.07609806954860687,
|
128 |
+
'sequence': "M'encanta la gent de Barcelona.",
|
129 |
+
'token': 15151,
|
130 |
+
'token_str': 'gent'}]
|
131 |
+
```
|
132 |
+
|
133 |
+
Alternatively, you can also extract the logits associated with the sequences and perform the calculations by hand:
|
134 |
+
|
135 |
+
```python
|
136 |
+
from transformers import AutoTokenizer, AutoModelForMaskedLM
|
137 |
+
import torch
|
138 |
+
|
139 |
+
model = AutoModelForMaskedLM.from_pretrained("BSC-LT/mRoBERTa")
|
140 |
+
tokenizer = AutoTokenizer.from_pretrained("BSC-LT/mRoBERTa")
|
141 |
+
|
142 |
+
outputs = model(**tokenizer("The capital of Spain is<mask>", return_tensors="pt")).logits
|
143 |
+
|
144 |
+
# The index of "<mask>" token is -2 given that the -1 position is the EOS token "</s>".
|
145 |
+
predicted_token = tokenizer.decode(torch.argmax(outputs[0,-2,:]))
|
146 |
+
|
147 |
+
print(f"The decoded element is \"{predicted_token}\"." ) # This will give "Madrid"
|
148 |
+
```
|
149 |
+
|
150 |
+
In most of the evaluations presented below, the model is adjusted to each use case using specific logits to encode the text.
|
151 |
|
152 |
## Data
|
153 |
|
|
|
179 |
| Benchmark | Description | Languages | Source |
|
180 |
|------------------|-------------|-----------|--------------|
|
181 |
| XTREME| Benchmark for the evaluation of the cross-lingual generalization ability of pre-trained multilingual models | bg,ca,de,el,en,es,et,eu,fi,fr,hu,it,lt,nl,pl,pt,ro,ru,uk | [LINK](https://github.com/google-research/xtreme) |
|
182 |
+
| CLUB | Human-Annotated Catalan Benchmark | ca | [LINK](https://github.com/projecte-aina/club) |
|
183 |
| Basque Custom Benchmark | A set of NER, POS and TC evaluation tasks to assess the performace in Basque language. | eu | [LINK](https://huggingface.co/datasets/orai-nlp/basqueGLUE) |
|
184 |
| Galician Custom Benchmark | NER and POS evaluation tasks to assess the performace in Galician language. | gl | [LINK](https://github.com/xavier-gz/SLI_Galician_Corpora/blob/main/SLI_NERC_Galician_Gold_CoNLL.1.0.tar.gz) [LINK](https://huggingface.co/datasets/universal-dependencies/universal_dependencies)|
|
185 |
|