Update README (#3)

Browse files

- Update README (4aae4e34b2f87e4b5c3c9b335d93c8830aec9214)
- Update README (5a52a117df438362a1a18110444a9defd4961a31)

Co-authored-by: Daniel Tamayo <[email protected]>

Files changed (1) hide show

README.md +69 -4

README.md CHANGED Viewed

@@ -81,9 +81,74 @@ Training Hyperparemeters
 ## Data
@@ -115,7 +180,7 @@ The following multilingual benchmarks have been considered:
 | Benchmark        | Description | Languages |       Source |
 |------------------|-------------|-----------|--------------|
 | XTREME|  Benchmark for the evaluation of the cross-lingual generalization ability of pre-trained multilingual models | bg,ca,de,el,en,es,et,eu,fi,fr,hu,it,lt,nl,pl,pt,ro,ru,uk | [LINK](https://github.com/google-research/xtreme) |
-| CLUB | Human-Annotated Catalan Benchmark  | ca | [LINK](https://club.aina.bsc.es/datasets.html) |
 | Basque Custom Benchmark |  A set of NER, POS and TC evaluation tasks to assess the performace in Basque language.  | eu | [LINK](https://huggingface.co/datasets/orai-nlp/basqueGLUE) |
 | Galician Custom Benchmark |  NER and POS evaluation tasks to assess the performace in Galician language. | gl | [LINK](https://github.com/xavier-gz/SLI_Galician_Corpora/blob/main/SLI_NERC_Galician_Gold_CoNLL.1.0.tar.gz) [LINK](https://huggingface.co/datasets/universal-dependencies/universal_dependencies)|

+## How to use
+You can use the pipeline for masked language modeling:
+```python
+>>> from transformers import pipeline
+>>> from pprint import pprint
+>>> unmasker = pipeline('fill-mask', model='BSC-LT/mRoBERTa')
+>>> pprint(unmasker("I love the<mask>of Barcelona.",top_k=3))
+[{'score': 0.44038116931915283,
+  'sequence': 'I love the city of Barcelona.',
+  'token': 31489,
+  'token_str': 'city'},
+ {'score': 0.10049665719270706,
+  'sequence': 'I love the City of Barcelona.',
+  'token': 13613,
+  'token_str': 'City'},
+ {'score': 0.09289316833019257,
+  'sequence': 'I love the streets of Barcelona.',
+  'token': 178738,
+  'token_str': 'streets'}]
+>>> pprint(unmasker("Me encanta la<mask>de Barcelona.",top_k=3))
+[{'score': 0.17127428948879242,
+  'sequence': 'Me encanta la historia de Barcelona.',
+  'token': 10559,
+  'token_str': 'historia'},
+ {'score': 0.14173351228237152,
+  'sequence': 'Me encanta la ciudad de Barcelona.',
+  'token': 19587,
+  'token_str': 'ciudad'},
+ {'score': 0.06284074485301971,
+  'sequence': 'Me encanta la vida de Barcelona.',
+  'token': 5019,
+  'token_str': 'vida'}]
+>>> pprint(unmasker("M'encanta la<mask>de Barcelona.",top_k=3))
+[{'score': 0.35796159505844116,
+  'sequence': "M'encanta la ciutat de Barcelona.",
+  'token': 17128,
+  'token_str': 'ciutat'},
+ {'score': 0.10453521460294724,
+  'sequence': "M'encanta la història de Barcelona.",
+  'token': 35763,
+  'token_str': 'història'},
+ {'score': 0.07609806954860687,
+  'sequence': "M'encanta la gent de Barcelona.",
+  'token': 15151,
+  'token_str': 'gent'}]
+```
+Alternatively, you can also extract the logits associated with the sequences and perform the calculations by hand:
+```python
+from transformers import AutoTokenizer, AutoModelForMaskedLM
+import torch
+model = AutoModelForMaskedLM.from_pretrained("BSC-LT/mRoBERTa")
+tokenizer = AutoTokenizer.from_pretrained("BSC-LT/mRoBERTa")
+outputs = model(**tokenizer("The capital of Spain is<mask>", return_tensors="pt")).logits
+# The index of "<mask>" token is -2 given that the -1 position is the EOS token "</s>".
+predicted_token = tokenizer.decode(torch.argmax(outputs[0,-2,:]))
+print(f"The decoded element is \"{predicted_token}\"." ) # This will give "Madrid"
+```
+In most of the evaluations presented below, the model is adjusted to each use case using specific logits to encode the text.
 ## Data
 | Benchmark        | Description | Languages |       Source |
 |------------------|-------------|-----------|--------------|
 | XTREME|  Benchmark for the evaluation of the cross-lingual generalization ability of pre-trained multilingual models | bg,ca,de,el,en,es,et,eu,fi,fr,hu,it,lt,nl,pl,pt,ro,ru,uk | [LINK](https://github.com/google-research/xtreme) |
+| CLUB | Human-Annotated Catalan Benchmark  | ca | [LINK](https://github.com/projecte-aina/club) |
 | Basque Custom Benchmark |  A set of NER, POS and TC evaluation tasks to assess the performace in Basque language.  | eu | [LINK](https://huggingface.co/datasets/orai-nlp/basqueGLUE) |
 | Galician Custom Benchmark |  NER and POS evaluation tasks to assess the performace in Galician language. | gl | [LINK](https://github.com/xavier-gz/SLI_Galician_Corpora/blob/main/SLI_NERC_Galician_Gold_CoNLL.1.0.tar.gz) [LINK](https://huggingface.co/datasets/universal-dependencies/universal_dependencies)|