dtamayo commited on
Commit
4aae4e3
·
verified ·
1 Parent(s): fcf163f

Update README

Browse files

Add use case examples and update the CLUB link.

Files changed (1) hide show
  1. README.md +68 -4
README.md CHANGED
@@ -81,9 +81,73 @@ Training Hyperparemeters
81
 
82
 
83
 
84
-
85
-
86
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
87
 
88
  ## Data
89
 
@@ -115,7 +179,7 @@ The following multilingual benchmarks have been considered:
115
  | Benchmark | Description | Languages | Source |
116
  |------------------|-------------|-----------|--------------|
117
  | XTREME| Benchmark for the evaluation of the cross-lingual generalization ability of pre-trained multilingual models | bg,ca,de,el,en,es,et,eu,fi,fr,hu,it,lt,nl,pl,pt,ro,ru,uk | [LINK](https://github.com/google-research/xtreme) |
118
- | CLUB | Human-Annotated Catalan Benchmark | ca | [LINK](https://club.aina.bsc.es/datasets.html) |
119
  | Basque Custom Benchmark | A set of NER, POS and TC evaluation tasks to assess the performace in Basque language. | eu | [LINK](https://huggingface.co/datasets/orai-nlp/basqueGLUE) |
120
  | Galician Custom Benchmark | NER and POS evaluation tasks to assess the performace in Galician language. | gl | [LINK](https://github.com/xavier-gz/SLI_Galician_Corpora/blob/main/SLI_NERC_Galician_Gold_CoNLL.1.0.tar.gz) [LINK](https://huggingface.co/datasets/universal-dependencies/universal_dependencies)|
121
 
 
81
 
82
 
83
 
84
+ ## How to use
85
+
86
+ You can use the pipeline for masked language modeling:
87
+
88
+ ```python
89
+ >>> from transformers import pipeline
90
+ >>> from pprint import pprint
91
+
92
+ >>> pprint(unmasker("I love the<mask>of Barcelona.",top_k=3))
93
+ [{'score': 0.44038116931915283,
94
+ 'sequence': 'I love the city of Barcelona.',
95
+ 'token': 31489,
96
+ 'token_str': 'city'},
97
+ {'score': 0.10049665719270706,
98
+ 'sequence': 'I love the City of Barcelona.',
99
+ 'token': 13613,
100
+ 'token_str': 'City'},
101
+ {'score': 0.09289316833019257,
102
+ 'sequence': 'I love the streets of Barcelona.',
103
+ 'token': 178738,
104
+ 'token_str': 'streets'}]
105
+ >>> pprint(unmasker("Me encanta la<mask>de Barcelona.",top_k=3))
106
+ [{'score': 0.17127428948879242,
107
+ 'sequence': 'Me encanta la historia de Barcelona.',
108
+ 'token': 10559,
109
+ 'token_str': 'historia'},
110
+ {'score': 0.14173351228237152,
111
+ 'sequence': 'Me encanta la ciudad de Barcelona.',
112
+ 'token': 19587,
113
+ 'token_str': 'ciudad'},
114
+ {'score': 0.06284074485301971,
115
+ 'sequence': 'Me encanta la vida de Barcelona.',
116
+ 'token': 5019,
117
+ 'token_str': 'vida'}]
118
+ >>> pprint(unmasker("M'encanta la<mask>de Barcelona.",top_k=3))
119
+ [{'score': 0.35796159505844116,
120
+ 'sequence': "M'encanta la ciutat de Barcelona.",
121
+ 'token': 17128,
122
+ 'token_str': 'ciutat'},
123
+ {'score': 0.10453521460294724,
124
+ 'sequence': "M'encanta la història de Barcelona.",
125
+ 'token': 35763,
126
+ 'token_str': 'història'},
127
+ {'score': 0.07609806954860687,
128
+ 'sequence': "M'encanta la gent de Barcelona.",
129
+ 'token': 15151,
130
+ 'token_str': 'gent'}]
131
+ ```
132
+
133
+ Alternatively, you can also extract the logits associated with the sequences and perform the calculations by hand:
134
+
135
+ ```python
136
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
137
+ import torch
138
+
139
+ model = AutoModelForMaskedLM.from_pretrained("BSC-LT/mRoBERTa")
140
+ tokenizer = AutoTokenizer.from_pretrained("BSC-LT/mRoBERTa")
141
+
142
+ outputs = model(**tokenizer("The capital of Spain is<mask>", return_tensors="pt")).logits
143
+
144
+ # The index of "<mask>" token is -2 given that the -1 position is the EOS token "</s>".
145
+ predicted_token = tokenizer.decode(torch.argmax(outputs[0,-2,:]))
146
+
147
+ print(f"The decoded element is \"{predicted_token}\"." ) # This will give "Madrid"
148
+ ```
149
+
150
+ In most of the evaluations presented below, the model is adjusted to each use case using specific logits to encode the text.
151
 
152
  ## Data
153
 
 
179
  | Benchmark | Description | Languages | Source |
180
  |------------------|-------------|-----------|--------------|
181
  | XTREME| Benchmark for the evaluation of the cross-lingual generalization ability of pre-trained multilingual models | bg,ca,de,el,en,es,et,eu,fi,fr,hu,it,lt,nl,pl,pt,ro,ru,uk | [LINK](https://github.com/google-research/xtreme) |
182
+ | CLUB | Human-Annotated Catalan Benchmark | ca | [LINK](https://github.com/projecte-aina/club) |
183
  | Basque Custom Benchmark | A set of NER, POS and TC evaluation tasks to assess the performace in Basque language. | eu | [LINK](https://huggingface.co/datasets/orai-nlp/basqueGLUE) |
184
  | Galician Custom Benchmark | NER and POS evaluation tasks to assess the performace in Galician language. | gl | [LINK](https://github.com/xavier-gz/SLI_Galician_Corpora/blob/main/SLI_NERC_Galician_Gold_CoNLL.1.0.tar.gz) [LINK](https://huggingface.co/datasets/universal-dependencies/universal_dependencies)|
185