Update README.md
Browse files
README.md
CHANGED
@@ -230,6 +230,29 @@ We ablate how performance changes when not using task-specific instructions for
|
|
230 |
|Relative Δ|-1.71%|-2.32%|-0.13%|-0.01%|-2.31%|-7.59%|-0.64%|-0.91%|-0.80%|2.22%|0.16%|-2.97%|**-1.09%**|
|
231 |
|
232 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
233 |
## Bias, Risks, and Limitations
|
234 |
|
235 |
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
|
|
|
230 |
|Relative Δ|-1.71%|-2.32%|-0.13%|-0.01%|-2.31%|-7.59%|-0.64%|-0.91%|-0.80%|2.22%|0.16%|-2.97%|**-1.09%**|
|
231 |
|
232 |
|
233 |
+
#### Methodology for Multilingual Evaluations (European languages)
|
234 |
+
* Context: MTEB is a collection of tasks across many task types (e.g. classification, retrieval etc.). Furthermore, tasks can
|
235 |
+
have N subsets on different languages. Subsets itself can also contain N languages, e.g. translation-related tasks. Base script
|
236 |
+
actually comes from [gritlm/evaluation/eval_mteb.py at main · ContextualAI/gritlm](https://github.com/ContextualAI/gritlm/blob/main/evaluation/eval_mteb.py) and
|
237 |
+
includes Medi2-style instructions for many MTEB Tasks. The instructions are all in English. All evaluations use Medi2-style instructions except for
|
238 |
+
the “no instructions” case (see above). If a task does not have Medi2-style instructions, we skip the task. As European languages for
|
239 |
+
MTEB tests German, Italian, Spanish, Portuguese and French were used.
|
240 |
+
* For our Multilingual Evaluations (European languages) we use the tasks
|
241 |
+
from [mteb/scripts/task_selection/europe_tasks.csv at main · embeddings-benchmark/mteb](https://github.com/embeddings-benchmark/mteb/blob/main/scripts/task_selection/europe_tasks.csv) and then filter for tasks where there is at least one subset with at least one of the European languages.
|
242 |
+
* We skip BibleNLPBitextMining and FloresBitextMining because they don’t have ‘test’ splits, only ‘train’ split which we don’t want to use for evaluation (→ training data contamination likely)
|
243 |
+
* We evaluate subsets which contain at least one of the European languages → that’s why there is also an “English” language column because there are subsets that are e.g. En ↔︎ De and are thus considered
|
244 |
+
* The tasks that remain are
|
245 |
+
- AmazonCounterfactualClassification
|
246 |
+
- BUCC.v2
|
247 |
+
- DiaBlaBitextMining
|
248 |
+
- MassiveScenarioClassification
|
249 |
+
- NTREXBitextMining
|
250 |
+
- STS17
|
251 |
+
* For NTREXBitextMining the subsets are further filtered down to only pairs of the European languages instead of at least one European language
|
252 |
+
- i.e. this gives 20-2=18 translation pair subsets between the 5 languages. -2 because Italian ↔︎ German doesn’t exist.
|
253 |
+
- this is done because otherwise there are 250 translation pair subsets which are not as relevant (e.g. they contain Vietnamese ↔︎ Portuguese)
|
254 |
+
|
255 |
+
|
256 |
## Bias, Risks, and Limitations
|
257 |
|
258 |
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
|