peralp24 commited on
Commit
2d3b9ae
·
verified ·
1 Parent(s): 4a6fca0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +23 -0
README.md CHANGED
@@ -230,6 +230,29 @@ We ablate how performance changes when not using task-specific instructions for
230
  |Relative Δ|-1.71%|-2.32%|-0.13%|-0.01%|-2.31%|-7.59%|-0.64%|-0.91%|-0.80%|2.22%|0.16%|-2.97%|**-1.09%**|
231
 
232
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
233
  ## Bias, Risks, and Limitations
234
 
235
  <!-- This section is meant to convey both technical and sociotechnical limitations. -->
 
230
  |Relative Δ|-1.71%|-2.32%|-0.13%|-0.01%|-2.31%|-7.59%|-0.64%|-0.91%|-0.80%|2.22%|0.16%|-2.97%|**-1.09%**|
231
 
232
 
233
+ #### Methodology for Multilingual Evaluations (European languages)
234
+ * Context: MTEB is a collection of tasks across many task types (e.g. classification, retrieval etc.). Furthermore, tasks can
235
+ have N subsets on different languages. Subsets itself can also contain N languages, e.g. translation-related tasks. Base script
236
+ actually comes from [gritlm/evaluation/eval_mteb.py at main · ContextualAI/gritlm](https://github.com/ContextualAI/gritlm/blob/main/evaluation/eval_mteb.py) and
237
+ includes Medi2-style instructions for many MTEB Tasks. The instructions are all in English. All evaluations use Medi2-style instructions except for
238
+ the “no instructions” case (see above). If a task does not have Medi2-style instructions, we skip the task. As European languages for
239
+ MTEB tests German, Italian, Spanish, Portuguese and French were used.
240
+ * For our Multilingual Evaluations (European languages) we use the tasks
241
+ from [mteb/scripts/task_selection/europe_tasks.csv at main · embeddings-benchmark/mteb](https://github.com/embeddings-benchmark/mteb/blob/main/scripts/task_selection/europe_tasks.csv) and then filter for tasks where there is at least one subset with at least one of the European languages.
242
+ * We skip BibleNLPBitextMining and FloresBitextMining because they don’t have ‘test’ splits, only ‘train’ split which we don’t want to use for evaluation (→ training data contamination likely)
243
+ * We evaluate subsets which contain at least one of the European languages → that’s why there is also an “English” language column because there are subsets that are e.g. En ↔︎ De and are thus considered
244
+ * The tasks that remain are
245
+ - AmazonCounterfactualClassification
246
+ - BUCC.v2
247
+ - DiaBlaBitextMining
248
+ - MassiveScenarioClassification
249
+ - NTREXBitextMining
250
+ - STS17
251
+ * For NTREXBitextMining the subsets are further filtered down to only pairs of the European languages instead of at least one European language
252
+ - i.e. this gives 20-2=18 translation pair subsets between the 5 languages. -2 because Italian ↔︎ German doesn’t exist.
253
+ - this is done because otherwise there are 250 translation pair subsets which are not as relevant (e.g. they contain Vietnamese ↔︎ Portuguese)
254
+
255
+
256
  ## Bias, Risks, and Limitations
257
 
258
  <!-- This section is meant to convey both technical and sociotechnical limitations. -->