Aleph-Alpha
/

Pharia-1-Embedding-4608-control

Model card Files Files and versions Community

peralp24 commited on Nov 28, 2024

Commit

2d3b9ae

·

verified ·

1 Parent(s): 4a6fca0

Update README.md

Files changed (1) hide show

README.md +23 -0

README.md CHANGED Viewed

@@ -230,6 +230,29 @@ We ablate how performance changes when not using task-specific instructions for
 |Relative Δ|-1.71%|-2.32%|-0.13%|-0.01%|-2.31%|-7.59%|-0.64%|-0.91%|-0.80%|2.22%|0.16%|-2.97%|**-1.09%**|
 ## Bias, Risks, and Limitations
 <!-- This section is meant to convey both technical and sociotechnical limitations. -->

 |Relative Δ|-1.71%|-2.32%|-0.13%|-0.01%|-2.31%|-7.59%|-0.64%|-0.91%|-0.80%|2.22%|0.16%|-2.97%|**-1.09%**|
+#### Methodology for Multilingual Evaluations (European languages)
+* Context: MTEB is a collection of tasks across many task types (e.g. classification, retrieval etc.). Furthermore, tasks can
+have N subsets on different languages. Subsets itself can also contain N languages, e.g. translation-related tasks. Base script
+actually comes from [gritlm/evaluation/eval_mteb.py at main · ContextualAI/gritlm](https://github.com/ContextualAI/gritlm/blob/main/evaluation/eval_mteb.py) and
+includes Medi2-style instructions for many MTEB Tasks. The instructions are all in English. All evaluations use Medi2-style instructions except for
+the “no instructions” case (see above). If a task does not have Medi2-style instructions, we skip the task. As European languages for
+MTEB tests German, Italian, Spanish, Portuguese and French were used.
+* For our Multilingual Evaluations (European languages) we use the tasks
+from [mteb/scripts/task_selection/europe_tasks.csv at main · embeddings-benchmark/mteb](https://github.com/embeddings-benchmark/mteb/blob/main/scripts/task_selection/europe_tasks.csv) and then filter for tasks where there is at least one subset with at least one of the European languages.
+* We skip BibleNLPBitextMining and FloresBitextMining because they don’t have ‘test’ splits, only ‘train’ split which we don’t want to use for evaluation (→ training data contamination likely)
+* We evaluate subsets which contain at least one of the European languages → that’s why there is also an “English” language column because there are subsets that are e.g. En ↔︎ De and are thus considered
+* The tasks that remain are
+  - AmazonCounterfactualClassification
+  - BUCC.v2
+  - DiaBlaBitextMining
+  - MassiveScenarioClassification
+  - NTREXBitextMining
+  - STS17
+* For NTREXBitextMining the subsets are further filtered down to only pairs of the European languages instead of at least one European language
+  - i.e. this gives 20-2=18 translation pair subsets between the 5 languages. -2 because Italian ↔︎ German doesn’t exist.
+  - this is done because otherwise there are 250 translation pair subsets which are not as relevant (e.g. they contain Vietnamese ↔︎ Portuguese)
 ## Bias, Risks, and Limitations
 <!-- This section is meant to convey both technical and sociotechnical limitations. -->