Aleph-Alpha
/

Pharia-1-Embedding-4608-control

Model card Files Files and versions Community

peralp24 commited on Nov 28, 2024

Commit

aeaca1b

·

verified ·

1 Parent(s): 3dc62e1

Update README.md

Files changed (1) hide show

README.md +37 -1

README.md CHANGED Viewed

@@ -274,7 +274,8 @@ from [mteb/scripts/task_selection/europe_tasks.csv at main · embeddings-benchma
 ### Model architecture
-|:-------:|:-------:|
 |Number of layers|27|
 |Number of attention heads|36|
 |Head size|128|
@@ -286,6 +287,41 @@ from [mteb/scripts/task_selection/europe_tasks.csv at main · embeddings-benchma
 |Rotary base|1,000,000|
 |Total parameter count|7,041,544,704|

 ### Model architecture
+|         |         |
+|-------|-------|
 |Number of layers|27|
 |Number of attention heads|36|
 |Head size|128|
 |Rotary base|1,000,000|
 |Total parameter count|7,041,544,704|
+### Training
+Pharia-1-Embedding-4608-control is an adapter on top of Pharia-1-LLM-7B-control, trained with a context window
+of 2048 Tokens.  Pharia-1-Embedding-4608-control was trained with representational instruction-tuning (inspired by the
+approach of GritLM) and a contrastive learning approach. The final layer is an embedding head with weighted mean pooling.
+The train set consisted of a blend of open-source and proprietary datasets. Further postprocessing was used to optimize
+for downstream use and multilinguality.
+### Tokenization
+Our tokenizer has vocabulary size 128000 and was trained via the Unigram algorithm, using the implementation provided by the SentencePiece library. The tokenizer training set was a small subset of our high-quality data. After the training procedure, we performed some additional cleaning steps:
+Split whole number tokens (e.g. 12345 ) into individual digit tokens
+Remove double spaces: removes the tokens which contains " " in the token
+Remove tokens that contain zero-width space (except itself)
+Remove tokens with more than 3 repeated characters in a substring: bananaaaa, caaaar
+Remove any token that contains “\n” and is not either "\n", "\r".
+### Tokenizer fertility
+Tokenizer fertility is a metric used to evaluate tokenizer performance and measures a tokenizer’s ability to
+represent text, calculated by dividing the number of tokens in a text (after tokenizing) by the number of words in that
+same text [(https://arxiv.org/pdf/2310.08754)](https://arxiv.org/pdf/2310.08754). The tokenizer fertility of the Pharia-1-Embedding-4608-control model is lower
+than that of Mistral-7B-Instruct-v0.3’s and llama-3.1-8b-instruct’s for 4 out of the supported 7 European languages.
+Pharia-1-LLM-7B model’s tokenizer can thus represent the same text more efficiently, i.e. with less tokens, and is
+therefore more cost-efficient at inference time.
+|Tokenizer fertility |  |   |   |    |
+|--|--|--|--|--|
+|Pharia-1-LLM-7B-control, Pharia-1-LLM-7B-control-aligned|Mistral-7B-Instruct-v0.3|llama-3.1-8b-instruct|
+|de|2.011|2.546|2.241|
+|fr|1.896|2.105|1.836|
+|es|1.673|2.030|1.749|
+|en|1.633|1.681|1.410|