Update README.md
Browse files
README.md
CHANGED
@@ -274,7 +274,8 @@ from [mteb/scripts/task_selection/europe_tasks.csv at main · embeddings-benchma
|
|
274 |
|
275 |
### Model architecture
|
276 |
|
277 |
-
|
|
|
278 |
|Number of layers|27|
|
279 |
|Number of attention heads|36|
|
280 |
|Head size|128|
|
@@ -286,6 +287,41 @@ from [mteb/scripts/task_selection/europe_tasks.csv at main · embeddings-benchma
|
|
286 |
|Rotary base|1,000,000|
|
287 |
|Total parameter count|7,041,544,704|
|
288 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
289 |
|
290 |
|
291 |
|
|
|
274 |
|
275 |
### Model architecture
|
276 |
|
277 |
+
| | |
|
278 |
+
|-------|-------|
|
279 |
|Number of layers|27|
|
280 |
|Number of attention heads|36|
|
281 |
|Head size|128|
|
|
|
287 |
|Rotary base|1,000,000|
|
288 |
|Total parameter count|7,041,544,704|
|
289 |
|
290 |
+
### Training
|
291 |
+
|
292 |
+
Pharia-1-Embedding-4608-control is an adapter on top of Pharia-1-LLM-7B-control, trained with a context window
|
293 |
+
of 2048 Tokens. Pharia-1-Embedding-4608-control was trained with representational instruction-tuning (inspired by the
|
294 |
+
approach of GritLM) and a contrastive learning approach. The final layer is an embedding head with weighted mean pooling.
|
295 |
+
The train set consisted of a blend of open-source and proprietary datasets. Further postprocessing was used to optimize
|
296 |
+
for downstream use and multilinguality.
|
297 |
+
|
298 |
+
### Tokenization
|
299 |
+
|
300 |
+
Our tokenizer has vocabulary size 128000 and was trained via the Unigram algorithm, using the implementation provided by the SentencePiece library. The tokenizer training set was a small subset of our high-quality data. After the training procedure, we performed some additional cleaning steps:
|
301 |
+
Split whole number tokens (e.g. 12345 ) into individual digit tokens
|
302 |
+
Remove double spaces: removes the tokens which contains " " in the token
|
303 |
+
Remove tokens that contain zero-width space (except itself)
|
304 |
+
Remove tokens with more than 3 repeated characters in a substring: bananaaaa, caaaar
|
305 |
+
Remove any token that contains “\n” and is not either "\n", "\r".
|
306 |
+
|
307 |
+
### Tokenizer fertility
|
308 |
+
|
309 |
+
Tokenizer fertility is a metric used to evaluate tokenizer performance and measures a tokenizer’s ability to
|
310 |
+
represent text, calculated by dividing the number of tokens in a text (after tokenizing) by the number of words in that
|
311 |
+
same text [(https://arxiv.org/pdf/2310.08754)](https://arxiv.org/pdf/2310.08754). The tokenizer fertility of the Pharia-1-Embedding-4608-control model is lower
|
312 |
+
than that of Mistral-7B-Instruct-v0.3’s and llama-3.1-8b-instruct’s for 4 out of the supported 7 European languages.
|
313 |
+
Pharia-1-LLM-7B model’s tokenizer can thus represent the same text more efficiently, i.e. with less tokens, and is
|
314 |
+
therefore more cost-efficient at inference time.
|
315 |
+
|
316 |
+
|Tokenizer fertility | | | | |
|
317 |
+
|--|--|--|--|--|
|
318 |
+
|Pharia-1-LLM-7B-control, Pharia-1-LLM-7B-control-aligned|Mistral-7B-Instruct-v0.3|llama-3.1-8b-instruct|
|
319 |
+
|de|2.011|2.546|2.241|
|
320 |
+
|fr|1.896|2.105|1.836|
|
321 |
+
|es|1.673|2.030|1.749|
|
322 |
+
|en|1.633|1.681|1.410|
|
323 |
+
|
324 |
+
|
325 |
|
326 |
|
327 |
|