peralp24 commited on
Commit
aeaca1b
·
verified ·
1 Parent(s): 3dc62e1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +37 -1
README.md CHANGED
@@ -274,7 +274,8 @@ from [mteb/scripts/task_selection/europe_tasks.csv at main · embeddings-benchma
274
 
275
  ### Model architecture
276
 
277
- |:-------:|:-------:|
 
278
  |Number of layers|27|
279
  |Number of attention heads|36|
280
  |Head size|128|
@@ -286,6 +287,41 @@ from [mteb/scripts/task_selection/europe_tasks.csv at main · embeddings-benchma
286
  |Rotary base|1,000,000|
287
  |Total parameter count|7,041,544,704|
288
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
289
 
290
 
291
 
 
274
 
275
  ### Model architecture
276
 
277
+ | | |
278
+ |-------|-------|
279
  |Number of layers|27|
280
  |Number of attention heads|36|
281
  |Head size|128|
 
287
  |Rotary base|1,000,000|
288
  |Total parameter count|7,041,544,704|
289
 
290
+ ### Training
291
+
292
+ Pharia-1-Embedding-4608-control is an adapter on top of Pharia-1-LLM-7B-control, trained with a context window
293
+ of 2048 Tokens. Pharia-1-Embedding-4608-control was trained with representational instruction-tuning (inspired by the
294
+ approach of GritLM) and a contrastive learning approach. The final layer is an embedding head with weighted mean pooling.
295
+ The train set consisted of a blend of open-source and proprietary datasets. Further postprocessing was used to optimize
296
+ for downstream use and multilinguality.
297
+
298
+ ### Tokenization
299
+
300
+ Our tokenizer has vocabulary size 128000 and was trained via the Unigram algorithm, using the implementation provided by the SentencePiece library. The tokenizer training set was a small subset of our high-quality data. After the training procedure, we performed some additional cleaning steps:
301
+ Split whole number tokens (e.g. 12345 ) into individual digit tokens
302
+ Remove double spaces: removes the tokens which contains " " in the token
303
+ Remove tokens that contain zero-width space (except itself)
304
+ Remove tokens with more than 3 repeated characters in a substring: bananaaaa, caaaar
305
+ Remove any token that contains “\n” and is not either "\n", "\r".
306
+
307
+ ### Tokenizer fertility
308
+
309
+ Tokenizer fertility is a metric used to evaluate tokenizer performance and measures a tokenizer’s ability to
310
+ represent text, calculated by dividing the number of tokens in a text (after tokenizing) by the number of words in that
311
+ same text [(https://arxiv.org/pdf/2310.08754)](https://arxiv.org/pdf/2310.08754). The tokenizer fertility of the Pharia-1-Embedding-4608-control model is lower
312
+ than that of Mistral-7B-Instruct-v0.3’s and llama-3.1-8b-instruct’s for 4 out of the supported 7 European languages.
313
+ Pharia-1-LLM-7B model’s tokenizer can thus represent the same text more efficiently, i.e. with less tokens, and is
314
+ therefore more cost-efficient at inference time.
315
+
316
+ |Tokenizer fertility | | | | |
317
+ |--|--|--|--|--|
318
+ |Pharia-1-LLM-7B-control, Pharia-1-LLM-7B-control-aligned|Mistral-7B-Instruct-v0.3|llama-3.1-8b-instruct|
319
+ |de|2.011|2.546|2.241|
320
+ |fr|1.896|2.105|1.836|
321
+ |es|1.673|2.030|1.749|
322
+ |en|1.633|1.681|1.410|
323
+
324
+
325
 
326
 
327