vrashad commited on
Commit
58a4e3e
·
verified ·
1 Parent(s): c9ca88a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -43,7 +43,7 @@ A custom bilingual (Azerbaijani-English) SentencePiece Unigram tokenizer with a
43
 
44
  * **Base Architecture:** `sentence-transformers/all-MiniLM-L6-v2` (6 layers, 384 hidden dimension, 12 attention heads)
45
  * **Parameters:** ~30.2 Million (after vocabulary expansion)
46
- * **Tokenizer:** Custom bilingual (AZ-EN) SentencePiece Unigram, vocab size ~50k. Available at [LocalDoc/az-en-unigram-tokenizer-50k](https://huggingface.co/LocalDoc/az-en-unigram-tokenizer-50k).
47
  * **Output Dimension:** 384
48
  * **Max Sequence Length:** 512 tokens
49
  * **Training:** Fine-tuned for 3 epochs on a parallel corpus of ~4.14 million Azerbaijani-English sentence pairs using MSELoss for knowledge distillation from `BAAI/bge-small-en-v1.5`.
 
43
 
44
  * **Base Architecture:** `sentence-transformers/all-MiniLM-L6-v2` (6 layers, 384 hidden dimension, 12 attention heads)
45
  * **Parameters:** ~30.2 Million (after vocabulary expansion)
46
+ * **Tokenizer:** Custom bilingual (AZ-EN) SentencePiece Unigram, vocab size ~50k. Available at [LocalDoc/az-en-unigram-tokenizer-50k](https://huggingface.co/LocalDoc/az-en-unigram-tokenizer-50k). You can get train code from this repository https://github.com/vrashad/azerbaijani_tokenizer
47
  * **Output Dimension:** 384
48
  * **Max Sequence Length:** 512 tokens
49
  * **Training:** Fine-tuned for 3 epochs on a parallel corpus of ~4.14 million Azerbaijani-English sentence pairs using MSELoss for knowledge distillation from `BAAI/bge-small-en-v1.5`.