Update README.md
Browse files
README.md
CHANGED
@@ -43,7 +43,7 @@ A custom bilingual (Azerbaijani-English) SentencePiece Unigram tokenizer with a
|
|
43 |
|
44 |
* **Base Architecture:** `sentence-transformers/all-MiniLM-L6-v2` (6 layers, 384 hidden dimension, 12 attention heads)
|
45 |
* **Parameters:** ~30.2 Million (after vocabulary expansion)
|
46 |
-
* **Tokenizer:** Custom bilingual (AZ-EN) SentencePiece Unigram, vocab size ~50k. Available at [LocalDoc/az-en-unigram-tokenizer-50k](https://huggingface.co/LocalDoc/az-en-unigram-tokenizer-50k).
|
47 |
* **Output Dimension:** 384
|
48 |
* **Max Sequence Length:** 512 tokens
|
49 |
* **Training:** Fine-tuned for 3 epochs on a parallel corpus of ~4.14 million Azerbaijani-English sentence pairs using MSELoss for knowledge distillation from `BAAI/bge-small-en-v1.5`.
|
|
|
43 |
|
44 |
* **Base Architecture:** `sentence-transformers/all-MiniLM-L6-v2` (6 layers, 384 hidden dimension, 12 attention heads)
|
45 |
* **Parameters:** ~30.2 Million (after vocabulary expansion)
|
46 |
+
* **Tokenizer:** Custom bilingual (AZ-EN) SentencePiece Unigram, vocab size ~50k. Available at [LocalDoc/az-en-unigram-tokenizer-50k](https://huggingface.co/LocalDoc/az-en-unigram-tokenizer-50k). You can get train code from this repository https://github.com/vrashad/azerbaijani_tokenizer
|
47 |
* **Output Dimension:** 384
|
48 |
* **Max Sequence Length:** 512 tokens
|
49 |
* **Training:** Fine-tuned for 3 epochs on a parallel corpus of ~4.14 million Azerbaijani-English sentence pairs using MSELoss for knowledge distillation from `BAAI/bge-small-en-v1.5`.
|