monsoon-nlp
/

dv-labse

Model card Files Files and versions

dv-labse

This is an experiment in cross-lingual transfer learning, to insert Dhivehi word and word-piece tokens into Google's LaBSE model.

Original model weights: https://huggingface.co/setu4993/LaBSE
Original model announcement: https://ai.googleblog.com/2020/08/language-agnostic-bert-sentence.html

This currently outperforms dv-wave and dv-MuRIL (a similar transfer learning model) on the Maldivian News Classification task https://github.com/Sofwath/DhivehiDatasets

mBERT: 52%
dv-wave (ELECTRA): 89%
dv-muril: 90.7%
dv-labse: 91.3-91.5% (may continue training)

Training

Start with LaBSE (similar to mBERT) with no Thaana vocabulary
Based on PanLex dictionaries, attach 1,100 Dhivehi words to Sinhalese or English embeddings
Add remaining words and word-pieces from dv-wave's vocabulary to vocab.txt
Continue BERT pretraining on Dhivehi text

CoLab notebook: https://colab.research.google.com/drive/1CUn44M2fb4Qbat2pAvjYqsPvWLt1Novi

Downloads last month: 13

Safetensors

Model size

0.5B params

Tensor type

I64

·

F32

·