Instructions to use monsoon-nlp/dv-labse with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use monsoon-nlp/dv-labse with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="monsoon-nlp/dv-labse")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("monsoon-nlp/dv-labse") model = AutoModelForMaskedLM.from_pretrained("monsoon-nlp/dv-labse") - Notebooks
- Google Colab
- Kaggle
dv-labse
This is an experiment in cross-lingual transfer learning, to insert Dhivehi word and word-piece tokens into Google's LaBSE model.
- Original model weights: https://huggingface.co/setu4993/LaBSE
- Original model announcement: https://ai.googleblog.com/2020/08/language-agnostic-bert-sentence.html
This currently outperforms dv-wave and dv-MuRIL (a similar transfer learning model) on the Maldivian News Classification task https://github.com/Sofwath/DhivehiDatasets
- mBERT: 52%
- dv-wave (ELECTRA): 89%
- dv-muril: 90.7%
- dv-labse: 91.3-91.5% (may continue training)
Training
- Start with LaBSE (similar to mBERT) with no Thaana vocabulary
- Based on PanLex dictionaries, attach 1,100 Dhivehi words to Sinhalese or English embeddings
- Add remaining words and word-pieces from dv-wave's vocabulary to vocab.txt
- Continue BERT pretraining on Dhivehi text
CoLab notebook: https://colab.research.google.com/drive/1CUn44M2fb4Qbat2pAvjYqsPvWLt1Novi
- Downloads last month
- 26