dsfsi
/

PuoBERTaJW300

masked langauge model

Model card Files Files and versions Community

vukosi commited on Oct 12, 2023

Commit

eb0affd

·

1 Parent(s): dccfb9a

Update README.md

Files changed (1) hide show

README.md +107 -0

README.md CHANGED Viewed

@@ -1,3 +1,110 @@
 ---
 license: cc-by-4.0
 ---

 ---
 license: cc-by-4.0
+datasets:
+- dsfsi/vukuzenzele-monolingual
+- nchlt
+- dsfsi/PuoData
+language:
+- tn
+library_name: transformers
+pipeline_tag: fill-mask
+tags:
+- masked langauge model
+- setswana
 ---
+# PuoBertaJW300: A curated Setswana Language Model (trained on PuoData + JW300 Setswana)
+A Roberta-based language model specially designed for Setswana, using the new PuoData dataset + JW300 corpora.
+## Model Details
+### Model Description
+This is a masked language model trained on Setswana corpora, making it a valuable tool for a range of downstream applications from translation to content creation. It's powered by the PuoData dataset to ensure accuracy and cultural relevance.
+- **Developed by:** Vukosi Marivate ([@vukosi](https://huggingface.co/@vukosi)), Moseli Mots'Oehli ([@MoseliMotsoehli](https://huggingface.co/@MoseliMotsoehli)) , Valencia Wagner, Richard Lastrucci and Isheanesu Dzingirai
+- **Model type:** RoBERTa Model
+- **Language(s) (NLP):** Setswana
+- **License:** CC BY 4.0
+### Usage
+Use this model filling in masks or finetune for downstream tasks. Here’s a simple example for masked prediction:
+```python
+from transformers import RobertaTokenizer, RobertaModel
+# Load model and tokenizer
+model = RobertaModel.from_pretrained('dsfsi/PuoBERTaJW300')
+tokenizer = RobertaTokenizer.from_pretrained('dsfsi/PuoBERTaJW300')
+```
+### Downstream Use
+## Downstream Performance
+### MasakhaPOS
+Performance of models on the MasakhaPOS downstream task.
+| Model | Test Performance |
+|---|---|
+| **Multilingual Models** |  |
+| AfroLM | 83.8 |
+| AfriBERTa | 82.5 |
+| AfroXLMR-base | 82.7 |
+| AfroXLMR-large | 83.0 |
+| **Monolingual Models** |  |
+| NCHLT TSN RoBERTa | 82.3 |
+| PuoBERTa | 83.4 |
+| PuoBERTa+JW300 | **84.1** |
+### MasakhaNER
+Performance of models on the MasakhaNER downstream task.
+| Model | Test Performance (f1 score) |
+|---|---|
+| **Multilingual Models** |  |
+| AfriBERTa | 83.2 |
+| AfroXLMR-base | 87.7 |
+| AfroXLMR-large | 89.4 |
+| **Monolingual Models** |  |
+| NCHLT TSN RoBERTa | 74.2 |
+| PuoBERTa | 78.2 |
+| PuoBERTa+JW300 | **80.2** |
+## Dataset
+We used the PuoData dataset, a rich source of Setswana text, ensuring that our model is well-trained and culturally attuned.\\
+## Citation Information
+Bibtex Refrence
+```
+@article{marivatePuoBERTa2023,
+  title={PuoBERTa: Training and evaluation of a curated language model for Setswana},
+  author={Vukosi Marivate and Moseli Mots'Oehli and Valencia Wagner and Richard Lastrucci and Isheanesu Dzingirai},
+  journal={ArXiv},
+}
+```
+## Contributing
+Your contributions are welcome! Feel free to improve the model.
+## Model Card Authors
+Vukosi Marivate
+## Model Card Contact
+For more details, reach out or check our [website](https://dsfsi.github.io/).
+Email: [email protected]
+**Enjoy exploring Setswana through AI!**