File size: 4,130 Bytes

---
license: cc-by-4.0
library_name: transformers
pipeline_tag: fill-mask
language:
- sw
- xh
- tn
- rw
- zu
- lg
- sot
- ln
- om
- ss
- rn
- nso
- ny
- sn
- nr
- ts
- ve
datasets:
- Helsinki-NLP/opus-100
- statmt/cc100
- legacy-datasets/mc4
tags:
- africannlp
- africanlp
- dsfsi
---


# BantuBERTa: Using Language Family Grouping in Multilingual Language Modeling for Bantu Languages


Give Feedback 📑: [DSFSI Resource Feedback Form](https://docs.google.com/forms/d/e/1FAIpQLSf7S36dyAUPx2egmXbFpnTBuzoRulhL5Elu-N1eoMhaO7v10w/formResponse)

Using Language Family Grouping in Multilingual Language Modeling for Bantu Languages

## Model Details

### Mini-Dissertation Abstract

It was researched whether a multilingual Bantu pretraining corpus could be created from freely available data. Here, to create the dataset, Bantu text extracted from datasets that are freely available online (mainly from Huggingface) were used. The resulting multilingual language model (BantuBERTa) from this pretraining data proved to be predictive across multiple Bantu languages on a higher-order NLP task (NER) and in a simpler NLP task (classification). This proves that this dataset can be used for Bantu multilingual pretraining and transfer to multiple Bantu languages. Additionally, it was researched whether using this Bantu dataset could benefit transfer learning in downstream NLP tasks. BantuBERTa under-performed with respect to other models (XlM-R, mBERT, and AfriBERTa) bench-marked on MasakhaNER’s Bantu language tests (Swahili, Luganda, and Kinyarwanda). Additionally, it produced state of the art results for the Bantu language benchmarks (Zulu, and Lingala) in the African News Topic Classification dataset. It was surmised that the pretraining dataset size (which was 30% smaller than AfriBERTa’s) and dataset quality were the main cause for the poor performance in the NER test. We believe this is a case-specific failure due to poor data quality resulting from a pretraining dataset consisting mainly of web-scraped pages. Here, the resulting dataset consisted mainly of MC4 and CC100 Bantu text. However, on lower-order NLP tasks, like classification, pretraining on languages solely within the language family seemed to benefit transfer to other similar languages within the family. This potentially opens a method for effectively including low-resourced languages in low-level NLP tasks.

### Model Description

- **Developed by:** Jesse Parvess, Vukosi Marivate ([@vukosi](https://huggingface.co/@vukosi)), Verrah Akinyi
- **Model type:** RoBERTa Model
- **Language(s) (NLP):** 
- **License:** CC BY 4.0


### Usage

Use this model filling in masks or finetune for downstream tasks. Here’s a simple example for masked prediction:

```python
from transformers import RobertaTokenizer, RobertaModel

# Load model and tokenizer
model = RobertaModel.from_pretrained('dsfsi/BantuBERTA')
tokenizer = RobertaTokenizer.from_pretrained('dsfsi/BantuBERTA')

```

## Citation Information

Bibtex References (please cite both)

```
@misc{parvess2024bantuberta,
  title   = {BantuBERTa Model},
  author  = {Jesse Parvess and Vukosi Marivate and Verrah Akinyi},
  year    = {2024},
  publisher  = {Hugging Face},
  keywords = {NLP},
  software_url = {https://huggingface.co/dsfsi/BantuBERTa},
  doi = { 10.57967/hf/3067 },
}
```

```
@masterthesis{parvess2023thesis,
  title        = {BantuBERTa: Using Language Family Grouping in Multilingual Language Modeling for Bantu Languages},
  author       = {Jesse Parvess},
  year         = 2023,
  address      = {Pretoria, South Africa},
  note         = {Available at \url{https://repository.up.ac.za/handle/2263/92766}},
  school       = {University of Pretoria},
  type         = {Master's mini-dissertation},
  url = https://repository.up.ac.za/handle/2263/92766
}
```

## Contributing

Your contributions are welcome! Feel free to improve the model.

## Model Card Authors

Vukosi Marivate

## Model Card Contact

For more details, reach out or check our [website](https://dsfsi.github.io/).

Email: [email protected]

**Enjoy exploring African Languages through AI!**