vukosi commited on
Commit
eb0affd
·
1 Parent(s): dccfb9a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +107 -0
README.md CHANGED
@@ -1,3 +1,110 @@
1
  ---
2
  license: cc-by-4.0
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-4.0
3
+ datasets:
4
+ - dsfsi/vukuzenzele-monolingual
5
+ - nchlt
6
+ - dsfsi/PuoData
7
+ language:
8
+ - tn
9
+ library_name: transformers
10
+ pipeline_tag: fill-mask
11
+ tags:
12
+ - masked langauge model
13
+ - setswana
14
  ---
15
+ # PuoBertaJW300: A curated Setswana Language Model (trained on PuoData + JW300 Setswana)
16
+
17
+ A Roberta-based language model specially designed for Setswana, using the new PuoData dataset + JW300 corpora.
18
+
19
+ ## Model Details
20
+
21
+
22
+ ### Model Description
23
+
24
+ This is a masked language model trained on Setswana corpora, making it a valuable tool for a range of downstream applications from translation to content creation. It's powered by the PuoData dataset to ensure accuracy and cultural relevance.
25
+
26
+ - **Developed by:** Vukosi Marivate ([@vukosi](https://huggingface.co/@vukosi)), Moseli Mots'Oehli ([@MoseliMotsoehli](https://huggingface.co/@MoseliMotsoehli)) , Valencia Wagner, Richard Lastrucci and Isheanesu Dzingirai
27
+ - **Model type:** RoBERTa Model
28
+ - **Language(s) (NLP):** Setswana
29
+ - **License:** CC BY 4.0
30
+
31
+
32
+ ### Usage
33
+
34
+ Use this model filling in masks or finetune for downstream tasks. Here’s a simple example for masked prediction:
35
+
36
+ ```python
37
+ from transformers import RobertaTokenizer, RobertaModel
38
+
39
+ # Load model and tokenizer
40
+ model = RobertaModel.from_pretrained('dsfsi/PuoBERTaJW300')
41
+ tokenizer = RobertaTokenizer.from_pretrained('dsfsi/PuoBERTaJW300')
42
+
43
+ ```
44
+
45
+ ### Downstream Use
46
+
47
+ ## Downstream Performance
48
+
49
+ ### MasakhaPOS
50
+
51
+ Performance of models on the MasakhaPOS downstream task.
52
+
53
+ | Model | Test Performance |
54
+ |---|---|
55
+ | **Multilingual Models** | |
56
+ | AfroLM | 83.8 |
57
+ | AfriBERTa | 82.5 |
58
+ | AfroXLMR-base | 82.7 |
59
+ | AfroXLMR-large | 83.0 |
60
+ | **Monolingual Models** | |
61
+ | NCHLT TSN RoBERTa | 82.3 |
62
+ | PuoBERTa | 83.4 |
63
+ | PuoBERTa+JW300 | **84.1** |
64
+
65
+ ### MasakhaNER
66
+
67
+ Performance of models on the MasakhaNER downstream task.
68
+
69
+ | Model | Test Performance (f1 score) |
70
+ |---|---|
71
+ | **Multilingual Models** | |
72
+ | AfriBERTa | 83.2 |
73
+ | AfroXLMR-base | 87.7 |
74
+ | AfroXLMR-large | 89.4 |
75
+ | **Monolingual Models** | |
76
+ | NCHLT TSN RoBERTa | 74.2 |
77
+ | PuoBERTa | 78.2 |
78
+ | PuoBERTa+JW300 | **80.2** |
79
+
80
+ ## Dataset
81
+
82
+ We used the PuoData dataset, a rich source of Setswana text, ensuring that our model is well-trained and culturally attuned.\\
83
+
84
+ ## Citation Information
85
+
86
+ Bibtex Refrence
87
+
88
+ ```
89
+ @article{marivatePuoBERTa2023,
90
+ title={PuoBERTa: Training and evaluation of a curated language model for Setswana},
91
+ author={Vukosi Marivate and Moseli Mots'Oehli and Valencia Wagner and Richard Lastrucci and Isheanesu Dzingirai},
92
+ journal={ArXiv},
93
+ }
94
+ ```
95
+
96
+ ## Contributing
97
+
98
+ Your contributions are welcome! Feel free to improve the model.
99
+
100
+ ## Model Card Authors
101
+
102
+ Vukosi Marivate
103
+
104
+ ## Model Card Contact
105
+
106
+ For more details, reach out or check our [website](https://dsfsi.github.io/).
107
+
108
109
+
110
+ **Enjoy exploring Setswana through AI!**