Davlan commited on
Commit
350c7b1
·
verified ·
1 Parent(s): 22c83d8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +38 -2
README.md CHANGED
@@ -3,7 +3,43 @@ license: apache-2.0
3
  language:
4
  - en
5
  - yo
6
- - ig
7
  - ha
 
8
  - pcm
9
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  language:
4
  - en
5
  - yo
 
6
  - ha
7
+ - ig
8
  - pcm
9
+ ---
10
+
11
+
12
+ # naija-bert-base
13
+
14
+ NaijaBERT was created by pre-training a [BERT model with token dropping](https://aclanthology.org/2022.acl-long.262/) on five Nigerian languages (English, Hausa, Igbo, Naija, and Yoruba) texts for about 100K steps.
15
+ It was trained using BERT-base architecture with [Tensorflow Model Garden](https://github.com/tensorflow/models/tree/master/official/projects)
16
+
17
+ ### Pre-training corpus
18
+ A mix of WURA, Wikipedia and MT560 data
19
+
20
+ #### How to use
21
+ You can use this model with Transformers *pipeline* for masked token prediction.
22
+ ```python
23
+ >>> from transformers import pipeline
24
+ >>> unmasker = pipeline('fill-mask', model='Davlan/naija-bert-large')
25
+ >>> unmasker("Ọjọ kẹsan-an, [MASK] Kẹjọ ni wọn ri oku Baba")
26
+ ```
27
+ ```
28
+ [{'score': 0.9981744289398193, 'token': 3785, 'token_str': 'osu', 'sequence': 'ojo kesan - an, osu kejo ni won ri oku baba'}, {'score': 0.0015279919607564807, 'token': 3355, 'token_str': 'ojo', 'sequence': 'ojo kesan - an, ojo kejo ni won ri oku baba'}, {'score': 0.0001734074903652072, 'token': 11780, 'token_str': 'osun', 'sequence': 'ojo kesan - an, osun kejo ni won ri oku baba'}, {'score': 9.066923666978255e-05, 'token': 21579, 'token_str': 'oṣu', 'sequence': 'ojo kesan - an, oṣu kejo ni won ri oku baba'}, {'score': 1.816015355871059e-05, 'token': 3387, 'token_str': 'odun', 'sequence': 'ojo kesan - an, odun kejo ni won ri oku baba'}]
29
+ ```
30
+
31
+ ### Acknowledgment
32
+ We thank [@stefan-it](https://github.com/stefan-it) for providing the pre-processing and pre-training scripts. Finally, we would like to thank Google Cloud for providing us access to TPU v3-8 through the free cloud credits. Model trained using flax, before converted to pytorch.
33
+
34
+
35
+ ### BibTeX entry and citation info.
36
+ ```
37
+ @misc{david_adelani_2025,
38
+ author = { David Adelani },
39
+ title = { naija-bert-base (Revision 22c83d8) },
40
+ year = 2025,
41
+ url = { https://huggingface.co/Davlan/naija-bert-base },
42
+ doi = { 10.57967/hf/5864 },
43
+ publisher = { Hugging Face }
44
+ }
45
+ ```