YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Hugging Face's logo

language:

  • om
  • am
  • rw
  • rn
  • ha
  • ig
  • pcm
  • so
  • sw
  • ti
  • yo
  • multilingual

afriteva_base

Model desription

AfriTeVa base is a multilingual sequence to sequence model pretrained on 10 African languages

Languages

Afaan Oromoo(orm), Amharic(amh), Gahuza(gah), Hausa(hau), Igbo(igb), Nigerian Pidgin(pcm), Somali(som), Swahili(swa), Tigrinya(tig), Yoruba(yor)

More information on the model, dataset:

The model

  • 229M parameters encoder-decoder architecture (T5-like)
  • 12 layers, 12 attention heads and 512 token sequence length

The dataset

  • Multilingual: 10 African languages listed above
  • 143 Million Tokens (1GB of text data)
  • Tokenizer Vocabulary Size: 70,000 tokens

Intended uses & limitations

afriteva_base is pre-trained model and primarily aimed at being fine-tuned on multilingual sequence-to-sequence tasks.

>>> from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("castorini/afriteva_base")
>>> model = AutoModelForSeq2SeqLM.from_pretrained("castorini/afriteva_base")

>>> src_text = "Ó hùn ọ́ láti di ara wa bí?"
>>> tgt_text =  "Would you like to be?"

>>> model_inputs = tokenizer(src_text, return_tensors="pt")
>>> with tokenizer.as_target_tokenizer():
        labels = tokenizer(tgt_text, return_tensors="pt").input_ids

>>> model(**model_inputs, labels=labels) # forward pass

Training Procedure

For information on training procedures, please refer to the AfriTeVa paper or repository

BibTex entry and Citation info

coming soon ...

Downloads last month
16
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.

Collection including castorini/afriteva_base