Sarufi: Swahili WordPiece Tokenizer
Sarufi is a WordPiece tokenizer trained on Swahili news data for natural language processing tasks.
Details
- Vocabulary size: 25,000 tokens
- Training data: Swahili news dataset
- Special tokens: [UNK], [PAD], [CLS], [SEP], [MASK]
- Trained with WordPiece algorithm
Usage
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("{repo_id}")
# Example
text = "Baba na mama wanapendana"
tokenizer(text)
Training Process
This tokenizer was trained on the Swahili news dataset using the Hugging Face tokenizers library with normalization (NFD, lowercase, strip accents) and whitespace pre-tokenization.
About the Name
"Sarufi" means "grammar" in Swahili, reflecting this tokenizer's purpose in processing the grammatical structures of the Swahili language.
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
HF Inference deployability: The model has no library tag.