DF-Arc v1.1: Morphology-Aware Arabic Tokenizer
DF-Arc is a specialized tokenizer for Arabic LLMs that minimizes the "Arabic Token Tax". By combining Morphological Pre-tokenization with PMI-based Phrase Merging, it achieves near 1:1 fertility (0.83 fertility on dialects), preserving semantic coherence better than GPT-4 or standard BERT tokenizers.
New in v1.1
- PMI-Powered Phrase Merging: Learning phrases based on statistical coupling (Pointwise Mutual Information) rather than just frequency.
- Embedded Protections: Built-in protection for sensitive entities (e.g., "Allah", "Mohamed") and common trademarks without external files.
- Enhanced Dialect Support: Trained on a broader corpus including Egyptian dialogue, songs, and feedback datasets.
- Self-Contained: No extra config files needed; just load and go.
Performance
| Model | Fertility (lower is better) | Efficiency vs GPT-4 |
|---|---|---|
| DF-Arc v1.1 | 0.83 | +77.6% |
| GPT-4 (cl100k) | 3.69 | Baseline |
| AraBERT v2 | 1.56 | - |
Usage
from transformers import AutoTokenizer
# trust_remote_code=True is required for custom logic
tokenizer = AutoTokenizer.from_pretrained("dataflare/df-arc", trust_remote_code=True)
# Example: Dialectal + MSA
text = "بسم الله الرحمن الرحيم، انا بحب الذكاء الاصطناعي جدا"
tokens = tokenizer.tokenize(text)
print(tokens)
# Output: ['ب_سم', 'الله', 'ال_رحمن', 'ال_رحيم', '،', 'انا', 'ب_حب', 'ال_ذكاء_ال_اصطناع_ي', 'جدا']
# Note "الله" preserved, phrases like "بسم الله" handled naturally.
Citation
If you use DF-Arc, please cite our paper: The Arabic Token Tax: Quantifying Tokenization Inefficiency in Large Language Models (Dataflare Lab, 2026).
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
1
Ask for provider support