DF-Arc v1.1: Morphology-Aware Arabic Tokenizer

DF-Arc is a specialized tokenizer for Arabic LLMs that minimizes the "Arabic Token Tax". By combining Morphological Pre-tokenization with PMI-based Phrase Merging, it achieves near 1:1 fertility (0.83 fertility on dialects), preserving semantic coherence better than GPT-4 or standard BERT tokenizers.

New in v1.1

PMI-Powered Phrase Merging: Learning phrases based on statistical coupling (Pointwise Mutual Information) rather than just frequency.
Embedded Protections: Built-in protection for sensitive entities (e.g., "Allah", "Mohamed") and common trademarks without external files.
Enhanced Dialect Support: Trained on a broader corpus including Egyptian dialogue, songs, and feedback datasets.
Self-Contained: No extra config files needed; just load and go.

Performance

Model	Fertility (lower is better)	Efficiency vs GPT-4
DF-Arc v1.1	0.83	+77.6%
GPT-4 (cl100k)	3.69	Baseline
AraBERT v2	1.56	-

Usage

from transformers import AutoTokenizer

# trust_remote_code=True is required for custom logic
tokenizer = AutoTokenizer.from_pretrained("dataflare/df-arc", trust_remote_code=True)

# Example: Dialectal + MSA
text = "بسم الله الرحمن الرحيم، انا بحب الذكاء الاصطناعي جدا"
tokens = tokenizer.tokenize(text)
print(tokens)
# Output: ['ب_سم', 'الله', 'ال_رحمن', 'ال_رحيم', '،', 'انا', 'ب_حب', 'ال_ذكاء_ال_اصطناع_ي', 'جدا']
# Note "الله" preserved, phrases like "بسم الله" handled naturally.

Citation

If you use DF-Arc, please cite our paper: The Arabic Token Tax: Quantifying Tokenization Inefficiency in Large Language Models (Dataflare Lab, 2026).

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

dataflare
/

df-arc

DF-Arc v1.1: Morphology-Aware Arabic Tokenizer

New in v1.1

Performance

Usage

Citation

Datasets used to train dataflare/df-arc