DF-Arc v1.1: Morphology-Aware Arabic Tokenizer

DF-Arc is a specialized tokenizer for Arabic LLMs that minimizes the "Arabic Token Tax". By combining Morphological Pre-tokenization with PMI-based Phrase Merging, it achieves near 1:1 fertility (0.83 fertility on dialects), preserving semantic coherence better than GPT-4 or standard BERT tokenizers.

New in v1.1

  • PMI-Powered Phrase Merging: Learning phrases based on statistical coupling (Pointwise Mutual Information) rather than just frequency.
  • Embedded Protections: Built-in protection for sensitive entities (e.g., "Allah", "Mohamed") and common trademarks without external files.
  • Enhanced Dialect Support: Trained on a broader corpus including Egyptian dialogue, songs, and feedback datasets.
  • Self-Contained: No extra config files needed; just load and go.

Performance

Model Fertility (lower is better) Efficiency vs GPT-4
DF-Arc v1.1 0.83 +77.6%
GPT-4 (cl100k) 3.69 Baseline
AraBERT v2 1.56 -

Usage

from transformers import AutoTokenizer

# trust_remote_code=True is required for custom logic
tokenizer = AutoTokenizer.from_pretrained("dataflare/df-arc", trust_remote_code=True)

# Example: Dialectal + MSA
text = "بسم الله الرحمن الرحيم، انا بحب الذكاء الاصطناعي جدا"
tokens = tokenizer.tokenize(text)
print(tokens)
# Output: ['ب_سم', 'الله', 'ال_رحمن', 'ال_رحيم', '،', 'انا', 'ب_حب', 'ال_ذكاء_ال_اصطناع_ي', 'جدا']
# Note "الله" preserved, phrases like "بسم الله" handled naturally.

Citation

If you use DF-Arc, please cite our paper: The Arabic Token Tax: Quantifying Tokenization Inefficiency in Large Language Models (Dataflare Lab, 2026).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

Datasets used to train dataflare/df-arc