EZ-Tokenizer: 3.47 Chars/Token with 100% Reconstruction

"Go ahead, try to break it. I dare you." - A tokenizer so efficient, it feels like cheating.

πŸš€ Performance Highlights

  • 3.47 characters per token (beats industry standards)
  • 100% perfect reconstruction on all test cases
  • 50K vocab size (smaller, smarter, faster)
  • 264K tokens/second processing speed

πŸ’₯ Benchmark This!

from tokenizers import Tokenizer
tokenizer = Tokenizer.from_pretrained("johnnyman1100/EZ-Tokenizer_The_Tokenizer")

# Test it yourself
text = "Your text here"
encoded = tokenizer.encode(text)
decoded = tokenizer.decode(encoded.ids)

assert text == decoded  # Try to make this fail, I'll wait...
print(f"Compression: {len(text)/len(encoded.ids):.2f} chars/token")

πŸ† Challenge

Find any text where this tokenizer:

  1. Fails to reconstruct perfectly, or
  2. Gets worse compression than DeepSeek/others

First to report a verified case gets a shoutout!

πŸ“Š Technical Details

  • Vocabulary: 50,000 tokens
  • Tested on: 1.7M+ characters of mixed content
  • Perfect reconstruction on all test cases
  • Faster than DeepSeek by 1.23x

πŸ€” Why This Matters

Because in a world of bloated models, efficiency still wins. This tokenizer proves you don't need 100K+ tokens to achieve perfect reconstruction and better compression.

βš–οΈ License

MIT


"I didn't believe it either until I saw the benchmarks." - You, probably

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support