zip2zip: Inference-Time Adaptive Vocabularies for Language Models via Token Compression
Abstract
A framework called zip2zip dynamically adjusts token vocabulary in LLMs at inference time using LZW compression, reducing token sequence length and improving inference speed.
Tokenization efficiency plays a critical role in the performance and cost of large language models (LLMs), yet most models rely on static tokenizers optimized for general-purpose corpora. These tokenizers' fixed vocabularies often fail to adapt to domain- or language-specific inputs, leading to longer token sequences and higher computational costs. We introduce zip2zip, a framework that enables LLMs to dynamically adjust token vocabulary at inference time, allowing for fewer generated tokens and thus faster inference. zip2zip consists of three key components: (1) a tokenizer based on Lempel-Ziv-Welch (LZW) compression that incrementally compresses tokens into reusable "hypertokens" on the fly; (2) an embedding layer that computes embeddings for newly formed hypertokens at runtime; and (3) a causal language modeling variant that trains the model to operate on hypertokenized, compressed sequences. We show that an existing LLM can be zip2zip-fied in 10 GPU-hours via parameter-efficient finetuning. The resulting zip2zip LLMs effectively learn to use hypertokens at inference time, reducing input and output sequence length by 20-60\%, with significant improvements in inference latency.
Community
🚀 Here comes the first dynamic tokenizer!
Token counts are inflating—inputs are too long, outputs are slow to generate, and costs are rising. Non-English languages suffer even more under fixed tokenizers.
We’re introducing zip2zip — a framework that enables large language models to dynamically adapt their tokenizer at inference time via LZW-style compression. Fewer tokens, faster inference, and lower cost — all without sacrificing model quality.
🔍 How?
• We compress tokens into reusable hypertokens on the fly
• Embed them dynamically at runtime
• Train the model to reason in compressed space
✅ Works with existing LLMs
📉 Cuts token counts by 20–60%
⚡ Improves inference latency
Check out our paper and stay tuned — code and models coming soon!
📄 arXiv preprint https://arxiv.org/abs/2506.01084
🔗 https://github.com/epfl-dlab/zip2zip
From EPFL Data Science Lab 🇨🇭
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Think Twice, Act Once: Token-Aware Compression and Action Reuse for Efficient Inference in Vision-Language-Action Models (2025)
- Achieving Tokenizer Flexibility in Language Models through Heuristic Adaptation and Supertoken Learning (2025)
- Embedding-to-Prefix: Parameter-Efficient Personalization for Pre-Trained Large Language Models (2025)
- HAMburger: Accelerating LLM Inference via Token Smashing (2025)
- Multi-Sense Embeddings for Language Models and Knowledge Distillation (2025)
- FreqKV: Frequency Domain Key-Value Compression for Efficient Context Window Extension (2025)
- Multi-Token Prediction Needs Registers (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper