Papers
arxiv:2506.01084

zip2zip: Inference-Time Adaptive Vocabularies for Language Models via Token Compression

Published on Jun 1
· Submitted by Saibo-creator on Jun 3
Authors:
,
,
,
,

Abstract

A framework called zip2zip dynamically adjusts token vocabulary in LLMs at inference time using LZW compression, reducing token sequence length and improving inference speed.

AI-generated summary

Tokenization efficiency plays a critical role in the performance and cost of large language models (LLMs), yet most models rely on static tokenizers optimized for general-purpose corpora. These tokenizers' fixed vocabularies often fail to adapt to domain- or language-specific inputs, leading to longer token sequences and higher computational costs. We introduce zip2zip, a framework that enables LLMs to dynamically adjust token vocabulary at inference time, allowing for fewer generated tokens and thus faster inference. zip2zip consists of three key components: (1) a tokenizer based on Lempel-Ziv-Welch (LZW) compression that incrementally compresses tokens into reusable "hypertokens" on the fly; (2) an embedding layer that computes embeddings for newly formed hypertokens at runtime; and (3) a causal language modeling variant that trains the model to operate on hypertokenized, compressed sequences. We show that an existing LLM can be zip2zip-fied in 10 GPU-hours via parameter-efficient finetuning. The resulting zip2zip LLMs effectively learn to use hypertokens at inference time, reducing input and output sequence length by 20-60\%, with significant improvements in inference latency.

Community

Paper author Paper submitter

🚀 Here comes the first dynamic tokenizer!

Token counts are inflating—inputs are too long, outputs are slow to generate, and costs are rising. Non-English languages suffer even more under fixed tokenizers.

We’re introducing zip2zip — a framework that enables large language models to dynamically adapt their tokenizer at inference time via LZW-style compression. Fewer tokens, faster inference, and lower cost — all without sacrificing model quality.

🔍 How?
• We compress tokens into reusable hypertokens on the fly
• Embed them dynamically at runtime
• Train the model to reason in compressed space

✅ Works with existing LLMs
📉 Cuts token counts by 20–60%
⚡ Improves inference latency

Check out our paper and stay tuned — code and models coming soon!
📄 arXiv preprint https://arxiv.org/abs/2506.01084
🔗 https://github.com/epfl-dlab/zip2zip

From EPFL Data Science Lab 🇨🇭

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2506.01084 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.01084 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.