Date Fragments: A Hidden Bottleneck of Tokenization for Temporal Reasoning
Abstract
New DateAugBench benchmarks reveal how modern tokenizers fragment dates, impacting the accuracy of temporal reasoning in large language models, which compensate for fragmentation more effectively as they grow larger.
Modern BPE tokenizers often split calendar dates into meaningless fragments, e.g., 20250312 rightarrow 202, 503, 12, inflating token counts and obscuring the inherent structure needed for robust temporal reasoning. In this work, we (1) introduce a simple yet interpretable metric, termed date fragmentation ratio, that measures how faithfully a tokenizer preserves multi-digit date components; (2) release DateAugBench, a suite of 6500 examples spanning three temporal reasoning tasks: context-based date resolution, format-invariance puzzles, and date arithmetic across historical, contemporary, and future regimes; and (3) through layer-wise probing and causal attention-hop analyses, uncover an emergent date-abstraction mechanism whereby large language models stitch together the fragments of month, day, and year components for temporal reasoning. Our experiments show that excessive fragmentation correlates with accuracy drops of up to 10 points on uncommon dates like historical and futuristic dates. Further, we find that the larger the model, the faster the emergent date abstraction that heals date fragments is accomplished. Lastly, we observe a reasoning path that LLMs follow to assemble date fragments, typically differing from human interpretation (year rightarrow month rightarrow day).
Community
Date Fragments: A Hidden Bottleneck of Tokenization for Temporal Reasoning
Modern BPE tokenizers often split calendar dates into meaningless fragments, e.g., 20250312 202, 503, 12, inflating token counts and obscuring the inherent structure needed for robust temporal reasoning. In this work, we (1) introduce a simple yet interpretable metric, termed date fragmentation ratio, that measures how faithfully a tokenizer preserves multi-digit date components; (2) release DateAugBench, a suite of 6500 examples spanning three temporal reasoning tasks: context-based date resolution, format-invariance puzzles, and date arithmetic across historical, contemporary, and future regimes; and (3) through layer-wise probing and causal attention-hop analyses, uncover an emergent date-abstraction mechanism whereby large language models stitch together the fragments of month, day, and year components for temporal reasoning. Our experiments show that excessive fragmentation correlates with accuracy drops of up to 10 points on uncommon dates like historical and futuristic dates. Further, we find that the larger the model, the faster the emergent date abstraction that heals date fragments is accomplished. Lastly, we observe a reasoning path that LLMs follow to assemble date fragments, typically differing from human interpretation (year month day).
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Tokenization Constraints in LLMs: A Study of Symbolic and Arithmetic Reasoning Limits (2025)
- ThinkLess: A Training-Free Inference-Efficient Method for Reducing Reasoning Redundancy (2025)
- Learning to Reason Over Time: Timeline Self-Reflection for Improved Temporal Reasoning in Language Models (2025)
- Towards Explainable Temporal Reasoning in Large Language Models: A Structure-Aware Generative Framework (2025)
- Scaling Reasoning can Improve Factuality in Large Language Models (2025)
- Internal Chain-of-Thought: Empirical Evidence for Layer-wise Subtask Scheduling in LLMs (2025)
- Mapping the Minds of LLMs: A Graph-Based Analysis of Reasoning LLM (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper