Retrofitting (Large) Language Models with Dynamic Tokenization Paper • 2411.18553 • Published Nov 27, 2024 • 2
Cross-Tokenizer Distillation via Approximate Likelihood Matching Paper • 2503.20083 • Published Mar 25 • 1
Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation Paper • 2406.16678 • Published Jun 24, 2024 • 16
Where's the Point? Self-Supervised Multilingual Punctuation-Agnostic Sentence Segmentation Paper • 2305.18893 • Published May 30, 2023 • 2
CompoundPiece: Evaluating and Improving Decompounding Performance of Language Models Paper • 2305.14214 • Published May 23, 2023
HumSet: Dataset of Multilingual Information Extraction and Classification for Humanitarian Crisis Response Paper • 2210.04573 • Published Oct 10, 2022
WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models Paper • 2112.06598 • Published Dec 13, 2021 • 1