Tokenizer splits single-letter words on bytes

by nshmyrevgmail - opened May 28

May 28

#!/usr/bin/env python3

from transformers import AutoTokenizer, AutoModel

device = 'cpu'
model = AutoModel.from_pretrained("RuModernBERT-small").to(device)
tokenizer = AutoTokenizer.from_pretrained("RuModernBERT-small")
model.eval()
    
text = "а и у"
print ([tokenizer.decode(x) for x in tokenizer(text)['input_ids']])

for this code result is:

['[CLS]', '�', '�', ' ', '�', '�', ' у', '[SEP]']

it split а and и on separate bytes. Is it intentional tokenization or a bug in transformers?

transformers version 4.51.3

BorisTM

deep vk org Jun 9

Hello!
It's clearly not a bug, and this behavior is possible with BPE - single characters can be split into multiple tokens. We agree it’s not good to work with in some cases. To fix it, we released a revision with a patched tokenizer where common Russian letters are single tokens.

You can try it like this:

from transformers import AutoTokenizer, AutoModel

model = AutoModel.from_pretrained("deepvk/RuModernBERT-small", revision="patched-tokenizer")
tokenizer = AutoTokenizer.from_pretrained("deepvk/RuModernBERT-small", revision="patched-tokenizer")

text = "а и у"
print([tokenizer.decode(x) for x in tokenizer(text)['input_ids']])
# Outpus is: ['[CLS]', 'а', ' ', 'и', ' у', '[SEP]']

nshmyrevgmail

Jun 9

Great, thank you!

SpirinEgor changed discussion status to closed Jun 25

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment