Fill-Mask
Transformers
Safetensors
Russian
English
modernbert

Tokenizer splits single-letter words on bytes

#1
by nshmyrevgmail - opened
#!/usr/bin/env python3

from transformers import AutoTokenizer, AutoModel

device = 'cpu'
model = AutoModel.from_pretrained("RuModernBERT-small").to(device)
tokenizer = AutoTokenizer.from_pretrained("RuModernBERT-small")
model.eval()
    
text = "а и у"
print ([tokenizer.decode(x) for x in tokenizer(text)['input_ids']])

for this code result is:

['[CLS]', '�', '�', ' ', '�', '�', ' у', '[SEP]']

it split а and и on separate bytes. Is it intentional tokenization or a bug in transformers?

transformers version 4.51.3

Sign up or log in to comment