Tokenizer splits single-letter words on bytes
#1
by
nshmyrevgmail
- opened
#!/usr/bin/env python3
from transformers import AutoTokenizer, AutoModel
device = 'cpu'
model = AutoModel.from_pretrained("RuModernBERT-small").to(device)
tokenizer = AutoTokenizer.from_pretrained("RuModernBERT-small")
model.eval()
text = "а и у"
print ([tokenizer.decode(x) for x in tokenizer(text)['input_ids']])
for this code result is:
['[CLS]', '�', '�', ' ', '�', '�', ' у', '[SEP]']
it split а and и on separate bytes. Is it intentional tokenization or a bug in transformers?
transformers version 4.51.3