Bug in the tokenizer: Special tokens are not being encoded

#14
by cassanof - opened

Hello, I found a bug in the tokenizer, special tokens are not being encoded as such:

>> from transformers import AutoTokenizer
>> tokenizer = AutoTokenizer.from_pretrained("moonshotai/Kimi-K2-Instruct")
>> print(tokenizer.eos_token)
[EOS]
>> print(tokenizer.eos_token_id)
163585
>> print(tokenizer.encode('[EOS]'))
[58, 85521, 60]

This will cause many issues downstream

Moonshot AI org

@cassanof Hi, thanks a lot for pointing out this bug. We have updated our code and it should now work as expected

Moonshot AI org

@bigmoyan I think it's a feature, but too many downstream codes depend on a bug, so it's ok.

Sign up or log in to comment