Is EOS token in tokenizer.json and tokenizer_config.json correct?

by Noeda - opened Jun 7

Discussion

Noeda

Jun 7

•

edited Jun 8

For context see: https://github.com/ggml-org/llama.cpp/issues/14044#issuecomment-2952938086 where I noticed this.

Are the token names correctly set in the tokenizer.json and tokenizer_config.json?

In llama.cpp I saw this (this is a tool that tokenizes a prompt so you can see the numerical token IDs):

$ llama-tokenize --model dots1-instruct-q4.gguf --prompt "<|userprompt|><|endofuserprompt|><|response|><|endofresponse|><|im_end|><|im_start|>" --log-disable
151646 -> '<|userprompt|>'
151647 -> '<|endofuserprompt|>'
151648 -> '<|response|>'
151649 -> '<|endofresponse|>'
151645 -> '<|im_end|>'
151644 -> '<|im_start|>'

But I then compare here, with tokenizer_config.json (https://huggingface.co/rednote-hilab/dots.llm1.inst/blob/main/tokenizer_config.json)

"151646" -> I can't find this in tokenizer_config.json
"151647" -> "<|userprompt|>" (this is <|endofuserprompt|> in llama.cpp side)
"151648" -> "<|endofuserprompt|>" (this is <|response|> in llama.cpp side)
"151649" -> "<|response|>" (this is <|endofresponse|> in llama.cpp side)
"151645" -> "<|im_end|>"
"151644" -> "<|im_start|>"

There might be more tokens off, these are what I saw on initial check. I was hacking together llama.cpp support and the WIP .gguf converter code using defaults picked up 151650 as the stopping token (which here in tokenizer_config.json is <|endofresponse|> but llama.cpp has 151650 -> '<|system|>' instead. It made llama.cpp not realize when to stop generating.

Empirically I saw the model wanted to output 151649 which is why I think that is maybe meant to be <|endofresponse|> instead of <|response|>.

The base model files might be correct? 151649 is <|endofresponse|> in the base model side here: https://huggingface.co/rednote-hilab/dots.llm1.base/blob/main/tokenizer_config.json

Edit: Maybe one addition, I see 151645 as eos_token_id generally in config.json and generation_config.json as well, which is <|im_end|>. That's a ChatML-style ending tag, if my memory is corrrect. But this model does not use ChatML, does it? On top of my head I don't remember if llama.cpp conversions scripts inspect the token ID fields in these files but I could imagine other tooling does. Should it be set to <|endofresponse|> 151649 token as well? Code links: https://huggingface.co/rednote-hilab/dots.llm1.inst/blob/8a162c1ff5a8a22fa21fdc1bf222ce7cc08e2539/config.json#L8 and https://huggingface.co/rednote-hilab/dots.llm1.inst/blob/8a162c1ff5a8a22fa21fdc1bf222ce7cc08e2539/generation_config.json#L7

Luni

Jun 9

I can confirm this one, it seems that the Tokenizer of the https://huggingface.co/rednote-hilab/dots.llm1.base/blob/main/tokenizer_config.json#L60
Has the right one instead of the inst.

redmoe-ai-v1

rednote-hilab org Jun 9

•

edited Jun 9

@Noeda @Luni Thank you for reporting this issue. We use <|endofresponse|> as the stop token. We have now corrected the indices for some special tokens.

tokenizer_config.json (https://huggingface.co/rednote-hilab/dots.llm1.inst/commit/6975cfc4940836c90e21153cbba8d6a2e100e1cc)
tokenizer.json (https://huggingface.co/rednote-hilab/dots.llm1.inst/commit/546a63897f9e896eecd71379d0e1e83ce1e3b70d)
generation_config.json (https://huggingface.co/rednote-hilab/dots.llm1.inst/commit/546a63897f9e896eecd71379d0e1e83ce1e3b70d)

P.S. The transformers tokenizer is unaffected by this change.

Noeda

Jun 9

Awesome. When I have the time, I'll confirm on llama.cpp side if it's all good (likely either today in a few hours, but if not then later in the week; I have some international travel about to happen in two days that's taking a lot of my time.) The changes look good just now when I looked at the changes, but I have not tried them with the llama.cpp conversion code yet. These changes help in that don't have to add any special hacks to llama.cpp to work around the tokens.

But once I do have time actually test, I'll come back here and then close this discussion to confirm all is good, if I don't find anything else token-related :)

Thanks!

Noeda

Jun 10

Just now I tested end-to-end: HF safetensors -> bf16 gguf files (using the WIP llama.cpp convert_hf_to_gguf.py) -> quantized to q4 and then ran llama-cli and llama-server, I saw the tokens picked up correctly. I saw it in the metadata already in conversion phase, but wanted to test with an actual inference test before declaring it all good.

To my knowledge, the EOS tokens are now set correctly in this model here in Huggingface. Thanks for the fixes! If I discover something else, I'll poke you again with a new issue or discussion.

Noeda changed discussion status to closed Jun 10

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment