The tokenizer adds a special token '<|im_end|>' to solve the problem of non-stop generation when encountering <|im_end|>.

#16

by zjyhf - opened May 15, 2024

base: refs/heads/main

←

from: refs/pr/16

Discussion Files changed

-1

The tokenizer adds a special token '<|im_end|>' to solve the problem of non-stop generation when encountering <|im_end|>.a31949e1

zjyhf

May 15, 2024

Using vllm to infer 'Llama3-ChatQA-1.5-8B', it will continue to be generated when encountering the special token '<|im_end|>', as shown in the figure below. This PR adds <|im_end|> to the tokenizer, and you need to add mapping to generation_config.json.

Qubitium

May 21, 2024

@zjyhf To be clear, are you saying this model has incorrect mapping of tokenid 128010 to string value of "<|reserved_special_token_5|>"? If there are no incorrect mapping, then using vllm "stop" param to pass extra tokens you want to use as stop tokens in addition to EOS.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Ready to merge

This branch is ready to get merged automatically.

· Sign up or log in to comment