question for the pr of [update additional_special_tokens (#8)]

#10
by Qingyun - opened

This pr added additional_special_tokens, which seems result in mismatch of tokenizer length and vocablary size in my transformers==4.31.0 version.

  "additional_special_tokens": [
    "<|im_start|>",
    "<|im_end|>",
    "<|action_start|>",
    "<|action_end|>",
    "<|interpreter|>",
    "<|plugin|>"
  ],
tokenizer
ipdb> InternLM2Tokenizer(name_or_path='internlm/internlm2-chat-7b', vocab_size=92544, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '</s>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|action_start|>', '<|action_end|>', '<|interpreter|>', '<|plugin|>']}, clean_up_tokenization_spaces=False)
len(tokenizer)
ipdb> 92550

It seems that the additional special tokens are made new ids, which is mismatched with the input_embeddings. But this pr seems to resolve the bug in 4.33.2 as described in this issue.

Qingyun changed discussion status to closed
Your need to confirm your account before you can post a new comment.

Sign up or log in to comment