Fix BaichuanTokenizer to fit transformers>=4.34

#8
by xu-song - opened
$ python predict_baichuan.py
Traceback (most recent call last):
  File "/workspace/baichuan/predict/predict_baichuan.py", line 14, in <module>
    tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR, use_fast=False, trust_remote_code=True)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py", line 755, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 2024, in from_pretrained
    return cls._from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 2256, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/tokenization_baichuan.py", line 75, in __init__
    super().__init__(
  File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils.py", line 367, in __init__
    self._add_tokens(
  File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils.py", line 467, in _add_tokens
    current_vocab = self.get_vocab().copy()
  File "/root/.cache/huggingface/modules/transformers_modules/tokenization_baichuan.py", line 109, in get_vocab
    vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
  File "/root/.cache/huggingface/modules/transformers_modules/tokenization_baichuan.py", line 105, in vocab_size
    return self.sp_model.get_piece_size()
AttributeError: 'BaichuanTokenizer' object has no attribute 'sp_model'

Related issue https://github.com/InternLM/InternLM/pull/419/files

This comment has been hidden
GradientGuru changed pull request status to merged

Sign up or log in to comment