Tokenizer can't be loaded - possibly related to recent Transformers versions
Trying to load the tokenizer from this model in Transformers 4.35.0 results in the following error:
Python 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("OpenNLPLab/TransNormerLLM-1B", trust_remote_code=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/workspace/venv/pytorch2/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 755, in from_pretrained
return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/workspace/venv/pytorch2/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2024, in from_pretrained
return cls._from_pretrained(
File "/workspace/venv/pytorch2/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2256, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/workspace/huggingface/modules/transformers_modules/OpenNLPLab/TransNormerLLM-1B/cf951417e7539e292188864a12171e2e2051917f/tokenization_baichuan.py", line 76, in __init__
super().__init__(
File "/workspace/venv/pytorch2/lib/python3.10/site-packages/transformers/tokenization_utils.py", line 367, in __init__
self._add_tokens(
File "/workspace/venv/pytorch2/lib/python3.10/site-packages/transformers/tokenization_utils.py", line 467, in _add_tokens
current_vocab = self.get_vocab().copy()
File "/workspace/huggingface/modules/transformers_modules/OpenNLPLab/TransNormerLLM-1B/cf951417e7539e292188864a12171e2e2051917f/tokenization_baichuan.py", line 112, in get_vocab
for i in range(self.vocab_size)
File "/workspace/huggingface/modules/transformers_modules/OpenNLPLab/TransNormerLLM-1B/cf951417e7539e292188864a12171e2e2051917f/tokenization_baichuan.py", line 106, in vocab_size
return self.sp_model.get_piece_size()
AttributeError: 'BaiChuanTokenizer' object has no attribute 'sp_model'
>>> import transformers
>>> print(transformers.__version__)
4.35.0
>>>
I haven't tested earlier Transformers versions, but this serr no attribute 'sp_model'
is identical to an error I had with another model, which proved to be related to recent Transformers versions.
Note that your other model TransNormerLLM-7B, does not have this problem:
>>> tokenizer = AutoTokenizer.from_pretrained("OpenNLPLab/TransNormerLLM-7B", trust_remote_code=True)
A new version of the following files was downloaded from https://huggingface.co/OpenNLPLab/TransNormerLLM-7B:
- tokenization_baichuan.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
>>>
Could you fix the tokenizer of this model so it works with recent Transformers versions, like TransNormerLLM 7B does?
Thanks in advance
TheBloke
Appreciate it, for flagging this problem.
The root of the issue lies in the transformer's version. We'll be updating the tokenizer file for both the TransNormerLLM-1B and 385M models.
For a swift solution, check this link: https://github.com/baichuan-inc/Baichuan2/issues/204
We've made updates to the associated files to resolve the problem stemming from the transformer's version.