pyo3_runtime.PanicException: AddedVocabulary bad split

#14
by vberry - opened

I hit this error (pyo3_runtime.PanicException: AddedVocabulary bad split) when I try to create embeddings from a local instance of the fastchat llm.

It seems to run cleanly up until I try to instantiate the vectordb, then I get the error. It seems to have something to do with the construction of the tokenizer, but I'm not sure how. Can anyone help?

---- CODE ----
model_name = "./aiml/models--lmsys--fastchat-t5-3b-v1.0/snapshots/0b1da230a891854102d749b93f7ddf1f18a81024"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': False}
embeddings = HuggingFaceEmbeddings(
model_name=model_name,
model_kwargs=model_kwargs,
encode_kwargs=encode_kwargs,
)

vectordb = Chroma.from_documents(
documents=texts,
embedding=embeddings,
persist_directory=persist_directory
)

did you manage to solve it?

I get same error

PanicException Traceback (most recent call last)
in <cell line: 1>()
----> 1 res = pipe(prompt)

9 frames
/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_fast.py in _batch_encode_plus(self, batch_text_or_text_pairs, add_special_tokens, padding_strategy, truncation_strategy, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose)
502 )
503
--> 504 encodings = self._tokenizer.encode_batch(
505 batch_text_or_text_pairs,
506 add_special_tokens=add_special_tokens,

PanicException: AddedVocabulary bad split

instead of using AutoTokenizer use T5Tokenizer!

Sign up or log in to comment