Has the tokenizer of the base model(Mistral-7B-v0.1) been retrained?

#37

by LH0521 - opened Jun 24, 2024

Jun 24, 2024

Hi，
I noticed that Mistral-7B-v0.1 was used as the base model. However, the original Mistral-7B-v0.1 uses BPE tokenization, while I found that NV-Embed-v1 uses a word-by-word mapping method.

Did you retrain the tokenizer? If so, was it because the latent layer needs to integrate the words better?

Thanks!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment