tokenizer

#5
by ctranslate2-4you - opened

Can we please get an explanation other than "we don't plan to release the tokenizer" as you guys stated before closing a prior issue? Is this just to subserve LM Studio and all those sycophants or what? According to this, ya'll freely give the tokenizers.

image.png

Mistral AI_ org

Hi there, we do provide the tokenizers, but not necessarily in other formats, mistral-common is our official implementation of our tokenizers with the accurate tokenization process that we can be sure it will work as expected, it's not that we wont release tokenizers, is that we will release most accurate format of the tokenizer we are confident about. I hope this helps!

Mistral AI_ org

Just in case, this is part of the tokenizer: https://huggingface.co/mistralai/Magistral-Small-2506/blob/main/tekken.json
But goes together with mistral-common the same way the hf tokenizer format goes with the transformers implementation of the tokenizer, the tokenizer is available, but a different format πŸ‘

That does help clarify. Another posting by another teammate didn't go into that level of detail, but you've cleared it up. Thanks. Am I correct in understanding that when peoples like unsloth upload their models they're basically creating a hf-compatible tokenizer? I noticed that they have the traditional .json files, for example?

Would there possibly be better performance with you guys or what not?

Thanks again.

Does anyone know if the tokenizer is the same used in mistral small 3 2503?
So that we can just copy paste

Edit: I just finetuned and served using Mistral Small 3 2503 tokenizer and it works, I just copy pasted tokenizer.json and special tokens map.

Sign up or log in to comment