Custom Quantization Types
I would like to make my own quants but the vocab is shown as incomplete when converting with llama.cpp. This likely means that the tokenizer implementation is unsupported in llama.cpp:main. Is there a llama.cpp PR or a fork I could look at that does support it? Like the nous-llama.cpp repo?
What quant format? I was able to make AWQ without any issue.
Use the --pad-vocab option when converting to gguf (this will resolve the issue):
python convert.py $model --pad-vocab --outtype f16
Padding the vocab ends up with dummy tokens being generated pretty frequently and is therefore not useful.
you can use convert-hf-to-gguf as well
Is this all good now? We resized vocab to a multiple of 32 even though we only added 2 tokens because it causes less issues with tensor parallelism and should make the model inference faster
It was fine to begin with. All anyone had to do to successfully convert the model to GGUF was use —pad-vocab. Doing so would resolve the issue of unequal sized vocabs and allows the conversion process to successfully complete. 😁