Custom Quantization Types

by christopherthompson81 - opened Mar 14, 2024

Mar 14, 2024

I would like to make my own quants but the vocab is shown as incomplete when converting with llama.cpp. This likely means that the tokenizer implementation is unsupported in llama.cpp:main. Is there a llama.cpp PR or a fork I could look at that does support it? Like the nous-llama.cpp repo?

Suparious

Mar 14, 2024

What quant format? I was able to make AWQ without any issue.

Joseph717171

Mar 14, 2024

•

edited Mar 14, 2024

Use the --pad-vocab option when converting to gguf (this will resolve the issue):

python convert.py $model --pad-vocab --outtype f16

christopherthompson81

Mar 14, 2024

Padding the vocab ends up with dummy tokens being generated pretty frequently and is therefore not useful.

qnguyen3

Mar 14, 2024

you can use convert-hf-to-gguf as well

teknium

NousResearch org Mar 14, 2024

Is this all good now? We resized vocab to a multiple of 32 even though we only added 2 tokens because it causes less issues with tensor parallelism and should make the model inference faster

Joseph717171

Mar 14, 2024

•

edited Mar 15, 2024

It was fine to begin with. All anyone had to do to successfully convert the model to GGUF was use —pad-vocab. Doing so would resolve the issue of unequal sized vocabs and allows the conversion process to successfully complete. 😁

deleted

Mar 14, 2024

This comment has been hidden

teknium changed discussion status to closed Mar 14, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment