Draft Model of Speculative Decoding

by nagug - opened 27 days ago

27 days ago

Do you have any suggestions of which draft models would play nicely with this mode. BTW. Qwen2.5 7B instruct seem to have different vocab size and not working. May be i am doing something wrong.

mhenning

11 days ago

According to the models config.json (https://huggingface.co/Nexusflow/Athene-V2-Chat/blob/main/config.json) and (https://huggingface.co/Qwen/Qwen2.5-7B-Instruct/blob/main/config.json), the vocab_size is the same. As this is a finetune of Qwen 2.5 72B, another Qwen 2.5 model as draft model makes the most sense. Do you maybe use quantized versions which report a different vocabulary size?

nagug

11 days ago

Ah.. that makes sense. i was using a AWQ version using VLLM. (as i have limit for GPU). does that mean i should try with a AWQ version of Qwen 7b? not sure that will bring any improvement..

mhenning

11 days ago

I can't speak for this model, but I tried with Llama 3.3 70B AWQ and Llama 3.2 3B AWQ as draft model, and while it was running on vLLM, I got less tokens/sec than without the draft model. I'm not sure yet why. The acceptance rate was okay-ish with 0.72.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment