Draft Model of Speculative Decoding
Do you have any suggestions of which draft models would play nicely with this mode. BTW. Qwen2.5 7B instruct seem to have different vocab size and not working. May be i am doing something wrong.
According to the models config.json (https://huggingface.co/Nexusflow/Athene-V2-Chat/blob/main/config.json) and (https://huggingface.co/Qwen/Qwen2.5-7B-Instruct/blob/main/config.json), the vocab_size is the same. As this is a finetune of Qwen 2.5 72B, another Qwen 2.5 model as draft model makes the most sense. Do you maybe use quantized versions which report a different vocabulary size?
Ah.. that makes sense. i was using a AWQ version using VLLM. (as i have limit for GPU). does that mean i should try with a AWQ version of Qwen 7b? not sure that will bring any improvement..
I can't speak for this model, but I tried with Llama 3.3 70B AWQ and Llama 3.2 3B AWQ as draft model, and while it was running on vLLM, I got less tokens/sec than without the draft model. I'm not sure yet why. The acceptance rate was okay-ish with 0.72.