GGUF
draft
speculative-decoding

A 0.6B parameter draft (speculative decoding) model for use with Kimi-K2-Instruct.

See Kimi-K2-Instruct-DRAFT-0.6B-v3.0 for the models in transformers format, and a detailed explanation of how the model was created.


I've included the Q4_0 quants for 3 different context lengths:


NOTES:

  • The 14 heads of Qwen2.5-0.5B doesn't allow for any of the other 4-bit quants to be made (and experimentation has shown using more or less than 4-bits for speculative decoding is a waste of time anwyay).
  • Due to llama.cpp using "static-YaRN" the scaling factor remains constant regardless of input length! Only use the longer context versions when processing long contexts is required...
  • If you want to recreate these, then the TikToken / SentencePiece tokenizer mismatch requires a small hack to convert_hf_to_gguf.py (see main model page for details).
Downloads last month
821
GGUF
Model size
651M params
Architecture
qwen2
Hardware compatibility
Log In to view the estimation

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jukofyork/Kimi-K2-Instruct-DRAFT-0.6B-v3.0-GGUF

Base model

Qwen/Qwen2.5-0.5B
Quantized
(135)
this model

Datasets used to train jukofyork/Kimi-K2-Instruct-DRAFT-0.6B-v3.0-GGUF

Collection including jukofyork/Kimi-K2-Instruct-DRAFT-0.6B-v3.0-GGUF