Draft Models
Collection
Tiny "draft" models for speculative decoding.
•
18 items
•
Updated
•
1
NOTE: See jukofyork/Kimi-K2-Instruct-DRAFT-0.6B-UNTRAINED-GGUF for a Q4_0
model in gguf
format.
So this is a bit of a Frankenstein's monster due to Kimi-K2-Instruct
using the TikToken
tokenizer and Qwen2.5-Coder-0.5B-Instruct
using the SentencePiece
tokenizer...
I used transplant-vocab with this command line:
python3 ./transplant_vocab.py \
./Qwen2.5-Coder-0.5B-Instruct \
./Kimi-K2-Instruct-BF16 \
./Kimi-K2-Instruct-DRAFT-0.6B-UNTRAINED \
--trust-remote-code \
--overwrite \
--override "[BOS]" "<|endoftext|>" \
--override "[EOS]" "<|im_end|>" \
--override "<|im_end|>" "<|im_end|>" \
--override "<|im_user|>" "<|im_start|>user" \
--override "<|im_assistant|>" "<|im_start|>assistant" \
--override "<|start_header_id|>" "<|im_start|>" \
--override "<|end_header_id|>" "<|im_end|>" \
--override "[EOT]" "<|endoftext|>" \
--override "<|im_system|>" "<|im_start|>system" \
--override "<|tool_calls_section_begin|>" "<tool_call>" \
--override "<|tool_calls_section_end|>" "</tool_call>" \
--override "<|tool_call_begin|>" "<tool_call>" \
--override "<|tool_call_argument_begin|>" "<tool_call>" \
--override "<|tool_call_end|>" "</tool_call>" \
--override "<|im_middle|>" "\\n" \
--override "[UNK]" "<|endoftext|>" \
--override "[PAD]" "<|endoftext|>"
There were still some odd looking things in the config files after this, so I:
tokenizer_config.json
from Kimi-K2-Instruct
over the top of the transplant-vocab
-generated one.config.json
from Qwen2.5-Coder-0.5B-Instruct
over the top of the transplant-vocab
-generated one.config.json
to use: "bos_token_id": 163584
, "eos_token_id": 163585
and "tie_word_embeddings": false
.I then ran into another problem with llama.cpp
where convert_hf_to_gguf.py
didn't like this, so had to temporarily hack this change in:
@ModelBase.register("Qwen2Model", "Qwen2ForCausalLM", "Qwen2AudioForConditionalGeneration")
class Qwen2Model(TextModel):
model_arch = gguf.MODEL_ARCH.QWEN2
#def set_vocab(self):
# try:
# self._set_vocab_sentencepiece()
# except FileNotFoundError:
# self._set_vocab_gpt2()
def set_vocab(self):
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(self.dir_model, trust_remote_code=True)
tokpre = self.get_vocab_base_pre(tokenizer)
# Build merges list using the approach similar to HunYuanMoE
merges = []
vocab = {}
mergeable_ranks = tokenizer.model._mergeable_ranks
for token, rank in mergeable_ranks.items():
vocab[QwenModel.token_bytes_to_string(token)] = rank
if len(token) == 1:
continue
merged = QwenModel.bpe(mergeable_ranks, token, max_rank=rank)
if len(merged) == 2:
merges.append(' '.join(map(QwenModel.token_bytes_to_string, merged)))
# Build token list
vocab_size = self.hparams["vocab_size"]
special_tokens = tokenizer.special_tokens
reverse_vocab = {id_ : encoded_tok for encoded_tok, id_ in {**vocab, **special_tokens}.items()}
tokens: list[str] = []
toktypes: list[int] = []
for i in range(vocab_size):
if i not in reverse_vocab:
tokens.append(f"[PAD{i}]")
toktypes.append(gguf.TokenType.UNUSED)
else:
token = reverse_vocab[i]
tokens.append(token)
if i in special_tokens.values():
toktypes.append(gguf.TokenType.CONTROL)
else:
toktypes.append(gguf.TokenType.NORMAL)
self.gguf_writer.add_tokenizer_model("gpt2")
self.gguf_writer.add_tokenizer_pre(tokpre)
self.gguf_writer.add_token_list(tokens)
self.gguf_writer.add_token_types(toktypes)
self.gguf_writer.add_token_merges(merges)
special_vocab = gguf.SpecialVocab(self.dir_model, load_merges=False)
special_vocab.add_to_gguf(self.gguf_writer)
This let me run:
~/llama.cpp/convert_hf_to_gguf.py --outtype auto --outfile Kimi-K2-Instruct-DRAFT-0.6B-UNTRAINED-BF16.gguf Kimi-K2-Instruct-DRAFT-0.6B-UNTRAINED
and:
~/llama.cpp/build/bin/llama-quantize Kimi-K2-Instruct-DRAFT-0.6B-UNTRAINED-BF16.gguf Kimi-K2-Instruct-DRAFT-0.6B-UNTRAINED-Q4_0.gguf Q4_0 44
I'm unsure yet if the Frankensteined trasformers model can be trained, but even this untrained version gives some improvement for me for a highly "draftable" refactoring prompt:
prompt eval time = 55366.84 ms / 1832 tokens ( 30.22 ms per token, 33.09 tokens per second)
eval time = 241439.49 ms / 1618 tokens ( 149.22 ms per token, 6.70 tokens per second)
total time = 296806.34 ms / 3450 tokens
prompt eval time = 55209.16 ms / 1832 tokens ( 30.14 ms per token, 33.18 tokens per second)
eval time = 169682.33 ms / 1524 tokens ( 111.34 ms per token, 8.98 tokens per second)
total time = 224891.50 ms / 3356 tokens
draft acceptance rate = 0.59985 ( 814 accepted / 1357 generated)
using:
#!/bin/bash
host_address=192.168.1.1
port_number=8080
# Turn off NUMA balancing
echo 0 | sudo tee /proc/sys/kernel/numa_balancing > /dev/null
# Ask for permission to drop caches
read -p "Do you want to drop caches? (y/n) " -n 1 -r
echo # Move to a new line
if [[ $REPLY =~ ^[Yy]$ ]]
then
echo "Dropping caches..."
echo 3 | sudo tee /proc/sys/vm/drop_caches > /dev/null
fi
# Run the main command
~/llama.cpp/build/bin/llama-server \
--host "$host_address" \
--port "$port_number" \
--alias "Kimi-K2-Instruct" \
--jinja \
--chat-template-file ~/models/Kimi-K2-Instruct.jinja \
--model ~/models/gguf/Kimi-K2-Instruct-Q6_K_X.gguf \
--n-gpu-layers 99 \
--numa distribute \
--override-tensor exps=CPU \
--flash-attn \
--ctx_size 32768 \
--model-draft ~/models/gguf/draft_models/Kimi-K2-Instruct-DRAFT-0.6B-UNTRAINED-Q4_0.gguf \
--gpu-layers-draft 99 \
--top-k 1 \
--samplers "top_k"
Base model
Qwen/Qwen2.5-0.5B