NOTE: See jukofyork/Kimi-K2-Instruct-DRAFT-0.6B-UNTRAINED-GGUF for a Q4_0 model in gguf format.


So this is a bit of a Frankenstein's monster due to Kimi-K2-Instruct using the TikToken tokenizer and Qwen2.5-Coder-0.5B-Instruct using the SentencePiece tokenizer...

I used transplant-vocab with this command line:

python3 ./transplant_vocab.py \
    ./Qwen2.5-Coder-0.5B-Instruct \
    ./Kimi-K2-Instruct-BF16 \
    ./Kimi-K2-Instruct-DRAFT-0.6B-UNTRAINED \
    --trust-remote-code \
    --overwrite \
    --override "[BOS]" "<|endoftext|>" \
    --override "[EOS]" "<|im_end|>" \
    --override "<|im_end|>" "<|im_end|>" \
    --override "<|im_user|>" "<|im_start|>user" \
    --override "<|im_assistant|>" "<|im_start|>assistant" \
    --override "<|start_header_id|>" "<|im_start|>" \
    --override "<|end_header_id|>" "<|im_end|>" \
    --override "[EOT]" "<|endoftext|>" \
    --override "<|im_system|>" "<|im_start|>system" \
    --override "<|tool_calls_section_begin|>" "<tool_call>" \
    --override "<|tool_calls_section_end|>" "</tool_call>" \
    --override "<|tool_call_begin|>" "<tool_call>" \
    --override "<|tool_call_argument_begin|>" "<tool_call>" \
    --override "<|tool_call_end|>" "</tool_call>" \
    --override "<|im_middle|>" "\\n" \
    --override "[UNK]" "<|endoftext|>" \
    --override "[PAD]" "<|endoftext|>"

There were still some odd looking things in the config files after this, so I:

  • Copied tokenizer_config.json from Kimi-K2-Instruct over the top of the transplant-vocab-generated one.
  • Copied config.json from Qwen2.5-Coder-0.5B-Instruct over the top of the transplant-vocab-generated one.
  • Edited config.json to use: "bos_token_id": 163584, "eos_token_id": 163585 and "tie_word_embeddings": false.

I then ran into another problem with llama.cpp where convert_hf_to_gguf.py didn't like this, so had to temporarily hack this change in:

@ModelBase.register("Qwen2Model", "Qwen2ForCausalLM", "Qwen2AudioForConditionalGeneration")
class Qwen2Model(TextModel):
    model_arch = gguf.MODEL_ARCH.QWEN2

    #def set_vocab(self):
    #    try:
    #        self._set_vocab_sentencepiece()
    #    except FileNotFoundError:
    #        self._set_vocab_gpt2()
    def set_vocab(self):
        from transformers import AutoTokenizer
        tokenizer = AutoTokenizer.from_pretrained(self.dir_model, trust_remote_code=True)
        tokpre = self.get_vocab_base_pre(tokenizer)
    
        # Build merges list using the approach similar to HunYuanMoE
        merges = []
        vocab = {}
        mergeable_ranks = tokenizer.model._mergeable_ranks
        for token, rank in mergeable_ranks.items():
            vocab[QwenModel.token_bytes_to_string(token)] = rank
            if len(token) == 1:
                continue
            merged = QwenModel.bpe(mergeable_ranks, token, max_rank=rank)
            if len(merged) == 2:
                merges.append(' '.join(map(QwenModel.token_bytes_to_string, merged)))

        # Build token list
        vocab_size = self.hparams["vocab_size"]
        special_tokens = tokenizer.special_tokens
        reverse_vocab = {id_ : encoded_tok for encoded_tok, id_ in {**vocab, **special_tokens}.items()}
        tokens: list[str] = []
        toktypes: list[int] = []
    
        for i in range(vocab_size):
            if i not in reverse_vocab:
                tokens.append(f"[PAD{i}]")
                toktypes.append(gguf.TokenType.UNUSED)
            else:
                token = reverse_vocab[i]
                tokens.append(token)
                if i in special_tokens.values():
                    toktypes.append(gguf.TokenType.CONTROL)
                else:
                    toktypes.append(gguf.TokenType.NORMAL)
    
        self.gguf_writer.add_tokenizer_model("gpt2")
        self.gguf_writer.add_tokenizer_pre(tokpre)
        self.gguf_writer.add_token_list(tokens)
        self.gguf_writer.add_token_types(toktypes)
        self.gguf_writer.add_token_merges(merges)

        special_vocab = gguf.SpecialVocab(self.dir_model, load_merges=False)
        special_vocab.add_to_gguf(self.gguf_writer)

This let me run:

~/llama.cpp/convert_hf_to_gguf.py --outtype auto --outfile Kimi-K2-Instruct-DRAFT-0.6B-UNTRAINED-BF16.gguf Kimi-K2-Instruct-DRAFT-0.6B-UNTRAINED

and:

~/llama.cpp/build/bin/llama-quantize Kimi-K2-Instruct-DRAFT-0.6B-UNTRAINED-BF16.gguf Kimi-K2-Instruct-DRAFT-0.6B-UNTRAINED-Q4_0.gguf Q4_0 44

I'm unsure yet if the Frankensteined trasformers model can be trained, but even this untrained version gives some improvement for me for a highly "draftable" refactoring prompt:

prompt eval time =   55366.84 ms /  1832 tokens (   30.22 ms per token,    33.09 tokens per second)
       eval time =  241439.49 ms /  1618 tokens (  149.22 ms per token,     6.70 tokens per second)
      total time =  296806.34 ms /  3450 tokens
prompt eval time =   55209.16 ms /  1832 tokens (   30.14 ms per token,    33.18 tokens per second)
       eval time =  169682.33 ms /  1524 tokens (  111.34 ms per token,     8.98 tokens per second)
      total time =  224891.50 ms /  3356 tokens
draft acceptance rate = 0.59985 (  814 accepted /  1357 generated)

using:

#!/bin/bash

host_address=192.168.1.1
port_number=8080

# Turn off NUMA balancing
echo 0 | sudo tee /proc/sys/kernel/numa_balancing > /dev/null

# Ask for permission to drop caches
read -p "Do you want to drop caches? (y/n) " -n 1 -r
echo    # Move to a new line
if [[ $REPLY =~ ^[Yy]$ ]]
then
    echo "Dropping caches..."
    echo 3 | sudo tee /proc/sys/vm/drop_caches > /dev/null
fi

# Run the main command
~/llama.cpp/build/bin/llama-server \
        --host "$host_address" \
        --port "$port_number" \
        --alias "Kimi-K2-Instruct" \
        --jinja \
        --chat-template-file ~/models/Kimi-K2-Instruct.jinja \
        --model ~/models/gguf/Kimi-K2-Instruct-Q6_K_X.gguf \
        --n-gpu-layers 99 \
        --numa distribute \
        --override-tensor exps=CPU \
        --flash-attn \
        --ctx_size 32768 \
        --model-draft ~/models/gguf/draft_models/Kimi-K2-Instruct-DRAFT-0.6B-UNTRAINED-Q4_0.gguf \
        --gpu-layers-draft 99 \
        --top-k 1 \
        --samplers "top_k"
Downloads last month
1
Safetensors
Model size
651M params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jukofyork/Kimi-K2-Instruct-DRAFT-0.6B-UNTRAINED

Base model

Qwen/Qwen2.5-0.5B
Finetuned
(57)
this model

Collection including jukofyork/Kimi-K2-Instruct-DRAFT-0.6B-UNTRAINED