Great! Others with 3B/7B/1.5B planned?
Thank you, a draft collection is great! i have missed exactly that so far, especially for these large models.
In my first test it doesn't give a speed advantage (it always depends a bit on the system) so it would be interesting to have other sizes like a 7B and 3B?
In my observation with other models (72B, 32B) the best draft model sizes were the ones that had a combination of good prediction (not too small like 0.5) and yet small enough not to steal too much resources (like >=14B). Mostly these were 3B and 7B, 1.5B was still partly ok.
Yeah, I'm just trying fine-tuning the 0.5B
first and will then test larger sizes. The 0.5B
started off with only around 33% top-1 accuracy, so will be interesting to see what it ends up as after fine-tuning on around 500M-1B tokens to compare with.
Just a quick update as I saw this was posted on Reddit today:
- I've pre-trained a
0.5b
,1b
,1.5b
and3b
each using ~650M tokens of the common-crawl-sample. - I'm currently training the
0.5b
on ~2.4B tokens of actualdeepseek-r1
outputs (from OpenThoughts-Unverified-173k and dolphin-r1:reasoning-deepseek).
I'm not sure if this is overkill as
@alamios
looks to be getting quite good results with a lot less data than this (I'm also using 32k context which likely slows things down quite significantly too, but felt it was important as the reasoning models can generate so much output for a single query - I may even retry with the YaRN settings in qwen-0.5b
and try 128k context if this shows a lot of promise, but that will be really slow to train I think).
I'm about 1/2 way through the final stage of the 0.5b
model and it's getting ~75% top-1
on the evaluation sets for those two datasets, but there is likely some data-leakage due to the way I have just concatenated the samples to all be 32k context.
After this I will try the 1b
based off llama-3.2-instruct:1b
to see what that gains, but I have a feeling it will be diminishing returns and won't go further than that unless it looks like I can get a significantly higher top-1
hit-rate.
I tried this paired with Unsloth dynamic quant: there's a token mismatch, token 128815 exists there as "PAD_TOKEN"
I think I may have fixed this in the new version, but will double check before I quantize the GGUFs for the new versions.
Sorry I don't have a Reddit account anymore so can't reply there:
I wonder about the chosen approach, if this model will predict the full R1 token better than the existing small R1 distill models. Yet even if it just matches maybe 30% of the tokens then you can run it with --draft-max 2 or 3 and still get 25% more TPS or so.
I'm hoping it should be a significantly better match than the distilled models (which from my tests didn't actually work very well after just using transplant_vocab
alone).
This post seems to confirm this:
As the llama-70b-r1-distill
shows the match isn't great and it's retained a lot of its old word distribution...
I'm hoping that my draft will match the word distribution much better, but at the cost of probably being a useless/broken model if used on its own.
I've got the 0.5b
(actually 0.6b
) version working now and the 1b
(actually 1.4b
) is about 2 days away.
I'm not sure it's any better though nor if the 1b
will be worth the extra compute (at least in llama.cpp
).
I'll upload both when they are done and then gonna look at training a coder-specific tiny-model for deepseek-v3
as I think that will have more potential.
a coder specific draft for deepseek v3 would be excellent. anything to speed up local generation is welcomed!
It turns out that larger models don't seem to gain much for this, so trying now with a new option to trim down to even smaller models (ie: 16 layers from 24 of qwen-2.5
, etc):
If you have the new version with the Unsloth theme in mind( token 128815 exists there as "PAD_TOKEN), can you let us know :) Thanks in advance
If you have the new version with the Unsloth theme in mind( token 128815 exists there as "PAD_TOKEN), can you let us know :) Thanks in advance
Can you paste in the output of llama.cpp
when you run it that shows all the data like this:
main: loading model
srv load_model: loading model '/home/juk/models/gguf/deepseek-r1-mla-Q8_0+BF16.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA RTX 5000 Ada Generation) - 31921 MiB free
llama_model_loader: loaded meta data with 46 key-value pairs and 1086 tensors from /home/juk/models/gguf/deepseek-r1-mla-Q8_0+BF16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = deepseek2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = DeepSeek R1 Bf16
llama_model_loader: - kv 3: general.size_label str = 256x20B
llama_model_loader: - kv 4: general.license str = mit
llama_model_loader: - kv 5: deepseek2.block_count u32 = 61
llama_model_loader: - kv 6: deepseek2.context_length u32 = 163840
llama_model_loader: - kv 7: deepseek2.embedding_length u32 = 7168
llama_model_loader: - kv 8: deepseek2.feed_forward_length u32 = 18432
llama_model_loader: - kv 9: deepseek2.attention.head_count u32 = 128
llama_model_loader: - kv 10: deepseek2.attention.head_count_kv u32 = 1
llama_model_loader: - kv 11: deepseek2.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 12: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 13: deepseek2.expert_used_count u32 = 8
llama_model_loader: - kv 14: deepseek2.leading_dense_block_count u32 = 3
llama_model_loader: - kv 15: deepseek2.vocab_size u32 = 129280
llama_model_loader: - kv 16: deepseek2.attention.q_lora_rank u32 = 1536
llama_model_loader: - kv 17: deepseek2.attention.kv_lora_rank u32 = 512
llama_model_loader: - kv 18: deepseek2.attention.key_length u32 = 576
llama_model_loader: - kv 19: deepseek2.attention.value_length u32 = 512
llama_model_loader: - kv 20: deepseek2.attention.key_length_mla u32 = 192
llama_model_loader: - kv 21: deepseek2.attention.value_length_mla u32 = 128
llama_model_loader: - kv 22: deepseek2.expert_feed_forward_length u32 = 2048
llama_model_loader: - kv 23: deepseek2.expert_count u32 = 256
llama_model_loader: - kv 24: deepseek2.expert_shared_count u32 = 1
llama_model_loader: - kv 25: deepseek2.expert_weights_scale f32 = 2.500000
llama_model_loader: - kv 26: deepseek2.expert_weights_norm bool = true
llama_model_loader: - kv 27: deepseek2.expert_gating_func u32 = 2
llama_model_loader: - kv 28: deepseek2.rope.dimension_count u32 = 64
llama_model_loader: - kv 29: deepseek2.rope.scaling.type str = yarn
llama_model_loader: - kv 30: deepseek2.rope.scaling.factor f32 = 40.000000
llama_model_loader: - kv 31: deepseek2.rope.scaling.original_context_length u32 = 4096
llama_model_loader: - kv 32: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000
llama_model_loader: - kv 33: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 34: tokenizer.ggml.pre str = deepseek-v3
llama_model_loader: - kv 35: tokenizer.ggml.tokens arr[str,129280] = ["<|begin▁of▁sentence|>", "<�...
llama_model_loader: - kv 36: tokenizer.ggml.token_type arr[i32,129280] = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 37: tokenizer.ggml.merges arr[str,127741] = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e...
llama_model_loader: - kv 38: tokenizer.ggml.bos_token_id u32 = 0
llama_model_loader: - kv 39: tokenizer.ggml.eos_token_id u32 = 1
llama_model_loader: - kv 40: tokenizer.ggml.padding_token_id u32 = 1
llama_model_loader: - kv 41: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 42: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 43: tokenizer.chat_template str = {% if not add_generation_prompt is de...
llama_model_loader: - kv 44: general.quantization_version u32 = 2
llama_model_loader: - kv 45: general.file_type u32 = 7
llama_model_loader: - type f32: 361 tensors
llama_model_loader: - type q8_0: 603 tensors
llama_model_loader: - type bf16: 122 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 665.19 GiB (8.52 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 818
load: token to piece cache size = 0.8223 MB
print_info: arch = deepseek2
print_info: vocab_only = 0
print_info: n_ctx_train = 163840
print_info: n_embd = 7168
print_info: n_layer = 61
print_info: n_head = 128
print_info: n_head_kv = 1
print_info: n_rot = 64
print_info: n_swa = 0
print_info: n_swa_pattern = 1
print_info: n_embd_head_k = 576
print_info: n_embd_head_v = 512
print_info: n_gqa = 128
print_info: n_embd_k_gqa = 576
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 18432
print_info: n_expert = 256
print_info: n_expert_used = 8
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = yarn
print_info: freq_base_train = 10000.0
print_info: freq_scale_train = 0.025
print_info: n_ctx_orig_yarn = 4096
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 671B
print_info: model params = 671.03 B
print_info: general.name = DeepSeek R1 Bf16
print_info: n_layer_dense_lead = 3
print_info: n_lora_q = 1536
print_info: n_lora_kv = 512
print_info: n_embd_head_k_mla = 192
print_info: n_embd_head_v_mla = 128
print_info: n_ff_exp = 2048
print_info: n_expert_shared = 1
print_info: expert_weights_scale = 2.5
print_info: expert_weights_norm = 1
print_info: expert_gating_func = sigmoid
print_info: rope_yarn_log_mul = 0.1000
print_info: vocab type = BPE
print_info: n_vocab = 129280
print_info: n_merges = 127741
print_info: BOS token = 0 '<|begin▁of▁sentence|>'
print_info: EOS token = 1 '<|end▁of▁sentence|>'
print_info: EOT token = 1 '<|end▁of▁sentence|>'
print_info: PAD token = 1 '<|end▁of▁sentence|>'
print_info: LF token = 201 'Ċ'
print_info: FIM PRE token = 128801 '<|fim▁begin|>'
print_info: FIM SUF token = 128800 '<|fim▁hole|>'
print_info: FIM MID token = 128802 '<|fim▁end|>'
print_info: EOG token = 1 '<|end▁of▁sentence|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 61 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 62/62 layers to GPU
load_tensors: CUDA0 model buffer size = 17621.99 MiB
load_tensors: CPU_Mapped model buffer size = 680199.08 MiB
I can be sure then the other tokens match up and will make some Unsloth-specific GGUF files for the new version.
Hey, I switched from Unsloth to the version from cognitive computations.(cognitivecomputations/DeepSeek-R1-AWQ).
Then, I guess it is not a problem right now.
However, I will get back to you with the output of llama.cpp that you asked for, for a unsloth specific version.
Here is the output from the Unsloth Deepseek R1 if anyone needs it. It can be seen that the pad token differs here
[load: special tokens cache size = 819
load: token to piece cache size = 0,8223 MB
print_info: arch = deepseek2
print_info: vocab_only = 0
print_info: n_ctx_train = 163840
print_info: n_embd = 7168
print_info: n_layer = 61
print_info: n_head = 128
print_info: n_head_kv = 128
print_info: n_rot = 64
print_info: n_swa = 0
print_info: n_embd_head_k = 192
print_info: n_embd_head_v = 128
print_info: n_gqa = 1
print_info: n_embd_k_gqa = 24576
print_info: n_embd_v_gqa = 16384
print_info: f_norm_eps = 0,0e+00
print_info: f_norm_rms_eps = 1,0e-06
print_info: f_clamp_kqv = 0,0e+00
print_info: f_max_alibi_bias = 0,0e+00
print_info: f_logit_scale = 0,0e+00
print_info: n_ff = 18432
print_info: n_expert = 256
print_info: n_expert_used = 8
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = yarn
print_info: freq_base_train = 10000,0
print_info: freq_scale_train = 0,025
print_info: n_ctx_orig_yarn = 4096
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 671B
print_info: model params = 671,03 B
print_info: general.name = DeepSeek R1 BF16
print_info: n_layer_dense_lead = 3
print_info: n_lora_q = 1536
print_info: n_lora_kv = 512
print_info: n_ff_exp = 2048
print_info: n_expert_shared = 1
print_info: expert_weights_scale = 2,5
print_info: expert_weights_norm = 1
print_info: expert_gating_func = sigmoid
print_info: rope_yarn_log_mul = 0,1000
print_info: vocab type = BPE
print_info: n_vocab = 129280
print_info: n_merges = 127741
print_info: BOS token = 0 '<|begin▁of▁sentence|>'
print_info: EOS token = 1 '<|end▁of▁sentence|>'
print_info: EOT token = 1 '<|end▁of▁sentence|>'
print_info: PAD token = 128815 '<|PAD▁TOKEN|>'
print_info: LF token = 201 'Ċ'
print_info: FIM PRE token = 128801 '<|fim▁begin|>'
print_info: FIM SUF token = 128800 '<|fim▁hole|>'
print_info: FIM MID token = 128802 '<|fim▁end|>'
print_info: EOG token = 1 '<|end▁of▁sentence|>'
print_info: max token length = 256](main: load the model and apply lora adapter, if any
llama_model_loader: additional 4 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 48 key-value pairs and 1025 tensors from DeepSeek-R1-GGUF/DeepSeek-R1-UD-Q2_K_XL/DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = deepseek2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = DeepSeek R1 BF16
llama_model_loader: - kv 3: general.quantized_by str = Unsloth
llama_model_loader: - kv 4: general.size_label str = 256x20B
llama_model_loader: - kv 5: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 6: deepseek2.block_count u32 = 61
llama_model_loader: - kv 7: deepseek2.context_length u32 = 163840
llama_model_loader: - kv 8: deepseek2.embedding_length u32 = 7168
llama_model_loader: - kv 9: deepseek2.feed_forward_length u32 = 18432
llama_model_loader: - kv 10: deepseek2.attention.head_count u32 = 128
llama_model_loader: - kv 11: deepseek2.attention.head_count_kv u32 = 128
llama_model_loader: - kv 12: deepseek2.rope.freq_base f32 = 10000,000000
llama_model_loader: - kv 13: deepseek2.attention.layer_norm_rms_epsilon f32 = 0,000001
llama_model_loader: - kv 14: deepseek2.expert_used_count u32 = 8
llama_model_loader: - kv 15: deepseek2.leading_dense_block_count u32 = 3
llama_model_loader: - kv 16: deepseek2.vocab_size u32 = 129280
llama_model_loader: - kv 17: deepseek2.attention.q_lora_rank u32 = 1536
llama_model_loader: - kv 18: deepseek2.attention.kv_lora_rank u32 = 512
llama_model_loader: - kv 19: deepseek2.attention.key_length u32 = 192
llama_model_loader: - kv 20: deepseek2.attention.value_length u32 = 128
llama_model_loader: - kv 21: deepseek2.expert_feed_forward_length u32 = 2048
llama_model_loader: - kv 22: deepseek2.expert_count u32 = 256
llama_model_loader: - kv 23: deepseek2.expert_shared_count u32 = 1
llama_model_loader: - kv 24: deepseek2.expert_weights_scale f32 = 2,500000
llama_model_loader: - kv 25: deepseek2.expert_weights_norm bool = true
llama_model_loader: - kv 26: deepseek2.expert_gating_func u32 = 2
llama_model_loader: - kv 27: deepseek2.rope.dimension_count u32 = 64
llama_model_loader: - kv 28: deepseek2.rope.scaling.type str = yarn
llama_model_loader: - kv 29: deepseek2.rope.scaling.factor f32 = 40,000000
llama_model_loader: - kv 30: deepseek2.rope.scaling.original_context_length u32 = 4096
llama_model_loader: - kv 31: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0,100000
llama_model_loader: - kv 32: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 33: tokenizer.ggml.pre str = deepseek-v3
llama_model_loader: - kv 34: tokenizer.ggml.tokens arr[str,129280] = ["<|begin▁of▁sentence|>", "<�...
llama_model_loader: - kv 35: tokenizer.ggml.token_type arr[i32,129280] = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 36: tokenizer.ggml.merges arr[str,127741] = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e...
llama_model_loader: - kv 37: tokenizer.ggml.bos_token_id u32 = 0
llama_model_loader: - kv 38: tokenizer.ggml.eos_token_id u32 = 1
llama_model_loader: - kv 39: tokenizer.ggml.padding_token_id u32 = 128815
llama_model_loader: - kv 40: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 41: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 42: tokenizer.chat_template str = {% if not add_generation_prompt is de...
llama_model_loader: - kv 43: general.quantization_version u32 = 2
llama_model_loader: - kv 44: general.file_type u32 = 10
llama_model_loader: - kv 45: split.no u16 = 0
llama_model_loader: - kv 46: split.tensors.count i32 = 1025
llama_model_loader: - kv 47: split.count u16 = 5
llama_model_loader: - type f32: 361 tensors
llama_model_loader: - type q2_K: 171 tensors
llama_model_loader: - type q3_K: 3 tensors
llama_model_loader: - type q4_K: 306 tensors
llama_model_loader: - type q6_K: 184 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q2_K - Medium
print_info: file size = 211,03 GiB (2,70 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 819
load: token to piece cache size = 0,8223 MB
print_info: arch = deepseek2
print_info: vocab_only = 0
print_info: n_ctx_train = 163840
print_info: n_embd = 7168
print_info: n_layer = 61
print_info: n_head = 128
print_info: n_head_kv = 128
print_info: n_rot = 64
print_info: n_swa = 0
print_info: n_embd_head_k = 192
print_info: n_embd_head_v = 128
print_info: n_gqa = 1
print_info: n_embd_k_gqa = 24576
print_info: n_embd_v_gqa = 16384
print_info: f_norm_eps = 0,0e+00
print_info: f_norm_rms_eps = 1,0e-06
print_info: f_clamp_kqv = 0,0e+00
print_info: f_max_alibi_bias = 0,0e+00
print_info: f_logit_scale = 0,0e+00
print_info: n_ff = 18432
print_info: n_expert = 256
print_info: n_expert_used = 8
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = yarn
print_info: freq_base_train = 10000,0
print_info: freq_scale_train = 0,025
print_info: n_ctx_orig_yarn = 4096
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 671B
print_info: model params = 671,03 B
print_info: general.name = DeepSeek R1 BF16
print_info: n_layer_dense_lead = 3
print_info: n_lora_q = 1536
print_info: n_lora_kv = 512
print_info: n_ff_exp = 2048
print_info: n_expert_shared = 1
print_info: expert_weights_scale = 2,5
print_info: expert_weights_norm = 1
print_info: expert_gating_func = sigmoid
print_info: rope_yarn_log_mul = 0,1000
print_info: vocab type = BPE
print_info: n_vocab = 129280
print_info: n_merges = 127741
print_info: BOS token = 0 '<|begin▁of▁sentence|>'
print_info: EOS token = 1 '<|end▁of▁sentence|>'
print_info: EOT token = 1 '<|end▁of▁sentence|>'
print_info: PAD token = 128815 '<|PAD▁TOKEN|>'
print_info: LF token = 201 'Ċ'
print_info: FIM PRE token = 128801 '<|fim▁begin|>'
print_info: FIM SUF token = 128800 '<|fim▁hole|>'
print_info: FIM MID token = 128802 '<|fim▁end|>'
print_info: EOG token = 1 '<|end▁of▁sentence|>'
print_info: max token length = 256)
I think to fix this then all you need to do is add --override-kv "tokenizer.ggml.padding_token_id=int:1"
to the command line.