russian dolls.webp

A 0.5B parameter draft (speculative decoding) model for use with deepseek-ai/DeepSeek-R1.

NOTE: This is a draft model for the full-sized DeepSeek-R1 model and not the smaller "distilled" models!

See jukofyork/DeepSeek-R1-DRAFT-0.5B-v1.0 for the non-GGUF version, and a detailed explanation of how the model was created.


Without imatrix

Link Type PPL PPL vs BF16
DeepSeek-R1-DRAFT-0.5B-BF16.gguf BF16 11.0267 ± 0.08658 ---
DeepSeek-R1-DRAFT-0.5B-F16.gguf F16 11.0294 ± 0.08660 +0.02%
DeepSeek-R1-DRAFT-0.5B-Q8_0.gguf Q8_0 11.0450 ± 0.08675 +0.17%
DeepSeek-R1-DRAFT-0.5B-Q6_K.gguf Q6_K 11.1231 ± 0.08732 +0.87%
DeepSeek-R1-DRAFT-0.5B-Q5_K_M.gguf Q5_K_M 11.2727 ± 0.08902 +2.23%
DeepSeek-R1-DRAFT-0.5B-Q5_K_S.gguf Q5_K_S 11.2803 ± 0.08888 +2.30%
DeepSeek-R1-DRAFT-0.5B-Q4_K_M.gguf Q4_K_M 11.8171 ± 0.09319 +7.17%
DeepSeek-R1-DRAFT-0.5B-Q4_K_S.gguf Q4_K_S 11.9379 ± 0.09380 +8.26%
DeepSeek-R1-DRAFT-0.5B-IQ4_NL.gguf IQ4_NL 11.8497 ± 0.09445 +7.46%
DeepSeek-R1-DRAFT-0.5B-IQ4_XS.gguf IQ4_XS 11.8600 ± 0.09464 +7.56%
DeepSeek-R1-DRAFT-0.5B-Q5_1.gguf Q5_1 11.3624 ± 0.08926 +3.05%
DeepSeek-R1-DRAFT-0.5B-Q5_0.gguf Q5_0 11.5217 ± 0.09124 +4.49%
DeepSeek-R1-DRAFT-0.5B-Q4_1.gguf Q4_1 12.3107 ± 0.09765 +11.64%
DeepSeek-R1-DRAFT-0.5B-Q4_0.gguf Q4_0 12.6168 ± 0.10021 +14.42%

With imatrix

Link Type PPL PPL vs BF16
DeepSeek-R1-DRAFT-0.5B-iQ6_K.gguf Q6_K 11.0940 ± 0.08714 +0.61%
DeepSeek-R1-DRAFT-0.5B-iQ5_K_M.gguf Q5_K_M 11.2333 ± 0.08819 +1.87%
DeepSeek-R1-DRAFT-0.5B-iQ5_K_S.gguf Q5_K_S 11.2238 ± 0.08798 +1.79%
DeepSeek-R1-DRAFT-0.5B-iQ4_K_M.gguf Q4_K_M 11.6273 ± 0.09165 +5.45%
DeepSeek-R1-DRAFT-0.5B-iQ4_K_S.gguf Q4_K_S 11.7004 ± 0.09225 +6.11%
DeepSeek-R1-DRAFT-0.5B-iIQ4_NL.gguf IQ4_NL 11.6495 ± 0.09192 +5.65%
DeepSeek-R1-DRAFT-0.5B-iIQ4_XS.gguf IQ4_XS 11.6924 ± 0.09246 +6.04%
DeepSeek-R1-DRAFT-0.5B-iQ5_1.gguf Q5_1 11.2001 ± 0.08792 +1.57%
DeepSeek-R1-DRAFT-0.5B-iQ5_0.gguf Q5_0 11.3579 ± 0.08961 +3.00%
DeepSeek-R1-DRAFT-0.5B-iQ4_1.gguf Q4_1 11.7469 ± 0.09250 +6.53%
DeepSeek-R1-DRAFT-0.5B-iQ4_0.gguf Q4_0 12.1546 ± 0.09619 +10.23%
  • Based on these results, my suggestion is to use IQ4_XS unless you have a good reason not to (ie: use Q4_K_S if IQ4_XS runs slow on your hardware, Q4_0 may be a better choice if running on CPU, etc).
  • I am not sure if the versions created using the imatrix file are actually better or worse in practice (more thorough testing is needeed; PPL might not be a good predictor of actual draft acceptance rates!).
  • Both deepseek-r1 and qwen-2.5 use YaRN as their context window extension method. To get the best quality output it is advised to use smaller contexts (eg: 16k) when you can. Due to the way YaRN is implemented in llama.cpp; just setting the context to a massive value will degrade both the draft and target models' outputs!
  • Do not use less than 4-bit quants for speculative decoding - the large drop in quality will not be offset by the memory bandwidth reduction!

I have included the imatrix file used to generate the Q4_0-Q6_K quants, along with the 1MB sample of the fine-tuning data used to create it.

I have included the 1MB sample of the fine-tuning data used to calculate the PPL using llama-perplexity's default settings.

Downloads last month
555
GGUF
Model size
501M params
Architecture
qwen2
Hardware compatibility
Log In to view the estimation

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jukofyork/DeepSeek-R1-DRAFT-0.5B-v1.0-GGUF

Base model

Qwen/Qwen2.5-0.5B
Quantized
(108)
this model

Datasets used to train jukofyork/DeepSeek-R1-DRAFT-0.5B-v1.0-GGUF

Collection including jukofyork/DeepSeek-R1-DRAFT-0.5B-v1.0-GGUF