---
license: apache-2.0
base_model:
- Qwen/Qwen2.5-Coder-0.5B-Instruct
datasets:
- agentlans/common-crawl-sample
- bigcode/the-stack-smol-xl
tags:
- draft
- speculative-decoding
language:
- zho
- eng
- fra
- spa
- por
- deu
- ita
- rus
- jpn
- kor
- vie
- tha
- ara
---

![image.webp](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/KL97x9lVuhmIPXbbKgvyY.webp)

***NOTE***: *This is just a slightly improved version that I trained using `"max_position_embeddings": 65536` + `"rope_scaling": {"factor": 2.0, ...` rather than setting the `rope_scaling` after training...*

---

A `0.6B` parameter draft (speculative decoding) model for use with [deepseek-ai/DeepSeek-V3-0324](https://huggingface.co/deepseek-ai/DeepSeek-V3-0324) and [deepseek-ai/DeepSeek-V3](https://huggingface.co/deepseek-ai/DeepSeek-V3).

**NOTES**:

- This version (unlike the previous [jukofyork/DeepSeek-V3-0324-DRAFT-0.5B-v1.0](https://huggingface.co/jukofyork/DeepSeek-V3-0324-DRAFT-0.5B-v1.0)), was trained using only the [agentlans/common-crawl-sample](https://huggingface.co/datasets/agentlans/common-crawl-sample) and [bigcode/the-stack-smol-xl](https://huggingface.co/datasets/bigcode/the-stack-smol-xl) datasets.
- This version (unlike the previous [jukofyork/DeepSeek-V3-0324-DRAFT-0.5B-v1.0](https://huggingface.co/jukofyork/DeepSeek-V3-0324-DRAFT-0.5B-v1.0)), doesn't trim the heads down from 14 to 12.

See [jukofyork/DeepSeek-V3-0324-CODER-DRAFT-0.6B-v1.0-GGUF](https://huggingface.co/jukofyork/DeepSeek-V3-0324-CODER-DRAFT-0.6B-v1.0-GGUF) for the models in GGUF format.

---

# How the model was created

## 1. The initial model was created from [Qwen/Qwen2.5-Coder-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B-Instruct) using [transplant-vocab](https://github.com/jukofyork/transplant-vocab):

```sh
python ./transplant_vocab.py \                                  
        Qwen2.5-Coder-0.5B-Instruct \
        DeepSeek-V3-0324-BF16 \
        DeepSeek-V3-0324-CODER-DRAFT-0.6B-UNTRAINED \
        --override "<｜▁pad▁｜>" "<|endoftext|>" \
        --override "<｜fim▁hole｜>" "<|fim_middle|>" \
        --override "<｜fim▁begin｜>" "<|fim_prefix|>" \
        --override "<｜fim▁end｜>" "<|fim_suffix|>" \
        --override "<｜User｜>" "<|im_start|>user\\n" \
        --override "<｜Assistant｜>" "<|im_start|>assistant\\n" \
        --override "<|EOT|>" "<|endoftext|>" \
        --override "<｜tool▁calls▁begin｜>" "<tool_call>" \
        --override "<｜tool▁call▁begin｜>" "<tool_call>" \
        --override "<｜tool▁outputs▁begin｜>" "<tool_call>" \
        --override "<｜tool▁output▁begin｜>" "<tool_call>" \
        --override "<｜tool▁calls▁end｜>" "</tool_call>" \
        --override "<｜tool▁call▁end｜>" "</tool_call>" \
        --override "<｜tool▁outputs▁end｜>" "</tool_call>" \
        --override "<｜tool▁output▁end｜>" "</tool_call>" \
        --override "<｜tool▁sep｜>" "</tool_call>"
```

## 2. The following datasets were merged to create a fine-tuning dataset of ~2.5B tokens:

- [agentlans/common-crawl-sample](https://huggingface.co/datasets/agentlans/common-crawl-sample)
- [bigcode/the-stack-smol-xl](https://huggingface.co/datasets/bigcode/the-stack-smol-xl)

formatted just between `<｜end▁of▁sentence｜>` tags.

## 3. The model was then trained using [qlora-pipe](https://github.com/tdrussell/qlora-pipe) for 1 epoch with a batch size of 120 and a sequence length of 32k (~4M tokens per step):

```toml
# Resume a prior run
resume_from_checkpoint = false

# Paths
model = 'DeepSeek-V3-0324-CODER-DRAFT-0.6B-UNTRAINED'
output_dir = 'DeepSeek-V3-0324-CODER-DRAFT-0.6B'

# Optimization configuration
full_fine_tune = true
epochs = 1
lr_scheduler = 'cosine'
warmup_steps = 100

# Performance settings
pipeline_stages = 1
logging_steps = 1
eval_steps = 100
save_steps = 100
checkpoint_every_n_minutes = 60
eval_before_first_step = true
eval_after_last_step = true
model_weight_dtype = 'bfloat16'
keep_states = 3
group_by_length = true
activation_checkpointing = 'unsloth'

# Dataset configuration
dataset_combination_mode = 'concatenate'
eval_gradient_accumulation_steps = 20

[optimizer]
type = 'adamw_kahan'
lr = 5e-5
beta1 = 0.9
beta2 = 0.999
weight_decay = 0.01

[[datasets]]
name = 'mixed_data'
dataset_type = 'textfile'
dataset_path = 'mixed_data/*.txt'
sequence_len = 32768
eval_size = 0.01
```

```json
{
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 20,
    "gradient_clipping": 1.0,
    "steps_per_print": 1
}
```

I used six `RTX A6000` GPUs over three nodes and hence the `120` batch size (`6 x 20 gradient accumulation steps = 120`).