metadata
license: apache-2.0
base_model:
- Qwen/Qwen2.5-Coder-0.5B-Instruct
datasets:
- agentlans/common-crawl-sample
- bigcode/the-stack-smol-xl
tags:
- draft
- speculative-decoding
language:
- zho
- eng
- fra
- spa
- por
- deu
- ita
- rus
- jpn
- kor
- vie
- tha
- ara
NOTE: This is just a slightly improved version that I trained using "max_position_embeddings": 65536
+ "rope_scaling": {"factor": 2.0, ...
rather than setting the rope_scaling
after training...
A 0.6B
parameter draft (speculative decoding) model for use with deepseek-ai/DeepSeek-V3-0324 and deepseek-ai/DeepSeek-V3.
NOTES:
- This version (unlike the previous jukofyork/DeepSeek-V3-0324-DRAFT-0.5B-v1.0), was trained using only the agentlans/common-crawl-sample and bigcode/the-stack-smol-xl datasets.
- This version (unlike the previous jukofyork/DeepSeek-V3-0324-DRAFT-0.5B-v1.0), doesn't trim the heads down from 14 to 12.
See jukofyork/DeepSeek-V3-0324-CODER-DRAFT-0.6B-v1.0-GGUF for the models in GGUF format.
How the model was created
1. The initial model was created from Qwen/Qwen2.5-Coder-0.5B-Instruct using transplant-vocab:
python ./transplant_vocab.py \
Qwen2.5-Coder-0.5B-Instruct \
DeepSeek-V3-0324-BF16 \
DeepSeek-V3-0324-CODER-DRAFT-0.6B-UNTRAINED \
--override "<|▁pad▁|>" "<|endoftext|>" \
--override "<|fim▁hole|>" "<|fim_middle|>" \
--override "<|fim▁begin|>" "<|fim_prefix|>" \
--override "<|fim▁end|>" "<|fim_suffix|>" \
--override "<|User|>" "<|im_start|>user\\n" \
--override "<|Assistant|>" "<|im_start|>assistant\\n" \
--override "<|EOT|>" "<|endoftext|>" \
--override "<|tool▁calls▁begin|>" "<tool_call>" \
--override "<|tool▁call▁begin|>" "<tool_call>" \
--override "<|tool▁outputs▁begin|>" "<tool_call>" \
--override "<|tool▁output▁begin|>" "<tool_call>" \
--override "<|tool▁calls▁end|>" "</tool_call>" \
--override "<|tool▁call▁end|>" "</tool_call>" \
--override "<|tool▁outputs▁end|>" "</tool_call>" \
--override "<|tool▁output▁end|>" "</tool_call>" \
--override "<|tool▁sep|>" "</tool_call>"
2. The following datasets were merged to create a fine-tuning dataset of ~2.5B tokens:
formatted just between <|end▁of▁sentence|>
tags.
3. The model was then trained using qlora-pipe for 1 epoch with a batch size of 120 and a sequence length of 32k (~4M tokens per step):
# Resume a prior run
resume_from_checkpoint = false
# Paths
model = 'DeepSeek-V3-0324-CODER-DRAFT-0.6B-UNTRAINED'
output_dir = 'DeepSeek-V3-0324-CODER-DRAFT-0.6B'
# Optimization configuration
full_fine_tune = true
epochs = 1
lr_scheduler = 'cosine'
warmup_steps = 100
# Performance settings
pipeline_stages = 1
logging_steps = 1
eval_steps = 100
save_steps = 100
checkpoint_every_n_minutes = 60
eval_before_first_step = true
eval_after_last_step = true
model_weight_dtype = 'bfloat16'
keep_states = 3
group_by_length = true
activation_checkpointing = 'unsloth'
# Dataset configuration
dataset_combination_mode = 'concatenate'
eval_gradient_accumulation_steps = 20
[optimizer]
type = 'adamw_kahan'
lr = 5e-5
beta1 = 0.9
beta2 = 0.999
weight_decay = 0.01
[[datasets]]
name = 'mixed_data'
dataset_type = 'textfile'
dataset_path = 'mixed_data/*.txt'
sequence_len = 32768
eval_size = 0.01
{
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": 20,
"gradient_clipping": 1.0,
"steps_per_print": 1
}
I used six RTX A6000
GPUs over three nodes and hence the 120
batch size (6 x 20 gradient accumulation steps = 120
).