--- license: apache-2.0 base_model: - Qwen/Qwen2.5-Coder-0.5B-Instruct datasets: - agentlans/common-crawl-sample - bigcode/the-stack-smol-xl tags: - draft - speculative-decoding language: - zho - eng - fra - spa - por - deu - ita - rus - jpn - kor - vie - tha - ara --- ![image.webp](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/KL97x9lVuhmIPXbbKgvyY.webp) ***NOTE***: *This is just a slightly improved version that I trained using `"max_position_embeddings": 65536` + `"rope_scaling": {"factor": 2.0, ...` rather than setting the `rope_scaling` after training...* --- A `0.6B` parameter draft (speculative decoding) model for use with [deepseek-ai/DeepSeek-V3-0324](https://huggingface.co/deepseek-ai/DeepSeek-V3-0324) and [deepseek-ai/DeepSeek-V3](https://huggingface.co/deepseek-ai/DeepSeek-V3). **NOTES**: - This version (unlike the previous [jukofyork/DeepSeek-V3-0324-DRAFT-0.5B-v1.0](https://huggingface.co/jukofyork/DeepSeek-V3-0324-DRAFT-0.5B-v1.0)), was trained using only the [agentlans/common-crawl-sample](https://huggingface.co/datasets/agentlans/common-crawl-sample) and [bigcode/the-stack-smol-xl](https://huggingface.co/datasets/bigcode/the-stack-smol-xl) datasets. - This version (unlike the previous [jukofyork/DeepSeek-V3-0324-DRAFT-0.5B-v1.0](https://huggingface.co/jukofyork/DeepSeek-V3-0324-DRAFT-0.5B-v1.0)), doesn't trim the heads down from 14 to 12. See [jukofyork/DeepSeek-V3-0324-CODER-DRAFT-0.6B-v1.0-GGUF](https://huggingface.co/jukofyork/DeepSeek-V3-0324-CODER-DRAFT-0.6B-v1.0-GGUF) for the models in GGUF format. --- # How the model was created ## 1. The initial model was created from [Qwen/Qwen2.5-Coder-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B-Instruct) using [transplant-vocab](https://github.com/jukofyork/transplant-vocab): ```sh python ./transplant_vocab.py \ Qwen2.5-Coder-0.5B-Instruct \ DeepSeek-V3-0324-BF16 \ DeepSeek-V3-0324-CODER-DRAFT-0.6B-UNTRAINED \ --override "<|▁pad▁|>" "<|endoftext|>" \ --override "<|fim▁hole|>" "<|fim_middle|>" \ --override "<|fim▁begin|>" "<|fim_prefix|>" \ --override "<|fim▁end|>" "<|fim_suffix|>" \ --override "<|User|>" "<|im_start|>user\\n" \ --override "<|Assistant|>" "<|im_start|>assistant\\n" \ --override "<|EOT|>" "<|endoftext|>" \ --override "<|tool▁calls▁begin|>" "" \ --override "<|tool▁call▁begin|>" "" \ --override "<|tool▁outputs▁begin|>" "" \ --override "<|tool▁output▁begin|>" "" \ --override "<|tool▁calls▁end|>" "" \ --override "<|tool▁call▁end|>" "" \ --override "<|tool▁outputs▁end|>" "" \ --override "<|tool▁output▁end|>" "" \ --override "<|tool▁sep|>" "" ``` ## 2. The following datasets were merged to create a fine-tuning dataset of ~2.5B tokens: - [agentlans/common-crawl-sample](https://huggingface.co/datasets/agentlans/common-crawl-sample) - [bigcode/the-stack-smol-xl](https://huggingface.co/datasets/bigcode/the-stack-smol-xl) formatted just between `<|end▁of▁sentence|>` tags. ## 3. The model was then trained using [qlora-pipe](https://github.com/tdrussell/qlora-pipe) for 1 epoch with a batch size of 120 and a sequence length of 32k (~4M tokens per step): ```toml # Resume a prior run resume_from_checkpoint = false # Paths model = 'DeepSeek-V3-0324-CODER-DRAFT-0.6B-UNTRAINED' output_dir = 'DeepSeek-V3-0324-CODER-DRAFT-0.6B' # Optimization configuration full_fine_tune = true epochs = 1 lr_scheduler = 'cosine' warmup_steps = 100 # Performance settings pipeline_stages = 1 logging_steps = 1 eval_steps = 100 save_steps = 100 checkpoint_every_n_minutes = 60 eval_before_first_step = true eval_after_last_step = true model_weight_dtype = 'bfloat16' keep_states = 3 group_by_length = true activation_checkpointing = 'unsloth' # Dataset configuration dataset_combination_mode = 'concatenate' eval_gradient_accumulation_steps = 20 [optimizer] type = 'adamw_kahan' lr = 5e-5 beta1 = 0.9 beta2 = 0.999 weight_decay = 0.01 [[datasets]] name = 'mixed_data' dataset_type = 'textfile' dataset_path = 'mixed_data/*.txt' sequence_len = 32768 eval_size = 0.01 ``` ```json { "train_micro_batch_size_per_gpu": 1, "gradient_accumulation_steps": 20, "gradient_clipping": 1.0, "steps_per_print": 1 } ``` I used six `RTX A6000` GPUs over three nodes and hence the `120` batch size (`6 x 20 gradient accumulation steps = 120`).