danielhanchen commited on
Commit
19b180d
·
verified ·
1 Parent(s): a587207

Add files using upload-large-folder tool

Browse files
Files changed (6) hide show
  1. .gitattributes +1 -1
  2. LICENSE +21 -0
  3. README.md +71 -44
  4. config.json +6 -5
  5. figures/benchmark.jpg +3 -0
  6. generation_config.json +2 -4
.gitattributes CHANGED
@@ -33,4 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
- tokenizer.json filter=lfs diff=lfs merge=lfs -text
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ figures/benchmark.jpg filter=lfs diff=lfs merge=lfs -text
LICENSE ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MIT License
2
+
3
+ Copyright (c) 2023 DeepSeek
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
README.md CHANGED
@@ -1,52 +1,50 @@
1
  ---
2
- base_model: deepseek-ai/DeepSeek-R1-Distill-Llama-70B
3
- language:
4
- - en
5
- license: llama3.3
6
  library_name: transformers
7
- tags:
8
- - deepseek
9
- - unsloth
10
- - transformers
11
- - llama
12
- - llama-3
13
- - meta
14
  ---
 
 
 
 
15
 
16
- ## ***See [our collection](https://huggingface.co/collections/unsloth/deepseek-r1-all-versions-678e1c48f5d2fce87892ace5) for versions of Deepseek-R1 including GGUF and original formats.***
17
-
18
-
19
- # Finetune LLMs 2-5x faster with 70% less memory via Unsloth!
20
- We have a free Google Colab Tesla T4 notebook for Llama 3.1 (8B) here: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-Alpaca.ipynb
21
-
22
- [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/Discord%20button.png" width="200"/>](https://discord.gg/unsloth)
23
- [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
24
-
25
-
26
- ## ✨ Finetune for Free
27
-
28
- All notebooks are **beginner friendly**! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face.
29
-
30
- | Unsloth supports | Free Notebooks | Performance | Memory use |
31
- |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------|
32
- | **Llama-3.2 (3B)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb) | 2.4x faster | 58% less |
33
- | **Llama-3.2 (11B vision)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb) | 2x faster | 60% less |
34
- | **Qwen2 VL (7B)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen2_VL_(7B)-Vision.ipynb) | 1.8x faster | 60% less |
35
- | **Qwen2.5 (7B)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen2.5_(7B)-Alpaca.ipynb) | 2x faster | 60% less |
36
- | **Llama-3.1 (8B)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-Alpaca.ipynb) | 2.4x faster | 58% less |
37
- | **Phi-3.5 (mini)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Phi_3.5_Mini-Conversational.ipynb) | 2x faster | 50% less |
38
- | **Gemma 2 (9B)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma2_(9B)-Alpaca.ipynb) | 2.4x faster | 58% less |
39
- | **Mistral (7B)** | [▶️ Start on Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Mistral_v0.3_(7B)-Conversational.ipynb) | 2.2x faster | 62% less |
40
 
41
- [<img src="https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/images/documentation%20green%20button.png" width="200"/>](https://docs.unsloth.ai)
 
 
 
 
 
 
 
 
 
 
42
 
43
- - This [Llama 3.2 conversational notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb) is useful for ShareGPT ChatML / Vicuna templates.
44
- - This [text completion notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Mistral_(7B)-Text_Completion.ipynb) is for raw text. This [DPO notebook](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing) replicates Zephyr.
45
- - \* Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster.
 
 
46
 
47
- ## Special Thanks
48
- A huge thank you to the DeepSeek team for creating and releasing these models.
49
 
 
 
 
50
 
51
 
52
  ## 1. Introduction
@@ -59,6 +57,8 @@ we introduce DeepSeek-R1, which incorporates cold-start data before RL.
59
  DeepSeek-R1 achieves performance comparable to OpenAI-o1 across math, code, and reasoning tasks.
60
  To support the research community, we have open-sourced DeepSeek-R1-Zero, DeepSeek-R1, and six dense models distilled from DeepSeek-R1 based on Llama and Qwen. DeepSeek-R1-Distill-Qwen-32B outperforms OpenAI-o1-mini across various benchmarks, achieving new state-of-the-art results for dense models.
61
 
 
 
62
  <p align="center">
63
  <img width="80%" src="figures/benchmark.jpg">
64
  </p>
@@ -95,7 +95,7 @@ To support the research community, we have open-sourced DeepSeek-R1-Zero, DeepSe
95
  </div>
96
 
97
  DeepSeek-R1-Zero & DeepSeek-R1 are trained based on DeepSeek-V3-Base.
98
- For more details regrading the model architecture, please refer to [DeepSeek-V3](https://github.com/deepseek-ai/DeepSeek-V3) repository.
99
 
100
  ### DeepSeek-R1-Distill Models
101
 
@@ -184,6 +184,8 @@ We also provide OpenAI-Compatible API at DeepSeek Platform: [platform.deepseek.c
184
 
185
  Please visit [DeepSeek-V3](https://github.com/deepseek-ai/DeepSeek-V3) repo for more information about running DeepSeek-R1 locally.
186
 
 
 
187
  ### DeepSeek-R1-Distill Models
188
 
189
  DeepSeek-R1-Distill models can be utilized in the same manner as Qwen or Llama models.
@@ -194,7 +196,23 @@ For instance, you can easily start a service using [vLLM](https://github.com/vll
194
  vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --tensor-parallel-size 2 --max-model-len 32768 --enforce-eager
195
  ```
196
 
197
- **NOTE: We recommend setting an appropriate temperature (between 0.5 and 0.7) when running these models, otherwise you may encounter issues with endless repetition or incoherent output.**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
198
 
199
  ## 7. License
200
  This code repository and the model weights are licensed under the [MIT License](https://github.com/deepseek-ai/DeepSeek-R1/blob/main/LICENSE).
@@ -205,8 +223,17 @@ DeepSeek-R1 series support commercial use, allow for any modifications and deriv
205
 
206
  ## 8. Citation
207
  ```
 
 
 
 
 
 
 
 
 
208
 
209
  ```
210
 
211
  ## 9. Contact
212
- If you have any questions, please raise an issue or contact us at [[email protected]]([email protected]).
 
1
  ---
2
+ license: mit
 
 
 
3
  library_name: transformers
 
 
 
 
 
 
 
4
  ---
5
+ # DeepSeek-R1
6
+ <!-- markdownlint-disable first-line-h1 -->
7
+ <!-- markdownlint-disable html -->
8
+ <!-- markdownlint-disable no-duplicate-header -->
9
 
10
+ <div align="center">
11
+ <img src="https://github.com/deepseek-ai/DeepSeek-V2/blob/main/figures/logo.svg?raw=true" width="60%" alt="DeepSeek-V3" />
12
+ </div>
13
+ <hr>
14
+ <div align="center" style="line-height: 1;">
15
+ <a href="https://www.deepseek.com/" target="_blank" style="margin: 2px;">
16
+ <img alt="Homepage" src="https://github.com/deepseek-ai/DeepSeek-V2/blob/main/figures/badge.svg?raw=true" style="display: inline-block; vertical-align: middle;"/>
17
+ </a>
18
+ <a href="https://chat.deepseek.com/" target="_blank" style="margin: 2px;">
19
+ <img alt="Chat" src="https://img.shields.io/badge/🤖%20Chat-DeepSeek%20R1-536af5?color=536af5&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
20
+ </a>
21
+ <a href="https://huggingface.co/deepseek-ai" target="_blank" style="margin: 2px;">
22
+ <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-DeepSeek%20AI-ffc107?color=ffc107&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
23
+ </a>
24
+ </div>
 
 
 
 
 
 
 
 
 
25
 
26
+ <div align="center" style="line-height: 1;">
27
+ <a href="https://discord.gg/Tc7c45Zzu5" target="_blank" style="margin: 2px;">
28
+ <img alt="Discord" src="https://img.shields.io/badge/Discord-DeepSeek%20AI-7289da?logo=discord&logoColor=white&color=7289da" style="display: inline-block; vertical-align: middle;"/>
29
+ </a>
30
+ <a href="https://github.com/deepseek-ai/DeepSeek-V2/blob/main/figures/qr.jpeg?raw=true" target="_blank" style="margin: 2px;">
31
+ <img alt="Wechat" src="https://img.shields.io/badge/WeChat-DeepSeek%20AI-brightgreen?logo=wechat&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
32
+ </a>
33
+ <a href="https://twitter.com/deepseek_ai" target="_blank" style="margin: 2px;">
34
+ <img alt="Twitter Follow" src="https://img.shields.io/badge/Twitter-deepseek_ai-white?logo=x&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
35
+ </a>
36
+ </div>
37
 
38
+ <div align="center" style="line-height: 1;">
39
+ <a href="https://github.com/deepseek-ai/DeepSeek-R1/blob/main/LICENSE" style="margin: 2px;">
40
+ <img alt="License" src="https://img.shields.io/badge/License-MIT-f5de53?&color=f5de53" style="display: inline-block; vertical-align: middle;"/>
41
+ </a>
42
+ </div>
43
 
 
 
44
 
45
+ <p align="center">
46
+ <a href="https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf"><b>Paper Link</b>👁️</a>
47
+ </p>
48
 
49
 
50
  ## 1. Introduction
 
57
  DeepSeek-R1 achieves performance comparable to OpenAI-o1 across math, code, and reasoning tasks.
58
  To support the research community, we have open-sourced DeepSeek-R1-Zero, DeepSeek-R1, and six dense models distilled from DeepSeek-R1 based on Llama and Qwen. DeepSeek-R1-Distill-Qwen-32B outperforms OpenAI-o1-mini across various benchmarks, achieving new state-of-the-art results for dense models.
59
 
60
+ **NOTE: Before running DeepSeek-R1 series models locally, we kindly recommend reviewing the [Usage Recommendation](#usage-recommendations) section.**
61
+
62
  <p align="center">
63
  <img width="80%" src="figures/benchmark.jpg">
64
  </p>
 
95
  </div>
96
 
97
  DeepSeek-R1-Zero & DeepSeek-R1 are trained based on DeepSeek-V3-Base.
98
+ For more details regarding the model architecture, please refer to [DeepSeek-V3](https://github.com/deepseek-ai/DeepSeek-V3) repository.
99
 
100
  ### DeepSeek-R1-Distill Models
101
 
 
184
 
185
  Please visit [DeepSeek-V3](https://github.com/deepseek-ai/DeepSeek-V3) repo for more information about running DeepSeek-R1 locally.
186
 
187
+ **NOTE: Hugging Face's Transformers has not been directly supported yet.**
188
+
189
  ### DeepSeek-R1-Distill Models
190
 
191
  DeepSeek-R1-Distill models can be utilized in the same manner as Qwen or Llama models.
 
196
  vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --tensor-parallel-size 2 --max-model-len 32768 --enforce-eager
197
  ```
198
 
199
+ You can also easily start a service using [SGLang](https://github.com/sgl-project/sglang)
200
+
201
+ ```bash
202
+ python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --trust-remote-code --tp 2
203
+ ```
204
+
205
+ ### Usage Recommendations
206
+
207
+ **We recommend adhering to the following configurations when utilizing the DeepSeek-R1 series models, including benchmarking, to achieve the expected performance:**
208
+
209
+ 1. Set the temperature within the range of 0.5-0.7 (0.6 is recommended) to prevent endless repetitions or incoherent outputs.
210
+ 2. **Avoid adding a system prompt; all instructions should be contained within the user prompt.**
211
+ 3. For mathematical problems, it is advisable to include a directive in your prompt such as: "Please reason step by step, and put your final answer within \boxed{}."
212
+ 4. When evaluating model performance, it is recommended to conduct multiple tests and average the results.
213
+
214
+ Additionally, we have observed that the DeepSeek-R1 series models tend to bypass thinking pattern (i.e., outputting "\<think\>\n\n\</think\>") when responding to certain queries, which can adversely affect the model's performance.
215
+ **To ensure that the model engages in thorough reasoning, we recommend enforcing the model to initiate its response with "\<think\>\n" at the beginning of every output.**
216
 
217
  ## 7. License
218
  This code repository and the model weights are licensed under the [MIT License](https://github.com/deepseek-ai/DeepSeek-R1/blob/main/LICENSE).
 
223
 
224
  ## 8. Citation
225
  ```
226
+ @misc{deepseekai2025deepseekr1incentivizingreasoningcapability,
227
+ title={DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning},
228
+ author={DeepSeek-AI},
229
+ year={2025},
230
+ eprint={2501.12948},
231
+ archivePrefix={arXiv},
232
+ primaryClass={cs.CL},
233
+ url={https://arxiv.org/abs/2501.12948},
234
+ }
235
 
236
  ```
237
 
238
  ## 9. Contact
239
+ If you have any questions, please raise an issue or contact us at [[email protected]]([email protected]).
config.json CHANGED
@@ -1,12 +1,15 @@
1
  {
2
- "_name_or_path": "deepseek-ai/DeepSeek-R1-Distill-Llama-70B",
3
  "architectures": [
4
  "LlamaForCausalLM"
5
  ],
6
  "attention_bias": false,
7
  "attention_dropout": 0.0,
8
  "bos_token_id": 128000,
9
- "eos_token_id": 128001,
 
 
 
 
10
  "head_dim": 128,
11
  "hidden_act": "silu",
12
  "hidden_size": 8192,
@@ -18,7 +21,6 @@
18
  "num_attention_heads": 64,
19
  "num_hidden_layers": 80,
20
  "num_key_value_heads": 8,
21
- "pad_token_id": 128004,
22
  "pretraining_tp": 1,
23
  "rms_norm_eps": 1e-05,
24
  "rope_scaling": {
@@ -31,8 +33,7 @@
31
  "rope_theta": 500000.0,
32
  "tie_word_embeddings": false,
33
  "torch_dtype": "bfloat16",
34
- "transformers_version": "4.49.0.dev0",
35
- "unsloth_fixed": true,
36
  "use_cache": true,
37
  "vocab_size": 128256
38
  }
 
1
  {
 
2
  "architectures": [
3
  "LlamaForCausalLM"
4
  ],
5
  "attention_bias": false,
6
  "attention_dropout": 0.0,
7
  "bos_token_id": 128000,
8
+ "eos_token_id": [
9
+ 128001,
10
+ 128008,
11
+ 128009
12
+ ],
13
  "head_dim": 128,
14
  "hidden_act": "silu",
15
  "hidden_size": 8192,
 
21
  "num_attention_heads": 64,
22
  "num_hidden_layers": 80,
23
  "num_key_value_heads": 8,
 
24
  "pretraining_tp": 1,
25
  "rms_norm_eps": 1e-05,
26
  "rope_scaling": {
 
33
  "rope_theta": 500000.0,
34
  "tie_word_embeddings": false,
35
  "torch_dtype": "bfloat16",
36
+ "transformers_version": "4.47.0.dev0",
 
37
  "use_cache": true,
38
  "vocab_size": 128256
39
  }
figures/benchmark.jpg ADDED

Git LFS Details

  • SHA256: 96fa3297b31b53a21a283886d9f6d5759433e7523bd63b09d5ac43e9422dae97
  • Pointer size: 131 Bytes
  • Size of remote file: 777 kB
generation_config.json CHANGED
@@ -1,11 +1,9 @@
1
  {
2
  "_from_model_config": true,
3
  "bos_token_id": 128000,
4
- "do_sample": true,
5
  "eos_token_id": 128001,
6
- "max_length": 131072,
7
- "pad_token_id": 128004,
8
  "temperature": 0.6,
9
  "top_p": 0.95,
10
- "transformers_version": "4.49.0.dev0"
11
  }
 
1
  {
2
  "_from_model_config": true,
3
  "bos_token_id": 128000,
 
4
  "eos_token_id": 128001,
5
+ "do_sample": true,
 
6
  "temperature": 0.6,
7
  "top_p": 0.95,
8
+ "transformers_version": "4.39.3"
9
  }