luofuli commited on
Commit
89a2dbc
1 Parent(s): 1e90cb3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -8
README.md CHANGED
@@ -52,16 +52,14 @@
52
  # DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
53
 
54
  ## 1. Introduction
55
- Today, we’re introducing DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token. Compared with DeepSeek 67B, DeepSeek-V2 achieves stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times.
56
 
57
- <p align="center">
58
- <div style="display: flex; justify-content: center;">
59
- <img src="https://github.com/deepseek-ai/DeepSeek-V2/blob/main/figures/activationparameters.png?raw=true" style="height:300px; width:auto; margin-right:10px">
60
- <img src="https://github.com/deepseek-ai/DeepSeek-V2/blob/main/figures/trainingcost.png?raw=true" style="height:300px; width:auto; margin-left:10px">
61
- </div>
62
- </p>
63
 
64
- We pretrained DeepSeek-V2 on a diverse and high-quality corpus comprising 8.1 trillion tokens. This comprehensive pretraining was followed by a process of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unleash the model's capabilities. The evaluation results validate the effectiveness of our approach as DeepSeek-V2 achieves remarkable performance on both standard benchmarks and open-ended generation evaluation.
 
 
 
 
65
 
66
  ## 2. News
67
 
@@ -70,6 +68,8 @@ We pretrained DeepSeek-V2 on a diverse and high-quality corpus comprising 8.1 tr
70
 
71
  ## 3. Model Downloads
72
 
 
 
73
  <div align="center">
74
 
75
  | **Model** | **#Total Params** | **#Activated Params** | **Context Length** | **Download** |
 
52
  # DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
53
 
54
  ## 1. Introduction
 
55
 
56
+ Last week, the release and buzz around DeepSeek-V2 have ignited widespread interest in MLA (Multi-head Latent Attention)! Many in the community suggested open-sourcing a smaller MoE model for in-depth research. And now DeepSeek-V2-Lite comes out:
 
 
 
 
 
57
 
58
+ - 16B total params, 2.4B active params, 5.7T training tokens
59
+ - Outperforms 7B dense and 16B MoE on many benchmarks
60
+ - Deployable on single 40G GPU, fine-tunable on 8x80G GPUs
61
+
62
+ DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation.
63
 
64
  ## 2. News
65
 
 
68
 
69
  ## 3. Model Downloads
70
 
71
+ With DeepSeek-V2, we are open-sourcing base and chat models across two sizes:
72
+
73
  <div align="center">
74
 
75
  | **Model** | **#Total Params** | **#Activated Params** | **Context Length** | **Download** |