deepseek-ai
/

DeepSeek-V2-Lite-Chat

Text Generation

text-generation-inference

Model card Files Files and versions Community

luofuli commited on May 17, 2024

Commit

4621eeb

·

verified ·

1 Parent(s): 89a2dbc

Update README.md

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -55,8 +55,8 @@
 Last week, the release and buzz around DeepSeek-V2 have ignited widespread interest in MLA (Multi-head Latent Attention)! Many in the community suggested open-sourcing a smaller MoE model for in-depth research. And now DeepSeek-V2-Lite comes out:
-- 16B total params, 2.4B active params, 5.7T training tokens
-- Outperforms 7B dense and 16B MoE on many benchmarks
 - Deployable on single 40G GPU, fine-tunable on 8x80G GPUs
 DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation.

 Last week, the release and buzz around DeepSeek-V2 have ignited widespread interest in MLA (Multi-head Latent Attention)! Many in the community suggested open-sourcing a smaller MoE model for in-depth research. And now DeepSeek-V2-Lite comes out:
+- 16B total params, 2.4B active params, scratch training with 5.7T tokens
+- Outperforms 7B dense and 16B MoE on many English & Chinese benchmarks
 - Deployable on single 40G GPU, fine-tunable on 8x80G GPUs
 DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation.