Text Generation
Safetensors
deepseek_v3
conversational
bloc97 commited on
Commit
eff4a8b
·
verified ·
1 Parent(s): ecd39d0

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +68 -0
README.md ADDED
@@ -0,0 +1,68 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc0-1.0
3
+ datasets:
4
+ - HuggingFaceFW/fineweb
5
+ - HuggingFaceFW/fineweb-2
6
+ - bigcode/the-stack-v2
7
+ language:
8
+ - en
9
+ - zh
10
+ - ru
11
+ - de
12
+ - ja
13
+ - es
14
+ - fr
15
+ - it
16
+ - pt
17
+ - pl
18
+ - nl
19
+ - id
20
+ - tr
21
+ - cs
22
+ - ko
23
+ - ar
24
+ - hu
25
+ - fa
26
+ - ro
27
+ - vi
28
+ - uk
29
+ - 'no'
30
+ - th
31
+ - el
32
+ - sv
33
+ - da
34
+ - sk
35
+ - hr
36
+ - hi
37
+ - lt
38
+ - bs
39
+ - he
40
+ - bn
41
+ - sl
42
+ - et
43
+ - ca
44
+ - lv
45
+ pipeline_tag: text-generation
46
+ ---
47
+
48
+ Nous Consilience 40B is a generative text model, pretrained from scratch in a decentralized fashion over the internet.
49
+ This model is automatically updated every 500 training steps, with the latest checkpoint uploaded here from the [ongoing pretraining dashboard](https://psyche.network/).
50
+
51
+ For more information, read the [blog post](https://nousresearch.com/nous-psyche/).
52
+
53
+ # Model Details
54
+
55
+ **Model Type:** Decoder-only transformer
56
+ **Parameters:** 40 billion
57
+ **Architecture:** DeepSeek v3 + MLA (Dense version without MoE routers)
58
+ **Pretraining Data:** 20T tokens, Merge of [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb), [FineWeb 2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) and [The Stack v2](https://huggingface.co/datasets/bigcode/the-stack-v2)
59
+ **Training Duration:** TBD
60
+ **Optimizer:** [DisTrO](https://github.com/NousResearch/DisTrO), decentralized version
61
+
62
+ # Pretaining Dataset
63
+ For training data, we combined FineWeb (14T), FineWeb-2 with some less common languages removed (4T), and The Stack V2 (~.2T, upsampled to 1T tokens). We chose these datasets over more specialized pre-training datasets that aim to purely increase benchmark performance. Our goal with Consilience is to make a true "base" model -- one representative of the entirety of the creative output of humanity, and not merely trying to win the benchmaxxing game.
64
+
65
+ Additionally, we're training this model continuously without a final data "annealing" step. While annealing helps base models respond more accurately to benchmarks and improves usability, it may potentially constrain creativity and interesting behaviors. Our solution is to simply release both versions: the raw, un-annealed base model first, followed by an annealed version to aid in usability.
66
+
67
+ # License
68
+ As the model representing the vast and diverse creative output of humankind, we choose to release it under a dual license: [**CC0**](https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/cc0-1.0.md) by default -- to dedicate it to the public domain -- while also allowing it to be used under the [**MIT** license](https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/mit.md) for users who require permissive terms with attribution and warranty disclaimers.