Fixed typo
Browse files
README.md
CHANGED
@@ -37,20 +37,7 @@ SOLAR-KOEN-10.8B is an auto-regressive language model that leverages an optimize
|
|
37 |
|
38 |
| |Training Data|Parameters|Content Length|GQA|Tokens|Learning Rate|
|
39 |
|---|---|---|---|---|---|---|
|
40 |
-
|SOLAR-KOEN-10.8B|*A curated mix of Korean+English Corpora*|10.8B|4k|O|>
|
41 |
-
|
42 |
-
**Training Corpus**
|
43 |
-
|
44 |
-
The model was trained using selected datasets from AIHub and Modu Corpus. Detailed information about the training datasets is available below:
|
45 |
-
|
46 |
-
- AI Hub: [corpus/AI_HUB](./corpus/AI_HUB)
|
47 |
-
- Only the `Training` segment of the data was used.
|
48 |
-
- The `Validation` and `Test` segments were deliberately excluded.
|
49 |
-
- Modu Corpus: [corpus/MODU_CORPUS](./corpus/MODU_CORPUS)
|
50 |
-
|
51 |
-
The final JSONL dataset used to train this model is approximately 61GB in size.
|
52 |
-
|
53 |
-
Total token count: Approximately 15 billion tokens (*using the expanded tokenizer. With the original SOLAR tokenizer, >60 billion tokens.)
|
54 |
|
55 |
**Vocab Expansion**
|
56 |
|
|
|
37 |
|
38 |
| |Training Data|Parameters|Content Length|GQA|Tokens|Learning Rate|
|
39 |
|---|---|---|---|---|---|---|
|
40 |
+
|SOLAR-KOEN-10.8B|*A curated mix of Korean+English Corpora*|10.8B|4k|O|>60B*|5e<sup>-5</sup>|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
41 |
|
42 |
**Vocab Expansion**
|
43 |
|