File size: 5,618 Bytes
bcf1c4f
c10a8eb
 
 
 
 
 
 
 
 
 
 
bcf1c4f
 
c10a8eb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
---
language:
- ko
- en
pipeline_tag: text-generation
inference: false
tags:
- solar
- mistral
- pytorch
- solar-ko
library_name: transformers
license: cc-by-nc-sa-4.0
---

**Update Log**

- 2024.02.19: Initial Test version Release of SOLAR-KOEN-10.8B

# **SOLAR-KOEN** ⭐🇰🇷

Solar-KoEn represents an advanced iteration of the upstage/SOLAR-10.7B-v1.0 model, featuring an expanded vocabulary and the inclusion of a Korean+English corpus for enhanced pretraining. 

## Model Details

**Model Developers:** Junbum Lee (Beomi) & Taekyoon Choi (Taekyoon)

**Variations:** Solar-KoEn is available with one parameter sizes — 10.8B with Continual Pretrained version.

**Input:** The model accepts only text input.

**Output:** The model produces text output exclusively.

**Model Architecture:** 

SOLAR-KO-10.7B is an auto-regressive language model that leverages an optimized transformer architecture derived from Llama-2.

| |Training Data|Parameters|Content Length|GQA|Tokens|Learning Rate|
|---|---|---|---|---|---|---|
|SOLAR-KOEN-10.8B|*A curated mix of Korean+English Corpora*|10.8B|2k|O|>15B*|5e<sup>-5</sup>|

**Training Corpus**

The model was trained using selected datasets from AIHub and Modu Corpus. Detailed information about the training datasets is available below:

- AI Hub: [corpus/AI_HUB](./corpus/AI_HUB)
  - Only the `Training` segment of the data was used.
  - The `Validation` and `Test` segments were deliberately excluded.
- Modu Corpus: [corpus/MODU_CORPUS](./corpus/MODU_CORPUS)

The final JSONL dataset used to train this model is approximately 61GB in size.

Total token count: Approximately 15 billion tokens (*using the expanded tokenizer. With the original SOLAR tokenizer, >60 billion tokens.)

**Vocab Expansion**

| Model Name | Vocabulary Size | Description | 
| --- | --- | --- |
| Original Solar | 32000 | Sentencepiece BPE |
| **Expanded SOLAR-KO-10.7B** | 46336 | Sentencepiece BPE. Added Korean vocab and merges |

**Tokenizing "안녕하세요, 오늘은 날씨가 좋네요."**

- SOLAR-10.7B: 26 tokens
- SOLAR-KO-10.7b: 10 tokens

| Model | Tokens |
| --- | --- |
| SOLAR-10.7B | `['▁', '안', '<0xEB>', '<0x85>', '<0x95>', '하', '세', '요', ',', '▁', '오', '<0xEB>', '<0x8A>', '<0x98>', '은', '▁', '날', '<0xEC>', '<0x94>', '<0xA8>', '가', '▁', '좋', '네', '요', '.']` |
| SOLAR-KO-10.7B | `['▁안', '녕', '하세요', ',', '▁오늘', '은', '▁날', '씨가', '▁좋네요', '.']` |

**Tokenizing "Meet 10.7B Solar: Elevating Performance with Upstage Depth UP Scaling!"**

- SOLAR-10.7B: 22 tokens
- SOLAR-KO-10.7b: 22 tokens

| Model | Tokens |
| --- | --- |
| SOLAR-10.7B | `['▁Meet', '▁', '1', '0', '.', '7', 'B', '▁Solar', ':', '▁E', 'lev', 'ating', '▁Performance', '▁with', '▁Up', 'stage', '▁Dep', 'th', '▁UP', '▁Scal', 'ing', '!']` |
| SOLAR-KO-10.7B | `['▁Meet', '▁', '1', '0', '.', '7', 'B', '▁Solar', ':', '▁E', 'lev', 'ating', '▁Performance', '▁with', '▁Up', 'stage', '▁Dep', 'th', '▁UP', '▁Scal', 'ing', '!']` |

# LICENSE

Apache 2.0

# **Model Benchmark**

## LM Eval Harness - Korean (polyglot branch)

- Used EleutherAI's lm-evaluation-harness https://github.com/EleutherAI/lm-evaluation-harness/tree/polyglot
- 5-shot scores

|       Task        |Version|   Metric   | Value |   |Stderr|
|-------------------|------:|------------|------:|---|-----:|
|klue_mrc           |      0|exact       |50.2140|   |      |
|                   |       |f1          |54.0330|   |      |
|                   |       |HasAns_exact|73.1786|   |      |
|                   |       |HasAns_f1   |78.7442|   |      |
|                   |       |best_exact  |56.9594|   |      |
|                   |       |best_f1     |60.3743|   |      |
|korquad            |      1|exact_match |81.0530|   |      |
|                   |       |f1          |87.6418|   |      |
|klue_nli           |      0|acc         | 0.4540|±  |0.0091|
|klue_sts           |      0|acc         | 0.3410|±  |0.0208|
|                   |       |f1          | 0.4896|±  |0.0237|
|klue_ynat          |      0|acc         | 0.6308|±  |0.0051|
|                   |       |macro_f1    | 0.6086|±  |0.0057|
|kobest_boolq       |      0|acc         | 0.8711|±  |0.0089|
|                   |       |macro_f1    | 0.8705|±  |0.0090|
|kobest_copa        |      0|acc         | 0.8500|±  |0.0113|
|                   |       |macro_f1    | 0.8498|±  |0.0113|
|kobest_hellaswag   |      0|acc         | 0.5180|±  |0.0224|
|                   |       |acc_norm    | 0.6180|±  |0.0218|
|                   |       |macro_f1    | 0.5138|±  |0.0224|
|kobest_sentineg    |      0|acc         | 0.9723|±  |0.0082|
|                   |       |macro_f1    | 0.9723|±  |0.0083|
|kobest_wic         |      0|acc         | 0.5825|±  |0.0139|
|                   |       |macro_f1    | 0.4952|±  |0.0140|
|kohatespeech_apeach|      0|acc         | 0.7034|±  |0.0074|
|                   |       |macro_f1    | 0.7033|±  |0.0074|
|nsmc               |      0|acc         | 0.8738|±  |0.0015|
|pawsx_ko           |      0|acc         | 0.5510|±  |0.0111|
|kmmlu_direct       |      0|exact_match | 0.4220|±  |0.0909|


## Citation

```
@misc {solar_koen_junbum_taekyoon_2024,
    author       = { {L. Junbum, Taekyoon Choi} },
    title        = { Solar-KoEn-10.8b },
    year         = 2024,
    url          = { https://huggingface.co/beomi/SOLAR-KOEN-10.8B },
    publisher    = { Hugging Face }
}

```

## Acknowledgements

- Training support was provided by the [TPU Research Cloud](https://sites.research.google/trc/) program.