File size: 5,618 Bytes
bcf1c4f c10a8eb bcf1c4f c10a8eb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 |
---
language:
- ko
- en
pipeline_tag: text-generation
inference: false
tags:
- solar
- mistral
- pytorch
- solar-ko
library_name: transformers
license: cc-by-nc-sa-4.0
---
**Update Log**
- 2024.02.19: Initial Test version Release of SOLAR-KOEN-10.8B
# **SOLAR-KOEN** ⭐🇰🇷
Solar-KoEn represents an advanced iteration of the upstage/SOLAR-10.7B-v1.0 model, featuring an expanded vocabulary and the inclusion of a Korean+English corpus for enhanced pretraining.
## Model Details
**Model Developers:** Junbum Lee (Beomi) & Taekyoon Choi (Taekyoon)
**Variations:** Solar-KoEn is available with one parameter sizes — 10.8B with Continual Pretrained version.
**Input:** The model accepts only text input.
**Output:** The model produces text output exclusively.
**Model Architecture:**
SOLAR-KO-10.7B is an auto-regressive language model that leverages an optimized transformer architecture derived from Llama-2.
| |Training Data|Parameters|Content Length|GQA|Tokens|Learning Rate|
|---|---|---|---|---|---|---|
|SOLAR-KOEN-10.8B|*A curated mix of Korean+English Corpora*|10.8B|2k|O|>15B*|5e<sup>-5</sup>|
**Training Corpus**
The model was trained using selected datasets from AIHub and Modu Corpus. Detailed information about the training datasets is available below:
- AI Hub: [corpus/AI_HUB](./corpus/AI_HUB)
- Only the `Training` segment of the data was used.
- The `Validation` and `Test` segments were deliberately excluded.
- Modu Corpus: [corpus/MODU_CORPUS](./corpus/MODU_CORPUS)
The final JSONL dataset used to train this model is approximately 61GB in size.
Total token count: Approximately 15 billion tokens (*using the expanded tokenizer. With the original SOLAR tokenizer, >60 billion tokens.)
**Vocab Expansion**
| Model Name | Vocabulary Size | Description |
| --- | --- | --- |
| Original Solar | 32000 | Sentencepiece BPE |
| **Expanded SOLAR-KO-10.7B** | 46336 | Sentencepiece BPE. Added Korean vocab and merges |
**Tokenizing "안녕하세요, 오늘은 날씨가 좋네요."**
- SOLAR-10.7B: 26 tokens
- SOLAR-KO-10.7b: 10 tokens
| Model | Tokens |
| --- | --- |
| SOLAR-10.7B | `['▁', '안', '<0xEB>', '<0x85>', '<0x95>', '하', '세', '요', ',', '▁', '오', '<0xEB>', '<0x8A>', '<0x98>', '은', '▁', '날', '<0xEC>', '<0x94>', '<0xA8>', '가', '▁', '좋', '네', '요', '.']` |
| SOLAR-KO-10.7B | `['▁안', '녕', '하세요', ',', '▁오늘', '은', '▁날', '씨가', '▁좋네요', '.']` |
**Tokenizing "Meet 10.7B Solar: Elevating Performance with Upstage Depth UP Scaling!"**
- SOLAR-10.7B: 22 tokens
- SOLAR-KO-10.7b: 22 tokens
| Model | Tokens |
| --- | --- |
| SOLAR-10.7B | `['▁Meet', '▁', '1', '0', '.', '7', 'B', '▁Solar', ':', '▁E', 'lev', 'ating', '▁Performance', '▁with', '▁Up', 'stage', '▁Dep', 'th', '▁UP', '▁Scal', 'ing', '!']` |
| SOLAR-KO-10.7B | `['▁Meet', '▁', '1', '0', '.', '7', 'B', '▁Solar', ':', '▁E', 'lev', 'ating', '▁Performance', '▁with', '▁Up', 'stage', '▁Dep', 'th', '▁UP', '▁Scal', 'ing', '!']` |
# LICENSE
Apache 2.0
# **Model Benchmark**
## LM Eval Harness - Korean (polyglot branch)
- Used EleutherAI's lm-evaluation-harness https://github.com/EleutherAI/lm-evaluation-harness/tree/polyglot
- 5-shot scores
| Task |Version| Metric | Value | |Stderr|
|-------------------|------:|------------|------:|---|-----:|
|klue_mrc | 0|exact |50.2140| | |
| | |f1 |54.0330| | |
| | |HasAns_exact|73.1786| | |
| | |HasAns_f1 |78.7442| | |
| | |best_exact |56.9594| | |
| | |best_f1 |60.3743| | |
|korquad | 1|exact_match |81.0530| | |
| | |f1 |87.6418| | |
|klue_nli | 0|acc | 0.4540|± |0.0091|
|klue_sts | 0|acc | 0.3410|± |0.0208|
| | |f1 | 0.4896|± |0.0237|
|klue_ynat | 0|acc | 0.6308|± |0.0051|
| | |macro_f1 | 0.6086|± |0.0057|
|kobest_boolq | 0|acc | 0.8711|± |0.0089|
| | |macro_f1 | 0.8705|± |0.0090|
|kobest_copa | 0|acc | 0.8500|± |0.0113|
| | |macro_f1 | 0.8498|± |0.0113|
|kobest_hellaswag | 0|acc | 0.5180|± |0.0224|
| | |acc_norm | 0.6180|± |0.0218|
| | |macro_f1 | 0.5138|± |0.0224|
|kobest_sentineg | 0|acc | 0.9723|± |0.0082|
| | |macro_f1 | 0.9723|± |0.0083|
|kobest_wic | 0|acc | 0.5825|± |0.0139|
| | |macro_f1 | 0.4952|± |0.0140|
|kohatespeech_apeach| 0|acc | 0.7034|± |0.0074|
| | |macro_f1 | 0.7033|± |0.0074|
|nsmc | 0|acc | 0.8738|± |0.0015|
|pawsx_ko | 0|acc | 0.5510|± |0.0111|
|kmmlu_direct | 0|exact_match | 0.4220|± |0.0909|
## Citation
```
@misc {solar_koen_junbum_taekyoon_2024,
author = { {L. Junbum, Taekyoon Choi} },
title = { Solar-KoEn-10.8b },
year = 2024,
url = { https://huggingface.co/beomi/SOLAR-KOEN-10.8B },
publisher = { Hugging Face }
}
```
## Acknowledgements
- Training support was provided by the [TPU Research Cloud](https://sites.research.google/trc/) program.
|