metadata

language:
  - ko
  - en
pipeline_tag: text-generation
inference: false
tags:
  - solar
  - mistral
  - pytorch
  - solar-ko
library_name: transformers
license: cc-by-nc-sa-4.0

Update Log

2024.02.19: Initial Test version Release of SOLAR-KOEN-10.8B

SOLAR-KOEN ⭐🇰🇷

Solar-KoEn represents an advanced iteration of the upstage/SOLAR-10.7B-v1.0 model, featuring an expanded vocabulary and the inclusion of a Korean+English corpus for enhanced pretraining.

Model Details

Model Developers: Junbum Lee (Beomi) & Taekyoon Choi (Taekyoon)

Variations: Solar-KoEn is available with one parameter sizes — 10.8B with Continual Pretrained version.

Input: The model accepts only text input.

Output: The model produces text output exclusively.

Model Architecture:

SOLAR-KO-10.7B is an auto-regressive language model that leverages an optimized transformer architecture derived from Llama-2.

	Training Data	Parameters	Content Length	GQA	Tokens	Learning Rate
SOLAR-KOEN-10.8B	A curated mix of Korean+English Corpora	10.8B	2k	O	>15B*	5e^-5

Training Corpus

The model was trained using selected datasets from AIHub and Modu Corpus. Detailed information about the training datasets is available below:

AI Hub: corpus/AI_HUB
- Only the Training segment of the data was used.
- The Validation and Test segments were deliberately excluded.
Modu Corpus: corpus/MODU_CORPUS

The final JSONL dataset used to train this model is approximately 61GB in size.

Total token count: Approximately 15 billion tokens (*using the expanded tokenizer. With the original SOLAR tokenizer, >60 billion tokens.)

Vocab Expansion

Model Name	Vocabulary Size	Description
Original Solar	32000	Sentencepiece BPE
Expanded SOLAR-KO-10.7B	46336	Sentencepiece BPE. Added Korean vocab and merges

Tokenizing "안녕하세요, 오늘은 날씨가 좋네요."

SOLAR-10.7B: 26 tokens
SOLAR-KO-10.7b: 10 tokens

Model	Tokens
SOLAR-10.7B	`['▁', '안', '<0xEB>', '<0x85>', '<0x95>', '하', '세', '요', ',', '▁', '오', '<0xEB>', '<0x8A>', '<0x98>', '은', '▁', '날', '<0xEC>', '<0x94>', '<0xA8>', '가', '▁', '좋', '네', '요', '.']`
SOLAR-KO-10.7B	`['▁안', '녕', '하세요', ',', '▁오늘', '은', '▁날', '씨가', '▁좋네요', '.']`

Tokenizing "Meet 10.7B Solar: Elevating Performance with Upstage Depth UP Scaling!"

SOLAR-10.7B: 22 tokens
SOLAR-KO-10.7b: 22 tokens

Model	Tokens
SOLAR-10.7B	`['▁Meet', '▁', '1', '0', '.', '7', 'B', '▁Solar', ':', '▁E', 'lev', 'ating', '▁Performance', '▁with', '▁Up', 'stage', '▁Dep', 'th', '▁UP', '▁Scal', 'ing', '!']`
SOLAR-KO-10.7B	`['▁Meet', '▁', '1', '0', '.', '7', 'B', '▁Solar', ':', '▁E', 'lev', 'ating', '▁Performance', '▁with', '▁Up', 'stage', '▁Dep', 'th', '▁UP', '▁Scal', 'ing', '!']`

LICENSE

Apache 2.0

Model Benchmark

LM Eval Harness - Korean (polyglot branch)

Used EleutherAI's lm-evaluation-harness https://github.com/EleutherAI/lm-evaluation-harness/tree/polyglot
5-shot scores

Task	Version	Metric	Value		Stderr
klue_mrc	0	exact	50.2140
		f1	54.0330
		HasAns_exact	73.1786
		HasAns_f1	78.7442
		best_exact	56.9594
		best_f1	60.3743
korquad	1	exact_match	81.0530
		f1	87.6418
klue_nli	0	acc	0.4540	±	0.0091
klue_sts	0	acc	0.3410	±	0.0208
		f1	0.4896	±	0.0237
klue_ynat	0	acc	0.6308	±	0.0051
		macro_f1	0.6086	±	0.0057
kobest_boolq	0	acc	0.8711	±	0.0089
		macro_f1	0.8705	±	0.0090
kobest_copa	0	acc	0.8500	±	0.0113
		macro_f1	0.8498	±	0.0113
kobest_hellaswag	0	acc	0.5180	±	0.0224
		acc_norm	0.6180	±	0.0218
		macro_f1	0.5138	±	0.0224
kobest_sentineg	0	acc	0.9723	±	0.0082
		macro_f1	0.9723	±	0.0083
kobest_wic	0	acc	0.5825	±	0.0139
		macro_f1	0.4952	±	0.0140
kohatespeech_apeach	0	acc	0.7034	±	0.0074
		macro_f1	0.7033	±	0.0074
nsmc	0	acc	0.8738	±	0.0015
pawsx_ko	0	acc	0.5510	±	0.0111
kmmlu_direct	0	exact_match	0.4220	±	0.0909

Citation

@misc {solar_koen_junbum_taekyoon_2024,
    author       = { {L. Junbum, Taekyoon Choi} },
    title        = { Solar-KoEn-10.8b },
    year         = 2024,
    url          = { https://huggingface.co/beomi/SOLAR-KOEN-10.8B },
    publisher    = { Hugging Face }
}

Acknowledgements

Training support was provided by the TPU Research Cloud program.