SOLAR-KOEN-10.8B / README.md
beomi's picture
Update README.md
715f579 verified
|
raw
history blame
5.63 kB
metadata
language:
  - ko
  - en
pipeline_tag: text-generation
inference: false
tags:
  - solar
  - mistral
  - pytorch
  - solar-ko
library_name: transformers
license: cc-by-nc-sa-4.0

Update Log

  • 2024.02.19: Initial Test version Release of SOLAR-KOEN-10.8B

SOLAR-KOEN ⭐🇰🇷

Solar-KoEn represents an advanced iteration of the upstage/SOLAR-10.7B-v1.0 model, featuring an expanded vocabulary and the inclusion of a Korean+English corpus for enhanced pretraining.

Model Details

Model Developers: Junbum Lee (Beomi) & Taekyoon Choi (Taekyoon)

Variations: Solar-KoEn is available with one parameter sizes — 10.8B with Continual Pretrained version.

Input: The model accepts only text input.

Output: The model produces text output exclusively.

Model Architecture:

SOLAR-KOEN-10.8B is an auto-regressive language model that leverages an optimized transformer architecture derived from Llama-2.

Training Data Parameters Content Length GQA Tokens Learning Rate
SOLAR-KOEN-10.8B A curated mix of Korean+English Corpora 10.8B 2k O >15B* 5e-5

Training Corpus

The model was trained using selected datasets from AIHub and Modu Corpus. Detailed information about the training datasets is available below:

  • AI Hub: corpus/AI_HUB
    • Only the Training segment of the data was used.
    • The Validation and Test segments were deliberately excluded.
  • Modu Corpus: corpus/MODU_CORPUS

The final JSONL dataset used to train this model is approximately 61GB in size.

Total token count: Approximately 15 billion tokens (*using the expanded tokenizer. With the original SOLAR tokenizer, >60 billion tokens.)

Vocab Expansion

Model Name Vocabulary Size Description
Original Solar 32000 Sentencepiece BPE
Expanded SOLAR-KOEN-10.8B 46336 Sentencepiece BPE. Added Korean vocab and merges

Tokenizing "안녕하세요, 오늘은 날씨가 좋네요."

  • SOLAR-10.7B: 26 tokens
  • SOLAR-KO-10.7b: 10 tokens
Model Tokens
SOLAR-10.7B ['▁', '안', '<0xEB>', '<0x85>', '<0x95>', '하', '세', '요', ',', '▁', '오', '<0xEB>', '<0x8A>', '<0x98>', '은', '▁', '날', '<0xEC>', '<0x94>', '<0xA8>', '가', '▁', '좋', '네', '요', '.']
SOLAR-KOEN-10.8B ['▁안', '녕', '하세요', ',', '▁오늘', '은', '▁날', '씨가', '▁좋네요', '.']

Tokenizing "Meet 10.7B Solar: Elevating Performance with Upstage Depth UP Scaling!"

  • SOLAR-10.7B: 22 tokens
  • SOLAR-KO-10.7b: 22 tokens
Model Tokens
SOLAR-10.7B ['▁Meet', '▁', '1', '0', '.', '7', 'B', '▁Solar', ':', '▁E', 'lev', 'ating', '▁Performance', '▁with', '▁Up', 'stage', '▁Dep', 'th', '▁UP', '▁Scal', 'ing', '!']
SOLAR-KOEN-10.8B ['▁Meet', '▁', '1', '0', '.', '7', 'B', '▁Solar', ':', '▁E', 'lev', 'ating', '▁Performance', '▁with', '▁Up', 'stage', '▁Dep', 'th', '▁UP', '▁Scal', 'ing', '!']

LICENSE

Apache 2.0

Model Benchmark

LM Eval Harness - Korean (polyglot branch)

Task Version Metric Value Stderr
klue_mrc 0 exact 50.2140
f1 54.0330
HasAns_exact 73.1786
HasAns_f1 78.7442
best_exact 56.9594
best_f1 60.3743
korquad 1 exact_match 81.0530
f1 87.6418
klue_nli 0 acc 0.4540 ± 0.0091
klue_sts 0 acc 0.3410 ± 0.0208
f1 0.4896 ± 0.0237
klue_ynat 0 acc 0.6308 ± 0.0051
macro_f1 0.6086 ± 0.0057
kobest_boolq 0 acc 0.8711 ± 0.0089
macro_f1 0.8705 ± 0.0090
kobest_copa 0 acc 0.8500 ± 0.0113
macro_f1 0.8498 ± 0.0113
kobest_hellaswag 0 acc 0.5180 ± 0.0224
acc_norm 0.6180 ± 0.0218
macro_f1 0.5138 ± 0.0224
kobest_sentineg 0 acc 0.9723 ± 0.0082
macro_f1 0.9723 ± 0.0083
kobest_wic 0 acc 0.5825 ± 0.0139
macro_f1 0.4952 ± 0.0140
kohatespeech_apeach 0 acc 0.7034 ± 0.0074
macro_f1 0.7033 ± 0.0074
nsmc 0 acc 0.8738 ± 0.0015
pawsx_ko 0 acc 0.5510 ± 0.0111
kmmlu_direct 0 exact_match 0.4220 ± 0.0909

Citation

@misc {solar_koen_junbum_taekyoon_2024,
    author       = { {L. Junbum, Taekyoon Choi} },
    title        = { SOLAR-KOEN-10.8B },
    year         = 2024,
    url          = { https://huggingface.co/beomi/SOLAR-KOEN-10.8B },
    publisher    = { Hugging Face }
}

Acknowledgements