Experimental-Neo-TinyStories-Korean-800K-20240819
A new model trained with new datasets while retaining the almost identical architecture to the previous version. The only difference in the architecture is that the context length has been increased to 1024.
- Architecture: Llama
- Vocab size: 4096
- Hidden size: 64
- Layers: 5
- Heads: 8 (MHA)
- Context length: up to 1024 tokens
Improvements
Even though this model is exceptionally small for a language model, it has achieved significant improvements in generating accurate and logical sentences compared to its predecessor. These improvements were made possible by learning from super simple stories generated by more powerful language models, inspired by the TinyStories paper. Instead of using a dataset translated from English data, a much higher quality dataset was obtained by generating new synthetic data in line with the methodology outlined in the paper.
This model was intentionally kept the same size as its predecessor to demonstrate the impact of dataset quality on model performance. In fact, despite being trained on only about 10% of the tokens used by the previous model, it exhibits significantly better performance. The dataset will be released along with a larger version of the Neo-TinyStories-Korean model once its creation and validation are complete.
Generation Examples
Result of this model with only single <s>
token was given:
νΈλ₯΄λ₯Έ νλ μλ λλΆμ νμ΄μ΄ κ°λν λ μ΄μμ΄μ. μκΈ° κ³°μ μλ§ κ³°κ³Ό ν¨κ» μ²μΌλ‘ λλ¬ κ°μ΄μ. μ² μμλ μμ κ½λ€μ΄ νΌμ΄ μμκ³ , μκΈ° κ³°μ μ μ΄ λμ κ½μ ꡬ경νμ΄μ. μκΈ° κ³°μ κ½μ 보며 μ μ΄ λ¬μ΄μ. κ°μκΈ° μκΈ° κ³°μ μλ§ κ³°μκ² λ¬λ €κ° "μλ§, μ κ½ μλ»μ!"λΌκ³ λ§νμ΄μ. μλ§ κ³°μ μκΈ° κ³°μκ² "κ·Έλ, μμ κ½μ΄μΌ!"λΌκ³ λ§νλ©° μκΈ° κ³°μ κΌ μμ μ¬λ Έμ΄μ. μκΈ° κ³°μ μλ§ κ³°μ λ§μ λ£κ³ κ½μ κΊΎμ§ μκ³ μμκ² λ°λΌλ³΄λ κ²μ λ°°μ μ΄μ. μκΈ° κ³°μ μλ§ κ³°μ λ§μ μ λ£λ κ²μ΄ μ€μνλ€λ κ²μ μμμ΄μ.
Result of previous model with only single <s>
token was given:
μλ μλ , ν° μ²μμ ν° λ무λ€μ΄ μμ μ²μ μ΄κ³ μμμ΅λλ€. κ·Έ λ무λ€μ λ§€μ° ν볡νμ΅λλ€. μ΄λ λ , μμ μλ€μ΄ κ·Έ λ무μ μμ΄μ. κ·Έ λ무λ μμ μλ₯Ό λ³΄κ³ λ§€μ° κΈ°λ»νμ΅λλ€. "μλ , μμ μμΌ! μ΄λ»κ² ν΄μΌ?"λΌκ³ μκ° λ§νμ΅λλ€. "λλ μμ΄ μμ΄μ κ·Έλμ." μμ μλ "μ λ ν¬κ³ κ°ν΄μ§λλ°, μ λ μ λ§ κ°ν΄μ."λΌκ³ λ§νμ΅λλ€. μμ μλ λ μμ¬λΌ μμ λ¨Ήκ³ μΆμ΄ νμ΅λλ€. μλ μμ μ‘κ³ μΆμ΄ νμ΅λλ€. μμ μλ μμ μ°Ύμμ μμ μ°Ύμ λμ°μ΅λλ€. λ§μΉ¨λ΄, μμ μλ μμ μ°Ύμ ν° μμ μ°Ύμμ΅λλ€. μμ μλ μμ¬κ·λ€μ μ°Ύμμ μμ μ°Ύμλ΄μ΄ μμ λ½μλμ΅λλ€. μλ μμ λ°μ μμ λ€μ λ½μλμ΅λλ€. μμ μλ λ§€μ° ν볡νμ΅λλ€. κ·Έλ€μ μ’ μΌ μμΌλ‘ λμκ³ , κ·Έ λ μ΄νλ‘ κ·Έλ€μ νμ ν¨κ» λμμ΅λλ€.
Result of this model with start sequence λ¨Ό μλ , κ³ μμ΄
was given:
λ¨Ό μλ , κ³ μμ΄ ν λ§λ¦¬κ° μ΄μμ΅λλ€. κ³ μμ΄λ λ°₯μ λ¨Ήκ³ μΆμμ΅λλ€. κ³ μμ΄λ λ°₯μ μ°Ύμ μ² μμΌλ‘ λ€μ΄κ°μ΅λλ€. κ³ μμ΄λ λ§μλ κ³ΌμΌμ μ°Ύμμ΅λλ€. κ³ μμ΄λ κ³ΌμΌμ λ¨Ήκ³ μΆμμ΅λλ€. κ³ μμ΄λ κ³ΌμΌμ λ¨ΉμΌλ €κ³ νμ§λ§, κ³ μμ΄λ λ°°μ μ¬λΌκ° μ μμμ΅λλ€. κ³ μμ΄λ μ¬νμ΅λλ€. κ³ μμ΄λ λ°₯μ λͺ» λ¨Ήμμ΅λλ€. κ³ μμ΄λ μ¬νμ΅λλ€. κ³ μμ΄λ μλ§μκ² λ§νμ΅λλ€. "μλ§, λ°₯μ΄ μμ΄μ!" κ³ μμ΄λ μ¬νμ΅λλ€. κ³ μμ΄λ μλ§μκ² λ°₯μ μ£Όμμ΅λλ€. μλ§λ κ³ μμ΄μκ² λ°₯μ μ£Όμμ΅λλ€. κ³ μμ΄λ λ°₯μ λ§μκ² λ¨Ήμμ΅λλ€. κ³ μμ΄λ μλ§μκ² κ³ λ§λ€κ³ λ§νμ΅λλ€. κ³ μμ΄λ μλ§ λ§μ μ λ€μμ΅λλ€.
Result of previous model with start sequence λ¨Ό μλ , κ³ μμ΄
was given:
λ¨Ό μλ , κ³ μμ΄, 보λΌμμΈ μ, ν°μ 곡μμμ λκ³ μμμ΅λλ€. κ·Έλ ν° λ무λ₯Ό λ³΄κ³ μ¬λΌκ°κ³ μΆμ΄ νμ΅λλ€. κ·Έλ μ¬λΌκ°κ³ μ¬λΌκ°κΈ° μμνμ΅λλ€. κ·Έλ μ¬λΌκ°μ λ무λ₯Ό μ€λ₯΄κΈ° μμνμ΅λλ€. ν°μ΄ μ¬λΌ κ°λ©΄μ κ·Έλ ν° λ무λ₯Ό 보μμ΅λλ€. κ·Έλ λ무λ₯Ό μ€λ₯΄κΈ° μμνμ΅λλ€. ν°μ λ무λ₯Ό μ€λ₯΄κΈ° μμνμ΅λλ€. κ·Έλ μμ μ‘μμ΅λλ€. κ·Έλ λ무 κΌλκΈ°κΉμ§ μ¬λΌκ°μ΅λλ€. κ·Έ λ무 κΌλκΈ°μ λλ¬νμ΅λλ€. ν°μ λ§€μ° ν볡νμ΅λλ€. κ·Έλ λ무 μλ ν° λ무λ₯Ό 보μμ΅λλ€. κ·Έλ λ무μμ λ΄λ €μ€λ©° λ무λ₯Ό μ€λ₯Ό μ μμμ΅λλ€. κ·Έλ λ무λ₯Ό μ€λ₯΄κ³ μΆμμ΅λλ€. νμ§λ§ ν°μ λ무λ₯Ό μ€λ₯΄κΈ° μμνμ΅λλ€. κ·Έλ ν° λ무λ₯Ό μ€λ₯΄κΈ° μν΄ μ¬λΌκ° μ¬λΌκ°μ΅λλ€. ν°μ λ무 κΌλκΈ°μ λλ¬νμ΅λλ€. κ·Έλ λ무λ₯Ό μ€λ₯΄λ κ²μ΄ λ무 기뻀μ΅λλ€. κ·Έλ λ무 κΌλκΈ°κΉμ§ λλ¬νμ΅λλ€. λ무 κΌλκΈ°μλ ν° λλ¬΄κ° μλ κ²μ 보μμ΅λλ€. κ·Έλ λ무 μλ‘ μ¬λΌκ°κ³ λ무λ₯Ό μ€λ₯΄κΈ° μμνμ΅λλ€. λ무λ λ무 κΌλκΈ°μ μ¬λΌκ°μ΅λλ€. ν°μ λ§€μ° κΈ°λ»€μ΅λλ€. κ·Έλ λ무μ λͺ¨λ μΉκ΅¬λ€μ λ°©λ¬Έν΄ μ€λ«λμ λ΄λ €μμ΅λλ€. κ·Έλ€μ ν¨κ» λ§μ μ¬λ―Έλ₯Ό λκΌμ΅λλ€.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained('north-wind/Experimental-Neo-TinyStories-Korean-800K-20240819')
tokenizer = AutoTokenizer.from_pretrained('north-wind/Experimental-Neo-TinyStories-Korean-800K-20240819')
input_text = ''
input_ids = tokenizer(input_text, return_tensors='pt').input_ids
output = model.generate(input_ids, max_length=1024, do_sample=True, temperature=0.5)
print(tokenizer.decode(output[0]))
Further plans
- Generate more data to match the quantity to the original English version of TinyStories dataset
- Apply quality filter to generated dataset
- Train a larger model on the final dataset
- Release the model and dataset
- Downloads last month
- 0