Chinese-Mixtral-8x7B

๐Ÿš€ ไป‹็ป

ๆœฌ้กน็›ฎๅŸบไบŽMistralๅ‘ๅธƒ็š„ๆจกๅž‹Mixtral-8x7B่ฟ›่กŒไบ†ไธญๆ–‡ๆ‰ฉ่ฏ่กจๅขž้‡้ข„่ฎญ็ปƒ๏ผŒๅธŒๆœ›่ฟ›ไธ€ๆญฅไฟƒ่ฟ›ไธญๆ–‡่‡ช็„ถ่ฏญ่จ€ๅค„็†็คพๅŒบๅฏนMoEๆจกๅž‹็š„็ ”็ฉถใ€‚ๆˆ‘ไปฌๆ‰ฉๅ……ๅŽ็š„่ฏ่กจๆ˜พ่‘—ๆ้ซ˜ไบ†ๆจกๅž‹ๅฏนไธญๆ–‡็š„็ผ–่งฃ็ ๆ•ˆ็އ๏ผŒๅนถ้€š่ฟ‡ๅคง่ง„ๆจกๅผ€ๆบ่ฏญๆ–™ๅฏนๆ‰ฉ่ฏ่กจๆจกๅž‹่ฟ›่กŒๅขž้‡้ข„่ฎญ็ปƒ๏ผŒไฝฟๆจกๅž‹ๅ…ทๅค‡ไบ†ๅผบๅคง็š„ไธญๆ–‡็”Ÿๆˆๅ’Œ็†่งฃ่ƒฝๅŠ›ใ€‚

้กน็›ฎๅผ€ๆบๅ†…ๅฎน๏ผš

  • ไธญๆ–‡Mixtral-8x7Bๆ‰ฉ่ฏ่กจๅคงๆจกๅž‹
  • ๆ‰ฉ่ฏ่กจๅขž้‡้ข„่ฎญ็ปƒไปฃ็ 

่ฏทๆณจๆ„๏ผŒChinese-Mixtral-8x7Bไป็„ถๅฏ่ƒฝ็”ŸๆˆๅŒ…ๅซไบ‹ๅฎžๆ€ง้”™่ฏฏ็š„่ฏฏๅฏผๆ€งๅ›žๅคๆˆ–ๅŒ…ๅซๅ่ง/ๆญง่ง†็š„ๆœ‰ๅฎณๅ†…ๅฎน๏ผŒ่ฏท่ฐจๆ…Ž้‰ดๅˆซๅ’Œไฝฟ็”จ็”Ÿๆˆ็š„ๅ†…ๅฎน๏ผŒ่ฏทๅ‹ฟๅฐ†็”Ÿๆˆ็š„ๆœ‰ๅฎณๅ†…ๅฎนไผ ๆ’ญ่‡ณไบ’่”็ฝ‘ใ€‚

๐Ÿ“ฅ ๆจกๅž‹ไธ‹่ฝฝ

ๆœฌ้กน็›ฎไฝฟ็”จQLoRA่ฟ›่กŒ่ฎญ็ปƒ๏ผŒLoRAๆƒ้‡ไธŽๅˆๅนถๆƒ้‡ๅŽ็š„ๆจกๅž‹ๅˆ†ๅˆซๅผ€ๆบ๏ผŒๆ‚จๅฏไปฅๆ นๆฎ่‡ชๅทฑ็š„้œ€ๆฑ‚้€‰ๆ‹ฉไธ‹่ฝฝ๏ผš

ๆจกๅž‹ๅ็งฐ ๆจกๅž‹ๅคงๅฐ ไธ‹่ฝฝๅœฐๅ€ ๅค‡ๆณจ
Chinese-Mixtral-8x7B 88GB ๐Ÿค—HuggingFace ไธญๆ–‡ๆ‰ฉ่ฏ่กจๅฎŒๆ•ดๆจกๅž‹๏ผŒๅฏไปฅ็›ดๆŽฅไฝฟ็”จ
Chinese-Mixtral-8x7B-adapter 2.7GB ๐Ÿค—HuggingFace LoRAๆƒ้‡๏ผŒ้œ€่ฆไธŽๅŽŸ็‰ˆMixtral-8x7B่ฟ›่กŒๅˆๅนถๆ‰ๅฏไปฅไฝฟ็”จ๏ผŒๅˆๅนถ่„šๆœฌ่ฏทๅ‚่€ƒ่ฟ™้‡Œ

๐Ÿ’ป ๆจกๅž‹ๆŽจ็†

Chinese-Mixtral-8x7Bๆ”ฏๆŒๅฎŒๆ•ด็š„Mixtral-8x7Bๆจกๅž‹็”Ÿๆ€๏ผŒๅŒ…ๆ‹ฌไฝฟ็”จvLLMใ€Flash Attention 2่ฟ›่กŒๅŠ ้€Ÿ๏ผŒไฝฟ็”จbitsandbytes่ฟ›่กŒๆจกๅž‹้‡ๅŒ–็ญ‰ใ€‚ไปฅไธ‹ๆ˜ฏไฝฟ็”จChinese-Mixtral-8x7B่ฟ›่กŒๆŽจ็†็š„ไปฃ็ ็คบไพ‹ใ€‚

ไฝฟ็”จFlash Attention 2๏ผš

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "HIT-SCIR/Chinese-Mixtral-8x7B"
tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(model_id, attn_implementation="flash_attention_2", torch_dtype=torch.bfloat16, device_map="auto")

text = "ๆˆ‘็š„ๅๅญ—ๆ˜ฏ"
inputs = tokenizer(text, return_tensors="pt").to(0)

outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

ไฝฟ็”จ4bit้‡ๅŒ–๏ผš

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "HIT-SCIR/Chinese-Mixtral-8x7B"
tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True, device_map="auto")

text = "ๆˆ‘็š„ๅๅญ—ๆ˜ฏ"
inputs = tokenizer(text, return_tensors="pt").to(0)

outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

่ฏทๆณจๆ„๏ผŒChinese-Mixtral-8x7BไธบๅŸบๅบงๆจกๅž‹๏ผŒๆฒกๆœ‰็ป่ฟ‡ๆŒ‡ไปคๅพฎ่ฐƒ๏ผŒๅ› ๆญคๆŒ‡ไปค้ตๅพช่ƒฝๅŠ›ๆœ‰้™ใ€‚ๆ‚จๅฏไปฅๅ‚่€ƒๅพฎ่ฐƒไธ€่Š‚ๅฏนๆจกๅž‹่ฟ›่กŒๅพฎ่ฐƒใ€‚

๐Ÿ“ˆ ๆจกๅž‹ๆ€ง่ƒฝ

ๆจกๅž‹็ปผๅˆ่ƒฝๅŠ›

ๆˆ‘ไปฌๅˆ†ๅˆซไฝฟ็”จไปฅไธ‹่ฏ„ๆต‹ๆ•ฐๆฎ้›†ๅฏนChinese-Mixtral-8x7B่ฟ›่กŒ่ฏ„ๆต‹๏ผš

  • C-Eval๏ผšไธ€ไธชๅ…จ้ข็š„ไธญๆ–‡ๅŸบ็ก€ๆจกๅž‹่ฏ„ไผฐๅฅ—ไปถใ€‚ๅฎƒๅŒ…ๅซไบ†13948ไธชๅคš้กน้€‰ๆ‹ฉ้ข˜๏ผŒๆถต็›–ไบ†52ไธชไธๅŒ็š„ๅญฆ็ง‘ๅ’Œๅ››ไธช้šพๅบฆ็บงๅˆซใ€‚
  • CMMLU๏ผšไธ€ไธช็ปผๅˆๆ€ง็š„ไธญๆ–‡่ฏ„ไผฐๅŸบๅ‡†๏ผŒไธ“้—จ็”จไบŽ่ฏ„ไผฐ่ฏญ่จ€ๆจกๅž‹ๅœจไธญๆ–‡่ฏญๅขƒไธ‹็š„็Ÿฅ่ฏ†ๅ’ŒๆŽจ็†่ƒฝๅŠ›๏ผŒๆถต็›–ไบ†ไปŽๅŸบ็ก€ๅญฆ็ง‘ๅˆฐ้ซ˜็บงไธ“ไธšๆฐดๅนณ็š„67ไธชไธป้ข˜ใ€‚
  • MMLU๏ผšไธ€ไธชๅŒ…ๅซ57ไธชๅคš้€‰ไปปๅŠก็š„่‹ฑๆ–‡่ฏ„ๆต‹ๆ•ฐๆฎ้›†๏ผŒๆถต็›–ไบ†ๅˆ็ญ‰ๆ•ฐๅญฆใ€็พŽๅ›ฝๅކๅฒใ€่ฎก็ฎ—ๆœบ็ง‘ๅญฆใ€ๆณ•ๅพ‹็ญ‰๏ผŒ้šพๅบฆ่ฆ†็›–้ซ˜ไธญๆฐดๅนณๅˆฐไธ“ๅฎถๆฐดๅนณ๏ผŒๆ˜ฏ็›ฎๅ‰ไธปๆต็š„LLM่ฏ„ๆต‹ๆ•ฐๆฎ้›†ไน‹ไธ€ใ€‚
  • HellaSwag๏ผšไธ€ไธชๆžๅ…ทๆŒ‘ๆˆ˜็š„่‹ฑๆ–‡NLI่ฏ„ๆต‹ๆ•ฐๆฎ้›†๏ผŒๆฏไธ€ไธช้—ฎ้ข˜้ƒฝ้œ€่ฆๅฏนไธŠไธ‹ๆ–‡่ฟ›่กŒๆทฑๅ…ฅ็†่งฃ๏ผŒ่€Œไธ่ƒฝๅŸบไบŽๅธธ่ฏ†่ฟ›่กŒๅ›ž็ญ”ใ€‚

ๆ นๆฎMistralๅ‘ๅธƒ็š„ๆŠ€ๆœฏๆŠฅๅ‘Š๏ผŒMixtral-8x7BๅœจๆŽจ็†ๆ—ถๅฐ†ๆฟ€ๆดป13Bๅ‚ๆ•ฐใ€‚ไธ‹่กจไธบChinese-Mixtral-8x7BไธŽๅ…ถไป–13B่ง„ๆจก็š„ไธญๆ–‡ๆ‰ฉ่ฏ่กจๆจกๅž‹ๅœจๅ„ไธช่ฏ„ๆต‹ๆ•ฐๆฎ้›†ไธŠ็š„5-shot็ป“ๆžœ๏ผš

ๆจกๅž‹ๅ็งฐ ๅขž้‡่ฎญ็ปƒ่ฏญๆ–™ C-Eval
(ไธญๆ–‡)
CMMLU
(ไธญๆ–‡)
MMLU
(่‹ฑๆ–‡)
HellaSwag
(่‹ฑๆ–‡)
IDEA-CCNL/Ziya2-13B-Base 650B Token 59.29 60.93 59.86 58.90
TigerResearch/tigerbot-13b-base-v3 500B Token 50.52 51.65 53.46 59.16
Linly-AI/Chinese-LLaMA-2-13B-hf 11B Token 42.57 41.95 51.32 59.05
hfl/chinese-llama-2-13b ็บฆ30B Token(120GB) 41.90 42.08 51.92 59.28
Chinese-Mixtral-8x7B(ๆœฌ้กน็›ฎ) 42B Token 52.08 51.08 69.80 65.69

ๅœจไธญๆ–‡็Ÿฅ่ฏ†ๅ’Œ็†่งฃๆ–น้ข๏ผŒๆˆ‘ไปฌ็š„Chinese-Mixtral-8x7BไธŽTigerBot-13B-Base-v3ๆ€ง่ƒฝ็›ธๅฝ“ใ€‚็”ฑไบŽChinese-Mixtral-8x7B็š„่ฎญ็ปƒๆ•ฐๆฎ้‡ไป…ไธบTigerBot-13B-Base-v3็š„8%๏ผŒๆˆ‘ไปฌ็š„ๆจกๅž‹ไปๆœ‰่ฟ›ไธ€ๆญฅๆๅ‡็š„็ฉบ้—ดใ€‚ไธŽๆญคๅŒๆ—ถ๏ผŒๅพ—็›ŠไบŽๅŽŸ็‰ˆMixtral-8x7Bๆจกๅž‹ๅผบๅคง็š„ๆ€ง่ƒฝ๏ผŒๆˆ‘ไปฌ็š„Chinese-Mixtral-8x7B่พพๅˆฐไบ†ๅ„ไธชๆ‰ฉ่ฏ่กจๆจกๅž‹็š„ๆœ€ๅผบ่‹ฑๆ–‡ๆฐดๅนณใ€‚

็”ฑไบŽไธๅŒ็‰ˆๆœฌ็š„่ฏ„ๆต‹่„šๆœฌๅฎž็Žฐ็ป†่Š‚ๆœ‰็ป†ๅพฎๅทฎๅผ‚๏ผŒไธบไบ†ไฟ่ฏ่ฏ„ๆต‹็ป“ๆžœ็š„ไธ€่‡ดๆ€งๅ’Œๅ…ฌๅนณๆ€ง๏ผŒๆˆ‘ไปฌ็š„่ฏ„ๆต‹่„šๆœฌ็ปŸไธ€ไฝฟ็”จEleutherAIๅ‘ๅธƒ็š„lm-evaluation-harness๏ผŒcommit hashไธบ28ec7faใ€‚

ๆจกๅž‹็”Ÿๆˆๆ•ˆๆžœ

ไธ‹่กจไธบๅ„ไธชๆ‰ฉ่ฏ่กจๆจกๅž‹็š„็”Ÿๆˆๆ•ˆๆžœใ€‚็”ฑไบŽ้ƒจๅˆ†ๆจกๅž‹็š„้ข„่ฎญ็ปƒ่ฏญๆ–™ๆœชไฝฟ็”จeos_token่ฟ›่กŒๅˆ†้š”๏ผŒๆˆ‘ไปฌ้‡‡็”จไบ†max_tokens = 100ๅฏน็”Ÿๆˆๆ–‡ๆœฌ่ฟ›่กŒๆˆชๆ–ญใ€‚ๆˆ‘ไปฌ็š„้‡‡ๆ ทๅ‚ๆ•ฐไธบtemperature = 0.8, top_p = 0.9ใ€‚

ไธญๆ–‡็ผ–่งฃ็ ๆ•ˆ็އ

้’ˆๅฏนไธญๆ–‡็ผ–่งฃ็ ๆ•ˆ็އ๏ผŒๆˆ‘ไปฌไฝฟ็”จๅ„ไธชๆ‰ฉ่ฏ่กจๆจกๅž‹็š„ๅˆ†่ฏๅ™จๅฏนSkyPileๆ•ฐๆฎ้›†็š„ไธ€ไธชๅˆ‡็‰‡๏ผˆ2023-06_zh_head_0000.jsonl๏ผ‰่ฟ›่กŒ็ผ–็ ๏ผŒๅฏนๆฏ”ไบ†ๅ„ไธชๅˆ†่ฏๅ™จ่พ“ๅ‡บ็š„ไธญๆ–‡ๆ–‡ๆœฌToken้‡๏ผš

ๆจกๅž‹ๅ็งฐ ๆจกๅž‹็ฑปๅˆซ ่ฏ่กจๅคงๅฐ ไธญๆ–‡ๆ–‡ๆœฌToken้‡ ็ผ–่งฃ็ ๆ•ˆ็އ
meta-llama/Llama-2-13B-hf LLaMA 32000 780M ไฝŽ
mistralai/Mixtral-8x7B-v0.1 Mixtral 32000 606M ไฝŽ
Linly-AI/Chinese-LLaMA-2-13B-hf LLaMA 40076 532M ไธญ
IDEA-CCNL/Ziya2-13B-Base LLaMA 39424 532M ไธญ
hfl/chinese-llama-2-13b LLaMA 55296 365M ้ซ˜
TigerResearch/tigerbot-13b-base-v3 LLaMA 65112 342M ้ซ˜
Chinese-Mixtral-8x7B(ๆœฌ้กน็›ฎ) Mixtral 57000 355M ้ซ˜

ๅœจ็บฆ1.4GB็š„ๆต‹่ฏ•ๆ–‡ๆœฌไธญ๏ผŒๆˆ‘ไปฌ็š„Chinese-Mixtral-8x7Bไธญๆ–‡็ผ–่งฃ็ ๆ•ˆ็އไป…ๆฌกไบŽTigerBot-13B-Base-v3๏ผŒ่พƒๅŽŸๆจกๅž‹ๆ้ซ˜ไบ†41.5%ใ€‚่ฟ™ๆœ‰ๅˆฉไบŽๅŠ ้€Ÿไธญๆ–‡ๆ–‡ๆœฌ็š„ๆŽจ็†้€Ÿๅบฆ๏ผŒๅนถๅœจIn-Context Learningใ€Chain-of-Thought็ญ‰ๅœบๆ™ฏไธญ่Š‚็œๅบๅˆ—้•ฟๅบฆ๏ผŒๆœ‰ๅˆฉไบŽๆ้ซ˜ๅคๆ‚ๆŽจ็†ไปปๅŠก็š„ๆ€ง่ƒฝใ€‚

โš™๏ธ ่ฎญ็ปƒ็ป†่Š‚

่ฏ่กจๆ‰ฉๅ……

ๆˆ‘ไปฌไฝฟ็”จsentencepieceๅœจ12G็ŸฅไนŽๆ•ฐๆฎๅ’Œ2Gๆ‚Ÿ้“ๆ•ฐๆฎไธŠ่ฎญ็ปƒไธญๆ–‡BPE่ฏ่กจใ€‚ๆˆ‘ไปฌๅœจ่ฎญ็ปƒ่ฏ่กจๆ—ถๅˆ†ๅˆซๆžšไธพไบ†ไธญๆ–‡ๅ•ๅญ—Tokenๆ•ฐ้‡ไปฅๅŠไธญๆ–‡ๆ€ปTokenๆ•ฐ้‡๏ผŒๅนถๅฏนไบŒ่€…่ฟ›่กŒ็ป„ๅˆ๏ผŒๅพ—ๅˆฐไบ†ๆ•ฐ็™พไธชๅคงๅฐใ€ๅ†…ๅฎนๅ„ๅผ‚็š„่ฏ่กจใ€‚ไธบไบ†ๅพ—ๅˆฐๆœ€้€‚ๅˆ็š„่ฏ่กจ๏ผŒๆˆ‘ไปฌ้€š่ฟ‡Zheng Bo็ญ‰ไบบๆๅ‡บ็š„ALP่ฎก็ฎ—่ฟ™ไบ›่ฏ่กจ็š„ไธญๆ–‡่ฏๆฑ‡่ƒฝๅŠ›ใ€‚ALP้€š่ฟ‡่ฎก็ฎ—็‰นๅฎš่ฏญ่จ€็š„ๅญ่ฏๅˆ‡ๅˆ†็ฒ’ๅบฆ๏ผŒๅนถๅฏน่ฏ่กจ็š„ไธญไฝŽ้ข‘ๅญ่ฏ่ฟ›่กŒๆƒฉ็ฝš๏ผŒๆ˜ฏไธ€็งๆ–นไพฟๅฟซๆท็š„่กก้‡็‰นๅฎš่ฏญ่จ€่ฏๆฑ‡่ƒฝๅŠ›็š„ๆŒ‡ๆ ‡ใ€‚

ๆˆ‘ไปฌๅœจไนฆ็ฑๅ’Œ็™พ็ง‘่ฏญๆ–™ไธŠ่ฏ„ไผฐไบ†ไธๅŒ่ฏ่กจ็š„ALPๅ€ผใ€‚ๅ›พ็คบไธญ๏ผŒๅ››ๆกๆ›ฒ็บฟๅˆ†ๅˆซไปฃ่กจๅ››็งไธญๆ–‡ๅ•ๅญ—Tokenๆ•ฐ้‡็š„่ฏ่กจ๏ผˆ4451ใ€5435ใ€6414ๅ’Œ7434๏ผ‰ใ€‚ไธบไบ†้ฟๅ…่ฏ่กจ่ฟ‡ๅฐๅฏผ่‡ดไธญๆ–‡ๅŽ‹็ผฉ็އ่ฟ‡ไฝŽ๏ผŒไปฅๅŠ่ฏ่กจ่ฟ‡ๅคงๅฏผ่‡ดembeddingๅฑ‚่ฟ‡ไบŽ็จ€็–๏ผŒๆˆ‘ไปฌ้€‰ๅ–ALPๆ›ฒ็บฟ็š„ๆ‹็‚น๏ผŒๅฏนๅบ”ๅ‘่ฏ่กจไธญๆ–ฐๅขž25000ไธชไธญๆ–‡Tokenใ€‚ๅœจๆญคๅŸบ็ก€ไธŠ๏ผŒๆˆ‘ไปฌ้€‰ๆ‹ฉไบ†ๅ››ๆกๆ›ฒ็บฟไธญALPๆœ€ๅคง่€…๏ผŒๅณๆ–ฐๅขž6414ไธชไธญๆ–‡ๅ•ๅญ—Token็š„่ฏ่กจ๏ผŒไฝœไธบๆœ€็ปˆChinese-Mixtral-8x7B้€‰็”จ็š„่ฏ่กจใ€‚

ๅœจ่Žทๅพ—ๆ–ฐ่ฏ่กจๅŽ๏ผŒๆˆ‘ไปฌ้œ€่ฆๅฏนembeddingๅ’Œlm_headๅฑ‚่ฟ›่กŒๆ‰ฉๅ……ๅ’Œๅˆๅง‹ๅŒ–ใ€‚ๆˆ‘ไปฌไฝฟ็”จๆ–ฐTokenๅœจๆ—งembeddingๅฑ‚ไธญ็š„่ฏๅตŒๅ…ฅๅนณๅ‡ๅ€ผๅฏนๆ‰ฉๅ……้ƒจๅˆ†่ฟ›่กŒๅˆๅง‹ๅŒ–ใ€‚ๅœจๆˆ‘ไปฌ็š„ๅ‰ๆœŸๅฎž้ชŒไธญ๏ผŒ่ฟ™็งๆ–นๆณ•็•ฅไผ˜ไบŽHuggingFace็š„้ป˜่ฎคๅฎž็Žฐ๏ผŒๅณไฝฟ็”จๅ›บๅฎš็š„ๆญฃๆ€ๅˆ†ๅธƒ่ฟ›่กŒๅˆๅง‹ๅŒ–ใ€‚

ๅขž้‡้ข„่ฎญ็ปƒ

Mixtral-8x7Bๆจกๅž‹ๅ‚ๆ•ฐ้‡ไธบ46.7B๏ผŒๅ…จๅ‚ๆ•ฐ่ฎญ็ปƒ้œ€่ฆๅŒๆ—ถไฝฟ็”จๅคš็งๅนถ่กŒ็ญ–็•ฅ๏ผŒๅœจ่ฎญ็ปƒ่ต„ๆบๅ—้™็š„ๆƒ…ๅ†ตไธ‹ๆ—ถ้—ดๆˆๆœฌ่ฟ‡้ซ˜ใ€‚ๅ› ๆญคๆˆ‘ไปฌ้‡‡็”จHuggingFaceๅฎ˜ๆ–นๆŽจ่็š„ๆ–นๆณ•๏ผŒไฝฟ็”จQLoRAๅฏนๆจกๅž‹่ฟ›่กŒ่ฎญ็ปƒใ€‚QLoRAๅœจLoRAไฝŽ็งฉๅˆ†่งฃ็š„ๅŸบ็ก€ไธŠ๏ผŒ้€š่ฟ‡ๅผ•ๅ…ฅ4ไฝ้‡ๅŒ–ใ€ๅŒ้‡้‡ๅŒ–ๅ’Œๅˆฉ็”จNVIDIA็ปŸไธ€ๅ†…ๅญ˜่ฟ›่กŒๅˆ†้กต๏ผŒ่ฟ›ไธ€ๆญฅๅ‡ๅฐ‘ไบ†่ฎญ็ปƒๆ‰€้œ€ๆ˜พๅญ˜๏ผŒๅŒๆ—ถไฟๆŒไบ†ไธŽๅ…จๅ‚ๆ•ฐ่ฎญ็ปƒ็›ธๅฝ“็š„ๆ€ง่ƒฝใ€‚

ๆˆ‘ไปฌๅ‚่€ƒYiming Cui็ญ‰ไบบๅฏนLoRA็š„่ฎพ็ฝฎ๏ผŒๅฏนๅŽŸๆจกๅž‹ๆ‰€ๆœ‰Linearๅฑ‚ๅบ”็”จไฝŽ็งฉๅˆ†่งฃ๏ผŒๅนถๅฐ†ๆ‰ฉๅขžๅŽ็š„embeddingๅ’Œlm_headๅฑ‚็š„ๅ‚ๆ•ฐ่ฎพ็ฝฎไธบๅฏ่ฎญ็ปƒใ€‚ๅฏนไบŽๆจกๅž‹ไธปไฝ“๏ผŒๆˆ‘ไปฌ้‡‡็”จNF4ๆ ผๅผ่ฟ›่กŒ้‡ๅŒ–๏ผŒ่ฟ™็งๆ ผๅผๅฏไปฅไฝฟๅพ—้‡ๅŒ–ๅŽ็š„ๆ•ฐๆฎไธŽ้‡ๅŒ–ๅ‰ๅ…ทๆœ‰ๅŒ็ญ‰็š„ๆ•ฐๆฎๅˆ†ๅธƒ๏ผŒๆจกๅž‹็š„ๆƒ้‡ไฟกๆฏๆŸๅคฑๆ›ดๅฐ‘ใ€‚

็Žฏๅขƒๅ‡†ๅค‡

ๆˆ‘ไปฌๅปบ่ฎฎไฝฟ็”จPython 3.10 + torch 2.0.1

# Pytorch + Transformers
$ pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2
$ pip install transformers==4.36.2 datasets evaluate peft accelerate gradio optimum sentencepiece
$ pip install jupyterlab scikit-learn pandas matplotlib tensorboard nltk rouge bitsandbytes fire
# DeepSpeed
$ git clone https://github.com/microsoft/DeepSpeed.git
$ cd DeepSpeed
$ DS_BUILD_FUSED_ADAM=1 pip3 install .
# Flash Attention
$ pip install flash-attn --no-build-isolation

ๆ•ฐๆฎ้›†ไธ‹่ฝฝ

ๆˆ‘ไปฌๅŸบไบŽ็Žฐๆœ‰็š„ๅผ€ๆบๆ•ฐๆฎ้›†่ฎญ็ปƒไบ†Chinese-Mixtral-8x7B๏ผŒๆ•ฐๆฎ้›†ๅŒ…ๆ‹ฌ๏ผš

ๆ•ฐๆฎ้›†ๅ็งฐ ๆ•ฐๆฎ้›†่ฏญ่จ€ ไฝฟ็”จๆ•ฐๆฎ้‡ ๅค‡ๆณจ
Skywork/SkyPile-150B ไธญๆ–‡ 30B ไป…ไฝฟ็”จ2022 + 2023ๅนด็š„ๆ•ฐๆฎ
DKYoon/SlimPajama-6B ่‹ฑๆ–‡ 12B ๆ•ฐๆฎ้›†้‡ๅค2 Epoch

้€š่ฟ‡data/download.pyๅฐ†ๆ•ฐๆฎ้›†ไธ‹่ฝฝๅˆฐdataไธญใ€‚้’ˆๅฏนSlimpajamaๆ•ฐๆฎ้›†๏ผŒ้œ€่ฆไฝฟ็”จdata/parquet2jsonl.pyๅฐ†ๅŽŸๅง‹ๆ•ฐๆฎ้›†่ฝฌๆขไธบjsonlๆ ผๅผใ€‚

ไธ‹่ฝฝๅŽ็š„ๆ•ฐๆฎ้›†ไธบๅคšไธชjsonlๆ–‡ไปถ็š„ๅˆ†็‰‡๏ผŒไฝฟ็”จcatๅฐ†ๅคšไธชๅˆ†็‰‡ๅˆๅนถไธบไธ€ไธชjsonlๆ–‡ไปถใ€‚

$ cat *.jsonl > all.jsonl

้€š่ฟ‡splitๅฐ†jsonlๅˆ‡ๅˆ†ไธบtrainๅ’Œvalid้›†ๅˆใ€‚ๆœฌ้กน็›ฎไธญtrainๅ’Œvalid็š„่กŒๆ•ฐๆฏ”ไพ‹ไธบ999:1ใ€‚

$ wc -l all.jsonl                          # ่ฎก็ฎ—ๆ•ฐๆฎ้›†ๆ€ป่กŒๆ•ฐ
$ split -l <lines> all.jsonl               # ๆŒ‰999:1่ฎก็ฎ—train/valid่กŒๆ•ฐ๏ผŒ่ฟ›่กŒๅˆ‡ๅˆ†
$ mv xaa DKYoon-SlimPajama-6B-train.jsonl  # ้‡ๅ‘ฝๅ
$ mv xab DKYoon-SlimPajama-6B-dev.jsonl

ๆ•ฐๆฎ้›†้ข„ๅค„็†

ๅฐ†ๆ•ฐๆฎ้›†ๅ็งฐๅ’Œ่ทฏๅพ„ๆณจๅ†Œๅˆฐdata/datasets.tomlไธญ๏ผš

[DKYoon-SlimPajama-6B]              # ๆ•ฐๆฎ้›†ๅ็งฐ
splits = ["train", "dev"]           # ๆ•ฐๆฎ้›†train/valid้›†ๅˆ
root = "{DATA_DIR}/en/{name}"       # ๆ•ฐๆฎ้›†ๆ น็›ฎๅฝ•
doc = "{name}-{split}"              # ๆ•ฐๆฎ้›†ๆ–‡ไปถๅ
encoded = "encoded-{name}-{split}"  # ้ข„ๅค„็†ไฟๅญ˜ไฝ็ฝฎ

ไฝฟ็”จdata/preprocess_datasets.pyๅฏนๆ•ฐๆฎ้›†่ฟ›่กŒๅญ่ฏๅˆ‡ๅˆ†๏ผŒไปŽ่€ŒๅŠ ๅฟซ่ฎญ็ปƒ้€Ÿๅบฆใ€‚

$ python data/preprocess_datasets.py --ds_name SkyPile-150B-2023 --tokenizer_name_or_path tokenizer/Mixtral-8x7B-v0.1-vocab
$ python data/preprocess_datasets.py --ds_name DKYoon-SlimPajama-6B --tokenizer_name_or_path tokenizer/Mixtral-8x7B-v0.1-vocab

ๅœจ่ฟ›่กŒๅญ่ฏๅˆ‡ๅˆ†ๅŽ๏ผŒๅฏไปฅไฝฟ็”จdata/utils.pyๆŸฅ็œ‹ๅ„ไธชๆ•ฐๆฎ้›†็š„tokenๆ€ป้‡๏ผš

$ python data/utils.py

ๅผ€ๅง‹่ฎญ็ปƒ

่ฎญ็ปƒๅฏๅŠจ่„šๆœฌไธบscripts/train.shใ€‚ๅฏไปฅ้€š่ฟ‡ไฟฎๆ”นๅ…ถไธญ็š„TRAIN_DATASETSไฟฎๆ”น่ฎญ็ปƒๆ•ฐๆฎ้›†ๅ’Œๆ•ฐๆฎ้›†ๆฏ”ไพ‹๏ผš

TRAIN_DATASETS=(
    1:SkyPile-150B-2022     # ไฝฟ็”จๅ…จ้‡SkyPile-150B-2022
    0.1:SkyPile-150B-2023   # ไฝฟ็”จSkyPile-150B-2023็š„10%ๆ•ฐๆฎ
    1:DKYoon-SlimPajama-6B  # ไฝฟ็”จๅ…จ้‡DKYoon-SlimPajama-6B
)

ๅฆ‚ๆžœๆ‚จไฝฟ็”จSLURM้›†็พค็ฎก็†็ณป็ปŸ๏ผŒๅฏไปฅ้€š่ฟ‡sbatch่ฟ›่กŒๆไบค๏ผš

$ sbatch scripts/train.sh

ๅฆ‚ๆžœๆฒกๆœ‰SLURMๆˆ–ๅธŒๆœ›้€š่ฟ‡ๅ‘ฝไปค่กŒๅฏๅŠจ่ฎญ็ปƒ๏ผŒๆ‚จๅฏไปฅ็›ดๆŽฅๆๅ–scripts/train.shไธญ็š„torchrunๅผ€ๅง‹่ฎญ็ปƒใ€‚

ๅพฎ่ฐƒ

ๆœฌ้กน็›ฎๅ‘ๅธƒ็š„Chinese-Mixtral-8x7BไธบๅŸบๅบงๆจกๅž‹๏ผŒๆฒกๆœ‰็ป่ฟ‡ๅพฎ่ฐƒใ€‚ๅฆ‚ๆžœๆ‚จๅธŒๆœ›ไฝฟ็”จChinese-Mixtral-8x7B่ฟ›่กŒไธ‹ๆธธไปปๅŠกๅพฎ่ฐƒๆˆ–SFT๏ผŒๅฏไปฅๅ‚่€ƒHuggingFace็ป™ๅ‡บMixtral-8x7B็š„QLoRAๅพฎ่ฐƒ่„šๆœฌ่ฟ›่กŒ่ฎญ็ปƒ๏ผšHuggingFace็š„ๅฎ˜ๆ–น็คบไพ‹ไปฃ็ ใ€‚

โœ’๏ธ ๅผ•็”จ

ๅฆ‚ๆžœๆ‚จ่ง‰ๅพ—ๆœฌ้กน็›ฎๅฏนๆ‚จ็š„็ ”็ฉถๆœ‰ๆ‰€ๅธฎๅŠฉๆˆ–ไฝฟ็”จไบ†ๆœฌ้กน็›ฎ็š„ไปฃ็ ๏ผŒ่ฏทๅผ•็”จๆœฌ้กน็›ฎ๏ผš

@misc{Chinese-Mixtral-8x7B,
    author = {HIT-SCIR},
    title = {Chinese-Mixtral-8x7B: An Open-Source Mixture-of-Experts LLM},
    year = {2024},
    publisher = {GitHub},
    journal = {GitHub repository},
    howpublished = {\url{https://github.com/HIT-SCIR/Chinese-Mixtral-8x7B}}
}

๐ŸŒŸ Star History

Star History Chart

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 66.69
AI2 Reasoning Challenge (25-Shot) 63.57
HellaSwag (10-Shot) 85.98
MMLU (5-Shot) 70.95
TruthfulQA (0-shot) 45.86
Winogrande (5-shot) 82.08
GSM8k (5-shot) 51.71
Downloads last month
10,781
Safetensors
Model size
46.9B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for HIT-SCIR/Chinese-Mixtral-8x7B

Adapters
1 model
Finetunes
1 model
Quantizations
3 models

Collection including HIT-SCIR/Chinese-Mixtral-8x7B

Evaluation results