Xwin-LM-13B-V0.1 / README.md
nbl97's picture
Update README.md
2411131
|
raw
history blame
No virus
4.07 kB
metadata
license: llama2

Xwin-LM: Powerful, Stable, and Reproducible LLM Alignment

Step up your LLM alignment with Xwin-LM!

Xwin-LM aims to develop and open-source alignment technologies for large language models, including supervised fine-tuning (SFT), reward models, reject sampling, reinforcement learning, etc. Our first release, built-upon on the Llama2 base models, ranked TOP-1 on AlpacaEval. Notably, it's the first to surpass GPT-4 on this benchmark. The project will be continuously updated.

News

  • :boom: [Sep, 2023] We released Xwin-LM-70B-V0.1, which has achieved a win-rate against Davinci-003 of 95.57% on AlpacaEval benchmark, ranking as TOP-1 on AlpacaEval. It was the FIRST model surpassing GPT-4 on AlpacaEval. Also note its winrate v.s. GPT-4 is 60.61.
  • :boom: [Sep, 2023] We released Xwin-LM-13B-V0.1, which has achieved 91.76% win-rate on AlpacaEval, ranking as top-1 among all 13B models.
  • :boom: [Sep, 2023] We released Xwin-LM-7B-V0.1, which has achieved 87.82% win-rate on AlpacaEval, ranking as top-1 among all 7B models.

Model Card

Model Checkpoint Report License
Xwin-LM-7B-V0.1 ๐Ÿค— HF Link ๐Ÿ“ƒComing soon (Stay tuned) Llama 2 License
Xwin-LM-13B-V0.1 ๐Ÿค— HF Link Llama 2 License
Xwin-LM-70B-V0.1 ๐Ÿค— HF Link Llama 2 License

Benchmarks

Xwin-LM performance on AlpacaEval.

The table below displays the performance of Xwin-LM on AlpacaEval, where evaluates its win-rate against Text-Davinci-003 across 805 questions. To provide a comprehensive evaluation, we present, for the first time, the win-rate against ChatGPT and GPT-4 as well. Our Xwin-LM model family establish a new state-of-the-art performance across all metrics. Notably, Xwin-LM-70B-V0.1 has eclipsed GPT-4 for the first time, achieving an impressive win-rate of 95.57% to Text-Davinci-003 and 60.61% to GPT-4.

Model AlpacaEval (winrate %) AlpacaEval (winrate %) AlpacaEval (winrate %)
v.s. Text-Davinci-003 v.s. ChatGPT v.s. GPT4
Xwin-LM-70B-V0.1 95.57 87.50 60.61
GPT-4 95.28 84.66 50.00
WizardLM-70B-V1.0 92.91 80.19 46.70
Llama-2-70B-Chat 92.66 81.97 51.19
Xwin-LM-13B-V0.1 91.76 81.79 55.30
ChatGPT 89.37 50.00 16.60
WizardLM-13B-V1.2 89.17 75.75 41.91
Xwin-LM-7B-V0.1 87.35 76.40 47.57
Llama-2-13B-Chat 81.09 64.22 30.92