UPDATE: We have since released the full Shisa V2 family of models. See our announcement at https://shisa.ai/posts/shisa-v2/
Per the Llama 3.1 Community License Agreement, the official name of this model is "Llama 3.1 shisa-v2-llama3.1-8b-preview"
shisa-v2-llama-3.1-8b-preview
This is a preview release of the Shisa V2 bilingual Japanese and English (JA/EN).
It is a fine tune of meta-llama/Llama-3.1-8B-Instruct and inherits the tokenizer (JA efficiency) and context length (128K).
While we're still working hard on this model (including integrating additional datasets and applying several more post-training stages) this model already shows a significant performace leap over our prior published models, including beating our Llama 3 70B tune from last year almost across the board on our evals. We're releasing this WIP preview to celebrate since we also just noticed that shisa-gamma-7b-v1 hit 1 milion downloads! 🥳 (OK, it's not 1 billion but it's still nothing to sneeze at!)
Evals
Release Date | Model Name | Overall Avg | JA Avg | EN Avg | Shaberi Avg | ELYZA 100 | JA MT Bench | Rakuda | Tengu | shisa-jp-ifeval | llm-jp-eval | shisa-jp-rp-bench | MixEval | LiveBench | IFEval | EvalPlus |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2025-03 | shisa-ai/shisa-v2-llama3.1-8b-preview | 62.46 | 70.35 | 54.58 | 7.18 | 7.55 | 8.57 | 9.03 | 7.33 | 0.19 | 0.56 | 4.67 | 0.46 | 32.00 | 0.79 | 0.62 |
2024-05 | shisa-ai/shisa-v1-llama3-70b | 60.70 | 68.45 | 52.95 | 6.82 | 7.33 | 8.06 | 8.88 | 6.65 | 0.24 | 0.56 | 4.51 | 0.56 | 27.40 | 0.65 | 0.63 |
2024-05 | shisa-ai/shisa-v1-llama3-8b | 50.09 | 57.37 | 42.80 | 6.30 | 6.46 | 7.54 | 8.03 | 6.41 | 0.09 | 0.23 | 4.26 | 0.36 | 20.20 | 0.63 | 0.52 |
2023-12 | augmxnt/shisa-gamma-7b-v1 | 37.37 | 53.94 | 20.80 | 5.49 | 5.74 | 5.93 | 7.28 | 5.87 | 0.13 | 0.52 | 3.22 | 0.26 | 2.20 | 0.37 | 0.18 |
2023-12 | augmxnt/shisa-7b-v1 | 29.54 | 36.80 | 22.28 | 3.51 | 4.03 | 3.66 | 3.25 | 5.11 | 0.15 | 0.46 | 1.82 | 0.20 | 16.50 | 0.27 | 0.26 |
Not only is this our best model yet, from our testing, this model is also currently SOTA in overall JA/EN performance among all 7-8B class models:
Release Date | Model Name | Overall AVG | JA AVG | EN AVG | Shaberi AVG | ELYZA 100 | JA MT Bench | Rakuda | Tengu | shisa-jp-ifeval | llm-jp.eval | shisa-jp-rp-bench | MixEval | LiveBench | IFEval | EvalPlus |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2025-03 | shisa-ai/shisa-v2-llama3.1-8b-preview | 62.46 | 70.35 | 54.58 | 7.18 | 7.55 | 8.57 | 9.03 | 7.33 | 0.19 | 0.56 | 4.67 | 0.46 | 32.00 | 0.79 | 0.62 |
2024-07 | AXCXEPT/Llama-3.1-8B-EZO-1.1-it | 60.46 | 66.97 | 53.95 | 6.98 | 7.57 | 8.26 | 8.61 | 7.28 | 0.22 | 0.39 | 4.53 | 0.46 | 30.40 | 0.77 | 0.62 |
2025-03 | allenai/Llama-3.1-Tulu-3.1-8B | 59.81 | 65.41 | 54.20 | 6.56 | 6.84 | 7.69 | 8.61 | 6.52 | 0.22 | 0.52 | 4.39 | 0.40 | 31.30 | 0.82 | 0.63 |
2024-07 | meta-llama/Llama-3.1-8B-Instruct | 56.78 | 59.66 | 53.90 | 6.48 | 6.95 | 7.67 | 8.36 | 6.40 | 0.16 | 0.25 | 4.15 | 0.44 | 27.70 | 0.80 | 0.63 |
2024-12 | tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.3 | 56.69 | 71.18 | 42.20 | 7.18 | 7.90 | 8.25 | 9.16 | 7.36 | 0.25 | 0.56 | 4.52 | 0.30 | 26.40 | 0.64 | 0.48 |
2025-02 | sbintuitions/sarashina2.2-3b-instruct-v0.1 | 53.90 | 68.70 | 39.10 | 7.28 | 7.91 | 8.46 | 9.28 | 7.43 | 0.22 | 0.42 | 4.30 | 0.28 | 21.20 | 0.50 | 0.57 |
2024-06 | elyza/Llama-3-ELYZA-JP-8B | 53.32 | 67.55 | 39.10 | 7.01 | 7.80 | 8.05 | 9.00 | 7.11 | 0.24 | 0.41 | 4.39 | 0.34 | 17.50 | 0.62 | 0.43 |
2024-08 | weblab-GENIAC/Tanuki-8B-dpo-v1.0 | 42.14 | 57.26 | 27.03 | 6.30 | 7.00 | 7.10 | 8.11 | 6.51 | 0.17 | 0.24 | 3.67 | 0.24 | 14.40 | 0.38 | 0.32 |
2025-02 | llm-jp/llm-jp-3-7.2b-instruct3 | 42.69 | 61.93 | 23.45 | 6.79 | 6.99 | 7.70 | 9.16 | 6.79 | 0.20 | 0.47 | 3.03 | 0.22 | 5.20 | 0.49 | 0.18 |
Testing notes:
- JA functional tests are done with the shisa-ai/shaberi fork using a PoLL of Tulu 3 405B FP8, Llama 3.3 70B, and Athene-V2 that was tested to be roughly comparable to
gpt-4-1106-preview
for scoring - Gemini 2 9B models aren't tested atm due to lack of system prompt breaking multiple evals atm...
- Dynamic RoPE extension is used when necessary for testing models w/ a 4K context window
- Sarashina2.2-Instruct 3B included since they claim to achieve 7-8B class performance and I was curious if that panned out (it seems so, kicks butt on the Shaberi functional tests)
Data
Our final release will have full details, but this currently model is largely based off of work done on Shisa V1, but refined, filtered, regenerated, annotated, rated, and selected. This is augmented by additional datasets focused on translation, multi-turn chat, role-play and other real-world tasks. All synthetic data was regenerated from open weight models. This model currently has only a single DPO stage using a placeholder (but surprisingly good!) EN preference set and a custom RP DPO mix.
Credits
Trained by Shisa.AI: Leonard Lin and Adam Lensenmayer
Compute sponsored by Ubitus K.K. and METI GENIAC.
Trained with Axolotl.
Framework versions
- TRL: 0.15.1
- Transformers: 4.49.0
- Pytorch: 2.6.0
- Datasets: 3.2.0
- Tokenizers: 0.21.0
- Downloads last month
- 92
Model tree for shisa-ai/shisa-v2-llama3.1-8b-preview
Base model
meta-llama/Llama-3.1-8B