Safetensors
llama
Generated from Trainer

UPDATE: We have since released the full Shisa V2 family of models. See our announcement at https://shisa.ai/posts/shisa-v2/


Per the Llama 3.1 Community License Agreement, the official name of this model is "Llama 3.1 shisa-v2-llama3.1-8b-preview"

shisa-v2-llama-3.1-8b-preview

This is a preview release of the Shisa V2 bilingual Japanese and English (JA/EN).

It is a fine tune of meta-llama/Llama-3.1-8B-Instruct and inherits the tokenizer (JA efficiency) and context length (128K).

While we're still working hard on this model (including integrating additional datasets and applying several more post-training stages) this model already shows a significant performace leap over our prior published models, including beating our Llama 3 70B tune from last year almost across the board on our evals. We're releasing this WIP preview to celebrate since we also just noticed that shisa-gamma-7b-v1 hit 1 milion downloads! 🥳 (OK, it's not 1 billion but it's still nothing to sneeze at!)

Evals

Release Date Model Name Overall Avg JA Avg EN Avg Shaberi Avg ELYZA 100 JA MT Bench Rakuda Tengu shisa-jp-ifeval llm-jp-eval shisa-jp-rp-bench MixEval LiveBench IFEval EvalPlus
2025-03 shisa-ai/shisa-v2-llama3.1-8b-preview 62.46 70.35 54.58 7.18 7.55 8.57 9.03 7.33 0.19 0.56 4.67 0.46 32.00 0.79 0.62
2024-05 shisa-ai/shisa-v1-llama3-70b 60.70 68.45 52.95 6.82 7.33 8.06 8.88 6.65 0.24 0.56 4.51 0.56 27.40 0.65 0.63
2024-05 shisa-ai/shisa-v1-llama3-8b 50.09 57.37 42.80 6.30 6.46 7.54 8.03 6.41 0.09 0.23 4.26 0.36 20.20 0.63 0.52
2023-12 augmxnt/shisa-gamma-7b-v1 37.37 53.94 20.80 5.49 5.74 5.93 7.28 5.87 0.13 0.52 3.22 0.26 2.20 0.37 0.18
2023-12 augmxnt/shisa-7b-v1 29.54 36.80 22.28 3.51 4.03 3.66 3.25 5.11 0.15 0.46 1.82 0.20 16.50 0.27 0.26

Not only is this our best model yet, from our testing, this model is also currently SOTA in overall JA/EN performance among all 7-8B class models:

Release Date Model Name Overall AVG JA AVG EN AVG Shaberi AVG ELYZA 100 JA MT Bench Rakuda Tengu shisa-jp-ifeval llm-jp.eval shisa-jp-rp-bench MixEval LiveBench IFEval EvalPlus
2025-03 shisa-ai/shisa-v2-llama3.1-8b-preview 62.46 70.35 54.58 7.18 7.55 8.57 9.03 7.33 0.19 0.56 4.67 0.46 32.00 0.79 0.62
2024-07 AXCXEPT/Llama-3.1-8B-EZO-1.1-it 60.46 66.97 53.95 6.98 7.57 8.26 8.61 7.28 0.22 0.39 4.53 0.46 30.40 0.77 0.62
2025-03 allenai/Llama-3.1-Tulu-3.1-8B 59.81 65.41 54.20 6.56 6.84 7.69 8.61 6.52 0.22 0.52 4.39 0.40 31.30 0.82 0.63
2024-07 meta-llama/Llama-3.1-8B-Instruct 56.78 59.66 53.90 6.48 6.95 7.67 8.36 6.40 0.16 0.25 4.15 0.44 27.70 0.80 0.63
2024-12 tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.3 56.69 71.18 42.20 7.18 7.90 8.25 9.16 7.36 0.25 0.56 4.52 0.30 26.40 0.64 0.48
2025-02 sbintuitions/sarashina2.2-3b-instruct-v0.1 53.90 68.70 39.10 7.28 7.91 8.46 9.28 7.43 0.22 0.42 4.30 0.28 21.20 0.50 0.57
2024-06 elyza/Llama-3-ELYZA-JP-8B 53.32 67.55 39.10 7.01 7.80 8.05 9.00 7.11 0.24 0.41 4.39 0.34 17.50 0.62 0.43
2024-08 weblab-GENIAC/Tanuki-8B-dpo-v1.0 42.14 57.26 27.03 6.30 7.00 7.10 8.11 6.51 0.17 0.24 3.67 0.24 14.40 0.38 0.32
2025-02 llm-jp/llm-jp-3-7.2b-instruct3 42.69 61.93 23.45 6.79 6.99 7.70 9.16 6.79 0.20 0.47 3.03 0.22 5.20 0.49 0.18

Testing notes:

  • JA functional tests are done with the shisa-ai/shaberi fork using a PoLL of Tulu 3 405B FP8, Llama 3.3 70B, and Athene-V2 that was tested to be roughly comparable to gpt-4-1106-preview for scoring
  • Gemini 2 9B models aren't tested atm due to lack of system prompt breaking multiple evals atm...
  • Dynamic RoPE extension is used when necessary for testing models w/ a 4K context window
  • Sarashina2.2-Instruct 3B included since they claim to achieve 7-8B class performance and I was curious if that panned out (it seems so, kicks butt on the Shaberi functional tests)

Data

Our final release will have full details, but this currently model is largely based off of work done on Shisa V1, but refined, filtered, regenerated, annotated, rated, and selected. This is augmented by additional datasets focused on translation, multi-turn chat, role-play and other real-world tasks. All synthetic data was regenerated from open weight models. This model currently has only a single DPO stage using a placeholder (but surprisingly good!) EN preference set and a custom RP DPO mix.

Credits

Trained by Shisa.AI: Leonard Lin and Adam Lensenmayer

Compute sponsored by Ubitus K.K. and METI GENIAC.

Trained with Axolotl.

Visualize in Weights & Biases

Framework versions

  • TRL: 0.15.1
  • Transformers: 4.49.0
  • Pytorch: 2.6.0
  • Datasets: 3.2.0
  • Tokenizers: 0.21.0
Downloads last month
92
Safetensors
Model size
8.03B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for shisa-ai/shisa-v2-llama3.1-8b-preview

Finetuned
(1231)
this model

Dataset used to train shisa-ai/shisa-v2-llama3.1-8b-preview

Collection including shisa-ai/shisa-v2-llama3.1-8b-preview