UPDATE: We have since released the full Shisa V2 family of models. See our announcement at https://shisa.ai/posts/shisa-v2/

Per the Llama 3.1 Community License Agreement, the official name of this model is "Llama 3.1 shisa-v2-llama3.1-8b-preview"

shisa-v2-llama-3.1-8b-preview

This is a preview release of the Shisa V2 bilingual Japanese and English (JA/EN).

It is a fine tune of meta-llama/Llama-3.1-8B-Instruct and inherits the tokenizer (JA efficiency) and context length (128K).

While we're still working hard on this model (including integrating additional datasets and applying several more post-training stages) this model already shows a significant performace leap over our prior published models, including beating our Llama 3 70B tune from last year almost across the board on our evals. We're releasing this WIP preview to celebrate since we also just noticed that shisa-gamma-7b-v1 hit 1 milion downloads! 🥳 (OK, it's not 1 billion but it's still nothing to sneeze at!)

Evals

Release Date	Model Name	Overall Avg	JA Avg	EN Avg	Shaberi Avg	ELYZA 100	JA MT Bench	Rakuda	Tengu	shisa-jp-ifeval	llm-jp-eval	shisa-jp-rp-bench	MixEval	LiveBench	IFEval	EvalPlus
2025-03	shisa-ai/shisa-v2-llama3.1-8b-preview	62.46	70.35	54.58	7.18	7.55	8.57	9.03	7.33	0.19	0.56	4.67	0.46	32.00	0.79	0.62
2024-05	shisa-ai/shisa-v1-llama3-70b	60.70	68.45	52.95	6.82	7.33	8.06	8.88	6.65	0.24	0.56	4.51	0.56	27.40	0.65	0.63
2024-05	shisa-ai/shisa-v1-llama3-8b	50.09	57.37	42.80	6.30	6.46	7.54	8.03	6.41	0.09	0.23	4.26	0.36	20.20	0.63	0.52
2023-12	augmxnt/shisa-gamma-7b-v1	37.37	53.94	20.80	5.49	5.74	5.93	7.28	5.87	0.13	0.52	3.22	0.26	2.20	0.37	0.18
2023-12	augmxnt/shisa-7b-v1	29.54	36.80	22.28	3.51	4.03	3.66	3.25	5.11	0.15	0.46	1.82	0.20	16.50	0.27	0.26

Not only is this our best model yet, from our testing, this model is also currently SOTA in overall JA/EN performance among all 7-8B class models:

Release Date	Model Name	Overall AVG	JA AVG	EN AVG	Shaberi AVG	ELYZA 100	JA MT Bench	Rakuda	Tengu	shisa-jp-ifeval	llm-jp.eval	shisa-jp-rp-bench	MixEval	LiveBench	IFEval	EvalPlus
2025-03	shisa-ai/shisa-v2-llama3.1-8b-preview	62.46	70.35	54.58	7.18	7.55	8.57	9.03	7.33	0.19	0.56	4.67	0.46	32.00	0.79	0.62
2024-07	AXCXEPT/Llama-3.1-8B-EZO-1.1-it	60.46	66.97	53.95	6.98	7.57	8.26	8.61	7.28	0.22	0.39	4.53	0.46	30.40	0.77	0.62
2025-03	allenai/Llama-3.1-Tulu-3.1-8B	59.81	65.41	54.20	6.56	6.84	7.69	8.61	6.52	0.22	0.52	4.39	0.40	31.30	0.82	0.63
2024-07	meta-llama/Llama-3.1-8B-Instruct	56.78	59.66	53.90	6.48	6.95	7.67	8.36	6.40	0.16	0.25	4.15	0.44	27.70	0.80	0.63
2024-12	tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.3	56.69	71.18	42.20	7.18	7.90	8.25	9.16	7.36	0.25	0.56	4.52	0.30	26.40	0.64	0.48
2025-02	sbintuitions/sarashina2.2-3b-instruct-v0.1	53.90	68.70	39.10	7.28	7.91	8.46	9.28	7.43	0.22	0.42	4.30	0.28	21.20	0.50	0.57
2024-06	elyza/Llama-3-ELYZA-JP-8B	53.32	67.55	39.10	7.01	7.80	8.05	9.00	7.11	0.24	0.41	4.39	0.34	17.50	0.62	0.43
2024-08	weblab-GENIAC/Tanuki-8B-dpo-v1.0	42.14	57.26	27.03	6.30	7.00	7.10	8.11	6.51	0.17	0.24	3.67	0.24	14.40	0.38	0.32
2025-02	llm-jp/llm-jp-3-7.2b-instruct3	42.69	61.93	23.45	6.79	6.99	7.70	9.16	6.79	0.20	0.47	3.03	0.22	5.20	0.49	0.18

Testing notes:

JA functional tests are done with the shisa-ai/shaberi fork using a PoLL of Tulu 3 405B FP8, Llama 3.3 70B, and Athene-V2 that was tested to be roughly comparable to gpt-4-1106-preview for scoring
Gemini 2 9B models aren't tested atm due to lack of system prompt breaking multiple evals atm...
Dynamic RoPE extension is used when necessary for testing models w/ a 4K context window
Sarashina2.2-Instruct 3B included since they claim to achieve 7-8B class performance and I was curious if that panned out (it seems so, kicks butt on the Shaberi functional tests)

Data

Our final release will have full details, but this currently model is largely based off of work done on Shisa V1, but refined, filtered, regenerated, annotated, rated, and selected. This is augmented by additional datasets focused on translation, multi-turn chat, role-play and other real-world tasks. All synthetic data was regenerated from open weight models. This model currently has only a single DPO stage using a placeholder (but surprisingly good!) EN preference set and a custom RP DPO mix.

Credits

Trained by Shisa.AI: Leonard Lin and Adam Lensenmayer

Compute sponsored by Ubitus K.K. and METI GENIAC.

Trained with Axolotl.

Framework versions

TRL: 0.15.1
Transformers: 4.49.0
Pytorch: 2.6.0
Datasets: 3.2.0
Tokenizers: 0.21.0

shisa-ai
/

shisa-v2-llama3.1-8b-preview

shisa-v2-llama-3.1-8b-preview

Evals

Data

Credits

Framework versions

Model tree for shisa-ai/shisa-v2-llama3.1-8b-preview

Dataset used to train shisa-ai/shisa-v2-llama3.1-8b-preview

Collection including shisa-ai/shisa-v2-llama3.1-8b-preview

Shisa V2

Evaluation results