Shisa Gamma 7B V2new

The original augmxnt/shisa-gamma-7b-v1 model actually still gets far more downloads per month (30K as of this writing) than all of our Shisa V2 models combined!

As part of our Shisa V2 followup work, one thing I was curious about was what putting our old Shisa V1 models through our latest Shisa V2 recipe looks like in terms of performance. Can they be competitive? How much uplift can they get vs those based off of newer, better trained base models?

So firstly, we can indeed greatly improve performance in both absolute and relative measurements:

(Shaberi results as judged by GPT-4.1)

Model	Average	ELYZA 100	JA-MT	Rakuda	Tengu
029-shisa-gamma-7b-v1-v2new-dpo405b	5.64	6.42	5.70	4.48	5.98
027-shisa-7b-v1-v2new-dpo405b	5.48	5.34	5.07	6.43	5.09
026-shisa-7b-v1-v2new-sft	5.27	4.92	5.28	5.93	4.95
028-shisa-gamma-7b-v1-v2new-sft	5.19	6.24	4.58	4.38	5.58
augmxnt/shisa-gamma-7b-v1	4.80	5.86	4.07	4.55	4.72
augmxnt/shisa-7b-v1	3.95	4.36	3.75	3.88	3.83

However, these improvements still don't come close to matching the performance of running the exact same training recipe on stronger base models:

Model	Average	ELYZA 100	JA-MT	Rakuda	Tengu
025-qwen3-8b-v2new-dpo405b	7.87	8.22	8.35	8.05	6.87
024-llama3.1-8b-v2new-dpo405b	7.60	7.58	7.57	8.25	7.01
Qwen/Qwen3-8B	7.47	7.58	8.05	7.65	6.60
shisa-ai/shisa-v2-llama3.1-8b	7.14	7.54	6.83	7.85	6.34
meta-llama/Llama-3.1-8B-Instruct	5.89	6.96	5.68	5.20	5.73

This V2new model is suitable as "drop-in" replacements for the V1 models and are also released under an Apache 2.0 license, but our released Shisa V2 models are stronger across the board.

This model was trained with OpenRLHF and took 66 hours for SFT and 7 hours for DPO on a single MI300X.

Compute sponsored by Hot Aisle and AMD.

shisa-ai
/

029-shisa-gamma-7b-v1-v2new-dpo405b

Shisa Gamma 7B V2new

Model tree for shisa-ai/029-shisa-gamma-7b-v1-v2new-dpo405b