shisa-ai/030-swallow-8b-0.5-base-v2new-sft

So just one of the additional fun experiments we're doing on the path to Shisa V2.1.

This is an SFT of the tokyotech-llm/Llama-3.1-Swallow-8B-v0.5 base model.

This came out from a curiousity to see if a well-curated CPT can beat simply using strong Instruct models as base models in Japanese... tentatively yes (we haven't run on our full suite or analyzed English capabilities) and if our Shisa V2 datasets are better than the Swallow v0.5 Instruct approach (knowledge distill of Gemma 3 27B) - also, tentatively yes.

Anyway, throwing this model out there so we can cheekily reclaim SOTA as running the latest Shisa V2 SFT dataset (which we'll publish with official V2.1 releases) manages to edge out the (very strong, congrats!) tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.5 on our GPT-4.1 judged Shaberi results:

Model	Average	ELYZA 100	JA-MT	Rakuda	Tengu
030-swallow-8b-0.5-base-v2new-sft	8.00	8.00	7.82	8.98	7.19
tokyotech-llm/Llama-3.1-Swallow-8B-Instruct-v0.5	7.96	7.78	8.03	8.95	7.06
025-qwen3-8b-v2new-dpo405b	7.87	8.22	8.35	8.05	6.87
021-qwen3-8b-v2new-sft	7.79	7.98	8.10	8.05	7.04
024-llama3.1-8b-v2new-dpo405b	7.60	7.58	7.57	8.25	7.01
022-llama3.1-8b-v2new-sft	7.48	7.62	7.53	7.90	6.88
Qwen/Qwen3-8B	7.47	7.58	8.05	7.65	6.60
meta-llama/Llama-3.1-8B-Instruct	5.89	6.96	5.68	5.20	5.73

030 is a Shisa V2 SFT of Swallow 8B v0.5 (base) and is SOTA until... maybe until tomorrow when our 031 DPO finishes 😂
030 can be compared to 021, the SFT on Qwen 3 8B (Instruct) and 022, the SFT on Llama 3.1 8B (Instruct)

Note: the Swallow team has released this under the Llama 3.3 Community License and Gemma Terms of Use and as a derived work we inherit the model. Please carefully read and respect the licenses if they apply to you.

Compute sponsored by Hot Aisle and AMD.

shisa-ai
/

030-swallow-8b-0.5-base-v2new-sft

Model tree for shisa-ai/030-swallow-8b-0.5-base-v2new-sft