Update README.md

0333fae verified about 1 month ago

4.95 kB

	---
	base_model: shisa-ai/shisa-v2-llama3.1-405b
	datasets:
	- shisa-ai/shisa-v2-sharegpt
	- shisa-ai/deepseekv3-ultrafeedback-armorm-dpo
	language:
	- ja
	- en
	- ko
	- zh
	library_name: transformers
	license: llama3.1
	model_name: shisa-v2-llama3.1-405b
	quantized_by: leonardlin
	---

	## About
	This repo contains select GGUF quants of [shisa-ai/shisa-v2-llama3.1-405b](https://huggingface.co/shisa-ai/shisa-v2-llama3.1-405b)
	- All quants were created with `b5503` of upstream [llama.cpp](https://github.com/ggerganov/llama.cpp)
	- All quants are weighted/imatrix quants created from our [shisa-ai/shisa-v2-sharegpt](https://huggingface.co/datasets/shisa-ai/shisa-v2-sharegpt) bilingual dataset on the fp16 model except for the Q8_0
	- Files are pre-split at 45GB (below HF's 50GB upload limit). Modern llama.cpp builds should be able to load the sequential files automatically, but you can use `llama-gguf-split --merge` if you want to merge them back together

	## Provided Quants

	\| Type \| Size (GiB) \|
	\|:--------\|----------:\|
	\| IQ2_XXS \| 100 \|
	\| IQ3_XS \| 155 \|
	\| IQ3_M \| 170 \|
	\| IQ4_XS \| 202 \|
	\| Q4_K_M \| 227 \|
	\| Q8_0 \| 402 \|


	## Quant Quality
	All quants have been tested with JA MT-Bench (judged by GPT-4.1) as a rough guide for quality:

	\| Quant \| Size (GiB)\| % Diff \| Overall \| Writing \| Roleplay \| Reasoning \| Math \| Coding \| Extraction \| STEM \| Humanities \|
	\|--------------\|--------:\|-------------:\|---------:\|----------:\|---------:\|----------:\|---------:\|---------:\|-----------:\|---------:\|-----------:\|
	\| Full FP16 \| 810 \| \| 9.13 \| 9.25 \| 9.55 \| 8.15 \| 8.90 \| 9.10 \| 9.65 \| 9.10 \| 9.35 \|
	\| IQ3_M \| 170 \| -0.99 \| 9.04 \| 8.90 \| 9.45 \| 7.75 \| 8.95 \| 8.95 \| 9.70 \| 9.15 \| 9.50 \|
	\| Q4_K_M \| 227 \| -1.10 \| 9.03 \| 9.40 \| 9.00 \| 8.25 \| 8.85 \| 9.10 \| 9.50 \| 8.90 \| 9.25 \|
	\| Q8_0 \| 405 \| -1.20 \| 9.02 \| 9.40 \| 9.05 \| 8.30 \| 9.20 \| 8.70 \| 9.50 \| 8.45 \| 9.55 \|
	\| W8A8-INT8 \| 405 \| -1.42 \| 9.00 \| 9.20 \| 9.35 \| 7.80 \| 8.75 \| 9.00 \| 9.80 \| 8.65 \| 9.45 \|
	\| FP8-Dynamic \| 405 \| -3.29 \| 8.83 \| 8.70 \| 9.20 \| 7.85 \| 8.80 \| 8.65 \| 9.30 \| 8.80 \| 9.35 \|
	\| IQ3_XS \| 155 \| -3.50 \| 8.81 \| 8.70 \| 9.05 \| 7.70 \| 8.60 \| 8.95 \| 9.35 \| 8.70 \| 9.45 \|
	\| IQ4_XS \| 202 \| -3.61 \| 8.80 \| 8.85 \| 9.55 \| 6.90 \| 8.35 \| 8.60 \| 9.90 \| 8.65 \| 9.60 \|
	\| 70B FP16 \| 140 \| -7.89 \| 8.41 \| 7.95 \| 9.05 \| 6.25 \| 8.30 \| 8.25 \| 9.70 \| 8.70 \| 9.05 \|
	\| IQ2_XXS \| 100 \| -18.18 \| 7.47 \| 7.50 \| 6.80 \| 5.15 \| 7.55 \| 7.30 \| 9.05 \| 7.65 \| 8.80 \|

	Due to margin of error, you could probably fairly say that the IQ3_M, Q4_K_M, and Q8_0 GGUFs have almost no functional loss versus the FP16.

	Interestingly enough, while roleplay takes one of the biggest hits, writing seems to be improved on the Q4 and Q8? I think you'd really need to test more (more samples, more runs, more evals) to really see what's going on. Also interestingly the XS quants track pretty consistently, with the IQ4_XS doing worse than the IQ3_M.

	The IQ2_XXS scores extremely poorly. I included the 70B Full FP16 scores as a baseline and I'd expect you'd be better off running a decent Shisa V2 70B Q4_K_M (40GB) or IQ3_M (32GB) vs the IQ2.

	In an ideal world, of course, you should test different quants on your downstream tasks, but I understand that that's not always an option. Based on this testing though, if you had to pick on bang/buck quant blind, I'd start with the IQ3_M.


	## Making Quants
	```
	# first you need an fp16 - setup llama.cpp python env and run something like
	python convert_hf_to_gguf.py ~/.cache/huggingface/hub/models--shisa-ai--shisa-v2-llama3.1-405b/snapshots/71b83a7cb998c3a44f59c83a9928596ac348b9b5 --outfile shisa-v2-llama3.1-405b-fp16.gguf

	# Create imatrix: using 4 x H200 you can load 88 layers, takes about 1h15m
	CUDA_VISIBLE_DEVICES=4,5,6,7 build/bin/llama-imatrix -m shisa-v2-llama3.1-405b-fp16.gguf -f /data/quantize/shisa-v2-llama-3.1-405b/gguf/calibration_chat.txt -o imatrix.dat -c 512 -b 512 --chunks 100 -ngl 88

	# create your imatrix quants
	build/bin/llama-quantize --imatrix imatrix.dat shisa-v2-llama3.1-405b-fp16.gguf shisa-v2-llama3.1-405b-IQ3_XS.gguf IQ3_XS

	# split the quants
	build/bin/llama-gguf-split --split-max-size 45G shisa-v2-llama3.1-405b-IQ3_XS.gguf shisa-v2-llama3.1-405b-IQ3_XS

	# upload (bash loop)
	for f in shisa-v2-llama3.1-405b-IQ3_XS-0000*; do huggingface-cli upload shisa-ai/shisa-v2-llama3.1-405b-GGUF "$f"; done
	```