---
base_model: shisa-ai/shisa-v2-llama3.1-405b
datasets:
- shisa-ai/shisa-v2-sharegpt
- shisa-ai/deepseekv3-ultrafeedback-armorm-dpo
language:
- ja
- en
- ko
- zh
library_name: transformers
license: llama3.1
model_name: shisa-v2-llama3.1-405b
quantized_by: leonardlin
---

## About
This repo contains select GGUF quants of [shisa-ai/shisa-v2-llama3.1-405b](https://huggingface.co/shisa-ai/shisa-v2-llama3.1-405b)
- All quants were created with `b5503` of upstream [llama.cpp](https://github.com/ggerganov/llama.cpp)
- All quants are weighted/imatrix quants created from our [shisa-ai/shisa-v2-sharegpt](https://huggingface.co/datasets/shisa-ai/shisa-v2-sharegpt) bilingual dataset on the fp16 model except for the Q8_0
- Files are pre-split at 45GB (below HF's 50GB upload limit). Modern llama.cpp builds should be able to load the sequential files automatically, but you can use `llama-gguf-split --merge` if you want to merge them back together

## Provided Quants

| Type    | Size (GiB) |
|:--------|----------:|
| IQ2_XXS | 100       |
| IQ3_XS  | 155       |
| IQ3_M   | 170       |
| IQ4_XS  | 202       |
| Q4_K_M  | 227       |
| Q8_0    | 402       |


## Quant Quality
All quants have been tested with JA MT-Bench (judged by GPT-4.1) as a rough guide for quality:

| Quant        | Size (GiB)| % Diff       | Overall  | Writing   | Roleplay | Reasoning | Math     | Coding   | Extraction | STEM     | Humanities |
|--------------|--------:|-------------:|---------:|----------:|---------:|----------:|---------:|---------:|-----------:|---------:|-----------:|
| Full FP16    | 810     |              | **9.13** | 9.25      | **9.55** | 8.15      | 8.90     | 9.10     | 9.65       | 9.10     | 9.35       |
| IQ3_M        | 170     | -0.99        | 9.04     | 8.90      | 9.45     | 7.75      | 8.95     | 8.95     | 9.70       | **9.15** | 9.50       |
| Q4_K_M       | 227     | -1.10        | 9.03     | **9.40**  | 9.00     | 8.25      | 8.85     | **9.10** | 9.50       | 8.90     | 9.25       |
| Q8_0         | 405     | -1.20        | 9.02     | **9.40**  | 9.05     | **8.30**  | **9.20** | 8.70     | 9.50       | 8.45     | 9.55       |
| W8A8-INT8    | 405     | -1.42        | 9.00     | 9.20      | 9.35     | 7.80      | 8.75     | 9.00     | 9.80       | 8.65     | 9.45       |
| FP8-Dynamic  | 405     | -3.29        | 8.83     | 8.70      | 9.20     | 7.85      | 8.80     | 8.65     | 9.30       | 8.80     | 9.35       |
| IQ3_XS       | 155     | -3.50        | 8.81     | 8.70      | 9.05     | 7.70      | 8.60     | 8.95     | 9.35       | 8.70     | 9.45       |
| IQ4_XS       | 202     | -3.61        | 8.80     | 8.85      | **9.55** | 6.90      | 8.35     | 8.60     | **9.90**   | 8.65     | **9.60**   |
| *70B FP16*   | 140     | -7.89        | 8.41     | 7.95      | 9.05     | 6.25      | 8.30     | 8.25     | 9.70       | 8.70     | 9.05       |
| IQ2_XXS      | 100     | -18.18       | 7.47     | 7.50      | 6.80     | 5.15      | 7.55     | 7.30     | 9.05       | 7.65     | 8.80       |

Due to margin of error, you could probably fairly say that the IQ3_M, Q4_K_M, and Q8_0 GGUFs have almost no functional loss versus the FP16.

Interestingly enough, while roleplay takes one of the biggest hits, writing seems to be improved on the Q4 and Q8? I think you'd really need to test more (more samples, more runs, more evals) to really see what's going on. Also interestingly the XS quants track pretty consistently, with the IQ4_XS doing worse than the IQ3_M.

The IQ2_XXS scores extremely poorly. I included the 70B Full FP16 scores as a baseline and I'd expect you'd be better off running a decent Shisa V2 70B Q4_K_M (40GB) or IQ3_M (32GB) vs the IQ2. 

In an ideal world, of course, you should test different quants on your downstream tasks, but I understand that that's not always an option. Based on this testing though, if you had to pick on bang/buck quant blind, I'd start with the IQ3_M.


## Making Quants
```
# first you need an fp16 - setup llama.cpp python env and run something like
python convert_hf_to_gguf.py ~/.cache/huggingface/hub/models--shisa-ai--shisa-v2-llama3.1-405b/snapshots/71b83a7cb998c3a44f59c83a9928596ac348b9b5 --outfile shisa-v2-llama3.1-405b-fp16.gguf

# Create imatrix: using 4 x H200 you can load 88 layers, takes about 1h15m
CUDA_VISIBLE_DEVICES=4,5,6,7 build/bin/llama-imatrix -m shisa-v2-llama3.1-405b-fp16.gguf -f /data/quantize/shisa-v2-llama-3.1-405b/gguf/calibration_chat.txt -o imatrix.dat -c 512 -b 512 --chunks 100 -ngl 88

# create your imatrix quants
build/bin/llama-quantize --imatrix imatrix.dat shisa-v2-llama3.1-405b-fp16.gguf shisa-v2-llama3.1-405b-IQ3_XS.gguf IQ3_XS

# split the quants
build/bin/llama-gguf-split --split-max-size 45G shisa-v2-llama3.1-405b-IQ3_XS.gguf  shisa-v2-llama3.1-405b-IQ3_XS

# upload (bash loop)
for f in shisa-v2-llama3.1-405b-IQ3_XS-0000*; do huggingface-cli upload shisa-ai/shisa-v2-llama3.1-405b-GGUF "$f"; done
```