|
--- |
|
base_model: shisa-ai/shisa-v2-llama3.1-405b |
|
datasets: |
|
- shisa-ai/shisa-v2-sharegpt |
|
- shisa-ai/deepseekv3-ultrafeedback-armorm-dpo |
|
language: |
|
- ja |
|
- en |
|
- ko |
|
- zh |
|
library_name: transformers |
|
license: llama3.1 |
|
model_name: shisa-v2-llama3.1-405b |
|
quantized_by: leonardlin |
|
--- |
|
|
|
## About |
|
This repo contains select GGUF quants of [shisa-ai/shisa-v2-llama3.1-405b](https://huggingface.co/shisa-ai/shisa-v2-llama3.1-405b) |
|
- All quants were created with `b5503` of upstream [llama.cpp](https://github.com/ggerganov/llama.cpp) |
|
- All quants are weighted/imatrix quants created from our [shisa-ai/shisa-v2-sharegpt](https://huggingface.co/datasets/shisa-ai/shisa-v2-sharegpt) bilingual dataset on the fp16 model except for the Q8_0 |
|
- Files are pre-split at 45GB (below HF's 50GB upload limit). Modern llama.cpp builds should be able to load the sequential files automatically, but you can use `llama-gguf-split --merge` if you want to merge them back together |
|
|
|
## Provided Quants |
|
|
|
| Type | Size (GiB) | |
|
|:--------|----------:| |
|
| IQ2_XXS | 100 | |
|
| IQ3_XS | 155 | |
|
| IQ3_M | 170 | |
|
| IQ4_XS | 202 | |
|
| Q4_K_M | 227 | |
|
| Q8_0 | 402 | |
|
|
|
|
|
## Quant Quality |
|
All quants have been tested with JA MT-Bench (judged by GPT-4.1) as a rough guide for quality: |
|
|
|
| Quant | Size (GiB)| % Diff | Overall | Writing | Roleplay | Reasoning | Math | Coding | Extraction | STEM | Humanities | |
|
|--------------|--------:|-------------:|---------:|----------:|---------:|----------:|---------:|---------:|-----------:|---------:|-----------:| |
|
| Full FP16 | 810 | | **9.13** | 9.25 | **9.55** | 8.15 | 8.90 | 9.10 | 9.65 | 9.10 | 9.35 | |
|
| IQ3_M | 170 | -0.99 | 9.04 | 8.90 | 9.45 | 7.75 | 8.95 | 8.95 | 9.70 | **9.15** | 9.50 | |
|
| Q4_K_M | 227 | -1.10 | 9.03 | **9.40** | 9.00 | 8.25 | 8.85 | **9.10** | 9.50 | 8.90 | 9.25 | |
|
| Q8_0 | 405 | -1.20 | 9.02 | **9.40** | 9.05 | **8.30** | **9.20** | 8.70 | 9.50 | 8.45 | 9.55 | |
|
| W8A8-INT8 | 405 | -1.42 | 9.00 | 9.20 | 9.35 | 7.80 | 8.75 | 9.00 | 9.80 | 8.65 | 9.45 | |
|
| FP8-Dynamic | 405 | -3.29 | 8.83 | 8.70 | 9.20 | 7.85 | 8.80 | 8.65 | 9.30 | 8.80 | 9.35 | |
|
| IQ3_XS | 155 | -3.50 | 8.81 | 8.70 | 9.05 | 7.70 | 8.60 | 8.95 | 9.35 | 8.70 | 9.45 | |
|
| IQ4_XS | 202 | -3.61 | 8.80 | 8.85 | **9.55** | 6.90 | 8.35 | 8.60 | **9.90** | 8.65 | **9.60** | |
|
| *70B FP16* | 140 | -7.89 | 8.41 | 7.95 | 9.05 | 6.25 | 8.30 | 8.25 | 9.70 | 8.70 | 9.05 | |
|
| IQ2_XXS | 100 | -18.18 | 7.47 | 7.50 | 6.80 | 5.15 | 7.55 | 7.30 | 9.05 | 7.65 | 8.80 | |
|
|
|
Due to margin of error, you could probably fairly say that the IQ3_M, Q4_K_M, and Q8_0 GGUFs have almost no functional loss versus the FP16. |
|
|
|
Interestingly enough, while roleplay takes one of the biggest hits, writing seems to be improved on the Q4 and Q8? I think you'd really need to test more (more samples, more runs, more evals) to really see what's going on. Also interestingly the XS quants track pretty consistently, with the IQ4_XS doing worse than the IQ3_M. |
|
|
|
The IQ2_XXS scores extremely poorly. I included the 70B Full FP16 scores as a baseline and I'd expect you'd be better off running a decent Shisa V2 70B Q4_K_M (40GB) or IQ3_M (32GB) vs the IQ2. |
|
|
|
In an ideal world, of course, you should test different quants on your downstream tasks, but I understand that that's not always an option. Based on this testing though, if you had to pick on bang/buck quant blind, I'd start with the IQ3_M. |
|
|
|
|
|
## Making Quants |
|
``` |
|
# first you need an fp16 - setup llama.cpp python env and run something like |
|
python convert_hf_to_gguf.py ~/.cache/huggingface/hub/models--shisa-ai--shisa-v2-llama3.1-405b/snapshots/71b83a7cb998c3a44f59c83a9928596ac348b9b5 --outfile shisa-v2-llama3.1-405b-fp16.gguf |
|
|
|
# Create imatrix: using 4 x H200 you can load 88 layers, takes about 1h15m |
|
CUDA_VISIBLE_DEVICES=4,5,6,7 build/bin/llama-imatrix -m shisa-v2-llama3.1-405b-fp16.gguf -f /data/quantize/shisa-v2-llama-3.1-405b/gguf/calibration_chat.txt -o imatrix.dat -c 512 -b 512 --chunks 100 -ngl 88 |
|
|
|
# create your imatrix quants |
|
build/bin/llama-quantize --imatrix imatrix.dat shisa-v2-llama3.1-405b-fp16.gguf shisa-v2-llama3.1-405b-IQ3_XS.gguf IQ3_XS |
|
|
|
# split the quants |
|
build/bin/llama-gguf-split --split-max-size 45G shisa-v2-llama3.1-405b-IQ3_XS.gguf shisa-v2-llama3.1-405b-IQ3_XS |
|
|
|
# upload (bash loop) |
|
for f in shisa-v2-llama3.1-405b-IQ3_XS-0000*; do huggingface-cli upload shisa-ai/shisa-v2-llama3.1-405b-GGUF "$f"; done |
|
``` |