--- quantized_by: ubergarm pipeline_tag: text-generation base_model: tencent/Hunyuan-A13B-Instruct license: other license_name: tencent-hunyuan-a13b license_link: https://github.com/Tencent-Hunyuan/Hunyuan-A13B/blob/main/LICENSE base_model_relation: quantized tags: - imatrix - conversational - ik_llama.cpp --- ## `ik_llama.cpp` imatrix Quantizations of Hunyuan-A13B-Instruct This quant collection **REQUIRES** [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/) fork to support the ik's latest SOTA quants and optimizations! Do **not** download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc! *NOTE* `ik_llama.cpp` can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants. Some of ik's new quants are supported with [Nexesenex/croco.cpp](https://github.com/Nexesenex/croco.cpp) fork of KoboldCPP. These quants provide best in class perplexity for the given memory footprint. ## Big Thanks Shout out to Wendell and the **Level1Techs** crew, the community [Forums](https://forum.level1techs.com/t/deepseek-deep-dive-r1-at-home/225826), [YouTube Channel](https://www.youtube.com/@Level1Techs)! **BIG thanks** for providing **BIG hardware** expertise and access to run these experiments and make these great quants available to the community!!! Also thanks to all the folks in the quanting and inferencing community on [BeaverAI Club Discord](https://discord.com/channels/1238219753324281886/1238239819017097246/1238676202357784650) and on [r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/) for tips and tricks helping each other run, test, and benchmark all the fun new models! ## Quants #### `IQ3_KS` 34.088 GiB (3.642 BPW) Special mix `IQ4_KS` `ffn_down` and all new `IQ3_KS` `ffn_(up|gate)` routed experts. `iq6_k/iq5_k` for attn and shared expert as shown in the recipe below. Test out `-rtr` to run-time-repack tensors to `_r4` variants layers when running on CPU/RAM likely faster in default ubatch sizes. With under 16GB VRAM and ~24GB RAM fit 32k context and still offload 10 extra exps layers onto GPU for extra TG speed! Can even run on just 4GB VRAM with lower context and no extra offload layers with enough system RAM ~32GiB. More context or offload additional layers with extra VRAM.
👈 Secret Recipe ```bash custom=" # Attention blk\..*\.attn_k.*=iq6_k blk\..*\.attn_v.*=iq6_k blk\..*\.attn_q.*=iq5_k blk\..*\.attn_o.*=iq5_k # 1x Shared Expert blk\..*\.ffn_(down)_shexp.*=iq6_k blk\..*\.ffn_(gate|up)_shexp.*=iq5_k # 64x Routed Experts blk\..*\.ffn_(down)_exps.*=iq4_ks blk\..*\.ffn_(gate|up)_exps.*=iq3_ks # Token Embedding token_embd\.weight=iq6_k " custom=$( echo "$custom" | grep -v '^#' | \ sed -Ez 's:\n+:,:g;s:,$::;s:^,::' ) ./build/bin/llama-quantize \ --custom-q "$custom" \ --imatrix /mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/imatrix-Hunyuan-A13B-Instruct-BF16.dat \ /mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/Hunyuan-A13B-Instruct-BF16-00001-of-00004.gguf \ /mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/Hunyuan-A13B-Instruct-IQ3_KS.gguf \ IQ3_KS \ 24 ```
## Quick Start #### 16GB VRAM + 24GB RAM Hybrid GPU+CPU Inference ``` # Basically trade-off VRAM between longer context or more speed for your configuration. ./build/bin/llama-server \ --model /mnt/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/Hunyuan-A13B-Instruct-IQ3_KS.gguf \ --alias ubergarm/Hunyuan-A13B-Instruct-IQ3_KS \ -fa -fmoe \ -rtr \ -ctk q8_0 -ctv q8_0 \ -c 32768 \ --temp 0.6 \ --presence-penalty 0.7 \ --min-p 0.1 \ -ngl 99 \ -ot "blk\.([0-9])\.ffn_.*=CUDA0" \ -ot exps=CPU \ --parallel 1 \ --threads 16 \ --host 127.0.0.1 \ --port 8083 ``` ## Perplexity The perplexity on these Hunyuan-A13B-Instruct models seems really high compared to stuff I've seen before. Check out the mainline llama.cpp [PR14425](https://github.com/ggml-org/llama.cpp/pull/14425#issuecomment-3024357323) for more details. * `IQ3_KS` 34.088 GiB (3.642 BPW) `Final estimate: PPL = 522.7473 +/- 5.68072` ## Speed Used built in `llama-sweep-bench` tool for example speeds across a variety of context length chats (N_KV is the kv-cache depth used for generation). ![llama-sweep-bench graph](images/sweep-bench.png "Chart showing how speed slows down as kv-cache size grows simulating longer multi-turn chats.") ## llama-sweep-bench ```bash # Offload 15 total layers and increase ubatch from default of -ub 512 up to -ub 2048 for big PP! export model=/mnt/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/Hunyuan-A13B-Instruct-IQ3_KS.gguf ./build/bin/llama-sweep-bench \ --model "$model" \ -fa -fmoe \ -rtr \ -ctk q8_0 -ctv q8_0 \ -c 32768 \ -ngl 99 \ -ot "blk\.([0-9])\.ffn_.*=CUDA0" \ -ot "blk\.(1[0-4])\.ffn_.*=CUDA0" \ -ub 2048 -b 2048 \ -ot exps=CPU \ --threads 16 \ --warmup-batch ``` ## *NOTE* Building Experimental PRs This PR is based on currently un-released PRs so is quite experimental. To build it before PRs are merged try something like this: ```bash # get the code setup cd projects git clone https://github.com/ikawrakow/ik_llama.cpp.git git ik_llama.cpp git remote add ubergarm https://github.com/ubergarm/ik_llama.cpp git fetch ubergarm git checkout ug/hunyuan-moe-2 # build for CUDA cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_VULKAN=OFF -DGGML_RPC=OFF -DGGML_BLAS=OFF -DGGML_CUDA_F16=ON -DGGML_SCHED_MAX_COPIES=1 cmake --build build --config Release -j $(nproc) # clean up later if things get merged into main git checkout main git branch -D merge-stuff-here ``` ## VRAM Estimations Context length = VRAM use: * 8k = 3790MiB total with KV self size = 544.00 MiB, K (q8_0): 272.00 MiB, V (q8_0): 272.00 MiB * 32k = 5462MiB total with KV self size = 2176.00 MiB, K (q8_0): 1088.00 MiB, V (q8_0): 1088.00 MiB * 64k = 7734MiB total with KV self size = 4352.00 MiB, K (q8_0): 2176.00 MiB, V (q8_0): 2176.00 MiB * 256k = 21162MiB total with KV self size = 17408.00 MiB, K (q8_0): 8704.00 MiB, V (q8_0): 8704.00 MiB ## ROPE Considerations The rope-freq-base defaults to about 11 million `11158840` but can be adjusted down to possibly better match shorter context applications. ``` # adjust to 3 million --rope-freq-base 3000000 ``` Thanks to [@kooshi for this tip](https://github.com/ggml-org/llama.cpp/pull/14425#issuecomment-3025974262) with which you can experiment. ## References * [mainline llama.cpp Hunyuan-A13B-Instruct PR](https://github.com/ggml-org/llama.cpp/pull/14425) * [ik_llama.cpp Hunyuan-A13B-Instruct PR](https://github.com/ikawrakow/ik_llama.cpp/pull/565) * [ik_llama.cpp IQ3_KS PR](https://github.com/ikawrakow/ik_llama.cpp/pull/566)