Upload SnowDrogito-RpRv3-32B_IQ4-XS-Q8InOut-Q56Attn.gguf

W/ RpR V3, iMatrix across the board with Q8 Embeddings/Output as the other files in this repo, but using the new tensor-type option in llama quantize to make Q Attention and Attention Output tensors Q6_K + K and V attention tensors at Q5_K instead of IQ4_XS. Overall, the goal was to keep a small file size (less than Q4_K_M, slightly more than Q4_K_S and IQ4_XS) with Q5-Q8 performance where it matters most.

Still able to offload 61 of 65 layers with 40960 tokens of context on a 24GB VRAM card using Q8 context quantization with mostly decent speed.

Files changed (2) hide show

.gitattributes +1 -0
SnowDrogito-RpRv3-32B_IQ4-XS-Q8InOut-Q56Attn.gguf +3 -0

.gitattributes CHANGED Viewed

@@ -35,3 +35,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 SnowDrogito-RpR-32B_IQ4-XS.gguf filter=lfs diff=lfs merge=lfs -text
 SnowDrogito-RpRv3-32B_IQ4-XS.gguf filter=lfs diff=lfs merge=lfs -text

 *tfevents* filter=lfs diff=lfs merge=lfs -text
 SnowDrogito-RpR-32B_IQ4-XS.gguf filter=lfs diff=lfs merge=lfs -text
 SnowDrogito-RpRv3-32B_IQ4-XS.gguf filter=lfs diff=lfs merge=lfs -text
+SnowDrogito-RpRv3-32B_IQ4-XS-Q8InOut-Q56Attn.gguf filter=lfs diff=lfs merge=lfs -text

SnowDrogito-RpRv3-32B_IQ4-XS-Q8InOut-Q56Attn.gguf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f5206207f04665cfb11417594aea8b29b9be032219ec54107548310e1c09b3a1
+size 19313339424