skatardude10 commited on
Commit
40cdff1
·
verified ·
1 Parent(s): c19173e

Upload SnowDrogito-RpRv3-32B_IQ4-XS-Q8InOut-Q56Attn.gguf

Browse files

W/ RpR V3, iMatrix across the board with Q8 Embeddings/Output as the other files in this repo, but using the new tensor-type option in llama quantize to make Q Attention and Attention Output tensors Q6_K + K and V attention tensors at Q5_K instead of IQ4_XS. Overall, the goal was to keep a small file size (less than Q4_K_M, slightly more than Q4_K_S and IQ4_XS) with Q5-Q8 performance where it matters most.

Still able to offload 61 of 65 layers with 40960 tokens of context on a 24GB VRAM card using Q8 context quantization with mostly decent speed.

.gitattributes CHANGED
@@ -35,3 +35,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  SnowDrogito-RpR-32B_IQ4-XS.gguf filter=lfs diff=lfs merge=lfs -text
37
  SnowDrogito-RpRv3-32B_IQ4-XS.gguf filter=lfs diff=lfs merge=lfs -text
 
 
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  SnowDrogito-RpR-32B_IQ4-XS.gguf filter=lfs diff=lfs merge=lfs -text
37
  SnowDrogito-RpRv3-32B_IQ4-XS.gguf filter=lfs diff=lfs merge=lfs -text
38
+ SnowDrogito-RpRv3-32B_IQ4-XS-Q8InOut-Q56Attn.gguf filter=lfs diff=lfs merge=lfs -text
SnowDrogito-RpRv3-32B_IQ4-XS-Q8InOut-Q56Attn.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f5206207f04665cfb11417594aea8b29b9be032219ec54107548310e1c09b3a1
3
+ size 19313339424