Upload SnowDrogito-RpRv3-32B_IQ4-XS-Q8InOut-Q56Attn.gguf
Browse filesW/ RpR V3, iMatrix across the board with Q8 Embeddings/Output as the other files in this repo, but using the new tensor-type option in llama quantize to make Q Attention and Attention Output tensors Q6_K + K and V attention tensors at Q5_K instead of IQ4_XS. Overall, the goal was to keep a small file size (less than Q4_K_M, slightly more than Q4_K_S and IQ4_XS) with Q5-Q8 performance where it matters most.
Still able to offload 61 of 65 layers with 40960 tokens of context on a 24GB VRAM card using Q8 context quantization with mostly decent speed.
.gitattributes
CHANGED
@@ -35,3 +35,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
36 |
SnowDrogito-RpR-32B_IQ4-XS.gguf filter=lfs diff=lfs merge=lfs -text
|
37 |
SnowDrogito-RpRv3-32B_IQ4-XS.gguf filter=lfs diff=lfs merge=lfs -text
|
|
|
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
36 |
SnowDrogito-RpR-32B_IQ4-XS.gguf filter=lfs diff=lfs merge=lfs -text
|
37 |
SnowDrogito-RpRv3-32B_IQ4-XS.gguf filter=lfs diff=lfs merge=lfs -text
|
38 |
+
SnowDrogito-RpRv3-32B_IQ4-XS-Q8InOut-Q56Attn.gguf filter=lfs diff=lfs merge=lfs -text
|
SnowDrogito-RpRv3-32B_IQ4-XS-Q8InOut-Q56Attn.gguf
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:f5206207f04665cfb11417594aea8b29b9be032219ec54107548310e1c09b3a1
|
3 |
+
size 19313339424
|