GGUF quants of CausalLM/35b-beta-long, here I have:

IQ1_S  (8.0G, 16.8624 +/- 0.24892, fits into 10GiB VRAM, just for kicks and giggles, not really usable)
IQ1_M  (8.6G, 13.9588 +/- 0.19871, fits into 12GiB VRAM, just for kicks and giggles, not really usable)
IQ2_M  ( 12G, 10.1401 +/- 0.14062, fits into 16GiB VRAM + 6144 context with q4_1 KV cache)
IQ4_XS ( 18G,  9.4489 +/- 0.13005, fits into 24GiB VRAM + 8192 context with q4_1 KV cache, also room for 2048 ubatch)
IQ4_NL ( 19G,  9.4632 +/- 0.13056, fits into 24GiB VRAM + 8192 context with q4_1 KV cache)
Q4_K_M ( 21G,  9.3738 +/- 0.12900, fits into 24GiB VRAM + 6144 context with q4_1 KV cache, also good for CPU inference on E5-26xx v3/v4)
Q8_0   ( 35G,  9.3277 +/- 0.12781, probably isn't practical for anything unless you have big GPU array, imatrix derived from it)

Perplexity measured with -fa -ctv q4_1 -ctk q4_1 -c 2048 -ub 2048 on UTF-8 text version of "Wired Love" from Project Gutenberg.

NeoChen1024
/

CausalLM_35b-beta-long-GGUF-imatrix

GGUF quants of CausalLM/35b-beta-long, here I have:

Model tree for NeoChen1024/CausalLM_35b-beta-long-GGUF-imatrix