Can we quantize the model to GGUF or GPTQ?
Thank you for sharing a new model architecture.
I was wondering if we can quantize the model to popular formats like GGUF, exllamav2 or GPTQ? Also, if there is no quantization available, can we use nf4 from bitsandbytes or some other quantization? I wanted to test it with 64 steps and I see that would require 103B parameters.
Another question. With 64 steps, does it require 103B parameter worth of VRAM space? or can we run it in RTX 3090 with 24GB VRAM with 64 steps (probably, it will take more time to compute)?
Thanks!
Hi, you can certainly try. We have not yet experimented with quantizing the model. You'd have to write support for this custom architecture if you plan to use a separate inference setup like GGUF/exllama/GPTQ. For generic quantization strategies that apply to the Pytorch implementation, those might work out of the box, but the effect is not yet determined.
Importantly, while the model uses as much compute as if it had 103B parameters if you are doing 64 steps, it still has only 3.5B parameters. So, a single forward pass with many steps just takes more time.