Can we quantize the model to GGUF or GPTQ?

#10

by MLDataScientist - opened Feb 26

Feb 26

Thank you for sharing a new model architecture.
I was wondering if we can quantize the model to popular formats like GGUF, exllamav2 or GPTQ? Also, if there is no quantization available, can we use nf4 from bitsandbytes or some other quantization? I wanted to test it with 64 steps and I see that would require 103B parameters.
Another question. With 64 steps, does it require 103B parameter worth of VRAM space? or can we run it in RTX 3090 with 24GB VRAM with 64 steps (probably, it will take more time to compute)?
Thanks!

JonasGeiping

Tom Goldstein's Lab at University of Maryland, College Park org Feb 26

•

edited Feb 26

Hi, you can certainly try. We have not yet experimented with quantizing the model. You'd have to write support for this custom architecture if you plan to use a separate inference setup like GGUF/exllama/GPTQ. For generic quantization strategies that apply to the Pytorch implementation, those might work out of the box, but the effect is not yet determined.

Importantly, while the model uses as much compute as if it had 103B parameters if you are doing 64 steps, it still has only 3.5B parameters. So, a single forward pass with many steps just takes more time.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment