vram Requirements for full size

#14
by tazomatalax - opened

Anyone know how much vram we need to run this? Will it run ok on ada 6000 48gb with 100k-ish context?

Full 36B BF16 model is about 70GB in itself, so it won't fit, but a quant will fit. I am running 4.22bpw EXL3 quant with 150k Q8 context on 2x 3090 Ti 24GB with tensor parallel, it works alright. You can try this, or GGUF quants up to q6_k, sglang/vllm with FP8 , GPTQ-8bit or AWQ/GPTQ 4-bit quants. With 4.22bpw EXL3 and Q4 KV cache you should be able to push up to 300-350k context. I've had normal chats with it till about 120k ctx so far and it was perfectly stable.

Anyone know how much vram we need to run this? Will it run ok on ada 6000 48gb with 100k-ish context?

I can fit 100K context in 24GB with exllamav3.

In fact, you'd have room to batch calls in parallel with 48GB if you wish.

Sign up or log in to comment