vram Requirements for full size

#14

by tazomatalax - opened 1 day ago

Discussion

tazomatalax

1 day ago

•

edited 1 day ago

Anyone know how much vram we need to run this? Will it run ok on ada 6000 48gb with 100k-ish context?

adamo1139

1 day ago

Full 36B BF16 model is about 70GB in itself, so it won't fit, but a quant will fit. I am running 4.22bpw EXL3 quant with 150k Q8 context on 2x 3090 Ti 24GB with tensor parallel, it works alright. You can try this, or GGUF quants up to q6_k, sglang/vllm with FP8 , GPTQ-8bit or AWQ/GPTQ 4-bit quants. With 4.22bpw EXL3 and Q4 KV cache you should be able to push up to 300-350k context. I've had normal chats with it till about 120k ctx so far and it was perfectly stable.

Downtown-Case

about 13 hours ago

•

edited about 13 hours ago

Anyone know how much vram we need to run this? Will it run ok on ada 6000 48gb with 100k-ish context?

I can fit 100K context in 24GB with exllamav3.

In fact, you'd have room to batch calls in parallel with 48GB if you wish.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment