casperhansen/mixtral-instruct-awq · Out of memory on RTX 3090

Feb 4

I’m getting out of memory on RTX 3090 (24Go).
torch.cuda.OutOfMemoryError: CUDA out of memory....

Here my Python code:

from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import TextStreamer

# https://huggingface.co/casperhansen/mixtral-instruct-awq
MODEL_ID = "casperhansen/mixtral-instruct-awq"

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    device_map="cuda:0",
    attn_implementation="flash_attention_2"
)
streamer=TextStreamer(
    tokenizer=tokenizer,
    skip_prompt=True,
    skip_special_tokens=True
)

prompt="[INST] Donne moi un codePython optimisé pour calculer la racine carrée [/INST]"
tokens = tokenizer(
    text=prompt,
    return_tensors='pt'
).inputs_id.to("cuda:0")
generation_ouput=model.generate(
    tokens=tokens,
    streamer=streamer,
    max_new_tokens=512
)

nudelbrot

Feb 5

on vllm and with a different model, I often see a short spike in Memory usage when loading a AWQ model. This is on a 48GB card, so I have the extra memory, shortly after the spike the memory usage is lowered again. Maybe offloading to disk might help.

LeMoussel

Feb 5

Could you explain to me how to offloading to disk?

Larkin92

Feb 5

•

edited Feb 6

on vllm and with a different model, I often see a short spike in Memory usage when loading a AWQ model. This is on a 48GB card, so I have the extra memory, shortly after the spike the memory usage is lowered again. Maybe offloading to disk might help.

how much of a spike? will 40gb gpu be enough?
UPD: takes around 30-31gb VRAM