Why Am I Getting an Out-Of-Memory Error with My GPU Specs?

#7
by chunjae - opened

hi, I have a setup with four A100 GPUs, each with 40GB of VRAM.
I believe this hardware should be sufficient to load the Llama-4 quantized model (listed as 57.4B on the homepage), but I'm running into CUDA Out of Memory errors.
Could someone please explain why this might be happening?

@chunjae
Can you let me know whether you are using transformers or vLLM for inference ? Share the code as well please.

Hi @Mogith and @chunjae ,

I am running llama 4 with vllm by following the official website command https://blog.vllm.ai/2025/04/05/llama4.html. However, it still error with this model.

Can provide some advice for me? Thank you.

I have a setup with 2 v100 * 80GB.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment