Why Am I Getting an Out-Of-Memory Error with My GPU Specs?

by chunjae - opened Apr 9

Apr 9

hi, I have a setup with four A100 GPUs, each with 40GB of VRAM.
I believe this hardware should be sufficient to load the Llama-4 quantized model (listed as 57.4B on the homepage), but I'm running into CUDA Out of Memory errors.
Could someone please explain why this might be happening?

Mogith

Apr 9

@chunjae
Can you let me know whether you are using transformers or vLLM for inference ? Share the code as well please.

Vopsix23

Apr 10

•

edited Apr 10

Hi @Mogith and @chunjae ,

I am running llama 4 with vllm by following the official website command https://blog.vllm.ai/2025/04/05/llama4.html. However, it still error with this model.

Can provide some advice for me? Thank you.

I have a setup with 2 v100 * 80GB.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment