Multi GPU with different VRAM size does not work

#9
by jweb - opened

Thanks a lot.

i have 128GB CPU memory, and 24GB + 12GB GPUs.
i build with the following arguments for multiple GPUs.
-DGGML_CUDA=ON -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_IQK_FORCE_BF16=1 -DLLAMA_CURL=ON

i am using IQ3_K_R4.
i have a problem.
Single GPU mode works fine.

CUDA_VISIBLE_DEVICES="0,"
LD_PRELOAD=/home/mycat7/amd-blis/lib/ILP64/libaocl-libmem.so ./llama-server
--model /home/mycat7/LLM/DeepSeek-R1-0528-GGUF/IQ3_K_R4/DeepSeek-R1-0528-IQ3_K_R4-00001-of-00007.gguf
--alias ubergarm/DeepSeek-R1-0528-IQ3_K_R4
--ctx-size 8192
-ctk q8_0
-mla 3 -fa
-amb 512
-fmoe
--n-gpu-layers 63
--override-tensor exps=CPU
--parallel 1
--threads 16
--host 127.0.0.1
--port 8080

However, Multi GPU mode does not wok.

// Multi GPU
LD_PRELOAD=/home/mycat7/amd-blis/lib/ILP64/libaocl-libmem.so ./llama-server
--model /home/mycat7/LLM/DeepSeek-R1-0528-GGUF/IQ3_K_R4/DeepSeek-R1-0528-IQ3_K_R4-00001-of-00007.gguf
--alias ubergarm/DeepSeek-R1-0528-IQ3_K_R4
--ctx-size 8192
-ctk q8_0
-mla 3 -fa
-amb 512
-fmoe
--n-gpu-layers 63
-ts 24,12
-ot "blk.(3|4).ffn_.=CUDA0"
-ot "blk.(5|6).ffn_.
=CUDA1"
--override-tensor exps=CPU
--parallel 1
--threads 16
--host 127.0.0.1
--port 8080

ggml_backend_cuda_buffer_type_alloc_buffer: allocating 15931.61 MiB on device 1: cudaMalloc failed: out of memory
llama_model_load: error loading model: unable to allocate backend buffer
llama_load_model_from_file: failed to load model

How can solve the problem ?
Help me.

Owner

Take out the -ts 24,12 as for some reason that seems to mess up tensor overrides. Then you manually place more layers on the bigger CUDA similar to how you're doing there.

Start off small and make sure it is all loading, then slowly increase layers on CUDA0 until it OOMs. Then back off by one. Repeat for CUDA1. Some experimentation like that should get you going once you remove -ts.

I probably have to update my documentation to reflect this as I only realized it recently. Let me know if that works!

Thank you ubergarm.
i remove -ts and edit -ot.
It works fine with multi GPU.
my cuda capacity is limited.
CUDA0=24GB, CUDA1=12GB
Which tensors are important to ot in order to improve the LLM performance ?

LD_PRELOAD=/home/mycat7/amd-blis/lib/ILP64/libaocl-libmem.so ./llama-server
--model /home/mycat7/LLM/DeepSeek-R1-0528-GGUF/IQ3_K_R4/DeepSeek-R1-0528-IQ3_K_R4-00001-of-00007.gguf
--alias ubergarm/DeepSeek-R1-0528-IQ3_K_R4
--ctx-size 32768
-ctk q8_0
-mla 3 -fa
-amb 512
-fmoe
--n-gpu-layers 63
-ot "blk.(3|4).ffn_norm*=CUDA0"
-ot "blk.(3|4).exp_probs_b*=CUDA0"
-ot "blk.(3|4).ffn_down_shexp*=CUDA0"
-ot "blk.(3|4).attn_.=CUDA0"
-ot "blk.(5).ffn_norm.
=CUDA1"
--override-tensor exps=CPU
--parallel 1
--threads 16
--host 127.0.0.1
--port 8080

Owner

I have no idea why you are splitting up the tensors of a given layer across devices? Remember it takes time to transfer data across PCIe bus if you have to send it from CUDA0 to CUDA1 or to CPU etc so avoid that by keeping layers together as much as possible. I've never given shown commands like yours before so not sure where you got this idea, perhaps you asked your ai?

Anyway, keep it simple and only put full layers on a given device until you OOM then back off by one e.g. assuming CUDA0=24GB, CUDA1=12GB

-ngl 99 \
-ot "blk\.(3|4)\.ffn_.*=CUDA0" \
-ot "blk\.(5)\.ffn_.*=CUDA1" \
-ot exps=CPU \

You can probably use -amb 256 to get a little more free VRAM.

You might be able to add one more layer to each CUDA device but probably not. If my recommendations work you might be able to increase e.g. (3|4|5) and then (6|7) but probably will OOM on this larger model.

Also, if you want faster speed I'd recommend offloading one less layer and try to use -ub 2048 -b 2048 which takes up more VRAM for larger batch size but can get you 2-3x faster Prompt Processing. I can get over 200 tok/sec and others frequently report over 150 tok/sec.

Read this PR discussions for more detailed info and examples: https://github.com/ikawrakow/ik_llama.cpp/issues/437#issuecomment-2954709486

Thank you.
i refresh my brain.
The following setting fits my CUDA0, CUDA1, and it works fine.

-amb 512 \
-fmoe \
-ngl 99 \
-ot "blk\.(3)\.ffn_.*=CUDA0" \
-ot "blk\.(4)\.ffn_norm.*=CUDA1" \
-ot exps=CPU \

I have no idea why you are splitting up the tensors of a given layer across devices? Remember it takes time to transfer data across PCIe bus if you have to send it from CUDA0 to CUDA1 or to CPU etc so avoid that by keeping layers together as much as possible. I've never given shown commands like yours before so not sure where you got this idea, perhaps you asked your ai?

Anyway, keep it simple and only put full layers on a given device until you OOM then back off by one e.g. assuming CUDA0=24GB, CUDA1=12GB

-ngl 99 \
-ot "blk\.(3|4)\.ffn_.*=CUDA0" \
-ot "blk\.(5)\.ffn_.*=CUDA1" \
-ot exps=CPU \

You can probably use -amb 256 to get a little more free VRAM.

You might be able to add one more layer to each CUDA device but probably not. If my recommendations work you might be able to increase e.g. (3|4|5) and then (6|7) but probably will OOM on this larger model.

Also, if you want faster speed I'd recommend offloading one less layer and try to use -ub 2048 -b 2048 which takes up more VRAM for larger batch size but can get you 2-3x faster Prompt Processing. I can get over 200 tok/sec and others frequently report over 150 tok/sec.

Read this PR discussions for more detailed info and examples: https://github.com/ikawrakow/ik_llama.cpp/issues/437#issuecomment-2954709486

Can you help me with mla parameter? When should I use 1, 2 and 3?

@mtcl

Can you help me with mla parameter? When should I use 1, 2 and 3?

Anymore I just always use 3.

Sign up or log in to comment