TQ1_0 deepseek-r1-0528 could not run with ollama
I run /home/lx# docker exec ollama ollama run hf.co/unsloth/DeepSeek-R1-0528-GGUF:TQ1_0 with 5*4090(48G) GPU, I could see model really run into GPUs
but always get following error:
Error: llama runner process has terminated: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA4 buffer of size 35888578560
My server mem is 512G and I set 1T virtual mem, why it still doesn't work? Is there any relationship with disk size or docker env? (36G left after downloading the quatized model and running in a docker (0.90 version ollama))
thanks so much!
you may need to set the Unified Memory ENV setting in Ollama: On linux = (sudo nano /etc/systemd/system/ollama.service) and add the setting on windows I am not sure how
To enable unified memory (where the CPU and GPU share memory) with Ollama, Linux users need to set the GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 environment variable before running Ollama. Windows users have it enabled by default. This allows Ollama to load model layers onto either the GPU or system RAM as needed, potentially preventing out-of-memory errors and enabling larger models to be run.