Hardware requirements?
Hey guys, what hardware's required to run this model as is?
I meant what hardware does the full Deepseek-R1 need to run, not the Distilled versions. It's good to know Distill-Qwen-32B can run locally on a 3080 tho.
@shl0th
That is not DeepSeek R1, it's Qwen-32b. OP asked the original 671B R1?
According to Distilled Model Evaluation, DeepSeek-R1-Distill-Qwen-32B
is very close to o1-mini
, must be a balanced option
I meant what hardware does the full Deepseek-R1 need to run, not the Distilled versions. It's good to know Distill-Qwen-32B can run locally on a 3080 tho.
I am running the Q8_0 GGUF with llama.cpp on my 256 GB workstation. It does not actually require this much RAM since it is an MoE model, if you keep the context window modest. The KV cache consumes more RAM than the model itself. For example with context length of 32092 tokens it takes around 220 GB RAM.
So, you able to run original R1 671B model with Q8_0 GGUF in 256 GB VRAM?if yes, which GPUs config you are using ?
Yes, it is running now. I am exploring the odd regression issues. I am not using any GPUs, this is CPU only so the speed is not all that great (it is very slow, more like emailing than chat) but it works and can work for agent work, if I can trust it enough not to give me boilerplate nonsense on random challenges.
@vmajor
ooh, in CPU.. that's painful for a 600B model... Which CPU you are using ?
It is a MoE so (from memory) only 40B parameters are active at one time. It is far from real time, but for turn based or agent work it is fine. Dual Xeon something. Old tech, but available cheaply. When/if this space settles on some kind of 'optimum' I will look at purchasing faster hardware.
sounds good. Which repo you sourced the Q8_0 GGUF from ?
unsloth
@vmajor
ooh. one last question.. do you have any idea of running LLM models in AMD MI300X using Oolama ?
Lol no :) I do not. Also I never quite got into the ollama train. I prefer to download my own models and not have to mess around with config files ie. ollama does not simplify things for me.
Why not use an API? Like using the openai o1 API.
Even if you are worried about data, there are many third-party API vendors, such as openroute
in openrouter R1 API, do you able to get "< think > ...< / think >" part also? I only get the final result when using deepseek as provider
Running it at about 5-8 t/s on a dual EPYC CPU with 24 x 16GB of DDR5 RAM (384GB). Running the IQ4_XS version with llama.cpp. No GPU.
Total system cost a bit over $4000, CPU bought on eBay engineering samples.
Also using KV cache compression q4_0
Running it at about 5-8 t/s on a dual EPYC CPU with 24 x 16GB of DDR5 RAM (384GB). Running the IQ4_XS version with llama.cpp. No GPU.
Total system cost a bit over $4000, CPU bought on eBay engineering samples.
What motherboard are you using? The number of RAM channels make a big difference for CPU inference. I've been looking at a motherboard that support 12-channel DDR5-6000 memory at 576 GB/s https://www.phoronix.com/review/8-12-channel-epyc-9005
@JohnnieB what do you mean by this? I was able to run this locally on a 3080
what 3080 specifically?