How to run ollama using these new quantized weights?
In https://docs.unsloth.ai/basics/tutorial-how-to-run-deepseek-v3-0324-locally#tutorial-how-to-run-deepseek-v3-in-llama.cpp are instructions for local serving using llamacpp
Could you please advise how to use these multi-part gguf files (e.g. downloaded with hf_transfer
as also advised in https://docs.unsloth.ai/basics/tutorial-how-to-run-deepseek-v3-0324-locally#tutorial-how-to-run-deepseek-v3-in-llama.cpp) to serve a model in ollama
?
I found https://ollama.com/sunny-g/deepseek-v3-0324:ud-q2_k_xl (and its suggested comand ollama run sunny-g/deepseek-v3-0324:ud-q2_k_xl
), but it's only one version among the many provided in this HF Space, and instructions for consuming the multi-part gguf files downloaded onto the local filesystem using the fast hf_transfer
?
E.g. for deepseek-v3-0324:ud-q2_k_xl
, would I need at minimum 3 H100 80Gb devices to run it without CPU-offloading? Otherwise, what would the speed be on 1 H100 or 1 H200 using ollama?
Thank you!
In https://docs.unsloth.ai/basics/tutorial-how-to-run-deepseek-v3-0324-locally#tutorial-how-to-run-deepseek-v3-in-llama.cpp are instructions for local serving using llamacpp
Could you please advise how to use these multi-part gguf files (e.g. downloaded with
hf_transfer
as also advised in https://docs.unsloth.ai/basics/tutorial-how-to-run-deepseek-v3-0324-locally#tutorial-how-to-run-deepseek-v3-in-llama.cpp) to serve a model inollama
?I found https://ollama.com/sunny-g/deepseek-v3-0324:ud-q2_k_xl (and its suggested comand
ollama run sunny-g/deepseek-v3-0324:ud-q2_k_xl
), but it's only one version among the many provided in this HF Space, and instructions for consuming the multi-part gguf files downloaded onto the local filesystem using the fasthf_transfer
?E.g. for
deepseek-v3-0324:ud-q2_k_xl
, would I need at minimum 3 H100 80Gb devices to run it without CPU-offloading? Otherwise, what would the speed be on 1 H100 or 1 H200 using ollama?Thank you!
Have you successfully run ollama?
In https://docs.unsloth.ai/basics/tutorial-how-to-run-deepseek-v3-0324-locally#tutorial-how-to-run-deepseek-v3-in-llama.cpp are instructions for local serving using llamacpp
Could you please advise how to use these multi-part gguf files (e.g. downloaded with
hf_transfer
as also advised in https://docs.unsloth.ai/basics/tutorial-how-to-run-deepseek-v3-0324-locally#tutorial-how-to-run-deepseek-v3-in-llama.cpp) to serve a model inollama
?I found https://ollama.com/sunny-g/deepseek-v3-0324:ud-q2_k_xl (and its suggested comand
ollama run sunny-g/deepseek-v3-0324:ud-q2_k_xl
), but it's only one version among the many provided in this HF Space, and instructions for consuming the multi-part gguf files downloaded onto the local filesystem using the fasthf_transfer
?E.g. for
deepseek-v3-0324:ud-q2_k_xl
, would I need at minimum 3 H100 80Gb devices to run it without CPU-offloading? Otherwise, what would the speed be on 1 H100 or 1 H200 using ollama?Thank you!
Currently Ollama doesn't support multiple sharding so you can't right now