How to run ollama using these new quantized weights?

#12
by vadimkantorov - opened

In https://docs.unsloth.ai/basics/tutorial-how-to-run-deepseek-v3-0324-locally#tutorial-how-to-run-deepseek-v3-in-llama.cpp are instructions for local serving using llamacpp

Could you please advise how to use these multi-part gguf files (e.g. downloaded with hf_transfer as also advised in https://docs.unsloth.ai/basics/tutorial-how-to-run-deepseek-v3-0324-locally#tutorial-how-to-run-deepseek-v3-in-llama.cpp) to serve a model in ollama?

I found https://ollama.com/sunny-g/deepseek-v3-0324:ud-q2_k_xl (and its suggested comand ollama run sunny-g/deepseek-v3-0324:ud-q2_k_xl), but it's only one version among the many provided in this HF Space, and instructions for consuming the multi-part gguf files downloaded onto the local filesystem using the fast hf_transfer?

E.g. for deepseek-v3-0324:ud-q2_k_xl, would I need at minimum 3 H100 80Gb devices to run it without CPU-offloading? Otherwise, what would the speed be on 1 H100 or 1 H200 using ollama?

Thank you!

In https://docs.unsloth.ai/basics/tutorial-how-to-run-deepseek-v3-0324-locally#tutorial-how-to-run-deepseek-v3-in-llama.cpp are instructions for local serving using llamacpp

Could you please advise how to use these multi-part gguf files (e.g. downloaded with hf_transfer as also advised in https://docs.unsloth.ai/basics/tutorial-how-to-run-deepseek-v3-0324-locally#tutorial-how-to-run-deepseek-v3-in-llama.cpp) to serve a model in ollama?

I found https://ollama.com/sunny-g/deepseek-v3-0324:ud-q2_k_xl (and its suggested comand ollama run sunny-g/deepseek-v3-0324:ud-q2_k_xl), but it's only one version among the many provided in this HF Space, and instructions for consuming the multi-part gguf files downloaded onto the local filesystem using the fast hf_transfer?

E.g. for deepseek-v3-0324:ud-q2_k_xl, would I need at minimum 3 H100 80Gb devices to run it without CPU-offloading? Otherwise, what would the speed be on 1 H100 or 1 H200 using ollama?

Thank you!

Have you successfully run ollama?

Unsloth AI org

In https://docs.unsloth.ai/basics/tutorial-how-to-run-deepseek-v3-0324-locally#tutorial-how-to-run-deepseek-v3-in-llama.cpp are instructions for local serving using llamacpp

Could you please advise how to use these multi-part gguf files (e.g. downloaded with hf_transfer as also advised in https://docs.unsloth.ai/basics/tutorial-how-to-run-deepseek-v3-0324-locally#tutorial-how-to-run-deepseek-v3-in-llama.cpp) to serve a model in ollama?

I found https://ollama.com/sunny-g/deepseek-v3-0324:ud-q2_k_xl (and its suggested comand ollama run sunny-g/deepseek-v3-0324:ud-q2_k_xl), but it's only one version among the many provided in this HF Space, and instructions for consuming the multi-part gguf files downloaded onto the local filesystem using the fast hf_transfer?

E.g. for deepseek-v3-0324:ud-q2_k_xl, would I need at minimum 3 H100 80Gb devices to run it without CPU-offloading? Otherwise, what would the speed be on 1 H100 or 1 H200 using ollama?

Thank you!

Currently Ollama doesn't support multiple sharding so you can't right now

Sign up or log in to comment