How to run ollama using these new quantized weights?

#12

by vadimkantorov - opened Apr 2

Apr 2

•

In https://docs.unsloth.ai/basics/tutorial-how-to-run-deepseek-v3-0324-locally#tutorial-how-to-run-deepseek-v3-in-llama.cpp are instructions for local serving using llamacpp

Could you please advise how to use these multi-part gguf files (e.g. downloaded with hf_transfer as also advised in https://docs.unsloth.ai/basics/tutorial-how-to-run-deepseek-v3-0324-locally#tutorial-how-to-run-deepseek-v3-in-llama.cpp) to serve a model in ollama?

I found https://ollama.com/sunny-g/deepseek-v3-0324:ud-q2_k_xl (and its suggested comand ollama run sunny-g/deepseek-v3-0324:ud-q2_k_xl), but it's only one version among the many provided in this HF Space, and instructions for consuming the multi-part gguf files downloaded onto the local filesystem using the fast hf_transfer?

E.g. for deepseek-v3-0324:ud-q2_k_xl, would I need at minimum 3 H100 80Gb devices to run it without CPU-offloading? Otherwise, what would the speed be on 1 H100 or 1 H200 using ollama?

Thank you!

randomwalkers

Apr 24

In https://docs.unsloth.ai/basics/tutorial-how-to-run-deepseek-v3-0324-locally#tutorial-how-to-run-deepseek-v3-in-llama.cpp are instructions for local serving using llamacpp

Could you please advise how to use these multi-part gguf files (e.g. downloaded with hf_transfer as also advised in https://docs.unsloth.ai/basics/tutorial-how-to-run-deepseek-v3-0324-locally#tutorial-how-to-run-deepseek-v3-in-llama.cpp) to serve a model in ollama?

I found https://ollama.com/sunny-g/deepseek-v3-0324:ud-q2_k_xl (and its suggested comand ollama run sunny-g/deepseek-v3-0324:ud-q2_k_xl), but it's only one version among the many provided in this HF Space, and instructions for consuming the multi-part gguf files downloaded onto the local filesystem using the fast hf_transfer?

E.g. for deepseek-v3-0324:ud-q2_k_xl, would I need at minimum 3 H100 80Gb devices to run it without CPU-offloading? Otherwise, what would the speed be on 1 H100 or 1 H200 using ollama?

Thank you!

Have you successfully run ollama?

shimmyshimmer

Unsloth AI org Apr 25

In https://docs.unsloth.ai/basics/tutorial-how-to-run-deepseek-v3-0324-locally#tutorial-how-to-run-deepseek-v3-in-llama.cpp are instructions for local serving using llamacpp

Could you please advise how to use these multi-part gguf files (e.g. downloaded with hf_transfer as also advised in https://docs.unsloth.ai/basics/tutorial-how-to-run-deepseek-v3-0324-locally#tutorial-how-to-run-deepseek-v3-in-llama.cpp) to serve a model in ollama?

I found https://ollama.com/sunny-g/deepseek-v3-0324:ud-q2_k_xl (and its suggested comand ollama run sunny-g/deepseek-v3-0324:ud-q2_k_xl), but it's only one version among the many provided in this HF Space, and instructions for consuming the multi-part gguf files downloaded onto the local filesystem using the fast hf_transfer?

E.g. for deepseek-v3-0324:ud-q2_k_xl, would I need at minimum 3 H100 80Gb devices to run it without CPU-offloading? Otherwise, what would the speed be on 1 H100 or 1 H200 using ollama?

Thank you!

Currently Ollama doesn't support multiple sharding so you can't right now

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment