Hub documentation
GGUF usage with llama.cpp
GGUF usage with llama.cpp
You can now deploy any llama.cpp compatible GGUF on Hugging Face Endpoints, read more about it here
Llama.cpp allows you to download and run inference on a GGUF simply by providing a path to the Hugging Face repo path and the file name. llama.cpp downloads the model checkpoint and automatically caches it. The location of the cache is defined by LLAMA_CACHE
environment variable; read more about it here.
You can install llama.cpp through brew (works on Mac and Linux), or you can build it from source. There are also pre-built binaries and Docker images that you can check in the official documentation.
Option 1: Install with brew
brew install llama.cpp
Option 2: build from source
Step 1: Clone llama.cpp from GitHub.
git clone https://github.com/ggerganov/llama.cpp
Step 2: Move into the llama.cpp folder and build it with LLAMA_CURL=1
flag along with other hardware-specific flags (for ex: LLAMA_CUDA=1 for Nvidia GPUs on Linux).
cd llama.cpp && LLAMA_CURL=1 make
Once installed, you can use the llama-cli
or llama-server
as follows:
llama-cli -hf bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0
Note: You can remove -cnv
to run the CLI in chat completion mode.
Additionally, you can invoke an OpenAI spec chat completions endpoint directly using the llama.cpp server:
llama-server -hf bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0
After running the server you can simply utilise the endpoint as below:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer no-key" \
-d '{
"messages": [
{
"role": "system",
"content": "You are an AI assistant. Your top priority is achieving user fulfilment via helping them with their requests."
},
{
"role": "user",
"content": "Write a limerick about Python exceptions"
}
]
}'
Replace -hf
with any valid Hugging Face hub repo name - off you go! 🦙
Note: Remember to build
llama.cpp with LLAMA_CURL=1
:)