❤️ a love letter to the Open AI inference client

Community Article Published February 28, 2025

Hands down, my favorite Open AI library is the Python inference client. Mainly just because it’s becoming a universal interface for inference across the whole AI field. Just to name a few usages, it’s compatible with vLLM, TGI, Inference Providers, LlamaCPP, LlamaStudio, and cloud providers.

It’s mostly used so widely because of its simplicity and compatibility. Other inference servers, both local and remote, can expose the same API interface a Open AI’s client and be instantly compatible. Therefore, if you’re in the business of building an inference server, it makes a lot of sense to be compatible with Open AI.

In this blog post, I want to go through some of the cool stuff you can do with the Open AI library – other than just calling Open AI models 😛.

Hugging Face Inference

The hub has one and a half million models on it and many of them are available for inference by Hugging Face itself or other providers. Check out this guide if you haven’t heard about third party inference providers.

Here I can call deepseek’s R1 model served on Fireworks ai. The request is routed through the hub but starts with the OpenAI client. You can copy snippets like this from any model page on the hub.

from openai import OpenAI

client = OpenAI(
    base_url="https://router.huggingface.co/fireworks-ai",
    api_key="hf_xxxxxxxxxxxxxxxxxxxxxxxx"
)

messages = [
    {
        "role": "user",
        "content": "What is the capital of France?"
    }
]

completion = client.chat.completions.create(
    model="accounts/fireworks/models/deepseek-r1", 
    messages=messages, 
    max_tokens=500,
)

print(completion.choices[0].message)

Then, if I make these changes I can switch over to hyperbolic as an inference provider.

client = OpenAI(
-   base_url="https://router.huggingface.co/fireworks-ai",
+	base_url="https://router.huggingface.co/hyperbolic",
    api_key="hf_xxxxxxxxxxxxxxxxxxxxxxxx"
)

I also need to change the model when switching model providers.

completion = client.chat.completions.create(
-   model="accounts/fireworks/models/deepseek-r1", 
+ 	model="deepseek-ai/DeepSeek-R1", 
    messages=messages, 
    max_tokens=500,
)

The requests are routed by Hugging Face to correct inference provider.

Text Generation Inference

If you use a model that's running locally, you can use the text generation inference container.

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \
    ghcr.io/huggingface/text-generation-inference:3.1.0 \
    --model-id deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B

Then you can use the OpenAI client to request it, even though it's running locally.

from openai import OpenAI

# init the client but point it to TGI
client = OpenAI(
    base_url="http://localhost:8080/v1/",
    api_key="-"
)

chat_completion = client.chat.completions.create(
    model="tgi",
    messages=[
        {"role": "system", "content": "You are a helpful assistant." },
        {"role": "user", "content": "What is deep learning?"}
    ],
    stream=True
)

vLLM

Local model folk love to use vLLM for inference, and you can use the OpenAI client with your local vLLM server too.

vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --dtype auto --api-key token-abc123

And in Python, you can request the local server like this:

from openai import OpenAI
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-abc123",
)

completion = client.chat.completions.create(
  model="deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
  messages=[
    {"role": "user", "content": "Hello!"}
  ]
)

print(completion.choices[0].message)

There are obviously alternatives to this. Hugging Face’s own inference clients in python and JS offer a load of integration to inference providers. Most of inference providers have their own libraries which expose deeper integration with their service.

If you want to see it in action, check out the Open AI library snippets on model pages.

Community

🔥 article

Having OpenAI-compatible API is very convenient, kudos to HF team!

Sign up or log in to comment