@MoritzLaurer on Hugging Face: "The new NIM Serverless API by HF and Nvidia is a great option if you want a…"

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

MoritzLaurer

posted an update Sep 23, 2024

Post

2330

The new NIM Serverless API by HF and Nvidia is a great option if you want a reliable API for open-weight LLMs like Llama-3.1-405B that are too expensive to run on your own hardware.

- It's pay-as-you-go, so it doesn't have rate limits like the standard HF Serverless API and you don't need to commit to hardware like for a dedicated endpoint.
- It works out-of-the box with the new v0.25 release of our huggingface_hub.InferenceClient
- It's specifically tailored to a small collection of popular open-weight models. For a broader selection of open models, we recommend using the standard HF Serverless API.
- Note that you need a token from an Enterprise Hub organization to use it.

Details in this blog post: https://huggingface.co/blog/inference-dgx-cloud
Compatible models in this HF collection: https://huggingface.co/collections/nvidia/nim-serverless-inference-api-66a3c6fcdcb5bbc6e975b508
Release notes with many more features of huggingface_hub==0.25.0: https://github.com/huggingface/huggingface_hub/releases/tag/v0.25.0

Copy-pasteable code in the first comment:

MoritzLaurer

Sep 23, 2024

#!pip install "huggingface_hub>=0.25.0"
from huggingface_hub import InferenceClient

client = InferenceClient(
    base_url="https://huggingface.co/api/integrations/dgx/v1",
    api_key="MY_FINEGRAINED_ENTERPRISE_ORG_TOKEN"  # see docs: https://huggingface.co/blog/inference-dgx-cloud#create-a-fine-grained-token
)

output = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-405B-Instruct-FP8",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Count to 10"},
    ],
    max_tokens=1024,
)

print(output)

davanstrien

Sep 23, 2024

Very exciting to see this! I often want to use an LLM for a short period, and setting up a whole endpoint for this can be overkill. This seems like a very neat solution!

Do you think there is a chance that any VLMs will be added soon!?

jmparejaz

Dec 2, 2024

hi @MoritzLaurer

the api is not currently working for llama 405b.
the other models are working except for it.
Do you know if it is a temporary issue or permanent?

MoritzLaurer

Dec 3, 2024

Hey @jmparejaz , did you use a token from an enterprise org as explained here https://huggingface.co/blog/inference-dgx-cloud#create-a-fine-grained-token ?

borowis

Dec 17, 2024

hey folks, is there a plan to add support for embedding (retrieval) models in the future? I'm not sure I follow how to run them using NIM Serverless

MoritzLaurer

Dec 18, 2024

Hey @borowis , I don't think there is a plan to add embedding models to the NIM API. Embedding models are quite small which makes them easier to run on accessible hardware (vs. the H100 GPUs running the large LLMs on the NIM API). I'd recommend using a cheap GPU (or even a CPU) via the HF dedicated endpoints for deploying embedding models: https://huggingface.co/inference-endpoints/dedicated And you can use the autoscaling/scale-to-zero feature to avoid unnecessary costs
(The smaller BGE models from the MTEB leaderboard are always a good place to start)

In this post