text-generation-inference documentation

Gaudi Backend for Text Generation Inference

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Gaudi Backend for Text Generation Inference

Overview

Text Generation Inference (TGI) has been optimized to run on Gaudi hardware via the Gaudi backend for TGI.

Supported Hardware

Tutorial: Getting Started with TGI on Gaudi

Basic Usage

The easiest way to run TGI on Gaudi is to use the official Docker image:

model=meta-llama/Meta-Llama-3.1-8B-Instruct
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
hf_token=YOUR_HF_ACCESS_TOKEN

docker run --runtime=habana --cap-add=sys_nice --ipc=host \
    -p 8080:80 -v $volume:/data -e HF_TOKEN=$hf_token \
    ghcr.io/huggingface/text-generation-inference:3.3.4-gaudi \
    --model-id $model

Once you see the connected log, the server is ready to accept requests:

2024-05-22T19:31:48.302239Z INFO text_generation_router: router/src/main.rs:378: Connected

You can find your YOUR_HF_ACCESS_TOKEN at https://huggingface.co/settings/tokens. This is necessary to access gated models like llama3.1.

Making Your First Request

You can send a request from a separate terminal:

curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":32}}' \
    -H 'Content-Type: application/json'

How-to Guides

You can view the full list of supported models in the Supported Models section.

For example, to run Llama3.1-8B, you can use the following command:

model=meta-llama/Meta-Llama-3.1-8B-Instruct
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
hf_token=YOUR_ACCESS_TOKEN

docker run --runtime=habana --cap-add=sys_nice --ipc=host \
    -p 8080:80 -v $volume:/data -e HF_TOKEN=$hf_token \
    ghcr.io/huggingface/text-generation-inference:3.3.4-gaudi \
    --model-id $model
    <text-generation-inference-launcher-arguments>

For the full list of service parameters, refer to the launcher-arguments page.

The validated docker commands can be found in the examples/docker_commands folder.

Note: --runtime=habana --cap-add=sys_nice --ipc=host is required to enable docker to use the Gaudi hardware (more details here).

How to Enable Multi-Card Inference (Sharding)

TGI-Gaudi supports sharding for multi-card inference, allowing you to distribute the load across multiple Gaudi cards. This is recommended to run large models and to speed up inference.

For example, on a machine with 8 Gaudi cards, you can run:

docker run --runtime=habana --ipc=host --cap-add=sys_nice \
    -p 8080:80 -v $volume:/data -e HF_TOKEN=$hf_token \
    tgi-gaudi \
    --model-id $model --sharded true --num-shard 8
We recommend always using sharding when running on a multi-card machine.

How to Use Different Precision Formats

BF16 Precision (Default)

By default, all models run with BF16 precision on Gaudi hardware.

FP8 Precision

TGI-Gaudi supports FP8 precision inference, which can significantly reduce memory usage and improve performance for large models. We support model like W8A8 FP compressed-tensors parameters such as RedHatAI/Mixtral-8x7B-Instruct-v0.1-FP8 and AutoFP8 generated modelRedHatAI/Meta-Llama-3-8B-Instruct-FP8 . TGI-Gaudi supports FP8 precision inference with Intel Neural Compressor (INC).

How to Run Vision-Language Models (VLMs)

Gaudi supports VLM inference.

Example for Llava-v1.6-Mistral-7B on 1 card:

Start the TGI server via the following command:

model=llava-hf/llava-v1.6-mistral-7b-hf
volume=$PWD/data   # share a volume with the Docker container to avoid downloading weights every run

docker run -p 8080:80 \
   --runtime=habana \
   --cap-add=sys_nice \
   --ipc=host \
   -v $volume:/data \
   ghcr.io/huggingface/text-generation-inference:3.3.4-gaudi \
   --model-id $model \
   --max-input-tokens 4096 --max-batch-prefill-tokens 16384 \
   --max-total-tokens 8192 --max-batch-size 4

You can then send a request to the server via the following command:

curl -N 127.0.0.1:8080/generate \
    -X POST \
    -d '{"inputs":"![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png)What is this a picture of?\n\n","parameters":{"max_new_tokens":32}}' \
    -H 'Content-Type: application/json'

Note: In Llava-v1.6-Mistral-7B, an image usually accounts for 2000 input tokens. For example, an image of size 512x512 is represented by 2800 tokens. Thus, max-input-tokens must be larger than the number of tokens associated with the image. Otherwise the image may be truncated. The value of max-batch-prefill-tokens is 16384, which is calculated as follows: prefill_batch_size = max-batch-prefill-tokens / max-input-tokens.

How to Benchmark Performance

We recommend using the inference-benchmarker tool to benchmark performance on Gaudi hardware.

This benchmark tool simulates user requests and measures the performance of the model on realistic scenarios.

To run it on the same machine, you can do the following:

MODEL=meta-llama/Llama-3.1-8B-Instruct
HF_TOKEN=<your HF READ token>
# run a benchmark to evaluate the performance of the model for chat use case
# we mount results to the current directory
docker run \
    --rm \
    -it \
    --net host \
    -v $(pwd):/opt/inference-benchmarker/results \
    -e "HF_TOKEN=$HF_TOKEN" \
    ghcr.io/huggingface/inference-benchmarker:latest \
    inference-benchmarker \
    --tokenizer-name "$MODEL" \
    --url http://localhost:8080 \
    --profile chat

Please refer to the inference-benchmarker README for more details.

Explanation: Understanding TGI on Gaudi

The Warmup Process

Intel Gaudi accelerators perform best when operating on models with fixed tensor shapes. Intel Gaudi Graph Compiler generates optimized binary code that implements the given model topology on Gaudi. In its default configuration, the produced binary code may be highly dependent on input and output tensor shapes, requiring graph recompilation when encountering tensors with different shapes within the same topology. While these binaries efficiently utilize Gaudi, the compilation process itself can introduce noticeable overhead in end-to-end execution. In dynamic inference serving scenarios, minimizing the number of graph compilations and reducing the risk of graph compilation occurring during server runtime is important.

To ensure optimal performance, warmup is performed at the beginning of each server run. This process creates queries with various input shapes based on provided parameters and runs basic TGI operations (prefill, decode).

Note: Model warmup can take several minutes, especially for FP8 inference. For faster subsequent runs, refer to Disk Caching Eviction Policy.

Understanding Parameter Tuning

Sequence Length Parameters

  • --max-input-tokens is the maximum possible input prompt length. Default value is 4095.
  • --max-total-tokens is the maximum possible total length of the sequence (input and output). Default value is 4096.

Batch Size Parameters

  • For prefill operation, please set --max-batch-prefill-tokens as bs * max-input-tokens, where bs is your expected maximum prefill batch size.
  • For decode operation, please set --max-batch-size as bs, where bs is your expected maximum decode batch size.
  • Please note that batch size will be always padded to the nearest shapes that has been warmed up. This is done to avoid out of memory issues and to ensure that the graphs are reused efficiently.

Reference

This section contains reference information about the Gaudi backend.

Supported Models

Text Generation Inference enables serving optimized models on Gaudi hardware. The following sections list which models (VLMs & LLMs) are supported on Gaudi.

Large Language Models (LLMs)

Vision-Language Models (VLMs)

If you have an issue with a model, please open an issue on the Gaudi backend repository.

Environment Variables

The following table contains the environment variables that can be used to configure the Gaudi backend:

Name Value(s) Default Description Usage
LIMIT_HPU_GRAPH True/False True Skip HPU graph usage for prefill to save memory, set to True for large sequence/decoding lengths(e.g. 300/212) add -e in docker run command
SKIP_TOKENIZER_IN_TGI True/False False Skip tokenizer for input/output processing add -e in docker run command
VLLM_SKIP_WARMUP True/False False Skip graph warmup during server initialization which is not recommended, but could be used for debug. add -e in docker run command

Contributing

Contributions to the TGI-Gaudi project are welcome. Please refer to the contributing guide.

Guidelines for contributing to Gaudi on TGI: All changes should be made within the backends/gaudi folder. In general, you should avoid modifying the router, launcher, or benchmark to accommodate Gaudi hardware, as all Gaudi-specific logic should be contained within the backends/gaudi folder.

Building the Docker Image from Source

To build the Docker image from source:

make -C backends/gaudi image

This builds the image and saves it as tgi-gaudi. You can then run TGI-Gaudi with this image:

model=meta-llama/Meta-Llama-3.1-8B-Instruct
volume=$PWD/data
hf_token=YOUR_ACCESS_TOKEN

docker run --runtime=habana --ipc=host --cap-add=sys_nice \
    -p 8080:80 -v $volume:/data -e HF_TOKEN=$hf_token \
    tgi-gaudi \
    --model-id $model

For more details, see the README of the Gaudi backend and the Makefile of the Gaudi backend.

< > Update on GitHub