YingxuHe's picture
update vllm serving guide
1ee1019
|
raw
history blame
3.12 kB

MERaLiON vLLM serving

Set up Environment

MERaLiON-AudioLLM requires vLLM version 0.6.4.post1 and transformers 4.46.3

pip install vllm==0.6.4.post1
pip install transformers==4.46.3

As the vLLM documentation recommends, we provide a way to register our model via vLLM plugins.

python install .

Serving

Here is an example to start the server via the vllm serve command.

export HF_TOKEN=<your-hf-token>

vllm serve MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION --tokenizer MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION --max-num-seqs 8 --trust-remote-code --dtype bfloat16 --port 8000

To call the server, you can use the official OpenAI client:

import base64

from openai import OpenAI


def get_client(api_key="EMPTY", base_url="http://localhost:8000/v1"):
    client = OpenAI(
        api_key=api_key,
        base_url=base_url,
    )

    models = client.models.list()
    model_name = models.data[0].id
    return client, model_name


def get_response(text_input, base64_audio_input, **params):
    response_obj = client.chat.completions.create(
        messages=[{
            "role":
            "user",
            "content": [
                {
                    "type": "text",
                    "text": f"Text instruction: {text_input}"
                },
                {
                    "type": "audio_url",
                    "audio_url": {
                        "url": f"data:audio/ogg;base64,{base64_audio_input}"
                    },
                },
            ],
        }],
        **params
    )
    return response_obj


#specify input and params
possible_text_inputs = [
    "Please transcribe this speech.",
    "Please summarise the content of this speech.",
    "Please follow the instruction in this speech."
]

audio_bytes = open(f"/path/to/wav/or/mp3/file", "rb").read()
audio_base64 = base64.b64encode(audio_bytes).decode('utf-8')

# use the port number of vllm service.
client, model_name = get_client(base_url="http://localhost:8000/v1")

generation_parameters = dict(
    model=model_name,
    max_completion_tokens=1024,
    temperature=0.1,
    top_p=0.9,
    extra_body={
        "repetition_penalty": 1.1,
        "top_k": 50,
        "length_penalty": 1.0
    },
    seed=42
)


response_obj = get_response(possible_text_inputs[0], audio_base64, **generation_parameters)
print(response_obj.choices[0].message.content)

Alternatively, can try calling the server with curl command.

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION",
        "messages": [
            {"role": "system", "content": [{"type": "text", "text": "Text instruction: <your-command>"}, {"type":"audio_url", "audio_url": {"url": "data:audio/ogg;base64,<audio base64>"}}]},
        ]
    }'