YingxuHe's picture
update vllm serving guide
1ee1019
|
raw
history blame
3.12 kB
## MERaLiON vLLM serving
### Set up Environment
MERaLiON-AudioLLM requires vLLM version `0.6.4.post1` and transformers `4.46.3`
```bash
pip install vllm==0.6.4.post1
pip install transformers==4.46.3
```
As the [vLLM documentation](https://docs.vllm.ai/en/stable/models/adding_model.html#out-of-tree-model-integration) recommends,
we provide a way to register our model via [vLLM plugins](https://docs.vllm.ai/en/stable/design/plugin_system.html#plugin-system).
```bash
python install .
```
### Serving
Here is an example to start the server via the `vllm serve` command.
```bash
export HF_TOKEN=<your-hf-token>
vllm serve MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION --tokenizer MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION --max-num-seqs 8 --trust-remote-code --dtype bfloat16 --port 8000
```
To call the server, you can use the [official OpenAI client](https://github.com/openai/openai-python):
```python
import base64
from openai import OpenAI
def get_client(api_key="EMPTY", base_url="http://localhost:8000/v1"):
client = OpenAI(
api_key=api_key,
base_url=base_url,
)
models = client.models.list()
model_name = models.data[0].id
return client, model_name
def get_response(text_input, base64_audio_input, **params):
response_obj = client.chat.completions.create(
messages=[{
"role":
"user",
"content": [
{
"type": "text",
"text": f"Text instruction: {text_input}"
},
{
"type": "audio_url",
"audio_url": {
"url": f"data:audio/ogg;base64,{base64_audio_input}"
},
},
],
}],
**params
)
return response_obj
#specify input and params
possible_text_inputs = [
"Please transcribe this speech.",
"Please summarise the content of this speech.",
"Please follow the instruction in this speech."
]
audio_bytes = open(f"/path/to/wav/or/mp3/file", "rb").read()
audio_base64 = base64.b64encode(audio_bytes).decode('utf-8')
# use the port number of vllm service.
client, model_name = get_client(base_url="http://localhost:8000/v1")
generation_parameters = dict(
model=model_name,
max_completion_tokens=1024,
temperature=0.1,
top_p=0.9,
extra_body={
"repetition_penalty": 1.1,
"top_k": 50,
"length_penalty": 1.0
},
seed=42
)
response_obj = get_response(possible_text_inputs[0], audio_base64, **generation_parameters)
print(response_obj.choices[0].message.content)
```
Alternatively, can try calling the server with curl command.
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION",
"messages": [
{"role": "system", "content": [{"type": "text", "text": "Text instruction: <your-command>"}, {"type":"audio_url", "audio_url": {"url": "data:audio/ogg;base64,<audio base64>"}}]},
]
}'
```