refine vllm guide
Browse files- README.md +2 -90
- vllm_plugin_meralion/README.md +72 -5
README.md
CHANGED
@@ -55,7 +55,7 @@ MERaLiON stands for **M**ultimodal **E**mpathetic **R**easoning **a**nd **L**ear
|
|
55 |
- **License:** [MERaLiON Public License](https://huggingface.co/MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION/blob/main/MERaLiON-Public-Licence-v1.pdf)
|
56 |
- **Demo:** [MERaLiON-AudioLLM Web Demo](https://huggingface.co/spaces/MERaLiON/MERaLiON-AudioLLM)
|
57 |
|
58 |
-
We support model inference using the [Huggingface](#inference) and [vLLM](
|
59 |
|
60 |
## Acknowledgement
|
61 |
This research is supported by the National Research Foundation, Singapore and Infocomm Media Development Authority, Singapore under its National Large Language Models Funding Initiative.
|
@@ -497,95 +497,7 @@ response = processor.batch_decode(generated_ids, skip_special_tokens=True)
|
|
497 |
|
498 |
### vLLM Inference
|
499 |
|
500 |
-
|
501 |
-
|
502 |
-
```bash
|
503 |
-
pip install vllm==0.6.4.post1
|
504 |
-
```
|
505 |
-
|
506 |
-
#### Model Registration
|
507 |
-
|
508 |
-
As the [vLLM documentation](https://docs.vllm.ai/en/stable/models/adding_model.html#out-of-tree-model-integration) recommends,
|
509 |
-
we provide a way to register our model via [vLLM plugins](https://docs.vllm.ai/en/stable/design/plugin_system.html#plugin-system).
|
510 |
-
|
511 |
-
```bash
|
512 |
-
cd vllm_plugin_meralion
|
513 |
-
python install .
|
514 |
-
```
|
515 |
-
|
516 |
-
#### vLLM Offline Inference
|
517 |
-
|
518 |
-
Here is an example of offline inference using our custom vLLM class.
|
519 |
-
|
520 |
-
```python
|
521 |
-
import torch
|
522 |
-
from vllm import ModelRegistry, LLM, SamplingParams
|
523 |
-
from vllm.assets.audio import AudioAsset
|
524 |
-
|
525 |
-
def run_meralion(question: str):
|
526 |
-
model_name = "MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION"
|
527 |
-
|
528 |
-
llm = LLM(model=model_name,
|
529 |
-
tokenizer=model_name,
|
530 |
-
max_num_seqs=8,
|
531 |
-
limit_mm_per_prompt={"audio": 1},
|
532 |
-
trust_remote_code=True,
|
533 |
-
dtype=torch.bfloat16
|
534 |
-
)
|
535 |
-
|
536 |
-
audio_in_prompt = "Given the following audio context: <SpeechHere>\n\n"
|
537 |
-
|
538 |
-
prompt = ("<start_of_turn>user\n"
|
539 |
-
f"{audio_in_prompt}Text instruction: {question}<end_of_turn>\n"
|
540 |
-
"<start_of_turn>model\n")
|
541 |
-
stop_token_ids = None
|
542 |
-
return llm, prompt, stop_token_ids
|
543 |
-
|
544 |
-
audio_asset = AudioAsset("mary_had_lamb")
|
545 |
-
question= "Please trancribe this speech."
|
546 |
-
|
547 |
-
llm, prompt, stop_token_ids = run_meralion(question)
|
548 |
-
|
549 |
-
# We set temperature to 0.2 so that outputs can be different
|
550 |
-
# even when all prompts are identical when running batch inference.
|
551 |
-
sampling_params = SamplingParams(
|
552 |
-
temperature=0.1,
|
553 |
-
top_p=0.9,
|
554 |
-
top_k=50,
|
555 |
-
repetition_penalty=1.1,
|
556 |
-
seed=42,
|
557 |
-
max_tokens=1024,
|
558 |
-
stop_token_ids=None
|
559 |
-
)
|
560 |
-
|
561 |
-
mm_data = {"audio": [audio_asset.audio_and_sample_rate]}
|
562 |
-
inputs = {"prompt": prompt, "multi_modal_data": mm_data}
|
563 |
-
|
564 |
-
# batch inference
|
565 |
-
inputs = [inputs] * 2
|
566 |
-
|
567 |
-
outputs = llm.generate(inputs, sampling_params=sampling_params)
|
568 |
-
|
569 |
-
for o in outputs:
|
570 |
-
generated_text = o.outputs[0].text
|
571 |
-
print(generated_text)
|
572 |
-
```
|
573 |
-
|
574 |
-
#### OpenAI Compatible Server
|
575 |
-
|
576 |
-
**server**
|
577 |
-
|
578 |
-
Here is an example to start the server via the `vllm serve` command.
|
579 |
-
|
580 |
-
```bash
|
581 |
-
export HF_TOKEN=your-hf-token
|
582 |
-
|
583 |
-
vllm serve MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION --tokenizer MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION --max-num-seqs 8 --trust-remote-code --dtype bfloat16
|
584 |
-
```
|
585 |
-
|
586 |
-
**client**
|
587 |
-
|
588 |
-
Refer to official vLLM example [code](https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_completion_client_for_multimodal.py#L213-L236).
|
589 |
|
590 |
## Disclaimer
|
591 |
|
|
|
55 |
- **License:** [MERaLiON Public License](https://huggingface.co/MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION/blob/main/MERaLiON-Public-Licence-v1.pdf)
|
56 |
- **Demo:** [MERaLiON-AudioLLM Web Demo](https://huggingface.co/spaces/MERaLiON/MERaLiON-AudioLLM)
|
57 |
|
58 |
+
We support model inference using the [Huggingface](#inference) and [vLLM](vllm_plugin_meralion/README.md) frameworks. For more technical details, please refer to our [technical report](https://arxiv.org/abs/2412.09818).
|
59 |
|
60 |
## Acknowledgement
|
61 |
This research is supported by the National Research Foundation, Singapore and Infocomm Media Development Authority, Singapore under its National Large Language Models Funding Initiative.
|
|
|
497 |
|
498 |
### vLLM Inference
|
499 |
|
500 |
+
We support hosting the model using vLLM framework. Refer to the guide [here](vllm_plugin_meralion/README.md).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
501 |
|
502 |
## Disclaimer
|
503 |
|
vllm_plugin_meralion/README.md
CHANGED
@@ -1,4 +1,4 @@
|
|
1 |
-
## MERaLiON vLLM
|
2 |
|
3 |
### Set up Environment
|
4 |
|
@@ -17,6 +17,65 @@ we provide a way to register our model via [vLLM plugins](https://docs.vllm.ai/e
|
|
17 |
python install .
|
18 |
```
|
19 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
20 |
### Serving
|
21 |
|
22 |
Here is an example to start the server via the `vllm serve` command.
|
@@ -100,7 +159,7 @@ response_obj = get_response(possible_text_inputs[0], audio_base64, **generation_
|
|
100 |
print(response_obj.choices[0].message.content)
|
101 |
```
|
102 |
|
103 |
-
Alternatively, can try calling the server with curl
|
104 |
|
105 |
```bash
|
106 |
curl http://localhost:8000/v1/chat/completions \
|
@@ -108,8 +167,16 @@ curl http://localhost:8000/v1/chat/completions \
|
|
108 |
-d '{
|
109 |
"model": "MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION",
|
110 |
"messages": [
|
111 |
-
{"role": "
|
112 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
113 |
}'
|
114 |
-
|
115 |
```
|
|
|
1 |
+
## MERaLiON-AudioLLM vLLM Serving
|
2 |
|
3 |
### Set up Environment
|
4 |
|
|
|
17 |
python install .
|
18 |
```
|
19 |
|
20 |
+
|
21 |
+
### Offline Inference
|
22 |
+
|
23 |
+
Here is an example of offline inference using our custom vLLM class.
|
24 |
+
|
25 |
+
```python
|
26 |
+
import torch
|
27 |
+
from vllm import ModelRegistry, LLM, SamplingParams
|
28 |
+
from vllm.assets.audio import AudioAsset
|
29 |
+
|
30 |
+
def run_meralion(question: str):
|
31 |
+
model_name = "MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION"
|
32 |
+
|
33 |
+
llm = LLM(model=model_name,
|
34 |
+
tokenizer=model_name,
|
35 |
+
max_num_seqs=8,
|
36 |
+
limit_mm_per_prompt={"audio": 1},
|
37 |
+
trust_remote_code=True,
|
38 |
+
dtype=torch.bfloat16
|
39 |
+
)
|
40 |
+
|
41 |
+
audio_in_prompt = "Given the following audio context: <SpeechHere>\n\n"
|
42 |
+
|
43 |
+
prompt = ("<start_of_turn>user\n"
|
44 |
+
f"{audio_in_prompt}Text instruction: {question}<end_of_turn>\n"
|
45 |
+
"<start_of_turn>model\n")
|
46 |
+
stop_token_ids = None
|
47 |
+
return llm, prompt, stop_token_ids
|
48 |
+
|
49 |
+
audio_asset = AudioAsset("mary_had_lamb")
|
50 |
+
question= "Please trancribe this speech."
|
51 |
+
|
52 |
+
llm, prompt, stop_token_ids = run_meralion(question)
|
53 |
+
|
54 |
+
# We set temperature to 0.2 so that outputs can be different
|
55 |
+
# even when all prompts are identical when running batch inference.
|
56 |
+
sampling_params = SamplingParams(
|
57 |
+
temperature=0.1,
|
58 |
+
top_p=0.9,
|
59 |
+
top_k=50,
|
60 |
+
repetition_penalty=1.1,
|
61 |
+
seed=42,
|
62 |
+
max_tokens=1024,
|
63 |
+
stop_token_ids=None
|
64 |
+
)
|
65 |
+
|
66 |
+
mm_data = {"audio": [audio_asset.audio_and_sample_rate]}
|
67 |
+
inputs = {"prompt": prompt, "multi_modal_data": mm_data}
|
68 |
+
|
69 |
+
# batch inference
|
70 |
+
inputs = [inputs] * 2
|
71 |
+
|
72 |
+
outputs = llm.generate(inputs, sampling_params=sampling_params)
|
73 |
+
|
74 |
+
for o in outputs:
|
75 |
+
generated_text = o.outputs[0].text
|
76 |
+
print(generated_text)
|
77 |
+
```
|
78 |
+
|
79 |
### Serving
|
80 |
|
81 |
Here is an example to start the server via the `vllm serve` command.
|
|
|
159 |
print(response_obj.choices[0].message.content)
|
160 |
```
|
161 |
|
162 |
+
Alternatively, you can try calling the server with curl, see the example below. We recommend using the generation config in the json body to fully reproduce the performance.
|
163 |
|
164 |
```bash
|
165 |
curl http://localhost:8000/v1/chat/completions \
|
|
|
167 |
-d '{
|
168 |
"model": "MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION",
|
169 |
"messages": [
|
170 |
+
{"role": "user",
|
171 |
+
"content": [
|
172 |
+
{"type": "text", "text": "Text instruction: <your-instruction>"},
|
173 |
+
{"type": "audio_url", "audio_url": {"url": "data:audio/ogg;base64,<your-audio-base64-string>"}}
|
174 |
+
]
|
175 |
+
}
|
176 |
+
],
|
177 |
+
"max_completion_tokens": 1024,
|
178 |
+
"temperature": 0.1,
|
179 |
+
"top_p": 0.9,
|
180 |
+
"seed": 42
|
181 |
}'
|
|
|
182 |
```
|