YingxuHe commited on
Commit
251bc20
·
1 Parent(s): 1ee1019

refine vllm guide

Browse files
Files changed (2) hide show
  1. README.md +2 -90
  2. vllm_plugin_meralion/README.md +72 -5
README.md CHANGED
@@ -55,7 +55,7 @@ MERaLiON stands for **M**ultimodal **E**mpathetic **R**easoning **a**nd **L**ear
55
  - **License:** [MERaLiON Public License](https://huggingface.co/MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION/blob/main/MERaLiON-Public-Licence-v1.pdf)
56
  - **Demo:** [MERaLiON-AudioLLM Web Demo](https://huggingface.co/spaces/MERaLiON/MERaLiON-AudioLLM)
57
 
58
- We support model inference using the [Huggingface](#inference) and [vLLM](#vllm-inference) frameworks. For more technical details, please refer to our [technical report](https://arxiv.org/abs/2412.09818).
59
 
60
  ## Acknowledgement
61
  This research is supported by the National Research Foundation, Singapore and Infocomm Media Development Authority, Singapore under its National Large Language Models Funding Initiative.
@@ -497,95 +497,7 @@ response = processor.batch_decode(generated_ids, skip_special_tokens=True)
497
 
498
  ### vLLM Inference
499
 
500
- MERaLiON-AudioLLM requires vLLM version `0.6.4.post1`.
501
-
502
- ```bash
503
- pip install vllm==0.6.4.post1
504
- ```
505
-
506
- #### Model Registration
507
-
508
- As the [vLLM documentation](https://docs.vllm.ai/en/stable/models/adding_model.html#out-of-tree-model-integration) recommends,
509
- we provide a way to register our model via [vLLM plugins](https://docs.vllm.ai/en/stable/design/plugin_system.html#plugin-system).
510
-
511
- ```bash
512
- cd vllm_plugin_meralion
513
- python install .
514
- ```
515
-
516
- #### vLLM Offline Inference
517
-
518
- Here is an example of offline inference using our custom vLLM class.
519
-
520
- ```python
521
- import torch
522
- from vllm import ModelRegistry, LLM, SamplingParams
523
- from vllm.assets.audio import AudioAsset
524
-
525
- def run_meralion(question: str):
526
- model_name = "MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION"
527
-
528
- llm = LLM(model=model_name,
529
- tokenizer=model_name,
530
- max_num_seqs=8,
531
- limit_mm_per_prompt={"audio": 1},
532
- trust_remote_code=True,
533
- dtype=torch.bfloat16
534
- )
535
-
536
- audio_in_prompt = "Given the following audio context: <SpeechHere>\n\n"
537
-
538
- prompt = ("<start_of_turn>user\n"
539
- f"{audio_in_prompt}Text instruction: {question}<end_of_turn>\n"
540
- "<start_of_turn>model\n")
541
- stop_token_ids = None
542
- return llm, prompt, stop_token_ids
543
-
544
- audio_asset = AudioAsset("mary_had_lamb")
545
- question= "Please trancribe this speech."
546
-
547
- llm, prompt, stop_token_ids = run_meralion(question)
548
-
549
- # We set temperature to 0.2 so that outputs can be different
550
- # even when all prompts are identical when running batch inference.
551
- sampling_params = SamplingParams(
552
- temperature=0.1,
553
- top_p=0.9,
554
- top_k=50,
555
- repetition_penalty=1.1,
556
- seed=42,
557
- max_tokens=1024,
558
- stop_token_ids=None
559
- )
560
-
561
- mm_data = {"audio": [audio_asset.audio_and_sample_rate]}
562
- inputs = {"prompt": prompt, "multi_modal_data": mm_data}
563
-
564
- # batch inference
565
- inputs = [inputs] * 2
566
-
567
- outputs = llm.generate(inputs, sampling_params=sampling_params)
568
-
569
- for o in outputs:
570
- generated_text = o.outputs[0].text
571
- print(generated_text)
572
- ```
573
-
574
- #### OpenAI Compatible Server
575
-
576
- **server**
577
-
578
- Here is an example to start the server via the `vllm serve` command.
579
-
580
- ```bash
581
- export HF_TOKEN=your-hf-token
582
-
583
- vllm serve MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION --tokenizer MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION --max-num-seqs 8 --trust-remote-code --dtype bfloat16
584
- ```
585
-
586
- **client**
587
-
588
- Refer to official vLLM example [code](https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_completion_client_for_multimodal.py#L213-L236).
589
 
590
  ## Disclaimer
591
 
 
55
  - **License:** [MERaLiON Public License](https://huggingface.co/MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION/blob/main/MERaLiON-Public-Licence-v1.pdf)
56
  - **Demo:** [MERaLiON-AudioLLM Web Demo](https://huggingface.co/spaces/MERaLiON/MERaLiON-AudioLLM)
57
 
58
+ We support model inference using the [Huggingface](#inference) and [vLLM](vllm_plugin_meralion/README.md) frameworks. For more technical details, please refer to our [technical report](https://arxiv.org/abs/2412.09818).
59
 
60
  ## Acknowledgement
61
  This research is supported by the National Research Foundation, Singapore and Infocomm Media Development Authority, Singapore under its National Large Language Models Funding Initiative.
 
497
 
498
  ### vLLM Inference
499
 
500
+ We support hosting the model using vLLM framework. Refer to the guide [here](vllm_plugin_meralion/README.md).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
501
 
502
  ## Disclaimer
503
 
vllm_plugin_meralion/README.md CHANGED
@@ -1,4 +1,4 @@
1
- ## MERaLiON vLLM serving
2
 
3
  ### Set up Environment
4
 
@@ -17,6 +17,65 @@ we provide a way to register our model via [vLLM plugins](https://docs.vllm.ai/e
17
  python install .
18
  ```
19
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
  ### Serving
21
 
22
  Here is an example to start the server via the `vllm serve` command.
@@ -100,7 +159,7 @@ response_obj = get_response(possible_text_inputs[0], audio_base64, **generation_
100
  print(response_obj.choices[0].message.content)
101
  ```
102
 
103
- Alternatively, can try calling the server with curl command.
104
 
105
  ```bash
106
  curl http://localhost:8000/v1/chat/completions \
@@ -108,8 +167,16 @@ curl http://localhost:8000/v1/chat/completions \
108
  -d '{
109
  "model": "MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION",
110
  "messages": [
111
- {"role": "system", "content": [{"type": "text", "text": "Text instruction: <your-command>"}, {"type":"audio_url", "audio_url": {"url": "data:audio/ogg;base64,<audio base64>"}}]},
112
- ]
 
 
 
 
 
 
 
 
 
113
  }'
114
-
115
  ```
 
1
+ ## MERaLiON-AudioLLM vLLM Serving
2
 
3
  ### Set up Environment
4
 
 
17
  python install .
18
  ```
19
 
20
+
21
+ ### Offline Inference
22
+
23
+ Here is an example of offline inference using our custom vLLM class.
24
+
25
+ ```python
26
+ import torch
27
+ from vllm import ModelRegistry, LLM, SamplingParams
28
+ from vllm.assets.audio import AudioAsset
29
+
30
+ def run_meralion(question: str):
31
+ model_name = "MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION"
32
+
33
+ llm = LLM(model=model_name,
34
+ tokenizer=model_name,
35
+ max_num_seqs=8,
36
+ limit_mm_per_prompt={"audio": 1},
37
+ trust_remote_code=True,
38
+ dtype=torch.bfloat16
39
+ )
40
+
41
+ audio_in_prompt = "Given the following audio context: <SpeechHere>\n\n"
42
+
43
+ prompt = ("<start_of_turn>user\n"
44
+ f"{audio_in_prompt}Text instruction: {question}<end_of_turn>\n"
45
+ "<start_of_turn>model\n")
46
+ stop_token_ids = None
47
+ return llm, prompt, stop_token_ids
48
+
49
+ audio_asset = AudioAsset("mary_had_lamb")
50
+ question= "Please trancribe this speech."
51
+
52
+ llm, prompt, stop_token_ids = run_meralion(question)
53
+
54
+ # We set temperature to 0.2 so that outputs can be different
55
+ # even when all prompts are identical when running batch inference.
56
+ sampling_params = SamplingParams(
57
+ temperature=0.1,
58
+ top_p=0.9,
59
+ top_k=50,
60
+ repetition_penalty=1.1,
61
+ seed=42,
62
+ max_tokens=1024,
63
+ stop_token_ids=None
64
+ )
65
+
66
+ mm_data = {"audio": [audio_asset.audio_and_sample_rate]}
67
+ inputs = {"prompt": prompt, "multi_modal_data": mm_data}
68
+
69
+ # batch inference
70
+ inputs = [inputs] * 2
71
+
72
+ outputs = llm.generate(inputs, sampling_params=sampling_params)
73
+
74
+ for o in outputs:
75
+ generated_text = o.outputs[0].text
76
+ print(generated_text)
77
+ ```
78
+
79
  ### Serving
80
 
81
  Here is an example to start the server via the `vllm serve` command.
 
159
  print(response_obj.choices[0].message.content)
160
  ```
161
 
162
+ Alternatively, you can try calling the server with curl, see the example below. We recommend using the generation config in the json body to fully reproduce the performance.
163
 
164
  ```bash
165
  curl http://localhost:8000/v1/chat/completions \
 
167
  -d '{
168
  "model": "MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION",
169
  "messages": [
170
+ {"role": "user",
171
+ "content": [
172
+ {"type": "text", "text": "Text instruction: <your-instruction>"},
173
+ {"type": "audio_url", "audio_url": {"url": "data:audio/ogg;base64,<your-audio-base64-string>"}}
174
+ ]
175
+ }
176
+ ],
177
+ "max_completion_tokens": 1024,
178
+ "temperature": 0.1,
179
+ "top_p": 0.9,
180
+ "seed": 42
181
  }'
 
182
  ```