MERaLiON
/

MERaLiON-AudioLLM-Whisper-SEA-LION

@@ -55,7 +55,7 @@ MERaLiON stands for **M**ultimodal **E**mpathetic **R**easoning **a**nd **L**ear
 - **License:** [MERaLiON Public License](https://huggingface.co/MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION/blob/main/MERaLiON-Public-Licence-v1.pdf)
 - **Demo:** [MERaLiON-AudioLLM Web Demo](https://huggingface.co/spaces/MERaLiON/MERaLiON-AudioLLM)
-We support model inference using the [Huggingface](#inference) and [vLLM](#vllm-inference) frameworks. For more technical details, please refer to our [technical report](https://arxiv.org/abs/2412.09818).
 ## Acknowledgement
 This research is supported by the National Research Foundation, Singapore and Infocomm Media Development Authority, Singapore under its National Large Language Models Funding Initiative.
@@ -497,95 +497,7 @@ response = processor.batch_decode(generated_ids, skip_special_tokens=True)
 ### vLLM Inference
-MERaLiON-AudioLLM requires vLLM version `0.6.4.post1`.
-```bash
-pip install vllm==0.6.4.post1
-```
-#### Model Registration
-As the [vLLM documentation](https://docs.vllm.ai/en/stable/models/adding_model.html#out-of-tree-model-integration) recommends,
-we provide a way to register our model via [vLLM plugins](https://docs.vllm.ai/en/stable/design/plugin_system.html#plugin-system).
-```bash
-cd vllm_plugin_meralion
-python install .
-```
-#### vLLM Offline Inference
-Here is an example of offline inference using our custom vLLM class.
-```python
-import torch
-from vllm import ModelRegistry, LLM, SamplingParams
-from vllm.assets.audio import AudioAsset
-def run_meralion(question: str):
-    model_name = "MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION"
-    llm = LLM(model=model_name,
-              tokenizer=model_name,
-              max_num_seqs=8,
-              limit_mm_per_prompt={"audio": 1},
-              trust_remote_code=True,
-              dtype=torch.bfloat16
-              )
-    audio_in_prompt = "Given the following audio context: <SpeechHere>\n\n"
-    prompt = ("<start_of_turn>user\n"
-              f"{audio_in_prompt}Text instruction: {question}<end_of_turn>\n"
-              "<start_of_turn>model\n")
-    stop_token_ids = None
-    return llm, prompt, stop_token_ids
-audio_asset = AudioAsset("mary_had_lamb")
-question= "Please trancribe this speech."
-llm, prompt, stop_token_ids = run_meralion(question)
-# We set temperature to 0.2 so that outputs can be different
-# even when all prompts are identical when running batch inference.
-sampling_params = SamplingParams(
-  temperature=0.1,
-  top_p=0.9,
-  top_k=50,
-  repetition_penalty=1.1,
-  seed=42,
-  max_tokens=1024,
-  stop_token_ids=None
-)
-mm_data = {"audio": [audio_asset.audio_and_sample_rate]}
-inputs = {"prompt": prompt, "multi_modal_data": mm_data}
-# batch inference
-inputs = [inputs] * 2
-outputs = llm.generate(inputs, sampling_params=sampling_params)
-for o in outputs:
-    generated_text = o.outputs[0].text
-    print(generated_text)
-```
-#### OpenAI Compatible Server
-**server**
-Here is an example to start the server via the `vllm serve` command.
-```bash
-export HF_TOKEN=your-hf-token
-vllm serve MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION --tokenizer MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION --max-num-seqs 8 --trust-remote-code --dtype bfloat16
-```
-**client**
-Refer to official vLLM example [code](https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_completion_client_for_multimodal.py#L213-L236).
 ## Disclaimer

 - **License:** [MERaLiON Public License](https://huggingface.co/MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION/blob/main/MERaLiON-Public-Licence-v1.pdf)
 - **Demo:** [MERaLiON-AudioLLM Web Demo](https://huggingface.co/spaces/MERaLiON/MERaLiON-AudioLLM)
+We support model inference using the [Huggingface](#inference) and [vLLM](vllm_plugin_meralion/README.md) frameworks. For more technical details, please refer to our [technical report](https://arxiv.org/abs/2412.09818).
 ## Acknowledgement
 This research is supported by the National Research Foundation, Singapore and Infocomm Media Development Authority, Singapore under its National Large Language Models Funding Initiative.
 ### vLLM Inference
+We support hosting the model using vLLM framework. Refer to the guide [here](vllm_plugin_meralion/README.md).
 ## Disclaimer

vllm_plugin_meralion/README.md CHANGED Viewed

@@ -1,4 +1,4 @@
-## MERaLiON vLLM serving
 ### Set up Environment
@@ -17,6 +17,65 @@ we provide a way to register our model via [vLLM plugins](https://docs.vllm.ai/e
 python install .
 ```
 ### Serving
 Here is an example to start the server via the `vllm serve` command.
@@ -100,7 +159,7 @@ response_obj = get_response(possible_text_inputs[0], audio_base64, **generation_
 print(response_obj.choices[0].message.content)
 ```
-Alternatively, can try calling the server with curl command.
 ```bash
 curl http://localhost:8000/v1/chat/completions \
@@ -108,8 +167,16 @@ curl http://localhost:8000/v1/chat/completions \
     -d '{
         "model": "MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION",
         "messages": [
-            {"role": "system", "content": [{"type": "text", "text": "Text instruction: <your-command>"}, {"type":"audio_url", "audio_url": {"url": "data:audio/ogg;base64,<audio base64>"}}]},
-        ]
     }'
 ```

+## MERaLiON-AudioLLM vLLM Serving
 ### Set up Environment
 python install .
 ```
+### Offline Inference
+Here is an example of offline inference using our custom vLLM class.
+```python
+import torch
+from vllm import ModelRegistry, LLM, SamplingParams
+from vllm.assets.audio import AudioAsset
+def run_meralion(question: str):
+    model_name = "MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION"
+    llm = LLM(model=model_name,
+              tokenizer=model_name,
+              max_num_seqs=8,
+              limit_mm_per_prompt={"audio": 1},
+              trust_remote_code=True,
+              dtype=torch.bfloat16
+              )
+    audio_in_prompt = "Given the following audio context: <SpeechHere>\n\n"
+    prompt = ("<start_of_turn>user\n"
+              f"{audio_in_prompt}Text instruction: {question}<end_of_turn>\n"
+              "<start_of_turn>model\n")
+    stop_token_ids = None
+    return llm, prompt, stop_token_ids
+audio_asset = AudioAsset("mary_had_lamb")
+question= "Please trancribe this speech."
+llm, prompt, stop_token_ids = run_meralion(question)
+# We set temperature to 0.2 so that outputs can be different
+# even when all prompts are identical when running batch inference.
+sampling_params = SamplingParams(
+  temperature=0.1,
+  top_p=0.9,
+  top_k=50,
+  repetition_penalty=1.1,
+  seed=42,
+  max_tokens=1024,
+  stop_token_ids=None
+)
+mm_data = {"audio": [audio_asset.audio_and_sample_rate]}
+inputs = {"prompt": prompt, "multi_modal_data": mm_data}
+# batch inference
+inputs = [inputs] * 2
+outputs = llm.generate(inputs, sampling_params=sampling_params)
+for o in outputs:
+    generated_text = o.outputs[0].text
+    print(generated_text)
+```
 ### Serving
 Here is an example to start the server via the `vllm serve` command.
 print(response_obj.choices[0].message.content)
 ```
+Alternatively, you can try calling the server with curl, see the example below. We recommend using the generation config in the json body to fully reproduce the performance.
 ```bash
 curl http://localhost:8000/v1/chat/completions \
     -d '{
         "model": "MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION",
         "messages": [
+            {"role": "user",
+            "content": [
+                {"type": "text", "text": "Text instruction: <your-instruction>"},
+                {"type": "audio_url", "audio_url": {"url": "data:audio/ogg;base64,<your-audio-base64-string>"}}
+            ]
+            }
+        ],
+        "max_completion_tokens": 1024,
+        "temperature": 0.1,
+        "top_p": 0.9,
+        "seed": 42
     }'
 ```