Does this model support multiple audio inputs in a single turn?
Hello,
Does this model support multiple audio inputs in a single turn?
such as
messages = [
{
"role": "system",
"content": [{"type": "text", "text": "You are a helpful assistant."}]
},
{
"role": "user",
"content": [
{"type": "text", "text": "Select the best audio and answer with its number."}
{"type": "text", "text": "1."}
{"type": "audio", "audio": "audio1.wav},
{"type": "text", "text": "2."}
{"type": "audio", "audio": "audio1.wav},
]
}
]
Hi @Evan-Lin ,
The model's multimodal capabilities are designed to process audio, but typically it expects a single audio file or stream to be passed into the input. The model's audio encoder takes a single audio stream and converts it into a sequence of tokens that are then interleaved with text tokens for the language model to process.
The standard workflow involves providing one audio file or stream at a time within the user
turn. This audio can be a voice recording to be transcribed or a sound to be analyzed.
Kindly refer this link for more information.
We appreciate your exploration and feedback — and we’ll be sure to share updates as more multimodal capabilities become available in future releases.
Thank you.
ريد حريه الحركة و المرح المتستمر مع الابتسامات القويه