Does this model support multiple audio inputs in a single turn?

#17
by Evan-Lin - opened

Hello,

Does this model support multiple audio inputs in a single turn?
such as

messages = [
{
    "role": "system",
    "content": [{"type": "text", "text": "You are a helpful assistant."}]
},
{
    "role": "user",
    "content": [
        {"type": "text", "text": "Select the best audio and answer with its number."}
        {"type": "text", "text": "1."}
        {"type": "audio", "audio": "audio1.wav},
        {"type": "text", "text": "2."}
        {"type": "audio", "audio": "audio1.wav},
    ]
}
]
Google org

Hi @Evan-Lin ,

The model's multimodal capabilities are designed to process audio, but typically it expects a single audio file or stream to be passed into the input. The model's audio encoder takes a single audio stream and converts it into a sequence of tokens that are then interleaved with text tokens for the language model to process.

The standard workflow involves providing one audio file or stream at a time within the user turn. This audio can be a voice recording to be transcribed or a sound to be analyzed.

Kindly refer this link for more information.

We appreciate your exploration and feedback — and we’ll be sure to share updates as more multimodal capabilities become available in future releases.

Thank you.

att.ZM_xqh26JXDWSsPrA0km26J_b0hpMuwIAQmG97cfOJg.jpg.jpeg

att.pw3ao2Gsp6rFiKghIr7fFpe0CiEk9zso9dQGJ8RCnjI.jpg.jpeg

ريد حريه الحركة و المرح المتستمر مع الابتسامات القويه

Sign up or log in to comment