google/gemma-3n-E2B-it · Does this model support multiple audio inputs in a single turn?

20 days ago

•

Hello,

Does this model support multiple audio inputs in a single turn?
such as

messages = [
{
    "role": "system",
    "content": [{"type": "text", "text": "You are a helpful assistant."}]
},
{
    "role": "user",
    "content": [
        {"type": "text", "text": "Select the best audio and answer with its number."}
        {"type": "text", "text": "1."}
        {"type": "audio", "audio": "audio1.wav},
        {"type": "text", "text": "2."}
        {"type": "audio", "audio": "audio1.wav},
    ]
}
]

lkv

Google org 4 days ago

Hi @Evan-Lin ,

The model's multimodal capabilities are designed to process audio, but typically it expects a single audio file or stream to be passed into the input. The model's audio encoder takes a single audio stream and converts it into a sequence of tokens that are then interleaved with text tokens for the language model to process.

The standard workflow involves providing one audio file or stream at a time within the user turn. This audio can be a voice recording to be transcribed or a sound to be analyzed.

Kindly refer this link for more information.

We appreciate your exploration and feedback — and we’ll be sure to share updates as more multimodal capabilities become available in future releases.

Thank you.

moghazy0814

4 days ago

moghazy0814

4 days ago

moghazy0814

4 days ago

ريد حريه الحركة و المرح المتستمر مع الابتسامات القويه