SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information
Abstract
SAKURA is introduced to evaluate the multi-hop reasoning abilities of large audio-language models, revealing their struggles in integrating speech/audio representations.
Large audio-language models (LALMs) extend the large language models with multimodal understanding in speech, audio, etc. While their performances on speech and audio-processing tasks are extensively studied, their reasoning abilities remain underexplored. Particularly, their multi-hop reasoning, the ability to recall and integrate multiple facts, lacks systematic evaluation. Existing benchmarks focus on general speech and audio-processing tasks, conversational abilities, and fairness but overlook this aspect. To bridge this gap, we introduce SAKURA, a benchmark assessing LALMs' multi-hop reasoning based on speech and audio information. Results show that LALMs struggle to integrate speech/audio representations for multi-hop reasoning, even when they extract the relevant information correctly, highlighting a fundamental challenge in multimodal reasoning. Our findings expose a critical limitation in LALMs, offering insights and resources for future research.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix (2025)
- Towards Spoken Mathematical Reasoning: Benchmarking Speech-based Models over Multi-faceted Math Problems (2025)
- SVLA: A Unified Speech-Vision-Language Assistant with Multimodal Reasoning and Speech Generation (2025)
- Multi-Domain Audio Question Answering Toward Acoustic Content Reasoning in The DCASE 2025 Challenge (2025)
- Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs (2025)
- MAVERIX: Multimodal Audio-Visual Evaluation Reasoning IndeX (2025)
- BLAB: Brutally Long Audio Bench (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper