arxiv:2506.02863

CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech

Published on Jun 3

· Submitted by

westbrook on Jun 5

Upvote

Authors:

Helin Wang ,

Jiarui Hai ,

Abstract

CapSpeech introduces a large benchmark dataset for various captioned text-to-speech tasks, facilitating advancements in style, accent, emotion, and chat-agent synthesis.

AI-generated summary

Recent advancements in generative artificial intelligence have significantly transformed the field of style-captioned text-to-speech synthesis (CapTTS). However, adapting CapTTS to real-world applications remains challenging due to the lack of standardized, comprehensive datasets and limited research on downstream tasks built upon CapTTS. To address these gaps, we introduce CapSpeech, a new benchmark designed for a series of CapTTS-related tasks, including style-captioned text-to-speech synthesis with sound events (CapTTS-SE), accent-captioned TTS (AccCapTTS), emotion-captioned TTS (EmoCapTTS), and text-to-speech synthesis for chat agent (AgentTTS). CapSpeech comprises over 10 million machine-annotated audio-caption pairs and nearly 0.36 million human-annotated audio-caption pairs. In addition, we introduce two new datasets collected and recorded by a professional voice actor and experienced audio engineers, specifically for the AgentTTS and CapTTS-SE tasks. Alongside the datasets, we conduct comprehensive experiments using both autoregressive and non-autoregressive models on CapSpeech. Our results demonstrate high-fidelity and highly intelligible speech synthesis across a diverse range of speaking styles. To the best of our knowledge, CapSpeech is the largest available dataset offering comprehensive annotations for CapTTS-related tasks. The experiments and findings further provide valuable insights into the challenges of developing CapTTS systems.

View arXiv page View PDF Project page GitHub repository Add to collection

Community

OpenSound

1 day ago

We are excited to share our recent work titled "CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech"

📄 Paper: https://arxiv.org/abs/2506.02863
🌐 Project Page: https://wanghelin1997.github.io/CapSpeech-demo/
🚀 Spaces Demo: https://huggingface.co/spaces/OpenSound/CapSpeech-TTS

westbrook

Paper author Paper submitter 1 day ago

CapSpeech is a new benchmark designed for style-captioned TTS (CapTTS) tasks, including style-captioned text-to-speech synthesis with sound effects (CapTTS-SE), accent-captioned TTS (AccCapTTS), emotion-captioned TTS (EmoCapTTS) and text-to-speech synthesis for chat agent (AgentTTS).

CapSpeech comprises over 10 million machine-annotated audio-caption pairs and nearly 0.36 million human-annotated audio-caption pairs. 3 new speech datasets are specifically designed for the CapTTS-SE and AgentTTS tasks to enhance the benchmark’s coverage of real-world scenarios.