--- language: en tags: - audio - speech-processing - speech-codec - low-bitrate - streaming - tts - cross-modal license: apache-2.0 --- # SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec ## Resources - [📄 Research Paper](https://arxiv.org/abs/2508.02849) - [💻 Source Code](https://github.com/QiangChunyu/SecoustiCodec) - [🤗 Demo Page](https://qiangchunyu.github.io/SecoustiCodec_Page/) ## Model Overview SecoustiCodec is a **low-bitrate streaming speech codec** that achieves good performance in speech reconstruction at ultra-low bitrates (0.27-1 kbps). The model introduces several innovations: - 🧠 **Cross-modal alignment**: Aligns text and speech in joint multimodal frame-level space - 🔍 **Semantic-paralinguistic disentanglement**: Separates linguistic content from speaker characteristics - ⚡ **Streaming support**: Real-time processing capabilities - 📊 **Efficient quantization**: VAE+FSQ approach solves token distribution problems ## Architecture Overview ![Model Architecture](https://qiangchunyu.github.io/SecoustiCodec_Page/model.png) ## Acknowledgments - We used [HiFiGAN](https://github.com/jik876/hifi-gan) for efficient waveform generation - We referred to [MIMICodec](https://huggingface.co/kyutai/mimi) to implement this. ## Citation ```bibtex @article{qiang2025secousticodec, title={SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec}, author={Chunyu Qiang, Haoyu Wang, Cheng Gong, Tianrui Wang, Ruibo Fu, Tao Wang, Ruilong Chen, Jiangyan Yi, Zhengqi Wen, Chen Zhang, Longbiao Wang, Jianwu Dang, Jianhua Tao}, journal={arXiv preprint arXiv:2508.02849}, year={2025} } ```