|
--- |
|
language: en |
|
tags: |
|
- audio |
|
- speech-processing |
|
- speech-codec |
|
- low-bitrate |
|
- streaming |
|
- tts |
|
- cross-modal |
|
license: apache-2.0 |
|
--- |
|
|
|
# SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec |
|
|
|
## Resources |
|
- [π Research Paper](https://arxiv.org/abs/2508.02849) |
|
- [π» Source Code](https://github.com/QiangChunyu/SecoustiCodec) |
|
- [π€ Demo Page](https://qiangchunyu.github.io/SecoustiCodec_Page/) |
|
|
|
## Model Overview |
|
|
|
SecoustiCodec is a **low-bitrate streaming speech codec** that achieves good performance in speech reconstruction at ultra-low bitrates (0.27-1 kbps). The model introduces several innovations: |
|
|
|
- π§ **Cross-modal alignment**: Aligns text and speech in joint multimodal frame-level space |
|
- π **Semantic-paralinguistic disentanglement**: Separates linguistic content from speaker characteristics |
|
- β‘ **Streaming support**: Real-time processing capabilities |
|
- π **Efficient quantization**: VAE+FSQ approach solves token distribution problems |
|
|
|
|
|
|
|
## Architecture Overview |
|
|
|
 |
|
|
|
|
|
## Acknowledgments |
|
- We used [HiFiGAN](https://github.com/jik876/hifi-gan) for efficient waveform generation |
|
- We referred to [MIMICodec](https://huggingface.co/kyutai/mimi) to implement this. |
|
|
|
|
|
## Citation |
|
```bibtex |
|
@article{qiang2025secousticodec, |
|
title={SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec}, |
|
author={Chunyu Qiang, Haoyu Wang, Cheng Gong, Tianrui Wang, Ruibo Fu, Tao Wang, Ruilong Chen, Jiangyan Yi, Zhengqi Wen, Chen Zhang, Longbiao Wang, Jianwu Dang, Jianhua Tao}, |
|
journal={arXiv preprint arXiv:2508.02849}, |
|
year={2025} |
|
} |
|
``` |
|
|