File size: 1,682 Bytes
10ab439 7804221 10ab439 1096dd5 10ab439 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
---
language: en
tags:
- audio
- speech-processing
- speech-codec
- low-bitrate
- streaming
- tts
- cross-modal
license: apache-2.0
---
# SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec
## Resources
- [π Research Paper](https://arxiv.org/abs/2508.02849)
- [π» Source Code](https://github.com/QiangChunyu/SecoustiCodec)
- [π€ Demo Page](https://qiangchunyu.github.io/SecoustiCodec_Page/)
## Model Overview
SecoustiCodec is a **low-bitrate streaming speech codec** that achieves good performance in speech reconstruction at ultra-low bitrates (0.27-1 kbps). The model introduces several innovations:
- π§ **Cross-modal alignment**: Aligns text and speech in joint multimodal frame-level space
- π **Semantic-paralinguistic disentanglement**: Separates linguistic content from speaker characteristics
- β‘ **Streaming support**: Real-time processing capabilities
- π **Efficient quantization**: VAE+FSQ approach solves token distribution problems
## Architecture Overview

## Acknowledgments
- We used [HiFiGAN](https://github.com/jik876/hifi-gan) for efficient waveform generation
- We referred to [MIMICodec](https://huggingface.co/kyutai/mimi) to implement this.
## Citation
```bibtex
@article{qiang2025secousticodec,
title={SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec},
author={Chunyu Qiang, Haoyu Wang, Cheng Gong, Tianrui Wang, Ruibo Fu, Tao Wang, Ruilong Chen, Jiangyan Yi, Zhengqi Wen, Chen Zhang, Longbiao Wang, Jianwu Dang, Jianhua Tao},
journal={arXiv preprint arXiv:2508.02849},
year={2025}
}
```
|