Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,51 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: en
|
3 |
+
tags:
|
4 |
+
- audio
|
5 |
+
- speech-processing
|
6 |
+
- speech-codec
|
7 |
+
- low-bitrate
|
8 |
+
- streaming
|
9 |
+
- tts
|
10 |
+
- cross-modal
|
11 |
+
license: apache-2.0
|
12 |
+
---
|
13 |
+
|
14 |
+
# SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec
|
15 |
+
|
16 |
+
## Resources
|
17 |
+
- [π Research Paper](https://arxiv.org/abs/2508.02849)
|
18 |
+
- [π» Source Code](https://github.com/QiangChunyu/SecoustiCodec)
|
19 |
+
- [π€ Demo Page](https://qiangchunyu.github.io/SecoustiCodec_Page/)
|
20 |
+
|
21 |
+
## Model Overview
|
22 |
+
|
23 |
+
SecoustiCodec is a state-of-the-art **low-bitrate streaming speech codec** that achieves good performance in speech reconstruction at ultra-low bitrates (0.27-1 kbps). The model introduces several innovations:
|
24 |
+
|
25 |
+
- π§ **Cross-modal alignment**: Aligns text and speech in joint multimodal frame-level space
|
26 |
+
- π **Semantic-paralinguistic disentanglement**: Separates linguistic content from speaker characteristics
|
27 |
+
- β‘ **Streaming support**: Real-time processing capabilities
|
28 |
+
- π **Efficient quantization**: VAE+FSQ approach solves token distribution problems
|
29 |
+
- π― **Acoustic-constrained optimization**: Ensures stable convergence
|
30 |
+
|
31 |
+
|
32 |
+
|
33 |
+
## Architecture Overview
|
34 |
+
|
35 |
+

|
36 |
+
|
37 |
+
|
38 |
+
## Acknowledgments
|
39 |
+
- We used [HiFiGAN](https://github.com/jik876/hifi-gan) for efficient waveform generation
|
40 |
+
- We referred to [MIMICodec](https://huggingface.co/kyutai/mimi) to implement this.
|
41 |
+
|
42 |
+
|
43 |
+
## Citation
|
44 |
+
```bibtex
|
45 |
+
@article{qiang2025secousticodec,
|
46 |
+
title={SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec},
|
47 |
+
author={Qiang, Chunyu and Wang, Haoyu and Gong, Cheng and Wang, Tianrui and Fu, Ruibo and Wang, Tao and Chen, Ruilong and Yi, Jiangyan and Wen, Zhengqi and Zhang, Chen and Wang, Longbiao and Dang, Jianwu and Tao, Jianhua},
|
48 |
+
journal={arXiv preprint arXiv:2508.02849},
|
49 |
+
year={2025}
|
50 |
+
}
|
51 |
+
```
|