File size: 1,682 Bytes
10ab439
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7804221
10ab439
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1096dd5
10ab439
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
---
language: en
tags:
- audio
- speech-processing
- speech-codec
- low-bitrate
- streaming
- tts
- cross-modal
license: apache-2.0
---

# SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec

## Resources
- [πŸ“„ Research Paper](https://arxiv.org/abs/2508.02849)
- [πŸ’» Source Code](https://github.com/QiangChunyu/SecoustiCodec)
- [πŸ€— Demo Page](https://qiangchunyu.github.io/SecoustiCodec_Page/)

## Model Overview

SecoustiCodec is a **low-bitrate streaming speech codec** that achieves good performance in speech reconstruction at ultra-low bitrates (0.27-1 kbps). The model introduces several innovations:

- 🧠 **Cross-modal alignment**: Aligns text and speech in joint multimodal frame-level space
- πŸ” **Semantic-paralinguistic disentanglement**: Separates linguistic content from speaker characteristics
- ⚑ **Streaming support**: Real-time processing capabilities
- πŸ“Š **Efficient quantization**: VAE+FSQ approach solves token distribution problems



## Architecture Overview

![Model Architecture](https://qiangchunyu.github.io/SecoustiCodec_Page/model.png)


## Acknowledgments
- We used [HiFiGAN](https://github.com/jik876/hifi-gan) for efficient waveform generation
- We referred to [MIMICodec](https://huggingface.co/kyutai/mimi) to implement this.


## Citation
```bibtex
@article{qiang2025secousticodec,
  title={SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec},
  author={Chunyu Qiang, Haoyu Wang, Cheng Gong, Tianrui Wang, Ruibo Fu, Tao Wang, Ruilong Chen, Jiangyan Yi, Zhengqi Wen, Chen Zhang, Longbiao Wang, Jianwu Dang, Jianhua Tao},
  journal={arXiv preprint arXiv:2508.02849},
  year={2025}
}
```