qiangchunyu
/

SecoustiCodec

speech-processing

Model card Files Files and versions Community

SecoustiCodec / README.md

qiangchunyu's picture

Update README.md

117f4d7 verified 9 days ago

|

history blame contribute delete

1.68 kB

	---
	language: en
	tags:
	- audio
	- speech-processing
	- speech-codec
	- low-bitrate
	- streaming
	- tts
	- cross-modal
	license: apache-2.0
	---

	# SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec

	## Resources
	- [📄 Research Paper](https://arxiv.org/abs/2508.02849)
	- [💻 Source Code](https://github.com/QiangChunyu/SecoustiCodec)
	- [🤗 Demo Page](https://qiangchunyu.github.io/SecoustiCodec_Page/)

	## Model Overview

	SecoustiCodec is a low-bitrate streaming speech codec that achieves good performance in speech reconstruction at ultra-low bitrates (0.27-1 kbps). The model introduces several innovations:

	- 🧠 Cross-modal alignment: Aligns text and speech in joint multimodal frame-level space
	- 🔍 Semantic-paralinguistic disentanglement: Separates linguistic content from speaker characteristics
	- ⚡ Streaming support: Real-time processing capabilities
	- 📊 Efficient quantization: VAE+FSQ approach solves token distribution problems



	## Architecture Overview

	![Model Architecture](https://qiangchunyu.github.io/SecoustiCodec_Page/model.png)


	## Acknowledgments
	- We used [HiFiGAN](https://github.com/jik876/hifi-gan) for efficient waveform generation
	- We referred to [MIMICodec](https://huggingface.co/kyutai/mimi) to implement this.


	## Citation
	```bibtex
	@article{qiang2025secousticodec,
	title={SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec},
	author={Chunyu Qiang, Haoyu Wang, Cheng Gong, Tianrui Wang, Ruibo Fu, Tao Wang, Ruilong Chen, Jiangyan Yi, Zhengqi Wen, Chen Zhang, Longbiao Wang, Jianwu Dang, Jianhua Tao},
	journal={arXiv preprint arXiv:2508.02849},
	year={2025}
	}
	```