SecoustiCodec / README.md
qiangchunyu's picture
Update README.md
117f4d7 verified
metadata
language: en
tags:
  - audio
  - speech-processing
  - speech-codec
  - low-bitrate
  - streaming
  - tts
  - cross-modal
license: apache-2.0

SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec

Resources

Model Overview

SecoustiCodec is a low-bitrate streaming speech codec that achieves good performance in speech reconstruction at ultra-low bitrates (0.27-1 kbps). The model introduces several innovations:

  • 🧠 Cross-modal alignment: Aligns text and speech in joint multimodal frame-level space
  • πŸ” Semantic-paralinguistic disentanglement: Separates linguistic content from speaker characteristics
  • ⚑ Streaming support: Real-time processing capabilities
  • πŸ“Š Efficient quantization: VAE+FSQ approach solves token distribution problems

Architecture Overview

Model Architecture

Acknowledgments

  • We used HiFiGAN for efficient waveform generation
  • We referred to MIMICodec to implement this.

Citation

@article{qiang2025secousticodec,
  title={SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec},
  author={Chunyu Qiang, Haoyu Wang, Cheng Gong, Tianrui Wang, Ruibo Fu, Tao Wang, Ruilong Chen, Jiangyan Yi, Zhengqi Wen, Chen Zhang, Longbiao Wang, Jianwu Dang, Jianhua Tao},
  journal={arXiv preprint arXiv:2508.02849},
  year={2025}
}