DistilCodec

DistilCodec: A Single Codebook Audio Codec For Universal Audio

Paper | HuggingFace Model | Code

Institution 1

Institution 2 Institution 3

πŸ”₯ News

  • 2025.05.26: We release DistilCodec-v1.0 checkpoint on huggingface.
  • 2025.05.26: The paper is available on arxiv.
  • 2025.05.23: We submit paper to arxiv.

Introduction of DistilCodec

The Joint Laboratory of International Digital Economy Academy (IDEA) and Emdoor, in collaboration with Emdoor Information Technology Co., Ltd., and Shenzhen Yijiayiban Information Technology Co., Ltd, has launched DistilCodec - A Single-Codebook Neural Audio Codec (NAC) with 32768 codes trained on uniersal audio.The foundational network architecture of DistilCodec adopts an Encoder-VQ-Decoder framework similar to that proposed in Soundstream. The encoder employs a ConvNeXt-V2 structure, while the vector quantization module implements the GRFVQ scheme. The decoder employs a ConvTranspose1d based architectural configuration similar to HiFiGAN. The training methodol- ogy of DistilCodec follows a similar approach to HiFiGAN, incorporating three types of discriminators: Multi-Period Discriminator (MPD), Multi-Scale Discriminator (MSD), and Multi- STFT Discriminator (MSFTFD). Here is the architecture of Distilcodec: The Architecture of DistilCodec Distribution of DistilCodec training data is shown in below table:

Data Category Data Size (in hours)
Chinese Audiobook 38000
Chinese Common Audio 20000
English Audiobook 10000
English Speech 30000
Music 2000
Total 100000

Inference of DistilCodec

The code is in github DistilCodec.

Part1: Generating discrete audio tokens from DistilCodec


from distil_codec import DistilCodec, demo_for_generate_audio_codes

codec_model_config_path='/path/to/distilcodec/model_config.json'
codec_ckpt_path = '/path/to/distilcodec_ckpt'
step=204000

codec = DistilCodec.from_pretrained(
    config_path=codec_model_config_path,
    model_path=codec_ckpt_path,
    load_steps=step,
    use_generator=True,
    is_debug=False).eval()

audio_path = '/path/to/audio_file'
audio_tokens = demo_for_generate_audio_codes(
    codec, 
    audio_path, 
    target_sr=24000, 
    plus_llm_offset=True # If this parameter set to True, then it will add LLM's vocabulary number to audio token, and DistilCodec's default vocabulary number is from QWen2.5-7B.
)
print(audio_tokens)

Part2: Reconstruct audio from raw audio


from distil_codec import DistilCodec, demo_for_generate_audio_codes

codec_model_config_path='/path/to/distilcodec/model_config.json'
codec_ckpt_path = '/path/to/distilcodec_ckpt'
step=204000

codec = DistilCodec.from_pretrained(
    config_path=codec_model_config_path,
    model_path=codec_ckpt_path,
    load_steps=step,
    use_generator=True,
    is_debug=False).eval()

audio_path = '/path/to/audio_file'
audio_tokens = demo_for_generate_audio_codes(
    codec, 
    audio_path, 
    target_sr=24000, 
    plus_llm_offset=True # If this parameter set to True, then it will add LLM's vocabulary number to audio token, and DistilCodec's default vocabulary number is from QWen2.5-7B.
)
print(audio_tokens)

# Generated audio save path, the path is f'{gen_audio_save_path}/{audio_name}.wav'
gen_audio_save_path = '/path/to/audio_save_path'
audio_name = 'audio_name'
y_gen = codec.decode_from_codes(
    audio_tokens, 
    minus_token_offset=True # if the 'plus_llm_offset' of method demo_for_generate_audio_codes is set to True, then minus_token_offset must be True.
)
codec.save_wav(
    audio_gen_batch=y_gen, 
    nhop_lengths=[y_gen.shape[-1]], 
    save_path=gen_audio_save_path,
    name_tag=audio_name
)

Available DistilCodec models

Model Version Huggingface Corpus Token/s Domain
DistilCodec-v1.0 HuggingFace Universal Audio 93 Universal Audio

Citation

If you find our work useful in your research, please cite our work:

@misc{wang2025unittsendtoendttsdecoupling,
      title={UniTTS: An end-to-end TTS system without decoupling of acoustic and semantic information}, 
      author={Rui Wang and Qianguo Sun and Tianrong Chen and Zhiyun Zeng and Junlong Wu and Jiaxing Zhang},
      year={2025},
      eprint={2505.17426},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2505.17426}, 
}

Disclaimer

DistilCodec provides the capability of universal audio discretion only for academic research purposes. We encourage the community to uphold safety and ethical principles in AI research and applications.

Important Notes:

  • Compliance with the model's open-source license is mandatory.

  • Unauthorized voice replication applications are strictly prohibited.

  • Developers bear no responsibility for any misuse of this model.

License

UniTTS: An end-to-end TTS system without decoupling of acoustic and semantic information Β© 2025 by Rui Wang, Qianguo Sun, Tianrong Chen, Zhiyun Zeng, Junlong Wu, Jiaxing Zhang is licensed under CC BY-NC-ND 4.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support