File size: 3,197 Bytes

d0d5a53
 
a961df1
 
 
 
d0d5a53
 
0abc205
d0d5a53
 
a961df1
d0d5a53
a961df1
d0d5a53
0abc205
d0d5a53
a961df1
509222e
a961df1
 
d0d5a53
a961df1
d0d5a53
a961df1
d0d5a53
a961df1
 
 
 
d0d5a53
a961df1
d0d5a53
a961df1
 
 
 
d0d5a53
a961df1
 
 
 
d0d5a53
a961df1
 
d0d5a53
a961df1
 
 
 
 
 
 
 
d0d5a53
a961df1
0abc205
d0d5a53

---
library_name: transformers
license: cc-by-nc-4.0
tags:
- audio-to-audio
pipeline_tag: audio-to-audio
---

# Xcodec2 (Transformers-compatible version)


The X-Codec2 model was proposed in [Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis](https://huggingface.co/papers/2502.04128).

X-Codec2 is a neural audio codec designed to improve speech synthesis and general audio generation for large language model (LLM) pipelines. It extends the original X-Codec by refining how semantic and acoustic information is integrated and tokenized, enabling efficient and high-fidelity audio representation.

Its architecture is based on X-Codec with several major differences:

- **Unified Semantic-Acoustic Tokenization**: X-Codec2 fuses outputs from a semantic encoder (e.g., Wav2Vec2-BERT) and an acoustic encoder into a single embedding, capturing both high-level meaning (e.g., text content, emotion) and low-level audio details (e.g., timbre).
- **Single-Stage Vector Quantization (VQ)**: Unlike the multi-layer residual VQ in most approaches (e.g., X-Codec, DAC, EnCodec), X-Codec2 uses a single-layer Feature-Space Quantization (FSQ) for stability and compatibility with causal, autoregressive LLMs.
- **Semantic Supervision During Training**: It adds a semantic reconstruction loss, ensuring that the discrete tokens preserve meaningful linguistic and emotional information — crucial for TTS tasks.
- **Transformer-Friendly Design**: The 1D token structure of X-Codec2 naturally aligns with the autoregressive modeling in LLMs like LLaMA, improving training efficiency and downstream compatibility.

## Usage example 

Here is a quick example of how to encode and decode an audio using this model:

```python 
>>> import torch
>>> from datasets import Audio, load_dataset
>>> from transformers import AutoFeatureExtractor, Xcodec2Model

>>> torch_device = "cuda" if torch.cuda.is_available() else "cpu"

>>> # load model and feature extractor
>>> model_id = "bezzam/xcodec2"
>>> model = Xcodec2Model.from_pretrained(model_id).to(torch_device).eval()
>>> feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)

>>> # load data
>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=feature_extractor.sampling_rate))
>>> audio = dataset[0]["audio"]["array"]

>>> # prepare data
>>> inputs = feature_extractor(raw_audio=audio, sampling_rate=feature_extractor.sampling_rate, return_tensors="pt").to(torch_device)

>>> # encoder and decode
>>> audio_codes = model.encode(inputs["input_values"]).audio_codes
>>> audio_values = model.decode(audio_codes).audio_values
>>> # or the equivalent with a forward pass
>>> model_output = model(inputs["input_values"])
>>> audio_codes = model_output.audio_codes
>>> audio_values = model_output.audio_values
```

This model was contributed by [Steven Zheng](https://huggingface.co/Steveeeeeeen) and [Eric Bezzam](https://huggingface.co/bezzam).
The original code can be found [here](https://github.com/zhenye234/X-Codec-2.0), and original checkpoints [here](https://huggingface.co/HKUSTAudio/xcodec2).