hf-audio
/

xcodec2

Model card Files Files and versions

xcodec2 / README.md

bezzam's picture

bezzam HF Staff

Update README.md

509222e verified 8 days ago

|

history blame contribute delete

3.2 kB

	---
	library_name: transformers
	license: cc-by-nc-4.0
	tags:
	- audio-to-audio
	pipeline_tag: audio-to-audio
	---

	# Xcodec2 (Transformers-compatible version)


	The X-Codec2 model was proposed in [Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis](https://huggingface.co/papers/2502.04128).

	X-Codec2 is a neural audio codec designed to improve speech synthesis and general audio generation for large language model (LLM) pipelines. It extends the original X-Codec by refining how semantic and acoustic information is integrated and tokenized, enabling efficient and high-fidelity audio representation.

	Its architecture is based on X-Codec with several major differences:

	- Unified Semantic-Acoustic Tokenization: X-Codec2 fuses outputs from a semantic encoder (e.g., Wav2Vec2-BERT) and an acoustic encoder into a single embedding, capturing both high-level meaning (e.g., text content, emotion) and low-level audio details (e.g., timbre).
	- Single-Stage Vector Quantization (VQ): Unlike the multi-layer residual VQ in most approaches (e.g., X-Codec, DAC, EnCodec), X-Codec2 uses a single-layer Feature-Space Quantization (FSQ) for stability and compatibility with causal, autoregressive LLMs.
	- Semantic Supervision During Training: It adds a semantic reconstruction loss, ensuring that the discrete tokens preserve meaningful linguistic and emotional information — crucial for TTS tasks.
	- Transformer-Friendly Design: The 1D token structure of X-Codec2 naturally aligns with the autoregressive modeling in LLMs like LLaMA, improving training efficiency and downstream compatibility.

	## Usage example

	Here is a quick example of how to encode and decode an audio using this model:

	```python
	>>> import torch
	>>> from datasets import Audio, load_dataset
	>>> from transformers import AutoFeatureExtractor, Xcodec2Model

	>>> torch_device = "cuda" if torch.cuda.is_available() else "cpu"

	>>> # load model and feature extractor
	>>> model_id = "bezzam/xcodec2"
	>>> model = Xcodec2Model.from_pretrained(model_id).to(torch_device).eval()
	>>> feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)

	>>> # load data
	>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
	>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=feature_extractor.sampling_rate))
	>>> audio = dataset[0]["audio"]["array"]

	>>> # prepare data
	>>> inputs = feature_extractor(raw_audio=audio, sampling_rate=feature_extractor.sampling_rate, return_tensors="pt").to(torch_device)

	>>> # encoder and decode
	>>> audio_codes = model.encode(inputs["input_values"]).audio_codes
	>>> audio_values = model.decode(audio_codes).audio_values
	>>> # or the equivalent with a forward pass
	>>> model_output = model(inputs["input_values"])
	>>> audio_codes = model_output.audio_codes
	>>> audio_values = model_output.audio_values
	```

	This model was contributed by [Steven Zheng](https://huggingface.co/Steveeeeeeen) and [Eric Bezzam](https://huggingface.co/bezzam).
	The original code can be found [here](https://github.com/zhenye234/X-Codec-2.0), and original checkpoints [here](https://huggingface.co/HKUSTAudio/xcodec2).