This model was released on 2025-07-22 and added to Hugging Face Transformers on 2026-02-19.

Higgs Audio V2 Tokenizer

Overview

Low Frame Rate: At 25 fps, our tokenizer halves the frame rate of many baselines when still maintaining high audio quality.
Unified 24 kHz Training: We mix speech, music, and sound-event clips in one model, capturing both semantic and acoustic details, hugely facilitating the training of audio language models.
Fast Inference: By avoiding diffusion steps, our encoder/decoder processes batches quickly, making it practical for real-time or large-scale tasks.

Model Architecture:

Usage

from transformers import HiggsAudioV2TokenizerModel, AutoFeatureExtractor
from datasets import load_dataset, Audio

# load model and feature extractor
model_id = "eustlb/higgs-audio-v2-tokenizer"
feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)
model = HiggsAudioV2TokenizerModel.from_pretrained(model_id, device_map="auto")

# load audio sample
dummy_dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
dummy_dataset = dummy_dataset.cast_column("audio", Audio(sampling_rate=feature_extractor.sampling_rate))
audio_sample = dummy_dataset[-1]["audio"]["array"]
inputs = feature_extractor(raw_audio=audio_sample, sampling_rate=feature_extractor.sampling_rate, return_tensors="pt")

# encode and decode
encoder_outputs = model.encode(inputs["input_values"])
decoder_outputs = model.decode(encoder_outputs.audio_codes)
audio_values = decoder_outputs.audio_values

# or the equivalent with a forward pass
audio_values = model(inputs["input_values"]).audio_values

HiggsAudioV2TokenizerConfig

class transformers.HiggsAudioV2TokenizerConfig

< source >

( target_bandwidths = [0.5, 1, 1.5, 2] sample_rate = 24000 kernel_size = 3 channel_ratios = [1, 1] strides = [1, 1] block_dilations = [1, 1] unit_kernel_size = 3 codebook_size = 1024 codebook_dim = 64 initializer_range = 0.02 acoustic_model_config = None semantic_model_config = None semantic_sample_rate = 16000 downsample_factor = 320 **kwargs )

Parameters

target_bandwidths (List[float], optional, defaults to [0.5, 1, 1.5, 2]) — The range of different bandwidths (in kbps) the model can encode audio with.
sample_rate (int, optional, defaults to 24000) — The sampling rate at which the audio waveform should be digitalized, in hertz (Hz).
kernel_size (int, optional, defaults to 3) — Kernel size for the initial semantic convolution.
channel_ratios (List[float], optional, defaults to [1, 1]) — Expansion factors for the number of output channels in each semantic block.
strides (List[int], optional, defaults to [1, 1]) — Strides for each semantic encoder block.
block_dilations (List[int], optional, defaults to [1, 1]) — Dilation factors for the residual units in semantic blocks.
unit_kernel_size (int, optional, defaults to 3) — Kernel size inside each ResidualUnit in semantic blocks.
codebook_size (int, optional, defaults to 1024) — Number of entries in each residual quantizer’s codebook.
codebook_dim (int, optional, defaults to 64) — Dimensionality of each codebook vector.
initializer_range (float, optional, defaults to 0.02) — Standard deviation of the truncated normal initializer for all weight matrices.
acoustic_model_config (Union[Dict, AutoConfig], optional) — An instance of the configuration for the acoustic (DAC) model.
semantic_model_config (Union[Dict, AutoConfig], optional) — An instance of the configuration object for the semantic (HuBERT) model.
semantic_sample_rate (int, optional, defaults to 16000) — The sampling rate at which the semantic model expects audio input, in hertz (Hz).
downsample_factor (int, optional, defaults to 320) — Downsampling factor for the semantic features.

This is the configuration class to store the configuration of an HiggsAudioV2TokenizerModel. It is used to instantiate a HiggsAudioV2Tokenizer model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the Higgs Audio v2 Tokenizer. e.g. bosonai/higgs-audio-v2-tokenizer

Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.

Example:

>>> from transformers import HiggsAudioV2TokenizerModel, HiggsAudioV2TokenizerConfig

>>> # Initializing configuration
>>> configuration = HiggsAudioV2TokenizerConfig()

>>> # Initializing a model (with random weights) from the configuration
>>> model = HiggsAudioV2TokenizerModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

HiggsAudioV2TokenizerModel

class transformers.HiggsAudioV2TokenizerModel

< source >

( config )

Parameters

config (HiggsAudioV2TokenizerModel) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

The HiggsAudioV2Tokenizer neural audio codec model.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

decode

< source >

( audio_codes: Tensor return_dict: bool | None = None )

Parameters

audio_codes (torch.LongTensor of shape (batch_size, num_quantizers, codes_length)) — Discrete code indices computed using model.encode.
return_dict (bool, optional) — Whether or not to return a ModelOutput

encode

< source >

( input_values: Tensor bandwidth: float | None = None return_dict: bool | None = None )

Parameters

input_values (torch.FloatTensor of shape (batch_size, channels, num_samples)) — Float values of the input audio waveform.
bandwidth (float, optional) — The target bandwidth in (kbps) supports only values in config.target_bandwidths. Defaults to the highest available bandwidth 4.0 kbps.
return_dict (bool, optional) — Whether or not to return a ModelOutput.

forward

< source >

( input_values: Tensor audio_codes: torch.Tensor | None = None bandwidth: float | None = None return_dict: bool | None = None ) → HiggsAudioV2TokenizerOutput or tuple (audio_codes, audio_values)

Parameters

input_values (torch.FloatTensor of shape (batch_size, channels, num_samples)) — The raw float values of the input audio waveform.
audio_codes (torch.LongTensor of shape (batch_size, num_quantizers, codes_length) — Discrete code indices computed using model.encode.
bandwidth (float, optional) — Target bandwidth in kbps. Must be one of config.target_bandwidths. Defaults to the highest available bandwidth.
bandwidth (float, optional) — Target bandwidth in kbps. Must be one of config.target_bandwidths. Defaults to the highest available bandwidth.
return_dict (bool, optional) — Whether to return a HiggsAudioV2TokenizerOutput instead of a plain tuple.

Returns

HiggsAudioV2TokenizerOutput or tuple (audio_codes, audio_values)

audio_codes of shape (batch_size, num_quantizers, codes_length): the quantized discrete codes.
audio_values of shape (batch_size, channels, num_samples): the reconstructed audio waveform given the codes.

The HiggsAudioV2TokenizerModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

>>> from datasets import load_dataset
>>> from transformers import AutoFeatureExtractor, HiggsAudioV2TokenizerModel

>>> model_id = "hf-audio/higgs_audio_v2_tokenizer-hubert-librispeech"
>>> model = HiggsAudioV2TokenizerModel.from_pretrained(model_id)
>>> feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)

>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=feature_extractor.sampling_rate))
>>> audio_sample = dataset[0]['audio']['array']

>>> inputs = feature_extractor(raw_audio=audio_sample, return_tensors="pt")

>>> outputs = model(**inputs)
>>> audio_codes = outputs.audio_codes
>>> audio_values = outputs.audio_values

Update on GitHub