Transformers documentation
Higgs Audio V2 Tokenizer
This model was released on 2025-07-22 and added to Hugging Face Transformers on 2026-02-19.
Higgs Audio V2 Tokenizer
Overview
- Low Frame Rate: At 25 fps, our tokenizer halves the frame rate of many baselines when still maintaining high audio quality.
- Unified 24 kHz Training: We mix speech, music, and sound-event clips in one model, capturing both semantic and acoustic details, hugely facilitating the training of audio language models.
- Fast Inference: By avoiding diffusion steps, our encoder/decoder processes batches quickly, making it practical for real-time or large-scale tasks.
Model Architecture:

Usage
from transformers import HiggsAudioV2TokenizerModel, AutoFeatureExtractor
from datasets import load_dataset, Audio
# load model and feature extractor
model_id = "eustlb/higgs-audio-v2-tokenizer"
feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)
model = HiggsAudioV2TokenizerModel.from_pretrained(model_id, device_map="auto")
# load audio sample
dummy_dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
dummy_dataset = dummy_dataset.cast_column("audio", Audio(sampling_rate=feature_extractor.sampling_rate))
audio_sample = dummy_dataset[-1]["audio"]["array"]
inputs = feature_extractor(raw_audio=audio_sample, sampling_rate=feature_extractor.sampling_rate, return_tensors="pt")
# encode and decode
encoder_outputs = model.encode(inputs["input_values"])
decoder_outputs = model.decode(encoder_outputs.audio_codes)
audio_values = decoder_outputs.audio_values
# or the equivalent with a forward pass
audio_values = model(inputs["input_values"]).audio_valuesHiggsAudioV2TokenizerConfig
class transformers.HiggsAudioV2TokenizerConfig
< source >( target_bandwidths = [0.5, 1, 1.5, 2] sample_rate = 24000 kernel_size = 3 channel_ratios = [1, 1] strides = [1, 1] block_dilations = [1, 1] unit_kernel_size = 3 codebook_size = 1024 codebook_dim = 64 initializer_range = 0.02 acoustic_model_config = None semantic_model_config = None semantic_sample_rate = 16000 downsample_factor = 320 **kwargs )
Parameters
- target_bandwidths (
List[float], optional, defaults to[0.5, 1, 1.5, 2]) — The range of different bandwidths (in kbps) the model can encode audio with. - sample_rate (
int, optional, defaults to 24000) — The sampling rate at which the audio waveform should be digitalized, in hertz (Hz). - kernel_size (
int, optional, defaults to 3) — Kernel size for the initial semantic convolution. - channel_ratios (
List[float], optional, defaults to[1, 1]) — Expansion factors for the number of output channels in each semantic block. - strides (
List[int], optional, defaults to[1, 1]) — Strides for each semantic encoder block. - block_dilations (
List[int], optional, defaults to[1, 1]) — Dilation factors for the residual units in semantic blocks. - unit_kernel_size (
int, optional, defaults to 3) — Kernel size inside each ResidualUnit in semantic blocks. - codebook_size (
int, optional, defaults to 1024) — Number of entries in each residual quantizer’s codebook. - codebook_dim (
int, optional, defaults to 64) — Dimensionality of each codebook vector. - initializer_range (
float, optional, defaults to 0.02) — Standard deviation of the truncated normal initializer for all weight matrices. - acoustic_model_config (
Union[Dict, AutoConfig], optional) — An instance of the configuration for the acoustic (DAC) model. - semantic_model_config (
Union[Dict, AutoConfig], optional) — An instance of the configuration object for the semantic (HuBERT) model. - semantic_sample_rate (
int, optional, defaults to 16000) — The sampling rate at which the semantic model expects audio input, in hertz (Hz). - downsample_factor (
int, optional, defaults to 320) — Downsampling factor for the semantic features.
This is the configuration class to store the configuration of an HiggsAudioV2TokenizerModel. It is used to instantiate a
HiggsAudioV2Tokenizer model according to the specified arguments, defining the model architecture. Instantiating a configuration
with the defaults will yield a similar configuration to that of the Higgs Audio v2 Tokenizer.
e.g. bosonai/higgs-audio-v2-tokenizer
Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.
Example:
>>> from transformers import HiggsAudioV2TokenizerModel, HiggsAudioV2TokenizerConfig
>>> # Initializing configuration
>>> configuration = HiggsAudioV2TokenizerConfig()
>>> # Initializing a model (with random weights) from the configuration
>>> model = HiggsAudioV2TokenizerModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.configHiggsAudioV2TokenizerModel
class transformers.HiggsAudioV2TokenizerModel
< source >( config )
Parameters
- config (HiggsAudioV2TokenizerModel) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The HiggsAudioV2Tokenizer neural audio codec model.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
decode
< source >( audio_codes: Tensor return_dict: bool | None = None )
Parameters
- audio_codes (
torch.LongTensorof shape(batch_size, num_quantizers, codes_length)) — Discrete code indices computed usingmodel.encode. - return_dict (
bool, optional) — Whether or not to return a ModelOutput
encode
< source >( input_values: Tensor bandwidth: float | None = None return_dict: bool | None = None )
Parameters
- input_values (
torch.FloatTensorof shape(batch_size, channels, num_samples)) — Float values of the input audio waveform. - bandwidth (
float, optional) — The target bandwidth in (kbps) supports only values inconfig.target_bandwidths. Defaults to the highest available bandwidth4.0kbps. - return_dict (
bool, optional) — Whether or not to return a ModelOutput.
forward
< source >( input_values: Tensor audio_codes: torch.Tensor | None = None bandwidth: float | None = None return_dict: bool | None = None ) → HiggsAudioV2TokenizerOutput or tuple (audio_codes, audio_values)
Parameters
- input_values (
torch.FloatTensorof shape(batch_size, channels, num_samples)) — The raw float values of the input audio waveform. - audio_codes (
torch.LongTensorof shape(batch_size, num_quantizers, codes_length)— Discrete code indices computed usingmodel.encode. - bandwidth (
float, optional) — Target bandwidth in kbps. Must be one ofconfig.target_bandwidths. Defaults to the highest available bandwidth. - bandwidth (
float, optional) — Target bandwidth in kbps. Must be one ofconfig.target_bandwidths. Defaults to the highest available bandwidth. - return_dict (
bool, optional) — Whether to return aHiggsAudioV2TokenizerOutputinstead of a plain tuple.
Returns
HiggsAudioV2TokenizerOutput or tuple (audio_codes, audio_values)
audio_codesof shape(batch_size, num_quantizers, codes_length): the quantized discrete codes.audio_valuesof shape(batch_size, channels, num_samples): the reconstructed audio waveform given the codes.
The HiggsAudioV2TokenizerModel forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Example:
>>> from datasets import load_dataset
>>> from transformers import AutoFeatureExtractor, HiggsAudioV2TokenizerModel
>>> model_id = "hf-audio/higgs_audio_v2_tokenizer-hubert-librispeech"
>>> model = HiggsAudioV2TokenizerModel.from_pretrained(model_id)
>>> feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)
>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=feature_extractor.sampling_rate))
>>> audio_sample = dataset[0]['audio']['array']
>>> inputs = feature_extractor(raw_audio=audio_sample, return_tensors="pt")
>>> outputs = model(**inputs)
>>> audio_codes = outputs.audio_codes
>>> audio_values = outputs.audio_values