Step-Audio-Tokenizer

Step-Audio LLM is the industry’s first 130-billion parameter hu-manlike unified end-to-end model that integrates multimodal speech un-derstanding and generation capabilities, including singing voice synthesis, tool utilization, role-play and multilingual/dialectal comprehension and synthesis.

This repository provides the speech tokenizer component of Step-Audio LLM. For linguistic tokenization, we utilize the output from the Paraformer encoder, which is quantized into discrete representations at a token rate of 16.7 Hz. For semantic tokenization, we employ CosyVoice’s tokenizer, specifically designed to efficiently encode features essential for generating natural and expressive speech outputs, operating at a token rate of 25 Hz.

More information

For more information, please refer to our repository: Step-Audio.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using stepfun-ai/Step-Audio-Tokenizer 1

Collection including stepfun-ai/Step-Audio-Tokenizer

Step-Audio

Collection

Step-Audio model family, including Audio-Tokenizer, Audio-Chat and TTS • 4 items • Updated Jul 31, 2025 • 32