Step-Audio-AQAA: A Fully End-to-End Expressive Large Audio Language Model

📚 Paper: Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model

🚀 Live Demo:

Model Overview

Step-Audio-AQAA is a fully end-to-end Large Audio-Language Model (LALM) designed for Audio Query-Audio Answer (AQAA) tasks. It directly processes audio inputs and generates natural, accurate speech responses without relying on traditional ASR and TTS modules, eliminating cascading errors and simplifying the system architecture.

Key Capabilities

Fully End-to-End Audio Interaction: Generates speech outputs directly from raw audio inputs without ASR/TTS intermediates.
Fine-Grained Voice Control: Supports sentence-level adjustments of emotional tone, speech rate, and other vocal features.
Multilingual & Dialect Support: Covers Chinese (including Sichuanese, Cantonese), English, Japanese, etc.
Complex Task Handling: Excels in speech emotion control, role-playing, logical reasoning, and other complex audio interactions.

Model Architecture

Step-Audio-AQAA consists of three core modules:

Dual-Codebook Audio Tokenizer

Linguistic Tokenizer: Based on Paraformer encoder, extracts phonemic and linguistic attributes with a 1,024-codebook at 16.7Hz.
Semantic Tokenizer: References CosyVoice 1.0, captures acoustic features with a 4,096-codebook at 25Hz.
Temporal Alignment: Uses a 2:3 interleaving ratio to ensure temporal consistency between token types.

Backbone LLM

Parameter Scale: 130-billion-parameter multi-modal LLM (Step-Omni).
Architecture: Decoder-only with Transformer blocks, RMSNorm layers, and grouped query attention.
Vocabulary Expansion: Incorporates 5,120 audio tokens into the text vocabulary for text-audio interleaved output.

Neural Vocoder

Architecture: Flow-matching model based on CosyVoice, using U-Net and ResNet-1D layers.
Conditional Generation: Generates high-fidelity speech waveforms conditioned solely on audio tokens.

Training Approach

Multi-Stage Training Pipeline

Pretraining: Multi-modal pretraining on text, audio, and image data.
Supervised Fine-Tuning (SFT):
- Stage 1: Full-parameter update on AQTA and AQTAA datasets.
- Stage 2: Optimizes specific capabilities with high-quality AQTAA data.
Direct Preference Optimization (DPO): Uses audio token masking to avoid degradation of speech generation.
Model Merging: Weighted combination of SFT and DPO models to enhance overall performance.

Training Data

Multi-Modal Pretraining Data: 800 billion text tokens and audio-text interleaved data.
AQTA Dataset: Audio query-text answer pairs.
AQTAA Dataset: Audio query-text answer-audio answer triplets generated from AQTA.

Citation

@misc{huang2025stepaudioaqaa,
      title={Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model}, 
      author={Ailin Huang and Boyong Wu and Bruce Wang and Chao Yan and Chen Hu and Chengli Feng and Fei Tian and Feiyu Shen and Jingbei Li and Mingrui Chen and et al.},
      year={2025},
      eprint={2506.08967},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}

Team & Contributions

Step-Audio-AQAA is developed by the StepFun team, with contributions from multiple researchers and engineers. For technical support or collaboration, contact the corresponding authors: Daxin Jiang ([email protected]), Shuchang Zhou ([email protected]), Chen Hu ([email protected]).

License

This model is released under the Apache 2.0 license. For more details, please refer to the license file.

stepfun-ai
/

Step-Audio-AQAA