|
--- |
|
license: apache-2.0 |
|
tags: |
|
- music |
|
- text2music |
|
pipeline_tag: text-to-audio |
|
language: |
|
- en |
|
- zh |
|
- de |
|
- fr |
|
- es |
|
- it |
|
- pt |
|
- pl |
|
- tr |
|
- ru |
|
- cs |
|
- nl |
|
- ar |
|
- ja |
|
- hu |
|
- ko |
|
- hi |
|
library_name: diffusers |
|
--- |
|
|
|
# ACE-Step: A Step Towards Music Generation Foundation Model |
|
|
|
 |
|
|
|
## Model Description |
|
|
|
ACE-Step is a novel open-source foundation model for music generation that overcomes key limitations of existing approaches through a holistic architectural design. It integrates diffusion-based generation with Sana's Deep Compression AutoEncoder (DCAE) and a lightweight linear transformer, achieving state-of-the-art performance in generation speed, musical coherence, and controllability. |
|
|
|
**Key Features:** |
|
- 15× faster than LLM-based baselines (20s for 4-minute music on A100) |
|
- Superior musical coherence across melody, harmony, and rhythm |
|
- full-song generation, duration control and accepts natural language descriptions |
|
|
|
## Uses |
|
|
|
### Direct Use |
|
ACE-Step can be used for: |
|
- Generating original music from text descriptions |
|
- Music remixing and style transfer |
|
- edit song lyrics |
|
|
|
### Downstream Use |
|
The model serves as a foundation for: |
|
- Voice cloning applications |
|
- Specialized music generation (rap, jazz, etc.) |
|
- Music production tools |
|
- Creative AI assistants |
|
|
|
### Out-of-Scope Use |
|
The model should not be used for: |
|
- Generating copyrighted content without permission |
|
- Creating harmful or offensive content |
|
- Misrepresenting AI-generated music as human-created |
|
|
|
## How to Get Started |
|
|
|
see: https://github.com/ace-step/ACE-Step |
|
|
|
## Hardware Performance |
|
|
|
| Device | 27 Steps | 60 Steps | |
|
|---------------|----------|----------| |
|
| NVIDIA A100 | 27.27x | 12.27x | |
|
| RTX 4090 | 34.48x | 15.63x | |
|
| RTX 3090 | 12.76x | 6.48x | |
|
| M2 Max | 2.27x | 1.03x | |
|
|
|
*RTF (Real-Time Factor) shown - higher values indicate faster generation* |
|
|
|
|
|
## Limitations |
|
|
|
- Performance varies by language (top 10 languages perform best) |
|
- Longer generations (>5 minutes) may lose structural coherence |
|
- Rare instruments may not render perfectly |
|
- Output Inconsistency: Highly sensitive to random seeds and input duration, leading to varied "gacha-style" results. |
|
- Style-specific Weaknesses: Underperforms on certain genres (e.g. Chinese rap/zh_rap) Limited style adherence and musicality ceiling |
|
- Continuity Artifacts: Unnatural transitions in repainting/extend operations |
|
- Vocal Quality: Coarse vocal synthesis lacking nuance |
|
- Control Granularity: Needs finer-grained musical parameter control |
|
|
|
## Ethical Considerations |
|
|
|
Users should: |
|
- Verify originality of generated works |
|
- Disclose AI involvement |
|
- Respect cultural elements and copyrights |
|
- Avoid harmful content generation |
|
|
|
|
|
## Model Details |
|
|
|
**Developed by:** ACE Studio and StepFun |
|
**Model type:** Diffusion-based music generation with transformer conditioning |
|
**License:** Apache 2.0 |
|
**Resources:** |
|
- [Project Page](https://ace-step.github.io/) |
|
- [Demo Space](https://huggingface.co/spaces/ACE-Step/ACE-Step) |
|
- [GitHub Repository](https://github.com/ACE-Step/ACE-Step) |
|
|
|
|
|
## Citation |
|
|
|
```bibtex |
|
@misc{gong2025acestep, |
|
title={ACE-Step: A Step Towards Music Generation Foundation Model}, |
|
author={Junmin Gong, Wenxiao Zhao, Sen Wang, Shengyuan Xu, Jing Guo}, |
|
howpublished={\url{https://github.com/ace-step/ACE-Step}}, |
|
year={2025}, |
|
note={GitHub repository} |
|
} |
|
``` |
|
|
|
## Acknowledgements |
|
This project is co-led by ACE Studio and StepFun. |
|
|
|
|