Elastic model: MusicGen Large. Fastest and most flexible models for self-serving.
Elastic models are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement. For each model, ANNA produces a series of optimized models:
- XL: Mathematically equivalent neural network (original compiled
facebook/musicgen-large
), optimized with our DNN compiler. - L: Near lossless model, with minimal degradation obtained on corresponding audio quality benchmarks.
- M: Faster model, with minor and acceptable accuracy degradation.
- S: The fastest model, with slight accuracy degradation, offering the best speed.
- Original: The original
facebook/musicgen-large
model from Hugging Face, without QLIP compilation.
Goals of elastic models:
- Provide flexibility in cost vs quality selection for inference
- Provide clear quality and latency benchmarks for audio generation
- Provide interface of HF libraries:
transformers
andelastic_models
with a single line of code change for using optimized versions. - Provide models supported on a wide range of hardware (NVIDIA GPUs), which are pre-compiled and require no JIT.
- Provide the best models and service for self-hosting.
It's important to note that specific quality degradation can vary. We aim for S models to retain high perceptual quality. The "Original" in tables refers to the non-compiled Hugging Face model, while "XL" is the compiled original. S, M, L are ANNA-quantized and compiled.
Audio Examples
Below are a few examples demonstrating the audio quality of the different Elastic MusicGen Large versions. These samples were generated on an NVIDIA H100 GPU with a duration of 20 seconds each. For a more comprehensive set of examples and interactive demos, please visit musicgen.thestage.ai.
Prompt: "Calm lofi hip hop track with a simple piano melody and soft drums" (Audio: 20 seconds, H100 GPU)
S | M | L | XL (Compiled Original) | Original (HF Non-Compiled) |
---|---|---|---|---|
Inference
To infer our MusicGen models, you primarily use the elastic_models.transformers.MusicgenForConditionalGeneration
class. If you have compiled engines, you provide the path to them. Otherwise, for non-compiled or original models, you can use the standard Hugging Face transformers.MusicgenForConditionalGeneration
.
Example using elastic_models
with a compiled model:
import torch
import scipy.io.wavfile
from transformers import AutoProcessor
from elastic_models.transformers import MusicgenForConditionalGeneration
model_name_hf = "facebook/musicgen-large"
elastic_mode = "S"
prompt = "A groovy funk bassline with a tight drum beat"
output_wav_path = "generated_audio_elastic_S.wav"
hf_token = "YOUR_TOKEN"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
processor = AutoProcessor.from_pretrained(model_name_hf, token=hf_token)
model = MusicgenForConditionalGeneration.from_pretrained(
model_name_hf,
token=hf_token,
torch_dtype=torch.float16,
mode=elastic_mode,
device=device,
).to(device)
model.eval()
inputs = processor(
text=[prompt],
padding=True,
return_tensors="pt",
).to(device)
print(f"Generating audio for: {prompt}...")
generate_kwargs = {"do_sample": True, "guidance_scale": 3.0, "max_new_tokens": 256, "cache_implementation": "paged"}
audio_values = model.generate(**inputs, **generate_kwargs)
audio_values_np = audio_values.to(torch.float32).cpu().numpy().squeeze()
sampling_rate = model.config.audio_encoder.sampling_rate
scipy.io.wavfile.write(output_wav_path, rate=sampling_rate, data=audio_values_np)
print(f"Audio saved to {output_wav_path}")
System requirements:
- GPUs: NVIDIA H100, L40S.
- CPU: AMD, Intel
- Python: 3.8-3.11 (check dependencies for specific versions)
To work with our elastic models and compilation tools, you'll need to install elastic_models
and qlip
libraries from TheStage:
pip install thestage
pip install elastic_models[nvidia]\
--index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple\
--extra-index-url https://pypi.nvidia.com\
--extra-index-url https://pypi.org/simple
pip install flash-attn==2.7.3 --no-build-isolation
pip uninstall apex
Then go to app.thestage.ai, login and generate API token from your profile page. Set up API token as follows:
thestage config set --api-token <YOUR_API_TOKEN>
Congrats, now you can use accelerated models and tools!
Benchmarks
Benchmarking is one of the most important procedures during model acceleration. We aim to provide clear performance metrics for MusicGen models using our algorithms.
The Original
column in latency benchmarks typically refers to the Hugging Face facebook/musicgen-large
model compiled without ANNA quantization (XL in our terminology).
Latency benchmarks (Tokens Per Second - TPS)
Performance for generating audio (decoder stage, max_new_tokens = 256 (5 seconds audio)).
GPU Type | S | M | L | XL (Compiled Original) | Original (HF, non-compiled) |
---|---|---|---|---|---|
H100 | 122.75 | 124.70 | 126.21 | 126.71 | 45.33 |
L40S | 96.74 | 90.90 | 86.51 | 83.31 | 44.69 |
Performance by Batch Size
Batch Size 16:
GPU Type | S Mode (TPS) | XL Mode (TPS) |
---|---|---|
H100 | 94.21 | 97.96 |
L40S | 69.66 | 63.19 |
Batch Size 32:
GPU Type | S Mode (TPS) | XL Mode (TPS) |
---|---|---|
H100 | 77.15 | 76.64 |
L40S | 54.81 | 51.34 |
Note: Currently deployed models support only batch size = 1. Expect upcoming updates for larger batch size support.
As shown in the results, smaller batch sizes typically demonstrate higher per-token performance, which is typical for inference tasks.
Links
- Platform: app.thestage.ai
- Subscribe for updates: TheStageAI X (Twitter)
- Contact email: [email protected]
Model tree for TheStageAI/Elastic-musicgen-large
Base model
facebook/musicgen-large