Spaces:
Running
on
Zero
A newer version of the Gradio SDK is available:
5.38.0
title: BARK Text to Audio with Batch Inference
emoji: 🪄
colorFrom: purple
colorTo: pink
sdk: gradio
python_version: 3.10.13
sdk_version: 5.23.3
suggested_hardware: cpu-upgrade
suggested_storage: small
app_file: app.py
short_description: Generate natural sounding speech audio from text
pinned: true
startup_duration_timeout: 45m
tags:
- text-to-audio
- gradio
- bark
preload_from_hub:
- suno/bark
Generate Audio from text and clone voice with BARK
You can generate audio from text with natural sounding voice and clone any voice (not perfect).
Code worked on Python 3.12. May also work on other versions.
Example generated audio in the /assets/audio folder
Features
- Text-to-Audio Generation: Generate speech from text using the BARK model (supports 'small' and 'large' variants).
- Parameter Control: Adjust semantic, coarse, and fine temperature settings for generation diversity. Set a generation seed for reproducibility.
- Device Selection: Run inference on available devices (CPU, CUDA, MPS).
- Standard Voice Prompts: Utilize built-in BARK voice prompts (
.npz
files) located in thebark_prompts
directory. - Custom Voice Prompt Creation (Voice Cloning):
- Upload your own audio file (.wav, .mp3).
- Generate a BARK-compatible semantic prompt (
.npz
file) using a custom-trained HuBERT model. - The generated prompt appears in the "Select Voice Prompt" dropdown for immediate use.
- Audio Management: View, play, and delete generated audio files directly within the interface.
- Training Scripts: Includes scripts to generate the necessary dataset (
generate_audio_semantic_dataset.py
) and train the custom HuBERT model (train_hubert.py
).
Custom Voice Cloning Model
The core of the custom voice prompt generation relies on a fine-tuned HuBERT model.
- Model:
sleeper371/hubert-for-bark-semantic
on Hugging Face (Link) - Architecture: This model uses a HuBERT base feature extractor followed by a Transformer decoder head.
- Training: It was trained on over 4700 sentence pairs, mapping audio waveforms to the semantic tokens generated by BARK's semantic model. The training used a cross-entropy loss objective.
- Dataset: The training dataset is available at
sleeper371/bark-wave-semantic
on Hugging Face (Link). - Comparison: This approach is inspired by projects like gitmylo/bark-data-gen, but differs in the head architecture (he used an LSTM head while I used a transformers decoder head)
Setup and Installation
Follow these steps to set up the environment and run the application.
Clone the Repository:
Create a Virtual Environment: It's highly recommended to use a virtual environment to manage dependencies.
# For Linux/macOS python3 -m venv venv source venv/bin/activate # For Windows python -m venv venv .\venv\Scripts\activate
Install Requirements: Make sure you have a
requirements.txt
file in the repository root containing all necessary packages (e.g.,gradio
,torch
,transformers
,soundfile
, etc.).pip install -r requirements.txt
Running the Application
Once the setup is complete, run the Gradio application:
python app.py
This will launch the Gradio interface, typically accessible at http://127.0.0.1:7860 in your web browser. The console output will provide the exact URL.
Training Your Own Custom HuBERT Model
If you want to train your own HuBERT model for voice cloning:
- Generate Dataset:
- Use the generate_audio_semantic_dataset.py script.
- Train the Model:
Use the train_hubert.py script.
This script takes the generated dataset (audio paths and semantic token paths) to fine-tune a HuBERT model with a Transformer decoder head.
Configure training parameters (batch size, learning rate, epochs, output directory) within the script or via command-line arguments (if implemented).
License
MIT
Acknowledgements
Suno AI, they trained the models
gitmylo, inspired me to use HuBERT to predict semantic tokens from audio