sleeper371's picture
update
b5b7f54

A newer version of the Gradio SDK is available: 5.38.0

Upgrade
metadata
title: BARK Text to Audio with Batch Inference
emoji: 🪄
colorFrom: purple
colorTo: pink
sdk: gradio
python_version: 3.10.13
sdk_version: 5.23.3
suggested_hardware: cpu-upgrade
suggested_storage: small
app_file: app.py
short_description: Generate natural sounding speech audio from text
pinned: true
startup_duration_timeout: 45m
tags:
  - text-to-audio
  - gradio
  - bark
preload_from_hub:
  - suno/bark

Generate Audio from text and clone voice with BARK

You can generate audio from text with natural sounding voice and clone any voice (not perfect). Screenshot Placeholder

Code worked on Python 3.12. May also work on other versions.

Example generated audio in the /assets/audio folder

Features

  • Text-to-Audio Generation: Generate speech from text using the BARK model (supports 'small' and 'large' variants).
  • Parameter Control: Adjust semantic, coarse, and fine temperature settings for generation diversity. Set a generation seed for reproducibility.
  • Device Selection: Run inference on available devices (CPU, CUDA, MPS).
  • Standard Voice Prompts: Utilize built-in BARK voice prompts (.npz files) located in the bark_prompts directory.
  • Custom Voice Prompt Creation (Voice Cloning):
    • Upload your own audio file (.wav, .mp3).
    • Generate a BARK-compatible semantic prompt (.npz file) using a custom-trained HuBERT model.
    • The generated prompt appears in the "Select Voice Prompt" dropdown for immediate use.
  • Audio Management: View, play, and delete generated audio files directly within the interface.
  • Training Scripts: Includes scripts to generate the necessary dataset (generate_audio_semantic_dataset.py) and train the custom HuBERT model (train_hubert.py).

Custom Voice Cloning Model

The core of the custom voice prompt generation relies on a fine-tuned HuBERT model.

  • Model: sleeper371/hubert-for-bark-semantic on Hugging Face (Link)
  • Architecture: This model uses a HuBERT base feature extractor followed by a Transformer decoder head.
  • Training: It was trained on over 4700 sentence pairs, mapping audio waveforms to the semantic tokens generated by BARK's semantic model. The training used a cross-entropy loss objective.
  • Dataset: The training dataset is available at sleeper371/bark-wave-semantic on Hugging Face (Link).
  • Comparison: This approach is inspired by projects like gitmylo/bark-data-gen, but differs in the head architecture (he used an LSTM head while I used a transformers decoder head)

Setup and Installation

Follow these steps to set up the environment and run the application.

  1. Clone the Repository:

  2. Create a Virtual Environment: It's highly recommended to use a virtual environment to manage dependencies.

    # For Linux/macOS
    python3 -m venv venv
    source venv/bin/activate
    
    # For Windows
    python -m venv venv
    .\venv\Scripts\activate
    
  3. Install Requirements: Make sure you have a requirements.txt file in the repository root containing all necessary packages (e.g., gradio, torch, transformers, soundfile, etc.).

    pip install -r requirements.txt
    

Running the Application

Once the setup is complete, run the Gradio application:

python app.py

This will launch the Gradio interface, typically accessible at http://127.0.0.1:7860 in your web browser. The console output will provide the exact URL.

Training Your Own Custom HuBERT Model

If you want to train your own HuBERT model for voice cloning:

  1. Generate Dataset:
  • Use the generate_audio_semantic_dataset.py script.
  1. Train the Model:
  • Use the train_hubert.py script.

  • This script takes the generated dataset (audio paths and semantic token paths) to fine-tune a HuBERT model with a Transformer decoder head.

  • Configure training parameters (batch size, learning rate, epochs, output directory) within the script or via command-line arguments (if implemented).

License

MIT

Acknowledgements

  • Suno AI, they trained the models

  • gitmylo, inspired me to use HuBERT to predict semantic tokens from audio