Model Card for Model ID

Voicera is a AR text-to-speech model trained on ~1000hrs of speech data. speech is converted to discrete tokens using "Multi-Scale Neural Audio Codec (SNAC)" model NB: This is not a SOTA model, and not accuarate enough for production usecase

Model Details

Model Description

"Voicera" is a text-to-speech (TTS) model designed for generating speech from written text. It uses a GPT-2 type architecture, which helps in creating natural and expressive speech. The model converts audio into tokens using the "Multi-Scale Neural Audio Codec (SNAC)" model, allowing it to understand and produce speech sounds. Voicera aims to provide clear and understandable speech, focusing on natural pronunciation and intonation. It's a project to explore TTS technology and improve audio output quality.

Developed by: Lwasinam Dilli
Funded by : Lwasinam Dilli
Model type: GPT2-Transformer architecture
License: Free and Open to use I guess :)

Model Sources

Repository: Github
Paper [optional]: [More Information Needed]
Demo : Demos

How to Get Started with the Model

There are three models, We have the base model and two other finetuned on jenny and expresso datasets The best of all currently is the Jenny finetune Here are colab link to all 3 respectively

Training Details

Training Data

Training data consist of clean subset of Hifi, Libri-Speech, Libri-TTs and Globe datasets

Training Procedure

During training, audio tokens are generated from snac model and concatenated with text tokens, They are all trained in an autoregressive manner but since we're interested in just audio tokens, text token loss is reduced by 0.1.

Preprocessing

Hugging Face had pretty much all the datasets I needed. I just had to filter out audio more than 10secs due to compute restraints

Training Hyperparameters

Weight decay 0.1
batch_size 1 with grad_accumulation of 32
Scheduler : CosineAnnealingWarmRestart with minimum learning rate of 1e-7 and Num of steps for Warm Restart being 500

Evaluation

I should probably work on this, the loss went down and the output got better :)

Results

Check out the demo page her -> Demo

Summary

Hardware Type: Tesla P100
Hours used: 300+hrs
Cloud Provider: Kaggle :)

Citation [optional]

BibTeX:

@software{Betker_TorToiSe_text-to-speech_2022,
author = {Betker, James},
month = apr,
title = {{TorToiSe text-to-speech}},
url = {https://github.com/neonbjb/tortoise-tts},
version = {2.0},
year = {2022}
}

@software{Siuzdak_SNAC_Multi-Scale_Neural_2024,
author = {Siuzdak, Hubert},
month = feb,
title = {{SNAC: Multi-Scale Neural Audio Codec}},
url = {https://github.com/hubertsiuzdak/snac},
year = {2024}
}

Model Card Authors [optional]

Lwasinam Dilli

Lwasinam
/

voicera