--- # library_name: license: apache-2.0 language: - en base_model: - StyleTTS pipeline_tag: text-to-speech --- # Nigerian Accented Text to Speech Model ## Table of Contents 1. [Model Summary](#model-summary) 2. [Model Description](#model-description) 3. [Bias, Risks, and Limitations](#bias-risks-and-limitations) - [Recommendations](#recommendations) 4. [Speech Samples](#speech-samples) 5. [Training](#training) 7. [Citation](#citation) 8. [Credits & References](#credits--references) ## Model Summary This text-to-speech (TTS) model (v1) was designed to synthesize Nigerian-accented English, offering high-quality, natural and relevant speech synthesis for diverse applications like narration, voice cloning, etc. ## Demo ## Common Issues - High-pitched background noise: This is caused by numerical float differences - Slow conversation (This depends on the compute used in running the inference) #### How to use in Colab This model can generate realistic audios if prompted with the right inference function arguments. The model has two voices: - Andy - Ben - Oge - But you can also use your own reference audio to clone any voice. Please, refer to the [Github Repository](https://github.com/benjaminogbonna/styletts2-finetuned) for complete code. ```python # install libraries !sudo apt-get update -y !apt-get install build-essential -y !pip install torch tensorboard transformers accelerate SoundFile torchaudio librosa phonemizer !pip install einops einops-exts tqdm typing typing-extensions munch pydub pyyaml nltk matplotlib !pip install git+https://github.com/resemble-ai/monotonic_align.git !pip install hf_transfer -qU !sudo apt-get install -y espeak-ng # ____________Download model_____________ model_repo = 'benjaminogbonna/tts_demo_models' model_path = 'Models' target_files = ['config.yml', 'model_v1.pth'] local_dir = '.' import os os.makedirs(local_dir, exist_ok=True) downloaded_files = [] for file_name in target_files: file_path = hf_hub_download(repo_id=model_repo, filename=f'{model_path}/{file_name}', local_dir=local_dir) downloaded_files.append(file_path) print('Downloaded files', downloaded_files) #________________________ import nltk nltk.download('punkt') nltk.download('punkt_tab') model_folder = 'Models/' # I do this to always pick the last trained epoch files = [f for f in os.listdir(model_folder) if f.endswith('.pth')] sorted_files = sorted(files, key=lambda x: int(x.split('_')[-1].split('.')[0])) print(sorted_files[-1]) #________________________ import torch torch.manual_seed(0) torch.backends.cudnn.benchmark = False torch.backends.cudnn.deterministic = True import random random.seed(0) import numpy as np np.random.seed(0) # load packages import time import random import yaml from munch import Munch import numpy as np import torch from torch import nn import torch.nn.functional as F import torchaudio import librosa from nltk.tokenize import word_tokenize from models import * from utils import * from text_utils import TextCleaner textclenaer = TextCleaner() %matplotlib inline #________________________ to_mel = torchaudio.transforms.MelSpectrogram( n_mels=80, n_fft=2048, win_length=1200, hop_length=300) mean, std = -4, 4 def length_to_mask(lengths): mask = torch.arange(lengths.max()).unsqueeze(0).expand(lengths.shape[0], -1).type_as(lengths) mask = torch.gt(mask+1, lengths.unsqueeze(1)) return mask def preprocess(wave): wave_tensor = torch.from_numpy(wave).float() mel_tensor = to_mel(wave_tensor) mel_tensor = (torch.log(1e-5 + mel_tensor.unsqueeze(0)) - mean) / std return mel_tensor def compute_style(path): wave, sr = librosa.load(path, sr=24000) audio, index = librosa.effects.trim(wave, top_db=30) if sr != 24000: audio = librosa.resample(audio, sr, 24000) mel_tensor = preprocess(audio).to(device) with torch.no_grad(): ref_s = model.style_encoder(mel_tensor.unsqueeze(1)) ref_p = model.predictor_encoder(mel_tensor.unsqueeze(1)) return torch.cat([ref_s, ref_p], dim=1) device = 'cuda' if torch.cuda.is_available() else 'cpu' # load phonemizer import phonemizer global_phonemizer = phonemizer.backend.EspeakBackend(language='en-us', preserve_punctuation=True, with_stress=True) config = yaml.safe_load(open(f"{model_folder}config.yml")) # load pretrained ASR model ASR_config = config.get('ASR_config', False) ASR_path = config.get('ASR_path', False) text_aligner = load_ASR_models(ASR_path, ASR_config) # load pretrained F0 model F0_path = config.get('F0_path', False) pitch_extractor = load_F0_models(F0_path) # load BERT model from Utils.PLBERT.util import load_plbert BERT_path = config.get('PLBERT_dir', False) plbert = load_plbert(BERT_path) model_params = recursive_munch(config['model_params']) model = build_model(model_params, text_aligner, pitch_extractor, plbert) _ = [model[key].eval() for key in model] _ = [model[key].to(device) for key in model] #________________________ params_whole = torch.load(f"{model_folder}" + sorted_files[-1], map_location='cpu') params = params_whole['net'] #________________________ for key in model: if key in params: print('%s loaded' % key) try: model[key].load_state_dict(params[key]) except: from collections import OrderedDict state_dict = params[key] new_state_dict = OrderedDict() for k, v in state_dict.items(): name = k[7:] # remove `module.` new_state_dict[name] = v # load params model[key].load_state_dict(new_state_dict, strict=False) # except: # _load(params[key], model[key]) _ = [model[key].eval() for key in model] #________________________ from Modules.diffusion.sampler import DiffusionSampler, ADPM2Sampler, KarrasSchedule sampler = DiffusionSampler( model.diffusion.diffusion, sampler=ADPM2Sampler(), sigma_schedule=KarrasSchedule(sigma_min=0.0001, sigma_max=3.0, rho=9.0), # empirical parameters clamp=False ) #________________________ def inference(text, ref_s, alpha = 0.3, beta = 0.7, diffusion_steps=5, embedding_scale=1): text = text.strip() ps = global_phonemizer.phonemize([text]) ps = word_tokenize(ps[0]) ps = ' '.join(ps) tokens = textclenaer(ps) tokens.insert(0, 0) tokens = torch.LongTensor(tokens).to(device).unsqueeze(0) with torch.no_grad(): input_lengths = torch.LongTensor([tokens.shape[-1]]).to(device) text_mask = length_to_mask(input_lengths).to(device) t_en = model.text_encoder(tokens, input_lengths, text_mask) bert_dur = model.bert(tokens, attention_mask=(~text_mask).int()) d_en = model.bert_encoder(bert_dur).transpose(-1, -2) s_pred = sampler(noise = torch.randn((1, 256)).unsqueeze(1).to(device), embedding=bert_dur, embedding_scale=embedding_scale, features=ref_s, # reference from the same speaker as the embedding num_steps=diffusion_steps).squeeze(1) s = s_pred[:, 128:] ref = s_pred[:, :128] ref = alpha * ref + (1 - alpha) * ref_s[:, :128] s = beta * s + (1 - beta) * ref_s[:, 128:] d = model.predictor.text_encoder(d_en, s, input_lengths, text_mask) x, _ = model.predictor.lstm(d) duration = model.predictor.duration_proj(x) duration = torch.sigmoid(duration).sum(axis=-1) pred_dur = torch.round(duration.squeeze()).clamp(min=1) pred_aln_trg = torch.zeros(input_lengths, int(pred_dur.sum().data)) c_frame = 0 for i in range(pred_aln_trg.size(0)): pred_aln_trg[i, c_frame:c_frame + int(pred_dur[i].data)] = 1 c_frame += int(pred_dur[i].data) # encode prosody en = (d.transpose(-1, -2) @ pred_aln_trg.unsqueeze(0).to(device)) if model_params.decoder.type == "hifigan": asr_new = torch.zeros_like(en) asr_new[:, :, 0] = en[:, :, 0] asr_new[:, :, 1:] = en[:, :, 0:-1] en = asr_new F0_pred, N_pred = model.predictor.F0Ntrain(en, s) asr = (t_en @ pred_aln_trg.unsqueeze(0).to(device)) if model_params.decoder.type == "hifigan": asr_new = torch.zeros_like(asr) asr_new[:, :, 0] = asr[:, :, 0] asr_new[:, :, 1:] = asr[:, :, 0:-1] asr = asr_new out = model.decoder(asr, F0_pred, N_pred, ref.squeeze().unsqueeze(0)) return out.squeeze().cpu().numpy()[..., :-50] # weird pulse at the end of the model, need to be fixed later #________________________ # Synthesize speech text = "We are happy to invite you to join us on a journey to the future." #________________________ reference_dicts = {} reference_dicts['oge'] = "ref_audios/things_fall_apart_1.wav" # or use your own audio samples reference_dicts['ben'] = "ref_audios/feels_good_to_be_odd_1.wav" #________________________ start = time.time() noise = torch.randn(1,1,256).to(device) for k, path in reference_dicts.items(): ref_s = compute_style(path) wav = inference(text, ref_s, alpha=0.3, beta=0.9, diffusion_steps=10, embedding_scale=2) rtf = (time.time() - start) / (len(wav) / 24000) print(f"RTF = {rtf:5f}") import IPython.display as ipd print(k + ' Synthesized:') display(ipd.Audio(wav, rate=24000, normalize=False)) print('Reference:') display(ipd.Audio(path, rate=24000, normalize=False)) ``` ## Model Description - **Developed by:** [Benjamin](https://www.linkedin.com/in/benjamin-ogbonna) - **Model type:** Text-to-Speech - **Language(s) (NLP):** English--> Nigerian Accented English - **Finetuned from:** [StyleTTS](https://huggingface.co/yl4579/StyleTTS) - **Repository:** [Github Repository](https://github.com/benjaminogbonna/styletts2-finetuned) #### Uses Generate Nigerian-accented English speech. #### Out-of-Scope Use The model is not suitable for generating speech in languages other than English or other accents. ## Bias, Risks, and Limitations The model may not capture the full diversity of Nigerian accents and could exhibit biases based on the training dataset. #### Recommendations Users (both direct and downstream) should be made aware of the risks, biases, and limitations of the model. Feedback and diverse training data contributions are encouraged. ## Speech Samples
Text Input Audio Output Notes
We are happy to invite you to join us on a journey to the future. (alpha=0.3, beta=0.9, diffusion_steps=10, embedding_scale=2), Speaker: Ben
We are happy to invite you to join us on a journey to the future. (alpha=0.3, beta=0.9, diffusion_steps=10, embedding_scale=2), Speaker: Oge
If the supply of fruit is greater than the family needs, it may be made a source of income by sending the fresh fruit to the market if there is one near enough, or by preserving, canning, and making jelly for sale. To make such an enterprise a success the fruit and work must be first class. There is magic in the word "Homemade," when the product appeals to the eye and the palate; but many careless and incompetent people have found to their sorrow that this word has not magic enough to float inferior goods on the market. As a rule large canning and preserving establishments are clean and have the best appliances, and they employ chemists and skilled labor. The home product must be very good to compete with the attractive goods that are sent out from such establishments. Yet for first class home made products there is a market in all large cities. All first-class grocers have customers who purchase such goods. (alpha = 0.3, beta = 0.9, t = 0.7, diffusion_steps=10, embedding_scale=1.5), Long narration: Oge
## TODO - Getting more data for current speakers and re-training the model for better generalization - Adding more Speakers for diversity ## Training #### Data This model was trained on a proprietary audio dataset (5 hours) of the authors reading different categories of books. #### Preprocessing Audio files were preprocessed and resampled to 24Khz #### Training Hyperparameters - **Number of epochs:** 6 - **batch_size:** 2 - **use_lora:** False #### Hardware - **GPUs:** NVIDIA L40S (3 hours) #### Software - **Training Framework:** Pytorch ## Citation [optional] #### BibTeX: ```python @misc{ authors = {Benjamin O, Oge N, Mathias E, Daniel A.}, title = {Nigerian-Accented Text-to-Speech Model}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/benjaminogbonna/tts_demo_models} } ``` #### APA: ```python Benjamin O, Oge N, Mathias E, Daniel A. (2025). Nigerian-Accented Text-to-Speech Model. Hugging Face. Available at: https://huggingface.co/benjaminogbonna/tts_demo_models ``` ## Credits & References - [StyleTTS](https://huggingface.co/yl4579/StyleTTS)