How to Use?
I have difficulties using your model.
First, do I need to use the processor and the vocoder from Microsoft? I did the following, but I only get noise:
(I assume it is not a problem that the embedding is from an English speaker.)
Could you please provide an example how to use this model properly?
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
from datasets import load_dataset
import torch
import soundfile as sf
from datasets import load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts",device=device)
model = SpeechT5ForTextToSpeech.from_pretrained("nikolab/speecht5_tts_hr").to(device)
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan").to(device)
inputs = processor(text="Naravno! Danas je sunčan dan.", return_tensors="pt")
load xvector containing speaker's voice characteristics from a dataset
embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0).to(device)
speech = model.generate_speech(inputs["input_ids"].to(device), speaker_embeddings, vocoder=vocoder)
sf.write("speech.wav", speech.cpu().numpy(), samplerate=16000)