I have difficulties using your model.
First, do I need to use the processor and the vocoder from Microsoft? I did the following, but I only get noise:
(I assume it is not a problem that the embedding is from an English speaker.)

Could you please provide an example how to use this model properly?

from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
from datasets import load_dataset
import torch
import soundfile as sf
from datasets import load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts",device=device)
model = SpeechT5ForTextToSpeech.from_pretrained("nikolab/speecht5_tts_hr").to(device)
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan").to(device)
inputs = processor(text="Naravno! Danas je sunčan dan.", return_tensors="pt")

load xvector containing speaker's voice characteristics from a dataset

embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0).to(device)
speech = model.generate_speech(inputs["input_ids"].to(device), speaker_embeddings, vocoder=vocoder)
sf.write("speech.wav", speech.cpu().numpy(), samplerate=16000)

nikolab
/

speecht5_tts_hr

How to Use?

load xvector containing speaker's voice characteristics from a dataset