reach-vb HF staff commited on
Commit
f91af75
1 Parent(s): 39b288d

[DX] Clearer instructions for SpeechT5

Browse files
Files changed (1) hide show
  1. README.md +14 -13
README.md CHANGED
@@ -47,14 +47,19 @@ Extensive evaluations show the superiority of the proposed SpeechT5 framework on
47
 
48
  <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
49
 
50
- ## How to Get Started With the Model
51
 
52
- You can access the SpeechT5 model via the `Text-to-Speech` pipeline in just a couple lines of code!
53
 
54
- ```python
55
- # Following pip packages need to be installed:
56
- # !pip install transformers sentencepiece datasets
 
 
 
 
57
 
 
58
  from transformers import pipeline
59
  from datasets import load_dataset
60
  import soundfile as sf
@@ -62,21 +67,17 @@ import soundfile as sf
62
  synthesiser = pipeline("text-to-speech", "microsoft/speech_tt5")
63
 
64
  embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
65
- speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)
66
  # You can replace this embedding with your own as well.
67
 
68
- speech = pipe("Hello what is happening", forward_params={"speaker_embeddings": speaker_embeddings})
69
 
70
  sf.write("speech.wav", speech["audio"], samplerate=speech["sampling_rate"])
71
-
72
  ```
73
 
74
- For more fine-grained control you can use the processor + generate code to convert text into a mono 16 kHz speech waveform.
75
 
76
  ```python
77
- # Following pip packages need to be installed:
78
- # !pip install transformers sentencepiece datasets
79
-
80
  from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
81
  from datasets import load_dataset
82
  import torch
@@ -87,7 +88,7 @@ processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
87
  model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
88
  vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
89
 
90
- inputs = processor(text="Hello, my dog is cute", return_tensors="pt")
91
 
92
  # load xvector containing speaker's voice characteristics from a dataset
93
  embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
 
47
 
48
  <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
49
 
50
+ ## 🤗 Transformers Usage
51
 
52
+ You can run SpeechT5 TTS locally with the 🤗 Transformers library.
53
 
54
+ 1. First install the 🤗 [Transformers library](https://github.com/huggingface/transformers), sentencepiece and datasets(optional):
55
+
56
+ ```
57
+ pip install transformers sentencepiece datasets
58
+ ```
59
+
60
+ 2. Run inference via the `Text-to-Speech` (TTS) pipeline. You can access the SpeechT5 model via the TTS pipeline in just a few lines of code!
61
 
62
+ ```python
63
  from transformers import pipeline
64
  from datasets import load_dataset
65
  import soundfile as sf
 
67
  synthesiser = pipeline("text-to-speech", "microsoft/speech_tt5")
68
 
69
  embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
70
+ speaker_embedding = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)
71
  # You can replace this embedding with your own as well.
72
 
73
+ speech = pipe("Hello, my dog is cooler than you!", forward_params={"speaker_embeddings": speaker_embedding})
74
 
75
  sf.write("speech.wav", speech["audio"], samplerate=speech["sampling_rate"])
 
76
  ```
77
 
78
+ 3. Run inference via the Transformers modelling code - You can use the processor + generate code to convert text into a mono 16 kHz speech waveform for more fine-grained control.
79
 
80
  ```python
 
 
 
81
  from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
82
  from datasets import load_dataset
83
  import torch
 
88
  model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
89
  vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
90
 
91
+ inputs = processor(text="Hello, my dog is cute.", return_tensors="pt")
92
 
93
  # load xvector containing speaker's voice characteristics from a dataset
94
  embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")