Update the order of code and add pipeline usage! (#22)
Browse files- Update the order of code and add pipeline usage! (578a8302e3794675594bb02d971245bb7cf5690e)
Co-authored-by: Vaibhav Srivastav <[email protected]>
README.md
CHANGED
@@ -47,44 +47,35 @@ Extensive evaluations show the superiority of the proposed SpeechT5 framework on
|
|
47 |
|
48 |
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
49 |
|
50 |
-
##
|
51 |
-
|
52 |
-
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
|
53 |
-
|
54 |
-
You can use this model for speech synthesis. See the [model hub](https://huggingface.co/models?search=speecht5) to look for fine-tuned versions on a task that interests you.
|
55 |
-
|
56 |
-
## Downstream Use [optional]
|
57 |
-
|
58 |
-
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
|
59 |
-
|
60 |
-
[More Information Needed]
|
61 |
-
|
62 |
-
## Out-of-Scope Use
|
63 |
-
|
64 |
-
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
|
65 |
-
|
66 |
-
[More Information Needed]
|
67 |
|
68 |
-
|
69 |
|
70 |
-
|
|
|
|
|
71 |
|
72 |
-
|
|
|
|
|
73 |
|
74 |
-
|
75 |
|
76 |
-
|
|
|
|
|
77 |
|
78 |
-
|
79 |
|
|
|
80 |
|
81 |
-
|
82 |
|
83 |
-
|
84 |
|
85 |
```python
|
86 |
# Following pip packages need to be installed:
|
87 |
-
# !pip install
|
88 |
|
89 |
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
|
90 |
from datasets import load_dataset
|
@@ -111,6 +102,37 @@ sf.write("speech.wav", speech.numpy(), samplerate=16000)
|
|
111 |
|
112 |
Refer to [this Colab notebook](https://colab.research.google.com/drive/1i7I5pzBcU3WDFarDnzweIj4-sVVoIUFJ) for an example of how to fine-tune SpeechT5 for TTS on a different dataset or a new language.
|
113 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
114 |
# Training Details
|
115 |
|
116 |
## Training Data
|
|
|
47 |
|
48 |
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
49 |
|
50 |
+
## How to Get Started With the Model
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
51 |
|
52 |
+
You can access the SpeechT5 model via the `Text-to-Speech` pipeline in just a couple lines of code!
|
53 |
|
54 |
+
```python
|
55 |
+
# Following pip packages need to be installed:
|
56 |
+
# !pip install transformers sentencepiece datasets
|
57 |
|
58 |
+
from transformers import pipeline
|
59 |
+
from datasets import load_dataset
|
60 |
+
import soundfile as sf
|
61 |
|
62 |
+
synthesiser = pipeline("text-to-speech", "microsoft/speech_tt5")
|
63 |
|
64 |
+
embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
|
65 |
+
speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)
|
66 |
+
# You can replace this embedding with your own as well.
|
67 |
|
68 |
+
speech = pipe("Hello what is happening", forward_params={"speaker_embeddings": speaker_embeddings})
|
69 |
|
70 |
+
sf.write("speech.wav", speech["audio"], samplerate=speech["sampling_rate"])
|
71 |
|
72 |
+
```
|
73 |
|
74 |
+
For more fine-grained control you can use the processor + generate code to convert text into a mono 16 kHz speech waveform.
|
75 |
|
76 |
```python
|
77 |
# Following pip packages need to be installed:
|
78 |
+
# !pip install transformers sentencepiece datasets
|
79 |
|
80 |
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
|
81 |
from datasets import load_dataset
|
|
|
102 |
|
103 |
Refer to [this Colab notebook](https://colab.research.google.com/drive/1i7I5pzBcU3WDFarDnzweIj4-sVVoIUFJ) for an example of how to fine-tune SpeechT5 for TTS on a different dataset or a new language.
|
104 |
|
105 |
+
|
106 |
+
## Direct Use
|
107 |
+
|
108 |
+
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
|
109 |
+
|
110 |
+
You can use this model for speech synthesis. See the [model hub](https://huggingface.co/models?search=speecht5) to look for fine-tuned versions on a task that interests you.
|
111 |
+
|
112 |
+
## Downstream Use [optional]
|
113 |
+
|
114 |
+
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
|
115 |
+
|
116 |
+
[More Information Needed]
|
117 |
+
|
118 |
+
## Out-of-Scope Use
|
119 |
+
|
120 |
+
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
|
121 |
+
|
122 |
+
[More Information Needed]
|
123 |
+
|
124 |
+
# Bias, Risks, and Limitations
|
125 |
+
|
126 |
+
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
|
127 |
+
|
128 |
+
[More Information Needed]
|
129 |
+
|
130 |
+
## Recommendations
|
131 |
+
|
132 |
+
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
|
133 |
+
|
134 |
+
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
|
135 |
+
|
136 |
# Training Details
|
137 |
|
138 |
## Training Data
|