Update README.md
#1
by
imedennikov
- opened
README.md
CHANGED
@@ -144,43 +144,46 @@ The model is available for use in the NeMo Framework[5], and can be used as a pr
|
|
144 |
from nemo.collections.asr.models import SortformerEncLabelModel
|
145 |
|
146 |
# load model
|
147 |
-
diar_model = SortformerEncLabelModel.restore_from(restore_path="diar_sortformer_4spk-v1", map_location=torch.device('cuda'), strict=False)
|
148 |
```
|
149 |
|
150 |
### Input Format
|
151 |
-
Input to Sortformer can be
|
152 |
-
|
153 |
```python
|
154 |
-
|
155 |
```
|
156 |
-
|
157 |
-
Individual audio file can be fed into Sortformer model as follows:
|
158 |
```python
|
159 |
-
|
160 |
```
|
161 |
-
|
162 |
-
|
163 |
-
|
164 |
-
|
|
|
165 |
```yaml
|
166 |
# Example of a line in `multispeaker_manifest.json`
|
167 |
{
|
168 |
"audio_filepath": "/path/to/multispeaker_audio1.wav", # path to the input audio file
|
169 |
-
"offset": 0 # offset (start) time of the input audio
|
170 |
"duration": 600, # duration of the audio, can be set to `null` if using NeMo main branch
|
171 |
}
|
172 |
{
|
173 |
"audio_filepath": "/path/to/multispeaker_audio2.wav",
|
174 |
-
"offset":
|
175 |
"duration": 580,
|
176 |
}
|
177 |
```
|
178 |
|
179 |
-
|
|
|
180 |
```python
|
181 |
-
|
|
|
|
|
|
|
|
|
182 |
```
|
183 |
-
|
184 |
|
185 |
### Input
|
186 |
|
@@ -190,7 +193,7 @@ This model accepts single-channel (mono) audio sampled at 16,000 Hz.
|
|
190 |
|
191 |
### Output
|
192 |
|
193 |
-
The output of the model is
|
194 |
- S is the maximum number of speakers (in this model, S = 4).
|
195 |
- T is the total number of frames, including zero-padding. Each frame corresponds to a segment of 0.08 seconds of audio.
|
196 |
Each element of the T x S matrix represents the speaker activity probability in the [0, 1] range. For example, a matrix element a(150, 2) = 0.95 indicates a 95% probability of activity for the second speaker during the time range [12.00, 12.08] seconds.
|
@@ -202,9 +205,27 @@ Each element of the T x S matrix represents the speaker activity probability in
|
|
202 |
Sortformer diarizer models are trained on 8 nodes of 8×NVIDIA Tesla V100 GPUs. We use 90 second long training samples and batch size of 4.
|
203 |
The model can be trained using this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/diarization/neural_diarizer/sortformer_diar_train.py) and [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/diarization/conf/neural_diarizer/sortformer_diarizer_hybrid_loss_4spk-v1.yaml).
|
204 |
|
205 |
-
###
|
206 |
|
207 |
-
Sortformer diarizer
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
208 |
|
209 |
### Technical Limitations
|
210 |
|
|
|
144 |
from nemo.collections.asr.models import SortformerEncLabelModel
|
145 |
|
146 |
# load model
|
147 |
+
diar_model = SortformerEncLabelModel.restore_from(restore_path="/path/to/diar_sortformer_4spk-v1.nemo", map_location=torch.device('cuda'), strict=False)
|
148 |
```
|
149 |
|
150 |
### Input Format
|
151 |
+
Input to Sortformer can be an individual audio file:
|
|
|
152 |
```python
|
153 |
+
audio_input="/path/to/multispeaker_audio1.wav"
|
154 |
```
|
155 |
+
or a list of paths to audio files:
|
|
|
156 |
```python
|
157 |
+
audio_input=["/path/to/multispeaker_audio1.wav", "/path/to/multispeaker_audio2.wav"]
|
158 |
```
|
159 |
+
or a jsonl manifest file:
|
160 |
+
```python
|
161 |
+
audio_input="/path/to/multispeaker_manifest.json"
|
162 |
+
```
|
163 |
+
where each line is a dictionary containing the following fields:
|
164 |
```yaml
|
165 |
# Example of a line in `multispeaker_manifest.json`
|
166 |
{
|
167 |
"audio_filepath": "/path/to/multispeaker_audio1.wav", # path to the input audio file
|
168 |
+
"offset": 0, # offset (start) time of the input audio
|
169 |
"duration": 600, # duration of the audio, can be set to `null` if using NeMo main branch
|
170 |
}
|
171 |
{
|
172 |
"audio_filepath": "/path/to/multispeaker_audio2.wav",
|
173 |
+
"offset": 900,
|
174 |
"duration": 580,
|
175 |
}
|
176 |
```
|
177 |
|
178 |
+
### Getting Diarization Results
|
179 |
+
To perform speaker diarization and get a list of speaker-marked speech segments in the format 'begin_seconds, end_seconds, speaker_index', simply use:
|
180 |
```python
|
181 |
+
predicted_segments = diar_model.diarize(audio=audio_input, batch_size=1)
|
182 |
+
```
|
183 |
+
To also obtain tensors of speaker activity probabilities, use:
|
184 |
+
```python
|
185 |
+
predicted_segments, predicted_probs = diar_model.diarize(audio=audio_input, batch_size=1, include_tensor_outputs=True)
|
186 |
```
|
|
|
187 |
|
188 |
### Input
|
189 |
|
|
|
193 |
|
194 |
### Output
|
195 |
|
196 |
+
The output of the model is a T x S matrix, where:
|
197 |
- S is the maximum number of speakers (in this model, S = 4).
|
198 |
- T is the total number of frames, including zero-padding. Each frame corresponds to a segment of 0.08 seconds of audio.
|
199 |
Each element of the T x S matrix represents the speaker activity probability in the [0, 1] range. For example, a matrix element a(150, 2) = 0.95 indicates a 95% probability of activity for the second speaker during the time range [12.00, 12.08] seconds.
|
|
|
205 |
Sortformer diarizer models are trained on 8 nodes of 8×NVIDIA Tesla V100 GPUs. We use 90 second long training samples and batch size of 4.
|
206 |
The model can be trained using this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/diarization/neural_diarizer/sortformer_diar_train.py) and [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/diarization/conf/neural_diarizer/sortformer_diarizer_hybrid_loss_4spk-v1.yaml).
|
207 |
|
208 |
+
### Evaluation
|
209 |
|
210 |
+
To evaluate Sortformer diarizer and save diarization results in RTTM format, use the inference [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/diarization/neural_diarizer/e2e_diarize_speech.py):
|
211 |
+
```shell
|
212 |
+
python [NEMO_GIT_FOLDER]/examples/speaker_tasks/diarization/neural_diarizer/e2e_diarize_speech.py
|
213 |
+
model_path="/path/to/diar_sortformer_4spk-v1.nemo"
|
214 |
+
manifest_filepath="/path/to/multispeaker_manifest_with_reference_rttms.json"
|
215 |
+
collar=COLLAR
|
216 |
+
out_rttm_dir="/path/to/output_rttms"
|
217 |
+
```
|
218 |
+
|
219 |
+
You can provide the post-processing YAML configs from [`post_processing` folder](https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/post_processing) to reproduce the optimized post-processing algorithm for each development dataset:
|
220 |
+
```shell
|
221 |
+
python [NEMO_GIT_FOLDER]/examples/speaker_tasks/diarization/neural_diarizer/e2e_diarize_speech.py
|
222 |
+
model_path="/path/to/diar_sortformer_4spk-v1.nemo"
|
223 |
+
manifest_filepath="/path/to/multispeaker_manifest_with_reference_rttms.json"
|
224 |
+
collar=COLLAR
|
225 |
+
bypass_postprocessing=False
|
226 |
+
postprocessing_yaml="/path/to/postprocessing_config.yaml"
|
227 |
+
out_rttm_dir="/path/to/output_rttms"
|
228 |
+
```
|
229 |
|
230 |
### Technical Limitations
|
231 |
|