Taejin commited on
Commit
f059506
·
1 Parent(s): 5bd87d8

README and gitattributes

Browse files

Signed-off-by: taejinp <[email protected]>

Files changed (2) hide show
  1. .gitattributes +1 -0
  2. README.md +304 -3
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ diar_sortformer_4spk-v1.nemo filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,304 @@
1
- ---
2
- license: cc-by-nc-sa-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-sa-4.0
3
+ library_name: nemo
4
+ datasets:
5
+ - fisher_english
6
+ - NIST_SRE_2004-2010
7
+ - librispeech
8
+ - ami_meeting_corpus
9
+ - voxconverse_v0.3
10
+ - icsi
11
+ - aishell4
12
+ - dihard_challenge-3
13
+ - NIST_SRE_2000-Disc8_split1
14
+ thumbnail: null
15
+ tags:
16
+ - speaker-diarization
17
+ - speaker-recognition
18
+ - speech
19
+ - audio
20
+ - Transformer
21
+ - FastConformer
22
+ - Conformer
23
+ - NEST
24
+ - pytorch
25
+ - NeMo
26
+ widget:
27
+ - example_title: Librispeech sample 1
28
+ src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
29
+ - example_title: Librispeech sample 2
30
+ src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
31
+ model-index:
32
+ - name: diar_sortformer_4spk-v1
33
+ results:
34
+ - task:
35
+ name: Speaker Diarization
36
+ type: speaker-diarization-with-post-processing
37
+ dataset:
38
+ name: DIHARD3-eval
39
+ type: dihard3-eval-1to4spks
40
+ config: with_overlap_collar_0.0s
41
+ split: eval
42
+ metrics:
43
+ - name: Test DER
44
+ type: der
45
+ value: 14.76
46
+ - task:
47
+ name: Speaker Diarization
48
+ type: speaker-diarization-with-post-processing
49
+ dataset:
50
+ name: CALLHOME (NIST-SRE-2000 Disc8)
51
+ type: CALLHOME-part2-2spk
52
+ config: with_overlap_collar_0.25s
53
+ split: part2-2spk
54
+ metrics:
55
+ - name: Test DER
56
+ type: der
57
+ value: 5.85
58
+ - task:
59
+ name: Speaker Diarization
60
+ type: speaker-diarization-with-post-processing
61
+ dataset:
62
+ name: CALLHOME (NIST-SRE-2000 Disc8)
63
+ type: CALLHOME-part2-3spk
64
+ config: with_overlap_collar_0.25s
65
+ split: part2-3spk
66
+ metrics:
67
+ - name: Test DER
68
+ type: der
69
+ value: 8.46
70
+ - task:
71
+ name: Speaker Diarization
72
+ type: speaker-diarization-with-post-processing
73
+ dataset:
74
+ name: CALLHOME (NIST-SRE-2000 Disc8)
75
+ type: CALLHOME-part2-4spk
76
+ config: with_overlap_collar_0.25s
77
+ split: part2-4spk
78
+ metrics:
79
+ - name: Test DER
80
+ type: der
81
+ value: 12.59
82
+ - task:
83
+ name: Speaker Diarization
84
+ type: speaker-diarization-with-post-processing
85
+ dataset:
86
+ name: call_home_american_english_speech
87
+ type: CHAES_2spk_109sessions
88
+ config: with_overlap_collar_0.25s
89
+ split: ch109
90
+ metrics:
91
+ - name: Test DER
92
+ type: der
93
+ value: 6.86
94
+ metrics:
95
+ - der
96
+ pipeline_tag: audio-classification
97
+ ---
98
+
99
+
100
+ # Sortformer Diarizer 4spk v1
101
+
102
+ <style>
103
+ img {
104
+ display: inline;
105
+ }
106
+ </style>
107
+
108
+ [![Model architecture](https://img.shields.io/badge/Model_Arch-FastConformer--Transformer-lightgrey#model-badge)](#model-architecture)
109
+ | [![Model size](https://img.shields.io/badge/Params-123M-lightgrey#model-badge)](#model-architecture)
110
+ <!-- | [![Language](https://img.shields.io/badge/Language-multilingual-lightgrey#model-badge)](#datasets) -->
111
+
112
+ [Sortformer](https://arxiv.org/abs/2409.06656)[1] is a novel end-to-end neural model for speaker diarization, trained with unconventional objectives compared to existing end-to-end diarization models.
113
+
114
+ <div align="center">
115
+ <img src="sortformer_intro.png" width="750" />
116
+ </div>
117
+
118
+ Sortformer resolves permutation problem in diarization following the arrival-time order of the speech segments from each speaker.
119
+
120
+ ## Model Architecture
121
+
122
+ Sortformer consists of an L-size (18 layers) [NeMo Encoder for
123
+ Speech Tasks (NEST)](https://arxiv.org/abs/2408.13106)[2] which is based on [Fast-Conformer](https://arxiv.org/abs/2305.05084)[3] encoder. Following that, an 18-layer Transformer[4] encoder with hidden size of 192,
124
+ and two feedforward layers with 4 sigmoid outputs for each frame input at the top layer. More information can be found in the [Sortformer paper](https://arxiv.org/abs/2409.06656)[1].
125
+
126
+ <div align="center">
127
+ <img src="sortformer-v1-model.png" width="450" />
128
+ </div>
129
+
130
+ ## NVIDIA NeMo
131
+
132
+ To train, fine-tune or perform diarization with Sortformer, you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo)[5]. We recommend you install it after you've installed Cython and latest PyTorch version.
133
+ ```
134
+ pip install git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[asr]
135
+ ```
136
+
137
+ ## How to Use this Model
138
+
139
+ The model is available for use in the NeMo Framework[5], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
140
+
141
+ ### Loading the Model
142
+
143
+ ```python
144
+ from nemo.collections.asr.models import SortformerEncLabelModel
145
+
146
+ # load model
147
+ diar_model = SortformerEncLabelModel.restore_from(restore_path="diar_sortformer_4spk-v1", map_location=torch.device('cuda'), strict=False)
148
+ ```
149
+
150
+ ### Input Format
151
+ Input to Sortformer can be either a list of paths to audio files or a jsonl manifest file.
152
+
153
+ ```python
154
+ pred_outputs = diar_model.diarize(audio=["/path/to/multispeaker_audio1.wav", "/path/to/multispeaker_audio2.wav"], batch_size=1)
155
+ ```
156
+
157
+ Individual audio file can be fed into Sortformer model as follows:
158
+ ```python
159
+ pred_output1 = diar_model.diarize(audio="/path/to/multispeaker_audio1.wav", batch_size=1)
160
+ ```
161
+
162
+
163
+ To use Sortformer for performing diarization on a multi-speaker audio recording, specify the input as jsonl manifest file, where each line in the file is a dictionary containing the following fields:
164
+
165
+ ```yaml
166
+ # Example of a line in `multispeaker_manifest.json`
167
+ {
168
+ "audio_filepath": "/path/to/multispeaker_audio1.wav", # path to the input audio file
169
+ "offset": 0 # offset (start) time of the input audio
170
+ "duration": 600, # duration of the audio, can be set to `null` if using NeMo main branch
171
+ }
172
+ {
173
+ "audio_filepath": "/path/to/multispeaker_audio2.wav",
174
+ "offset": 0,
175
+ "duration": 580,
176
+ }
177
+ ```
178
+
179
+ and then use:
180
+ ```python
181
+ pred_outputs = diar_model.diarize(audio="/path/to/multispeaker_manifest.json", batch_size=1)
182
+ ```
183
+
184
+
185
+ ### Input
186
+
187
+ This model accepts single-channel (mono) audio sampled at 16,000 Hz.
188
+ - The actual input tensor is a Ns x 1 matrix for each audio clip, where Ns is the number of samples in the time-series signal.
189
+ - For instance, a 10-second audio clip sampled at 16,000 Hz (mono-channel WAV file) will form a 160,000 x 1 matrix.
190
+
191
+ ### Output
192
+
193
+ The output of the model is an T x S matrix, where:
194
+ - S is the maximum number of speakers (in this model, S = 4).
195
+ - T is the total number of frames, including zero-padding. Each frame corresponds to a segment of 0.08 seconds of audio.
196
+ Each element of the T x S matrix represents the speaker activity probability in the [0, 1] range. For example, a matrix element a(150, 2) = 0.95 indicates a 95% probability of activity for the second speaker during the time range [12.00, 12.08] seconds.
197
+
198
+
199
+ ## Train and evaluate Sortformer diarizer using NeMo
200
+ ### Training
201
+
202
+ Sortformer diarizer models are trained on 8 nodes of 8×NVIDIA Tesla V100 GPUs. We use 90 second long training samples and batch size of 4.
203
+ The model can be trained using this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/diarization/neural_diarizer/sortformer_diar_train.py) and [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/diarization/conf/neural_diarizer/sortformer_diarizer_hybrid_loss_4spk-v1.yaml).
204
+
205
+ ### Inference
206
+
207
+ Sortformer diarizer models can be performed with post-processing algorithms using inference [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/diarization/neural_diarizer/e2e_diarize_speech.py). If you provide the post-processing YAML configs in [`post_processing` folder](https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/post_processing) to reproduce the optimized post-processing algorithm for each development dataset.
208
+
209
+ ### Technical Limitations
210
+
211
+ - The model operates in a non-streaming mode (offline mode).
212
+ - It can detect a maximum of 4 speakers; performance degrades on recordings with 5 and more speakers.
213
+ - The maximum duration of a test recording depends on available GPU memory. For an RTX A6000 48GB model, the limit is around 12 minutes.
214
+ - The model was trained on publicly available speech datasets, primarily in English. As a result:
215
+ * Performance may degrade on non-English speech.
216
+ * Performance may also degrade on out-of-domain data, such as recordings in noisy conditions.
217
+
218
+
219
+ ## Datasets
220
+
221
+ Sortformer was trained on a combination of 2030 hours of real conversations and 5150 hours or simulated audio mixtures generated by [NeMo speech data simulator](https://arxiv.org/abs/2310.12371)[6].
222
+ All the datasets listed above are based on the same labeling method via [RTTM](https://web.archive.org/web/20100606092041if_/http://www.itl.nist.gov/iad/mig/tests/rt/2009/docs/rt09-meeting-eval-plan-v2.pdf) format. A subset of RTTM files used for model training are processed for the speaker diarization model training purposes.
223
+ Data collection methods vary across individual datasets. For example, the above datasets include phone calls, interviews, web videos, and audiobook recordings. Please refer to the [Linguistic Data Consortium (LDC) website](https://www.ldc.upenn.edu/) or dataset webpage for detailed data collection methods.
224
+
225
+
226
+ ### Training Datasets (Real conversations)
227
+ - Fisher English (LDC)
228
+ - 2004-2010 NIST Speaker Recognition Evaluation (LDC)
229
+ - Librispeech
230
+ - AMI Meeting Corpus
231
+ - VoxConverse-v0.3
232
+ - ICSI
233
+ - AISHELL-4
234
+ - Third DIHARD Challenge Development (LDC)
235
+ - 2000 NIST Speaker Recognition Evaluation, split1 (LDC)
236
+
237
+ ### Training Datasets (Used to simulate audio mixtures)
238
+ - 2004-2010 NIST Speaker Recognition Evaluation (LDC)
239
+ - Librispeech
240
+
241
+ ## Performance
242
+
243
+
244
+ ### Evaluation dataset specifications
245
+
246
+ | **Dataset** | **DIHARD3-Eval** | **CALLHOME-part2** | **CALLHOME-part2** | **CALLHOME-part2** | **CH109** |
247
+ |:------------------------------|:------------------:|:-------------------:|:-------------------:|:-------------------:|:------------------:|
248
+ | **Number of Speakers** | ≤ 4 speakers | 2 speakers | 3 speakers | 4 speakers | 2 speakers |
249
+ | **Collar (sec)** | 0.0s | 0.25s | 0.25s | 0.25s | 0.25s |
250
+ | **Mean Audio Duration (sec)** | 453.0s | 73.0s | 135.7s | 329.8s | 552.9s |
251
+
252
+ ### Diarization Error Rate (DER)
253
+
254
+ * All evaluations include overlapping speech.
255
+ * Bolded and italicized numbers represent the best-performing Sortformer evaluations.
256
+ * Post-Processing (PP) is optimized on two different held-out dataset splits.
257
+ - [YAML file for DH3-dev Optimized Post-Processing](https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/post_processing/sortformer_diar_4spk-v1_dihard3-dev.yaml)
258
+ - [YAML file for CallHome-part1 Optimized Post-Processing](https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/post_processing/sortformer_diar_4spk-v1_callhome-part1.yaml)
259
+
260
+
261
+ | **Dataset** | **DIHARD3-Eval** | **CALLHOME-part2** | **CALLHOME-part2** | **CALLHOME-part2** | **CH109** |
262
+ |:----------------------------------------------------------|:------------------:|:-------------------:|:-------------------:|:-------------------:|:------------------:|
263
+ | DER **diar_sortformer_4spk-v1** | 16.28 | 6.49 | 10.01 | 14.14 | **_6.27_** |
264
+ | DER **diar_sortformer_4spk-v1 + DH3-dev Opt. PP** | **_14.76_** | - | - | - | - |
265
+ | DER **diar_sortformer_4spk-v1 + CallHome-part1 Opt. PP** | - | **_5.85_** | **_8.46_** | **_12.59_** | 6.86 |
266
+
267
+ ### Real Time Factor (RTFx)
268
+
269
+ All tests were measured on RTX A6000 48GB with batch size of 1. Post-processing is not included in RTFx calculations.
270
+
271
+ | **Datasets** | **DIHARD3-Eval** | **CALLHOME-part2** | **CALLHOME-part2** | **CALLHOME-part2** | **CH109** |
272
+ |:----------------------------------|:-------------------:|:-------------------:|:-------------------:|:-------------------:|:------------------:|
273
+ | RTFx **diar_sortformer_4spk-v1** | 437 | 1053 | 915 | 545 | 415 |
274
+
275
+
276
+ ## NVIDIA Riva: Deployment
277
+
278
+ [NVIDIA Riva](https://developer.nvidia.com/riva), is an accelerated speech AI SDK deployable on-prem, in all clouds, multi-cloud, hybrid, on edge, and embedded.
279
+ Additionally, Riva provides:
280
+
281
+ * World-class out-of-the-box accuracy for the most common languages with model checkpoints trained on proprietary data with hundreds of thousands of GPU-compute hours
282
+ * Best in class accuracy with run-time word boosting (e.g., brand and product names) and customization of acoustic model, language model, and inverse text normalization
283
+ * Streaming speech recognition, Kubernetes compatible scaling, and enterprise-grade support
284
+
285
+ Although this model isn’t supported yet by Riva, the [list of supported models](https://huggingface.co/models?other=Riva) is here.
286
+ Check out [Riva live demo](https://developer.nvidia.com/riva#demos).
287
+
288
+
289
+ ## References
290
+ [1] [Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens](https://arxiv.org/abs/2409.06656)
291
+
292
+ [2] [NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks](https://arxiv.org/abs/2408.13106)
293
+
294
+ [3] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)
295
+
296
+ [4] [Attention is all you need](https://arxiv.org/abs/1706.03762)
297
+
298
+ [5] [NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo)
299
+
300
+ [6] [NeMo speech data simulator](https://arxiv.org/abs/2310.12371)
301
+
302
+ ## Licence
303
+
304
+ License to use this model is covered by the [CC-BY-NC-SA-4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode). By downloading the public and release version of the model, you accept the terms and conditions of the CC-BY-NC-SA-4.0 license.