Bartelds commited on
Commit
d301339
·
1 Parent(s): f7f93da

Upload checkpoint, sanitized config, and transcripts for group-dro_mms_set_4

Browse files
Files changed (5) hide show
  1. README.md +41 -0
  2. config.yaml +359 -0
  3. hyp.trn +0 -0
  4. ref.trn +0 -0
  5. valid.loss.best.pth +3 -0
README.md ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Group-DRO MMS-based ASR model - set 4"
3
+ language: multilingual
4
+ tags:
5
+ - asr
6
+ - group-dro
7
+ - MMS
8
+ license: cc-by-nc-4.0
9
+ ---
10
+
11
+ # Group-DRO MMS-based ASR model - set 4
12
+
13
+ This repository contains a Group-DRO MMS-based automatic speech recognition (ASR) model trained with ESPnet.
14
+ The model was trained on balanced training data from set 4.
15
+
16
+ ## Intended Use
17
+
18
+ This model is intended for ASR. Users can run inference using the provided checkpoint (`valid.loss.best.pth`) and configuration file (`config.yaml`):
19
+ ```bash
20
+ import soundfile as sf
21
+ from espnet2.bin.asr_inference import Speech2Text
22
+
23
+ asr_train_config = "group-dro_mms_set_4/config.yaml"
24
+ asr_model_file = "group-dro_mms_set_4/valid.loss.best.pth"
25
+
26
+ model = Speech2Text.from_pretrained(
27
+ asr_train_config=asr_train_config,
28
+ asr_model_file=asr_model_file
29
+ )
30
+
31
+ speech, _ = sf.read("input.wav")
32
+ text, *_ = model(speech)[0]
33
+
34
+ print("Recognized text:", text)
35
+ ```
36
+
37
+ ## How to Use
38
+
39
+ 1. Clone this repository.
40
+ 2. Use ESPnet’s inference scripts with the provided `config.yaml` and checkpoint file.
41
+ 3. Ensure any external resources referenced in `config.yaml` are available at the indicated relative paths.
config.yaml ADDED
@@ -0,0 +1,359 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ accum_grad: 16
2
+ adapter: lora
3
+ adapter_conf: {}
4
+ allow_multi_rates: false
5
+ allow_variable_data_keys: false
6
+ aux_ctc_tasks: []
7
+ batch_bins: 1000000
8
+ batch_size: 4
9
+ batch_type: sorted
10
+ best_model_criterion:
11
+ - - valid
12
+ - loss
13
+ - min
14
+ bpemodel: null
15
+ chunk_default_fs: null
16
+ chunk_excluded_key_prefixes: []
17
+ chunk_length: 500
18
+ chunk_shift_ratio: 0.5
19
+ cleaner: null
20
+ collect_stats: false
21
+ create_graph_in_tensorboard: false
22
+ ctc_conf:
23
+ accumulation: false
24
+ agg: mean
25
+ ctc_type: droctc
26
+ dro_group_count: 6
27
+ dro_q_epsilon: 1.0e-10
28
+ dro_step_size: 0.0001
29
+ final_step_size: 0.001
30
+ init_strategy: uniform
31
+ initial_step_size: 0.0001
32
+ laplace_smoothing: 0.0
33
+ max_epoch: 40
34
+ normalize_grad: false
35
+ num_iters_per_epoch: 1200
36
+ running_mean_window: -1
37
+ scheduling: false
38
+ use_running_mean: false
39
+ warmup_steps: 0
40
+ cudnn_benchmark: false
41
+ cudnn_deterministic: true
42
+ cudnn_enabled: true
43
+ decoder: null
44
+ decoder_conf: {}
45
+ detect_anomaly: false
46
+ distributed: false
47
+ drop_last_iter: false
48
+ dry_run: false
49
+ duration_batch_length: -1
50
+ early_stopping_criterion:
51
+ - valid
52
+ - loss
53
+ - min
54
+ encoder: transformer
55
+ encoder_conf:
56
+ attention_dropout_rate: 0.1
57
+ attention_heads: 8
58
+ dropout_rate: 0.1
59
+ input_layer: conv2d2
60
+ linear_units: 1024
61
+ normalize_before: true
62
+ num_blocks: 2
63
+ output_size: 256
64
+ positional_dropout_rate: 0.1
65
+ exclude_weight_decay: false
66
+ exclude_weight_decay_conf: {}
67
+ fold_length:
68
+ - 80000
69
+ - 150
70
+ freeze_param: []
71
+ frontend: s3prl
72
+ frontend_conf:
73
+ download_dir: ./hub
74
+ frontend_conf:
75
+ path_or_url: facebook/mms-300m
76
+ upstream: hf_wav2vec2_custom
77
+ fs: 16k
78
+ multilayer_feature: true
79
+ g2p: null
80
+ grad_clip: 5.0
81
+ grad_clip_type: 2.0
82
+ grad_noise: false
83
+ ignore_init_mismatch: false
84
+ init: xavier_uniform
85
+ init_param: []
86
+ input_size: null
87
+ iterator_type: sequence
88
+ joint_net_conf: null
89
+ keep_nbest_models: 3
90
+ log_interval: null
91
+ log_level: INFO
92
+ max_cache_fd: 32
93
+ max_cache_size: 0.0
94
+ max_epoch: 40
95
+ model: espnet
96
+ model_conf:
97
+ ctc_weight: 1.0
98
+ multiple_iterator: false
99
+ multiprocessing_distributed: false
100
+ nbest_averaging_interval: 0
101
+ ngpu: 1
102
+ no_forward_run: false
103
+ noise_apply_prob: 1.0
104
+ noise_db_range: '13_15'
105
+ noise_scp: null
106
+ non_linguistic_symbols: ./nlsyms.txt
107
+ normalize: utterance_mvn
108
+ normalize_conf: {}
109
+ num_att_plot: 3
110
+ num_cache_chunks: 1024
111
+ num_iters_per_epoch: 1200
112
+ num_workers: 4
113
+ optim: adam
114
+ optim_conf:
115
+ lr: 0.0001
116
+ weight_decay: 1.0e-06
117
+ output_dir: ./inference_results
118
+ patience: null
119
+ postencoder: null
120
+ postencoder_conf: {}
121
+ preencoder: linear
122
+ preencoder_conf:
123
+ input_size: 1024
124
+ output_size: 80
125
+ preprocessor: default
126
+ preprocessor_conf: {}
127
+ pretrain_path: null
128
+ print_config: false
129
+ required:
130
+ - output_dir
131
+ - token_list
132
+ resume: true
133
+ rir_apply_prob: 1.0
134
+ rir_scp: null
135
+ save_strategy: all
136
+ scheduler: null
137
+ scheduler_conf: {}
138
+ seed: 0
139
+ sharded_ddp: false
140
+ short_noise_thres: 0.5
141
+ shuffle_within_batch: false
142
+ sort_batch: descending
143
+ sort_in_batch: descending
144
+ specaug: specaug
145
+ specaug_conf:
146
+ apply_freq_mask: true
147
+ apply_time_mask: true
148
+ apply_time_warp: true
149
+ freq_mask_width_range:
150
+ - 0
151
+ - 27
152
+ num_freq_mask: 2
153
+ num_time_mask: 10
154
+ time_mask_width_ratio_range:
155
+ - 0.0
156
+ - 0.05
157
+ time_warp_mode: bicubic
158
+ time_warp_window: 5
159
+ speech_volume_normalize: null
160
+ token_list:
161
+ - <blank>
162
+ - <unk>
163
+ - <space>
164
+ - E
165
+ - A
166
+ - O
167
+ - N
168
+ - S
169
+ - I
170
+ - ا
171
+ - L
172
+ - T
173
+ - R
174
+ - و
175
+ - D
176
+ - ن
177
+ - ر
178
+ - ی
179
+ - ي
180
+ - M
181
+ - U
182
+ - H
183
+ - P
184
+ - ک
185
+ - م
186
+ - C
187
+ - А
188
+ - Ӹ
189
+ - Н
190
+ - B
191
+ - ت
192
+ - س
193
+ - ل
194
+ - J
195
+ - K
196
+ - ہ
197
+ - Т
198
+ - ے
199
+ - G
200
+ - Ш
201
+ - К
202
+ - Е
203
+ - Л
204
+ - Ы
205
+ - V
206
+ - М
207
+ - ج
208
+ - Ӓ
209
+ - ه
210
+ - ب
211
+ - د
212
+ - О
213
+ - Y
214
+ - '[slv]'
215
+ - Р
216
+ - ڪ
217
+ - پ
218
+ - Z
219
+ - '[mrj]'
220
+ - F
221
+ - گ
222
+ - И
223
+ - В
224
+ - ئ
225
+ - Д
226
+ - '[sot]'
227
+ - ں
228
+ - '[spa]'
229
+ - W
230
+ - Q
231
+ - П
232
+ - Г
233
+ - ف
234
+ - ق
235
+ - С
236
+ - ع
237
+ - ش
238
+ - Ж
239
+ - ز
240
+ - ھ
241
+ - آ
242
+ - Č
243
+ - Í
244
+ - У
245
+ - ح
246
+ - '[urd]'
247
+ - Š
248
+ - ٹ
249
+ - چ
250
+ - Ь
251
+ - ٽ
252
+ - '[snd]'
253
+ - ڻ
254
+ - Й
255
+ - ط
256
+ - ص
257
+ - ٿ
258
+ - Ц
259
+ - خ
260
+ - Ó
261
+ - Я
262
+ - Á
263
+ - É
264
+ - Ч
265
+ - ۾
266
+ - '0'
267
+ - Ž
268
+ - З
269
+ - '1'
270
+ - ۽
271
+ - –
272
+ - ڏ
273
+ - Э
274
+ - ڊ
275
+ - —
276
+ - ڈ
277
+ - ء
278
+ - Ñ
279
+ - ڙ
280
+ - ِ
281
+ - '2'
282
+ - ٻ
283
+ - Х
284
+ - Ӱ
285
+ - ظ
286
+ - ض
287
+ - ث
288
+ - ڳ
289
+ - ،
290
+ - X
291
+ - ¡
292
+ - غ
293
+ - ڑ
294
+ - Ӧ
295
+ - ذ
296
+ - ¿
297
+ - '5'
298
+ - ڌ
299
+ - '3'
300
+ - ڀ
301
+ - ُ
302
+ - '9'
303
+ - Ú
304
+ - '4'
305
+ - '8'
306
+ - ۔
307
+ - '6'
308
+ - ٺ
309
+ - Ю
310
+ - »
311
+ - Б
312
+ - «
313
+ - ڇ
314
+ - ً
315
+ - ڃ
316
+ - '7'
317
+ - ڄ
318
+ - ؤ
319
+ - ڍ
320
+ - Ф
321
+ - َ
322
+ - ٰ
323
+ - ّ
324
+ - ڱ
325
+ - ”
326
+ - ژ
327
+ - ڦ
328
+ - Ё
329
+ - ؛
330
+ - ٍ
331
+ - Щ
332
+ - ؟
333
+ - ’
334
+ - ‘
335
+ - °
336
+ - ۃ
337
+ - إ
338
+ - Ć
339
+ - <sos/eos>
340
+ token_type: char
341
+ train_dtype: float32
342
+ unused_parameters: true
343
+ use_adapter: false
344
+ use_amp: false
345
+ use_lang_prompt: false
346
+ use_matplotlib: true
347
+ use_nlp_prompt: false
348
+ use_preprocessor: true
349
+ use_tensorboard: true
350
+ val_scheduler_criterion:
351
+ - valid
352
+ - loss
353
+ valid_batch_bins: null
354
+ valid_batch_size: null
355
+ valid_batch_type: null
356
+ valid_iterator_type: null
357
+ valid_max_cache_size: null
358
+ version: '202402'
359
+ write_collected_feats: false
hyp.trn ADDED
The diff for this file is too large to render. See raw diff
 
ref.trn ADDED
The diff for this file is too large to render. See raw diff
 
valid.loss.best.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c733a98800965037129fc41de3832cbc22b154f0694a4c0dfc4403d0db653a6b
3
+ size 1280866892