ESPnet
multilingual
audio
codec
ftshijt commited on
Commit
5866e6b
·
1 Parent(s): 6ebd595

Update model

Browse files
Files changed (37) hide show
  1. README.md +351 -0
  2. exp/codec_train_sedac_large_v4.2_raw_fs16000/70epoch.pth +3 -0
  3. exp/codec_train_sedac_large_v4.2_raw_fs16000/config.yaml +276 -0
  4. exp/codec_train_sedac_large_v4.2_raw_fs16000/images/adv_loss.png +0 -0
  5. exp/codec_train_sedac_large_v4.2_raw_fs16000/images/codec_commit_loss.png +0 -0
  6. exp/codec_train_sedac_large_v4.2_raw_fs16000/images/codec_quantization_loss.png +0 -0
  7. exp/codec_train_sedac_large_v4.2_raw_fs16000/images/discriminator_backward_time.png +0 -0
  8. exp/codec_train_sedac_large_v4.2_raw_fs16000/images/discriminator_forward_time.png +0 -0
  9. exp/codec_train_sedac_large_v4.2_raw_fs16000/images/discriminator_loss.png +0 -0
  10. exp/codec_train_sedac_large_v4.2_raw_fs16000/images/discriminator_optim_step_time.png +0 -0
  11. exp/codec_train_sedac_large_v4.2_raw_fs16000/images/discriminator_train_time.png +0 -0
  12. exp/codec_train_sedac_large_v4.2_raw_fs16000/images/enhanced_adv_loss.png +0 -0
  13. exp/codec_train_sedac_large_v4.2_raw_fs16000/images/enhanced_fake_loss.png +0 -0
  14. exp/codec_train_sedac_large_v4.2_raw_fs16000/images/enhanced_feat_match_loss.png +0 -0
  15. exp/codec_train_sedac_large_v4.2_raw_fs16000/images/enhanced_mel_loss.png +0 -0
  16. exp/codec_train_sedac_large_v4.2_raw_fs16000/images/enhanced_quantization_loss.png +0 -0
  17. exp/codec_train_sedac_large_v4.2_raw_fs16000/images/enhanced_real_loss.png +0 -0
  18. exp/codec_train_sedac_large_v4.2_raw_fs16000/images/enhanced_reconstruct_loss.png +0 -0
  19. exp/codec_train_sedac_large_v4.2_raw_fs16000/images/enhancement_nan_batch_dis.png +0 -0
  20. exp/codec_train_sedac_large_v4.2_raw_fs16000/images/enhancement_nan_batch_gen.png +0 -0
  21. exp/codec_train_sedac_large_v4.2_raw_fs16000/images/fake_loss.png +0 -0
  22. exp/codec_train_sedac_large_v4.2_raw_fs16000/images/feat_match_loss.png +0 -0
  23. exp/codec_train_sedac_large_v4.2_raw_fs16000/images/generator_backward_time.png +0 -0
  24. exp/codec_train_sedac_large_v4.2_raw_fs16000/images/generator_forward_time.png +0 -0
  25. exp/codec_train_sedac_large_v4.2_raw_fs16000/images/generator_optim_step_time.png +0 -0
  26. exp/codec_train_sedac_large_v4.2_raw_fs16000/images/generator_train_time.png +0 -0
  27. exp/codec_train_sedac_large_v4.2_raw_fs16000/images/gpu_max_cached_mem_GB.png +0 -0
  28. exp/codec_train_sedac_large_v4.2_raw_fs16000/images/iter_time.png +0 -0
  29. exp/codec_train_sedac_large_v4.2_raw_fs16000/images/loss.png +0 -0
  30. exp/codec_train_sedac_large_v4.2_raw_fs16000/images/mel_loss.png +0 -0
  31. exp/codec_train_sedac_large_v4.2_raw_fs16000/images/mel_loss_real.png +0 -0
  32. exp/codec_train_sedac_large_v4.2_raw_fs16000/images/optim0_lr0.png +0 -0
  33. exp/codec_train_sedac_large_v4.2_raw_fs16000/images/optim1_lr0.png +0 -0
  34. exp/codec_train_sedac_large_v4.2_raw_fs16000/images/real_loss.png +0 -0
  35. exp/codec_train_sedac_large_v4.2_raw_fs16000/images/reconstruct_loss.png +0 -0
  36. exp/codec_train_sedac_large_v4.2_raw_fs16000/images/train_time.png +0 -0
  37. meta.yaml +8 -0
README.md ADDED
@@ -0,0 +1,351 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - espnet
4
+ - audio
5
+ - codec
6
+ language: multilingual
7
+ datasets:
8
+ - amuse
9
+ license: cc-by-4.0
10
+ ---
11
+
12
+ ## ESPnet2 Codec model
13
+
14
+ ### `espnet/owsm_pure_codec_v1.2_16k`
15
+
16
+ This model was trained by ftshijt using amuse recipe in [espnet](https://github.com/espnet/espnet/).
17
+
18
+ ### Demo: How to use in ESPnet2
19
+
20
+ Follow the [ESPnet installation instructions](https://espnet.github.io/espnet/installation.html)
21
+ if you haven't done that already.
22
+
23
+ ```bash
24
+ cd espnet
25
+ git checkout 280bfedf2c9a19038e79d3402472bde30397a02c
26
+ pip install -e .
27
+ cd egs2/amuse/codec1
28
+ ./run.sh --skip_data_prep false --skip_train true --download_model espnet/owsm_pure_codec_v1.2_16k
29
+ ```
30
+
31
+
32
+
33
+ ## Codec config
34
+
35
+ <details><summary>expand</summary>
36
+
37
+ ```
38
+ config: conf/train_sedac_large_v4.2.yaml
39
+ print_config: false
40
+ log_level: INFO
41
+ drop_last_iter: false
42
+ dry_run: false
43
+ iterator_type: chunk
44
+ valid_iterator_type: null
45
+ output_dir: exp/codec_train_sedac_large_v4.2_raw_fs16000
46
+ ngpu: 1
47
+ seed: 777
48
+ num_workers: 1
49
+ num_att_plot: 0
50
+ dist_backend: nccl
51
+ dist_init_method: env://
52
+ dist_world_size: null
53
+ dist_rank: null
54
+ local_rank: 0
55
+ dist_master_addr: null
56
+ dist_master_port: null
57
+ dist_launcher: null
58
+ multiprocessing_distributed: false
59
+ unused_parameters: true
60
+ sharded_ddp: false
61
+ use_deepspeed: false
62
+ deepspeed_config: null
63
+ cudnn_enabled: true
64
+ cudnn_benchmark: false
65
+ cudnn_deterministic: false
66
+ use_tf32: false
67
+ collect_stats: false
68
+ write_collected_feats: false
69
+ max_epoch: 360
70
+ patience: null
71
+ val_scheduler_criterion:
72
+ - valid
73
+ - loss
74
+ early_stopping_criterion:
75
+ - valid
76
+ - loss
77
+ - min
78
+ best_model_criterion:
79
+ - - valid
80
+ - mel_loss
81
+ - min
82
+ - - train
83
+ - mel_loss
84
+ - min
85
+ - - train
86
+ - total_count
87
+ - max
88
+ keep_nbest_models: 5
89
+ nbest_averaging_interval: 0
90
+ grad_clip: 1
91
+ grad_clip_type: 2.0
92
+ grad_noise: false
93
+ accum_grad: 1
94
+ no_forward_run: false
95
+ resume: true
96
+ train_dtype: float32
97
+ use_amp: false
98
+ log_interval: 100
99
+ use_matplotlib: true
100
+ use_tensorboard: true
101
+ create_graph_in_tensorboard: false
102
+ use_wandb: false
103
+ wandb_project: null
104
+ wandb_id: null
105
+ wandb_entity: null
106
+ wandb_name: null
107
+ wandb_model_log_interval: -1
108
+ detect_anomaly: false
109
+ use_adapter: false
110
+ adapter: lora
111
+ save_strategy: all
112
+ adapter_conf: {}
113
+ pretrain_path: null
114
+ init_param:
115
+ - /work/nvme/bbjs/shi3/codec/espnet/egs2/amuse/codec1/exp/codec_train_sedac_large_v4-0_raw_fs16000/latest.pth
116
+ ignore_init_mismatch: true
117
+ freeze_param:
118
+ - codec.generator.encoder
119
+ num_iters_per_epoch: 5000
120
+ batch_size: 32
121
+ valid_batch_size: null
122
+ batch_bins: 1000000
123
+ valid_batch_bins: null
124
+ category_sample_size: 10
125
+ train_shape_file:
126
+ - exp/codec_stats_raw/train/audio_shape
127
+ valid_shape_file:
128
+ - exp/codec_stats_raw/valid/audio_shape
129
+ batch_type: unsorted
130
+ valid_batch_type: null
131
+ fold_length:
132
+ - 256000
133
+ sort_in_batch: descending
134
+ shuffle_within_batch: false
135
+ sort_batch: descending
136
+ multiple_iterator: false
137
+ chunk_length: 32000
138
+ chunk_shift_ratio: 0.5
139
+ num_cache_chunks: 256
140
+ chunk_excluded_key_prefixes: []
141
+ chunk_default_fs: null
142
+ chunk_max_abs_length: null
143
+ chunk_discard_short_samples: true
144
+ train_data_path_and_name_and_type:
145
+ - - dump/raw/owsm_all/wav.scp
146
+ - audio
147
+ - kaldi_ark
148
+ valid_data_path_and_name_and_type:
149
+ - - dump/raw/dev-small/wav.scp
150
+ - audio
151
+ - kaldi_ark
152
+ multi_task_dataset: false
153
+ allow_variable_data_keys: false
154
+ max_cache_size: 0.0
155
+ max_cache_fd: 32
156
+ allow_multi_rates: false
157
+ valid_max_cache_size: null
158
+ exclude_weight_decay: false
159
+ exclude_weight_decay_conf: {}
160
+ optim: adamw
161
+ optim_conf:
162
+ lr: 0.0002
163
+ betas:
164
+ - 0.5
165
+ - 0.9
166
+ eps: 1.0e-09
167
+ weight_decay: 0.0
168
+ scheduler: exponentiallr
169
+ scheduler_conf:
170
+ gamma: 0.999875
171
+ optim2: adamw
172
+ optim2_conf:
173
+ lr: 0.0002
174
+ betas:
175
+ - 0.5
176
+ - 0.9
177
+ eps: 1.0e-09
178
+ weight_decay: 0.0
179
+ scheduler2: exponentiallr
180
+ scheduler2_conf:
181
+ gamma: 0.999875
182
+ generator_first: true
183
+ skip_discriminator_prob: 0.0
184
+ model_conf: {}
185
+ use_preprocessor: true
186
+ codec: se_dac2
187
+ codec_conf:
188
+ sampling_rate: 16000
189
+ generator_params:
190
+ hidden_dim: 512
191
+ codebook_dim: 512
192
+ se_model_source: espnet
193
+ se_model_tag: wyz/vctk_dns2020_whamr_bsrnn_large_noncausal
194
+ enhanced_n_streams: 1
195
+ encdec_channels: 1
196
+ encdec_n_filters: 32
197
+ encdec_n_residual_layers: 3
198
+ encdec_ratios:
199
+ - 8
200
+ - 5
201
+ - 4
202
+ - 2
203
+ encdec_activation: Snake
204
+ encdec_norm: weight_norm
205
+ encdec_kernel_size: 7
206
+ encdec_residual_kernel_size: 7
207
+ encdec_last_kernel_size: 7
208
+ encdec_dilation_base: 2
209
+ encdec_causal: false
210
+ encdec_pad_mode: reflect
211
+ encdec_true_skip: false
212
+ encdec_compress: 2
213
+ encdec_lstm: 2
214
+ decoder_trim_right_ratio: 1.0
215
+ decoder_final_activation: null
216
+ decoder_final_activation_params: null
217
+ quantizer_n_q: 8
218
+ quantizer_bins: 1024
219
+ quantizer_decay: 0.99
220
+ quantizer_kmeans_init: true
221
+ quantizer_kmeans_iters: 50
222
+ quantizer_threshold_ema_dead_code: 2
223
+ quantizer_target_bandwidth:
224
+ - 1
225
+ - 2
226
+ - 4
227
+ quantizer_dropout: true
228
+ sample_rate: 16000
229
+ inference_only: false
230
+ discriminator_params:
231
+ msmpmb_discriminator_params:
232
+ rates: []
233
+ sample_rate: 16000
234
+ fft_sizes:
235
+ - 1024
236
+ - 512
237
+ - 256
238
+ - 128
239
+ periods:
240
+ - 2
241
+ - 3
242
+ - 5
243
+ - 7
244
+ - 11
245
+ period_discriminator_params:
246
+ in_channels: 1
247
+ out_channels: 1
248
+ kernel_sizes:
249
+ - 5
250
+ - 3
251
+ channels: 32
252
+ downsample_scales:
253
+ - 3
254
+ - 3
255
+ - 3
256
+ - 3
257
+ - 1
258
+ max_downsample_channels: 1024
259
+ bias: true
260
+ nonlinear_activation: LeakyReLU
261
+ nonlinear_activation_params:
262
+ negative_slope: 0.1
263
+ use_weight_norm: true
264
+ use_spectral_norm: false
265
+ band_discriminator_params:
266
+ hop_factor: 0.25
267
+ sample_rate: 16000
268
+ bands:
269
+ - - 0.0
270
+ - 0.1
271
+ - - 0.1
272
+ - 0.25
273
+ - - 0.25
274
+ - 0.5
275
+ - - 0.5
276
+ - 0.75
277
+ - - 0.75
278
+ - 1.0
279
+ channel: 32
280
+ generator_adv_loss_params:
281
+ average_by_discriminators: false
282
+ loss_type: mse
283
+ discriminator_adv_loss_params:
284
+ average_by_discriminators: false
285
+ loss_type: mse
286
+ use_feat_match_loss: true
287
+ feat_match_loss_params:
288
+ average_by_discriminators: false
289
+ average_by_layers: false
290
+ include_final_outputs: true
291
+ use_mel_loss: true
292
+ mel_loss_params:
293
+ range_start: 6
294
+ range_end: 11
295
+ window: hann
296
+ n_mels: 80
297
+ fmin: 0
298
+ fmax: null
299
+ log_base: null
300
+ fs: 16000
301
+ skip_quantizer_updates: 0
302
+ lambda_quantization: 0.25
303
+ lambda_commit: 1.0
304
+ lambda_reconstruct: 1.0
305
+ lambda_adv: 1.0
306
+ lambda_mel: 45.0
307
+ lambda_feat_match: 2.0
308
+ enhanced_prob: 0.5
309
+ cache_generator_outputs: true
310
+ required:
311
+ - output_dir
312
+ version: '202402'
313
+ distributed: false
314
+ ```
315
+
316
+ </details>
317
+
318
+
319
+
320
+ ### Citing ESPnet
321
+
322
+ ```BibTex
323
+ @inproceedings{watanabe2018espnet,
324
+ author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson Yalta and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
325
+ title={{ESPnet}: End-to-End Speech Processing Toolkit},
326
+ year={2018},
327
+ booktitle={Proceedings of Interspeech},
328
+ pages={2207--2211},
329
+ doi={10.21437/Interspeech.2018-1456},
330
+ url={http://dx.doi.org/10.21437/Interspeech.2018-1456}
331
+ }
332
+
333
+
334
+
335
+
336
+
337
+
338
+ ```
339
+
340
+ or arXiv:
341
+
342
+ ```bibtex
343
+ @misc{watanabe2018espnet,
344
+ title={ESPnet: End-to-End Speech Processing Toolkit},
345
+ author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson Yalta and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
346
+ year={2018},
347
+ eprint={1804.00015},
348
+ archivePrefix={arXiv},
349
+ primaryClass={cs.CL}
350
+ }
351
+ ```
exp/codec_train_sedac_large_v4.2_raw_fs16000/70epoch.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c2d6195ccfff1569df5dc6037d63c4f9ab9911706a4f089ae4809cdfa50c968e
3
+ size 709979942
exp/codec_train_sedac_large_v4.2_raw_fs16000/config.yaml ADDED
@@ -0,0 +1,276 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ config: conf/train_sedac_large_v4.2.yaml
2
+ print_config: false
3
+ log_level: INFO
4
+ drop_last_iter: false
5
+ dry_run: false
6
+ iterator_type: chunk
7
+ valid_iterator_type: null
8
+ output_dir: exp/codec_train_sedac_large_v4.2_raw_fs16000
9
+ ngpu: 1
10
+ seed: 777
11
+ num_workers: 1
12
+ num_att_plot: 0
13
+ dist_backend: nccl
14
+ dist_init_method: env://
15
+ dist_world_size: null
16
+ dist_rank: null
17
+ local_rank: 0
18
+ dist_master_addr: null
19
+ dist_master_port: null
20
+ dist_launcher: null
21
+ multiprocessing_distributed: false
22
+ unused_parameters: true
23
+ sharded_ddp: false
24
+ use_deepspeed: false
25
+ deepspeed_config: null
26
+ cudnn_enabled: true
27
+ cudnn_benchmark: false
28
+ cudnn_deterministic: false
29
+ use_tf32: false
30
+ collect_stats: false
31
+ write_collected_feats: false
32
+ max_epoch: 360
33
+ patience: null
34
+ val_scheduler_criterion:
35
+ - valid
36
+ - loss
37
+ early_stopping_criterion:
38
+ - valid
39
+ - loss
40
+ - min
41
+ best_model_criterion:
42
+ - - valid
43
+ - mel_loss
44
+ - min
45
+ - - train
46
+ - mel_loss
47
+ - min
48
+ - - train
49
+ - total_count
50
+ - max
51
+ keep_nbest_models: 5
52
+ nbest_averaging_interval: 0
53
+ grad_clip: 1
54
+ grad_clip_type: 2.0
55
+ grad_noise: false
56
+ accum_grad: 1
57
+ no_forward_run: false
58
+ resume: true
59
+ train_dtype: float32
60
+ use_amp: false
61
+ log_interval: 100
62
+ use_matplotlib: true
63
+ use_tensorboard: true
64
+ create_graph_in_tensorboard: false
65
+ use_wandb: false
66
+ wandb_project: null
67
+ wandb_id: null
68
+ wandb_entity: null
69
+ wandb_name: null
70
+ wandb_model_log_interval: -1
71
+ detect_anomaly: false
72
+ use_adapter: false
73
+ adapter: lora
74
+ save_strategy: all
75
+ adapter_conf: {}
76
+ pretrain_path: null
77
+ init_param:
78
+ - /work/nvme/bbjs/shi3/codec/espnet/egs2/amuse/codec1/exp/codec_train_sedac_large_v4-0_raw_fs16000/latest.pth
79
+ ignore_init_mismatch: true
80
+ freeze_param:
81
+ - codec.generator.encoder
82
+ num_iters_per_epoch: 5000
83
+ batch_size: 32
84
+ valid_batch_size: null
85
+ batch_bins: 1000000
86
+ valid_batch_bins: null
87
+ category_sample_size: 10
88
+ train_shape_file:
89
+ - exp/codec_stats_raw/train/audio_shape
90
+ valid_shape_file:
91
+ - exp/codec_stats_raw/valid/audio_shape
92
+ batch_type: unsorted
93
+ valid_batch_type: null
94
+ fold_length:
95
+ - 256000
96
+ sort_in_batch: descending
97
+ shuffle_within_batch: false
98
+ sort_batch: descending
99
+ multiple_iterator: false
100
+ chunk_length: 32000
101
+ chunk_shift_ratio: 0.5
102
+ num_cache_chunks: 256
103
+ chunk_excluded_key_prefixes: []
104
+ chunk_default_fs: null
105
+ chunk_max_abs_length: null
106
+ chunk_discard_short_samples: true
107
+ train_data_path_and_name_and_type:
108
+ - - dump/raw/owsm_all/wav.scp
109
+ - audio
110
+ - kaldi_ark
111
+ valid_data_path_and_name_and_type:
112
+ - - dump/raw/dev-small/wav.scp
113
+ - audio
114
+ - kaldi_ark
115
+ multi_task_dataset: false
116
+ allow_variable_data_keys: false
117
+ max_cache_size: 0.0
118
+ max_cache_fd: 32
119
+ allow_multi_rates: false
120
+ valid_max_cache_size: null
121
+ exclude_weight_decay: false
122
+ exclude_weight_decay_conf: {}
123
+ optim: adamw
124
+ optim_conf:
125
+ lr: 0.0002
126
+ betas:
127
+ - 0.5
128
+ - 0.9
129
+ eps: 1.0e-09
130
+ weight_decay: 0.0
131
+ scheduler: exponentiallr
132
+ scheduler_conf:
133
+ gamma: 0.999875
134
+ optim2: adamw
135
+ optim2_conf:
136
+ lr: 0.0002
137
+ betas:
138
+ - 0.5
139
+ - 0.9
140
+ eps: 1.0e-09
141
+ weight_decay: 0.0
142
+ scheduler2: exponentiallr
143
+ scheduler2_conf:
144
+ gamma: 0.999875
145
+ generator_first: true
146
+ skip_discriminator_prob: 0.0
147
+ model_conf: {}
148
+ use_preprocessor: true
149
+ codec: se_dac2
150
+ codec_conf:
151
+ sampling_rate: 16000
152
+ generator_params:
153
+ hidden_dim: 512
154
+ codebook_dim: 512
155
+ se_model_source: espnet
156
+ se_model_tag: wyz/vctk_dns2020_whamr_bsrnn_large_noncausal
157
+ enhanced_n_streams: 1
158
+ encdec_channels: 1
159
+ encdec_n_filters: 32
160
+ encdec_n_residual_layers: 3
161
+ encdec_ratios:
162
+ - 8
163
+ - 5
164
+ - 4
165
+ - 2
166
+ encdec_activation: Snake
167
+ encdec_norm: weight_norm
168
+ encdec_kernel_size: 7
169
+ encdec_residual_kernel_size: 7
170
+ encdec_last_kernel_size: 7
171
+ encdec_dilation_base: 2
172
+ encdec_causal: false
173
+ encdec_pad_mode: reflect
174
+ encdec_true_skip: false
175
+ encdec_compress: 2
176
+ encdec_lstm: 2
177
+ decoder_trim_right_ratio: 1.0
178
+ decoder_final_activation: null
179
+ decoder_final_activation_params: null
180
+ quantizer_n_q: 8
181
+ quantizer_bins: 1024
182
+ quantizer_decay: 0.99
183
+ quantizer_kmeans_init: true
184
+ quantizer_kmeans_iters: 50
185
+ quantizer_threshold_ema_dead_code: 2
186
+ quantizer_target_bandwidth:
187
+ - 1
188
+ - 2
189
+ - 4
190
+ quantizer_dropout: true
191
+ sample_rate: 16000
192
+ inference_only: false
193
+ discriminator_params:
194
+ msmpmb_discriminator_params:
195
+ rates: []
196
+ sample_rate: 16000
197
+ fft_sizes:
198
+ - 1024
199
+ - 512
200
+ - 256
201
+ - 128
202
+ periods:
203
+ - 2
204
+ - 3
205
+ - 5
206
+ - 7
207
+ - 11
208
+ period_discriminator_params:
209
+ in_channels: 1
210
+ out_channels: 1
211
+ kernel_sizes:
212
+ - 5
213
+ - 3
214
+ channels: 32
215
+ downsample_scales:
216
+ - 3
217
+ - 3
218
+ - 3
219
+ - 3
220
+ - 1
221
+ max_downsample_channels: 1024
222
+ bias: true
223
+ nonlinear_activation: LeakyReLU
224
+ nonlinear_activation_params:
225
+ negative_slope: 0.1
226
+ use_weight_norm: true
227
+ use_spectral_norm: false
228
+ band_discriminator_params:
229
+ hop_factor: 0.25
230
+ sample_rate: 16000
231
+ bands:
232
+ - - 0.0
233
+ - 0.1
234
+ - - 0.1
235
+ - 0.25
236
+ - - 0.25
237
+ - 0.5
238
+ - - 0.5
239
+ - 0.75
240
+ - - 0.75
241
+ - 1.0
242
+ channel: 32
243
+ generator_adv_loss_params:
244
+ average_by_discriminators: false
245
+ loss_type: mse
246
+ discriminator_adv_loss_params:
247
+ average_by_discriminators: false
248
+ loss_type: mse
249
+ use_feat_match_loss: true
250
+ feat_match_loss_params:
251
+ average_by_discriminators: false
252
+ average_by_layers: false
253
+ include_final_outputs: true
254
+ use_mel_loss: true
255
+ mel_loss_params:
256
+ range_start: 6
257
+ range_end: 11
258
+ window: hann
259
+ n_mels: 80
260
+ fmin: 0
261
+ fmax: null
262
+ log_base: null
263
+ fs: 16000
264
+ skip_quantizer_updates: 0
265
+ lambda_quantization: 0.25
266
+ lambda_commit: 1.0
267
+ lambda_reconstruct: 1.0
268
+ lambda_adv: 1.0
269
+ lambda_mel: 45.0
270
+ lambda_feat_match: 2.0
271
+ enhanced_prob: 0.5
272
+ cache_generator_outputs: true
273
+ required:
274
+ - output_dir
275
+ version: '202402'
276
+ distributed: false
exp/codec_train_sedac_large_v4.2_raw_fs16000/images/adv_loss.png ADDED
exp/codec_train_sedac_large_v4.2_raw_fs16000/images/codec_commit_loss.png ADDED
exp/codec_train_sedac_large_v4.2_raw_fs16000/images/codec_quantization_loss.png ADDED
exp/codec_train_sedac_large_v4.2_raw_fs16000/images/discriminator_backward_time.png ADDED
exp/codec_train_sedac_large_v4.2_raw_fs16000/images/discriminator_forward_time.png ADDED
exp/codec_train_sedac_large_v4.2_raw_fs16000/images/discriminator_loss.png ADDED
exp/codec_train_sedac_large_v4.2_raw_fs16000/images/discriminator_optim_step_time.png ADDED
exp/codec_train_sedac_large_v4.2_raw_fs16000/images/discriminator_train_time.png ADDED
exp/codec_train_sedac_large_v4.2_raw_fs16000/images/enhanced_adv_loss.png ADDED
exp/codec_train_sedac_large_v4.2_raw_fs16000/images/enhanced_fake_loss.png ADDED
exp/codec_train_sedac_large_v4.2_raw_fs16000/images/enhanced_feat_match_loss.png ADDED
exp/codec_train_sedac_large_v4.2_raw_fs16000/images/enhanced_mel_loss.png ADDED
exp/codec_train_sedac_large_v4.2_raw_fs16000/images/enhanced_quantization_loss.png ADDED
exp/codec_train_sedac_large_v4.2_raw_fs16000/images/enhanced_real_loss.png ADDED
exp/codec_train_sedac_large_v4.2_raw_fs16000/images/enhanced_reconstruct_loss.png ADDED
exp/codec_train_sedac_large_v4.2_raw_fs16000/images/enhancement_nan_batch_dis.png ADDED
exp/codec_train_sedac_large_v4.2_raw_fs16000/images/enhancement_nan_batch_gen.png ADDED
exp/codec_train_sedac_large_v4.2_raw_fs16000/images/fake_loss.png ADDED
exp/codec_train_sedac_large_v4.2_raw_fs16000/images/feat_match_loss.png ADDED
exp/codec_train_sedac_large_v4.2_raw_fs16000/images/generator_backward_time.png ADDED
exp/codec_train_sedac_large_v4.2_raw_fs16000/images/generator_forward_time.png ADDED
exp/codec_train_sedac_large_v4.2_raw_fs16000/images/generator_optim_step_time.png ADDED
exp/codec_train_sedac_large_v4.2_raw_fs16000/images/generator_train_time.png ADDED
exp/codec_train_sedac_large_v4.2_raw_fs16000/images/gpu_max_cached_mem_GB.png ADDED
exp/codec_train_sedac_large_v4.2_raw_fs16000/images/iter_time.png ADDED
exp/codec_train_sedac_large_v4.2_raw_fs16000/images/loss.png ADDED
exp/codec_train_sedac_large_v4.2_raw_fs16000/images/mel_loss.png ADDED
exp/codec_train_sedac_large_v4.2_raw_fs16000/images/mel_loss_real.png ADDED
exp/codec_train_sedac_large_v4.2_raw_fs16000/images/optim0_lr0.png ADDED
exp/codec_train_sedac_large_v4.2_raw_fs16000/images/optim1_lr0.png ADDED
exp/codec_train_sedac_large_v4.2_raw_fs16000/images/real_loss.png ADDED
exp/codec_train_sedac_large_v4.2_raw_fs16000/images/reconstruct_loss.png ADDED
exp/codec_train_sedac_large_v4.2_raw_fs16000/images/train_time.png ADDED
meta.yaml ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ espnet: '202402'
2
+ files:
3
+ model_file: exp/codec_train_sedac_large_v4.2_raw_fs16000/70epoch.pth
4
+ python: 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:26:55) [GCC 12.3.0]
5
+ timestamp: 1748552998.170263
6
+ torch: 2.6.0.dev20241209+cu124
7
+ yaml_files:
8
+ train_config: exp/codec_train_sedac_large_v4.2_raw_fs16000/config.yaml