update model

Browse files

Files changed (17) hide show

README.md +8 -9
asr.ckpt +2 -2
hyperparams.yaml +26 -48
model_checkpoints/1ad9ee8b47c176b02563689d28740ad00aa9941ddb9d28ab323af9e5ffc9b5dc.3aa7c2002067dfc71f74e269e463f76f247952b1abebe4841d03c98c534483b9 +75 -0
model_checkpoints/1ad9ee8b47c176b02563689d28740ad00aa9941ddb9d28ab323af9e5ffc9b5dc.3aa7c2002067dfc71f74e269e463f76f247952b1abebe4841d03c98c534483b9.json +1 -0
model_checkpoints/1ad9ee8b47c176b02563689d28740ad00aa9941ddb9d28ab323af9e5ffc9b5dc.3aa7c2002067dfc71f74e269e463f76f247952b1abebe4841d03c98c534483b9.lock +0 -0
model_checkpoints/47c38b5cad9a39b412be044270cd24897dcb7586ee61a0d6c0ce6ca9f4a3eff6.3aa7c2002067dfc71f74e269e463f76f247952b1abebe4841d03c98c534483b9 +75 -0
model_checkpoints/47c38b5cad9a39b412be044270cd24897dcb7586ee61a0d6c0ce6ca9f4a3eff6.3aa7c2002067dfc71f74e269e463f76f247952b1abebe4841d03c98c534483b9.json +1 -0
model_checkpoints/47c38b5cad9a39b412be044270cd24897dcb7586ee61a0d6c0ce6ca9f4a3eff6.3aa7c2002067dfc71f74e269e463f76f247952b1abebe4841d03c98c534483b9.lock +0 -0
model_checkpoints/86ed03fdf2dc6dfd5e306b11948471c225fe9080a51c2b5f2f58a708e59f65fa.fcd266b775b7f33ba9b607a0fee7cc615aeb2eb281586f046280492ea380ae23 +8 -0
model_checkpoints/86ed03fdf2dc6dfd5e306b11948471c225fe9080a51c2b5f2f58a708e59f65fa.fcd266b775b7f33ba9b607a0fee7cc615aeb2eb281586f046280492ea380ae23.json +1 -0
model_checkpoints/86ed03fdf2dc6dfd5e306b11948471c225fe9080a51c2b5f2f58a708e59f65fa.fcd266b775b7f33ba9b607a0fee7cc615aeb2eb281586f046280492ea380ae23.lock +0 -0
model_checkpoints/907639155bef046ba66b16c7a377f8cd45a6a81323bb2bb8feb817962e525368.fcd266b775b7f33ba9b607a0fee7cc615aeb2eb281586f046280492ea380ae23 +8 -0
model_checkpoints/907639155bef046ba66b16c7a377f8cd45a6a81323bb2bb8feb817962e525368.fcd266b775b7f33ba9b607a0fee7cc615aeb2eb281586f046280492ea380ae23.json +1 -0
model_checkpoints/907639155bef046ba66b16c7a377f8cd45a6a81323bb2bb8feb817962e525368.fcd266b775b7f33ba9b607a0fee7cc615aeb2eb281586f046280492ea380ae23.lock +0 -0
tokenizer.ckpt +0 -0
wav2vec2.ckpt +1 -1

README.md CHANGED Viewed

@@ -4,7 +4,6 @@ thumbnail:
 tags:
 - automatic-speech-recognition
 - CTC
-- Attention
 - pytorch
 - speechbrain
 - Transformer
@@ -24,21 +23,21 @@ metrics:
 This repository provides all the necessary tools to perform automatic speech
 recognition from an end-to-end system pretrained on CommonVoice (French Language) within
 SpeechBrain. For a better experience, we encourage you to learn more about
-[SpeechBrain](https://speechbrain.github.io).
 The performance of the model is the following:
 | Release | Test CER | Test WER | GPUs |
 |:-------------:|:--------------:|:--------------:| :--------:|
-| 29-04-21 | 9.78 | 13.34 | 2xV100 32GB |
 ## Pipeline description
 This ASR system is composed of 2 different but linked blocks:
 - Tokenizer (unigram) that transforms words into subword units and trained with
 the train transcriptions (train.tsv) of CommonVoice (FR).
-- Acoustic model (wav2vec2.0 + CTC/Attention). A pretrained wav2vec 2.0 model ([LeBenchmark/wav2vec2-FR-M-large](https://huggingface.co/LeBenchmark/wav2vec2-FR-M-large)) is combined with two DNN layers and finetuned on CommonVoice FR.
-The obtained final acoustic representation is given to the CTC and attention decoders.
 ## Install SpeechBrain
@@ -55,9 +54,9 @@ Please notice that we encourage you to read our tutorials and learn more about
 ### Transcribing your own audio files (in French)
 ```python
-from speechbrain.pretrained import EncoderDecoderASR
-asr_model = EncoderDecoderASR.from_hparams(source="speechbrain/asr-wav2vec2-commonvoice-fr", savedir="pretrained_models/asr-crdnn-commonvoice-fr")
 asr_model.transcribe_file("example-fr.wav")
 ```
@@ -80,11 +79,11 @@ pip install -e .
 3. Run Training:
 ```bash
-cd recipes/CommonVoice/ASR/seq2seq
 python train_with_wav2vec.py hparams/train_fr_with_wav2vec.yaml --data_folder=your_data_folder
 ```
-You can find our training results (models, logs, etc) [here](https://drive.google.com/drive/folders/1tjz6IZmVRkuRE97E7h1cXFoGTer7pT73?usp=sharing).
 ### Limitations
 The SpeechBrain team does not provide any warranty on the performance achieved by this model when used on other datasets.

 tags:
 - automatic-speech-recognition
 - CTC
 - pytorch
 - speechbrain
 - Transformer
 This repository provides all the necessary tools to perform automatic speech
 recognition from an end-to-end system pretrained on CommonVoice (French Language) within
 SpeechBrain. For a better experience, we encourage you to learn more about
+[SpeechBrain](https://speechbrain.github.io).
 The performance of the model is the following:
 | Release | Test CER | Test WER | GPUs |
 |:-------------:|:--------------:|:--------------:| :--------:|
+| 24-08-21 | 3.19 | 9.96 | 2xV100 32GB |
 ## Pipeline description
 This ASR system is composed of 2 different but linked blocks:
 - Tokenizer (unigram) that transforms words into subword units and trained with
 the train transcriptions (train.tsv) of CommonVoice (FR).
+- Acoustic model (wav2vec2.0 + CTC). A pretrained wav2vec 2.0 model ([LeBenchmark/wav2vec2-FR-7K-large](https://huggingface.co/LeBenchmark/wav2vec2-FR-7K-large)) is combined with two DNN layers and finetuned on CommonVoice FR.
+The obtained final acoustic representation is given to the CTC greedy decoder.
 ## Install SpeechBrain
 ### Transcribing your own audio files (in French)
 ```python
+from speechbrain.pretrained import EncoderASR
+asr_model = EncoderASR.from_hparams(source="speechbrain/asr-wav2vec2-commonvoice-fr", savedir="pretrained_models/asr-wav2vec2-commonvoice-fr")
 asr_model.transcribe_file("example-fr.wav")
 ```
 3. Run Training:
 ```bash
+cd recipes/CommonVoice/ASR/CTC/
 python train_with_wav2vec.py hparams/train_fr_with_wav2vec.yaml --data_folder=your_data_folder
 ```
+You can find our training results (models, logs, etc) [here](https://drive.google.com/drive/folders/1T9DfdZwcNI9CURxhLCi8GA5JVz8adiY8?usp=sharing).
 ### Limitations
 The SpeechBrain team does not provide any warranty on the performance achieved by this model when used on other datasets.

asr.ckpt CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:8f9b32cfe8a7d10fa852874b5507661d95cdb8c9c8dd9add45976e786e08c52e
-size 60570064

 version https://git-lfs.github.com/spec/v1
+oid sha256:64ba475ed7be735d4ac054c2d537f22251b80f6ecb65cb04217eb0d1ed50a143
+size 12963902

hyperparams.yaml CHANGED Viewed

@@ -5,7 +5,7 @@
 # ################################
 sample_rate: 16000
-wav2vec2_hub: LeBenchmark/wav2vec2-FR-M-large
 # BPE parameters
 token_type: unigram  # ["unigram", "bpe", "char"]
@@ -19,7 +19,7 @@ emb_size: 128
 dec_neurons: 1024
 # Outputs
-output_neurons: 500  # BPE size, index(blank/eos/bos) = 0
 # Decoding parameters
 # Be sure that the bos and eos index match with the BPEs ones
@@ -35,11 +35,27 @@ max_attn_shift: 140
 ctc_weight_decode: 0.0
 temperature: 1.50
-enc: !new:speechbrain.lobes.models.VanillaNN.VanillaNN
     input_shape: [null, null, 1024]
-    activation: !ref <activation>
-    dnn_blocks: !ref <dnn_layers>
-    dnn_neurons: !ref <dnn_neurons>
 wav2vec2: !new:speechbrain.lobes.models.huggingface_wav2vec.HuggingFaceWav2Vec2
     source: !ref <wav2vec2_hub>
@@ -48,69 +64,31 @@ wav2vec2: !new:speechbrain.lobes.models.huggingface_wav2vec.HuggingFaceWav2Vec2
     pretrain: False
     save_path: model_checkpoints
-emb: !new:speechbrain.nnet.embedding.Embedding
-    num_embeddings: !ref <output_neurons>
-    embedding_dim: !ref <emb_size>
-dec: !new:speechbrain.nnet.RNN.AttentionalRNNDecoder
-    enc_dim: !ref <dnn_neurons>
-    input_size: !ref <emb_size>
-    rnn_type: gru
-    attn_type: location
-    hidden_size: 1024
-    attn_dim: 1024
-    num_layers: 1
-    scaling: 1.0
-    channels: 10
-    kernel_size: 100
-    re_init: True
-    dropout: 0.15
 ctc_lin: !new:speechbrain.nnet.linear.Linear
     input_size: !ref <dnn_neurons>
     n_neurons: !ref <output_neurons>
-seq_lin: !new:speechbrain.nnet.linear.Linear
-    input_size: !ref <dec_neurons>
-    n_neurons: !ref <output_neurons>
 log_softmax: !new:speechbrain.nnet.activations.Softmax
     apply_log: True
 ctc_cost: !name:speechbrain.nnet.losses.ctc_loss
     blank_index: !ref <blank_index>
-seq_cost: !name:speechbrain.nnet.losses.nll_loss
-    label_smoothing: 0.1
 asr_model: !new:torch.nn.ModuleList
-    - [!ref <enc>, !ref <emb>, !ref <dec>, !ref <ctc_lin>, !ref <seq_lin>]
 tokenizer: !new:sentencepiece.SentencePieceProcessor
 encoder: !new:speechbrain.nnet.containers.LengthsCapableSequential
     wav2vec2: !ref <wav2vec2>
     enc: !ref <enc>
-decoder: !new:speechbrain.decoders.S2SRNNBeamSearcher
-    embedding: !ref <emb>
-    decoder: !ref <dec>
-    linear: !ref <seq_lin>
-    ctc_linear: !ref <ctc_lin>
-    bos_index: !ref <bos_index>
-    eos_index: !ref <eos_index>
-    blank_index: !ref <blank_index>
-    min_decode_ratio: !ref <min_decode_ratio>
-    max_decode_ratio: !ref <max_decode_ratio>
-    beam_size: !ref <beam_size>
-    eos_threshold: !ref <eos_threshold>
-    using_max_attn_shift: !ref <using_max_attn_shift>
-    max_attn_shift: !ref <max_attn_shift>
-    temperature: !ref <temperature>
 modules:
     encoder: !ref <encoder>
-    decoder: !ref <decoder>
 pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
     loadables:

 # ################################
 sample_rate: 16000
+wav2vec2_hub: LeBenchmark/wav2vec2-FR-7K-large
 # BPE parameters
 token_type: unigram  # ["unigram", "bpe", "char"]
 dec_neurons: 1024
 # Outputs
+output_neurons: 76  # BPE size, index(blank/eos/bos) = 0
 # Decoding parameters
 # Be sure that the bos and eos index match with the BPEs ones
 ctc_weight_decode: 0.0
 temperature: 1.50
+enc: !new:speechbrain.nnet.containers.Sequential
     input_shape: [null, null, 1024]
+    linear1: !name:speechbrain.nnet.linear.Linear
+        n_neurons: 1024
+        bias: True
+    bn1: !name:speechbrain.nnet.normalization.BatchNorm1d
+    activation: !new:torch.nn.LeakyReLU
+    drop: !new:torch.nn.Dropout
+        p: 0.15
+    linear2: !name:speechbrain.nnet.linear.Linear
+        n_neurons: 1024
+        bias: True
+    bn2: !name:speechbrain.nnet.normalization.BatchNorm1d
+    activation2: !new:torch.nn.LeakyReLU
+    drop2: !new:torch.nn.Dropout
+        p: 0.15
+    linear3: !name:speechbrain.nnet.linear.Linear
+        n_neurons: 1024
+        bias: True
+    bn3: !name:speechbrain.nnet.normalization.BatchNorm1d
+    activation3: !new:torch.nn.LeakyReLU
 wav2vec2: !new:speechbrain.lobes.models.huggingface_wav2vec.HuggingFaceWav2Vec2
     source: !ref <wav2vec2_hub>
     pretrain: False
     save_path: model_checkpoints
 ctc_lin: !new:speechbrain.nnet.linear.Linear
     input_size: !ref <dnn_neurons>
     n_neurons: !ref <output_neurons>
 log_softmax: !new:speechbrain.nnet.activations.Softmax
     apply_log: True
 ctc_cost: !name:speechbrain.nnet.losses.ctc_loss
     blank_index: !ref <blank_index>
 asr_model: !new:torch.nn.ModuleList
+    - [!ref <enc>, !ref <ctc_lin>]
 tokenizer: !new:sentencepiece.SentencePieceProcessor
 encoder: !new:speechbrain.nnet.containers.LengthsCapableSequential
     wav2vec2: !ref <wav2vec2>
     enc: !ref <enc>
+    ctc_lin: !ref <ctc_lin>
+decoding_function: !name:speechbrain.decoders.ctc_greedy_decode
+    blank_id: !ref <blank_index>
 modules:
     encoder: !ref <encoder>
 pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
     loadables:

model_checkpoints/1ad9ee8b47c176b02563689d28740ad00aa9941ddb9d28ab323af9e5ffc9b5dc.3aa7c2002067dfc71f74e269e463f76f247952b1abebe4841d03c98c534483b9 ADDED Viewed

	@@ -0,0 +1,75 @@

+{
+  "activation_dropout": 0.0,
+  "apply_spec_augment": true,
+  "architectures": [
+    "Wav2Vec2Model"
+  ],
+  "attention_dropout": 0.1,
+  "bos_token_id": 1,
+  "conv_bias": true,
+  "conv_dim": [
+    512,
+    512,
+    512,
+    512,
+    512,
+    512,
+    512
+  ],
+  "conv_kernel": [
+    10,
+    3,
+    3,
+    3,
+    3,
+    2,
+    2
+  ],
+  "conv_stride": [
+    5,
+    2,
+    2,
+    2,
+    2,
+    2,
+    2
+  ],
+  "ctc_loss_reduction": "sum",
+  "ctc_zero_infinity": false,
+  "do_stable_layer_norm": true,
+  "eos_token_id": 2,
+  "feat_extract_activation": "gelu",
+  "feat_extract_dropout": 0.0,
+  "feat_extract_norm": "layer",
+  "feat_proj_dropout": 0.1,
+  "final_dropout": 0.0,
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout": 0.1,
+  "hidden_size": 1024,
+  "initializer_range": 0.02,
+  "intermediate_size": 4096,
+  "layer_norm_eps": 1e-05,
+  "layerdrop": 0.1,
+  "mask_channel_length": 10,
+  "mask_channel_min_space": 1,
+  "mask_channel_other": 0.0,
+  "mask_channel_prob": 0.0,
+  "mask_channel_selection": "static",
+  "mask_feature_length": 10,
+  "mask_feature_prob": 0.0,
+  "mask_time_length": 10,
+  "mask_time_min_space": 1,
+  "mask_time_other": 0.0,
+  "mask_time_prob": 0.075,
+  "mask_time_selection": "static",
+  "model_type": "wav2vec2",
+  "num_attention_heads": 16,
+  "num_conv_pos_embedding_groups": 16,
+  "num_conv_pos_embeddings": 128,
+  "num_feat_extract_layers": 7,
+  "num_hidden_layers": 24,
+  "pad_token_id": 0,
+  "transformers_version": "4.5.1",
+  "vocab_size": 32
+}

model_checkpoints/1ad9ee8b47c176b02563689d28740ad00aa9941ddb9d28ab323af9e5ffc9b5dc.3aa7c2002067dfc71f74e269e463f76f247952b1abebe4841d03c98c534483b9.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"url": "https://huggingface.co/LeBenchmark/wav2vec2-FR-3K-large/resolve/main/config.json", "etag": "\"5565ad893213f0e049dcfd8a397c20224e7e26b9\""}

model_checkpoints/1ad9ee8b47c176b02563689d28740ad00aa9941ddb9d28ab323af9e5ffc9b5dc.3aa7c2002067dfc71f74e269e463f76f247952b1abebe4841d03c98c534483b9.lock ADDED Viewed

File without changes

model_checkpoints/47c38b5cad9a39b412be044270cd24897dcb7586ee61a0d6c0ce6ca9f4a3eff6.3aa7c2002067dfc71f74e269e463f76f247952b1abebe4841d03c98c534483b9 ADDED Viewed

	@@ -0,0 +1,75 @@

+{
+  "activation_dropout": 0.0,
+  "apply_spec_augment": true,
+  "architectures": [
+    "Wav2Vec2Model"
+  ],
+  "attention_dropout": 0.1,
+  "bos_token_id": 1,
+  "conv_bias": true,
+  "conv_dim": [
+    512,
+    512,
+    512,
+    512,
+    512,
+    512,
+    512
+  ],
+  "conv_kernel": [
+    10,
+    3,
+    3,
+    3,
+    3,
+    2,
+    2
+  ],
+  "conv_stride": [
+    5,
+    2,
+    2,
+    2,
+    2,
+    2,
+    2
+  ],
+  "ctc_loss_reduction": "sum",
+  "ctc_zero_infinity": false,
+  "do_stable_layer_norm": true,
+  "eos_token_id": 2,
+  "feat_extract_activation": "gelu",
+  "feat_extract_dropout": 0.0,
+  "feat_extract_norm": "layer",
+  "feat_proj_dropout": 0.1,
+  "final_dropout": 0.0,
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout": 0.1,
+  "hidden_size": 1024,
+  "initializer_range": 0.02,
+  "intermediate_size": 4096,
+  "layer_norm_eps": 1e-05,
+  "layerdrop": 0.1,
+  "mask_channel_length": 10,
+  "mask_channel_min_space": 1,
+  "mask_channel_other": 0.0,
+  "mask_channel_prob": 0.0,
+  "mask_channel_selection": "static",
+  "mask_feature_length": 10,
+  "mask_feature_prob": 0.0,
+  "mask_time_length": 10,
+  "mask_time_min_space": 1,
+  "mask_time_other": 0.0,
+  "mask_time_prob": 0.075,
+  "mask_time_selection": "static",
+  "model_type": "wav2vec2",
+  "num_attention_heads": 16,
+  "num_conv_pos_embedding_groups": 16,
+  "num_conv_pos_embeddings": 128,
+  "num_feat_extract_layers": 7,
+  "num_hidden_layers": 24,
+  "pad_token_id": 0,
+  "transformers_version": "4.5.1",
+  "vocab_size": 32
+}

model_checkpoints/47c38b5cad9a39b412be044270cd24897dcb7586ee61a0d6c0ce6ca9f4a3eff6.3aa7c2002067dfc71f74e269e463f76f247952b1abebe4841d03c98c534483b9.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"url": "https://huggingface.co/LeBenchmark/wav2vec2-FR-7K-large/resolve/main/config.json", "etag": "\"5565ad893213f0e049dcfd8a397c20224e7e26b9\""}

model_checkpoints/47c38b5cad9a39b412be044270cd24897dcb7586ee61a0d6c0ce6ca9f4a3eff6.3aa7c2002067dfc71f74e269e463f76f247952b1abebe4841d03c98c534483b9.lock ADDED Viewed

File without changes

model_checkpoints/86ed03fdf2dc6dfd5e306b11948471c225fe9080a51c2b5f2f58a708e59f65fa.fcd266b775b7f33ba9b607a0fee7cc615aeb2eb281586f046280492ea380ae23 ADDED Viewed

	@@ -0,0 +1,8 @@

+{
+  "do_normalize": true,
+  "feature_size": 1,
+  "padding_side": "right",
+  "padding_value": 0.0,
+  "return_attention_mask": true,
+  "sampling_rate": 16000
+}

model_checkpoints/86ed03fdf2dc6dfd5e306b11948471c225fe9080a51c2b5f2f58a708e59f65fa.fcd266b775b7f33ba9b607a0fee7cc615aeb2eb281586f046280492ea380ae23.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"url": "https://huggingface.co/LeBenchmark/wav2vec2-FR-7K-large/resolve/main/preprocessor_config.json", "etag": "\"0886a48276922a77013d8aa4681192138ae90d90\""}

model_checkpoints/86ed03fdf2dc6dfd5e306b11948471c225fe9080a51c2b5f2f58a708e59f65fa.fcd266b775b7f33ba9b607a0fee7cc615aeb2eb281586f046280492ea380ae23.lock ADDED Viewed

File without changes

model_checkpoints/907639155bef046ba66b16c7a377f8cd45a6a81323bb2bb8feb817962e525368.fcd266b775b7f33ba9b607a0fee7cc615aeb2eb281586f046280492ea380ae23 ADDED Viewed

	@@ -0,0 +1,8 @@

+{
+  "do_normalize": true,
+  "feature_size": 1,
+  "padding_side": "right",
+  "padding_value": 0.0,
+  "return_attention_mask": true,
+  "sampling_rate": 16000
+}

model_checkpoints/907639155bef046ba66b16c7a377f8cd45a6a81323bb2bb8feb817962e525368.fcd266b775b7f33ba9b607a0fee7cc615aeb2eb281586f046280492ea380ae23.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"url": "https://huggingface.co/LeBenchmark/wav2vec2-FR-3K-large/resolve/main/preprocessor_config.json", "etag": "\"0886a48276922a77013d8aa4681192138ae90d90\""}

model_checkpoints/907639155bef046ba66b16c7a377f8cd45a6a81323bb2bb8feb817962e525368.fcd266b775b7f33ba9b607a0fee7cc615aeb2eb281586f046280492ea380ae23.lock ADDED Viewed

File without changes

tokenizer.ckpt CHANGED Viewed

Binary files a/tokenizer.ckpt and b/tokenizer.ckpt differ

wav2vec2.ckpt CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:ae1869d41bd746312a183ce45f3119696d6a275680b0a01a7e5d2ebeba7e8a42
 size 1261930757

 version https://git-lfs.github.com/spec/v1
+oid sha256:1b2d9f900fd7a57a30bdc6220606f1ccf37f582f07aab7a5b75213ac46c38204
 size 1261930757