Nguyen Thai Khanh commited on
Commit
7788774
·
1 Parent(s): dc70350

Upload fine-tuned Vietnamese wav2vec2 ASR model

Browse files
Files changed (3) hide show
  1. README.md +125 -0
  2. preprocessor_config.json +9 -0
  3. vocab.json +1 -0
README.md ADDED
@@ -0,0 +1,125 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: vi
3
+ license: apache-2.0
4
+ base_model: nguyenvulebinh/wav2vec2-base-vi
5
+ tags:
6
+ - wav2vec2
7
+ - automatic-speech-recognition
8
+ - speech
9
+ - audio
10
+ - vietnamese
11
+ - pytorch
12
+ - CTC
13
+ datasets:
14
+ - custom-vietnamese-speech
15
+ metrics:
16
+ - wer
17
+ model-index:
18
+ - name: khanusa/nd_asr_wav2vec2
19
+ results:
20
+ - task:
21
+ name: Automatic Speech Recognition
22
+ type: automatic-speech-recognition
23
+ dataset:
24
+ name: Custom Vietnamese Speech Dataset
25
+ type: custom
26
+ metrics:
27
+ - name: WER
28
+ type: wer
29
+ value: "TBD" # Update with your actual WER score
30
+ ---
31
+
32
+ # khanusa/nd_asr_wav2vec2
33
+
34
+ This is a fine-tuned wav2vec2 model for Vietnamese Automatic Speech Recognition (ASR), based on `nguyenvulebinh/wav2vec2-base-vi`.
35
+
36
+ ## Model Description
37
+
38
+ - **Language:** Vietnamese
39
+ - **Task:** Automatic Speech Recognition
40
+ - **Base Model:** nguyenvulebinh/wav2vec2-base-vi
41
+ - **Architecture:** Wav2Vec2 + CTC Head
42
+ - **Training Framework:** PyTorch
43
+ - **Fine-tuning:** Custom Vietnamese speech dataset
44
+
45
+ ## Usage
46
+
47
+ ```python
48
+ import torch
49
+ import librosa
50
+ from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
51
+
52
+ # Load model and processor
53
+ processor = Wav2Vec2Processor.from_pretrained("khanusa/nd_asr_wav2vec2")
54
+ model = Wav2Vec2ForCTC.from_pretrained("khanusa/nd_asr_wav2vec2")
55
+
56
+ # Load and preprocess audio
57
+ audio, sr = librosa.load("path_to_your_audio.wav", sr=16000)
58
+
59
+ # Tokenize and predict
60
+ inputs = processor(audio, sampling_rate=16000, return_tensors="pt", padding=True)
61
+ with torch.no_grad():
62
+ logits = model(inputs.input_values).logits
63
+
64
+ # Decode predictions
65
+ predicted_ids = torch.argmax(logits, dim=-1)
66
+ transcription = processor.batch_decode(predicted_ids)[0]
67
+ print(transcription)
68
+ ```
69
+
70
+ ## Training Details
71
+
72
+ ### Training Data
73
+ Custom Vietnamese speech dataset
74
+
75
+ ### Training Procedure
76
+ - **Optimizer:** AdamW
77
+ - **Learning Rate:** 5e-6
78
+ - **Batch Size:** 8 (with gradient accumulation steps: 4)
79
+ - **Epochs:** 50
80
+ - **Audio Duration:** 7-11 seconds clips
81
+ - **Sampling Rate:** 16kHz
82
+ - **Features:** 16-bit PCM audio
83
+ - **Label Smoothing:** 0.1
84
+
85
+ ### Training Configuration
86
+ - Mixed Precision Training (AMP)
87
+ - Gradient Clipping: 1.0
88
+ - Warmup Steps: 2000
89
+ - Early Stopping Patience: 8 epochs
90
+
91
+ ## Performance
92
+
93
+ | Metric | Value |
94
+ |--------|-------|
95
+ | WER | 0.2123 |
96
+
97
+ *Note: Please update the WER value with your actual evaluation results.*
98
+
99
+ ## Limitations and Bias
100
+
101
+ This model was fine-tuned from an English base model on a specific Vietnamese speech dataset and may not generalize well to:
102
+ - Different Vietnamese dialects
103
+ - Noisy environments not represented in training data
104
+ - Domain-specific vocabulary outside of training scope
105
+ - Cross-lingual transfer limitations (base model was trained on English)
106
+ - Audio quality different from training conditions
107
+
108
+ ## Citation
109
+
110
+ ```bibtex
111
+ @article{baevski2020wav2vec,
112
+ title={wav2vec 2.0: A framework for self-supervised learning of speech representations},
113
+ author={Baevski, Alexei and Zhou, Yuhao and Mohamed, Abdelrahman and Auli, Michael},
114
+ journal={Advances in neural information processing systems},
115
+ volume={33},
116
+ pages={12449--12460},
117
+ year={2020}
118
+ }
119
+
120
+
121
+ ```
122
+
123
+ ## License
124
+
125
+ This model is released under the Apache 2.0 License.
preprocessor_config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "feature_extractor_type": "Wav2Vec2FeatureExtractor",
3
+ "normalizer": {
4
+ "do_lower_case": true,
5
+ "strip_accents": null,
6
+ "keep_accents": true
7
+ },
8
+ "tokenizer_type": "Wav2Vec2CTCTokenizer"
9
+ }
vocab.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"a": 1, "b": 2, "c": 3, "d": 4, "e": 5, "f": 6, "g": 7, "h": 8, "i": 9, "j": 10, "k": 11, "l": 12, "m": 13, "n": 14, "o": 15, "p": 16, "q": 17, "r": 18, "s": 19, "t": 20, "u": 21, "v": 22, "w": 23, "x": 24, "y": 25, "z": 26, "\u00e0": 27, "\u00e1": 28, "\u00e2": 29, "\u00e3": 30, "\u00e8": 31, "\u00e9": 32, "\u00ea": 33, "\u00ec": 34, "\u00ed": 35, "\u00f2": 36, "\u00f3": 37, "\u00f4": 38, "\u00f5": 39, "\u00f9": 40, "\u00fa": 41, "\u00fd": 42, "\u0103": 43, "\u0111": 44, "\u0129": 45, "\u0169": 46, "\u01a1": 47, "\u01b0": 48, "\u1ea1": 49, "\u1ea3": 50, "\u1ea5": 51, "\u1ea7": 52, "\u1ea9": 53, "\u1eab": 54, "\u1ead": 55, "\u1eaf": 56, "\u1eb1": 57, "\u1eb3": 58, "\u1eb5": 59, "\u1eb7": 60, "\u1eb9": 61, "\u1ebb": 62, "\u1ebd": 63, "\u1ebf": 64, "\u1ec1": 65, "\u1ec3": 66, "\u1ec5": 67, "\u1ec7": 68, "\u1ec9": 69, "\u1ecb": 70, "\u1ecd": 71, "\u1ecf": 72, "\u1ed1": 73, "\u1ed3": 74, "\u1ed5": 75, "\u1ed7": 76, "\u1ed9": 77, "\u1edb": 78, "\u1edd": 79, "\u1edf": 80, "\u1ee1": 81, "\u1ee3": 82, "\u1ee5": 83, "\u1ee7": 84, "\u1ee9": 85, "\u1eeb": 86, "\u1eed": 87, "\u1eef": 88, "\u1ef1": 89, "\u1ef3": 90, "\u1ef5": 91, "\u1ef7": 92, "\u1ef9": 93, "|": 0, "<bos>": 94, "<eos>": 95, "<unk>": 96, "<pad>": 97}