Nguyen Thai Khanh
commited on
Commit
·
7788774
1
Parent(s):
dc70350
Upload fine-tuned Vietnamese wav2vec2 ASR model
Browse files- README.md +125 -0
- preprocessor_config.json +9 -0
- vocab.json +1 -0
README.md
ADDED
@@ -0,0 +1,125 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: vi
|
3 |
+
license: apache-2.0
|
4 |
+
base_model: nguyenvulebinh/wav2vec2-base-vi
|
5 |
+
tags:
|
6 |
+
- wav2vec2
|
7 |
+
- automatic-speech-recognition
|
8 |
+
- speech
|
9 |
+
- audio
|
10 |
+
- vietnamese
|
11 |
+
- pytorch
|
12 |
+
- CTC
|
13 |
+
datasets:
|
14 |
+
- custom-vietnamese-speech
|
15 |
+
metrics:
|
16 |
+
- wer
|
17 |
+
model-index:
|
18 |
+
- name: khanusa/nd_asr_wav2vec2
|
19 |
+
results:
|
20 |
+
- task:
|
21 |
+
name: Automatic Speech Recognition
|
22 |
+
type: automatic-speech-recognition
|
23 |
+
dataset:
|
24 |
+
name: Custom Vietnamese Speech Dataset
|
25 |
+
type: custom
|
26 |
+
metrics:
|
27 |
+
- name: WER
|
28 |
+
type: wer
|
29 |
+
value: "TBD" # Update with your actual WER score
|
30 |
+
---
|
31 |
+
|
32 |
+
# khanusa/nd_asr_wav2vec2
|
33 |
+
|
34 |
+
This is a fine-tuned wav2vec2 model for Vietnamese Automatic Speech Recognition (ASR), based on `nguyenvulebinh/wav2vec2-base-vi`.
|
35 |
+
|
36 |
+
## Model Description
|
37 |
+
|
38 |
+
- **Language:** Vietnamese
|
39 |
+
- **Task:** Automatic Speech Recognition
|
40 |
+
- **Base Model:** nguyenvulebinh/wav2vec2-base-vi
|
41 |
+
- **Architecture:** Wav2Vec2 + CTC Head
|
42 |
+
- **Training Framework:** PyTorch
|
43 |
+
- **Fine-tuning:** Custom Vietnamese speech dataset
|
44 |
+
|
45 |
+
## Usage
|
46 |
+
|
47 |
+
```python
|
48 |
+
import torch
|
49 |
+
import librosa
|
50 |
+
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
|
51 |
+
|
52 |
+
# Load model and processor
|
53 |
+
processor = Wav2Vec2Processor.from_pretrained("khanusa/nd_asr_wav2vec2")
|
54 |
+
model = Wav2Vec2ForCTC.from_pretrained("khanusa/nd_asr_wav2vec2")
|
55 |
+
|
56 |
+
# Load and preprocess audio
|
57 |
+
audio, sr = librosa.load("path_to_your_audio.wav", sr=16000)
|
58 |
+
|
59 |
+
# Tokenize and predict
|
60 |
+
inputs = processor(audio, sampling_rate=16000, return_tensors="pt", padding=True)
|
61 |
+
with torch.no_grad():
|
62 |
+
logits = model(inputs.input_values).logits
|
63 |
+
|
64 |
+
# Decode predictions
|
65 |
+
predicted_ids = torch.argmax(logits, dim=-1)
|
66 |
+
transcription = processor.batch_decode(predicted_ids)[0]
|
67 |
+
print(transcription)
|
68 |
+
```
|
69 |
+
|
70 |
+
## Training Details
|
71 |
+
|
72 |
+
### Training Data
|
73 |
+
Custom Vietnamese speech dataset
|
74 |
+
|
75 |
+
### Training Procedure
|
76 |
+
- **Optimizer:** AdamW
|
77 |
+
- **Learning Rate:** 5e-6
|
78 |
+
- **Batch Size:** 8 (with gradient accumulation steps: 4)
|
79 |
+
- **Epochs:** 50
|
80 |
+
- **Audio Duration:** 7-11 seconds clips
|
81 |
+
- **Sampling Rate:** 16kHz
|
82 |
+
- **Features:** 16-bit PCM audio
|
83 |
+
- **Label Smoothing:** 0.1
|
84 |
+
|
85 |
+
### Training Configuration
|
86 |
+
- Mixed Precision Training (AMP)
|
87 |
+
- Gradient Clipping: 1.0
|
88 |
+
- Warmup Steps: 2000
|
89 |
+
- Early Stopping Patience: 8 epochs
|
90 |
+
|
91 |
+
## Performance
|
92 |
+
|
93 |
+
| Metric | Value |
|
94 |
+
|--------|-------|
|
95 |
+
| WER | 0.2123 |
|
96 |
+
|
97 |
+
*Note: Please update the WER value with your actual evaluation results.*
|
98 |
+
|
99 |
+
## Limitations and Bias
|
100 |
+
|
101 |
+
This model was fine-tuned from an English base model on a specific Vietnamese speech dataset and may not generalize well to:
|
102 |
+
- Different Vietnamese dialects
|
103 |
+
- Noisy environments not represented in training data
|
104 |
+
- Domain-specific vocabulary outside of training scope
|
105 |
+
- Cross-lingual transfer limitations (base model was trained on English)
|
106 |
+
- Audio quality different from training conditions
|
107 |
+
|
108 |
+
## Citation
|
109 |
+
|
110 |
+
```bibtex
|
111 |
+
@article{baevski2020wav2vec,
|
112 |
+
title={wav2vec 2.0: A framework for self-supervised learning of speech representations},
|
113 |
+
author={Baevski, Alexei and Zhou, Yuhao and Mohamed, Abdelrahman and Auli, Michael},
|
114 |
+
journal={Advances in neural information processing systems},
|
115 |
+
volume={33},
|
116 |
+
pages={12449--12460},
|
117 |
+
year={2020}
|
118 |
+
}
|
119 |
+
|
120 |
+
|
121 |
+
```
|
122 |
+
|
123 |
+
## License
|
124 |
+
|
125 |
+
This model is released under the Apache 2.0 License.
|
preprocessor_config.json
ADDED
@@ -0,0 +1,9 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"feature_extractor_type": "Wav2Vec2FeatureExtractor",
|
3 |
+
"normalizer": {
|
4 |
+
"do_lower_case": true,
|
5 |
+
"strip_accents": null,
|
6 |
+
"keep_accents": true
|
7 |
+
},
|
8 |
+
"tokenizer_type": "Wav2Vec2CTCTokenizer"
|
9 |
+
}
|
vocab.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"a": 1, "b": 2, "c": 3, "d": 4, "e": 5, "f": 6, "g": 7, "h": 8, "i": 9, "j": 10, "k": 11, "l": 12, "m": 13, "n": 14, "o": 15, "p": 16, "q": 17, "r": 18, "s": 19, "t": 20, "u": 21, "v": 22, "w": 23, "x": 24, "y": 25, "z": 26, "\u00e0": 27, "\u00e1": 28, "\u00e2": 29, "\u00e3": 30, "\u00e8": 31, "\u00e9": 32, "\u00ea": 33, "\u00ec": 34, "\u00ed": 35, "\u00f2": 36, "\u00f3": 37, "\u00f4": 38, "\u00f5": 39, "\u00f9": 40, "\u00fa": 41, "\u00fd": 42, "\u0103": 43, "\u0111": 44, "\u0129": 45, "\u0169": 46, "\u01a1": 47, "\u01b0": 48, "\u1ea1": 49, "\u1ea3": 50, "\u1ea5": 51, "\u1ea7": 52, "\u1ea9": 53, "\u1eab": 54, "\u1ead": 55, "\u1eaf": 56, "\u1eb1": 57, "\u1eb3": 58, "\u1eb5": 59, "\u1eb7": 60, "\u1eb9": 61, "\u1ebb": 62, "\u1ebd": 63, "\u1ebf": 64, "\u1ec1": 65, "\u1ec3": 66, "\u1ec5": 67, "\u1ec7": 68, "\u1ec9": 69, "\u1ecb": 70, "\u1ecd": 71, "\u1ecf": 72, "\u1ed1": 73, "\u1ed3": 74, "\u1ed5": 75, "\u1ed7": 76, "\u1ed9": 77, "\u1edb": 78, "\u1edd": 79, "\u1edf": 80, "\u1ee1": 81, "\u1ee3": 82, "\u1ee5": 83, "\u1ee7": 84, "\u1ee9": 85, "\u1eeb": 86, "\u1eed": 87, "\u1eef": 88, "\u1ef1": 89, "\u1ef3": 90, "\u1ef5": 91, "\u1ef7": 92, "\u1ef9": 93, "|": 0, "<bos>": 94, "<eos>": 95, "<unk>": 96, "<pad>": 97}
|