Warit commited on
Commit
48f7b43
·
verified ·
1 Parent(s): f6ba1f9

Push model using huggingface_hub.

Browse files
Files changed (2) hide show
  1. README.md +145 -59
  2. typhoon-asr-realtime.nemo +1 -1
README.md CHANGED
@@ -4,8 +4,6 @@ license: cc-by-4.0
4
  tags:
5
  - pytorch
6
  - NeMo
7
- base_model:
8
- - nvidia/stt_en_fastconformer_transducer_large
9
  ---
10
 
11
  # Typhoon-asr-realtime
@@ -16,95 +14,183 @@ img {
16
  }
17
  </style>
18
 
19
- | [![Model architecture](https://img.shields.io/badge/Model_Arch-FastConformer--Transducer-lightgrey#model-badge)](#model-architecture)
20
- | [![Model size](https://img.shields.io/badge/Params-114M-lightgrey#model-badge)](#model-architecture)
21
- | [![Language](https://img.shields.io/badge/Language-th-lightgrey#model-badge)](#datasets)
22
 
23
- Typhoon ASR Realtime is a next-generation, open-source Automatic Speech Recognition (ASR) model built specifically for real-world streaming applications in the Thai language. It is designed to deliver fast and accurate transcriptions while running efficiently on standard CPUs. This enables users to host their own ASR service without requiring expensive, specialized hardware or relying on third-party cloud services for sensitive data.
24
 
25
- The model is based on [NVIDIA's FastConformer Transducer model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#fast-conformer), which is optimized for low-latency, real-time performance.
26
 
27
 
28
- **Try our demo available on [Demo]()**
29
 
30
- **Code / Examples available on [Github](https://github.com/scb-10x/typhoon-asr)**
 
 
 
31
 
32
- **Release Blog available on [OpenTyphoon Blog](https://opentyphoon.ai/blog/en/typhoon-asr-realtime-release)**
33
 
34
- ***
35
 
36
- ### Performance
37
 
 
38
 
 
 
 
 
39
 
40
- ### Summary of Findings
41
 
42
- ***
43
 
44
- ### Usage and Implementation
 
 
 
 
 
 
 
 
45
 
46
- **(Recommended): Quick Start with Google Colab**
47
 
48
- For a hands-on demonstration without any local setup, you can run this project directly in Google Colab. The notebook provides a complete environment to transcribe audio files and experiment with the model.
 
 
49
 
50
- [![Alt text](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1t4tlRTJToYRolTmiN5ZWDR67ymdRnpAz?usp=sharing)
51
 
52
- **(Recommended): Using the `typhoon-asr` Package**
53
 
54
- This is the easiest way to get started. You can install the package via pip and use it directly from the command line or within your Python code.
55
 
56
- **1. Install the package:**
57
- ```bash
58
- pip install typhoon-asr
59
- ```
60
 
61
- **2. Command-Line Usage:**
62
- ```bash
63
- # Basic transcription (auto-detects device)
64
- typhoon-asr path/to/your_audio.wav
65
 
66
- # Transcription with timestamps on a specific device
67
- typhoon-asr path/to/your_audio.mp3 --with-timestamps --device cuda
68
- ```
69
 
70
- **3. Python API Usage:**
71
- ```python
72
- from typhoon_asr import transcribe
73
 
74
- # Basic transcription
75
- result = transcribe("path/to/your_audio.wav")
76
- print(result['text'])
77
 
78
- # Transcription with timestamps
79
- result_with_timestamps = transcribe("path/to/your_audio.wav", with_timestamps=True)
80
- print(result_with_timestamps)
81
- ```
82
 
83
- **(Alternative): Running from the Repository Script**
84
 
85
- You can also run the model by cloning the repository and using the inference script directly. This method is useful for development or if you need to modify the underlying code.
86
 
87
- **1. Clone the repository and install dependencies:**
88
- ```bash
89
- git clone https://github.com/scb10x/typhoon-asr.git
90
- cd typhoon-asr
91
- pip install -r requirements.txt
92
- ```
93
 
94
- **2. Run the inference script:**
95
- The `typhoon_asr_inference.py` script handles audio resampling and processing automatically.
96
 
97
- ```bash
98
- # Basic Transcription (CPU):
99
- python typhoon_asr_inference.py path/to/your_audio.m4a
100
 
101
- # Transcription with Estimated Timestamps:
102
- python typhoon_asr_inference.py path/to/your_audio.wav --with-timestamps
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
103
 
104
- # Transcription on a GPU:
105
- python typhoon_asr_inference.py path/to/your_audio.mp3 --device cuda
106
- ```
107
 
108
  ## License
109
 
110
  License to use this model is covered by the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/). By downloading the public and release version of the model, you accept the terms and conditions of the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license.
 
 
 
 
 
 
 
4
  tags:
5
  - pytorch
6
  - NeMo
 
 
7
  ---
8
 
9
  # Typhoon-asr-realtime
 
14
  }
15
  </style>
16
 
17
+ [![Model architecture](https://img.shields.io/badge/Model_Arch-PUT-YOUR-ARCHITECTURE-HERE-lightgrey#model-badge)](#model-architecture)
18
+ | [![Model size](https://img.shields.io/badge/Params-PUT-YOUR-MODEL-SIZE-HERE-lightgrey#model-badge)](#model-architecture)
19
+ | [![Language](https://img.shields.io/badge/Language-PUT-YOUR-LANGUAGE-HERE-lightgrey#model-badge)](#datasets)
20
 
21
+ **Put a short model description here.**
22
 
23
+ See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/index.html) for complete architecture details.
24
 
25
 
26
+ ## NVIDIA NeMo: Training
27
 
28
+ To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest Pytorch version.
29
+ ```
30
+ pip install nemo_toolkit['all']
31
+ ```
32
 
33
+ ## How to Use this Model
34
 
35
+ The model is available for use in the NeMo toolkit [3], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
36
 
37
+ ### Automatically instantiate the model
38
 
39
+ **NOTE**: Please update the model class below to match the class of the model being uploaded.
40
 
41
+ ```python
42
+ import nemo.core import ModelPT
43
+ model = ModelPT.from_pretrained("scb10x/typhoon-asr-realtime")
44
+ ```
45
 
46
+ ### NOTE
47
 
48
+ Add some information about how to use the model here. An example is provided for ASR inference below.
49
 
50
+ ### Transcribing using Python
51
+ First, let's get a sample
52
+ ```
53
+ wget https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav
54
+ ```
55
+ Then simply do:
56
+ ```
57
+ asr_model.transcribe(['2086-149220-0033.wav'])
58
+ ```
59
 
60
+ ### Transcribing many audio files
61
 
62
+ ```shell
63
+ python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py pretrained_name="scb10x/typhoon-asr-realtime" audio_dir=""
64
+ ```
65
 
66
+ ### Input
67
 
68
+ **Add some information about what are the inputs to this model**
69
 
70
+ ### Output
71
 
72
+ **Add some information about what are the outputs of this model**
 
 
 
73
 
74
+ ## Model Architecture
 
 
 
75
 
76
+ **Add information here discussing architectural details of the model or any comments to users about the model.**
 
 
77
 
78
+ ## Training
 
 
79
 
80
+ **Add information here about how the model was trained. It should be as detailed as possible, potentially including the the link to the script used to train as well as the base config used to train the model. If extraneous scripts are used to prepare the components of the model, please include them here.**
 
 
81
 
82
+ ### NOTE
 
 
 
83
 
84
+ An example is provided below for ASR
85
 
86
+ The NeMo toolkit [3] was used for training the models for over several hundred epochs. These model are trained with this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_transducer/speech_to_text_rnnt_bpe.py) and this [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/fastconformer/fast-conformer_transducer_bpe.yaml).
87
 
88
+ The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
 
 
 
 
 
89
 
 
 
90
 
91
+ ### Datasets
 
 
92
 
93
+ **Try to provide as detailed a list of datasets as possible. If possible, provide links to the datasets on HF by adding it to the manifest section at the top of the README (marked by ---).**
94
+
95
+ ### NOTE
96
+
97
+ An example for the manifest section is provided below for ASR datasets
98
+
99
+ datasets:
100
+ - librispeech_asr
101
+ - fisher_corpus
102
+ - Switchboard-1
103
+ - WSJ-0
104
+ - WSJ-1
105
+ - National-Singapore-Corpus-Part-1
106
+ - National-Singapore-Corpus-Part-6
107
+ - vctk
108
+ - voxpopuli
109
+ - europarl
110
+ - multilingual_librispeech
111
+ - mozilla-foundation/common_voice_8_0
112
+ - MLCommons/peoples_speech
113
+
114
+ The corresponding text in this section for those datasets is stated below -
115
+
116
+ The model was trained on 64K hours of English speech collected and prepared by NVIDIA NeMo and Suno teams.
117
+
118
+ The training dataset consists of private subset with 40K hours of English speech plus 24K hours from the following public datasets:
119
+
120
+ - Librispeech 960 hours of English speech
121
+ - Fisher Corpus
122
+ - Switchboard-1 Dataset
123
+ - WSJ-0 and WSJ-1
124
+ - National Speech Corpus (Part 1, Part 6)
125
+ - VCTK
126
+ - VoxPopuli (EN)
127
+ - Europarl-ASR (EN)
128
+ - Multilingual Librispeech (MLS EN) - 2,000 hour subset
129
+ - Mozilla Common Voice (v7.0)
130
+ - People's Speech - 12,000 hour subset
131
+
132
+
133
+ ## Performance
134
+
135
+ **Add information here about the performance of the model. Discuss what is the metric that is being used to evaluate the model and if there are external links explaning the custom metric, please link to it.
136
+
137
+ ### NOTE
138
+
139
+ An example is provided below for ASR metrics list that can be added to the top of the README
140
+
141
+ model-index:
142
+ - name: PUT_MODEL_NAME
143
+ results:
144
+ - task:
145
+ name: Automatic Speech Recognition
146
+ type: automatic-speech-recognition
147
+ dataset:
148
+ name: AMI (Meetings test)
149
+ type: edinburghcstr/ami
150
+ config: ihm
151
+ split: test
152
+ args:
153
+ language: en
154
+ metrics:
155
+ - name: Test WER
156
+ type: wer
157
+ value: 17.10
158
+ - task:
159
+ name: Automatic Speech Recognition
160
+ type: automatic-speech-recognition
161
+ dataset:
162
+ name: Earnings-22
163
+ type: revdotcom/earnings22
164
+ split: test
165
+ args:
166
+ language: en
167
+ metrics:
168
+ - name: Test WER
169
+ type: wer
170
+ value: 14.11
171
+
172
+ Provide any caveats about the results presented in the top of the discussion so that nuance is not lost.
173
+
174
+ It should ideally be in a tabular format (you can use the following website to make your tables in markdown format - https://www.tablesgenerator.com/markdown_tables)**
175
+
176
+ ## Limitations
177
+
178
+ **Discuss any practical limitations to the model when being used in real world cases. They can also be legal disclaimers, or discussion regarding the safety of the model (particularly in the case of LLMs).**
179
+
180
+
181
+ ### Note
182
+
183
+ An example is provided below
184
+
185
+ Since this model was trained on publicly available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech.
186
 
 
 
 
187
 
188
  ## License
189
 
190
  License to use this model is covered by the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/). By downloading the public and release version of the model, you accept the terms and conditions of the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license.
191
+
192
+ ## References
193
+
194
+ **Provide appropriate references in the markdown link format below. Please order them numerically.**
195
+
196
+ [1] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
typhoon-asr-realtime.nemo CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:473a382a3776003554e57364cb41dd4a456ed6aa771149046fb7596870c59b14
3
  size 462469120
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d3b731545a0a5cc9841e6b0fcb54cc87a38feae30cb987bb279a57b6734cbce2
3
  size 462469120