piotrzelasko commited on
Commit
e378166
·
verified ·
1 Parent(s): eff877a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +53 -30
README.md CHANGED
@@ -55,7 +55,7 @@ model-index:
55
  metrics:
56
  - name: Test WER
57
  type: wer
58
- value: 10.18
59
  - task:
60
  name: Automatic Speech Recognition
61
  type: automatic-speech-recognition
@@ -68,7 +68,7 @@ model-index:
68
  metrics:
69
  - name: Test WER
70
  type: wer
71
- value: 10.42
72
  - task:
73
  name: Automatic Speech Recognition
74
  type: automatic-speech-recognition
@@ -81,7 +81,7 @@ model-index:
81
  metrics:
82
  - name: Test WER
83
  type: wer
84
- value: 9.41
85
  - task:
86
  name: Automatic Speech Recognition
87
  type: automatic-speech-recognition
@@ -95,7 +95,7 @@ model-index:
95
  metrics:
96
  - name: Test WER
97
  type: wer
98
- value: 1.6
99
  - task:
100
  name: Automatic Speech Recognition
101
  type: automatic-speech-recognition
@@ -137,7 +137,7 @@ model-index:
137
  metrics:
138
  - name: Test WER
139
  type: wer
140
- value: 2.72
141
  - task:
142
  name: Automatic Speech Recognition
143
  type: automatic-speech-recognition
@@ -173,7 +173,7 @@ img {
173
  # Model Overview
174
 
175
  ## Description:
176
- NVIDIA NeMo Canary-Qwen-2.5B is an English speech recognition model that achieves state-of-the art performance on multiple English speech benchmarks. With 2.5 billion parameters and running at 458 RTFx, Canary-Qwen-2.5B supports automatic speech-to-text recognition (ASR) in English with punctuation and capitalization (PnC). The model is intended as a transcription tool only, and not expected to extend the LLM capabilities into speech modality. This model is ready for commercial use.
177
 
178
  ### License/Terms of Use:
179
  Canary-Qwen-2.5B is released under the CC-BY-4.0 license. By using this model, you are agreeing to the [terms and conditions](https://choosealicense.com/licenses/cc-by-4.0/) of the license. <br>
@@ -197,8 +197,20 @@ Canary-Qwen-2.5B is released under the CC-BY-4.0 license. By using this model, y
197
 
198
  [9] [SALM: Speech-augmented Language Model with In-context Learning for Speech Recognition and Translation](https://arxiv.org/abs/2310.09424)
199
 
 
 
 
 
 
 
 
 
 
 
 
 
200
  ## Model Architecture:
201
- Canary-Qwen is a Speech-Augmented Language Model (SALM) [9] model with FastConformer [2] Encoder and Transformer Decoder [3]. It is built using two base models: `nvidia/canary-1b-flash` [1,5] and `Qwen/Qwen3-1.7B` [4], a linear projection, and LoRA applied to the LLM. The audio encoder computes audio representation that is mapped to the LLM embedding space via a linear projection, and concatenated with the embeddings of text tokens. The model is prompted with "Transcribe the following: <audio>", using Qwen's chat template.
202
 
203
  ### Limitations
204
 
@@ -225,9 +237,10 @@ model = SALM.from_pretrained('nvidia/canary-qwen-2.5b')
225
  ```
226
 
227
  ## Input:
 
228
  **Input Type(s):** Audio, text prompt <br>
229
- **Input Format(s):** .wav or .flac files<br>
230
- **Input Parameters(s):** 1D <br>
231
  **Other Properties Related to Input:** 16000 Hz Mono-channel Audio, Pre-Processing Not Needed <br>
232
 
233
  Input to Canary-Qwen-2.5B is a batch of prompts that include audio.
@@ -266,13 +279,14 @@ python examples/speechlm2/salm_generate.py \
266
  ## Output:
267
  **Output Type(s):** Text <br>
268
  **Output Format:** Text transcript as a sequence of token IDs or a string <br>
269
- **Output Parameters:** 1-Dimensional text string <br>
270
  **Other Properties Related to Output:** May Need Inverse Text Normalization <br>
271
 
 
272
 
273
  ## Software Integration:
274
  **Runtime Engine(s):**
275
- * NeMo - 2.4.0 or higher <br>
276
 
277
  **Supported Hardware Microarchitecture Compatibility:** <br>
278
  * [NVIDIA Ampere] <br>
@@ -292,11 +306,25 @@ python examples/speechlm2/salm_generate.py \
292
  ## Model Version(s):
293
  Canary-Qwen-2.5B <br>
294
 
 
 
 
 
 
 
 
295
 
296
  # Training and Evaluation Datasets:
297
 
298
  ## Training Dataset:
299
 
 
 
 
 
 
 
 
300
  The Canary-Qwen-2.5B model is trained on a total of 234K hrs of publicly available speech data.
301
  The datasets below include conversations, videos from the web and audiobook recordings.
302
 
@@ -306,9 +334,11 @@ The datasets below include conversations, videos from the web and audiobook reco
306
  **Labeling Method:**
307
  * Hybrid: Human, Automated <br>
308
 
 
 
309
  #### English (234.5k hours)
310
 
311
- The majority of the training data come from English portion of the Granary dataset [7]:
312
 
313
  - YouTube-Commons (YTC) (109.5k hours)
314
  - YODAS2 (77k hours)
@@ -355,20 +385,6 @@ Noise Robustness:
355
  Model Fairness:
356
  * [Casual Conversations Dataset](https://arxiv.org/pdf/2104.02821)
357
 
358
- ## Training
359
-
360
- Canary-Qwen-2.5B was trained using the NVIDIA NeMo toolkit [6] for a total of 90k steps on 32 NVIDIA A100 80GB GPUs. LLM parameters were kept frozen. Speech encoder, projection, and LoRA parameters were trainable. The encoder's output frame rate is 80ms, or 12.5 tokens per second. The model was trained on approximately 1.3B tokens in total (this number inlcudes the speech encoder output frames, text response tokens, prompt tokens, and chat template tokens). The model was trained in bfloat16 precision (not using AMP) and bucketing.
361
-
362
- The model can be trained using this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/speechlm2/salm_train.py) and [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/speechlm2/conf/salm.yaml).
363
-
364
- The tokenizer was inherited from `Qwen/Qwen3-1.7B`.
365
-
366
- ## Inference:
367
- **Engine:** NVIDIA NeMo <br>
368
- **Test Hardware :** <br>
369
- * A6000 <br>
370
- * A100 <br>
371
-
372
  ## Performance
373
 
374
  The ASR predictions were generated using greedy decoding.
@@ -381,7 +397,7 @@ WER on [HuggingFace OpenASR leaderboard](https://huggingface.co/spaces/hf-audio/
381
 
382
  | **Version** | **Model** | **RTFx** | **Mean** | **AMI** | **GigaSpeech** | **LS Clean** | **LS Other** | **Earnings22** | **SPGISpech** | **Tedlium** | **Voxpopuli** |
383
  |:---------:|:-----------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|
384
- | 2.4.0 | Canary-Qwen-2.5B | 458.5 | 5.62 | 10.18 | 9.41 | 1.60 | 3.10 | 10.42 | 1.90 | 2.72 | 5.66 |
385
 
386
  More details on evaluation can be found at [HuggingFace ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)
387
 
@@ -389,14 +405,14 @@ More details on evaluation can be found at [HuggingFace ASR Leaderboard](https:/
389
  Number of characters per minute on [MUSAN](https://www.openslr.org/17) 48 hrs eval set (`max_new_tokens=50` following `nvidia/canary-1b-flash` evaluation)
390
  | **Version** | **Model** | **# of character per minute** |
391
  |:-----------:|:---------:|:----------:|
392
- | 2.4.0 | Canary-Qwen-2.5B | 138.1 |
393
 
394
  ### Noise Robustness
395
  WER on [Librispeech Test Clean](https://www.openslr.org/12) at different SNR (signal to noise ratio) levels of additive white noise
396
 
397
  | **Version** | **Model** | **SNR 10** | **SNR 5** | **SNR 0** | **SNR -5** |
398
  |:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|
399
- | 2.4.0 | Canary-Qwen-2.5B | 2.41% | 4.08% | 9.83% | 30.60% |
400
 
401
  ## Model Fairness Evaluation
402
 
@@ -418,6 +434,13 @@ As outlined in the paper "Towards Measuring Fairness in AI: the Casual Conversat
418
 
419
  (Error rates for fairness evaluation are determined by normalizing both the reference and predicted text, similar to the methods used in the evaluations found at https://github.com/huggingface/open_asr_leaderboard.)
420
 
 
 
 
 
 
 
 
421
  ## Ethical Considerations:
422
  NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
423
- For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
 
55
  metrics:
56
  - name: Test WER
57
  type: wer
58
+ value: 10.19
59
  - task:
60
  name: Automatic Speech Recognition
61
  type: automatic-speech-recognition
 
68
  metrics:
69
  - name: Test WER
70
  type: wer
71
+ value: 10.45
72
  - task:
73
  name: Automatic Speech Recognition
74
  type: automatic-speech-recognition
 
81
  metrics:
82
  - name: Test WER
83
  type: wer
84
+ value: 9.43
85
  - task:
86
  name: Automatic Speech Recognition
87
  type: automatic-speech-recognition
 
95
  metrics:
96
  - name: Test WER
97
  type: wer
98
+ value: 1.61
99
  - task:
100
  name: Automatic Speech Recognition
101
  type: automatic-speech-recognition
 
137
  metrics:
138
  - name: Test WER
139
  type: wer
140
+ value: 2.71
141
  - task:
142
  name: Automatic Speech Recognition
143
  type: automatic-speech-recognition
 
173
  # Model Overview
174
 
175
  ## Description:
176
+ NVIDIA NeMo Canary-Qwen-2.5B is an English speech recognition model that achieves state-of-the art performance on multiple English speech benchmarks. With 2.5 billion parameters and running at 458 RTFx, Canary-Qwen-2.5B supports automatic speech-to-text recognition (ASR) in English with punctuation and capitalization (PnC). The model works in two modes: as a transcription tool (ASR mode) and as an LLM (LLM mode). In ASR mode, the model is only capable of transcribing the speech into text, but does not retain any LLM-specific skills such as reasoning. In LLM mode, the model retains all of the original LLM capabilities, which can be used to post-process the transcript, e.g. summarize it or answer questions about it. In LLM mode, the model does not "understand" the raw audio anymore - only its transcript. This model is ready for commercial use.
177
 
178
  ### License/Terms of Use:
179
  Canary-Qwen-2.5B is released under the CC-BY-4.0 license. By using this model, you are agreeing to the [terms and conditions](https://choosealicense.com/licenses/cc-by-4.0/) of the license. <br>
 
197
 
198
  [9] [SALM: Speech-augmented Language Model with In-context Learning for Speech Recognition and Translation](https://arxiv.org/abs/2310.09424)
199
 
200
+ ### Deployment Geography:
201
+
202
+ Global
203
+
204
+ ### Use Case:
205
+
206
+ The model is intended for users requiring speech-to-text transcription capabilities for English speech, and/or transcript post-processing capabilities enabled by prompting the underlying LLMs. Typical use-cases: transcription, summarization, answering user questions about the transcript.
207
+
208
+ ### Release Date:
209
+
210
+ Huggingface 07/15/2025 via https://huggingface.co/nvidia/canary-qwen-2.5b
211
+
212
  ## Model Architecture:
213
+ Canary-Qwen is a Speech-Augmented Language Model (SALM) [9] model with FastConformer [2] Encoder and Transformer Decoder [3]. It is built using two base models: `nvidia/canary-1b-flash` [1,5] and `Qwen/Qwen3-1.7B` [4], a linear projection, and low-rank adaptation (LoRA) applied to the LLM. The audio encoder computes audio representation that is mapped to the LLM embedding space via a linear projection, and concatenated with the embeddings of text tokens. The model is prompted with "Transcribe the following: <audio>", using Qwen's chat template.
214
 
215
  ### Limitations
216
 
 
237
  ```
238
 
239
  ## Input:
240
+
241
  **Input Type(s):** Audio, text prompt <br>
242
+ **Input Format(s):** Audio: .wav or .flac files. Text prompt string for ASR mode: `Transcribe the following: <|audioplaceholder|>` <br>
243
+ **Input Parameters(s):** Audio: Two-Dimensional (batch, audio-samples); Text: One-Dimensional (string) <br>
244
  **Other Properties Related to Input:** 16000 Hz Mono-channel Audio, Pre-Processing Not Needed <br>
245
 
246
  Input to Canary-Qwen-2.5B is a batch of prompts that include audio.
 
279
  ## Output:
280
  **Output Type(s):** Text <br>
281
  **Output Format:** Text transcript as a sequence of token IDs or a string <br>
282
+ **Output Parameters:** One-Dimensional text string <br>
283
  **Other Properties Related to Output:** May Need Inverse Text Normalization <br>
284
 
285
+ Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
286
 
287
  ## Software Integration:
288
  **Runtime Engine(s):**
289
+ * NeMo - 2.5.0 or higher <br>
290
 
291
  **Supported Hardware Microarchitecture Compatibility:** <br>
292
  * [NVIDIA Ampere] <br>
 
306
  ## Model Version(s):
307
  Canary-Qwen-2.5B <br>
308
 
309
+ ## Training
310
+
311
+ Canary-Qwen-2.5B was trained using the NVIDIA NeMo toolkit [6] for a total of 90k steps on 32 NVIDIA A100 80GB GPUs. LLM parameters were kept frozen. Speech encoder, projection, and LoRA parameters were trainable. The encoder's output frame rate is 80ms, or 12.5 tokens per second. The model was trained on approximately 1.3B tokens in total (this number inlcudes the speech encoder output frames, text response tokens, prompt tokens, and chat template tokens).
312
+
313
+ The model can be trained using this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/speechlm2/salm_train.py) and [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/speechlm2/conf/salm.yaml).
314
+
315
+ The tokenizer was inherited from `Qwen/Qwen3-1.7B`.
316
 
317
  # Training and Evaluation Datasets:
318
 
319
  ## Training Dataset:
320
 
321
+ ** The total size (in number of data points): approx. 40 million (speech, text) pairs
322
+ ** Total number of datasets: 26, with 18 for training and 8 for test
323
+ ** Dataset partition: Training 99.6%, testing 0.04%, validation 0%
324
+ ** Time period for training data collection: 1990-2025
325
+ ** Time period for testing data collection: 2005-2022
326
+ ** Time period for validation data collection N/A (unused)
327
+
328
  The Canary-Qwen-2.5B model is trained on a total of 234K hrs of publicly available speech data.
329
  The datasets below include conversations, videos from the web and audiobook recordings.
330
 
 
334
  **Labeling Method:**
335
  * Hybrid: Human, Automated <br>
336
 
337
+ ### Properties
338
+
339
  #### English (234.5k hours)
340
 
341
+ The majority of the training data comes from the English portion of the Granary dataset [7]:
342
 
343
  - YouTube-Commons (YTC) (109.5k hours)
344
  - YODAS2 (77k hours)
 
385
  Model Fairness:
386
  * [Casual Conversations Dataset](https://arxiv.org/pdf/2104.02821)
387
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
388
  ## Performance
389
 
390
  The ASR predictions were generated using greedy decoding.
 
397
 
398
  | **Version** | **Model** | **RTFx** | **Mean** | **AMI** | **GigaSpeech** | **LS Clean** | **LS Other** | **Earnings22** | **SPGISpech** | **Tedlium** | **Voxpopuli** |
399
  |:---------:|:-----------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|
400
+ | 2.5.0 | Canary-Qwen-2.5B | 458.5 | 5.62 | 10.18 | 9.41 | 1.60 | 3.10 | 10.42 | 1.90 | 2.72 | 5.66 |
401
 
402
  More details on evaluation can be found at [HuggingFace ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)
403
 
 
405
  Number of characters per minute on [MUSAN](https://www.openslr.org/17) 48 hrs eval set (`max_new_tokens=50` following `nvidia/canary-1b-flash` evaluation)
406
  | **Version** | **Model** | **# of character per minute** |
407
  |:-----------:|:---------:|:----------:|
408
+ | 2.5.0 | Canary-Qwen-2.5B | 138.1 |
409
 
410
  ### Noise Robustness
411
  WER on [Librispeech Test Clean](https://www.openslr.org/12) at different SNR (signal to noise ratio) levels of additive white noise
412
 
413
  | **Version** | **Model** | **SNR 10** | **SNR 5** | **SNR 0** | **SNR -5** |
414
  |:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|
415
+ | 2.5.0 | Canary-Qwen-2.5B | 2.41% | 4.08% | 9.83% | 30.60% |
416
 
417
  ## Model Fairness Evaluation
418
 
 
434
 
435
  (Error rates for fairness evaluation are determined by normalizing both the reference and predicted text, similar to the methods used in the evaluations found at https://github.com/huggingface/open_asr_leaderboard.)
436
 
437
+ ## Inference:
438
+ **Engine:** NVIDIA NeMo <br>
439
+ **Test Hardware :** <br>
440
+ * A6000 <br>
441
+ * A100 <br>
442
+ * RTX 5090 <br>
443
+
444
  ## Ethical Considerations:
445
  NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
446
+ For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).