File size: 17,178 Bytes
e891611
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3221097
e891611
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e378166
e891611
 
 
 
 
 
 
 
 
 
 
 
e378166
e891611
 
 
 
 
 
 
 
 
 
 
 
e378166
e891611
 
 
 
 
 
 
 
 
 
 
 
 
e378166
e891611
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e378166
e891611
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
770a734
 
 
418c32c
770a734
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e378166
 
 
 
 
 
 
 
 
 
3273604
e378166
770a734
e378166
770a734
 
 
 
 
 
 
 
 
 
 
 
 
6852529
 
45f41ac
6852529
 
770a734
 
 
 
 
 
 
 
 
 
 
 
 
e378166
770a734
e378166
 
770a734
 
 
 
7e2ce3b
 
770a734
 
 
 
 
 
 
 
 
 
7e2ce3b
 
 
 
 
 
 
 
 
 
 
 
770a734
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e378166
770a734
 
e378166
770a734
 
 
e378166
770a734
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e378166
 
 
 
 
 
 
770a734
 
 
 
 
e378166
 
 
 
 
 
 
770a734
 
 
 
 
 
 
 
 
e378166
 
770a734
 
e378166
770a734
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
418c32c
770a734
 
 
 
 
 
 
e378166
770a734
 
 
 
 
 
e378166
770a734
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e378166
 
 
 
 
 
 
770a734
 
e378166
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
---
license: cc-by-4.0
language:
- en
library_name: nemo
datasets:
- Granary
- YTC
- Yodas2
- LibriLight
- librispeech_asr
- fisher_corpus
- Switchboard-1
- WSJ-0
- WSJ-1
- National-Singapore-Corpus-Part-1
- National-Singapore-Corpus-Part-6
- vctk
- voxpopuli
- europarl
- multilingual_librispeech
- fleurs
- mozilla-foundation/common_voice_8_0
- MLCommons/peoples_speech
thumbnail: null
tags:
- automatic-speech-recognition
- speech
- audio
- Transformer
- FastConformer
- Conformer
- pytorch
- NeMo
- Qwen
- hf-asr-leaderboard
widget:
- example_title: Librispeech sample 1
  src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
- example_title: Librispeech sample 2
  src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
model-index:
- name: canary-qwen-2.5b
  results:
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: AMI (Meetings test)
      type: edinburghcstr/ami
      config: ihm
      split: test
      args:
        language: en
    metrics:
    - name: Test WER
      type: wer
      value: 10.19
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Earnings-22
      type: revdotcom/earnings22
      split: test
      args:
        language: en
    metrics:
    - name: Test WER
      type: wer
      value: 10.45
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: GigaSpeech
      type: speechcolab/gigaspeech
      split: test
      args:
        language: en
    metrics:
    - name: Test WER
      type: wer
      value: 9.43
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: LibriSpeech (clean)
      type: librispeech_asr
      config: other
      split: test
      args:
        language: en
    metrics:
    - name: Test WER
      type: wer
      value: 1.61
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: LibriSpeech (other)
      type: librispeech_asr
      config: other
      split: test
      args:
        language: en
    metrics:
    - name: Test WER
      type: wer
      value: 3.1
  - task:
      type: Automatic Speech Recognition
      name: automatic-speech-recognition
    dataset:
      name: SPGI Speech
      type: kensho/spgispeech
      config: test
      split: test
      args:
        language: en
    metrics:
    - name: Test WER
      type: wer
      value: 1.9
  - task:
      type: Automatic Speech Recognition
      name: automatic-speech-recognition
    dataset:
      name: tedlium-v3
      type: LIUM/tedlium
      config: release1
      split: test
      args:
        language: en
    metrics:
    - name: Test WER
      type: wer
      value: 2.71
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Vox Populi
      type: facebook/voxpopuli
      config: en
      split: test
      args:
        language: en
    metrics:
    - name: Test WER
      type: wer
      value: 5.66
metrics:
- wer
base_model:
- nvidia/canary-1b-flash
- Qwen/Qwen3-1.7B
---

<style>
img {
 display: inline;
}
</style>

[![Model architecture](https://img.shields.io/badge/Model_Arch-SALM-blue#model-badge)](#model-architecture)
| [![Model size](https://img.shields.io/badge/Params-2.5B-green#model-badge)](#model-architecture)
| [![Language](https://img.shields.io/badge/Language-en-orange#model-badge)](#datasets)


# Model Overview

## Description:
NVIDIA NeMo Canary-Qwen-2.5B is an English speech recognition model that achieves state-of-the art performance on multiple English speech benchmarks. With 2.5 billion parameters and running at 418 RTFx, Canary-Qwen-2.5B supports automatic speech-to-text recognition (ASR) in English with punctuation and capitalization (PnC). The model works in two modes: as a transcription tool (ASR mode) and as an LLM (LLM mode). In ASR mode, the model is only capable of transcribing the speech into text, but does not retain any LLM-specific skills such as reasoning. In LLM mode, the model retains all of the original LLM capabilities, which can be used to post-process the transcript, e.g. summarize it or answer questions about it. In LLM mode, the model does not "understand" the raw audio anymore - only its transcript. This model is ready for commercial use.

### License/Terms of Use: 
Canary-Qwen-2.5B is released under the CC-BY-4.0 license. By using this model, you are agreeing to the [terms and conditions](https://choosealicense.com/licenses/cc-by-4.0/) of the license. <br>

## References:
[1] [Less is More: Accurate Speech Recognition & Translation without Web-Scale Data](https://www.isca-archive.org/interspeech_2024/puvvada24_interspeech.pdf)

[2] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10389701)

[3] [Attention Is All You Need](https://arxiv.org/abs/1706.03762)

[4] [Qwen/Qwen3-1.7B Model Card](https://huggingface.co/Qwen/Qwen3-1.7B)

[5] [Training and Inference Efficiency of Encoder-Decoder Speech Models](https://arxiv.org/abs/2503.05931)

[6] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)

[7] [Granary: Speech Recognition and Translation Dataset in 25 European Languages](https://arxiv.org/abs/2505.13404)

[8] [Towards Measuring Fairness in AI: the Casual Conversations Dataset](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9634168)

[9] [SALM: Speech-augmented Language Model with In-context Learning for Speech Recognition and Translation](https://arxiv.org/abs/2310.09424) 

### Deployment Geography:

Global

### Use Case:

The model is intended for users requiring speech-to-text transcription capabilities for English speech, and/or transcript post-processing capabilities enabled by prompting the underlying LLMs. Typical use-cases: transcription, summarization, answering user questions about the transcript.

### Release Date:

Huggingface 07/17/2025 via https://huggingface.co/nvidia/canary-qwen-2.5b

## Model Architecture:
Canary-Qwen is a Speech-Augmented Language Model (SALM) [9] model with FastConformer [2] Encoder and Transformer Decoder [3]. It is built using two base models: `nvidia/canary-1b-flash` [1,5] and `Qwen/Qwen3-1.7B` [4], a linear projection, and low-rank adaptation (LoRA) applied to the LLM. The audio encoder computes audio representation that is mapped to the LLM embedding space via a linear projection, and concatenated with the embeddings of text tokens. The model is prompted with "Transcribe the following: <audio>", using Qwen's chat template.

### Limitations

**Input length.** The maximum audio duration in training was 40s, and the maximum token sequence length was 1024 tokens (including prompt, audio, and response). The model may technically be able to process longer sequences, but its accuracy may be degraded.

**Exclusively ASR oriented capabilities.** The model is not expected to preserve any of the underlying LLM's capabilities into speech modality. 

**English-only language support.** The model was trained using English data only. It may be able to spuriously transcribe other languages as the underlying encoder was pretrained using German, French, and Spanish speech in addition to English, but it's unlikely to be reliable as a multilingual model.

## NVIDIA NeMo

To train, fine-tune or transcribe with Canary-Qwen-2.5B, you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo).

```bash
# Currently requires installing the latest trunk version of NeMo, and PyTorch 2.6+ for FSDP2 support.
python -m pip install "nemo_toolkit[asr,tts] @ git+https://github.com/NVIDIA/NeMo.git"
```

## How to Use this Model

The model is available for use in the NVIDIA NeMo toolkit [6], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.

### Loading the Model

```python
from nemo.collections.speechlm2.models import SALM

model = SALM.from_pretrained('nvidia/canary-qwen-2.5b')
```

## Input: 

**Input Type(s):** Audio, text prompt <br>
**Input Format(s):** Audio: .wav or .flac files. Text prompt string for ASR mode: `Transcribe the following: <|audioplaceholder|>` <br>
**Input Parameters(s):** Audio: Two-Dimensional (batch, audio-samples); Text: One-Dimensional (string) <br>
**Other Properties Related to Input:** 16000 Hz Mono-channel Audio, Pre-Processing Not Needed <br>

Input to Canary-Qwen-2.5B is a batch of prompts that include audio.

Example usage in ASR mode (speech-to-text):

```python
answer_ids = model.generate(
    prompts=[
        [{"role": "user", "content": f"Transcribe the following: {model.audio_locator_tag}", "audio": ["speech.wav"]}]
    ],
    max_new_tokens=128,
)
print(model.tokenizer.ids_to_text(answer_ids[0].cpu()))
```

Example usage in LLM mode (text-only):

```python
prompt = "..."
transcript = "..."
with model.llm.disable_adapter():
    answer_ids = model.generate(
        prompts=[[{"role": "user", "content": f"{prompt}\n\n{transcript}"}]],
        max_new_tokens=2048,
    )
```

To transcribe a dataset of recordings, specify the input as jsonl manifest file, where each line in the file is a dictionary containing the following fields: 

```yaml
# Example of a line in input_manifest.json
{
    "audio_filepath": "/path/to/audio.wav",  # path to the audio file
    "duration": 30.0,  # duration of the audio
}
```

and then use:
```bash
cd NeMo
python examples/speechlm2/salm_generate.py \
  pretrained_name=nvidia/canary-qwen-2.5b \
  inputs=input_manifest.json \
  output_manifest=generations.jsonl \
  batch_size=128 \
  user_prompt="Transcribe the following:"  # audio locator is added automatically at the end if not present
```

## Output:
**Output Type(s):** Text <br>
**Output Format:** Text transcript as a sequence of token IDs or a string <br> 
**Output Parameters:** One-Dimensional text string <br>
**Other Properties Related to Output:** May Need Inverse Text Normalization <br>

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

## Software Integration:
**Runtime Engine(s):** 
* NeMo - 2.5.0 or higher <br>

**Supported Hardware Microarchitecture Compatibility:** <br>
* [NVIDIA Ampere] <br>
* [NVIDIA Blackwell] <br>
* [NVIDIA Jetson]  <br>
* [NVIDIA Hopper] <br>
* [NVIDIA Lovelace] <br>
* [NVIDIA Pascal] <br>
* [NVIDIA Turing] <br>
* [NVIDIA Volta] <br>

**[Preferred/Supported] Operating System(s):** <br>
* [Linux] <br>
* [Linux 4 Tegra] <br>
* [Windows] <br>

## Model Version(s): 
Canary-Qwen-2.5B <br>

## Training

Canary-Qwen-2.5B was trained using the NVIDIA NeMo toolkit [6] for a total of 90k steps on 32 NVIDIA A100 80GB GPUs. LLM parameters were kept frozen. Speech encoder, projection, and LoRA parameters were trainable. The encoder's output frame rate is 80ms, or 12.5 tokens per second. The model was trained on approximately 1.3B tokens in total (this number inlcudes the speech encoder output frames, text response tokens, prompt tokens, and chat template tokens). 

The model can be trained using this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/speechlm2/salm_train.py) and [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/speechlm2/conf/salm.yaml).

The tokenizer was inherited from `Qwen/Qwen3-1.7B`.

# Training and Evaluation Datasets: 

## Training Dataset:

** The total size (in number of data points): approx. 40 million (speech, text) pairs
** Total number of datasets: 26, with 18 for training and 8 for test
** Dataset partition: Training 99.6%, testing 0.04%, validation 0%
** Time period for training data collection: 1990-2025
** Time period for testing data collection: 2005-2022
** Time period for validation data collection N/A (unused)

The Canary-Qwen-2.5B model is trained on a total of 234K hrs of publicly available speech data.
The datasets below include conversations, videos from the web and audiobook recordings.

**Data Collection Method:**
* Human <br>

**Labeling Method:**
* Hybrid: Human, Automated <br>

### Properties

#### English (234.5k hours)

The majority of the training data comes from the English portion of the Granary dataset [7]:

- YouTube-Commons (YTC) (109.5k hours)
- YODAS2 (77k hours)
- LibriLight (13.6k hours)

In addition, the following datasets were used:
- Librispeech 960 hours
- Fisher Corpus
- Switchboard-1 Dataset
- WSJ-0 and WSJ-1
- National Speech Corpus (Part 1, Part 6)
- VCTK
- VoxPopuli (EN)
- Europarl-ASR (EN)
- Multilingual Librispeech (MLS EN)
- Mozilla Common Voice (v11.0)
- Mozilla Common Voice (v7.0)
- Mozilla Common Voice (v4.0)
- AMI
- FLEURS

AMI was oversampled during model training to constitute about 15% of the total data observed. 
This skewed the model towards predicting verbatim transcripts that include conversational speech disfluencies such as repetitions.

The training transcripts contained punctuation and capitalization.

## Evaluation Dataset:

**Data Collection Method:** <br>
* Human <br>

**Labeling Method:** <br>
* Human <br>

Automatic Speech Recognition: 
* [HuggingFace OpenASR Leaderboard evaluation sets](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)

Hallucination Robustness:
* [MUSAN](https://www.openslr.org/17/) 48 hrs eval set

Noise Robustness:
* [Librispeech](https://www.openslr.org/12)

Model Fairness:
* [Casual Conversations Dataset](https://arxiv.org/pdf/2104.02821)

## Performance

The ASR predictions were generated using greedy decoding.

### ASR Performance (w/o PnC) 

The ASR performance is measured with word error rate (WER), and we process the groundtruth and predicted text with [whisper-normalizer](https://pypi.org/project/whisper-normalizer/) version 0.1.12.

WER on [HuggingFace OpenASR leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard):

| **Version** | **Model**     | **RTFx**   | **Mean**   | **AMI**   | **GigaSpeech**   | **LS Clean**   | **LS Other**   | **Earnings22**   | **SPGISpech**   | **Tedlium**   | **Voxpopuli**   |
|:---------:|:-----------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|
| 2.5.0  | Canary-Qwen-2.5B | 418 | 5.63 | 10.18 | 9.41 | 1.60 | 3.10 | 10.42 | 1.90 | 2.72 | 5.66 |

More details on evaluation can be found at [HuggingFace ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)

### Hallucination Robustness
Number of characters per minute on [MUSAN](https://www.openslr.org/17) 48 hrs eval set (`max_new_tokens=50` following `nvidia/canary-1b-flash` evaluation)
| **Version** | **Model** | **# of character per minute** |
|:-----------:|:---------:|:----------:|
| 2.5.0       | Canary-Qwen-2.5B |   138.1   |

### Noise Robustness
WER on [Librispeech Test Clean](https://www.openslr.org/12) at different SNR (signal to noise ratio) levels of additive white noise

| **Version** | **Model** | **SNR 10** | **SNR 5** | **SNR 0** | **SNR -5** |
|:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|
| 2.5.0       | Canary-Qwen-2.5B |    2.41%   |   4.08%   |   9.83%   |    30.60%  |

## Model Fairness Evaluation

As outlined in the paper "Towards Measuring Fairness in AI: the Casual Conversations Dataset" [8], we assessed the Canary-Qwen-2.5B model for fairness. The model was evaluated on the CasualConversations-v1 dataset with inference done on non-overlapping 40s chunks, and the results are reported as follows:

### Gender Bias:

| Gender | Male | Female | N/A | Other |
| :--- | :--- | :--- | :--- | :--- |
| Num utterances | 18471 | 23378 | 880 | 18 |
| % WER | 16.71 | 13.85 | 17.71 | 29.46 |

### Age Bias:

| Age Group | (18-30) | (31-45) | (46-85) | (1-100) |
| :--- | :--- | :--- | :--- | :--- |
| Num utterances | 15058 | 13984 | 12810 | 41852 |
| % WER | 15.73 | 15.3 | 14.14 | 15.11 |

(Error rates for fairness evaluation are determined by normalizing both the reference and predicted text, similar to the methods used in the evaluations found at https://github.com/huggingface/open_asr_leaderboard.)

## Inference:
**Engine:** NVIDIA NeMo <br>
**Test Hardware :** <br>
* A6000 <br>
* A100 <br>
* RTX 5090 <br>

## Ethical Considerations:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications.  When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.  
For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).