--- license: cc-by-4.0 language: - en library_name: nemo datasets: - Granary - YTC - Yodas2 - LibriLight - librispeech_asr - fisher_corpus - Switchboard-1 - WSJ-0 - WSJ-1 - National-Singapore-Corpus-Part-1 - National-Singapore-Corpus-Part-6 - vctk - voxpopuli - europarl - multilingual_librispeech - fleurs - mozilla-foundation/common_voice_8_0 - MLCommons/peoples_speech thumbnail: null tags: - automatic-speech-recognition - speech - audio - Transformer - FastConformer - Conformer - pytorch - NeMo - Qwen - hf-asr-leaderboard widget: - example_title: Librispeech sample 1 src: https://cdn-media.huggingface.co/speech_samples/sample1.flac - example_title: Librispeech sample 2 src: https://cdn-media.huggingface.co/speech_samples/sample2.flac model-index: - name: canary-qwen-2.5b results: - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: AMI (Meetings test) type: edinburghcstr/ami config: ihm split: test args: language: en metrics: - name: Test WER type: wer value: 10.19 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: Earnings-22 type: revdotcom/earnings22 split: test args: language: en metrics: - name: Test WER type: wer value: 10.45 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: GigaSpeech type: speechcolab/gigaspeech split: test args: language: en metrics: - name: Test WER type: wer value: 9.43 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: LibriSpeech (clean) type: librispeech_asr config: other split: test args: language: en metrics: - name: Test WER type: wer value: 1.61 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: LibriSpeech (other) type: librispeech_asr config: other split: test args: language: en metrics: - name: Test WER type: wer value: 3.1 - task: type: Automatic Speech Recognition name: automatic-speech-recognition dataset: name: SPGI Speech type: kensho/spgispeech config: test split: test args: language: en metrics: - name: Test WER type: wer value: 1.9 - task: type: Automatic Speech Recognition name: automatic-speech-recognition dataset: name: tedlium-v3 type: LIUM/tedlium config: release1 split: test args: language: en metrics: - name: Test WER type: wer value: 2.71 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: Vox Populi type: facebook/voxpopuli config: en split: test args: language: en metrics: - name: Test WER type: wer value: 5.66 metrics: - wer base_model: - nvidia/canary-1b-flash - Qwen/Qwen3-1.7B --- [![Model architecture](https://img.shields.io/badge/Model_Arch-SALM-blue#model-badge)](#model-architecture) | [![Model size](https://img.shields.io/badge/Params-2.5B-green#model-badge)](#model-architecture) | [![Language](https://img.shields.io/badge/Language-en-orange#model-badge)](#datasets) # Model Overview ## Description: NVIDIA NeMo Canary-Qwen-2.5B is an English speech recognition model that achieves state-of-the art performance on multiple English speech benchmarks. With 2.5 billion parameters and running at 418 RTFx, Canary-Qwen-2.5B supports automatic speech-to-text recognition (ASR) in English with punctuation and capitalization (PnC). The model works in two modes: as a transcription tool (ASR mode) and as an LLM (LLM mode). In ASR mode, the model is only capable of transcribing the speech into text, but does not retain any LLM-specific skills such as reasoning. In LLM mode, the model retains all of the original LLM capabilities, which can be used to post-process the transcript, e.g. summarize it or answer questions about it. In LLM mode, the model does not "understand" the raw audio anymore - only its transcript. This model is ready for commercial use. ### License/Terms of Use: Canary-Qwen-2.5B is released under the CC-BY-4.0 license. By using this model, you are agreeing to the [terms and conditions](https://choosealicense.com/licenses/cc-by-4.0/) of the license.
## References: [1] [Less is More: Accurate Speech Recognition & Translation without Web-Scale Data](https://www.isca-archive.org/interspeech_2024/puvvada24_interspeech.pdf) [2] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10389701) [3] [Attention Is All You Need](https://arxiv.org/abs/1706.03762) [4] [Qwen/Qwen3-1.7B Model Card](https://huggingface.co/Qwen/Qwen3-1.7B) [5] [Training and Inference Efficiency of Encoder-Decoder Speech Models](https://arxiv.org/abs/2503.05931) [6] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo) [7] [Granary: Speech Recognition and Translation Dataset in 25 European Languages](https://arxiv.org/abs/2505.13404) [8] [Towards Measuring Fairness in AI: the Casual Conversations Dataset](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9634168) [9] [SALM: Speech-augmented Language Model with In-context Learning for Speech Recognition and Translation](https://arxiv.org/abs/2310.09424) ### Deployment Geography: Global ### Use Case: The model is intended for users requiring speech-to-text transcription capabilities for English speech, and/or transcript post-processing capabilities enabled by prompting the underlying LLMs. Typical use-cases: transcription, summarization, answering user questions about the transcript. ### Release Date: Huggingface 07/17/2025 via https://huggingface.co/nvidia/canary-qwen-2.5b ## Model Architecture: Canary-Qwen is a Speech-Augmented Language Model (SALM) [9] model with FastConformer [2] Encoder and Transformer Decoder [3]. It is built using two base models: `nvidia/canary-1b-flash` [1,5] and `Qwen/Qwen3-1.7B` [4], a linear projection, and low-rank adaptation (LoRA) applied to the LLM. The audio encoder computes audio representation that is mapped to the LLM embedding space via a linear projection, and concatenated with the embeddings of text tokens. The model is prompted with "Transcribe the following: