Revert "Upload folder using huggingface_hub"

Browse files

This reverts commit 370f6f5b9c9e7921649309b38dab0c563bfd4dd5.

Files changed (6) hide show

.gitattributes +1 -0
README.md +252 -13
app.py +0 -120
packages.txt +0 -2
pre-requirements.txt +0 -1
requirements.txt +0 -3

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+results_hf/open-asr-leaderboarddatasets-test-only-spgispeech-test.jsonl filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,13 +1,252 @@
----
-title: Canary-Qwen-2.5B
-emoji: 🐤
-colorFrom: yellow
-colorTo: red
-sdk: gradio
-sdk_version: 5.14.0
-app_file: app.py
-pinned: false
-license: cc-by-4.0
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# Model Overview
+## Description:
+NVIDIA NeMo Canary-Qwen-2.5B is an English speech recognition model that achieves state-of-the art performance on multiple English speech benchmarks. With 2.5 billion parameters and running at 458 RTFx, Canary-Qwen-2.5B supports automatic speech-to-text recognition (ASR) in English with punctuation and capitalization (PnC). The model is intended as a transcription tool only, and not expected to extend the LLM capabilities into speech modality. This model is ready for commercial use.
+### License/Terms of Use:
+Canary-Qwen-2.5B is released under the CC-BY-4.0 license. By using this model, you are agreeing to the [terms and conditions](https://choosealicense.com/licenses/cc-by-4.0/) of the license. <br>
+## References:
+[1] [Less is More: Accurate Speech Recognition & Translation without Web-Scale Data](https://www.isca-archive.org/interspeech_2024/puvvada24_interspeech.pdf)
+[2] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10389701)
+[3] [Attention Is All You Need](https://arxiv.org/abs/1706.03762)
+[4] [Qwen/Qwen3-1.7B Model Card](https://huggingface.co/Qwen/Qwen3-1.7B)
+[5] [Training and Inference Efficiency of Encoder-Decoder Speech Models](https://arxiv.org/abs/2503.05931)
+[6] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
+[7] [Granary: Speech Recognition and Translation Dataset in 25 European Languages](https://arxiv.org/abs/2505.13404)
+[8] [Towards Measuring Fairness in AI: the Casual Conversations Dataset](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9634168)
+[9] [SALM: Speech-augmented Language Model with In-context Learning for Speech Recognition and Translation](https://arxiv.org/abs/2310.09424)
+## Model Architecture:
+Canary-Qwen is a Speech-Augmented Language Model (SALM) [9] model with FastConformer [2] Encoder and Transformer Decoder [3]. It is built using two base models: `nvidia/canary-1b-flash` [1,5] and `Qwen/Qwen3-1.7B` [4], a linear projection, and LoRA applied to the LLM. The audio encoder computes audio representation that is mapped to the LLM embedding space via a linear projection, and concatenated with the embeddings of text tokens. The model is prompted with "Transcribe the following: <audio>", using Qwen's chat template.
+### Limitations
+**Input length.** The maximum audio duration in training was 40s, and the maximum token sequence length was 1024 tokens (including prompt, audio, and response). The model may technically be able to process longer sequences, but its accuracy may be degraded.
+**Exclusively ASR oriented capabilities.** The model is not expected to preserve any of the underlying LLM's capabilities into speech modality.
+**English-only language support.** The model was trained using English data only. It may be able to spuriously transcribe other languages as the underlying encoder was pretrained using German, French, and Spanish speech in addition to English, but it's unlikely to be reliable as a multilingual model.
+## NVIDIA NeMo
+To train, fine-tune or transcribe with Canary-Qwen-2.5B, you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo).
+## How to Use this Model
+The model is available for use in the NVIDIA NeMo toolkit [6], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
+### Loading the Model
+```python
+from nemo.collections.speechlm2.models import SALM
+model = SALM.from_pretrained('nvidia/canary-qwen-2.5b')
+```
+## Input:
+**Input Type(s):** Audio, text prompt <br>
+**Input Format(s):** .wav or .flac files<br>
+**Input Parameters(s):** 1D <br>
+**Other Properties Related to Input:** 16000 Hz Mono-channel Audio, Pre-Processing Not Needed <br>
+Input to Canary-Qwen-2.5B is a batch of prompts that include audio.
+```python
+answer_ids = model.generate(
+    prompts=[
+        [{"role": "user", "content": f"Transcribe the following: {model.audio_locator_tag}", "audio": ["speech.wav"]}]
+    ],
+    max_new_tokens=128,
+)
+print(model.tokenizer.ids_to_text(answer_ids[0].cpu()))
+```
+To transcribe a dataset of recordings, specify the input as jsonl manifest file, where each line in the file is a dictionary containing the following fields:
+```yaml
+# Example of a line in input_manifest.json
+{
+    "audio_filepath": "/path/to/audio.wav",  # path to the audio file
+    "duration": 30.0,  # duration of the audio
+}
+```
+and then use:
+```bash
+cd NeMo
+python examples/speechlm2/salm_generate.py \
+  pretrained_name=nvidia/canary-qwen-2.5b \
+  inputs=input_manifest.json \
+  output_manifest=generations.jsonl \
+  batch_size=128 \
+  user_prompt="Transcribe the following:"  # audio locator is added automatically at the end if not present
+```
+## Output:
+**Output Type(s):** Text <br>
+**Output Format:** Text transcript as a sequence of token IDs or a string <br>
+**Output Parameters:** 1-Dimensional text string <br>
+**Other Properties Related to Output:** May Need Inverse Text Normalization <br>
+## Software Integration:
+**Runtime Engine(s):**
+* NeMo - 2.4.0 or higher <br>
+**Supported Hardware Microarchitecture Compatibility:** <br>
+* [NVIDIA Ampere] <br>
+* [NVIDIA Blackwell] <br>
+* [NVIDIA Jetson]  <br>
+* [NVIDIA Hopper] <br>
+* [NVIDIA Lovelace] <br>
+* [NVIDIA Pascal] <br>
+* [NVIDIA Turing] <br>
+* [NVIDIA Volta] <br>
+**[Preferred/Supported] Operating System(s):** <br>
+* [Linux] <br>
+* [Linux 4 Tegra] <br>
+* [Windows] <br>
+## Model Version(s):
+Canary-Qwen-2.5B <br>
+# Training and Evaluation Datasets:
+## Training Dataset:
+The Canary-Qwen-2.5B model is trained on a total of 234K hrs of publicly available speech data.
+The datasets below include conversations, videos from the web and audiobook recordings.
+**Data Collection Method:**
+* Human <br>
+**Labeling Method:**
+* Hybrid: Human, Automated <br>
+#### English (234.5k hours)
+The majority of the training data come from English portion of the Granary dataset [7]:
+- YouTube-Commons (YTC) (109.5k hours)
+- YODAS2 (77k hours)
+- LibriLight (13.6k hours)
+In addition, the following datasets were used:
+- Librispeech 960 hours
+- Fisher Corpus
+- Switchboard-1 Dataset
+- WSJ-0 and WSJ-1
+- National Speech Corpus (Part 1, Part 6)
+- VCTK
+- VoxPopuli (EN)
+- Europarl-ASR (EN)
+- Multilingual Librispeech (MLS EN)
+- Mozilla Common Voice (v11.0)
+- Mozilla Common Voice (v7.0)
+- Mozilla Common Voice (v4.0)
+- AMI
+- FLEURS
+AMI was oversampled during model training to constitute about 15% of the total data observed.
+This skewed the model towards predicting verbatim transcripts that include conversational speech disfluencies such as repetitions.
+The training transcripts contained punctuation and capitalization.
+## Evaluation Dataset:
+**Data Collection Method:** <br>
+* Human <br>
+**Labeling Method:** <br>
+* Human <br>
+Automatic Speech Recognition:
+* [HuggingFace OpenASR Leaderboard evaluation sets](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)
+Hallucination Robustness:
+* [MUSAN](https://www.openslr.org/17/) 48 hrs eval set
+Noise Robustness:
+* [Librispeech](https://www.openslr.org/12)
+Model Fairness:
+* [Casual Conversations Dataset](https://arxiv.org/pdf/2104.02821)
+## Training
+Canary-Qwen-2.5B was trained using the NVIDIA NeMo toolkit [6] for a total of 90k steps on 32 NVIDIA A100 80GB GPUs. LLM parameters were kept frozen. Speech encoder, projection, and LoRA parameters were trainable. The encoder's output frame rate is 80ms, or 12.5 tokens per second. The model was trained on approximately 1.3B tokens in total (this number inlcudes the speech encoder output frames, text response tokens, prompt tokens, and chat template tokens).
+The model can be trained using this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/speechlm2/salm_train.py) and [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/speechlm2/conf/salm.yaml).
+The tokenizer was inherited from `Qwen/Qwen3-1.7B`.
+## Inference:
+**Engine:** NVIDIA NeMo <br>
+**Test Hardware :** <br>
+* A6000 <br>
+* A100 <br>
+## Performance
+The ASR predictions were generated using greedy decoding.
+### ASR Performance (w/o PnC)
+The ASR performance is measured with word error rate (WER), and we process the groundtruth and predicted text with [whisper-normalizer](https://pypi.org/project/whisper-normalizer/) version 0.1.12.
+WER on [HuggingFace OpenASR leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard):
+| **Version** | **Model**     | **RTFx**   | **Mean**   | **AMI**   | **GigaSpeech**   | **LS Clean**   | **LS Other**   | **Earnings22**   | **SPGISpech**   | **Tedlium**   | **Voxpopuli**   |
+|:---------:|:-----------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|:------:|
+| 2.4.0  | Canary-Qwen-2.5B | 458.5 | 5.62 | 10.18 | 9.41 | 1.60 | 3.10 | 10.42 | 1.90 | 2.72 | 5.66 |
+More details on evaluation can be found at [HuggingFace ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)
+### Hallucination Robustness
+Number of characters per minute on [MUSAN](https://www.openslr.org/17) 48 hrs eval set (`max_new_tokens=50` following `nvidia/canary-1b-flash` evaluation)
+| **Version** | **Model** | **# of character per minute** |
+|:-----------:|:---------:|:----------:|
+| 2.4.0       | Canary-Qwen-2.5B |   138.1   |
+### Noise Robustness
+WER on [Librispeech Test Clean](https://www.openslr.org/12) at different SNR (signal to noise ratio) levels of additive white noise
+| **Version** | **Model** | **SNR 10** | **SNR 5** | **SNR 0** | **SNR -5** |
+|:-----------:|:---------:|:----------:|:----------:|:----------:|:----------:|
+| 2.4.0       | Canary-Qwen-2.5B |    2.41%   |   4.08%   |   9.83%   |    30.60%  |
+## Model Fairness Evaluation
+As outlined in the paper "Towards Measuring Fairness in AI: the Casual Conversations Dataset" [8], we assessed the Canary-Qwen-2.5B model for fairness. The model was evaluated on the CasualConversations-v1 dataset with inference done on non-overlapping 40s chunks, and the results are reported as follows:
+### Gender Bias:
+| Gender | Male | Female | N/A | Other |
+| :--- | :--- | :--- | :--- | :--- |
+| Num utterances | 18471 | 23378 | 880 | 18 |
+| % WER | 16.71 | 13.85 | 17.71 | 29.46 |
+### Age Bias:
+| Age Group | (18-30) | (31-45) | (46-85) | (1-100) |
+| :--- | :--- | :--- | :--- | :--- |
+| Num utterances | 15058 | 13984 | 12810 | 41852 |
+| % WER | 15.73 | 15.3 | 14.14 | 15.11 |
+(Error rates for fairness evaluation are determined by normalizing both the reference and predicted text, similar to the methods used in the evaluations found at https://github.com/huggingface/open_asr_leaderboard.)
+## Ethical Considerations:
+NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications.  When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
+For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).

app.py DELETED Viewed

@@ -1,120 +0,0 @@
-import gradio as gr
-import json
-import os
-import soundfile as sf
-import tempfile
-import uuid
-import torch
-from lhotse import Recording
-from lhotse.dataset import DynamicCutSampler
-from nemo.collections.speechlm2 import SALM
-from nemo.collections.asr.parts.utils.streaming_utils import FrameBatchMultiTaskAED
-from nemo.collections.asr.parts.utils.transcribe_utils import get_buffered_pred_feat_multitaskAED
-if torch.cuda.is_available():
-    device = torch.device("cuda")
-else:
-    device = torch.device("cpu")
-SAMPLE_RATE = 16000 # Hz
-MAX_AUDIO_MINUTES = 10 # wont try to transcribe if longer than this
-CHUNK_SECONDS = 40.0  # max audio length seen by the model
-BATCH_SIZE = 4  # for parallel transcription of audio longer than CHUNK_SECONDS
-with device:
-    torch.set_default_dtype(torch.bfloat16)  # speed up start-up time
-    model = SALM.from_pretrained("nvidia/canary-qwen-2.5b").bfloat16().eval().to(device)
-    torch.set_default_dtype(torch.float32)
-def as_batches(audio_filepath, utt_id):
-    rec = Recording.from_file(audio_filepath, recording_id=utt_id)
-    if rec.duration / 60.0 > MAX_AUDIO_MINUTES:
-        raise gr.Error(
-            f"This demo can transcribe up to {MAX_AUDIO_MINUTES} minutes of audio. "
-            "If you wish, you may trim the audio using the Audio viewer in Step 1 "
-            "(click on the scissors icon to start trimming audio)."
-        )
-    cut = rec.resample(SAMPLE_RATE).to_cut()
-    if cut.num_channels > 1:
-        cut = cut.to_mono(mono_downmix=True)
-    return DynamicCutSampler(cut.cut_into_windows(CHUNK_SECONDS), max_cuts=BATCH_SIZE)
-def transcribe(audio_filepath):
-    if audio_filepath is None:
-        raise gr.Error("Please provide some input audio: either upload an audio file or use the microphone")
-    utt_id = uuid.uuid4()
-    pred_text = []
-    for batch in as_batches(audio_filepath, str(utt_id)):
-        audio, audio_lens = batch.load_audio(collate=True)
-        with torch.inference_mode():
-            output_ids = model.generate(
-                prompts=[[{"role": "user", "content": f"Transcribe the following: {model.audio_locator_tag}"}]] * len(batch),
-                audios=torch.as_tensor(audio).to(device, non_blocking=True),
-                audio_lens=torch.as_tensor(audio_lens).to(device, non_blocking=True),
-                max_new_tokens=256,
-            )
-        pred_text.extend(model.tokenizer.ids_to_text(oids) for oids in output_ids.cpu())
-    return ' '.join(pred_text)
-with gr.Blocks(
-    title="NeMo Canary-Qwen-2.5B Model",
-    css="""
-        textarea { font-size: 18px;}
-        #model_output_text_box span {
-            font-size: 18px;
-            font-weight: bold;
-        }
-    """,
-    theme=gr.themes.Default(text_size=gr.themes.sizes.text_lg) # make text slightly bigger (default is text_md )
-) as demo:
-    gr.HTML("<h1 style='text-align: center'>NeMo Canary-Qwen-2.5B model: Transcribe audio</h1>")
-    with gr.Row():
-        with gr.Column():
-            gr.HTML(
-                "<p><b>Step 1:</b> Upload an audio file or record with your microphone.</p>"
-                "<p style='color: #A0A0A0;'>This demo supports audio files up to 10 mins long. "
-                "You can transcribe longer files locally with NeMo. "
-                #"<a href='https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_chunked_inference/aed/speech_to_text_aed_chunked_infer.py'>script</a>.</p>"
-            )
-            audio_file = gr.Audio(sources=["microphone", "upload"], type="filepath")
-        with gr.Column():
-            gr.HTML("<p><b>Step 2:</b> Run the model.</p>")
-            go_button = gr.Button(
-                value="Run model",
-                variant="primary", # make "primary" so it stands out (default is "secondary")
-            )
-            model_output_text_box = gr.Textbox(
-                label="Model Output",
-                elem_id="model_output_text_box",
-            )
-    with gr.Row():
-        gr.HTML(
-            "<p style='text-align: center'>"
-                "🐤 <a href='https://huggingface.co/nvidia/canary-qwen-2.5b' target='_blank'>Canary model</a> | "
-                "🧑‍💻 <a href='https://github.com/NVIDIA/NeMo' target='_blank'>NeMo Repository</a>"
-            "</p>"
-        )
-    go_button.click(
-        fn=transcribe,
-        inputs=[audio_file],
-        outputs=[model_output_text_box]
-    )
-demo.queue()
-demo.launch()

packages.txt DELETED Viewed

	@@ -1,2 +0,0 @@
1	- ffmpeg
2	- libsndfile1

pre-requirements.txt DELETED Viewed

	@@ -1 +0,0 @@
1	- Cython

requirements.txt DELETED Viewed

@@ -1,3 +0,0 @@
-nemo_toolkit[asr] @ git+https://github.com/NVIDIA/NeMo.git
-sacrebleu
-seaborn