Seeking a Clear Guide for Fine-Tuning NVIDIA NeMo Models on New English Audio Domains

#18
by jacktol - opened

I'm struggling to find a clear, beginner-friendly example of fine-tuning an NVIDIA NeMo model, such as parakeet-tdt-0.6b-v2 or canary-1b-flash, on a new English audio domain, like a Medical or Legal ASR dataset. Most resources focus on transfer learning for new spoken languages, but I'm looking for guidance on continued pre-training with English data in a specific domain.

Ideally, I need an up-to-date, simple, robust, and extensible tutorial (e.g., supporting additions like data augmentation) for fine-tuning these pre-trained models. I’ve found limited documentation, forums, or YouTube tutorials addressing this specific use case.

Could you share links to relevant docs, notebooks, blogs, tutorials, or repositories that demonstrate this process clearly?

Thanks!

recent tutorial: https://developer.nvidia.com/blog/developing-robust-georgian-automatic-speech-recognition-with-fastconformer-hybrid-transducer-ctc-bpe/
https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/configs.html#fine-tuning-configurations
https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/examples/kinyarwanda_asr.html

All these links are from existing sources, and while I'm not the original poster, I have the same use case. I want to fine-tune Nemo tdt-0.6-v2 with my own English voice data to better suit my specific domain in my daily ASR workflow. Most of the fine-tuning documentation relies heavily on command-line usage and requires manual edits to the YAML file for Hugging Face datasets, which I already have for Whisper v3 fine-tuning. A Python script to train and resume from a checkpoint would be incredibly helpful. The collab notebook focuses on Japanese data and goes too deep into data preprocessing, leaving me unsure which parts to omit or modify for my use case.

Sign up or log in to comment