Automatic Speech Recognition
NeMo
PyTorch
speech
audio
FastConformer
Conformer
NeMo
hf-asr-leaderboard
ctc
entity-tagging
speaker-attributes

Meta ASR English

This model is a fine-tuned version of ASR-CTC model enhanced with entity tagging, speaker attributes, and multi-language support for European languages.

Model Details

  • Fine-tuned on: Mix of CommonVoice (6 European languages), People's Speech, Indian accented English, and LibriSpeech
  • Languages: English, Spanish, French, Italian, German, Portuguese
  • Additional Features: Entity tagging, speaker attributes (age, gender, emotion), and intent detection

Output Format

The model provides rich transcriptions including:

  • Entity tags (PERSON_NAME, ORGANIZATION, etc.)
  • Speaker attributes (AGE, GENDER, EMOTION)
  • Intent classification
  • Language-specific transcription

Example output:

ENTITY_PERSON_NAME Robert Hoke END was educated at the ENTITY_ORGANIZATION Pleasant Retreat Academy END. AGE_45_60 GER_MALE EMOTION_NEUTRAL INTENT_INFORM

Usage

import nemo.collections.asr as nemo_asr

# Load model
asr_model = nemo_asr.models.EncDecCTCModel.from_pretrained('WhissleAI/meta_stt_euro_v1')

# Transcribe audio
transcription = asr_model.transcribe(['path/to/audio.wav'])
print(transcription[0])

Training Data

The model was fine-tuned on:

  • CommonVoice dataset (6 European languages)
  • People's Speech English corpus
  • Indian accented English
  • LibriSpeech corpus (en, es, fr, it, pt)

Model Architecture

Based on FastConformer [1] architecture with 8x depthwise-separable convolutional downsampling, trained using CTC loss.

License

This model is licensed under the CC-BY-4.0 license.

References

[1] Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition [2] NVIDIA NeMo Toolkit

Downloads last month
6
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train WhissleAI/masr-en-0.6b