Meta ASR English

This model is a fine-tuned version of ASR-CTC model enhanced with entity tagging, speaker attributes, and multi-language support for European languages.

Model Details

Fine-tuned on: Mix of CommonVoice (6 European languages), People's Speech, Indian accented English, and LibriSpeech
Languages: English, Spanish, French, Italian, German, Portuguese
Additional Features: Entity tagging, speaker attributes (age, gender, emotion), and intent detection

Output Format

The model provides rich transcriptions including:

Entity tags (PERSON_NAME, ORGANIZATION, etc.)
Speaker attributes (AGE, GENDER, EMOTION)
Intent classification
Language-specific transcription

Example output:

ENTITY_PERSON_NAME Robert Hoke END was educated at the ENTITY_ORGANIZATION Pleasant Retreat Academy END. AGE_45_60 GER_MALE EMOTION_NEUTRAL INTENT_INFORM

Usage

import nemo.collections.asr as nemo_asr

# Load model
asr_model = nemo_asr.models.EncDecCTCModel.from_pretrained('WhissleAI/meta_stt_euro_v1')

# Transcribe audio
transcription = asr_model.transcribe(['path/to/audio.wav'])
print(transcription[0])

Training Data

The model was fine-tuned on:

CommonVoice dataset (6 European languages)
People's Speech English corpus
Indian accented English
LibriSpeech corpus (en, es, fr, it, pt)

Model Architecture

Based on FastConformer [1] architecture with 8x depthwise-separable convolutional downsampling, trained using CTC loss.

License

This model is licensed under the CC-BY-4.0 license.

References

[1] Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition [2] NVIDIA NeMo Toolkit

WhissleAI
/

masr-en-0.6b