Voice AI - Automatic Speech Recognition

Introduction

Welcome to the Voice AI project, where our focus is on implementing a Speech-to-Text system for the Marathi language using the cutting-edge Hugging Face Whisper ASR models. This project is designed to deliver a robust Marathi speech-to-text transcription model with the ultimate goal of accurately transcribing spoken Marathi content.

Table of Contents

  1. Key Technologies and Skills
  2. Installation
  3. Usage
  4. Features
  5. Contributing
  6. License
  7. Contact

Key Technologies and Skills

  • Python
  • PyTorch
  • Hugging Face Transformers
  • pandas

Installation

To run this project, you need to install the following packages:

pip install --upgrade pip
pip install --upgrade git+https://github.com/huggingface/transformers.git accelerate datasets transformers soundfile librosa evaluate tensorboard
pip install pandas
pip install torch
pip install python-Levenshtein
pip install jiwer

Usage

from transformers import pipeline

def get_transcription(filename: str):

    if not isinstance(filename, str):
        raise TypeError("Argument 'filename' must be a string.")
    
    transcription = pipeline("automatic-speech-recognition", model="gopiashokan/whisper-small-mr")(filename)

    return transcription

print(get_transcription('audio path location'))

Features

Data Collection

This project relies on the Common Voice Marathi dataset from Mozilla Foundation for training and evaluating the Marathi Speech Recognition model. The dataset includes essential attributes such as a path column for easy access to audio files, an array column for numerical audio representations, sampling rate information, and a sentence column providing transcriptions for training the model on spoken Marathi sentences.

Data Preprocessing

  • Down-sampling Rate: The original audio, recorded at 48000Hz, is down-sampled to 16000Hz. This step is essential for standardizing the audio data and reducing computational complexity.

  • Feature Extraction: The Whisper models' feature extractor is employed to extract relevant features from the audio signals. These features play a crucial role in training the speech recognition model.

  • Tokenization: It involves breaking down the audio data into smaller units (tokens). This process is fundamental for converting the continuous audio stream into a format suitable for machine learning algorithms.

Speech Recognition Models

  • The primary architecture for this project is based on the Whisper ASR models, with a focus on both the Whisper Small and Large versions. These models are selected for their effectiveness in speech recognition tasks and will be used for comparison during the training and evaluation phases.

  • Additionally, to enhance training efficiency and adapt the models to specific tasks, we apply PEFT (Probabilistic Exposure Fine-Tuning) and LORA (Layer-wise Adaptive Rate) adaptations. These techniques contribute to improving the overall performance and robustness of the Marathi Speech Recognition system.

Fine Tuning

To optimize the performance of the models, fine-tuning strategies are applied. These strategies involve carefully adjusting model parameters and hyperparameters during the training process, ensuring the best possible outcomes for our specific speech recognition tasks.

Model Training

For model training, we leverage the powerful Seq2SeqTrainingArguments and Seq2SeqTrainer from the Hugging Face Transformers library. These components provide a robust framework for training our Marathi Speech Recognition models.

Model Evaluation

  • The primary metrics for evaluating the performance are Word Error Rate (WER) and text similarity scores. The objective is to minimize WER, ensuring accurate transcription of Marathi speech.

  • Before fine-tuning, we assessed the models on the provided test dataset. For the Whisper Large-v3 model, the calculated average WER was 0.74, with an average similarity score of 78.66. In comparison, the Whisper Small model showed an average WER of 3.38 and an average similarity score of 30.48. These metrics serve as benchmarks for gauging improvements during the fine-tuning process.

Challenges Faced

  • GPU and Storage Constraints: Limited storage capacity in Google Colab posed a challenge, preventing additional fine-tuning steps due to insufficient space for model checkpoints and intermediate results. The free version of Google Colab provided inadequate GPU capacity, hindering the fine-tuning of larger and more complex models. This limitation impacted training efficiency and overall model performance.

  • Model Complexity vs Steps: Balancing increased model complexity with a lower number of fine-tuning steps presented a challenge. The compromise led to a higher Word Error Rate (WER), indicating the impact of insufficient training steps on the model's language understanding and transcription accuracy.

  • Training Issues: During the training process, the project encountered the errors of "IndexError - Invalid key: 3403 is out of bounds for size 0" occurred during model training", "CUDA Setup failed - Despite the availability of GPU, the setup failed" and "NotImplementedError Cannot copy out of meta tensor; no data!" hindered the fine-tuning process.

  • Fine-Tuning: To address these challenges, certain methods such as PEFT and LORA were skipped in the fine-tuning process. Further investigation and adjustments were made to overcome errors related to CUDA setup and meta tensor copying.

Conclusion

  • In conclusion, the Whisper-large-v3 model, despite showcasing superior accuracy, encountered challenges during training, leading to automatic crashes in Google Colab. As a strategic adjustment, the Whisper-small model was employed for training and fine-tuning with a custom dataset. It's important to note that the performance of the Whisper-small model is comparatively poorer than the Whisper-large-v3 model.

  • The challenges faced during the project underscore the need for a more robust infrastructure, which we plan to address in future iterations by transitioning to cloud platforms such as AWS or Azure. This move aims to provide the necessary resources and scalability to overcome training limitations and further enhance the predictive capabilities of the Marathi Speech Recognition system.

References

Contributing

Contributions to this project are welcome! If you encounter any issues or have suggestions for improvements, please feel free to submit a pull request.

License

This project is licensed under the MIT License. Please review the LICENSE file for more details.

Contact

๐Ÿ“ง Email: [email protected]

๐ŸŒ LinkedIn: linkedin.com/in/gopiashokan

For any further questions or inquiries, feel free to reach out. We are happy to assist you with any queries.

Downloads last month
14
Safetensors
Model size
242M params
Tensor type
F32
ยท
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.