π΅ Multilingual Audio Intelligence System
Overview
The Multilingual Audio Intelligence System is an advanced AI-powered platform that combines state-of-the-art speaker diarization, automatic speech recognition, and neural machine translation to deliver comprehensive audio analysis capabilities. This sophisticated system processes multilingual audio content, identifies individual speakers, transcribes speech with high accuracy, and provides intelligent translations across multiple languages, transforming raw audio into structured, actionable insights.
Features
Demo Mode with Professional Audio Files
- Yuri Kizaki - Japanese Audio: Professional voice message about website communication (23 seconds)
- French Film Podcast: Discussion about movies including Social Network and Paranormal Activity (25 seconds)
- Smart demo file management with automatic download and preprocessing
- Instant results with cached processing for blazing-fast demonstration
Enhanced User Interface
- Audio Waveform Visualization: Real-time waveform display with HTML5 Canvas
- Interactive Demo Selection: Beautiful cards for selecting demo audio files
- Improved Transcript Display: Color-coded confidence levels and clear translation sections
- Professional Audio Preview: Audio player with waveform visualization
Screenshots
π¬ Demo Banner
π Transcript with Translation
π Visual Representation
π§ Summary Output
Installation and Quick Start
Clone the Repository:
git clone https://github.com/Prathameshv07/Multilingual-Audio-Intelligence-System.git cd Multilingual-Audio-Intelligence-SystemCreate and Activate Conda Environment:
conda create --name audio_challenge python=3.9 conda activate audio_challengeInstall Dependencies:
pip install -r requirements.txtConfigure Environment Variables:
cp config.example.env .env # Edit .env file with your HUGGINGFACE_TOKEN for accessing gated modelsPreload AI Models (Recommended):
python model_preloader.pyInitialize Application:
python run_fastapi.py
File Structure
audio_challenge/
βββ web_app.py # FastAPI application
βββ run_fastapi.py # Startup script
βββ requirements.txt # Dependencies
βββ templates/
β βββ index.html # Main interface
βββ src/ # Core modules
β βββ main.py # Pipeline orchestrator
β βββ audio_processor.py # Audio preprocessing
β βββ speaker_diarizer.py # Speaker identification
β βββ speech_recognizer.py # ASR with language detection
β βββ translator.py # Neural machine translation
β βββ output_formatter.py # Output generation
β βββ utils.py # Utility functions
βββ static/ # Static assets
βββ uploads/ # Uploaded files
βββ outputs/ # Generated outputs
βββ README.md
Configuration
Environment Variables
Create a .env file:
HUGGINGFACE_TOKEN=hf_your_token_here # Optional, for gated models
Model Configuration
- Whisper Model: tiny/small/medium/large
- Target Language: en/es/fr/de/it/pt/zh/ja/ko/ar
- Device: auto/cpu/cuda
Supported Audio Formats
- WAV (recommended)
- MP3
- OGG
- FLAC
- M4A
Maximum file size: 100MB
Recommended duration: Under 30 minutes
Development
Local Development
python run_fastapi.py
Production Deployment
uvicorn web_app:app --host 0.0.0.0 --port 8000
Performance
- Processing Speed: 2-14x real-time (depending on model size)
- Memory Usage: Optimized with INT8 quantization
- CPU Optimized: Works without GPU
- Concurrent Processing: Async/await support
Troubleshooting
Common Issues
- Dependencies: Use
requirements.txtfor clean installation - Memory: Use smaller models (tiny/small) for limited hardware
- Audio Format: Convert to WAV if other formats fail
- Port Conflicts: Change port in
run_fastapi.pyif 8000 is occupied
Error Resolution
- Check logs in terminal output
- Verify audio file format and size
- Ensure all dependencies are installed
- Check available system memory
Support
- Documentation: Check
/api/docsendpoint - System Info: Use the info button in the web interface
- Logs: Monitor terminal output for detailed information



