Prathamesh Sarjerao Vaidya
commited on
Commit
Β·
321254f
1
Parent(s):
da625ea
fix docker write error
Browse files- DOCUMENTATION.md +4 -4
- Dockerfile +20 -3
- README.md +35 -52
- TECHNICAL_UNDERSTANDING.md +2 -2
- spaces.yaml +1 -1
- static/imgs/banner.png +2 -2
DOCUMENTATION.md
CHANGED
@@ -1,12 +1,12 @@
|
|
1 |
-
#
|
2 |
|
3 |
## 1. Project Overview
|
4 |
|
5 |
-
The
|
6 |
|
7 |
## 2. Objective
|
8 |
|
9 |
-
The primary objective of the
|
10 |
|
11 |
- **Language Support**: Support for Tamil, Hindi, Telugu, Gujarati, Kannada, and other regional languages
|
12 |
- **Multi-Tier Translation**: Fallback system ensuring broad translation coverage across language pairs
|
@@ -180,7 +180,7 @@ The application includes a demo mode for testing without waiting for full model
|
|
180 |
- Available demos:
|
181 |
- [Yuri_Kizaki.mp3](https://www.mitsue.co.jp/service/audio_and_video/audio_production/media/narrators_sample/yuri_kizaki/03.mp3) β Japanese narration about website communication
|
182 |
- [Film_Podcast.mp3](https://www.lightbulblanguages.co.uk/resources/audio/film-podcast.mp3) β French podcast discussing films like The Social Network
|
183 |
-
- [Tamil_Wikipedia_Interview.ogg](https://commons.wikimedia.org/wiki/File:
|
184 |
- [Car_Trouble.mp3](https://www.tuttlepublishing.com/content/docs/9780804844383/06-18%20Part2%20Car%20Trouble.mp3) β Conversation about waiting for a mechanic and basic assistance (2:45)
|
185 |
- Static serving: demo audio is exposed at `/demo_audio/<filename>` for local preview.
|
186 |
- The UI provides enhanced selectable cards under Demo Mode; once selected, the system loads a preview and renders a waveform using HTML5 Canvas (Web Audio API) before processing.
|
|
|
1 |
+
# Multilingual Audio Intelligence System - Technical Documentation
|
2 |
|
3 |
## 1. Project Overview
|
4 |
|
5 |
+
The Multilingual Audio Intelligence System is an AI-powered platform that combines speaker diarization, automatic speech recognition, and neural machine translation to deliver comprehensive audio analysis capabilities. This system processes multilingual audio content with support for Indian languages, identifies individual speakers, transcribes speech with high accuracy, and provides translations across 100+ languages through a multi-tier fallback system, transforming raw audio into structured, actionable insights.
|
6 |
|
7 |
## 2. Objective
|
8 |
|
9 |
+
The primary objective of the Multilingual Audio Intelligence System is to provide comprehensive audio content analysis capabilities by:
|
10 |
|
11 |
- **Language Support**: Support for Tamil, Hindi, Telugu, Gujarati, Kannada, and other regional languages
|
12 |
- **Multi-Tier Translation**: Fallback system ensuring broad translation coverage across language pairs
|
|
|
180 |
- Available demos:
|
181 |
- [Yuri_Kizaki.mp3](https://www.mitsue.co.jp/service/audio_and_video/audio_production/media/narrators_sample/yuri_kizaki/03.mp3) β Japanese narration about website communication
|
182 |
- [Film_Podcast.mp3](https://www.lightbulblanguages.co.uk/resources/audio/film-podcast.mp3) β French podcast discussing films like The Social Network
|
183 |
+
- [Tamil_Wikipedia_Interview.ogg](https://commons.wikimedia.org/wiki/File:Parvathisri-Wikipedia-Interview-Vanavil-fm.ogg) β Tamil language interview (36+ minutes)
|
184 |
- [Car_Trouble.mp3](https://www.tuttlepublishing.com/content/docs/9780804844383/06-18%20Part2%20Car%20Trouble.mp3) β Conversation about waiting for a mechanic and basic assistance (2:45)
|
185 |
- Static serving: demo audio is exposed at `/demo_audio/<filename>` for local preview.
|
186 |
- The UI provides enhanced selectable cards under Demo Mode; once selected, the system loads a preview and renders a waveform using HTML5 Canvas (Web Audio API) before processing.
|
Dockerfile
CHANGED
@@ -24,8 +24,12 @@ RUN pip install --no-cache-dir --upgrade pip && \
|
|
24 |
COPY . .
|
25 |
|
26 |
# Create necessary directories with proper permissions
|
|
|
27 |
RUN mkdir -p templates static uploads outputs model_cache temp_files demo_results demo_audio \
|
28 |
-
|
|
|
|
|
|
|
29 |
|
30 |
# Set environment variables for Hugging Face Spaces
|
31 |
ENV PYTHONPATH=/app \
|
@@ -44,7 +48,8 @@ ENV PYTHONPATH=/app \
|
|
44 |
PYANNOTE_CACHE=/app/model_cache \
|
45 |
MPLCONFIGDIR=/tmp/matplotlib \
|
46 |
HUGGINGFACE_HUB_CACHE=/app/model_cache \
|
47 |
-
HF_HUB_CACHE=/app/model_cache
|
|
|
48 |
|
49 |
# Expose port for Hugging Face Spaces
|
50 |
EXPOSE 7860
|
@@ -54,4 +59,16 @@ HEALTHCHECK --interval=30s --timeout=30s --start-period=60s --retries=3 \
|
|
54 |
CMD curl -f http://localhost:7860/api/system-info || exit 1
|
55 |
|
56 |
# Preload models and start the application
|
57 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
24 |
COPY . .
|
25 |
|
26 |
# Create necessary directories with proper permissions
|
27 |
+
# Fixed: Use 777 permissions for directories that need write access
|
28 |
RUN mkdir -p templates static uploads outputs model_cache temp_files demo_results demo_audio \
|
29 |
+
/tmp/matplotlib /tmp/fontconfig \
|
30 |
+
&& chmod -R 777 templates static \
|
31 |
+
&& chmod -R 777 uploads outputs model_cache temp_files demo_results demo_audio \
|
32 |
+
&& chmod -R 777 /tmp/matplotlib /tmp/fontconfig
|
33 |
|
34 |
# Set environment variables for Hugging Face Spaces
|
35 |
ENV PYTHONPATH=/app \
|
|
|
48 |
PYANNOTE_CACHE=/app/model_cache \
|
49 |
MPLCONFIGDIR=/tmp/matplotlib \
|
50 |
HUGGINGFACE_HUB_CACHE=/app/model_cache \
|
51 |
+
HF_HUB_CACHE=/app/model_cache \
|
52 |
+
FONTCONFIG_PATH=/tmp/fontconfig
|
53 |
|
54 |
# Expose port for Hugging Face Spaces
|
55 |
EXPOSE 7860
|
|
|
59 |
CMD curl -f http://localhost:7860/api/system-info || exit 1
|
60 |
|
61 |
# Preload models and start the application
|
62 |
+
# Fixed: Ensure directories exist with proper permissions at runtime
|
63 |
+
CMD ["python", "-c", "\
|
64 |
+
import os; \
|
65 |
+
import subprocess; \
|
66 |
+
import time; \
|
67 |
+
print('π Starting Multilingual Audio Intelligence System...'); \
|
68 |
+
for dir in ['uploads', 'outputs', 'model_cache', 'temp_files', 'demo_results', '/tmp/matplotlib', '/tmp/fontconfig']: \
|
69 |
+
os.makedirs(dir, mode=0o777, exist_ok=True); \
|
70 |
+
subprocess.run(['python', 'model_preloader.py']); \
|
71 |
+
print('β
Models loaded successfully'); \
|
72 |
+
import uvicorn; \
|
73 |
+
uvicorn.run('web_app:app', host='0.0.0.0', port=7860, workers=1, log_level='info')\
|
74 |
+
"]
|
README.md
CHANGED
@@ -1,5 +1,5 @@
|
|
1 |
---
|
2 |
-
title:
|
3 |
emoji: π΅
|
4 |
colorFrom: blue
|
5 |
colorTo: purple
|
@@ -8,7 +8,7 @@ pinned: false
|
|
8 |
short_description: AI for multilingual transcription & Indian language support
|
9 |
---
|
10 |
|
11 |
-
# π΅
|
12 |
|
13 |
<p align="center">
|
14 |
<img src="static/imgs/banner.png" alt="Multilingual Audio Intelligence System Banner" style="border: 1px solid black"/>
|
@@ -48,10 +48,10 @@ This AI-powered platform combines speaker diarization, automatic speech recognit
|
|
48 |
|
49 |
The system includes sample audio files for testing and demonstration:
|
50 |
|
51 |
-
-
|
52 |
-
-
|
53 |
-
-
|
54 |
-
-
|
55 |
|
56 |
### Demo Features
|
57 |
- **Pre-processed Results**: Cached processing for quick demonstration
|
@@ -111,7 +111,7 @@ The system includes sample audio files for testing and demonstration:
|
|
111 |
### **1. Environment Setup**
|
112 |
```bash
|
113 |
# Clone the enhanced repository
|
114 |
-
git clone https://github.com/
|
115 |
cd Enhanced-Multilingual-Audio-Intelligence-System
|
116 |
|
117 |
# Create conda environment (recommended)
|
@@ -153,34 +153,34 @@ python run_app.py --mode test # System testing
|
|
153 |
|
154 |
```
|
155 |
Enhanced-Multilingual-Audio-Intelligence-System/
|
156 |
-
βββ run_app.py
|
157 |
-
βββ web_app.py
|
158 |
-
βββ src/
|
159 |
-
β βββ main.py
|
160 |
-
β βββ audio_processor.py
|
161 |
-
β βββ speaker_diarizer.py
|
162 |
-
β βββ speech_recognizer.py
|
163 |
-
β βββ translator.py
|
164 |
-
β βββ output_formatter.py
|
165 |
-
β βββ demo_manager.py
|
166 |
-
β βββ ui_components.py
|
167 |
-
β βββ utils.py
|
168 |
-
βββ demo_audio/
|
169 |
-
β βββ Yuri_Kizaki.mp3
|
170 |
-
β βββ Film_Podcast.mp3
|
171 |
-
β βββ Tamil_Wikipedia_Interview.ogg #
|
172 |
-
β βββ Car_Trouble.mp3
|
173 |
βββ templates/
|
174 |
-
β βββ index.html
|
175 |
βββ static/
|
176 |
-
β βββ imgs/
|
177 |
-
βββ model_cache/
|
178 |
-
βββ outputs/
|
179 |
-
βββ requirements.txt
|
180 |
-
βββ README.md
|
181 |
-
βββ DOCUMENTATION.md
|
182 |
-
βββ TECHNICAL_UNDERSTANDING.md
|
183 |
-
βββ files_which_are_not_needed/
|
184 |
```
|
185 |
|
186 |
## π Enhanced Usage Examples
|
@@ -246,23 +246,6 @@ MAX_FILE_SIZE_MB=200 # Smart file size limit
|
|
246 |
- **Device Selection**: CPU (recommended), CUDA (if available)
|
247 |
- **Cache Management**: Automatic model caching and cleanup
|
248 |
|
249 |
-
## Problem Statement 6 Alignment
|
250 |
-
|
251 |
-
This system addresses **PS-6: "Language-Agnostic Speaker Identification/Verification & Diarization; and subsequent Transcription & Translation System"** with the following capabilities:
|
252 |
-
|
253 |
-
### **Current Implementation (70% Coverage)**
|
254 |
-
- β
**Speaker Diarization**: pyannote.audio for "who spoke when" analysis
|
255 |
-
- β
**Multilingual ASR**: faster-whisper with automatic language detection
|
256 |
-
- β
**Neural Translation**: Multi-tier system for 100+ languages
|
257 |
-
- β
**Audio Format Support**: WAV, MP3, OGG, FLAC, M4A
|
258 |
-
- β
**User Interface**: Transcripts, visualizations, and translations
|
259 |
-
|
260 |
-
### **Enhanced Features (95% Complete)**
|
261 |
-
- β
**Advanced Speaker Verification**: Multi-model speaker identification with SpeechBrain, Wav2Vec2, and enhanced feature extraction
|
262 |
-
- β
**Advanced Noise Reduction**: ML-based enhancement with Sepformer, Demucs, and advanced signal processing
|
263 |
-
- β
**Enhanced Code-switching**: Improved support for mixed language audio with context awareness
|
264 |
-
- β
**Performance Optimization**: Real-time processing with advanced caching and optimization
|
265 |
-
|
266 |
## System Advantages
|
267 |
|
268 |
### **Reliability**
|
@@ -335,7 +318,7 @@ docker run -p 8000:7860 audio-intelligence
|
|
335 |
### **Hugging Face Spaces**
|
336 |
```yaml
|
337 |
# spaces.yaml
|
338 |
-
title:
|
339 |
emoji: π΅
|
340 |
colorFrom: blue
|
341 |
colorTo: purple
|
@@ -368,4 +351,4 @@ This enhanced system is released under MIT License - see the [LICENSE](LICENSE)
|
|
368 |
|
369 |
---
|
370 |
|
371 |
-
**A comprehensive solution for multilingual audio analysis and translation, designed to handle diverse language requirements and processing scenarios.**
|
|
|
1 |
---
|
2 |
+
title: Multilingual Audio Intelligence System
|
3 |
emoji: π΅
|
4 |
colorFrom: blue
|
5 |
colorTo: purple
|
|
|
8 |
short_description: AI for multilingual transcription & Indian language support
|
9 |
---
|
10 |
|
11 |
+
# π΅ Multilingual Audio Intelligence System
|
12 |
|
13 |
<p align="center">
|
14 |
<img src="static/imgs/banner.png" alt="Multilingual Audio Intelligence System Banner" style="border: 1px solid black"/>
|
|
|
48 |
|
49 |
The system includes sample audio files for testing and demonstration:
|
50 |
|
51 |
+
- [Japanese Business Audio](https://www.mitsue.co.jp/service/audio_and_video/audio_production/media/narrators_sample/yuri_kizaki/03.mp3): Professional voice message about website communication
|
52 |
+
- [French Film Podcast](https://www.lightbulblanguages.co.uk/resources/audio/film-podcast.mp3): Discussion about movies including Social Network and Paranormal Activity
|
53 |
+
- [Tamil Wikipedia Interview](https://commons.wikimedia.org/wiki/File:Parvathisri-Wikipedia-Interview-Vanavil-fm.ogg): Tamil language interview on collaborative knowledge sharing (36+ minutes)
|
54 |
+
- [Hindi Car Trouble](https://www.tuttlepublishing.com/content/docs/9780804844383/06-18%20Part2%20Car%20Trouble.mp3): Hindi conversation about daily life scenarios (2:45)
|
55 |
|
56 |
### Demo Features
|
57 |
- **Pre-processed Results**: Cached processing for quick demonstration
|
|
|
111 |
### **1. Environment Setup**
|
112 |
```bash
|
113 |
# Clone the enhanced repository
|
114 |
+
git clone https://github.com/Prathameshv07/Multilingual-Audio-Intelligence-System.git
|
115 |
cd Enhanced-Multilingual-Audio-Intelligence-System
|
116 |
|
117 |
# Create conda environment (recommended)
|
|
|
153 |
|
154 |
```
|
155 |
Enhanced-Multilingual-Audio-Intelligence-System/
|
156 |
+
βββ run_app.py # Single entry point for all modes
|
157 |
+
βββ web_app.py # Enhanced FastAPI application
|
158 |
+
βββ src/ # Organized source modules
|
159 |
+
β βββ main.py # Enhanced pipeline orchestrator
|
160 |
+
β βββ audio_processor.py # Enhanced with smart file management
|
161 |
+
β βββ speaker_diarizer.py # pyannote.audio integration
|
162 |
+
β βββ speech_recognizer.py # faster-whisper integration
|
163 |
+
β βββ translator.py # 3-tier hybrid translation system
|
164 |
+
β βββ output_formatter.py # Multi-format output generation
|
165 |
+
β βββ demo_manager.py # Enhanced demo file management
|
166 |
+
β βββ ui_components.py # Interactive UI components
|
167 |
+
β βββ utils.py # Enhanced utility functions
|
168 |
+
βββ demo_audio/ # Enhanced demo files
|
169 |
+
β βββ Yuri_Kizaki.mp3 # Japanese business communication
|
170 |
+
β βββ Film_Podcast.mp3 # French cinema discussion
|
171 |
+
β βββ Tamil_Wikipedia_Interview.ogg # Tamil language interview
|
172 |
+
β βββ Car_Trouble.mp3 # Hindi daily conversation
|
173 |
βββ templates/
|
174 |
+
β βββ index.html # Enhanced UI with Indian language support
|
175 |
βββ static/
|
176 |
+
β βββ imgs/ # Enhanced screenshots and assets
|
177 |
+
βββ model_cache/ # Intelligent model caching
|
178 |
+
βββ outputs/ # Processing results
|
179 |
+
βββ requirements.txt # Enhanced dependencies
|
180 |
+
βββ README.md # This enhanced documentation
|
181 |
+
βββ DOCUMENTATION.md # Comprehensive technical docs
|
182 |
+
βββ TECHNICAL_UNDERSTANDING.md # System architecture guide
|
183 |
+
βββ files_which_are_not_needed/ # Archived legacy files
|
184 |
```
|
185 |
|
186 |
## π Enhanced Usage Examples
|
|
|
246 |
- **Device Selection**: CPU (recommended), CUDA (if available)
|
247 |
- **Cache Management**: Automatic model caching and cleanup
|
248 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
249 |
## System Advantages
|
250 |
|
251 |
### **Reliability**
|
|
|
318 |
### **Hugging Face Spaces**
|
319 |
```yaml
|
320 |
# spaces.yaml
|
321 |
+
title: Multilingual Audio Intelligence System
|
322 |
emoji: π΅
|
323 |
colorFrom: blue
|
324 |
colorTo: purple
|
|
|
351 |
|
352 |
---
|
353 |
|
354 |
+
**A comprehensive solution for multilingual audio analysis and translation, designed to handle diverse language requirements and processing scenarios.**
|
TECHNICAL_UNDERSTANDING.md
CHANGED
@@ -1,8 +1,8 @@
|
|
1 |
-
# Technical Understanding -
|
2 |
|
3 |
## Architecture Overview
|
4 |
|
5 |
-
This document provides technical insights into the
|
6 |
|
7 |
## System Architecture
|
8 |
|
|
|
1 |
+
# Technical Understanding - Multilingual Audio Intelligence System
|
2 |
|
3 |
## Architecture Overview
|
4 |
|
5 |
+
This document provides technical insights into the multilingual audio intelligence system, designed to address comprehensive audio analysis requirements. The system incorporates **Indian language support**, **multi-tier translation**, **waveform visualization**, and **optimized performance** for various deployment scenarios.
|
6 |
|
7 |
## System Architecture
|
8 |
|
spaces.yaml
CHANGED
@@ -1,4 +1,4 @@
|
|
1 |
-
title:
|
2 |
emoji: π΅
|
3 |
colorFrom: blue
|
4 |
colorTo: purple
|
|
|
1 |
+
title: Multilingual Audio Intelligence System
|
2 |
emoji: π΅
|
3 |
colorFrom: blue
|
4 |
colorTo: purple
|
static/imgs/banner.png
CHANGED
![]() |
Git LFS Details
|
![]() |
Git LFS Details
|