---
title: Modal Transcriber MCP
emoji: 🎙️
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
license: mit
tag: mcp-server-track
---

# 🎙️ Modal Transcriber MCP

A powerful audio transcription system integrating Gradio UI, FastMCP Tools, and Modal cloud computing with intelligent speaker identification.

## ✨ Key Features

- **🎵 Multi-platform Audio Download**: Support for Apple Podcasts, XiaoYuZhou, and other podcast platforms
- **🚀 High-performance Transcription**: Based on OpenAI Whisper with multiple model support (turbo, large-v3, etc.)
- **🎤 Intelligent Speaker Identification**: Using pyannote.audio for speaker separation and embedding clustering
- **⚡ Distributed Processing**: Support for large file concurrent chunk processing, significantly improving processing speed
- **🔧 FastMCP Tools**: Complete MCP (Model Context Protocol) tool integration
- **☁️ Modal Deployment**: Support for both local and cloud deployment modes

## 🎯 Core Advantages

### 🧠 Intelligent Audio Segmentation
- **Silence Detection Segmentation**: Automatically identify silent segments in audio for intelligent chunking
- **Fallback Mechanism**: Long audio automatically degrades to time-based segmentation, ensuring processing efficiency
- **Concurrent Processing**: Multiple chunks processed simultaneously, dramatically improving transcription speed

### 🎤 Advanced Speaker Identification
- **Embedding Clustering**: Using deep learning embeddings for speaker consistency identification
- **Cross-chunk Unification**: Solving speaker label inconsistency issues in distributed processing
- **Quality Filtering**: Automatically filter low-quality segments to improve output accuracy

### 🔧 Developer Friendly
- **MCP Protocol Support**: Complete tool invocation interface
- **REST API**: Standardized API interface
- **Gradio UI**: Intuitive web interface
- **Test Coverage**: 29 unit tests and integration tests

## 🚀 Quick Start

### Local Setup

1. **Clone Repository**
```bash
git clone https://huggingface.co/spaces/Agents-MCP-Hackathon/ModalTranscriberMCP
cd ModalTranscriberMCP
```

2. **Install Dependencies**
```bash
pip install -r requirements.txt
```

3. **Configure Hugging Face Token** (Optional, for speaker identification)
```bash
# Create .env file
echo "HF_TOKEN=your_huggingface_token_here" > .env
```

4. **Start Application**
```bash
python app.py
```

### Usage Instructions

1. **Upload audio file** or **Input podcast URL**
2. **Select transcription options**:
   - Model size: turbo (recommended) / large-v3
   - Output format: SRT / TXT
   - Enable speaker identification
3. **Start transcription**, the system will automatically process and generate results

## 🛠️ Technical Architecture

- **Frontend**: Gradio 4.44.0
- **Backend**: FastAPI + FastMCP
- **Transcription Engine**: OpenAI Whisper
- **Speaker Identification**: pyannote.audio
- **Cloud Computing**: Modal.com
- **Audio Processing**: FFmpeg

## 📊 Performance Metrics

- **Processing Speed**: Support for 30x real-time transcription speed
- **Concurrency**: Up to 10 chunks processed simultaneously
- **Accuracy**: Chinese accuracy >95%
- **Supported Formats**: MP3, WAV, M4A, FLAC, etc.

## 🤝 Contributing

Issues and Pull Requests are welcome!

## 📜 License

MIT License

## 🔗 Related Links

- **Project Documentation**: See `docs/` directory in the repository
- **Test Coverage**: 29 test cases ensuring functional stability
- **Modal Deployment**: Support for cloud high-performance processing

---
*Last updated: 2025-06-11*