Spaces:

Agents-MCP-Hackathon
/

pdf_explainer

Sleeping

File size: 8,260 Bytes

---
title: Pdf Explainer
emoji: 🦀
colorFrom: indigo
colorTo: yellow
sdk: gradio
sdk_version: 5.33.0
app_file: app.py
pinned: false
tags: [agent-demo-track]
---

# 🔍 PDF Explainer

An intelligent PDF processing application that extracts text from PDF documents, generates easy-to-understand explanations, and creates audio narrations. This tool transforms complex PDF content into accessible formats using cutting-edge AI technologies.

## 🎥 Video Overview

[Watch a video overview of Pdf Explainer](https://lifehkbueduhk-my.sharepoint.com/:v:/g/personal/22203133_life_hkbu_edu_hk/ESvvzCNfRJBGg0_mMwGMLGoBwBhEQLtoKc-JzOjWWQ_ZDw?nav=eyJyZWZlcnJhbEluZm8iOnsicmVmZXJyYWxBcHAiOiJPbmVEcml2ZUZvckJ1c2luZXNzIiwicmVmZXJyYWxBcHBQbGF0Zm9ybSI6IldlYiIsInJlZmVycmFsTW9kZSI6InZpZXciLCJyZWZlcnJhbFZpZXciOiJNeUZpbGVzTGlua0NvcHkifX0&e=iuKAGg)

This video explains the usage and purpose of the Pdf Explainer application.

## ✨ Features

- **📄 PDF Text Extraction**: Extract text content from PDF documents using advanced OCR technology
- **🤖 Intelligent Explanations**: Generate simple, easy-to-understand explanations of complex content
- **🔊 Audio Generation**: Convert explanations to high-quality audio narrations
- **⚡ Parallel Processing**: Efficient processing of large documents with chunking and parallel audio generation
- **🎯 Context-Aware**: Maintains context across document sections for coherent explanations
- **📱 User-Friendly Interface**: Clean, responsive Gradio-based web interface

## 🏗️ Architecture & Technology Stack

### Core Technologies

#### 1. **Mistral OCR** - Text Extraction

- **Model**: `mistral-ocr-latest`
- **Purpose**: Extract text and images from PDF documents
- **Features**:
  - Advanced OCR capabilities with markdown formatting
  - Image extraction with coordinate mapping
  - Multi-page document support
  - Base64 encoding for secure document processing

#### 2. **Mistral AI Models** - Content Generation

- **Topic Extraction**: `ministral-8b-2410` for document topic identification
- **Explanation Generation**: `mistral-medium-2505` for creating simplified explanations
- **Features**:
  - Structured JSON output for topic extraction
  - Chat history maintenance for contextual explanations
  - Temperature-controlled generation for consistent results
  - Section-by-section processing with heading analysis

#### 3. **Chatterbox TTS** - Audio Generation

- **Platform**: Modal-deployed APIs
- **Endpoints**:
  - `GENERATE_AUDIO_ENDPOINT`: Standard text-to-speech conversion
  - `GENERATE_WITH_FILE_ENDPOINT`: Voice cloning with custom audio prompts
- **Features**:
  - High-quality audio synthesis
  - Voice cloning capabilities
  - Streaming audio responses
  - Progress tracking for long generations

### Processing Pipeline

```mermaid
graph TD
    A[PDF Upload] --> B[Mistral OCR Processing]
    B --> C[Text Extraction & Image Detection]
    C --> D[Section Analysis & Heading Detection]
    D --> E[Topic Identification - Ministral-8B]
    E --> F[Explanation Generation - Mistral-Small]
    F --> G[Text Chunking for Audio]
    G --> H[Parallel Audio Processing]
    H --> I[Chatterbox TTS Generation]
    I --> J[Audio Concatenation]
    J --> K[Final Output]
```

## 🔧 Installation & Setup

### Prerequisites

- Python 3.8+
- Virtual environment (recommended)

### Environment Variables

Create a `.env` file based on `.env.example`:

```bash
# Mistral AI API Key
MISTRAL_API_KEY=your_mistral_api_key_here

# Chatterbox TTS API Endpoints (Modal)
HEALTH_ENDPOINT=https://your-modal-endpoint/chatterbox-health
GENERATE_AUDIO_ENDPOINT=https://your-modal-endpoint/chatterbox-generate-audio
GENERATE_JSON_ENDPOINT=https://your-modal-endpoint/chatterbox-generate-json
GENERATE_WITH_FILE_ENDPOINT=https://your-modal-endpoint/chatterbox-generate-with-file
GENERATE_ENDPOINT=https://your-modal-endpoint/chatterbox-generate
```

### Installation

1. **Clone the repository**:

   ```bash
   git clone <repository-url>
   cd pdf_explainer
   ```

2. **Create virtual environment**:

   ```bash
   python -m venv .venv
   source .venv/Scripts/activate  # Windows
   # or
   source .venv/bin/activate      # Linux/Mac
   ```

3. **Install dependencies**:

   ```bash
   pip install -r requirements.txt
   ```

4. **Run the application**:
   ```bash
   python app.py
   ```

## 🚀 Usage

1. **Upload PDF**: Use the file upload interface to select your PDF document
2. **Automatic Processing**: The application will:
   - Extract text using Mistral OCR
   - Generate explanations using Mistral AI
   - Create audio narration using Chatterbox TTS
3. **View Results**: Access extracted text, explanations, and audio in separate tabs
4. **Download**: Copy text or download audio files as needed

## 📁 Project Structure

```
pdf_explainer/
├── app.py                      # Main application entry point
├── requirements.txt            # Python dependencies
├── .env.example               # Environment variables template
├── src/
│   ├── processors/            # Core processing modules
│   │   ├── pdf_processor.py          # Main PDF processing orchestrator
│   │   ├── pdf_text_extractor.py     # Mistral OCR integration
│   │   ├── audio_processor.py        # Audio generation coordinator
│   │   ├── generate_tts_audio.py     # Chatterbox TTS integration
│   │   ├── text_chunker.py           # Text splitting for audio processing
│   │   ├── parallel_processor.py     # Parallel audio generation
│   │   └── audio_concatenator.py     # Audio chunk merging
│   ├── ui_components/         # User interface components
│   │   ├── interface.py              # Gradio interface builder
│   │   └── styles.py                 # CSS styling
│   └── utils/                 # Utility modules
│       └── text_explainer.py         # Mistral AI explanation generation
```

## 🔧 Key Components

### PDF Processing (`PDFTextExtractor`)

- **OCR Integration**: Processes PDFs using Mistral's latest OCR model
- **Multi-strategy Extraction**: Multiple fallback methods for text extraction
- **Image Support**: Extracts and maps images with coordinates
- **Error Handling**: Robust error recovery and debugging

### Explanation Generation (`TextExplainer`)

- **Section Analysis**: Automatic detection of markdown headings
- **Context Maintenance**: Chat history for coherent multi-section explanations
- **Topic Extraction**: Automatic identification of document themes
- **Adaptive Processing**: Skips minimal content sections to optimize API usage

### Audio Processing (`AudioProcessor`)

- **Intelligent Chunking**: Splits text at natural boundaries (paragraphs, sentences)
- **Parallel Generation**: Concurrent audio generation for faster processing
- **Audio Concatenation**: Seamless merging with silence padding and fade effects
- **Progress Tracking**: Real-time updates during long operations

## 🎛️ Configuration Options

### Text Chunking

- `max_chunk_size`: Maximum characters per audio chunk (default: 800)
- `overlap_sentences`: Sentence overlap between chunks for continuity

### Audio Processing

- `max_workers`: Parallel processing threads (default: 4)
- `silence_duration`: Pause between audio chunks (default: 0.5s)
- `fade_duration`: Fade in/out effects (default: 0.1s)

### AI Models

- Mistral OCR: Latest OCR model for text extraction
- Ministral-8B: Topic extraction with structured output
- Mistral-Small: Explanation generation with chat context

## 🤝 Contributing

1. Fork the repository
2. Create a feature branch: `git checkout -b feature-name`
3. Make your changes and test thoroughly
4. Commit with descriptive messages: `git commit -m "Add feature description"`
5. Push to your fork: `git push origin feature-name`
6. Create a pull request

## 📄 License

This project is open source and available under the [MIT License](LICENSE).

## 🆘 Support

For questions, issues, or contributions:

- Create an issue in the repository
- Check the video overview for usage guidance
- Review the code documentation for technical details

---

**Built with ❤️ using Mistral AI, Gradio, and Modal-deployed Chatterbox TTS**