metadata

title: Pdf Explainer
emoji: 🦀
colorFrom: indigo
colorTo: yellow
sdk: gradio
sdk_version: 5.33.0
app_file: app.py
pinned: false
tags:
  - agent-demo-track

🔍 PDF Explainer

An intelligent PDF processing application that extracts text from PDF documents, generates easy-to-understand explanations, and creates audio narrations. This tool transforms complex PDF content into accessible formats using cutting-edge AI technologies.

🎥 Video Overview

Watch a video overview of Pdf Explainer

This video explains the usage and purpose of the Pdf Explainer application.

✨ Features

📄 PDF Text Extraction: Extract text content from PDF documents using advanced OCR technology
🤖 Intelligent Explanations: Generate simple, easy-to-understand explanations of complex content
🔊 Audio Generation: Convert explanations to high-quality audio narrations
⚡ Parallel Processing: Efficient processing of large documents with chunking and parallel audio generation
🎯 Context-Aware: Maintains context across document sections for coherent explanations
📱 User-Friendly Interface: Clean, responsive Gradio-based web interface

🏗️ Architecture & Technology Stack

Core Technologies

1. Mistral OCR - Text Extraction

Model: mistral-ocr-latest
Purpose: Extract text and images from PDF documents
Features:
- Advanced OCR capabilities with markdown formatting
- Image extraction with coordinate mapping
- Multi-page document support
- Base64 encoding for secure document processing

2. Mistral AI Models - Content Generation

Topic Extraction: ministral-8b-2410 for document topic identification
Explanation Generation: mistral-medium-2505 for creating simplified explanations
Features:
- Structured JSON output for topic extraction
- Chat history maintenance for contextual explanations
- Temperature-controlled generation for consistent results
- Section-by-section processing with heading analysis

3. Chatterbox TTS - Audio Generation

Platform: Modal-deployed APIs
Endpoints:
- GENERATE_AUDIO_ENDPOINT: Standard text-to-speech conversion
- GENERATE_WITH_FILE_ENDPOINT: Voice cloning with custom audio prompts
Features:
- High-quality audio synthesis
- Voice cloning capabilities
- Streaming audio responses
- Progress tracking for long generations

Processing Pipeline

graph TD
    A[PDF Upload] --> B[Mistral OCR Processing]
    B --> C[Text Extraction & Image Detection]
    C --> D[Section Analysis & Heading Detection]
    D --> E[Topic Identification - Ministral-8B]
    E --> F[Explanation Generation - Mistral-Small]
    F --> G[Text Chunking for Audio]
    G --> H[Parallel Audio Processing]
    H --> I[Chatterbox TTS Generation]
    I --> J[Audio Concatenation]
    J --> K[Final Output]

🔧 Installation & Setup

Prerequisites

Python 3.8+
Virtual environment (recommended)

Environment Variables

Create a .env file based on .env.example:

# Mistral AI API Key
MISTRAL_API_KEY=your_mistral_api_key_here

# Chatterbox TTS API Endpoints (Modal)
HEALTH_ENDPOINT=https://your-modal-endpoint/chatterbox-health
GENERATE_AUDIO_ENDPOINT=https://your-modal-endpoint/chatterbox-generate-audio
GENERATE_JSON_ENDPOINT=https://your-modal-endpoint/chatterbox-generate-json
GENERATE_WITH_FILE_ENDPOINT=https://your-modal-endpoint/chatterbox-generate-with-file
GENERATE_ENDPOINT=https://your-modal-endpoint/chatterbox-generate

Installation

Clone the repository:

git clone <repository-url>
cd pdf_explainer

Create virtual environment:

python -m venv .venv
source .venv/Scripts/activate  # Windows
# or
source .venv/bin/activate      # Linux/Mac

Install dependencies:
```
pip install -r requirements.txt
```
Run the application:
```
python app.py
```

🚀 Usage

Upload PDF: Use the file upload interface to select your PDF document
Automatic Processing: The application will:
- Extract text using Mistral OCR
- Generate explanations using Mistral AI
- Create audio narration using Chatterbox TTS
View Results: Access extracted text, explanations, and audio in separate tabs
Download: Copy text or download audio files as needed

📁 Project Structure

pdf_explainer/
├── app.py                      # Main application entry point
├── requirements.txt            # Python dependencies
├── .env.example               # Environment variables template
├── src/
│   ├── processors/            # Core processing modules
│   │   ├── pdf_processor.py          # Main PDF processing orchestrator
│   │   ├── pdf_text_extractor.py     # Mistral OCR integration
│   │   ├── audio_processor.py        # Audio generation coordinator
│   │   ├── generate_tts_audio.py     # Chatterbox TTS integration
│   │   ├── text_chunker.py           # Text splitting for audio processing
│   │   ├── parallel_processor.py     # Parallel audio generation
│   │   └── audio_concatenator.py     # Audio chunk merging
│   ├── ui_components/         # User interface components
│   │   ├── interface.py              # Gradio interface builder
│   │   └── styles.py                 # CSS styling
│   └── utils/                 # Utility modules
│       └── text_explainer.py         # Mistral AI explanation generation

🔧 Key Components

PDF Processing (`PDFTextExtractor`)

OCR Integration: Processes PDFs using Mistral's latest OCR model
Multi-strategy Extraction: Multiple fallback methods for text extraction
Image Support: Extracts and maps images with coordinates
Error Handling: Robust error recovery and debugging

Explanation Generation (`TextExplainer`)

Section Analysis: Automatic detection of markdown headings
Context Maintenance: Chat history for coherent multi-section explanations
Topic Extraction: Automatic identification of document themes
Adaptive Processing: Skips minimal content sections to optimize API usage

Audio Processing (`AudioProcessor`)

Intelligent Chunking: Splits text at natural boundaries (paragraphs, sentences)
Parallel Generation: Concurrent audio generation for faster processing
Audio Concatenation: Seamless merging with silence padding and fade effects
Progress Tracking: Real-time updates during long operations

🎛️ Configuration Options

Text Chunking

max_chunk_size: Maximum characters per audio chunk (default: 800)
overlap_sentences: Sentence overlap between chunks for continuity

Audio Processing

max_workers: Parallel processing threads (default: 4)
silence_duration: Pause between audio chunks (default: 0.5s)
fade_duration: Fade in/out effects (default: 0.1s)

AI Models

Mistral OCR: Latest OCR model for text extraction
Ministral-8B: Topic extraction with structured output
Mistral-Small: Explanation generation with chat context

🤝 Contributing

Fork the repository
Create a feature branch: git checkout -b feature-name
Make your changes and test thoroughly
Commit with descriptive messages: git commit -m "Add feature description"
Push to your fork: git push origin feature-name
Create a pull request

📄 License

This project is open source and available under the MIT License.

🆘 Support

For questions, issues, or contributions:

Create an issue in the repository
Check the video overview for usage guidance
Review the code documentation for technical details

Built with ❤️ using Mistral AI, Gradio, and Modal-deployed Chatterbox TTS