pdf_explainer / README.md
spagestic's picture
fix: update explanation generation model to mistral-medium-2505
dd41680

A newer version of the Gradio SDK is available: 5.38.0

Upgrade
metadata
title: Pdf Explainer
emoji: πŸ¦€
colorFrom: indigo
colorTo: yellow
sdk: gradio
sdk_version: 5.33.0
app_file: app.py
pinned: false
tags:
  - agent-demo-track

πŸ” PDF Explainer

An intelligent PDF processing application that extracts text from PDF documents, generates easy-to-understand explanations, and creates audio narrations. This tool transforms complex PDF content into accessible formats using cutting-edge AI technologies.

πŸŽ₯ Video Overview

Watch a video overview of Pdf Explainer

This video explains the usage and purpose of the Pdf Explainer application.

✨ Features

  • πŸ“„ PDF Text Extraction: Extract text content from PDF documents using advanced OCR technology
  • πŸ€– Intelligent Explanations: Generate simple, easy-to-understand explanations of complex content
  • πŸ”Š Audio Generation: Convert explanations to high-quality audio narrations
  • ⚑ Parallel Processing: Efficient processing of large documents with chunking and parallel audio generation
  • 🎯 Context-Aware: Maintains context across document sections for coherent explanations
  • πŸ“± User-Friendly Interface: Clean, responsive Gradio-based web interface

πŸ—οΈ Architecture & Technology Stack

Core Technologies

1. Mistral OCR - Text Extraction

  • Model: mistral-ocr-latest
  • Purpose: Extract text and images from PDF documents
  • Features:
    • Advanced OCR capabilities with markdown formatting
    • Image extraction with coordinate mapping
    • Multi-page document support
    • Base64 encoding for secure document processing

2. Mistral AI Models - Content Generation

  • Topic Extraction: ministral-8b-2410 for document topic identification
  • Explanation Generation: mistral-medium-2505 for creating simplified explanations
  • Features:
    • Structured JSON output for topic extraction
    • Chat history maintenance for contextual explanations
    • Temperature-controlled generation for consistent results
    • Section-by-section processing with heading analysis

3. Chatterbox TTS - Audio Generation

  • Platform: Modal-deployed APIs
  • Endpoints:
    • GENERATE_AUDIO_ENDPOINT: Standard text-to-speech conversion
    • GENERATE_WITH_FILE_ENDPOINT: Voice cloning with custom audio prompts
  • Features:
    • High-quality audio synthesis
    • Voice cloning capabilities
    • Streaming audio responses
    • Progress tracking for long generations

Processing Pipeline

graph TD
    A[PDF Upload] --> B[Mistral OCR Processing]
    B --> C[Text Extraction & Image Detection]
    C --> D[Section Analysis & Heading Detection]
    D --> E[Topic Identification - Ministral-8B]
    E --> F[Explanation Generation - Mistral-Small]
    F --> G[Text Chunking for Audio]
    G --> H[Parallel Audio Processing]
    H --> I[Chatterbox TTS Generation]
    I --> J[Audio Concatenation]
    J --> K[Final Output]

πŸ”§ Installation & Setup

Prerequisites

  • Python 3.8+
  • Virtual environment (recommended)

Environment Variables

Create a .env file based on .env.example:

# Mistral AI API Key
MISTRAL_API_KEY=your_mistral_api_key_here

# Chatterbox TTS API Endpoints (Modal)
HEALTH_ENDPOINT=https://your-modal-endpoint/chatterbox-health
GENERATE_AUDIO_ENDPOINT=https://your-modal-endpoint/chatterbox-generate-audio
GENERATE_JSON_ENDPOINT=https://your-modal-endpoint/chatterbox-generate-json
GENERATE_WITH_FILE_ENDPOINT=https://your-modal-endpoint/chatterbox-generate-with-file
GENERATE_ENDPOINT=https://your-modal-endpoint/chatterbox-generate

Installation

  1. Clone the repository:

    git clone <repository-url>
    cd pdf_explainer
    
  2. Create virtual environment:

    python -m venv .venv
    source .venv/Scripts/activate  # Windows
    # or
    source .venv/bin/activate      # Linux/Mac
    
  3. Install dependencies:

    pip install -r requirements.txt
    
  4. Run the application:

    python app.py
    

πŸš€ Usage

  1. Upload PDF: Use the file upload interface to select your PDF document
  2. Automatic Processing: The application will:
    • Extract text using Mistral OCR
    • Generate explanations using Mistral AI
    • Create audio narration using Chatterbox TTS
  3. View Results: Access extracted text, explanations, and audio in separate tabs
  4. Download: Copy text or download audio files as needed

πŸ“ Project Structure

pdf_explainer/
β”œβ”€β”€ app.py                      # Main application entry point
β”œβ”€β”€ requirements.txt            # Python dependencies
β”œβ”€β”€ .env.example               # Environment variables template
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ processors/            # Core processing modules
β”‚   β”‚   β”œβ”€β”€ pdf_processor.py          # Main PDF processing orchestrator
β”‚   β”‚   β”œβ”€β”€ pdf_text_extractor.py     # Mistral OCR integration
β”‚   β”‚   β”œβ”€β”€ audio_processor.py        # Audio generation coordinator
β”‚   β”‚   β”œβ”€β”€ generate_tts_audio.py     # Chatterbox TTS integration
β”‚   β”‚   β”œβ”€β”€ text_chunker.py           # Text splitting for audio processing
β”‚   β”‚   β”œβ”€β”€ parallel_processor.py     # Parallel audio generation
β”‚   β”‚   └── audio_concatenator.py     # Audio chunk merging
β”‚   β”œβ”€β”€ ui_components/         # User interface components
β”‚   β”‚   β”œβ”€β”€ interface.py              # Gradio interface builder
β”‚   β”‚   └── styles.py                 # CSS styling
β”‚   └── utils/                 # Utility modules
β”‚       └── text_explainer.py         # Mistral AI explanation generation

πŸ”§ Key Components

PDF Processing (PDFTextExtractor)

  • OCR Integration: Processes PDFs using Mistral's latest OCR model
  • Multi-strategy Extraction: Multiple fallback methods for text extraction
  • Image Support: Extracts and maps images with coordinates
  • Error Handling: Robust error recovery and debugging

Explanation Generation (TextExplainer)

  • Section Analysis: Automatic detection of markdown headings
  • Context Maintenance: Chat history for coherent multi-section explanations
  • Topic Extraction: Automatic identification of document themes
  • Adaptive Processing: Skips minimal content sections to optimize API usage

Audio Processing (AudioProcessor)

  • Intelligent Chunking: Splits text at natural boundaries (paragraphs, sentences)
  • Parallel Generation: Concurrent audio generation for faster processing
  • Audio Concatenation: Seamless merging with silence padding and fade effects
  • Progress Tracking: Real-time updates during long operations

πŸŽ›οΈ Configuration Options

Text Chunking

  • max_chunk_size: Maximum characters per audio chunk (default: 800)
  • overlap_sentences: Sentence overlap between chunks for continuity

Audio Processing

  • max_workers: Parallel processing threads (default: 4)
  • silence_duration: Pause between audio chunks (default: 0.5s)
  • fade_duration: Fade in/out effects (default: 0.1s)

AI Models

  • Mistral OCR: Latest OCR model for text extraction
  • Ministral-8B: Topic extraction with structured output
  • Mistral-Small: Explanation generation with chat context

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature-name
  3. Make your changes and test thoroughly
  4. Commit with descriptive messages: git commit -m "Add feature description"
  5. Push to your fork: git push origin feature-name
  6. Create a pull request

πŸ“„ License

This project is open source and available under the MIT License.

πŸ†˜ Support

For questions, issues, or contributions:

  • Create an issue in the repository
  • Check the video overview for usage guidance
  • Review the code documentation for technical details

Built with ❀️ using Mistral AI, Gradio, and Modal-deployed Chatterbox TTS