Spaces:
Sleeping
Sleeping
A newer version of the Gradio SDK is available:
5.38.0
metadata
title: Pdf Explainer
emoji: π¦
colorFrom: indigo
colorTo: yellow
sdk: gradio
sdk_version: 5.33.0
app_file: app.py
pinned: false
tags:
- agent-demo-track
π PDF Explainer
An intelligent PDF processing application that extracts text from PDF documents, generates easy-to-understand explanations, and creates audio narrations. This tool transforms complex PDF content into accessible formats using cutting-edge AI technologies.
π₯ Video Overview
Watch a video overview of Pdf Explainer
This video explains the usage and purpose of the Pdf Explainer application.
β¨ Features
- π PDF Text Extraction: Extract text content from PDF documents using advanced OCR technology
- π€ Intelligent Explanations: Generate simple, easy-to-understand explanations of complex content
- π Audio Generation: Convert explanations to high-quality audio narrations
- β‘ Parallel Processing: Efficient processing of large documents with chunking and parallel audio generation
- π― Context-Aware: Maintains context across document sections for coherent explanations
- π± User-Friendly Interface: Clean, responsive Gradio-based web interface
ποΈ Architecture & Technology Stack
Core Technologies
1. Mistral OCR - Text Extraction
- Model:
mistral-ocr-latest
- Purpose: Extract text and images from PDF documents
- Features:
- Advanced OCR capabilities with markdown formatting
- Image extraction with coordinate mapping
- Multi-page document support
- Base64 encoding for secure document processing
2. Mistral AI Models - Content Generation
- Topic Extraction:
ministral-8b-2410
for document topic identification - Explanation Generation:
mistral-medium-2505
for creating simplified explanations - Features:
- Structured JSON output for topic extraction
- Chat history maintenance for contextual explanations
- Temperature-controlled generation for consistent results
- Section-by-section processing with heading analysis
3. Chatterbox TTS - Audio Generation
- Platform: Modal-deployed APIs
- Endpoints:
GENERATE_AUDIO_ENDPOINT
: Standard text-to-speech conversionGENERATE_WITH_FILE_ENDPOINT
: Voice cloning with custom audio prompts
- Features:
- High-quality audio synthesis
- Voice cloning capabilities
- Streaming audio responses
- Progress tracking for long generations
Processing Pipeline
graph TD
A[PDF Upload] --> B[Mistral OCR Processing]
B --> C[Text Extraction & Image Detection]
C --> D[Section Analysis & Heading Detection]
D --> E[Topic Identification - Ministral-8B]
E --> F[Explanation Generation - Mistral-Small]
F --> G[Text Chunking for Audio]
G --> H[Parallel Audio Processing]
H --> I[Chatterbox TTS Generation]
I --> J[Audio Concatenation]
J --> K[Final Output]
π§ Installation & Setup
Prerequisites
- Python 3.8+
- Virtual environment (recommended)
Environment Variables
Create a .env
file based on .env.example
:
# Mistral AI API Key
MISTRAL_API_KEY=your_mistral_api_key_here
# Chatterbox TTS API Endpoints (Modal)
HEALTH_ENDPOINT=https://your-modal-endpoint/chatterbox-health
GENERATE_AUDIO_ENDPOINT=https://your-modal-endpoint/chatterbox-generate-audio
GENERATE_JSON_ENDPOINT=https://your-modal-endpoint/chatterbox-generate-json
GENERATE_WITH_FILE_ENDPOINT=https://your-modal-endpoint/chatterbox-generate-with-file
GENERATE_ENDPOINT=https://your-modal-endpoint/chatterbox-generate
Installation
Clone the repository:
git clone <repository-url> cd pdf_explainer
Create virtual environment:
python -m venv .venv source .venv/Scripts/activate # Windows # or source .venv/bin/activate # Linux/Mac
Install dependencies:
pip install -r requirements.txt
Run the application:
python app.py
π Usage
- Upload PDF: Use the file upload interface to select your PDF document
- Automatic Processing: The application will:
- Extract text using Mistral OCR
- Generate explanations using Mistral AI
- Create audio narration using Chatterbox TTS
- View Results: Access extracted text, explanations, and audio in separate tabs
- Download: Copy text or download audio files as needed
π Project Structure
pdf_explainer/
βββ app.py # Main application entry point
βββ requirements.txt # Python dependencies
βββ .env.example # Environment variables template
βββ src/
β βββ processors/ # Core processing modules
β β βββ pdf_processor.py # Main PDF processing orchestrator
β β βββ pdf_text_extractor.py # Mistral OCR integration
β β βββ audio_processor.py # Audio generation coordinator
β β βββ generate_tts_audio.py # Chatterbox TTS integration
β β βββ text_chunker.py # Text splitting for audio processing
β β βββ parallel_processor.py # Parallel audio generation
β β βββ audio_concatenator.py # Audio chunk merging
β βββ ui_components/ # User interface components
β β βββ interface.py # Gradio interface builder
β β βββ styles.py # CSS styling
β βββ utils/ # Utility modules
β βββ text_explainer.py # Mistral AI explanation generation
π§ Key Components
PDF Processing (PDFTextExtractor
)
- OCR Integration: Processes PDFs using Mistral's latest OCR model
- Multi-strategy Extraction: Multiple fallback methods for text extraction
- Image Support: Extracts and maps images with coordinates
- Error Handling: Robust error recovery and debugging
Explanation Generation (TextExplainer
)
- Section Analysis: Automatic detection of markdown headings
- Context Maintenance: Chat history for coherent multi-section explanations
- Topic Extraction: Automatic identification of document themes
- Adaptive Processing: Skips minimal content sections to optimize API usage
Audio Processing (AudioProcessor
)
- Intelligent Chunking: Splits text at natural boundaries (paragraphs, sentences)
- Parallel Generation: Concurrent audio generation for faster processing
- Audio Concatenation: Seamless merging with silence padding and fade effects
- Progress Tracking: Real-time updates during long operations
ποΈ Configuration Options
Text Chunking
max_chunk_size
: Maximum characters per audio chunk (default: 800)overlap_sentences
: Sentence overlap between chunks for continuity
Audio Processing
max_workers
: Parallel processing threads (default: 4)silence_duration
: Pause between audio chunks (default: 0.5s)fade_duration
: Fade in/out effects (default: 0.1s)
AI Models
- Mistral OCR: Latest OCR model for text extraction
- Ministral-8B: Topic extraction with structured output
- Mistral-Small: Explanation generation with chat context
π€ Contributing
- Fork the repository
- Create a feature branch:
git checkout -b feature-name
- Make your changes and test thoroughly
- Commit with descriptive messages:
git commit -m "Add feature description"
- Push to your fork:
git push origin feature-name
- Create a pull request
π License
This project is open source and available under the MIT License.
π Support
For questions, issues, or contributions:
- Create an issue in the repository
- Check the video overview for usage guidance
- Review the code documentation for technical details
Built with β€οΈ using Mistral AI, Gradio, and Modal-deployed Chatterbox TTS