pdf_explainer / README.md
spagestic's picture
fix: update explanation generation model to mistral-medium-2505
dd41680
---
title: Pdf Explainer
emoji: πŸ¦€
colorFrom: indigo
colorTo: yellow
sdk: gradio
sdk_version: 5.33.0
app_file: app.py
pinned: false
tags: [agent-demo-track]
---
# πŸ” PDF Explainer
An intelligent PDF processing application that extracts text from PDF documents, generates easy-to-understand explanations, and creates audio narrations. This tool transforms complex PDF content into accessible formats using cutting-edge AI technologies.
## πŸŽ₯ Video Overview
[Watch a video overview of Pdf Explainer](https://lifehkbueduhk-my.sharepoint.com/:v:/g/personal/22203133_life_hkbu_edu_hk/ESvvzCNfRJBGg0_mMwGMLGoBwBhEQLtoKc-JzOjWWQ_ZDw?nav=eyJyZWZlcnJhbEluZm8iOnsicmVmZXJyYWxBcHAiOiJPbmVEcml2ZUZvckJ1c2luZXNzIiwicmVmZXJyYWxBcHBQbGF0Zm9ybSI6IldlYiIsInJlZmVycmFsTW9kZSI6InZpZXciLCJyZWZlcnJhbFZpZXciOiJNeUZpbGVzTGlua0NvcHkifX0&e=iuKAGg)
This video explains the usage and purpose of the Pdf Explainer application.
## ✨ Features
- **πŸ“„ PDF Text Extraction**: Extract text content from PDF documents using advanced OCR technology
- **πŸ€– Intelligent Explanations**: Generate simple, easy-to-understand explanations of complex content
- **πŸ”Š Audio Generation**: Convert explanations to high-quality audio narrations
- **⚑ Parallel Processing**: Efficient processing of large documents with chunking and parallel audio generation
- **🎯 Context-Aware**: Maintains context across document sections for coherent explanations
- **πŸ“± User-Friendly Interface**: Clean, responsive Gradio-based web interface
## πŸ—οΈ Architecture & Technology Stack
### Core Technologies
#### 1. **Mistral OCR** - Text Extraction
- **Model**: `mistral-ocr-latest`
- **Purpose**: Extract text and images from PDF documents
- **Features**:
- Advanced OCR capabilities with markdown formatting
- Image extraction with coordinate mapping
- Multi-page document support
- Base64 encoding for secure document processing
#### 2. **Mistral AI Models** - Content Generation
- **Topic Extraction**: `ministral-8b-2410` for document topic identification
- **Explanation Generation**: `mistral-medium-2505` for creating simplified explanations
- **Features**:
- Structured JSON output for topic extraction
- Chat history maintenance for contextual explanations
- Temperature-controlled generation for consistent results
- Section-by-section processing with heading analysis
#### 3. **Chatterbox TTS** - Audio Generation
- **Platform**: Modal-deployed APIs
- **Endpoints**:
- `GENERATE_AUDIO_ENDPOINT`: Standard text-to-speech conversion
- `GENERATE_WITH_FILE_ENDPOINT`: Voice cloning with custom audio prompts
- **Features**:
- High-quality audio synthesis
- Voice cloning capabilities
- Streaming audio responses
- Progress tracking for long generations
### Processing Pipeline
```mermaid
graph TD
A[PDF Upload] --> B[Mistral OCR Processing]
B --> C[Text Extraction & Image Detection]
C --> D[Section Analysis & Heading Detection]
D --> E[Topic Identification - Ministral-8B]
E --> F[Explanation Generation - Mistral-Small]
F --> G[Text Chunking for Audio]
G --> H[Parallel Audio Processing]
H --> I[Chatterbox TTS Generation]
I --> J[Audio Concatenation]
J --> K[Final Output]
```
## πŸ”§ Installation & Setup
### Prerequisites
- Python 3.8+
- Virtual environment (recommended)
### Environment Variables
Create a `.env` file based on `.env.example`:
```bash
# Mistral AI API Key
MISTRAL_API_KEY=your_mistral_api_key_here
# Chatterbox TTS API Endpoints (Modal)
HEALTH_ENDPOINT=https://your-modal-endpoint/chatterbox-health
GENERATE_AUDIO_ENDPOINT=https://your-modal-endpoint/chatterbox-generate-audio
GENERATE_JSON_ENDPOINT=https://your-modal-endpoint/chatterbox-generate-json
GENERATE_WITH_FILE_ENDPOINT=https://your-modal-endpoint/chatterbox-generate-with-file
GENERATE_ENDPOINT=https://your-modal-endpoint/chatterbox-generate
```
### Installation
1. **Clone the repository**:
```bash
git clone <repository-url>
cd pdf_explainer
```
2. **Create virtual environment**:
```bash
python -m venv .venv
source .venv/Scripts/activate # Windows
# or
source .venv/bin/activate # Linux/Mac
```
3. **Install dependencies**:
```bash
pip install -r requirements.txt
```
4. **Run the application**:
```bash
python app.py
```
## πŸš€ Usage
1. **Upload PDF**: Use the file upload interface to select your PDF document
2. **Automatic Processing**: The application will:
- Extract text using Mistral OCR
- Generate explanations using Mistral AI
- Create audio narration using Chatterbox TTS
3. **View Results**: Access extracted text, explanations, and audio in separate tabs
4. **Download**: Copy text or download audio files as needed
## πŸ“ Project Structure
```
pdf_explainer/
β”œβ”€β”€ app.py # Main application entry point
β”œβ”€β”€ requirements.txt # Python dependencies
β”œβ”€β”€ .env.example # Environment variables template
β”œβ”€β”€ src/
β”‚ β”œβ”€β”€ processors/ # Core processing modules
β”‚ β”‚ β”œβ”€β”€ pdf_processor.py # Main PDF processing orchestrator
β”‚ β”‚ β”œβ”€β”€ pdf_text_extractor.py # Mistral OCR integration
β”‚ β”‚ β”œβ”€β”€ audio_processor.py # Audio generation coordinator
β”‚ β”‚ β”œβ”€β”€ generate_tts_audio.py # Chatterbox TTS integration
β”‚ β”‚ β”œβ”€β”€ text_chunker.py # Text splitting for audio processing
β”‚ β”‚ β”œβ”€β”€ parallel_processor.py # Parallel audio generation
β”‚ β”‚ └── audio_concatenator.py # Audio chunk merging
β”‚ β”œβ”€β”€ ui_components/ # User interface components
β”‚ β”‚ β”œβ”€β”€ interface.py # Gradio interface builder
β”‚ β”‚ └── styles.py # CSS styling
β”‚ └── utils/ # Utility modules
β”‚ └── text_explainer.py # Mistral AI explanation generation
```
## πŸ”§ Key Components
### PDF Processing (`PDFTextExtractor`)
- **OCR Integration**: Processes PDFs using Mistral's latest OCR model
- **Multi-strategy Extraction**: Multiple fallback methods for text extraction
- **Image Support**: Extracts and maps images with coordinates
- **Error Handling**: Robust error recovery and debugging
### Explanation Generation (`TextExplainer`)
- **Section Analysis**: Automatic detection of markdown headings
- **Context Maintenance**: Chat history for coherent multi-section explanations
- **Topic Extraction**: Automatic identification of document themes
- **Adaptive Processing**: Skips minimal content sections to optimize API usage
### Audio Processing (`AudioProcessor`)
- **Intelligent Chunking**: Splits text at natural boundaries (paragraphs, sentences)
- **Parallel Generation**: Concurrent audio generation for faster processing
- **Audio Concatenation**: Seamless merging with silence padding and fade effects
- **Progress Tracking**: Real-time updates during long operations
## πŸŽ›οΈ Configuration Options
### Text Chunking
- `max_chunk_size`: Maximum characters per audio chunk (default: 800)
- `overlap_sentences`: Sentence overlap between chunks for continuity
### Audio Processing
- `max_workers`: Parallel processing threads (default: 4)
- `silence_duration`: Pause between audio chunks (default: 0.5s)
- `fade_duration`: Fade in/out effects (default: 0.1s)
### AI Models
- Mistral OCR: Latest OCR model for text extraction
- Ministral-8B: Topic extraction with structured output
- Mistral-Small: Explanation generation with chat context
## 🀝 Contributing
1. Fork the repository
2. Create a feature branch: `git checkout -b feature-name`
3. Make your changes and test thoroughly
4. Commit with descriptive messages: `git commit -m "Add feature description"`
5. Push to your fork: `git push origin feature-name`
6. Create a pull request
## πŸ“„ License
This project is open source and available under the [MIT License](LICENSE).
## πŸ†˜ Support
For questions, issues, or contributions:
- Create an issue in the repository
- Check the video overview for usage guidance
- Review the code documentation for technical details
---
**Built with ❀️ using Mistral AI, Gradio, and Modal-deployed Chatterbox TTS**