Spaces:
Sleeping
Sleeping
File size: 8,260 Bytes
41aca0e 868664a 41aca0e 065887d 41aca0e e37b0d2 065887d e37b0d2 065887d 1027486 065887d e37b0d2 dd41680 e37b0d2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 |
---
title: Pdf Explainer
emoji: π¦
colorFrom: indigo
colorTo: yellow
sdk: gradio
sdk_version: 5.33.0
app_file: app.py
pinned: false
tags: [agent-demo-track]
---
# π PDF Explainer
An intelligent PDF processing application that extracts text from PDF documents, generates easy-to-understand explanations, and creates audio narrations. This tool transforms complex PDF content into accessible formats using cutting-edge AI technologies.
## π₯ Video Overview
[Watch a video overview of Pdf Explainer](https://lifehkbueduhk-my.sharepoint.com/:v:/g/personal/22203133_life_hkbu_edu_hk/ESvvzCNfRJBGg0_mMwGMLGoBwBhEQLtoKc-JzOjWWQ_ZDw?nav=eyJyZWZlcnJhbEluZm8iOnsicmVmZXJyYWxBcHAiOiJPbmVEcml2ZUZvckJ1c2luZXNzIiwicmVmZXJyYWxBcHBQbGF0Zm9ybSI6IldlYiIsInJlZmVycmFsTW9kZSI6InZpZXciLCJyZWZlcnJhbFZpZXciOiJNeUZpbGVzTGlua0NvcHkifX0&e=iuKAGg)
This video explains the usage and purpose of the Pdf Explainer application.
## β¨ Features
- **π PDF Text Extraction**: Extract text content from PDF documents using advanced OCR technology
- **π€ Intelligent Explanations**: Generate simple, easy-to-understand explanations of complex content
- **π Audio Generation**: Convert explanations to high-quality audio narrations
- **β‘ Parallel Processing**: Efficient processing of large documents with chunking and parallel audio generation
- **π― Context-Aware**: Maintains context across document sections for coherent explanations
- **π± User-Friendly Interface**: Clean, responsive Gradio-based web interface
## ποΈ Architecture & Technology Stack
### Core Technologies
#### 1. **Mistral OCR** - Text Extraction
- **Model**: `mistral-ocr-latest`
- **Purpose**: Extract text and images from PDF documents
- **Features**:
- Advanced OCR capabilities with markdown formatting
- Image extraction with coordinate mapping
- Multi-page document support
- Base64 encoding for secure document processing
#### 2. **Mistral AI Models** - Content Generation
- **Topic Extraction**: `ministral-8b-2410` for document topic identification
- **Explanation Generation**: `mistral-medium-2505` for creating simplified explanations
- **Features**:
- Structured JSON output for topic extraction
- Chat history maintenance for contextual explanations
- Temperature-controlled generation for consistent results
- Section-by-section processing with heading analysis
#### 3. **Chatterbox TTS** - Audio Generation
- **Platform**: Modal-deployed APIs
- **Endpoints**:
- `GENERATE_AUDIO_ENDPOINT`: Standard text-to-speech conversion
- `GENERATE_WITH_FILE_ENDPOINT`: Voice cloning with custom audio prompts
- **Features**:
- High-quality audio synthesis
- Voice cloning capabilities
- Streaming audio responses
- Progress tracking for long generations
### Processing Pipeline
```mermaid
graph TD
A[PDF Upload] --> B[Mistral OCR Processing]
B --> C[Text Extraction & Image Detection]
C --> D[Section Analysis & Heading Detection]
D --> E[Topic Identification - Ministral-8B]
E --> F[Explanation Generation - Mistral-Small]
F --> G[Text Chunking for Audio]
G --> H[Parallel Audio Processing]
H --> I[Chatterbox TTS Generation]
I --> J[Audio Concatenation]
J --> K[Final Output]
```
## π§ Installation & Setup
### Prerequisites
- Python 3.8+
- Virtual environment (recommended)
### Environment Variables
Create a `.env` file based on `.env.example`:
```bash
# Mistral AI API Key
MISTRAL_API_KEY=your_mistral_api_key_here
# Chatterbox TTS API Endpoints (Modal)
HEALTH_ENDPOINT=https://your-modal-endpoint/chatterbox-health
GENERATE_AUDIO_ENDPOINT=https://your-modal-endpoint/chatterbox-generate-audio
GENERATE_JSON_ENDPOINT=https://your-modal-endpoint/chatterbox-generate-json
GENERATE_WITH_FILE_ENDPOINT=https://your-modal-endpoint/chatterbox-generate-with-file
GENERATE_ENDPOINT=https://your-modal-endpoint/chatterbox-generate
```
### Installation
1. **Clone the repository**:
```bash
git clone <repository-url>
cd pdf_explainer
```
2. **Create virtual environment**:
```bash
python -m venv .venv
source .venv/Scripts/activate # Windows
# or
source .venv/bin/activate # Linux/Mac
```
3. **Install dependencies**:
```bash
pip install -r requirements.txt
```
4. **Run the application**:
```bash
python app.py
```
## π Usage
1. **Upload PDF**: Use the file upload interface to select your PDF document
2. **Automatic Processing**: The application will:
- Extract text using Mistral OCR
- Generate explanations using Mistral AI
- Create audio narration using Chatterbox TTS
3. **View Results**: Access extracted text, explanations, and audio in separate tabs
4. **Download**: Copy text or download audio files as needed
## π Project Structure
```
pdf_explainer/
βββ app.py # Main application entry point
βββ requirements.txt # Python dependencies
βββ .env.example # Environment variables template
βββ src/
β βββ processors/ # Core processing modules
β β βββ pdf_processor.py # Main PDF processing orchestrator
β β βββ pdf_text_extractor.py # Mistral OCR integration
β β βββ audio_processor.py # Audio generation coordinator
β β βββ generate_tts_audio.py # Chatterbox TTS integration
β β βββ text_chunker.py # Text splitting for audio processing
β β βββ parallel_processor.py # Parallel audio generation
β β βββ audio_concatenator.py # Audio chunk merging
β βββ ui_components/ # User interface components
β β βββ interface.py # Gradio interface builder
β β βββ styles.py # CSS styling
β βββ utils/ # Utility modules
β βββ text_explainer.py # Mistral AI explanation generation
```
## π§ Key Components
### PDF Processing (`PDFTextExtractor`)
- **OCR Integration**: Processes PDFs using Mistral's latest OCR model
- **Multi-strategy Extraction**: Multiple fallback methods for text extraction
- **Image Support**: Extracts and maps images with coordinates
- **Error Handling**: Robust error recovery and debugging
### Explanation Generation (`TextExplainer`)
- **Section Analysis**: Automatic detection of markdown headings
- **Context Maintenance**: Chat history for coherent multi-section explanations
- **Topic Extraction**: Automatic identification of document themes
- **Adaptive Processing**: Skips minimal content sections to optimize API usage
### Audio Processing (`AudioProcessor`)
- **Intelligent Chunking**: Splits text at natural boundaries (paragraphs, sentences)
- **Parallel Generation**: Concurrent audio generation for faster processing
- **Audio Concatenation**: Seamless merging with silence padding and fade effects
- **Progress Tracking**: Real-time updates during long operations
## ποΈ Configuration Options
### Text Chunking
- `max_chunk_size`: Maximum characters per audio chunk (default: 800)
- `overlap_sentences`: Sentence overlap between chunks for continuity
### Audio Processing
- `max_workers`: Parallel processing threads (default: 4)
- `silence_duration`: Pause between audio chunks (default: 0.5s)
- `fade_duration`: Fade in/out effects (default: 0.1s)
### AI Models
- Mistral OCR: Latest OCR model for text extraction
- Ministral-8B: Topic extraction with structured output
- Mistral-Small: Explanation generation with chat context
## π€ Contributing
1. Fork the repository
2. Create a feature branch: `git checkout -b feature-name`
3. Make your changes and test thoroughly
4. Commit with descriptive messages: `git commit -m "Add feature description"`
5. Push to your fork: `git push origin feature-name`
6. Create a pull request
## π License
This project is open source and available under the [MIT License](LICENSE).
## π Support
For questions, issues, or contributions:
- Create an issue in the repository
- Check the video overview for usage guidance
- Review the code documentation for technical details
---
**Built with β€οΈ using Mistral AI, Gradio, and Modal-deployed Chatterbox TTS**
|