File size: 8,260 Bytes
41aca0e
 
 
 
 
 
 
868664a
41aca0e
065887d
41aca0e
 
e37b0d2
065887d
e37b0d2
 
 
065887d
1027486
065887d
 
e37b0d2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dd41680
e37b0d2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
---
title: Pdf Explainer
emoji: πŸ¦€
colorFrom: indigo
colorTo: yellow
sdk: gradio
sdk_version: 5.33.0
app_file: app.py
pinned: false
tags: [agent-demo-track]
---

# πŸ” PDF Explainer

An intelligent PDF processing application that extracts text from PDF documents, generates easy-to-understand explanations, and creates audio narrations. This tool transforms complex PDF content into accessible formats using cutting-edge AI technologies.

## πŸŽ₯ Video Overview

[Watch a video overview of Pdf Explainer](https://lifehkbueduhk-my.sharepoint.com/:v:/g/personal/22203133_life_hkbu_edu_hk/ESvvzCNfRJBGg0_mMwGMLGoBwBhEQLtoKc-JzOjWWQ_ZDw?nav=eyJyZWZlcnJhbEluZm8iOnsicmVmZXJyYWxBcHAiOiJPbmVEcml2ZUZvckJ1c2luZXNzIiwicmVmZXJyYWxBcHBQbGF0Zm9ybSI6IldlYiIsInJlZmVycmFsTW9kZSI6InZpZXciLCJyZWZlcnJhbFZpZXciOiJNeUZpbGVzTGlua0NvcHkifX0&e=iuKAGg)

This video explains the usage and purpose of the Pdf Explainer application.

## ✨ Features

- **πŸ“„ PDF Text Extraction**: Extract text content from PDF documents using advanced OCR technology
- **πŸ€– Intelligent Explanations**: Generate simple, easy-to-understand explanations of complex content
- **πŸ”Š Audio Generation**: Convert explanations to high-quality audio narrations
- **⚑ Parallel Processing**: Efficient processing of large documents with chunking and parallel audio generation
- **🎯 Context-Aware**: Maintains context across document sections for coherent explanations
- **πŸ“± User-Friendly Interface**: Clean, responsive Gradio-based web interface

## πŸ—οΈ Architecture & Technology Stack

### Core Technologies

#### 1. **Mistral OCR** - Text Extraction

- **Model**: `mistral-ocr-latest`
- **Purpose**: Extract text and images from PDF documents
- **Features**:
  - Advanced OCR capabilities with markdown formatting
  - Image extraction with coordinate mapping
  - Multi-page document support
  - Base64 encoding for secure document processing

#### 2. **Mistral AI Models** - Content Generation

- **Topic Extraction**: `ministral-8b-2410` for document topic identification
- **Explanation Generation**: `mistral-medium-2505` for creating simplified explanations
- **Features**:
  - Structured JSON output for topic extraction
  - Chat history maintenance for contextual explanations
  - Temperature-controlled generation for consistent results
  - Section-by-section processing with heading analysis

#### 3. **Chatterbox TTS** - Audio Generation

- **Platform**: Modal-deployed APIs
- **Endpoints**:
  - `GENERATE_AUDIO_ENDPOINT`: Standard text-to-speech conversion
  - `GENERATE_WITH_FILE_ENDPOINT`: Voice cloning with custom audio prompts
- **Features**:
  - High-quality audio synthesis
  - Voice cloning capabilities
  - Streaming audio responses
  - Progress tracking for long generations

### Processing Pipeline

```mermaid
graph TD
    A[PDF Upload] --> B[Mistral OCR Processing]
    B --> C[Text Extraction & Image Detection]
    C --> D[Section Analysis & Heading Detection]
    D --> E[Topic Identification - Ministral-8B]
    E --> F[Explanation Generation - Mistral-Small]
    F --> G[Text Chunking for Audio]
    G --> H[Parallel Audio Processing]
    H --> I[Chatterbox TTS Generation]
    I --> J[Audio Concatenation]
    J --> K[Final Output]
```

## πŸ”§ Installation & Setup

### Prerequisites

- Python 3.8+
- Virtual environment (recommended)

### Environment Variables

Create a `.env` file based on `.env.example`:

```bash
# Mistral AI API Key
MISTRAL_API_KEY=your_mistral_api_key_here

# Chatterbox TTS API Endpoints (Modal)
HEALTH_ENDPOINT=https://your-modal-endpoint/chatterbox-health
GENERATE_AUDIO_ENDPOINT=https://your-modal-endpoint/chatterbox-generate-audio
GENERATE_JSON_ENDPOINT=https://your-modal-endpoint/chatterbox-generate-json
GENERATE_WITH_FILE_ENDPOINT=https://your-modal-endpoint/chatterbox-generate-with-file
GENERATE_ENDPOINT=https://your-modal-endpoint/chatterbox-generate
```

### Installation

1. **Clone the repository**:

   ```bash
   git clone <repository-url>
   cd pdf_explainer
   ```

2. **Create virtual environment**:

   ```bash
   python -m venv .venv
   source .venv/Scripts/activate  # Windows
   # or
   source .venv/bin/activate      # Linux/Mac
   ```

3. **Install dependencies**:

   ```bash
   pip install -r requirements.txt
   ```

4. **Run the application**:
   ```bash
   python app.py
   ```

## πŸš€ Usage

1. **Upload PDF**: Use the file upload interface to select your PDF document
2. **Automatic Processing**: The application will:
   - Extract text using Mistral OCR
   - Generate explanations using Mistral AI
   - Create audio narration using Chatterbox TTS
3. **View Results**: Access extracted text, explanations, and audio in separate tabs
4. **Download**: Copy text or download audio files as needed

## πŸ“ Project Structure

```
pdf_explainer/
β”œβ”€β”€ app.py                      # Main application entry point
β”œβ”€β”€ requirements.txt            # Python dependencies
β”œβ”€β”€ .env.example               # Environment variables template
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ processors/            # Core processing modules
β”‚   β”‚   β”œβ”€β”€ pdf_processor.py          # Main PDF processing orchestrator
β”‚   β”‚   β”œβ”€β”€ pdf_text_extractor.py     # Mistral OCR integration
β”‚   β”‚   β”œβ”€β”€ audio_processor.py        # Audio generation coordinator
β”‚   β”‚   β”œβ”€β”€ generate_tts_audio.py     # Chatterbox TTS integration
β”‚   β”‚   β”œβ”€β”€ text_chunker.py           # Text splitting for audio processing
β”‚   β”‚   β”œβ”€β”€ parallel_processor.py     # Parallel audio generation
β”‚   β”‚   └── audio_concatenator.py     # Audio chunk merging
β”‚   β”œβ”€β”€ ui_components/         # User interface components
β”‚   β”‚   β”œβ”€β”€ interface.py              # Gradio interface builder
β”‚   β”‚   └── styles.py                 # CSS styling
β”‚   └── utils/                 # Utility modules
β”‚       └── text_explainer.py         # Mistral AI explanation generation
```

## πŸ”§ Key Components

### PDF Processing (`PDFTextExtractor`)

- **OCR Integration**: Processes PDFs using Mistral's latest OCR model
- **Multi-strategy Extraction**: Multiple fallback methods for text extraction
- **Image Support**: Extracts and maps images with coordinates
- **Error Handling**: Robust error recovery and debugging

### Explanation Generation (`TextExplainer`)

- **Section Analysis**: Automatic detection of markdown headings
- **Context Maintenance**: Chat history for coherent multi-section explanations
- **Topic Extraction**: Automatic identification of document themes
- **Adaptive Processing**: Skips minimal content sections to optimize API usage

### Audio Processing (`AudioProcessor`)

- **Intelligent Chunking**: Splits text at natural boundaries (paragraphs, sentences)
- **Parallel Generation**: Concurrent audio generation for faster processing
- **Audio Concatenation**: Seamless merging with silence padding and fade effects
- **Progress Tracking**: Real-time updates during long operations

## πŸŽ›οΈ Configuration Options

### Text Chunking

- `max_chunk_size`: Maximum characters per audio chunk (default: 800)
- `overlap_sentences`: Sentence overlap between chunks for continuity

### Audio Processing

- `max_workers`: Parallel processing threads (default: 4)
- `silence_duration`: Pause between audio chunks (default: 0.5s)
- `fade_duration`: Fade in/out effects (default: 0.1s)

### AI Models

- Mistral OCR: Latest OCR model for text extraction
- Ministral-8B: Topic extraction with structured output
- Mistral-Small: Explanation generation with chat context

## 🀝 Contributing

1. Fork the repository
2. Create a feature branch: `git checkout -b feature-name`
3. Make your changes and test thoroughly
4. Commit with descriptive messages: `git commit -m "Add feature description"`
5. Push to your fork: `git push origin feature-name`
6. Create a pull request

## πŸ“„ License

This project is open source and available under the [MIT License](LICENSE).

## πŸ†˜ Support

For questions, issues, or contributions:

- Create an issue in the repository
- Check the video overview for usage guidance
- Review the code documentation for technical details

---

**Built with ❀️ using Mistral AI, Gradio, and Modal-deployed Chatterbox TTS**