Spaces:
Sleeping
Sleeping
title: Pdf Explainer | |
emoji: π¦ | |
colorFrom: indigo | |
colorTo: yellow | |
sdk: gradio | |
sdk_version: 5.33.0 | |
app_file: app.py | |
pinned: false | |
tags: [agent-demo-track] | |
# π PDF Explainer | |
An intelligent PDF processing application that extracts text from PDF documents, generates easy-to-understand explanations, and creates audio narrations. This tool transforms complex PDF content into accessible formats using cutting-edge AI technologies. | |
## π₯ Video Overview | |
[Watch a video overview of Pdf Explainer](https://lifehkbueduhk-my.sharepoint.com/:v:/g/personal/22203133_life_hkbu_edu_hk/ESvvzCNfRJBGg0_mMwGMLGoBwBhEQLtoKc-JzOjWWQ_ZDw?nav=eyJyZWZlcnJhbEluZm8iOnsicmVmZXJyYWxBcHAiOiJPbmVEcml2ZUZvckJ1c2luZXNzIiwicmVmZXJyYWxBcHBQbGF0Zm9ybSI6IldlYiIsInJlZmVycmFsTW9kZSI6InZpZXciLCJyZWZlcnJhbFZpZXciOiJNeUZpbGVzTGlua0NvcHkifX0&e=iuKAGg) | |
This video explains the usage and purpose of the Pdf Explainer application. | |
## β¨ Features | |
- **π PDF Text Extraction**: Extract text content from PDF documents using advanced OCR technology | |
- **π€ Intelligent Explanations**: Generate simple, easy-to-understand explanations of complex content | |
- **π Audio Generation**: Convert explanations to high-quality audio narrations | |
- **β‘ Parallel Processing**: Efficient processing of large documents with chunking and parallel audio generation | |
- **π― Context-Aware**: Maintains context across document sections for coherent explanations | |
- **π± User-Friendly Interface**: Clean, responsive Gradio-based web interface | |
## ποΈ Architecture & Technology Stack | |
### Core Technologies | |
#### 1. **Mistral OCR** - Text Extraction | |
- **Model**: `mistral-ocr-latest` | |
- **Purpose**: Extract text and images from PDF documents | |
- **Features**: | |
- Advanced OCR capabilities with markdown formatting | |
- Image extraction with coordinate mapping | |
- Multi-page document support | |
- Base64 encoding for secure document processing | |
#### 2. **Mistral AI Models** - Content Generation | |
- **Topic Extraction**: `ministral-8b-2410` for document topic identification | |
- **Explanation Generation**: `mistral-medium-2505` for creating simplified explanations | |
- **Features**: | |
- Structured JSON output for topic extraction | |
- Chat history maintenance for contextual explanations | |
- Temperature-controlled generation for consistent results | |
- Section-by-section processing with heading analysis | |
#### 3. **Chatterbox TTS** - Audio Generation | |
- **Platform**: Modal-deployed APIs | |
- **Endpoints**: | |
- `GENERATE_AUDIO_ENDPOINT`: Standard text-to-speech conversion | |
- `GENERATE_WITH_FILE_ENDPOINT`: Voice cloning with custom audio prompts | |
- **Features**: | |
- High-quality audio synthesis | |
- Voice cloning capabilities | |
- Streaming audio responses | |
- Progress tracking for long generations | |
### Processing Pipeline | |
```mermaid | |
graph TD | |
A[PDF Upload] --> B[Mistral OCR Processing] | |
B --> C[Text Extraction & Image Detection] | |
C --> D[Section Analysis & Heading Detection] | |
D --> E[Topic Identification - Ministral-8B] | |
E --> F[Explanation Generation - Mistral-Small] | |
F --> G[Text Chunking for Audio] | |
G --> H[Parallel Audio Processing] | |
H --> I[Chatterbox TTS Generation] | |
I --> J[Audio Concatenation] | |
J --> K[Final Output] | |
``` | |
## π§ Installation & Setup | |
### Prerequisites | |
- Python 3.8+ | |
- Virtual environment (recommended) | |
### Environment Variables | |
Create a `.env` file based on `.env.example`: | |
```bash | |
# Mistral AI API Key | |
MISTRAL_API_KEY=your_mistral_api_key_here | |
# Chatterbox TTS API Endpoints (Modal) | |
HEALTH_ENDPOINT=https://your-modal-endpoint/chatterbox-health | |
GENERATE_AUDIO_ENDPOINT=https://your-modal-endpoint/chatterbox-generate-audio | |
GENERATE_JSON_ENDPOINT=https://your-modal-endpoint/chatterbox-generate-json | |
GENERATE_WITH_FILE_ENDPOINT=https://your-modal-endpoint/chatterbox-generate-with-file | |
GENERATE_ENDPOINT=https://your-modal-endpoint/chatterbox-generate | |
``` | |
### Installation | |
1. **Clone the repository**: | |
```bash | |
git clone <repository-url> | |
cd pdf_explainer | |
``` | |
2. **Create virtual environment**: | |
```bash | |
python -m venv .venv | |
source .venv/Scripts/activate # Windows | |
# or | |
source .venv/bin/activate # Linux/Mac | |
``` | |
3. **Install dependencies**: | |
```bash | |
pip install -r requirements.txt | |
``` | |
4. **Run the application**: | |
```bash | |
python app.py | |
``` | |
## π Usage | |
1. **Upload PDF**: Use the file upload interface to select your PDF document | |
2. **Automatic Processing**: The application will: | |
- Extract text using Mistral OCR | |
- Generate explanations using Mistral AI | |
- Create audio narration using Chatterbox TTS | |
3. **View Results**: Access extracted text, explanations, and audio in separate tabs | |
4. **Download**: Copy text or download audio files as needed | |
## π Project Structure | |
``` | |
pdf_explainer/ | |
βββ app.py # Main application entry point | |
βββ requirements.txt # Python dependencies | |
βββ .env.example # Environment variables template | |
βββ src/ | |
β βββ processors/ # Core processing modules | |
β β βββ pdf_processor.py # Main PDF processing orchestrator | |
β β βββ pdf_text_extractor.py # Mistral OCR integration | |
β β βββ audio_processor.py # Audio generation coordinator | |
β β βββ generate_tts_audio.py # Chatterbox TTS integration | |
β β βββ text_chunker.py # Text splitting for audio processing | |
β β βββ parallel_processor.py # Parallel audio generation | |
β β βββ audio_concatenator.py # Audio chunk merging | |
β βββ ui_components/ # User interface components | |
β β βββ interface.py # Gradio interface builder | |
β β βββ styles.py # CSS styling | |
β βββ utils/ # Utility modules | |
β βββ text_explainer.py # Mistral AI explanation generation | |
``` | |
## π§ Key Components | |
### PDF Processing (`PDFTextExtractor`) | |
- **OCR Integration**: Processes PDFs using Mistral's latest OCR model | |
- **Multi-strategy Extraction**: Multiple fallback methods for text extraction | |
- **Image Support**: Extracts and maps images with coordinates | |
- **Error Handling**: Robust error recovery and debugging | |
### Explanation Generation (`TextExplainer`) | |
- **Section Analysis**: Automatic detection of markdown headings | |
- **Context Maintenance**: Chat history for coherent multi-section explanations | |
- **Topic Extraction**: Automatic identification of document themes | |
- **Adaptive Processing**: Skips minimal content sections to optimize API usage | |
### Audio Processing (`AudioProcessor`) | |
- **Intelligent Chunking**: Splits text at natural boundaries (paragraphs, sentences) | |
- **Parallel Generation**: Concurrent audio generation for faster processing | |
- **Audio Concatenation**: Seamless merging with silence padding and fade effects | |
- **Progress Tracking**: Real-time updates during long operations | |
## ποΈ Configuration Options | |
### Text Chunking | |
- `max_chunk_size`: Maximum characters per audio chunk (default: 800) | |
- `overlap_sentences`: Sentence overlap between chunks for continuity | |
### Audio Processing | |
- `max_workers`: Parallel processing threads (default: 4) | |
- `silence_duration`: Pause between audio chunks (default: 0.5s) | |
- `fade_duration`: Fade in/out effects (default: 0.1s) | |
### AI Models | |
- Mistral OCR: Latest OCR model for text extraction | |
- Ministral-8B: Topic extraction with structured output | |
- Mistral-Small: Explanation generation with chat context | |
## π€ Contributing | |
1. Fork the repository | |
2. Create a feature branch: `git checkout -b feature-name` | |
3. Make your changes and test thoroughly | |
4. Commit with descriptive messages: `git commit -m "Add feature description"` | |
5. Push to your fork: `git push origin feature-name` | |
6. Create a pull request | |
## π License | |
This project is open source and available under the [MIT License](LICENSE). | |
## π Support | |
For questions, issues, or contributions: | |
- Create an issue in the repository | |
- Check the video overview for usage guidance | |
- Review the code documentation for technical details | |
--- | |
**Built with β€οΈ using Mistral AI, Gradio, and Modal-deployed Chatterbox TTS** | |