--- title: Pdf Explainer emoji: 🦀 colorFrom: indigo colorTo: yellow sdk: gradio sdk_version: 5.33.0 app_file: app.py pinned: false tags: [agent-demo-track] --- # 🔍 PDF Explainer An intelligent PDF processing application that extracts text from PDF documents, generates easy-to-understand explanations, and creates audio narrations. This tool transforms complex PDF content into accessible formats using cutting-edge AI technologies. ## 🎥 Video Overview [Watch a video overview of Pdf Explainer](https://lifehkbueduhk-my.sharepoint.com/:v:/g/personal/22203133_life_hkbu_edu_hk/ESvvzCNfRJBGg0_mMwGMLGoBwBhEQLtoKc-JzOjWWQ_ZDw?nav=eyJyZWZlcnJhbEluZm8iOnsicmVmZXJyYWxBcHAiOiJPbmVEcml2ZUZvckJ1c2luZXNzIiwicmVmZXJyYWxBcHBQbGF0Zm9ybSI6IldlYiIsInJlZmVycmFsTW9kZSI6InZpZXciLCJyZWZlcnJhbFZpZXciOiJNeUZpbGVzTGlua0NvcHkifX0&e=iuKAGg) This video explains the usage and purpose of the Pdf Explainer application. ## ✨ Features - **📄 PDF Text Extraction**: Extract text content from PDF documents using advanced OCR technology - **🤖 Intelligent Explanations**: Generate simple, easy-to-understand explanations of complex content - **🔊 Audio Generation**: Convert explanations to high-quality audio narrations - **⚡ Parallel Processing**: Efficient processing of large documents with chunking and parallel audio generation - **🎯 Context-Aware**: Maintains context across document sections for coherent explanations - **📱 User-Friendly Interface**: Clean, responsive Gradio-based web interface ## 🏗️ Architecture & Technology Stack ### Core Technologies #### 1. **Mistral OCR** - Text Extraction - **Model**: `mistral-ocr-latest` - **Purpose**: Extract text and images from PDF documents - **Features**: - Advanced OCR capabilities with markdown formatting - Image extraction with coordinate mapping - Multi-page document support - Base64 encoding for secure document processing #### 2. **Mistral AI Models** - Content Generation - **Topic Extraction**: `ministral-8b-2410` for document topic identification - **Explanation Generation**: `mistral-medium-2505` for creating simplified explanations - **Features**: - Structured JSON output for topic extraction - Chat history maintenance for contextual explanations - Temperature-controlled generation for consistent results - Section-by-section processing with heading analysis #### 3. **Chatterbox TTS** - Audio Generation - **Platform**: Modal-deployed APIs - **Endpoints**: - `GENERATE_AUDIO_ENDPOINT`: Standard text-to-speech conversion - `GENERATE_WITH_FILE_ENDPOINT`: Voice cloning with custom audio prompts - **Features**: - High-quality audio synthesis - Voice cloning capabilities - Streaming audio responses - Progress tracking for long generations ### Processing Pipeline ```mermaid graph TD A[PDF Upload] --> B[Mistral OCR Processing] B --> C[Text Extraction & Image Detection] C --> D[Section Analysis & Heading Detection] D --> E[Topic Identification - Ministral-8B] E --> F[Explanation Generation - Mistral-Small] F --> G[Text Chunking for Audio] G --> H[Parallel Audio Processing] H --> I[Chatterbox TTS Generation] I --> J[Audio Concatenation] J --> K[Final Output] ``` ## 🔧 Installation & Setup ### Prerequisites - Python 3.8+ - Virtual environment (recommended) ### Environment Variables Create a `.env` file based on `.env.example`: ```bash # Mistral AI API Key MISTRAL_API_KEY=your_mistral_api_key_here # Chatterbox TTS API Endpoints (Modal) HEALTH_ENDPOINT=https://your-modal-endpoint/chatterbox-health GENERATE_AUDIO_ENDPOINT=https://your-modal-endpoint/chatterbox-generate-audio GENERATE_JSON_ENDPOINT=https://your-modal-endpoint/chatterbox-generate-json GENERATE_WITH_FILE_ENDPOINT=https://your-modal-endpoint/chatterbox-generate-with-file GENERATE_ENDPOINT=https://your-modal-endpoint/chatterbox-generate ``` ### Installation 1. **Clone the repository**: ```bash git clone cd pdf_explainer ``` 2. **Create virtual environment**: ```bash python -m venv .venv source .venv/Scripts/activate # Windows # or source .venv/bin/activate # Linux/Mac ``` 3. **Install dependencies**: ```bash pip install -r requirements.txt ``` 4. **Run the application**: ```bash python app.py ``` ## 🚀 Usage 1. **Upload PDF**: Use the file upload interface to select your PDF document 2. **Automatic Processing**: The application will: - Extract text using Mistral OCR - Generate explanations using Mistral AI - Create audio narration using Chatterbox TTS 3. **View Results**: Access extracted text, explanations, and audio in separate tabs 4. **Download**: Copy text or download audio files as needed ## 📁 Project Structure ``` pdf_explainer/ ├── app.py # Main application entry point ├── requirements.txt # Python dependencies ├── .env.example # Environment variables template ├── src/ │ ├── processors/ # Core processing modules │ │ ├── pdf_processor.py # Main PDF processing orchestrator │ │ ├── pdf_text_extractor.py # Mistral OCR integration │ │ ├── audio_processor.py # Audio generation coordinator │ │ ├── generate_tts_audio.py # Chatterbox TTS integration │ │ ├── text_chunker.py # Text splitting for audio processing │ │ ├── parallel_processor.py # Parallel audio generation │ │ └── audio_concatenator.py # Audio chunk merging │ ├── ui_components/ # User interface components │ │ ├── interface.py # Gradio interface builder │ │ └── styles.py # CSS styling │ └── utils/ # Utility modules │ └── text_explainer.py # Mistral AI explanation generation ``` ## 🔧 Key Components ### PDF Processing (`PDFTextExtractor`) - **OCR Integration**: Processes PDFs using Mistral's latest OCR model - **Multi-strategy Extraction**: Multiple fallback methods for text extraction - **Image Support**: Extracts and maps images with coordinates - **Error Handling**: Robust error recovery and debugging ### Explanation Generation (`TextExplainer`) - **Section Analysis**: Automatic detection of markdown headings - **Context Maintenance**: Chat history for coherent multi-section explanations - **Topic Extraction**: Automatic identification of document themes - **Adaptive Processing**: Skips minimal content sections to optimize API usage ### Audio Processing (`AudioProcessor`) - **Intelligent Chunking**: Splits text at natural boundaries (paragraphs, sentences) - **Parallel Generation**: Concurrent audio generation for faster processing - **Audio Concatenation**: Seamless merging with silence padding and fade effects - **Progress Tracking**: Real-time updates during long operations ## 🎛️ Configuration Options ### Text Chunking - `max_chunk_size`: Maximum characters per audio chunk (default: 800) - `overlap_sentences`: Sentence overlap between chunks for continuity ### Audio Processing - `max_workers`: Parallel processing threads (default: 4) - `silence_duration`: Pause between audio chunks (default: 0.5s) - `fade_duration`: Fade in/out effects (default: 0.1s) ### AI Models - Mistral OCR: Latest OCR model for text extraction - Ministral-8B: Topic extraction with structured output - Mistral-Small: Explanation generation with chat context ## 🤝 Contributing 1. Fork the repository 2. Create a feature branch: `git checkout -b feature-name` 3. Make your changes and test thoroughly 4. Commit with descriptive messages: `git commit -m "Add feature description"` 5. Push to your fork: `git push origin feature-name` 6. Create a pull request ## 📄 License This project is open source and available under the [MIT License](LICENSE). ## 🆘 Support For questions, issues, or contributions: - Create an issue in the repository - Check the video overview for usage guidance - Review the code documentation for technical details --- **Built with ❤️ using Mistral AI, Gradio, and Modal-deployed Chatterbox TTS**