Spaces:

Agents-MCP-Hackathon
/

pdf_explainer

Sleeping

App Files Files Community

pdf_explainer / README.md

spagestic

fix: update explanation generation model to mistral-medium-2505

dd41680 about 1 month ago

preview code

raw

history blame contribute delete

8.26 kB

	---
	title: Pdf Explainer
	emoji: 🦀
	colorFrom: indigo
	colorTo: yellow
	sdk: gradio
	sdk_version: 5.33.0
	app_file: app.py
	pinned: false
	tags: [agent-demo-track]
	---

	# 🔍 PDF Explainer

	An intelligent PDF processing application that extracts text from PDF documents, generates easy-to-understand explanations, and creates audio narrations. This tool transforms complex PDF content into accessible formats using cutting-edge AI technologies.

	## 🎥 Video Overview

	[Watch a video overview of Pdf Explainer](https://lifehkbueduhk-my.sharepoint.com/:v:/g/personal/22203133_life_hkbu_edu_hk/ESvvzCNfRJBGg0_mMwGMLGoBwBhEQLtoKc-JzOjWWQ_ZDw?nav=eyJyZWZlcnJhbEluZm8iOnsicmVmZXJyYWxBcHAiOiJPbmVEcml2ZUZvckJ1c2luZXNzIiwicmVmZXJyYWxBcHBQbGF0Zm9ybSI6IldlYiIsInJlZmVycmFsTW9kZSI6InZpZXciLCJyZWZlcnJhbFZpZXciOiJNeUZpbGVzTGlua0NvcHkifX0&e=iuKAGg)

	This video explains the usage and purpose of the Pdf Explainer application.

	## ✨ Features

	- 📄 PDF Text Extraction: Extract text content from PDF documents using advanced OCR technology
	- 🤖 Intelligent Explanations: Generate simple, easy-to-understand explanations of complex content
	- 🔊 Audio Generation: Convert explanations to high-quality audio narrations
	- ⚡ Parallel Processing: Efficient processing of large documents with chunking and parallel audio generation
	- 🎯 Context-Aware: Maintains context across document sections for coherent explanations
	- 📱 User-Friendly Interface: Clean, responsive Gradio-based web interface

	## 🏗️ Architecture & Technology Stack

	### Core Technologies

	#### 1. Mistral OCR - Text Extraction

	- Model: `mistral-ocr-latest`
	- Purpose: Extract text and images from PDF documents
	- Features:
	- Advanced OCR capabilities with markdown formatting
	- Image extraction with coordinate mapping
	- Multi-page document support
	- Base64 encoding for secure document processing

	#### 2. Mistral AI Models - Content Generation

	- Topic Extraction: `ministral-8b-2410` for document topic identification
	- Explanation Generation: `mistral-medium-2505` for creating simplified explanations
	- Features:
	- Structured JSON output for topic extraction
	- Chat history maintenance for contextual explanations
	- Temperature-controlled generation for consistent results
	- Section-by-section processing with heading analysis

	#### 3. Chatterbox TTS - Audio Generation

	- Platform: Modal-deployed APIs
	- Endpoints:
	- `GENERATE_AUDIO_ENDPOINT`: Standard text-to-speech conversion
	- `GENERATE_WITH_FILE_ENDPOINT`: Voice cloning with custom audio prompts
	- Features:
	- High-quality audio synthesis
	- Voice cloning capabilities
	- Streaming audio responses
	- Progress tracking for long generations

	### Processing Pipeline

	```mermaid
	graph TD
	A[PDF Upload] --> B[Mistral OCR Processing]
	B --> C[Text Extraction & Image Detection]
	C --> D[Section Analysis & Heading Detection]
	D --> E[Topic Identification - Ministral-8B]
	E --> F[Explanation Generation - Mistral-Small]
	F --> G[Text Chunking for Audio]
	G --> H[Parallel Audio Processing]
	H --> I[Chatterbox TTS Generation]
	I --> J[Audio Concatenation]
	J --> K[Final Output]
	```

	## 🔧 Installation & Setup

	### Prerequisites

	- Python 3.8+
	- Virtual environment (recommended)

	### Environment Variables

	Create a `.env` file based on `.env.example`:

	```bash
	# Mistral AI API Key
	MISTRAL_API_KEY=your_mistral_api_key_here

	# Chatterbox TTS API Endpoints (Modal)
	HEALTH_ENDPOINT=https://your-modal-endpoint/chatterbox-health
	GENERATE_AUDIO_ENDPOINT=https://your-modal-endpoint/chatterbox-generate-audio
	GENERATE_JSON_ENDPOINT=https://your-modal-endpoint/chatterbox-generate-json
	GENERATE_WITH_FILE_ENDPOINT=https://your-modal-endpoint/chatterbox-generate-with-file
	GENERATE_ENDPOINT=https://your-modal-endpoint/chatterbox-generate
	```

	### Installation

	1. Clone the repository:

	```bash
	git clone <repository-url>
	cd pdf_explainer
	```

	2. Create virtual environment:

	```bash
	python -m venv .venv
	source .venv/Scripts/activate # Windows
	# or
	source .venv/bin/activate # Linux/Mac
	```

	3. Install dependencies:

	```bash
	pip install -r requirements.txt
	```

	4. Run the application:
	```bash
	python app.py
	```

	## 🚀 Usage

	1. Upload PDF: Use the file upload interface to select your PDF document
	2. Automatic Processing: The application will:
	- Extract text using Mistral OCR
	- Generate explanations using Mistral AI
	- Create audio narration using Chatterbox TTS
	3. View Results: Access extracted text, explanations, and audio in separate tabs
	4. Download: Copy text or download audio files as needed

	## 📁 Project Structure

	```
	pdf_explainer/
	├── app.py # Main application entry point
	├── requirements.txt # Python dependencies
	├── .env.example # Environment variables template
	├── src/
	│ ├── processors/ # Core processing modules
	│ │ ├── pdf_processor.py # Main PDF processing orchestrator
	│ │ ├── pdf_text_extractor.py # Mistral OCR integration
	│ │ ├── audio_processor.py # Audio generation coordinator
	│ │ ├── generate_tts_audio.py # Chatterbox TTS integration
	│ │ ├── text_chunker.py # Text splitting for audio processing
	│ │ ├── parallel_processor.py # Parallel audio generation
	│ │ └── audio_concatenator.py # Audio chunk merging
	│ ├── ui_components/ # User interface components
	│ │ ├── interface.py # Gradio interface builder
	│ │ └── styles.py # CSS styling
	│ └── utils/ # Utility modules
	│ └── text_explainer.py # Mistral AI explanation generation
	```

	## 🔧 Key Components

	### PDF Processing (`PDFTextExtractor`)

	- OCR Integration: Processes PDFs using Mistral's latest OCR model
	- Multi-strategy Extraction: Multiple fallback methods for text extraction
	- Image Support: Extracts and maps images with coordinates
	- Error Handling: Robust error recovery and debugging

	### Explanation Generation (`TextExplainer`)

	- Section Analysis: Automatic detection of markdown headings
	- Context Maintenance: Chat history for coherent multi-section explanations
	- Topic Extraction: Automatic identification of document themes
	- Adaptive Processing: Skips minimal content sections to optimize API usage

	### Audio Processing (`AudioProcessor`)

	- Intelligent Chunking: Splits text at natural boundaries (paragraphs, sentences)
	- Parallel Generation: Concurrent audio generation for faster processing
	- Audio Concatenation: Seamless merging with silence padding and fade effects
	- Progress Tracking: Real-time updates during long operations

	## 🎛️ Configuration Options

	### Text Chunking

	- `max_chunk_size`: Maximum characters per audio chunk (default: 800)
	- `overlap_sentences`: Sentence overlap between chunks for continuity

	### Audio Processing

	- `max_workers`: Parallel processing threads (default: 4)
	- `silence_duration`: Pause between audio chunks (default: 0.5s)
	- `fade_duration`: Fade in/out effects (default: 0.1s)

	### AI Models

	- Mistral OCR: Latest OCR model for text extraction
	- Ministral-8B: Topic extraction with structured output
	- Mistral-Small: Explanation generation with chat context

	## 🤝 Contributing

	1. Fork the repository
	2. Create a feature branch: `git checkout -b feature-name`
	3. Make your changes and test thoroughly
	4. Commit with descriptive messages: `git commit -m "Add feature description"`
	5. Push to your fork: `git push origin feature-name`
	6. Create a pull request

	## 📄 License

	This project is open source and available under the [MIT License](LICENSE).

	## 🆘 Support

	For questions, issues, or contributions:

	- Create an issue in the repository
	- Check the video overview for usage guidance
	- Review the code documentation for technical details

	---

	Built with ❤️ using Mistral AI, Gradio, and Modal-deployed Chatterbox TTS