spagestic commited on
Commit
e37b0d2
Β·
1 Parent(s): 1027486

docs: enhance README with detailed application overview, features, and installation instructions

Browse files
Files changed (1) hide show
  1. README.md +217 -2
README.md CHANGED
@@ -10,10 +10,225 @@ pinned: false
10
  tags: [agent-demo-track]
11
  ---
12
 
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
14
 
15
- ## Video Overview
 
 
16
 
17
  [Watch a video overview of Pdf Explainer](https://lifehkbueduhk-my.sharepoint.com/:v:/g/personal/22203133_life_hkbu_edu_hk/ESvvzCNfRJBGg0_mMwGMLGoBwBhEQLtoKc-JzOjWWQ_ZDw?nav=eyJyZWZlcnJhbEluZm8iOnsicmVmZXJyYWxBcHAiOiJPbmVEcml2ZUZvckJ1c2luZXNzIiwicmVmZXJyYWxBcHBQbGF0Zm9ybSI6IldlYiIsInJlZmVycmFsTW9kZSI6InZpZXciLCJyZWZlcnJhbFZpZXciOiJNeUZpbGVzTGlua0NvcHkifX0&e=iuKAGg)
18
 
19
  This video explains the usage and purpose of the Pdf Explainer application.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  tags: [agent-demo-track]
11
  ---
12
 
13
+ # πŸ” PDF Explainer
14
 
15
+ An intelligent PDF processing application that extracts text from PDF documents, generates easy-to-understand explanations, and creates audio narrations. This tool transforms complex PDF content into accessible formats using cutting-edge AI technologies.
16
+
17
+ ## πŸŽ₯ Video Overview
18
 
19
  [Watch a video overview of Pdf Explainer](https://lifehkbueduhk-my.sharepoint.com/:v:/g/personal/22203133_life_hkbu_edu_hk/ESvvzCNfRJBGg0_mMwGMLGoBwBhEQLtoKc-JzOjWWQ_ZDw?nav=eyJyZWZlcnJhbEluZm8iOnsicmVmZXJyYWxBcHAiOiJPbmVEcml2ZUZvckJ1c2luZXNzIiwicmVmZXJyYWxBcHBQbGF0Zm9ybSI6IldlYiIsInJlZmVycmFsTW9kZSI6InZpZXciLCJyZWZlcnJhbFZpZXciOiJNeUZpbGVzTGlua0NvcHkifX0&e=iuKAGg)
20
 
21
  This video explains the usage and purpose of the Pdf Explainer application.
22
+
23
+ ## ✨ Features
24
+
25
+ - **πŸ“„ PDF Text Extraction**: Extract text content from PDF documents using advanced OCR technology
26
+ - **πŸ€– Intelligent Explanations**: Generate simple, easy-to-understand explanations of complex content
27
+ - **πŸ”Š Audio Generation**: Convert explanations to high-quality audio narrations
28
+ - **⚑ Parallel Processing**: Efficient processing of large documents with chunking and parallel audio generation
29
+ - **🎯 Context-Aware**: Maintains context across document sections for coherent explanations
30
+ - **πŸ“± User-Friendly Interface**: Clean, responsive Gradio-based web interface
31
+
32
+ ## πŸ—οΈ Architecture & Technology Stack
33
+
34
+ ### Core Technologies
35
+
36
+ #### 1. **Mistral OCR** - Text Extraction
37
+
38
+ - **Model**: `mistral-ocr-latest`
39
+ - **Purpose**: Extract text and images from PDF documents
40
+ - **Features**:
41
+ - Advanced OCR capabilities with markdown formatting
42
+ - Image extraction with coordinate mapping
43
+ - Multi-page document support
44
+ - Base64 encoding for secure document processing
45
+
46
+ #### 2. **Mistral AI Models** - Content Generation
47
+
48
+ - **Topic Extraction**: `ministral-8b-2410` for document topic identification
49
+ - **Explanation Generation**: `mistral-small-2503` for creating simplified explanations
50
+ - **Features**:
51
+ - Structured JSON output for topic extraction
52
+ - Chat history maintenance for contextual explanations
53
+ - Temperature-controlled generation for consistent results
54
+ - Section-by-section processing with heading analysis
55
+
56
+ #### 3. **Chatterbox TTS** - Audio Generation
57
+
58
+ - **Platform**: Modal-deployed APIs
59
+ - **Endpoints**:
60
+ - `GENERATE_AUDIO_ENDPOINT`: Standard text-to-speech conversion
61
+ - `GENERATE_WITH_FILE_ENDPOINT`: Voice cloning with custom audio prompts
62
+ - **Features**:
63
+ - High-quality audio synthesis
64
+ - Voice cloning capabilities
65
+ - Streaming audio responses
66
+ - Progress tracking for long generations
67
+
68
+ ### Processing Pipeline
69
+
70
+ ```mermaid
71
+ graph TD
72
+ A[PDF Upload] --> B[Mistral OCR Processing]
73
+ B --> C[Text Extraction & Image Detection]
74
+ C --> D[Section Analysis & Heading Detection]
75
+ D --> E[Topic Identification - Ministral-8B]
76
+ E --> F[Explanation Generation - Mistral-Small]
77
+ F --> G[Text Chunking for Audio]
78
+ G --> H[Parallel Audio Processing]
79
+ H --> I[Chatterbox TTS Generation]
80
+ I --> J[Audio Concatenation]
81
+ J --> K[Final Output]
82
+ ```
83
+
84
+ ## πŸ”§ Installation & Setup
85
+
86
+ ### Prerequisites
87
+
88
+ - Python 3.8+
89
+ - Virtual environment (recommended)
90
+
91
+ ### Environment Variables
92
+
93
+ Create a `.env` file based on `.env.example`:
94
+
95
+ ```bash
96
+ # Mistral AI API Key
97
+ MISTRAL_API_KEY=your_mistral_api_key_here
98
+
99
+ # Chatterbox TTS API Endpoints (Modal)
100
+ HEALTH_ENDPOINT=https://your-modal-endpoint/chatterbox-health
101
+ GENERATE_AUDIO_ENDPOINT=https://your-modal-endpoint/chatterbox-generate-audio
102
+ GENERATE_JSON_ENDPOINT=https://your-modal-endpoint/chatterbox-generate-json
103
+ GENERATE_WITH_FILE_ENDPOINT=https://your-modal-endpoint/chatterbox-generate-with-file
104
+ GENERATE_ENDPOINT=https://your-modal-endpoint/chatterbox-generate
105
+ ```
106
+
107
+ ### Installation
108
+
109
+ 1. **Clone the repository**:
110
+
111
+ ```bash
112
+ git clone <repository-url>
113
+ cd pdf_explainer
114
+ ```
115
+
116
+ 2. **Create virtual environment**:
117
+
118
+ ```bash
119
+ python -m venv .venv
120
+ source .venv/Scripts/activate # Windows
121
+ # or
122
+ source .venv/bin/activate # Linux/Mac
123
+ ```
124
+
125
+ 3. **Install dependencies**:
126
+
127
+ ```bash
128
+ pip install -r requirements.txt
129
+ ```
130
+
131
+ 4. **Run the application**:
132
+ ```bash
133
+ python app.py
134
+ ```
135
+
136
+ ## πŸš€ Usage
137
+
138
+ 1. **Upload PDF**: Use the file upload interface to select your PDF document
139
+ 2. **Automatic Processing**: The application will:
140
+ - Extract text using Mistral OCR
141
+ - Generate explanations using Mistral AI
142
+ - Create audio narration using Chatterbox TTS
143
+ 3. **View Results**: Access extracted text, explanations, and audio in separate tabs
144
+ 4. **Download**: Copy text or download audio files as needed
145
+
146
+ ## πŸ“ Project Structure
147
+
148
+ ```
149
+ pdf_explainer/
150
+ β”œβ”€β”€ app.py # Main application entry point
151
+ β”œβ”€β”€ requirements.txt # Python dependencies
152
+ β”œβ”€β”€ .env.example # Environment variables template
153
+ β”œβ”€β”€ src/
154
+ β”‚ β”œβ”€β”€ processors/ # Core processing modules
155
+ β”‚ β”‚ β”œβ”€β”€ pdf_processor.py # Main PDF processing orchestrator
156
+ β”‚ β”‚ β”œβ”€β”€ pdf_text_extractor.py # Mistral OCR integration
157
+ β”‚ β”‚ β”œβ”€β”€ audio_processor.py # Audio generation coordinator
158
+ β”‚ β”‚ β”œβ”€β”€ generate_tts_audio.py # Chatterbox TTS integration
159
+ β”‚ β”‚ β”œβ”€β”€ text_chunker.py # Text splitting for audio processing
160
+ β”‚ β”‚ β”œβ”€β”€ parallel_processor.py # Parallel audio generation
161
+ β”‚ β”‚ └── audio_concatenator.py # Audio chunk merging
162
+ β”‚ β”œβ”€β”€ ui_components/ # User interface components
163
+ β”‚ β”‚ β”œβ”€β”€ interface.py # Gradio interface builder
164
+ β”‚ β”‚ └── styles.py # CSS styling
165
+ β”‚ └── utils/ # Utility modules
166
+ β”‚ └── text_explainer.py # Mistral AI explanation generation
167
+ ```
168
+
169
+ ## πŸ”§ Key Components
170
+
171
+ ### PDF Processing (`PDFTextExtractor`)
172
+
173
+ - **OCR Integration**: Processes PDFs using Mistral's latest OCR model
174
+ - **Multi-strategy Extraction**: Multiple fallback methods for text extraction
175
+ - **Image Support**: Extracts and maps images with coordinates
176
+ - **Error Handling**: Robust error recovery and debugging
177
+
178
+ ### Explanation Generation (`TextExplainer`)
179
+
180
+ - **Section Analysis**: Automatic detection of markdown headings
181
+ - **Context Maintenance**: Chat history for coherent multi-section explanations
182
+ - **Topic Extraction**: Automatic identification of document themes
183
+ - **Adaptive Processing**: Skips minimal content sections to optimize API usage
184
+
185
+ ### Audio Processing (`AudioProcessor`)
186
+
187
+ - **Intelligent Chunking**: Splits text at natural boundaries (paragraphs, sentences)
188
+ - **Parallel Generation**: Concurrent audio generation for faster processing
189
+ - **Audio Concatenation**: Seamless merging with silence padding and fade effects
190
+ - **Progress Tracking**: Real-time updates during long operations
191
+
192
+ ## πŸŽ›οΈ Configuration Options
193
+
194
+ ### Text Chunking
195
+
196
+ - `max_chunk_size`: Maximum characters per audio chunk (default: 800)
197
+ - `overlap_sentences`: Sentence overlap between chunks for continuity
198
+
199
+ ### Audio Processing
200
+
201
+ - `max_workers`: Parallel processing threads (default: 4)
202
+ - `silence_duration`: Pause between audio chunks (default: 0.5s)
203
+ - `fade_duration`: Fade in/out effects (default: 0.1s)
204
+
205
+ ### AI Models
206
+
207
+ - Mistral OCR: Latest OCR model for text extraction
208
+ - Ministral-8B: Topic extraction with structured output
209
+ - Mistral-Small: Explanation generation with chat context
210
+
211
+ ## 🀝 Contributing
212
+
213
+ 1. Fork the repository
214
+ 2. Create a feature branch: `git checkout -b feature-name`
215
+ 3. Make your changes and test thoroughly
216
+ 4. Commit with descriptive messages: `git commit -m "Add feature description"`
217
+ 5. Push to your fork: `git push origin feature-name`
218
+ 6. Create a pull request
219
+
220
+ ## πŸ“„ License
221
+
222
+ This project is open source and available under the [MIT License](LICENSE).
223
+
224
+ ## πŸ†˜ Support
225
+
226
+ For questions, issues, or contributions:
227
+
228
+ - Create an issue in the repository
229
+ - Check the video overview for usage guidance
230
+ - Review the code documentation for technical details
231
+
232
+ ---
233
+
234
+ **Built with ❀️ using Mistral AI, Gradio, and Modal-deployed Chatterbox TTS**