raksama19 commited on
Commit
d9b3b6c
·
verified ·
1 Parent(s): b8b71e1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +236 -1
README.md CHANGED
@@ -9,4 +9,239 @@ app_file: app.py
9
  pinned: false
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  pinned: false
10
  ---
11
 
12
+ # Scholar Express
13
+ ## AI-Powered Accessible Academic Research Platform
14
+
15
+ Scholar Express is an innovative AI-powered platform that transforms inaccessible scientific research papers into interactive, screen-reader compatible documents. The system addresses critical accessibility barriers faced by students with disabilities in academic research, leveraging specialized AI models to make scientific literature truly inclusive.
16
+
17
+ ## 🎯 Problem Statement
18
+ According to the U.S. National Center for Education Statistics, a significant portion of undergraduate students have disabilities:
19
+ - 18% of male undergraduate students
20
+ - 22% of female undergraduate students
21
+ - 54% of nonbinary undergraduate students
22
+
23
+ These students face major barriers when conducting research, as scientific PDFs are fundamentally inaccessible to screen readers due to complex mathematical equations, figures, and diagrams lacking alt text descriptions.
24
+
25
+ ## 🚀 Key Features
26
+
27
+ ### Document Processing
28
+ - **OCR and layout analysis** optimized for scientific papers
29
+ - **Table and figure extraction** with proper formatting for research content
30
+ - **AI-generated alt text** specifically for scientific diagrams, charts, and equations
31
+ - **Structured markdown output** that preserves document hierarchy
32
+
33
+ ### Interactive Features
34
+ - **RAG-powered chatbot** for scientific document Q&A
35
+ - **Real-time voice conversations** about research content
36
+ - **Multi-tab interface** optimized for research workflows
37
+
38
+ ### Accessibility Focus
39
+ - **Screen reader compatible** output
40
+ - **Descriptive alt text** for all figures following WCAG guidelines
41
+ - **Privacy-first design** with local processing
42
+
43
+ ## 🏗️ System Architecture
44
+
45
+ ### Core AI Models
46
+ The platform utilizes a specialized ensemble of AI models, each optimized for specific tasks:
47
+
48
+ - **Gemma 3n 4B**: Primary engine for alt text generation and document chatbot functionality
49
+ - **Gemma 3n 2B**: Specialized for real-time voice chat interactions
50
+ - **DOLPHIN**: Handles PDF layout analysis and text extraction
51
+ - **SentenceTransformer**: Enables semantic search for Retrieval-Augmented Generation (RAG)
52
+
53
+ ### Processing Pipeline
54
+
55
+ #### PDF Processing
56
+ ```
57
+ PDF Upload → Image Conversion → Layout Analysis → Element Extraction → Alt Text Generation → Markdown Output
58
+ ```
59
+
60
+ #### Chat System
61
+ ```
62
+ User Question → Document Search → Context Retrieval → AI Response (Gemma 3n 4B)
63
+ ```
64
+
65
+ #### Voice System
66
+ ```
67
+ Audio Input → Speech Detection → Voice Processing → Text Response → Speech Output
68
+ ```
69
+
70
+ ## 📁 Project Structure
71
+
72
+ ```
73
+ Scholar-Express/
74
+ ├── 📄 Core Application Files
75
+ │ ├── app.py # Main Gradio application with multi-tab interface
76
+ │ ├── chat.py # Document chat functionality
77
+ │ ├── gradio_final_app.py # Final integrated Gradio application
78
+ │ └── gradio_local_gemma.py # Local Gemma model integration
79
+
80
+ ├── 🔧 Configuration & Dependencies
81
+ │ ├── requirements.txt # Main project dependencies
82
+ │ ├── requirements_gemma.txt # Gemma-specific dependencies
83
+ │ ├── requirements_voice_gemma.txt # Voice chat dependencies
84
+ │ ├── requirements_hf_spaces.txt # HuggingFace Spaces deployment
85
+ │ ├── pyproject.toml # Project configuration (Black formatting)
86
+ │ └── config/
87
+ │ └── Dolphin.yaml # DOLPHIN model configuration
88
+
89
+ ├── 🛠️ Utility Modules
90
+ │ └── utils/
91
+ │ ├── markdown_utils.py # Markdown processing utilities
92
+ │ ├── model.py # AI model management
93
+ │ ├── processor.py # Document processing utilities
94
+ │ └── utils.py # General utility functions
95
+
96
+ ├── 🎤 Voice Chat System
97
+ │ └── voice_chat/
98
+ │ ├── app.py # Voice chat Gradio interface
99
+ │ ├── gemma3n_inference.py # Gemma 3n voice inference
100
+ │ ├── inference.py # General inference utilities
101
+ │ ├── server.py # Voice chat server
102
+ │ ├── requirements.txt # Voice-specific dependencies
103
+ │ ├── litgpt/ # LitGPT integration
104
+ │ │ ├── config.py # Model configuration
105
+ │ │ ├── model.py # Model architecture
106
+ │ │ ├── tokenizer.py # Tokenization utilities
107
+ │ │ └── generate/ # Text generation utilities
108
+ │ ├── utils/
109
+ │ │ ├── vad.py # Voice Activity Detection
110
+ │ │ ├── snac_utils.py # Audio processing utilities
111
+ │ │ └── assets/
112
+ │ │ └── silero_vad.onnx # Silero VAD model
113
+ │ └── data/samples/ # Audio sample outputs
114
+
115
+ ├── 🤖 Pre-trained Models
116
+ │ └── hf_model/ # HuggingFace model files
117
+ │ ├── config.json # Model configuration
118
+ │ ├── model.safetensors # Model weights
119
+ │ ├── tokenizer.json # Tokenizer configuration
120
+ │ └── generation_config.json # Generation parameters
121
+
122
+ ├── 🧪 Development & Demo Files
123
+ │ ├── demo_element_hf.py # Element extraction demo
124
+ │ ├── demo_page_hf.py # Page processing demo
125
+ │ ├── gradio_pdf_app.py # PDF processing demo
126
+ │ ├── gradio_image_app.py # Image processing demo
127
+ │ ├── gradio_gemma.py # Gemma integration demo
128
+ │ └── gradio_gemma_api.py # Gemma API demo
129
+
130
+ └── 📚 Documentation
131
+ ├── README.md # This comprehensive guide
132
+ └── Scholar_Express_Technical_Write_Up.pdf # Detailed technical documentation
133
+ ```
134
+
135
+ ### 🔑 Essential Files Explained
136
+
137
+ #### Core Application
138
+ - **`app.py`**: Main entry point with complete Gradio interface featuring PDF processing, document chat, and voice interaction tabs
139
+
140
+ #### Configuration & Dependencies
141
+ - **`requirements.txt`**: Complete dependency list including PyTorch, Transformers, Gradio, PDF processing, and voice libraries
142
+ - **`requirements_voice_gemma.txt`**: Specialized dependencies for voice chat (LitGPT, SNAC, Whisper)
143
+ - **`config/Dolphin.yaml`**: Configuration file for DOLPHIN model parameters and settings
144
+
145
+ #### Utility Modules (`utils/`)
146
+ - **`model.py`**: AI model loading, initialization, and management functions
147
+ - **`processor.py`**: PDF processing, image extraction, and document parsing utilities
148
+ - **`markdown_utils.py`**: Markdown generation and formatting for accessible output
149
+ - **`utils.py`**: General helper functions for file handling and data processing
150
+
151
+ #### Voice Chat System (`voice_chat/`)
152
+ - **`gemma3n_inference.py`**: Core Gemma 3n 2B inference engine for voice processing
153
+ - **`utils/vad.py`**: Voice Activity Detection using Silero VAD model
154
+ - **`utils/snac_utils.py`**: Audio preprocessing and formatting utilities
155
+ - **`litgpt/`**: Lightweight GPT implementation for efficient voice processing
156
+
157
+ #### Model Files (`hf_model/`)
158
+ - **`model.safetensors`**: Pre-trained model weights in SafeTensors format
159
+ - **`config.json`**: Model architecture and parameter configuration
160
+ - **`tokenizer.json`**: Tokenization rules and vocabulary
161
+
162
+ ### 📋 Dependency Categories
163
+
164
+ The project uses multiple requirement files for different deployment scenarios:
165
+
166
+ | File | Purpose | Key Dependencies |
167
+ |------|---------|------------------|
168
+ | `requirements.txt` | Main application | PyTorch, Transformers, Gradio, PyMuPDF |
169
+ | `requirements_voice_gemma.txt` | Voice features | LitGPT, SNAC, Whisper, Librosa |
170
+ | `requirements_hf_spaces.txt` | HuggingFace deployment | Streamlined for cloud deployment |
171
+ | `requirements_gemma.txt` | Gemma-specific | Optimized for Gemma model usage |
172
+
173
+ ### Key Components
174
+
175
+ #### PDF Processing (`app.py:convert_pdf_to_images_gradio`)
176
+ - Converts PDFs to high-quality images (2x scaling)
177
+ - Uses PyMuPDF for reliable extraction
178
+
179
+ #### Layout Analysis (`app.py:process_elements_optimized`)
180
+ - DOLPHIN identifies text blocks, tables, figures, headers
181
+ - Maintains proper reading order for accessibility
182
+
183
+ #### Alt Text Generation
184
+ - Gemma 3n 4B processes images with accessibility-focused prompts
185
+ - Generates 1-2 sentence descriptions following WCAG guidelines
186
+ - Low temperature (0.1) for consistent, reliable output
187
+
188
+ #### RAG System
189
+ - **Document chunking**: Smart overlap-based chunking (1024 tokens, 100 overlap)
190
+ - **Semantic retrieval**: SentenceTransformer embeddings with cosine similarity
191
+ - **Context integration**: Top-3 relevant chunks for accurate responses
192
+
193
+ #### Voice Chat System
194
+ - **Gemma 3n 2B**: Optimized for real-time voice processing
195
+ - **Silero VAD**: Voice Activity Detection for speech vs silence
196
+ - **gTTS**: Google Text-to-Speech for audio responses
197
+ - **Audio preprocessing**: 16kHz mono, normalized amplitude
198
+
199
+ ## 🛠️ Technology Stack
200
+
201
+ | Component | Technology |
202
+ |-----------|------------|
203
+ | Frontend | Gradio web interface with streaming capabilities |
204
+ | AI Models | Gemma 3n, DOLPHIN, SentenceTransformer |
205
+ | Document Processing | PyMuPDF, OpenCV, PIL |
206
+ | Voice Processing | Librosa, VAD, gTTS |
207
+ | Search | SentenceTransformers for semantic retrieval |
208
+
209
+ ## 🎨 Architecture Philosophy
210
+
211
+ ### Right Tool for Right Job
212
+ - **DOLPHIN** for PDF extraction and layout analysis
213
+ - **Gemma 3n 4B** for alt text generation and document chat
214
+ - **Gemma 3n 2B** for real-time voice interaction
215
+ - Each component matched to its optimal model and specialization
216
+
217
+ ### Privacy-First Design
218
+ - All processing happens locally to protect sensitive academic content
219
+ - Meets institutional privacy requirements for research documents
220
+
221
+ ### Accessibility Focus
222
+ - AI-generated alt text makes academic papers inclusive for visually impaired researchers
223
+ - Addresses a real gap in academic publishing accessibility
224
+
225
+ ## 🚀 Getting Started
226
+
227
+ 1. **Install dependencies**: The app uses Gradio, PyMuPDF, and various AI model libraries
228
+ 2. **Run the application**: `python app.py`
229
+ 3. **Access the interface**: Open the Gradio web interface
230
+ 4. **Upload a PDF**: Use the document processing tab to convert research papers
231
+ 5. **Interact**: Chat with documents or use voice features for hands-free research
232
+
233
+ ## 💡 Design Challenges Solved
234
+
235
+ ### Challenge 1: Narrowing Down Big Ideas
236
+ - Focused on three core applications: alt text, document chat, and voice interaction
237
+ - Chose accessibility as the primary value proposition
238
+ - Specialized each model variant (4B vs 2B) for optimal performance
239
+
240
+ ### Challenge 2: Storage Limitations
241
+ - Developed code-first approach with thorough review before testing
242
+ - Built comprehensive error handling upfront since debugging was expensive
243
+ - Improved documentation and commenting discipline
244
+
245
+ ## 📈 Impact
246
+
247
+ Scholar Express bridges the accessibility gap in scientific research, ensuring that the 18-54% of students with disabilities can access the same research literature as their peers, while providing enhanced interaction capabilities for all users working with complex scientific content.