Spaces:
Sleeping
Sleeping
Update README.md
Browse files
README.md
CHANGED
@@ -9,4 +9,239 @@ app_file: app.py
|
|
9 |
pinned: false
|
10 |
---
|
11 |
|
12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
9 |
pinned: false
|
10 |
---
|
11 |
|
12 |
+
# Scholar Express
|
13 |
+
## AI-Powered Accessible Academic Research Platform
|
14 |
+
|
15 |
+
Scholar Express is an innovative AI-powered platform that transforms inaccessible scientific research papers into interactive, screen-reader compatible documents. The system addresses critical accessibility barriers faced by students with disabilities in academic research, leveraging specialized AI models to make scientific literature truly inclusive.
|
16 |
+
|
17 |
+
## 🎯 Problem Statement
|
18 |
+
According to the U.S. National Center for Education Statistics, a significant portion of undergraduate students have disabilities:
|
19 |
+
- 18% of male undergraduate students
|
20 |
+
- 22% of female undergraduate students
|
21 |
+
- 54% of nonbinary undergraduate students
|
22 |
+
|
23 |
+
These students face major barriers when conducting research, as scientific PDFs are fundamentally inaccessible to screen readers due to complex mathematical equations, figures, and diagrams lacking alt text descriptions.
|
24 |
+
|
25 |
+
## 🚀 Key Features
|
26 |
+
|
27 |
+
### Document Processing
|
28 |
+
- **OCR and layout analysis** optimized for scientific papers
|
29 |
+
- **Table and figure extraction** with proper formatting for research content
|
30 |
+
- **AI-generated alt text** specifically for scientific diagrams, charts, and equations
|
31 |
+
- **Structured markdown output** that preserves document hierarchy
|
32 |
+
|
33 |
+
### Interactive Features
|
34 |
+
- **RAG-powered chatbot** for scientific document Q&A
|
35 |
+
- **Real-time voice conversations** about research content
|
36 |
+
- **Multi-tab interface** optimized for research workflows
|
37 |
+
|
38 |
+
### Accessibility Focus
|
39 |
+
- **Screen reader compatible** output
|
40 |
+
- **Descriptive alt text** for all figures following WCAG guidelines
|
41 |
+
- **Privacy-first design** with local processing
|
42 |
+
|
43 |
+
## 🏗️ System Architecture
|
44 |
+
|
45 |
+
### Core AI Models
|
46 |
+
The platform utilizes a specialized ensemble of AI models, each optimized for specific tasks:
|
47 |
+
|
48 |
+
- **Gemma 3n 4B**: Primary engine for alt text generation and document chatbot functionality
|
49 |
+
- **Gemma 3n 2B**: Specialized for real-time voice chat interactions
|
50 |
+
- **DOLPHIN**: Handles PDF layout analysis and text extraction
|
51 |
+
- **SentenceTransformer**: Enables semantic search for Retrieval-Augmented Generation (RAG)
|
52 |
+
|
53 |
+
### Processing Pipeline
|
54 |
+
|
55 |
+
#### PDF Processing
|
56 |
+
```
|
57 |
+
PDF Upload → Image Conversion → Layout Analysis → Element Extraction → Alt Text Generation → Markdown Output
|
58 |
+
```
|
59 |
+
|
60 |
+
#### Chat System
|
61 |
+
```
|
62 |
+
User Question → Document Search → Context Retrieval → AI Response (Gemma 3n 4B)
|
63 |
+
```
|
64 |
+
|
65 |
+
#### Voice System
|
66 |
+
```
|
67 |
+
Audio Input → Speech Detection → Voice Processing → Text Response → Speech Output
|
68 |
+
```
|
69 |
+
|
70 |
+
## 📁 Project Structure
|
71 |
+
|
72 |
+
```
|
73 |
+
Scholar-Express/
|
74 |
+
├── 📄 Core Application Files
|
75 |
+
│ ├── app.py # Main Gradio application with multi-tab interface
|
76 |
+
│ ├── chat.py # Document chat functionality
|
77 |
+
│ ├── gradio_final_app.py # Final integrated Gradio application
|
78 |
+
│ └── gradio_local_gemma.py # Local Gemma model integration
|
79 |
+
│
|
80 |
+
├── 🔧 Configuration & Dependencies
|
81 |
+
│ ├── requirements.txt # Main project dependencies
|
82 |
+
│ ├── requirements_gemma.txt # Gemma-specific dependencies
|
83 |
+
│ ├── requirements_voice_gemma.txt # Voice chat dependencies
|
84 |
+
│ ├── requirements_hf_spaces.txt # HuggingFace Spaces deployment
|
85 |
+
│ ├── pyproject.toml # Project configuration (Black formatting)
|
86 |
+
│ └── config/
|
87 |
+
│ └── Dolphin.yaml # DOLPHIN model configuration
|
88 |
+
│
|
89 |
+
├── 🛠️ Utility Modules
|
90 |
+
│ └── utils/
|
91 |
+
│ ├── markdown_utils.py # Markdown processing utilities
|
92 |
+
│ ├── model.py # AI model management
|
93 |
+
│ ├── processor.py # Document processing utilities
|
94 |
+
│ └── utils.py # General utility functions
|
95 |
+
│
|
96 |
+
├── 🎤 Voice Chat System
|
97 |
+
│ └── voice_chat/
|
98 |
+
│ ├── app.py # Voice chat Gradio interface
|
99 |
+
│ ├── gemma3n_inference.py # Gemma 3n voice inference
|
100 |
+
│ ├── inference.py # General inference utilities
|
101 |
+
│ ├── server.py # Voice chat server
|
102 |
+
│ ├── requirements.txt # Voice-specific dependencies
|
103 |
+
│ ├── litgpt/ # LitGPT integration
|
104 |
+
│ │ ├── config.py # Model configuration
|
105 |
+
│ │ ├── model.py # Model architecture
|
106 |
+
│ │ ├── tokenizer.py # Tokenization utilities
|
107 |
+
│ │ └── generate/ # Text generation utilities
|
108 |
+
│ ├── utils/
|
109 |
+
│ │ ├── vad.py # Voice Activity Detection
|
110 |
+
│ │ ├── snac_utils.py # Audio processing utilities
|
111 |
+
│ │ └── assets/
|
112 |
+
│ │ └── silero_vad.onnx # Silero VAD model
|
113 |
+
│ └── data/samples/ # Audio sample outputs
|
114 |
+
│
|
115 |
+
├── 🤖 Pre-trained Models
|
116 |
+
│ └── hf_model/ # HuggingFace model files
|
117 |
+
│ ├── config.json # Model configuration
|
118 |
+
│ ├── model.safetensors # Model weights
|
119 |
+
│ ├── tokenizer.json # Tokenizer configuration
|
120 |
+
│ └── generation_config.json # Generation parameters
|
121 |
+
│
|
122 |
+
├── 🧪 Development & Demo Files
|
123 |
+
│ ├── demo_element_hf.py # Element extraction demo
|
124 |
+
│ ├── demo_page_hf.py # Page processing demo
|
125 |
+
│ ├── gradio_pdf_app.py # PDF processing demo
|
126 |
+
│ ├── gradio_image_app.py # Image processing demo
|
127 |
+
│ ├── gradio_gemma.py # Gemma integration demo
|
128 |
+
│ └── gradio_gemma_api.py # Gemma API demo
|
129 |
+
│
|
130 |
+
└── 📚 Documentation
|
131 |
+
├── README.md # This comprehensive guide
|
132 |
+
└── Scholar_Express_Technical_Write_Up.pdf # Detailed technical documentation
|
133 |
+
```
|
134 |
+
|
135 |
+
### 🔑 Essential Files Explained
|
136 |
+
|
137 |
+
#### Core Application
|
138 |
+
- **`app.py`**: Main entry point with complete Gradio interface featuring PDF processing, document chat, and voice interaction tabs
|
139 |
+
|
140 |
+
#### Configuration & Dependencies
|
141 |
+
- **`requirements.txt`**: Complete dependency list including PyTorch, Transformers, Gradio, PDF processing, and voice libraries
|
142 |
+
- **`requirements_voice_gemma.txt`**: Specialized dependencies for voice chat (LitGPT, SNAC, Whisper)
|
143 |
+
- **`config/Dolphin.yaml`**: Configuration file for DOLPHIN model parameters and settings
|
144 |
+
|
145 |
+
#### Utility Modules (`utils/`)
|
146 |
+
- **`model.py`**: AI model loading, initialization, and management functions
|
147 |
+
- **`processor.py`**: PDF processing, image extraction, and document parsing utilities
|
148 |
+
- **`markdown_utils.py`**: Markdown generation and formatting for accessible output
|
149 |
+
- **`utils.py`**: General helper functions for file handling and data processing
|
150 |
+
|
151 |
+
#### Voice Chat System (`voice_chat/`)
|
152 |
+
- **`gemma3n_inference.py`**: Core Gemma 3n 2B inference engine for voice processing
|
153 |
+
- **`utils/vad.py`**: Voice Activity Detection using Silero VAD model
|
154 |
+
- **`utils/snac_utils.py`**: Audio preprocessing and formatting utilities
|
155 |
+
- **`litgpt/`**: Lightweight GPT implementation for efficient voice processing
|
156 |
+
|
157 |
+
#### Model Files (`hf_model/`)
|
158 |
+
- **`model.safetensors`**: Pre-trained model weights in SafeTensors format
|
159 |
+
- **`config.json`**: Model architecture and parameter configuration
|
160 |
+
- **`tokenizer.json`**: Tokenization rules and vocabulary
|
161 |
+
|
162 |
+
### 📋 Dependency Categories
|
163 |
+
|
164 |
+
The project uses multiple requirement files for different deployment scenarios:
|
165 |
+
|
166 |
+
| File | Purpose | Key Dependencies |
|
167 |
+
|------|---------|------------------|
|
168 |
+
| `requirements.txt` | Main application | PyTorch, Transformers, Gradio, PyMuPDF |
|
169 |
+
| `requirements_voice_gemma.txt` | Voice features | LitGPT, SNAC, Whisper, Librosa |
|
170 |
+
| `requirements_hf_spaces.txt` | HuggingFace deployment | Streamlined for cloud deployment |
|
171 |
+
| `requirements_gemma.txt` | Gemma-specific | Optimized for Gemma model usage |
|
172 |
+
|
173 |
+
### Key Components
|
174 |
+
|
175 |
+
#### PDF Processing (`app.py:convert_pdf_to_images_gradio`)
|
176 |
+
- Converts PDFs to high-quality images (2x scaling)
|
177 |
+
- Uses PyMuPDF for reliable extraction
|
178 |
+
|
179 |
+
#### Layout Analysis (`app.py:process_elements_optimized`)
|
180 |
+
- DOLPHIN identifies text blocks, tables, figures, headers
|
181 |
+
- Maintains proper reading order for accessibility
|
182 |
+
|
183 |
+
#### Alt Text Generation
|
184 |
+
- Gemma 3n 4B processes images with accessibility-focused prompts
|
185 |
+
- Generates 1-2 sentence descriptions following WCAG guidelines
|
186 |
+
- Low temperature (0.1) for consistent, reliable output
|
187 |
+
|
188 |
+
#### RAG System
|
189 |
+
- **Document chunking**: Smart overlap-based chunking (1024 tokens, 100 overlap)
|
190 |
+
- **Semantic retrieval**: SentenceTransformer embeddings with cosine similarity
|
191 |
+
- **Context integration**: Top-3 relevant chunks for accurate responses
|
192 |
+
|
193 |
+
#### Voice Chat System
|
194 |
+
- **Gemma 3n 2B**: Optimized for real-time voice processing
|
195 |
+
- **Silero VAD**: Voice Activity Detection for speech vs silence
|
196 |
+
- **gTTS**: Google Text-to-Speech for audio responses
|
197 |
+
- **Audio preprocessing**: 16kHz mono, normalized amplitude
|
198 |
+
|
199 |
+
## 🛠️ Technology Stack
|
200 |
+
|
201 |
+
| Component | Technology |
|
202 |
+
|-----------|------------|
|
203 |
+
| Frontend | Gradio web interface with streaming capabilities |
|
204 |
+
| AI Models | Gemma 3n, DOLPHIN, SentenceTransformer |
|
205 |
+
| Document Processing | PyMuPDF, OpenCV, PIL |
|
206 |
+
| Voice Processing | Librosa, VAD, gTTS |
|
207 |
+
| Search | SentenceTransformers for semantic retrieval |
|
208 |
+
|
209 |
+
## 🎨 Architecture Philosophy
|
210 |
+
|
211 |
+
### Right Tool for Right Job
|
212 |
+
- **DOLPHIN** for PDF extraction and layout analysis
|
213 |
+
- **Gemma 3n 4B** for alt text generation and document chat
|
214 |
+
- **Gemma 3n 2B** for real-time voice interaction
|
215 |
+
- Each component matched to its optimal model and specialization
|
216 |
+
|
217 |
+
### Privacy-First Design
|
218 |
+
- All processing happens locally to protect sensitive academic content
|
219 |
+
- Meets institutional privacy requirements for research documents
|
220 |
+
|
221 |
+
### Accessibility Focus
|
222 |
+
- AI-generated alt text makes academic papers inclusive for visually impaired researchers
|
223 |
+
- Addresses a real gap in academic publishing accessibility
|
224 |
+
|
225 |
+
## 🚀 Getting Started
|
226 |
+
|
227 |
+
1. **Install dependencies**: The app uses Gradio, PyMuPDF, and various AI model libraries
|
228 |
+
2. **Run the application**: `python app.py`
|
229 |
+
3. **Access the interface**: Open the Gradio web interface
|
230 |
+
4. **Upload a PDF**: Use the document processing tab to convert research papers
|
231 |
+
5. **Interact**: Chat with documents or use voice features for hands-free research
|
232 |
+
|
233 |
+
## 💡 Design Challenges Solved
|
234 |
+
|
235 |
+
### Challenge 1: Narrowing Down Big Ideas
|
236 |
+
- Focused on three core applications: alt text, document chat, and voice interaction
|
237 |
+
- Chose accessibility as the primary value proposition
|
238 |
+
- Specialized each model variant (4B vs 2B) for optimal performance
|
239 |
+
|
240 |
+
### Challenge 2: Storage Limitations
|
241 |
+
- Developed code-first approach with thorough review before testing
|
242 |
+
- Built comprehensive error handling upfront since debugging was expensive
|
243 |
+
- Improved documentation and commenting discipline
|
244 |
+
|
245 |
+
## 📈 Impact
|
246 |
+
|
247 |
+
Scholar Express bridges the accessibility gap in scientific research, ensuring that the 18-54% of students with disabilities can access the same research literature as their peers, while providing enhanced interaction capabilities for all users working with complex scientific content.
|