Spaces:
Sleeping
A newer version of the Gradio SDK is available:
5.42.0
title: Scholar Express
emoji: ✈️
colorFrom: green
colorTo: yellow
sdk: gradio
sdk_version: 5.38.2
app_file: app.py
pinned: false
Scholar Express
AI-Powered Accessible Academic Research Platform
Scholar Express is an innovative AI-powered platform that transforms inaccessible scientific research papers into interactive, screen-reader compatible documents. The system addresses critical accessibility barriers faced by students with disabilities in academic research, leveraging specialized AI models to make scientific literature truly inclusive.
🎯 Problem Statement
According to the U.S. National Center for Education Statistics, a significant portion of undergraduate students have disabilities:
- 18% of male undergraduate students
- 22% of female undergraduate students
- 54% of nonbinary undergraduate students
These students face major barriers when conducting research, as scientific PDFs are fundamentally inaccessible to screen readers due to complex mathematical equations, figures, and diagrams lacking alt text descriptions.
🚀 Key Features
Document Processing
- OCR and layout analysis optimized for scientific papers
- Table and figure extraction with proper formatting for research content
- AI-generated alt text specifically for scientific diagrams, charts, and equations
- Structured markdown output that preserves document hierarchy
Interactive Features
- RAG-powered chatbot for scientific document Q&A
- Real-time voice conversations about research content
- Multi-tab interface optimized for research workflows
Accessibility Focus
- Screen reader compatible output
- Descriptive alt text for all figures following WCAG guidelines
- Privacy-first design with local processing
🏗️ System Architecture
Core AI Models
The platform utilizes a specialized ensemble of AI models, each optimized for specific tasks:
- Gemma 3n 4B: Primary engine for alt text generation and document chatbot functionality
- Gemma 3n 2B: Specialized for real-time voice chat interactions
- DOLPHIN: Handles PDF layout analysis and text extraction
- SentenceTransformer: Enables semantic search for Retrieval-Augmented Generation (RAG)
Processing Pipeline
PDF Processing
PDF Upload → Image Conversion → Layout Analysis → Element Extraction → Alt Text Generation → Markdown Output
Chat System
User Question → Document Search → Context Retrieval → AI Response (Gemma 3n 4B)
Voice System
Audio Input → Speech Detection → Voice Processing → Text Response → Speech Output
📁 Project Structure
Scholar-Express/
├── 📄 Core Application Files
│ ├── app.py # Main Gradio application with multi-tab interface
│ ├── chat.py # Document chat functionality
│ ├── gradio_final_app.py # Final integrated Gradio application
│ └── gradio_local_gemma.py # Local Gemma model integration
│
├── 🔧 Configuration & Dependencies
│ ├── requirements.txt # Main project dependencies
│ ├── requirements_gemma.txt # Gemma-specific dependencies
│ ├── requirements_voice_gemma.txt # Voice chat dependencies
│ ├── requirements_hf_spaces.txt # HuggingFace Spaces deployment
│ ├── pyproject.toml # Project configuration (Black formatting)
│ └── config/
│ └── Dolphin.yaml # DOLPHIN model configuration
│
├── 🛠️ Utility Modules
│ └── utils/
│ ├── markdown_utils.py # Markdown processing utilities
│ ├── model.py # AI model management
│ ├── processor.py # Document processing utilities
│ └── utils.py # General utility functions
│
├── 🎤 Voice Chat System
│ └── voice_chat/
│ ├── app.py # Voice chat Gradio interface
│ ├── gemma3n_inference.py # Gemma 3n voice inference
│ ├── inference.py # General inference utilities
│ ├── server.py # Voice chat server
│ ├── requirements.txt # Voice-specific dependencies
│ ├── litgpt/ # LitGPT integration
│ │ ├── config.py # Model configuration
│ │ ├── model.py # Model architecture
│ │ ├── tokenizer.py # Tokenization utilities
│ │ └── generate/ # Text generation utilities
│ ├── utils/
│ │ ├── vad.py # Voice Activity Detection
│ │ ├── snac_utils.py # Audio processing utilities
│ │ └── assets/
│ │ └── silero_vad.onnx # Silero VAD model
│ └── data/samples/ # Audio sample outputs
│
├── 🤖 Pre-trained Models
│ └── hf_model/ # HuggingFace model files
│ ├── config.json # Model configuration
│ ├── model.safetensors # Model weights
│ ├── tokenizer.json # Tokenizer configuration
│ └── generation_config.json # Generation parameters
│
├── 🧪 Development & Demo Files
│ ├── demo_element_hf.py # Element extraction demo
│ ├── demo_page_hf.py # Page processing demo
│ ├── gradio_pdf_app.py # PDF processing demo
│ ├── gradio_image_app.py # Image processing demo
│ ├── gradio_gemma.py # Gemma integration demo
│ └── gradio_gemma_api.py # Gemma API demo
│
└── 📚 Documentation
├── README.md # This comprehensive guide
└── Scholar_Express_Technical_Write_Up.pdf # Detailed technical documentation
🔑 Essential Files Explained
Core Application
app.py
: Main entry point with complete Gradio interface featuring PDF processing, document chat, and voice interaction tabs
Configuration & Dependencies
requirements.txt
: Complete dependency list including PyTorch, Transformers, Gradio, PDF processing, and voice librariesrequirements_voice_gemma.txt
: Specialized dependencies for voice chat (LitGPT, SNAC, Whisper)config/Dolphin.yaml
: Configuration file for DOLPHIN model parameters and settings
Utility Modules (utils/
)
model.py
: AI model loading, initialization, and management functionsprocessor.py
: PDF processing, image extraction, and document parsing utilitiesmarkdown_utils.py
: Markdown generation and formatting for accessible outpututils.py
: General helper functions for file handling and data processing
Voice Chat System (voice_chat/
)
gemma3n_inference.py
: Core Gemma 3n 2B inference engine for voice processingutils/vad.py
: Voice Activity Detection using Silero VAD modelutils/snac_utils.py
: Audio preprocessing and formatting utilitieslitgpt/
: Lightweight GPT implementation for efficient voice processing
Model Files (hf_model/
)
model.safetensors
: Pre-trained model weights in SafeTensors formatconfig.json
: Model architecture and parameter configurationtokenizer.json
: Tokenization rules and vocabulary
📋 Dependency Categories
The project uses multiple requirement files for different deployment scenarios:
File | Purpose | Key Dependencies |
---|---|---|
requirements.txt |
Main application | PyTorch, Transformers, Gradio, PyMuPDF |
requirements_voice_gemma.txt |
Voice features | LitGPT, SNAC, Whisper, Librosa |
requirements_hf_spaces.txt |
HuggingFace deployment | Streamlined for cloud deployment |
requirements_gemma.txt |
Gemma-specific | Optimized for Gemma model usage |
Key Components
PDF Processing (app.py:convert_pdf_to_images_gradio
)
- Converts PDFs to high-quality images (2x scaling)
- Uses PyMuPDF for reliable extraction
Layout Analysis (app.py:process_elements_optimized
)
- DOLPHIN identifies text blocks, tables, figures, headers
- Maintains proper reading order for accessibility
Alt Text Generation
- Gemma 3n 4B processes images with accessibility-focused prompts
- Generates 1-2 sentence descriptions following WCAG guidelines
- Low temperature (0.1) for consistent, reliable output
RAG System
- Document chunking: Smart overlap-based chunking (1024 tokens, 100 overlap)
- Semantic retrieval: SentenceTransformer embeddings with cosine similarity
- Context integration: Top-3 relevant chunks for accurate responses
Voice Chat System
- Gemma 3n 2B: Optimized for real-time voice processing
- Silero VAD: Voice Activity Detection for speech vs silence
- gTTS: Google Text-to-Speech for audio responses
- Audio preprocessing: 16kHz mono, normalized amplitude
🛠️ Technology Stack
Component | Technology |
---|---|
Frontend | Gradio web interface with streaming capabilities |
AI Models | Gemma 3n, DOLPHIN, SentenceTransformer |
Document Processing | PyMuPDF, OpenCV, PIL |
Voice Processing | Librosa, VAD, gTTS |
Search | SentenceTransformers for semantic retrieval |
🎨 Architecture Philosophy
Right Tool for Right Job
- DOLPHIN for PDF extraction and layout analysis
- Gemma 3n 4B for alt text generation and document chat
- Gemma 3n 2B for real-time voice interaction
- Each component matched to its optimal model and specialization
Privacy-First Design
- All processing happens locally to protect sensitive academic content
- Meets institutional privacy requirements for research documents
Accessibility Focus
- AI-generated alt text makes academic papers inclusive for visually impaired researchers
- Addresses a real gap in academic publishing accessibility
🚀 Getting Started
- Install dependencies: The app uses Gradio, PyMuPDF, and various AI model libraries
- Run the application:
python app.py
- Access the interface: Open the Gradio web interface
- Upload a PDF: Use the document processing tab to convert research papers
- Interact: Chat with documents or use voice features for hands-free research
💡 Design Challenges Solved
Challenge 1: Narrowing Down Big Ideas
- Focused on three core applications: alt text, document chat, and voice interaction
- Chose accessibility as the primary value proposition
- Specialized each model variant (4B vs 2B) for optimal performance
Challenge 2: Storage Limitations
- Developed code-first approach with thorough review before testing
- Built comprehensive error handling upfront since debugging was expensive
- Improved documentation and commenting discipline
📈 Impact
Scholar Express bridges the accessibility gap in scientific research, ensuring that the 18-54% of students with disabilities can access the same research literature as their peers, while providing enhanced interaction capabilities for all users working with complex scientific content.