# DOLPHIN PDF Document AI - HuggingFace Spaces App A Gradio-based web application for processing PDF documents using the DOLPHIN vision-language model. This app converts PDF files to images and processes them page by page to extract text, tables, and figures. ## Features - **PDF Upload**: Upload PDF documents directly through the web interface - **Page-by-Page Processing**: Converts PDF pages to high-quality images and processes each individually - **Document Parsing**: Extracts text, tables, and figures using the DOLPHIN model - **Markdown Output**: Generates clean markdown with embedded images and tables - **Memory Optimized**: Designed for NVIDIA T4 GPU deployment on HuggingFace Spaces - **Progress Tracking**: Real-time progress updates during processing ## Files - `gradio_pdf_app.py` - Main Gradio application with PDF processing functionality - `app.py` - HuggingFace Spaces entry point - `requirements_hf_spaces.txt` - Dependencies optimized for HF Spaces deployment ## Usage ### Local Development ```bash # Install dependencies pip install -r requirements_hf_spaces.txt # Run the app python gradio_pdf_app.py ``` ### HuggingFace Spaces Deployment 1. Create a new HuggingFace Space with Gradio SDK 2. Upload the following files: - `app.py` - `gradio_pdf_app.py` - `utils/` (directory with utility functions) - `requirements_hf_spaces.txt` (rename to `requirements.txt`) 3. Configure the Space: - **SDK**: Gradio - **Hardware**: NVIDIA T4 Small (recommended) - **Python Version**: 3.9+ ## Technical Details ### Memory Optimizations - Uses `torch.float16` for GPU inference - Smaller batch sizes (4) for element processing - Memory cleanup with `torch.cuda.empty_cache()` - Reduced max sequence length (2048) for generation ### PDF Processing Pipeline 1. **PDF to Images**: Uses PyMuPDF with 2x zoom for quality 2. **Layout Analysis**: DOLPHIN model parses document structure 3. **Element Extraction**: Processes text, tables, and figures separately 4. **Markdown Generation**: Converts results to formatted markdown 5. **Gallery View**: Creates overview of all processed pages ### Model Integration - Uses HuggingFace transformers implementation - Loads model with `device_map="auto"` for GPU optimization - Batch processing for improved efficiency - Graceful fallback to CPU if GPU unavailable ## Configuration The app automatically detects and uses the DOLPHIN model: - Local path: `./hf_model` - HuggingFace Hub: `ByteDance/DOLPHIN` ## Dependencies Core requirements: - `torch>=2.1.0` - PyTorch for model inference - `transformers>=4.47.0` - HuggingFace model loading - `gradio>=5.36.0` - Web interface - `pymupdf>=1.26.0` - PDF processing - `pillow>=9.3.0` - Image processing - `opencv-python-headless>=4.8.0` - Computer vision operations ## Error Handling - Graceful handling of PDF conversion failures - Memory management for large documents - Progress reporting for long-running operations - Fallback markdown generation if converter fails ## Performance Notes - Optimized for NVIDIA T4 with 16GB VRAM - Processing time: ~30-60 seconds per page (depends on complexity) - Memory usage: ~8-12GB VRAM for typical documents - CPU fallback available but significantly slower ## Example Output The app generates: 1. **Markdown Preview**: Rendered document with LaTeX support 2. **Raw Markdown**: Source text for copying/editing 3. **Page Gallery**: Visual overview of all processed pages 4. **JSON Details**: Technical processing information ## Troubleshooting - **Out of Memory**: Reduce batch size or use CPU - **PDF Conversion Failed**: Check PDF format compatibility - **Model Loading Error**: Verify model path and permissions - **Slow Processing**: Ensure GPU is available and configured ## Credits Built on the DOLPHIN model by ByteDance. Optimized for HuggingFace Spaces deployment.