DOLPHIN PDF Document AI - HuggingFace Spaces App

A Gradio-based web application for processing PDF documents using the DOLPHIN vision-language model. This app converts PDF files to images and processes them page by page to extract text, tables, and figures.

Features

PDF Upload: Upload PDF documents directly through the web interface
Page-by-Page Processing: Converts PDF pages to high-quality images and processes each individually
Document Parsing: Extracts text, tables, and figures using the DOLPHIN model
Markdown Output: Generates clean markdown with embedded images and tables
Memory Optimized: Designed for NVIDIA T4 GPU deployment on HuggingFace Spaces
Progress Tracking: Real-time progress updates during processing

Files

gradio_pdf_app.py - Main Gradio application with PDF processing functionality
app.py - HuggingFace Spaces entry point
requirements_hf_spaces.txt - Dependencies optimized for HF Spaces deployment

Usage

Local Development

# Install dependencies
pip install -r requirements_hf_spaces.txt

# Run the app
python gradio_pdf_app.py

HuggingFace Spaces Deployment

Create a new HuggingFace Space with Gradio SDK
Upload the following files:
- app.py
- gradio_pdf_app.py
- utils/ (directory with utility functions)
- requirements_hf_spaces.txt (rename to requirements.txt)
Configure the Space:
- SDK: Gradio
- Hardware: NVIDIA T4 Small (recommended)
- Python Version: 3.9+

Technical Details

Memory Optimizations

Uses torch.float16 for GPU inference
Smaller batch sizes (4) for element processing
Memory cleanup with torch.cuda.empty_cache()
Reduced max sequence length (2048) for generation

PDF Processing Pipeline

PDF to Images: Uses PyMuPDF with 2x zoom for quality
Layout Analysis: DOLPHIN model parses document structure
Element Extraction: Processes text, tables, and figures separately
Markdown Generation: Converts results to formatted markdown
Gallery View: Creates overview of all processed pages

Model Integration

Uses HuggingFace transformers implementation
Loads model with device_map="auto" for GPU optimization
Batch processing for improved efficiency
Graceful fallback to CPU if GPU unavailable

Configuration

The app automatically detects and uses the DOLPHIN model:

Local path: ./hf_model
HuggingFace Hub: ByteDance/DOLPHIN

Dependencies

Core requirements:

torch>=2.1.0 - PyTorch for model inference
transformers>=4.47.0 - HuggingFace model loading
gradio>=5.36.0 - Web interface
pymupdf>=1.26.0 - PDF processing
pillow>=9.3.0 - Image processing
opencv-python-headless>=4.8.0 - Computer vision operations

Error Handling

Graceful handling of PDF conversion failures
Memory management for large documents
Progress reporting for long-running operations
Fallback markdown generation if converter fails

Performance Notes

Optimized for NVIDIA T4 with 16GB VRAM
Processing time: ~30-60 seconds per page (depends on complexity)
Memory usage: ~8-12GB VRAM for typical documents
CPU fallback available but significantly slower

Example Output

The app generates:

Markdown Preview: Rendered document with LaTeX support
Raw Markdown: Source text for copying/editing
Page Gallery: Visual overview of all processed pages
JSON Details: Technical processing information

Troubleshooting

Out of Memory: Reduce batch size or use CPU
PDF Conversion Failed: Check PDF format compatibility
Model Loading Error: Verify model path and permissions
Slow Processing: Ensure GPU is available and configured

Credits

Built on the DOLPHIN model by ByteDance. Optimized for HuggingFace Spaces deployment.