Test-Dolphin-PDF / README_PDF_APP.md
raksa-the-wildcats
first commit
383af88

A newer version of the Gradio SDK is available: 5.42.0

Upgrade

DOLPHIN PDF Document AI - HuggingFace Spaces App

A Gradio-based web application for processing PDF documents using the DOLPHIN vision-language model. This app converts PDF files to images and processes them page by page to extract text, tables, and figures.

Features

  • PDF Upload: Upload PDF documents directly through the web interface
  • Page-by-Page Processing: Converts PDF pages to high-quality images and processes each individually
  • Document Parsing: Extracts text, tables, and figures using the DOLPHIN model
  • Markdown Output: Generates clean markdown with embedded images and tables
  • Memory Optimized: Designed for NVIDIA T4 GPU deployment on HuggingFace Spaces
  • Progress Tracking: Real-time progress updates during processing

Files

  • gradio_pdf_app.py - Main Gradio application with PDF processing functionality
  • app.py - HuggingFace Spaces entry point
  • requirements_hf_spaces.txt - Dependencies optimized for HF Spaces deployment

Usage

Local Development

# Install dependencies
pip install -r requirements_hf_spaces.txt

# Run the app
python gradio_pdf_app.py

HuggingFace Spaces Deployment

  1. Create a new HuggingFace Space with Gradio SDK

  2. Upload the following files:

    • app.py
    • gradio_pdf_app.py
    • utils/ (directory with utility functions)
    • requirements_hf_spaces.txt (rename to requirements.txt)
  3. Configure the Space:

    • SDK: Gradio
    • Hardware: NVIDIA T4 Small (recommended)
    • Python Version: 3.9+

Technical Details

Memory Optimizations

  • Uses torch.float16 for GPU inference
  • Smaller batch sizes (4) for element processing
  • Memory cleanup with torch.cuda.empty_cache()
  • Reduced max sequence length (2048) for generation

PDF Processing Pipeline

  1. PDF to Images: Uses PyMuPDF with 2x zoom for quality
  2. Layout Analysis: DOLPHIN model parses document structure
  3. Element Extraction: Processes text, tables, and figures separately
  4. Markdown Generation: Converts results to formatted markdown
  5. Gallery View: Creates overview of all processed pages

Model Integration

  • Uses HuggingFace transformers implementation
  • Loads model with device_map="auto" for GPU optimization
  • Batch processing for improved efficiency
  • Graceful fallback to CPU if GPU unavailable

Configuration

The app automatically detects and uses the DOLPHIN model:

  • Local path: ./hf_model
  • HuggingFace Hub: ByteDance/DOLPHIN

Dependencies

Core requirements:

  • torch>=2.1.0 - PyTorch for model inference
  • transformers>=4.47.0 - HuggingFace model loading
  • gradio>=5.36.0 - Web interface
  • pymupdf>=1.26.0 - PDF processing
  • pillow>=9.3.0 - Image processing
  • opencv-python-headless>=4.8.0 - Computer vision operations

Error Handling

  • Graceful handling of PDF conversion failures
  • Memory management for large documents
  • Progress reporting for long-running operations
  • Fallback markdown generation if converter fails

Performance Notes

  • Optimized for NVIDIA T4 with 16GB VRAM
  • Processing time: ~30-60 seconds per page (depends on complexity)
  • Memory usage: ~8-12GB VRAM for typical documents
  • CPU fallback available but significantly slower

Example Output

The app generates:

  1. Markdown Preview: Rendered document with LaTeX support
  2. Raw Markdown: Source text for copying/editing
  3. Page Gallery: Visual overview of all processed pages
  4. JSON Details: Technical processing information

Troubleshooting

  • Out of Memory: Reduce batch size or use CPU
  • PDF Conversion Failed: Check PDF format compatibility
  • Model Loading Error: Verify model path and permissions
  • Slow Processing: Ensure GPU is available and configured

Credits

Built on the DOLPHIN model by ByteDance. Optimized for HuggingFace Spaces deployment.