Test-Dolphin-PDF / README_PDF_APP.md
raksa-the-wildcats
first commit
383af88
# DOLPHIN PDF Document AI - HuggingFace Spaces App
A Gradio-based web application for processing PDF documents using the DOLPHIN vision-language model. This app converts PDF files to images and processes them page by page to extract text, tables, and figures.
## Features
- **PDF Upload**: Upload PDF documents directly through the web interface
- **Page-by-Page Processing**: Converts PDF pages to high-quality images and processes each individually
- **Document Parsing**: Extracts text, tables, and figures using the DOLPHIN model
- **Markdown Output**: Generates clean markdown with embedded images and tables
- **Memory Optimized**: Designed for NVIDIA T4 GPU deployment on HuggingFace Spaces
- **Progress Tracking**: Real-time progress updates during processing
## Files
- `gradio_pdf_app.py` - Main Gradio application with PDF processing functionality
- `app.py` - HuggingFace Spaces entry point
- `requirements_hf_spaces.txt` - Dependencies optimized for HF Spaces deployment
## Usage
### Local Development
```bash
# Install dependencies
pip install -r requirements_hf_spaces.txt
# Run the app
python gradio_pdf_app.py
```
### HuggingFace Spaces Deployment
1. Create a new HuggingFace Space with Gradio SDK
2. Upload the following files:
- `app.py`
- `gradio_pdf_app.py`
- `utils/` (directory with utility functions)
- `requirements_hf_spaces.txt` (rename to `requirements.txt`)
3. Configure the Space:
- **SDK**: Gradio
- **Hardware**: NVIDIA T4 Small (recommended)
- **Python Version**: 3.9+
## Technical Details
### Memory Optimizations
- Uses `torch.float16` for GPU inference
- Smaller batch sizes (4) for element processing
- Memory cleanup with `torch.cuda.empty_cache()`
- Reduced max sequence length (2048) for generation
### PDF Processing Pipeline
1. **PDF to Images**: Uses PyMuPDF with 2x zoom for quality
2. **Layout Analysis**: DOLPHIN model parses document structure
3. **Element Extraction**: Processes text, tables, and figures separately
4. **Markdown Generation**: Converts results to formatted markdown
5. **Gallery View**: Creates overview of all processed pages
### Model Integration
- Uses HuggingFace transformers implementation
- Loads model with `device_map="auto"` for GPU optimization
- Batch processing for improved efficiency
- Graceful fallback to CPU if GPU unavailable
## Configuration
The app automatically detects and uses the DOLPHIN model:
- Local path: `./hf_model`
- HuggingFace Hub: `ByteDance/DOLPHIN`
## Dependencies
Core requirements:
- `torch>=2.1.0` - PyTorch for model inference
- `transformers>=4.47.0` - HuggingFace model loading
- `gradio>=5.36.0` - Web interface
- `pymupdf>=1.26.0` - PDF processing
- `pillow>=9.3.0` - Image processing
- `opencv-python-headless>=4.8.0` - Computer vision operations
## Error Handling
- Graceful handling of PDF conversion failures
- Memory management for large documents
- Progress reporting for long-running operations
- Fallback markdown generation if converter fails
## Performance Notes
- Optimized for NVIDIA T4 with 16GB VRAM
- Processing time: ~30-60 seconds per page (depends on complexity)
- Memory usage: ~8-12GB VRAM for typical documents
- CPU fallback available but significantly slower
## Example Output
The app generates:
1. **Markdown Preview**: Rendered document with LaTeX support
2. **Raw Markdown**: Source text for copying/editing
3. **Page Gallery**: Visual overview of all processed pages
4. **JSON Details**: Technical processing information
## Troubleshooting
- **Out of Memory**: Reduce batch size or use CPU
- **PDF Conversion Failed**: Check PDF format compatibility
- **Model Loading Error**: Verify model path and permissions
- **Slow Processing**: Ensure GPU is available and configured
## Credits
Built on the DOLPHIN model by ByteDance. Optimized for HuggingFace Spaces deployment.