Spaces:
Runtime error
Runtime error
# DOLPHIN PDF Document AI - HuggingFace Spaces App | |
A Gradio-based web application for processing PDF documents using the DOLPHIN vision-language model. This app converts PDF files to images and processes them page by page to extract text, tables, and figures. | |
## Features | |
- **PDF Upload**: Upload PDF documents directly through the web interface | |
- **Page-by-Page Processing**: Converts PDF pages to high-quality images and processes each individually | |
- **Document Parsing**: Extracts text, tables, and figures using the DOLPHIN model | |
- **Markdown Output**: Generates clean markdown with embedded images and tables | |
- **Memory Optimized**: Designed for NVIDIA T4 GPU deployment on HuggingFace Spaces | |
- **Progress Tracking**: Real-time progress updates during processing | |
## Files | |
- `gradio_pdf_app.py` - Main Gradio application with PDF processing functionality | |
- `app.py` - HuggingFace Spaces entry point | |
- `requirements_hf_spaces.txt` - Dependencies optimized for HF Spaces deployment | |
## Usage | |
### Local Development | |
```bash | |
# Install dependencies | |
pip install -r requirements_hf_spaces.txt | |
# Run the app | |
python gradio_pdf_app.py | |
``` | |
### HuggingFace Spaces Deployment | |
1. Create a new HuggingFace Space with Gradio SDK | |
2. Upload the following files: | |
- `app.py` | |
- `gradio_pdf_app.py` | |
- `utils/` (directory with utility functions) | |
- `requirements_hf_spaces.txt` (rename to `requirements.txt`) | |
3. Configure the Space: | |
- **SDK**: Gradio | |
- **Hardware**: NVIDIA T4 Small (recommended) | |
- **Python Version**: 3.9+ | |
## Technical Details | |
### Memory Optimizations | |
- Uses `torch.float16` for GPU inference | |
- Smaller batch sizes (4) for element processing | |
- Memory cleanup with `torch.cuda.empty_cache()` | |
- Reduced max sequence length (2048) for generation | |
### PDF Processing Pipeline | |
1. **PDF to Images**: Uses PyMuPDF with 2x zoom for quality | |
2. **Layout Analysis**: DOLPHIN model parses document structure | |
3. **Element Extraction**: Processes text, tables, and figures separately | |
4. **Markdown Generation**: Converts results to formatted markdown | |
5. **Gallery View**: Creates overview of all processed pages | |
### Model Integration | |
- Uses HuggingFace transformers implementation | |
- Loads model with `device_map="auto"` for GPU optimization | |
- Batch processing for improved efficiency | |
- Graceful fallback to CPU if GPU unavailable | |
## Configuration | |
The app automatically detects and uses the DOLPHIN model: | |
- Local path: `./hf_model` | |
- HuggingFace Hub: `ByteDance/DOLPHIN` | |
## Dependencies | |
Core requirements: | |
- `torch>=2.1.0` - PyTorch for model inference | |
- `transformers>=4.47.0` - HuggingFace model loading | |
- `gradio>=5.36.0` - Web interface | |
- `pymupdf>=1.26.0` - PDF processing | |
- `pillow>=9.3.0` - Image processing | |
- `opencv-python-headless>=4.8.0` - Computer vision operations | |
## Error Handling | |
- Graceful handling of PDF conversion failures | |
- Memory management for large documents | |
- Progress reporting for long-running operations | |
- Fallback markdown generation if converter fails | |
## Performance Notes | |
- Optimized for NVIDIA T4 with 16GB VRAM | |
- Processing time: ~30-60 seconds per page (depends on complexity) | |
- Memory usage: ~8-12GB VRAM for typical documents | |
- CPU fallback available but significantly slower | |
## Example Output | |
The app generates: | |
1. **Markdown Preview**: Rendered document with LaTeX support | |
2. **Raw Markdown**: Source text for copying/editing | |
3. **Page Gallery**: Visual overview of all processed pages | |
4. **JSON Details**: Technical processing information | |
## Troubleshooting | |
- **Out of Memory**: Reduce batch size or use CPU | |
- **PDF Conversion Failed**: Check PDF format compatibility | |
- **Model Loading Error**: Verify model path and permissions | |
- **Slow Processing**: Ensure GPU is available and configured | |
## Credits | |
Built on the DOLPHIN model by ByteDance. Optimized for HuggingFace Spaces deployment. |