# DOLPHIN PDF Document AI - HuggingFace Spaces App

A Gradio-based web application for processing PDF documents using the DOLPHIN vision-language model. This app converts PDF files to images and processes them page by page to extract text, tables, and figures.

## Features

- **PDF Upload**: Upload PDF documents directly through the web interface
- **Page-by-Page Processing**: Converts PDF pages to high-quality images and processes each individually
- **Document Parsing**: Extracts text, tables, and figures using the DOLPHIN model
- **Markdown Output**: Generates clean markdown with embedded images and tables
- **Memory Optimized**: Designed for NVIDIA T4 GPU deployment on HuggingFace Spaces
- **Progress Tracking**: Real-time progress updates during processing

## Files

- `gradio_pdf_app.py` - Main Gradio application with PDF processing functionality
- `app.py` - HuggingFace Spaces entry point
- `requirements_hf_spaces.txt` - Dependencies optimized for HF Spaces deployment

## Usage

### Local Development

```bash
# Install dependencies
pip install -r requirements_hf_spaces.txt

# Run the app
python gradio_pdf_app.py
```

### HuggingFace Spaces Deployment

1. Create a new HuggingFace Space with Gradio SDK
2. Upload the following files:
   - `app.py`
   - `gradio_pdf_app.py`
   - `utils/` (directory with utility functions)
   - `requirements_hf_spaces.txt` (rename to `requirements.txt`)

3. Configure the Space:
   - **SDK**: Gradio
   - **Hardware**: NVIDIA T4 Small (recommended)
   - **Python Version**: 3.9+

## Technical Details

### Memory Optimizations

- Uses `torch.float16` for GPU inference
- Smaller batch sizes (4) for element processing
- Memory cleanup with `torch.cuda.empty_cache()`
- Reduced max sequence length (2048) for generation

### PDF Processing Pipeline

1. **PDF to Images**: Uses PyMuPDF with 2x zoom for quality
2. **Layout Analysis**: DOLPHIN model parses document structure
3. **Element Extraction**: Processes text, tables, and figures separately
4. **Markdown Generation**: Converts results to formatted markdown
5. **Gallery View**: Creates overview of all processed pages

### Model Integration

- Uses HuggingFace transformers implementation
- Loads model with `device_map="auto"` for GPU optimization
- Batch processing for improved efficiency
- Graceful fallback to CPU if GPU unavailable

## Configuration

The app automatically detects and uses the DOLPHIN model:
- Local path: `./hf_model`
- HuggingFace Hub: `ByteDance/DOLPHIN`

## Dependencies

Core requirements:
- `torch>=2.1.0` - PyTorch for model inference
- `transformers>=4.47.0` - HuggingFace model loading
- `gradio>=5.36.0` - Web interface
- `pymupdf>=1.26.0` - PDF processing
- `pillow>=9.3.0` - Image processing
- `opencv-python-headless>=4.8.0` - Computer vision operations

## Error Handling

- Graceful handling of PDF conversion failures
- Memory management for large documents
- Progress reporting for long-running operations
- Fallback markdown generation if converter fails

## Performance Notes

- Optimized for NVIDIA T4 with 16GB VRAM
- Processing time: ~30-60 seconds per page (depends on complexity)
- Memory usage: ~8-12GB VRAM for typical documents
- CPU fallback available but significantly slower

## Example Output

The app generates:
1. **Markdown Preview**: Rendered document with LaTeX support
2. **Raw Markdown**: Source text for copying/editing
3. **Page Gallery**: Visual overview of all processed pages
4. **JSON Details**: Technical processing information

## Troubleshooting

- **Out of Memory**: Reduce batch size or use CPU
- **PDF Conversion Failed**: Check PDF format compatibility
- **Model Loading Error**: Verify model path and permissions
- **Slow Processing**: Ensure GPU is available and configured

## Credits

Built on the DOLPHIN model by ByteDance. Optimized for HuggingFace Spaces deployment.