File size: 7,625 Bytes
f75ac71 942ab45 f75ac71 942ab45 f75ac71 35875c6 f75ac71 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 |
---
tags:
- ocr
- image-to-text
license: mit
library_name: transformers
---
# Model Card: PHOCR
an open high-performance Optical Character Recognition (OCR) toolkit [PHOCR](https://github.com/puhuilab/phocr).
# PHOCR: High-Performance OCR Toolkit
[English](README.md) | [ç®€ä½“ä¸æ–‡](README_CN.md)
PHOCR is an open high-performance Optical Character Recognition (OCR) toolkit designed for efficient text recognition across multiple languages including Chinese, Japanese, Korean, Russian, Vietnamese, and Thai. **PHOCR features a completely custom-developed recognition model (PH-OCRv1) that significantly outperforms existing solutions.**
## Motivation
Current token-prediction-based model architectures are highly sensitive to the accuracy of contextual tokens. Repetitive patterns, even as few as a thousand instances, can lead to persistent memorization by the model. While most open-source text recognition models currently achieve character error rates (CER) in the percent range, our goal is to push this further into the per-mille range. At that level, for a system processing 100 million characters, the total number of recognition errors would be reduced to under 1 million — an order of magnitude improvement.
## Features
- **Custom Recognition Model**: **PH-OCRv1** achieves sub-0.x% character error rate in document-style settings by leveraging open-source models. Even achieves 0.0x% character error rate in English.
- **Multi-language Support**: Chinese, English, Japanese, Korean, Russian, and more
- **Rich Vocabulary**: Comprehensive vocabulary for each language. Chinese: 15,316, Korean: 17,388, Japanese: 11,186, Russian: 292.
- **High Performance**: Optimized inference engine with ONNX Runtime support
- **Easy Integration**: Simple Python API for quick deployment
- **Cross-platform**: Support for CPU and CUDA
## Visualization
## Installation
```bash
# Choose **one** installation method below:
# Method 1: Install with ONNX Runtime CPU version
pip install phocr[cpu]
# Method 2: Install with ONNX Runtime GPU version
pip install phocr[cuda]
# Required: Make sure the CUDA toolkit and cuDNN library are properly installed
# You can install cuda runtime and cuDNN via conda:
conda install -c nvidia cuda-runtime=12.1 cudnn=9
# Or manually install the corresponding CUDA toolkit and cuDNN libraries
# Method 3: Manually manage ONNX Runtime
# You can install `onnxruntime` or `onnxruntime-gpu` yourself, then install PHOCR
pip install phocr
```
## Quick Start
```python
from phocr import PHOCR
# Initialize OCR engine
engine = PHOCR()
# Perform OCR on image
result = engine("path/to/image.jpg")
print(result)
# Visualize results
result.vis("output.jpg")
print(result.to_markdown())
## only recognition
```
## Benchmarks
We conducted comprehensive benchmarks comparing PHOCR with leading OCR solutions across multiple languages and scenarios. **Our custom-developed PH-OCRv1 model demonstrates significant improvements over existing solutions.**
### Overall Performance Comparison
<table style="width: 90%; margin: auto; border-collapse: collapse; font-size: small;">
<thead>
<tr>
<th rowspan="2">Model</th>
<th colspan="4">ZH & EN<br><span style="font-weight: normal; font-size: x-small;">CER ↓</span></th>
<th colspan="2">JP<br><span style="font-weight: normal; font-size: x-small;">CER ↓</span></th>
<th colspan="2">KO<br><span style="font-weight: normal; font-size: x-small;">CER ↓</span></th>
<th colspan="1">RU<br><span style="font-weight: normal; font-size: x-small;">CER ↓</span></th>
</tr>
<tr>
<th><i>English</i></th>
<th><i>Simplified Chinese</i></th>
<th><i>EN CH Mixed</i></th>
<th><i>Traditional Chinese</i></th>
<th><i>Document</i></th>
<th><i>Scene</i></th>
<th><i>Document</i></th>
<th><i>Scene</i></th>
<th><i>Document</i></th>
</tr>
</thead>
<tbody>
<tr>
<td>PHOCR</td>
<td><strong>0.0008</strong></td>
<td><strong>0.0057</strong></td>
<td><strong>0.0171</strong></td>
<td><strong>0.0145</strong></td>
<td><strong>0.0039</strong></td>
<td><strong>0.0197</strong></td>
<td><strong>0.0050</strong></td>
<td><strong>0.0255</strong></td>
<td><strong>0.0046</strong></td>
</tr>
<tr>
<td>Baidu</td>
<td>0.0014</td>
<td>0.0069</td>
<td>0.0354</td>
<td>0.0431</td>
<td>0.0222</td>
<td>0.0607</td>
<td>0.0238</td>
<td>0.212</td>
<td>0.0786</td>
</tr>
<tr>
<td>Ali</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>0.0272</td>
<td>0.0564</td>
<td>0.0159</td>
<td>0.102</td>
<td>0.0616</td>
</tr>
<tr>
<td>PP-OCRv5</td>
<td>0.0149</td>
<td>0.0226</td>
<td>0.0722</td>
<td>0.0625</td>
<td>0.0490</td>
<td>0.1140</td>
<td>0.0113</td>
<td>0.0519</td>
<td>0.0348</td>
</tr>
</tbody>
</table>
Notice
- baidu: [Baidu Accurate API](https://ai.baidu.com/tech/ocr/general)
- Ali: [Aliyun API](https://help.aliyun.com/zh/ocr/product-overview/recognition-of-characters-in-languages-except-for-chinese-and-english-1)
- CER: the total edit distance divided by the total number of characters in the ground truth.
## Advanced Usage
With global KV cache enabled, we implement a simple version using PyTorch (CUDA). When running with torch (CUDA), you can enable caching by setting `use_cache=True` in `ORTSeq2Seq(...)`, which also allows for larger batch sizes.
### Language-specific Configuration
See [demo.py](./demo.py) for more examples.
## Evaluation & Benchmarking
PHOCR provides comprehensive benchmarking tools to evaluate model performance across different languages and scenarios.
### Quick Benchmark
Run the complete benchmark pipeline:
```bash
sh benchmark/run_recognition.sh
```
Calculate Character Error Rate (CER) for model predictions:
```bash
sh benchmark/run_score.sh
```
### Benchmark Datasets
PHOCR uses standardized benchmark datasets for fair comparison:
- **zh_en_rec_bench** [Chinese & English mixed text recognition](https://huggingface.co/datasets/puhuilab/zh_en_rec_bench)
- **jp_rec_bench** [Japanese text recognition](https://huggingface.co/datasets/puhuilab/jp_rec_bench)
- **ko_rec_bench** [Korean text recognition](https://huggingface.co/datasets/puhuilab/ko_rec_bench)
- **ru_rec_bench** [Russian text recognition](https://huggingface.co/datasets/puhuilab/ru_rec_bench)
## Further Improvements
- Character error rate (CER), including punctuation, can be further reduced through additional normalization of the training corpus.
- Text detection accuracy can be further enhanced by employing a more advanced detection framework.
## Contributing
We welcome contributions! Please feel free to submit issues, feature requests, or pull requests.
## Support
For questions and support, please open an issue on GitHub or contact the maintainers.
## Acknowledgements
Many thanks to [RapidOCR](https://github.com/RapidAI/RapidOCR) for detection and main framework.
## License
- This project is released under the Apache 2.0 license
- The copyright of the OCR detection and classification model is held by Baidu
- The PHOCR recognition models are under the modified MIT License - see the [LICENSE](./LICENSE) file for details
## Citation
If you use PHOCR in your research, please cite:
```bibtex
@misc{phocr2025,
title={PHOCR: High-Performance OCR Toolkit},
author={PuHui Lab},
year={2025},
url={https://github.com/puhuilab/phocr}
}
``` |