File size: 3,491 Bytes
f8c7a4b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 |
---
library_name: transformers
license: apache-2.0
datasets:
- allenai/olmOCR-mix-0225
base_model: Qwen/Qwen2.5-VL-7B-Instruct
---
# RolmOCR by [Reducto AI](https://reducto.ai/)
Earlier this year, the [Allen Institute for AI](https://allenai.org/) released olmOCR, an open-source tool that performs document OCR using the Qwen2-VL-7B vision language model (VLM). We were excited to see a high-quality, openly available approach to parsing PDFs and other complex documents — and curious to explore what else might be possible using newer foundation models and some lightweight optimizations.
The result is **RolmOCR**, a drop-in alternative to olmOCR that’s faster, uses less memory, and still performs well on a variety of document types. We're releasing it under **Apache 2.0** for anyone to try out, explore, or build on.
This model is a fine-tuned version of [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) on the full [allenai/olmOCR-mix-0225](https://huggingface.co/datasets/allenai/olmOCR-mix-0225) dataset.
## Key changes
We made three notable changes:
1. **New Base Model**: We swapped in a more recent version of the existing model (Qwen2.5-VL-7B) as the foundation.
2. **No Metadata inputs**: Unlike the original, we don’t use metadata extracted from PDFs. This significantly reduces prompt length, which in turn lowers both processing time and VRAM usage — without hurting accuracy in most cases.
3. **Rotation of training data:** About 15% of the training data was rotated to enhance robustness to off-angle documents. We otherwise use the same training set.
## Usage
Host your model with vLLM:
```bash
export VLLM_USE_V1=1
vllm serve reducto/RolmOCR
```
Call the model via openai compatible server:
```python
# HOST YOUR OPENAI COMPATIBLE API WITH THE FOLLOWING COMMAND in VLLM:
# export VLLM_USE_V1=1
# vllm serve reducto/RolmOCR
from openai import OpenAI
import base64
client = OpenAI(api_key="123", base_url="http://localhost:8000/v1")
model = "reducto/RolmOCR-7b"
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode("utf-8")
def ocr_page_with_rolm(img_base64):
response = client.chat.completions.create(
model=model,
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{img_base64}"},
},
{
"type": "text",
"text": "Return the plain text representation of this document as if you were reading it naturally.\n",
},
],
}
],
temperature=0.2,
max_tokens=4096
)
return response.choices[0].message.content
test_img_path = "path/to/image.png"
img_base64 = encode_image(test_img_path)
print(ocr_page_with_rolm(img_base64))
```
## Limitations
- RolmOCR, like other VLM-based OCR solutions, still suffer from hallucination or dropping contents.
- Unlike the [Reducto Parsing API](https://app.reducto.ai/), RolmOCR cannot output layout bounding boxes.
- We have not evaluated the performance of any quantized versions.
## BibTex and citation info
```
@misc{RolmOCR,
author = {Reducto AI},
title = {RolmOCR: A Faster, Lighter Open Source OCR Model},
year = {2025},
}
``` |