OCRFlux-3B

This is a preview release of the OCRFlux-3B model that's fine tuned from Qwen2.5-VL-3B-Instruct using the our private document datasets and some data from olmOCR-mix-0225 dataset.

Quick links:

🛠️ Code

OCRFlux is a multimodal large language model based toolkit for converting PDFs and images into clean, readable, plain Markdown text. It aims to push the current state-of-the-art to a significantly higher level.

Try the online demo: https://ocrflux.pdfparser.io/

Key features:

Superior parsing quality on each page

It respectively achieves 0.095 higher (from 0.872 to 0.967), 0.109 higher (from 0.858 to 0.967) and 0.187 higher (from 0.780 to 0.967) Edit Distance Similarity (EDS) on our released benchmark OCRFlux-bench-single than the baseline model olmOCR-7B-0225-preview, Nanonets-OCR-s and MonkeyOCR.

Native support for cross-page table/paragraph merging (to our best this is the first to support this feature in all the open sourced project).

Based on a 3B parameter VLM, so it can run even on GTX 3090 GPU.

Usage

The best way to use this model is via the OCRFlux toolkit. The toolkit comes with an efficient inference setup via vllm that can handle millions of documents at scale.

API for directly calling OCRFlux (New)

You can use the inference API to directly call OCRFlux in your codes without using an online vllm server like following:

from vllm import LLM
from ocrflux.inference import parse

file_path = 'test.pdf'
# file_path = 'test.png'
llm = LLM(model="model_dir/OCRFlux-3B",gpu_memory_utilization=0.8,max_model_len=8192)
result = parse(llm,file_path)
document_markdown = result['document_text']
with open('test.md','w') as f:
    f.write(document_markdown)

Docker Usage

Requirements:

Docker with GPU support (NVIDIA Toolkit)
Pre-downloaded model: OCRFlux-3B

To use OCRFlux in a docker container, you can use the following example command to start the docker container firstly:

docker run -it --gpus all \
  -v /path/to/localworkspace:/localworkspace \
  -v /path/to/test_pdf_dir:/test_pdf_dir \
  -v /path/to/OCRFlux-3B:/OCRFlux-3B \
  --entrypoint bash \
  chatdoc/ocrflux:latest

and then run the following command on the docker container to parse document files:

python3.12 -m ocrflux.pipeline /localworkspace/ocrflux_results --data /test_pdf_dir/* --model /OCRFlux-3B/

The parsing results will be stored in /localworkspace/ocrflux_results directory.

Viewing Results

Generate the final Markdown files by running the following command. Generated Markdown files will be in ./localworkspace/markdowns/DOCUMENT_NAME directory.

python -m ocrflux.jsonl_to_markdown ./localworkspace

Full documentation for the pipeline

python -m ocrflux.pipeline --help
usage: pipeline.py [-h] [--task {pdf2markdown,merge_pages,merge_tables}] [--data [DATA ...]] [--pages_per_group PAGES_PER_GROUP] [--max_page_retries MAX_PAGE_RETRIES]
                   [--max_page_error_rate MAX_PAGE_ERROR_RATE] [--workers WORKERS] [--model MODEL] [--model_max_context MODEL_MAX_CONTEXT] [--model_chat_template MODEL_CHAT_TEMPLATE]
                   [--target_longest_image_dim TARGET_LONGEST_IMAGE_DIM] [--skip_cross_page_merge] [--port PORT]
                   workspace

Manager for running millions of PDFs through a batch inference pipeline

positional arguments:
  workspace             The filesystem path where work will be stored, can be a local folder

options:
  -h, --help            show this help message and exit
  --data [DATA ...]     List of paths to files to process
  --pages_per_group PAGES_PER_GROUP
                        Aiming for this many pdf pages per work item group
  --max_page_retries MAX_PAGE_RETRIES
                        Max number of times we will retry rendering a page
  --max_page_error_rate MAX_PAGE_ERROR_RATE
                        Rate of allowable failed pages in a document, 1/250 by default
  --workers WORKERS     Number of workers to run at a time
  --model MODEL         The path to the model
  --model_max_context MODEL_MAX_CONTEXT
                        Maximum context length that the model was fine tuned under
  --model_chat_template MODEL_CHAT_TEMPLATE
                        Chat template to pass to vllm server
  --target_longest_image_dim TARGET_LONGEST_IMAGE_DIM
                        Dimension on longest side to use for rendering the pdf pages
  --skip_cross_page_merge
                        Whether to skip cross-page merging
  --port PORT           Port to use for the VLLM server

Code overview

There are some nice reusable pieces of the code that may be useful for your own projects:

Processing millions of PDFs through our released model using VLLM - pipeline.py
Generating final Markdowns from jsonl files - jsonl_to_markdown.py
Evaluating the model on the single-page parsing task - eval_page_to_markdown.py
Evaluating the model on the table parising task - eval_table_to_html.py
Evaluating the model on the paragraphs/tables merging detection task - eval_element_merge_detect.py
Evaluating the model on the table merging task - eval_html_table_merge.py

Benchmark for single-page parsing

We ship two comprehensive benchmarks to help measure the performance of our OCR system in single-page parsing:

OCRFlux-bench-single: Containing 2000 pdf pages (1000 English pages and 1000 Chinese pages) and their ground-truth Markdowns (manually labeled with multi-round check).
OCRFlux-pubtabnet-single: Derived from the public PubTabNet benchmark with some format transformation. It contains 9064 HTML table samples, which are split into simple tables and complex tables according to whether they have rowspan and colspan cells.

We emphasize that the released benchmarks are NOT included in our training and evaluation data. The following is the main result:

In OCRFlux-bench-single, we calculated the Edit Distance Similarity (EDS) between the generated Markdowns and the ground-truth Markdowns as the metric.

Language	Model	Avg EDS ↑
English	olmOCR-7B-0225-preview	0.885
	Nanonets-OCR-s	0.870
	MonkeyOCR	0.828
	OCRFlux-3B	0.971
Chinese	olmOCR-7B-0225-preview	0.859
	Nanonets-OCR-s	0.846
	MonkeyOCR	0.731
	OCRFlux-3B	0.962
Total	olmOCR-7B-0225-preview	0.872
	Nanonets-OCR-s	0.858
	MonkeyOCR	0.780
	OCRFlux-3B	0.967

In OCRFlux-pubtabnet-single, we calculated the Tree Edit Distance-based Similarity (TEDS) between the generated HTML tables and the ground-truth HTML tables as the metric.

Type	Model	Avg TEDS ↑
Simple	olmOCR-7B-0225-preview	0.810
	Nanonets-OCR-s	0.882
	MonkeyOCR	0.880
	OCRFlux-3B	0.912
Complex	olmOCR-7B-0225-preview	0.676
	Nanonets-OCR-s	0.772
	MonkeyOCR	0.826
	OCRFlux-3B	0.807
Total	olmOCR-7B-0225-preview	0.744
	Nanonets-OCR-s	0.828
	MonkeyOCR	0.853
	OCRFlux-3B	0.861

We also conduct some case studies to show the superiority of our model in the blog article.

Benchmark for cross-page table/paragraph merging

PDF documents are typically paginated, which often results in tables or paragraphs being split across consecutive pages. Accurately detecting and merging such cross-page structures is crucial to avoid generating incomplete or fragmented content.

The detection task can be formulated as follows: given the Markdowns of two consecutive pages—each structured as a list of Markdown elements (e.g., paragraphs and tables)—the goal is to identify the indexes of elements that should be merged across the pages.

Then for the merging task, if the elements to be merged are paragraphs, we can just concate them. However, for two table fragments, their merging is much more challenging. For example, the table spanning multiple pages will repeat the header of the first page on the second page. Another difficult scenario is that the table cell contains long content that spans multiple lines within the cell, with the first few lines appearing on the previous page and the remaining lines continuing on the next page. We also observe some cases where tables with a large number of columns are split vertically and placed on two consecutive pages. More examples of cross-page tables can be found in our blog article. To address these issues, we develop the LLM model for cross-page table merging. Specifically, this model takes two split table fragments as input and generates a complete, well-structured table as output.

We ship two comprehensive benchmarks to help measure the performance of our OCR system in cross-page table/paragraph detection and merging tasks respectively:

OCRFlux-bench-cross: Containing 1000 samples (500 English samples and 500 Chinese samples), each sample contains the Markdown element lists of two consecutive pages, along with the indexes of elements that need to be merged (manually labeled through multiple rounds of review). If no tables or paragraphs require merging, the indexes in the annotation data are left empty.
OCRFlux-pubtabnet-cross: Containing 9064 pairs of split table fragments, along with their corresponding ground-truth merged versions.

The released benchmarks are NOT included in our training and evaluation data neither. The following is the main result:

In OCRFlux-bench-cross, we caculated the Accuracy, Precision, Recall and F1 score as the metric. Notice that the detection results are right only when it accurately judges whether there are elements that need to be merged across the two pages and output the right indexes of them.

Language Precision ↑ Recall ↑ F1 ↑ Accuracy ↑

English 0.992 0.964 0.978 0.978

Chinese 1.000 0.988 0.994 0.994

Total 0.996 0.976 0.986 0.986
In OCRFlux-pubtabnet-cross, we calculate the Tree Edit Distance-based Similarity (TEDS) between the generated merged table and the ground-truth merged table as the metric.

Table type Avg TEDS ↑

Simple 0.965

Complex 0.935

Total 0.950

Language	Precision ↑	Recall ↑	F1 ↑	Accuracy ↑
English	0.992	0.964	0.978	0.978
Chinese	1.000	0.988	0.994	0.994
Total	0.996	0.976	0.986	0.986

Table type	Avg TEDS ↑
Simple	0.965
Complex	0.935
Total	0.950

Downloads last month: 98,746

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for ChatDOC/OCRFlux-3B

Base model

Qwen/Qwen2.5-VL-3B-Instruct

Finetuned

(669)

this model

Quantizations

8 models

ChatDOC
/

OCRFlux-3B