Spaces:
Running
Running
Update app.py
Browse files
app.py
CHANGED
@@ -83,7 +83,7 @@ with gr.Blocks(title="OCR QA Demo") as demo:
|
|
83 |
<a href="https://impresso-project.ch" target="_blank">
|
84 |
<img src="https://huggingface.co/spaces/impresso-project/ocrqa-demo/resolve/main/logo.jpeg"
|
85 |
alt="Impresso Project Logo"
|
86 |
-
style="height:
|
87 |
</a>
|
88 |
"""
|
89 |
)
|
@@ -91,11 +91,12 @@ with gr.Blocks(title="OCR QA Demo") as demo:
|
|
91 |
"""
|
92 |
# 🔍 OCR Quality Assessment Demo
|
93 |
|
94 |
-
This demo showcases the **OCR Quality Assessment (OCRQA)**
|
|
|
95 |
|
96 |
It returns:
|
97 |
- a **quality score** between **0.0 (poor)** and **1.0 (excellent)**, and
|
98 |
-
- a list of **potential OCR errors** (unrecognized tokens).
|
99 |
|
100 |
You can try the example below (a German text containing typical OCR errors), or paste your own OCR-processed text to assess its quality.
|
101 |
"""
|
@@ -141,6 +142,7 @@ with gr.Blocks(title="OCR QA Demo") as demo:
|
|
141 |
- **Diagnostics output**:
|
142 |
- ✅ **Known tokens**: Words found in the reference wordlist, presumed correctly OCR’d.
|
143 |
- ❌ **Unrecognized tokens**: Words not found in the list—often OCR errors, rare forms, or out-of-vocabulary items (e.g., names, historical terms).
|
|
|
144 |
|
145 |
#### ⚠️ Limitations:
|
146 |
- The wordlists are **not exhaustive**, particularly for **historical vocabulary**, **dialects**, or **named entities**.
|
|
|
83 |
<a href="https://impresso-project.ch" target="_blank">
|
84 |
<img src="https://huggingface.co/spaces/impresso-project/ocrqa-demo/resolve/main/logo.jpeg"
|
85 |
alt="Impresso Project Logo"
|
86 |
+
style="height: 42px; display: block; margin: 0 auto;">
|
87 |
</a>
|
88 |
"""
|
89 |
)
|
|
|
91 |
"""
|
92 |
# 🔍 OCR Quality Assessment Demo
|
93 |
|
94 |
+
This demo showcases the **OCR Quality Assessment (OCRQA)** of the [Impresso Project](https://impresso-project.ch).
|
95 |
+
The pipeline evaluates the quality of text extracted via **Optical Character Recognition (OCR)** by estimating the proportion of recognizable words.
|
96 |
|
97 |
It returns:
|
98 |
- a **quality score** between **0.0 (poor)** and **1.0 (excellent)**, and
|
99 |
+
- a list of **potential OCR errors** (unrecognized tokens) as well as the known tokens.
|
100 |
|
101 |
You can try the example below (a German text containing typical OCR errors), or paste your own OCR-processed text to assess its quality.
|
102 |
"""
|
|
|
142 |
- **Diagnostics output**:
|
143 |
- ✅ **Known tokens**: Words found in the reference wordlist, presumed correctly OCR’d.
|
144 |
- ❌ **Unrecognized tokens**: Words not found in the list—often OCR errors, rare forms, or out-of-vocabulary items (e.g., names, historical terms).
|
145 |
+
- Note: Non-alphabetic characters will be removed. For efficiency reasons, all digits are replace by the digit 0.
|
146 |
|
147 |
#### ⚠️ Limitations:
|
148 |
- The wordlists are **not exhaustive**, particularly for **historical vocabulary**, **dialects**, or **named entities**.
|