simon-clmtd commited on
Commit
6acad0f
·
verified ·
1 Parent(s): d2137e3

Update app.py

Browse files
Files changed (1) hide show
  1. app.py +5 -3
app.py CHANGED
@@ -83,7 +83,7 @@ with gr.Blocks(title="OCR QA Demo") as demo:
83
  <a href="https://impresso-project.ch" target="_blank">
84
  <img src="https://huggingface.co/spaces/impresso-project/ocrqa-demo/resolve/main/logo.jpeg"
85
  alt="Impresso Project Logo"
86
- style="height: 84px; display: block; margin: 0 auto;">
87
  </a>
88
  """
89
  )
@@ -91,11 +91,12 @@ with gr.Blocks(title="OCR QA Demo") as demo:
91
  """
92
  # 🔍 OCR Quality Assessment Demo
93
 
94
- This demo showcases the **OCR Quality Assessment (OCRQA)** pipeline developed as part of the [Impresso Project](https://impresso-project.ch). The pipeline evaluates the quality of text extracted via **Optical Character Recognition (OCR)** by estimating the proportion of recognizable words.
 
95
 
96
  It returns:
97
  - a **quality score** between **0.0 (poor)** and **1.0 (excellent)**, and
98
- - a list of **potential OCR errors** (unrecognized tokens).
99
 
100
  You can try the example below (a German text containing typical OCR errors), or paste your own OCR-processed text to assess its quality.
101
  """
@@ -141,6 +142,7 @@ with gr.Blocks(title="OCR QA Demo") as demo:
141
  - **Diagnostics output**:
142
  - ✅ **Known tokens**: Words found in the reference wordlist, presumed correctly OCR’d.
143
  - ❌ **Unrecognized tokens**: Words not found in the list—often OCR errors, rare forms, or out-of-vocabulary items (e.g., names, historical terms).
 
144
 
145
  #### ⚠️ Limitations:
146
  - The wordlists are **not exhaustive**, particularly for **historical vocabulary**, **dialects**, or **named entities**.
 
83
  <a href="https://impresso-project.ch" target="_blank">
84
  <img src="https://huggingface.co/spaces/impresso-project/ocrqa-demo/resolve/main/logo.jpeg"
85
  alt="Impresso Project Logo"
86
+ style="height: 42px; display: block; margin: 0 auto;">
87
  </a>
88
  """
89
  )
 
91
  """
92
  # 🔍 OCR Quality Assessment Demo
93
 
94
+ This demo showcases the **OCR Quality Assessment (OCRQA)** of the [Impresso Project](https://impresso-project.ch).
95
+ The pipeline evaluates the quality of text extracted via **Optical Character Recognition (OCR)** by estimating the proportion of recognizable words.
96
 
97
  It returns:
98
  - a **quality score** between **0.0 (poor)** and **1.0 (excellent)**, and
99
+ - a list of **potential OCR errors** (unrecognized tokens) as well as the known tokens.
100
 
101
  You can try the example below (a German text containing typical OCR errors), or paste your own OCR-processed text to assess its quality.
102
  """
 
142
  - **Diagnostics output**:
143
  - ✅ **Known tokens**: Words found in the reference wordlist, presumed correctly OCR’d.
144
  - ❌ **Unrecognized tokens**: Words not found in the list—often OCR errors, rare forms, or out-of-vocabulary items (e.g., names, historical terms).
145
+ - Note: Non-alphabetic characters will be removed. For efficiency reasons, all digits are replace by the digit 0.
146
 
147
  #### ⚠️ Limitations:
148
  - The wordlists are **not exhaustive**, particularly for **historical vocabulary**, **dialects**, or **named entities**.