Spaces:

impresso-project
/

ocrqa-demo

Running

App Files Files Community

simon-clmtd commited on 12 days ago

Commit

6acad0f

verified ·

1 Parent(s): d2137e3

Update app.py

Browse files

Files changed (1) hide show

app.py +5 -3

app.py CHANGED Viewed

@@ -83,7 +83,7 @@ with gr.Blocks(title="OCR QA Demo") as demo:
     <a href="https://impresso-project.ch" target="_blank">
         <img src="https://huggingface.co/spaces/impresso-project/ocrqa-demo/resolve/main/logo.jpeg"
              alt="Impresso Project Logo"
-             style="height: 84px; display: block; margin: 0 auto;">
     </a>
     """
 )
@@ -91,11 +91,12 @@ with gr.Blocks(title="OCR QA Demo") as demo:
         """
     # 🔍 OCR Quality Assessment Demo
-    This demo showcases the **OCR Quality Assessment (OCRQA)** pipeline developed as part of the [Impresso Project](https://impresso-project.ch). The pipeline evaluates the quality of text extracted via **Optical Character Recognition (OCR)** by estimating the proportion of recognizable words.
     It returns:
     - a **quality score** between **0.0 (poor)** and **1.0 (excellent)**, and
-    - a list of **potential OCR errors** (unrecognized tokens).
     You can try the example below (a German text containing typical OCR errors), or paste your own OCR-processed text to assess its quality.
     """
@@ -141,6 +142,7 @@ with gr.Blocks(title="OCR QA Demo") as demo:
     - **Diagnostics output**:
         - ✅ **Known tokens**: Words found in the reference wordlist, presumed correctly OCR’d.
         - ❌ **Unrecognized tokens**: Words not found in the list—often OCR errors, rare forms, or out-of-vocabulary items (e.g., names, historical terms).
     #### ⚠️ Limitations:
     - The wordlists are **not exhaustive**, particularly for **historical vocabulary**, **dialects**, or **named entities**.

     <a href="https://impresso-project.ch" target="_blank">
         <img src="https://huggingface.co/spaces/impresso-project/ocrqa-demo/resolve/main/logo.jpeg"
              alt="Impresso Project Logo"
+             style="height: 42px; display: block; margin: 0 auto;">
     </a>
     """
 )
         """
     # 🔍 OCR Quality Assessment Demo
+    This demo showcases the **OCR Quality Assessment (OCRQA)** of the [Impresso Project](https://impresso-project.ch).
+    The pipeline evaluates the quality of text extracted via **Optical Character Recognition (OCR)** by estimating the proportion of recognizable words.
     It returns:
     - a **quality score** between **0.0 (poor)** and **1.0 (excellent)**, and
+    - a list of **potential OCR errors** (unrecognized tokens) as well as the known tokens.
     You can try the example below (a German text containing typical OCR errors), or paste your own OCR-processed text to assess its quality.
     """
     - **Diagnostics output**:
         - ✅ **Known tokens**: Words found in the reference wordlist, presumed correctly OCR’d.
         - ❌ **Unrecognized tokens**: Words not found in the list—often OCR errors, rare forms, or out-of-vocabulary items (e.g., names, historical terms).
+        - Note: Non-alphabetic characters will be removed. For efficiency reasons, all digits are replace by the digit 0.
     #### ⚠️ Limitations:
     - The wordlists are **not exhaustive**, particularly for **historical vocabulary**, **dialects**, or **named entities**.