Update README.md
Browse files
README.md
CHANGED
@@ -6,15 +6,15 @@ language:
|
|
6 |
- de
|
7 |
---
|
8 |
|
9 |
-
**OCRonos** is a series of specialized language models for the correction of badly digitized texts.
|
10 |
|
11 |
-
OCROnos models are versatile tools supporting the correction of OCR errors, wrong word cut/merge and overall broken text structures.
|
12 |
|
13 |
This release currently features a model based on llama-3-8b that has been the most tested to date. Future release will focus on smaller internal models that provides a better ratio of generation cost/quality.
|
14 |
|
15 |
OCRonos is generally faithful to what the original material, provides sensible restitution of deteriorated text and will rarely rewrite correct words. On highly deteriorated content, OCRonos can act as a synthetic rewriting tool rather than a strict correction tool.
|
16 |
|
17 |
-
Along with the other models of PleIAs
|
18 |
|
19 |
OCRonos can be tested on a free demo along with [Segmentext](https://huggingface.co/PleIAs/Segmentext), another model trained by PleIAs for the text segmentation of broken PDFs.
|
20 |
|
|
|
6 |
- de
|
7 |
---
|
8 |
|
9 |
+
**OCRonos** is a series of specialized language models trained by PleIAs for the correction of badly digitized texts.
|
10 |
|
11 |
+
OCROnos models are versatile tools supporting the correction of OCR errors, wrong word cut/merge and overall broken text structures. The training data includes a highly diverse set of ocrized texts in multiple languages from PleIAs open pre-training corpus, drawn from cultural heritage sources (Common Corpus) and financial and administrative documents in open data (Finance Commons).
|
12 |
|
13 |
This release currently features a model based on llama-3-8b that has been the most tested to date. Future release will focus on smaller internal models that provides a better ratio of generation cost/quality.
|
14 |
|
15 |
OCRonos is generally faithful to what the original material, provides sensible restitution of deteriorated text and will rarely rewrite correct words. On highly deteriorated content, OCRonos can act as a synthetic rewriting tool rather than a strict correction tool.
|
16 |
|
17 |
+
Along with the other models of PleIAs Bad Data Toolbox, OCRonos contributes to make challenging resources usable for LLM applications and, more broadly, search retrieval. It is especially fitting in situation where the original PDF sources is too damaged for correct OCRization or even non-existent/complex to retrieve.
|
18 |
|
19 |
OCRonos can be tested on a free demo along with [Segmentext](https://huggingface.co/PleIAs/Segmentext), another model trained by PleIAs for the text segmentation of broken PDFs.
|
20 |
|