Pclanglais commited on
Commit
33afe46
·
verified ·
1 Parent(s): c82c8b1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -18
README.md CHANGED
@@ -6,34 +6,25 @@ language:
6
  - de
7
  ---
8
 
9
- **OCRonos** is a series of specialized language model for OCR correction.
10
 
11
  OCRonos models are trained by PleIAs on a highly diverse set of ocrized texts in multiple languages from PleIAs open pre-training corpus, drawn from cultural heritage sources (Common Corpus) and financial and administrative documents in open data (Finance Commons).
12
 
13
  This release currently features a model based on llama-3-8b that has been the most tested to date. Future release will focus on smaller internal models that provides a better ratio of generation cost/quality.
14
 
15
- OCRonos is generally faithful to what the original material, provides sensible restitution of deteriorated text and will rarely rewrite correct words.
16
 
17
- On highly deteriorated content, OCRonos can act as a synthetic rewriting tool rather than a strict correction tool.
 
 
18
 
19
  ## Examples
20
 
21
- Original input with a high rate of OCR errors from *Chronicling America*:
22
 
23
- > The Douglas Island News.
24
- >
25
- > ??????II ????I ????mmmm? n i i n i ??mmmmmm?i????????????????????????????
26
- >
27
- > VOL. G. DOUGLAS CITY AND TREADWELL, ALASKA, WEDNESDAY, AUGUST 3, L90L NO. 37
28
- >
29
- > I Wall Paper $ Carpets f
30
- >
31
- > ? 5?!? *i i
32
- > g Spring House Cleaning is almost here, and g; ?J such a lot of trouble it brings with it. The 3 1 ff house has to be papered from top to bottom, ? ?J the floor Carpeted, Lace Curtains put up, J + Linoleum on kitchen floor, and various other 3 ? things, when carried out are the making of ? our Alaska homes. g
33
- >
34
- > Now we have prepared a way out of this trou- ?* ble for you. We have had a special depart- Jment made, where we are showing 40,000 rolls of Paper ranging in price from 10 c per roll ? J up; 50 different styles of carpets, from 50 c J 4 up, such as genuine home made rag carpets 4} J /washable), Brussels, Velvets, Axminstersand ? ? Ingraius; all grades and styles of Lace Cur- g 4 tains, inported and domestic portiers and *c J couch covers; Rugs, small as 12x20 inches ? J and large as 15x20 feet; Linoleum in grades J 4 A, B, C, D, E; all widths floor oil cloth, roller <* J shados, carpet sweepers, curtain poles, ? g mouldings, brass fixtures and trimmings of g 4 all kinds, in fact everything carried in an up g J to date department can be obtained from ?
35
 
36
- Rewrite with a reasonably high temperature (0.7):
37
 
38
  > The Douglas Island News.
39
  >
@@ -45,6 +36,8 @@ Rewrite with a reasonably high temperature (0.7):
45
  >
46
  > Now we have prepared a way out of this trouble for you. We have had a special department made, where we are showing 40,000 rolls of Paper ranging in price from 10 c per roll up; 50 different styles of carpets, from 50 c up, such as genuine home made rag carpets (washable), Brussels, Velvets, Axminsters, and Ingrains; all grades and styles of Lace Curtains, imported and domestic portiers and couch covers; Rugs, small as 12x20 inches and large as 15x20 feet; Linoleum in grades A, B, C, D, E; all widths, floor oil cloth, roller shades, carpet sweepers, curtain poles, mouldings, brass fixtures and trimmings of all kinds, in fact everything carried in an up to date department can be obtained from B. M. BEHRENDS CO., INC.
47
 
 
 
48
 
49
  ## Usage
50
  OCRonos use a custom instruction structure: "### Text ###\n[text]\n\n### Correction ###\n" and a custom eos #END#.
@@ -62,4 +55,4 @@ LLMs are theoretically well suited for the task of OCR correction as they are tr
62
 
63
  On past experiments, a common issue with OCR correction has *language switching*: due to the inherent noise in the input text, an LLM will transcribe in a different language or even in a different script (like cyrillic). The issue has been especially observed in smaller generalist models like GPT-3.5 or Claude-Haiku.
64
 
65
- OCRonos largely mitigates this issue. A few instability have been noticed at scale, with the inclusion of repeated words but could be generally filtered out.
 
6
  - de
7
  ---
8
 
9
+ **OCRonos** is a series of specialized language models for the correction of badly digitized texts part of PleIAs' "Bad Data Toolbox". It supports both the correction of OCR errors, wrong word cut/merge and overall broken text structures.
10
 
11
  OCRonos models are trained by PleIAs on a highly diverse set of ocrized texts in multiple languages from PleIAs open pre-training corpus, drawn from cultural heritage sources (Common Corpus) and financial and administrative documents in open data (Finance Commons).
12
 
13
  This release currently features a model based on llama-3-8b that has been the most tested to date. Future release will focus on smaller internal models that provides a better ratio of generation cost/quality.
14
 
15
+ OCRonos is generally faithful to what the original material, provides sensible restitution of deteriorated text and will rarely rewrite correct words. On highly deteriorated content, OCRonos can act as a synthetic rewriting tool rather than a strict correction tool.
16
 
17
+ Along with the other models of PleIAs "Bad Data Toolbox", OCRonos contributes to make challenging resources usable for LLM applications and, more broadly, search retrieval. It is especially fitting in situation where the original PDF sources is too damaged for correct OCRization or even non-existent/complex to retrieve.
18
+
19
+ OCRonos can be tested on a free demo along with [Segmentext](https://huggingface.co/PleIAs/Segmentext), another model trained by PleIAs for the text segmentation of broken PDFs.
20
 
21
  ## Examples
22
 
23
+ Original input with a high rate of digitization artifacts from a Financial report subtmitted to the AMF.
24
 
25
+ > Inthisrespect,the in surancebusiness inve stmen t portfolio can be considered conservativel y mana ged as itislargely com posed of cor porate, sovere ignandsupranational bonds, term l oansaswell asdemanddeposits. Followin gtheprevious year, thegroup continue dto diversi fyits holdingsinto investmen tgrade cor porate bonds.(see Note 4 ) Itshouldbenoted that bondsandterm loansareheldto matur ityinacco rdancewiththegroup’s businessmodelpolicy of “inflows”. Technical liabilities on insur ance contractsTheguaran tees offered cover death,disability,r e dundancy andunem ployment aspartof aloanprotect ion insurance policy. These types o f risk are controlled throu ghthe use o f app ropriate morta litytables,statistica lchecksonloss rat ios for thepopulation groups insure dandthrough ar e insurance program. Liabilit yade quacytest:a goodnes sટofટfittestaimed a t ensurin gthat insuranceliabilitiesare adequate wi threspect to curren testimates of future cash flowsgenerate d by the insurance contracts isperformed at eachstate mentof accou nt.Futu recashflowsresultin g fromthecontracts take into accoun t the guaran tees and o ptions inherent therein. In the even t of inade quacy, the potential l ossesare fully reco gnize dinnetincome. Them o delingof future cash flowsintheinsuranceliability adequacytest are basedonthefollowin g assum ptions: Atthee n dof 2022,thisliabilityadequacytestdidnot reve al any anomalies. Income state ment: Theincome andexpenses reco gnize d for theinsurance contracts issued bythegroupappear inthei ncome stat emen t in“Ne tincome of other activities”and“Ne t expense of other acti vities ”. Risk mana gement: The group ado ptsa“prudent a pproach” to itsmana gemen t of therisks towhichitcouldbee x posed throughits insurance activities. Risk o f contre partie As state dabove, insurance companyonlyinvests inasse ts (bankdeposit, soverei gnbonds,supra oragencies or
 
 
 
 
 
 
 
 
 
 
 
26
 
27
+ Rewrite suggested by OCRonos.
28
 
29
  > The Douglas Island News.
30
  >
 
36
  >
37
  > Now we have prepared a way out of this trouble for you. We have had a special department made, where we are showing 40,000 rolls of Paper ranging in price from 10 c per roll up; 50 different styles of carpets, from 50 c up, such as genuine home made rag carpets (washable), Brussels, Velvets, Axminsters, and Ingrains; all grades and styles of Lace Curtains, imported and domestic portiers and couch covers; Rugs, small as 12x20 inches and large as 15x20 feet; Linoleum in grades A, B, C, D, E; all widths, floor oil cloth, roller shades, carpet sweepers, curtain poles, mouldings, brass fixtures and trimmings of all kinds, in fact everything carried in an up to date department can be obtained from B. M. BEHRENDS CO., INC.
38
 
39
+ Editorial structuration proposed by Segmentext:
40
+
41
 
42
  ## Usage
43
  OCRonos use a custom instruction structure: "### Text ###\n[text]\n\n### Correction ###\n" and a custom eos #END#.
 
55
 
56
  On past experiments, a common issue with OCR correction has *language switching*: due to the inherent noise in the input text, an LLM will transcribe in a different language or even in a different script (like cyrillic). The issue has been especially observed in smaller generalist models like GPT-3.5 or Claude-Haiku.
57
 
58
+ OCRonos largely mitigates this issue. A few instabilities have been noticed at scale, with the inclusion of repeated words but could be generally filtered out.