Pclanglais commited on
Commit
952b9ea
·
verified ·
1 Parent(s): fb63ff8

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +65 -0
README.md ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - fr
5
+ - en
6
+ - de
7
+ ---
8
+
9
+ **OCRonos** is a series of specialized language model for OCR correction trained by PleIAs based on existing open weight and internal LLMs.
10
+
11
+ OCRonos models are trained on a highly diverse set of ocrized texts in multiple languages from PleIAs open pre-training corpus, drawn from cultural heritage sources (Common Corpus) and financial and administrative documents in open data (Finance Commons).
12
+
13
+ This release includes currently a model based on llama-3-8b that has been the most tested to date. Future release will focus on internal model that provide a more effective ratio of generation cost/quality.
14
+
15
+ OCRonos is generally faithful to what the original material, provides sensible restitution of deteriorated text and will rarely rewrite correct words.
16
+
17
+ On highly deteriorated content, OCRonos can act as a synthetic rewriting tool rather than a strict correction tool.
18
+
19
+ ## Examples
20
+
21
+ Original input with a high rate of OCR errors from *Chronicling America*:
22
+
23
+ > The Douglas Island News.
24
+ >
25
+ > ??????II ????I ????mmmm? n i i n i ??mmmmmm?i????????????????????????????
26
+ >
27
+ > VOL. G. DOUGLAS CITY AND TREADWELL, ALASKA, WEDNESDAY, AUGUST 3, L90L NO. 37
28
+ >
29
+ > I Wall Paper $ Carpets f
30
+ >
31
+ > ? 5?!? *i i
32
+ > g Spring House Cleaning is almost here, and g; ?J such a lot of trouble it brings with it. The 3 1 ff house has to be papered from top to bottom, ? ?J the floor Carpeted, Lace Curtains put up, J + Linoleum on kitchen floor, and various other 3 ? things, when carried out are the making of ? our Alaska homes. g
33
+ >
34
+ > Now we have prepared a way out of this trou- ?* ble for you. We have had a special depart- Jment made, where we are showing 40,000 rolls of Paper ranging in price from 10 c per roll ? J up; 50 different styles of carpets, from 50 c J 4 up, such as genuine home made rag carpets 4} J /washable), Brussels, Velvets, Axminstersand ? ? Ingraius; all grades and styles of Lace Cur- g 4 tains, inported and domestic portiers and *c J couch covers; Rugs, small as 12x20 inches ? J and large as 15x20 feet; Linoleum in grades J 4 A, B, C, D, E; all widths floor oil cloth, roller <* J shados, carpet sweepers, curtain poles, ? g mouldings, brass fixtures and trimmings of g 4 all kinds, in fact everything carried in an up g J to date department can be obtained from ?
35
+
36
+ Rewrite with a reasonably high temperature (0.7):
37
+
38
+ > The Douglas Island News.
39
+ >
40
+ > VOL. G. DOUGLAS CITY AND TREADWELL, ALASKA, WEDNESDAY, AUGUST 3, 1904 NO. 37
41
+ >
42
+ > Wall Paper & Carpets
43
+ >
44
+ > Spring House Cleaning is almost here, and it brings with it a lot of trouble. The house has to be papered from top to bottom, the floor carpeted, lace curtains put up, linoleum on kitchen floor, and various other things, when carried out are the making of our Alaska homes.
45
+ >
46
+ > Now we have prepared a way out of this trouble for you. We have had a special department made, where we are showing 40,000 rolls of Paper ranging in price from 10 c per roll up; 50 different styles of carpets, from 50 c up, such as genuine home made rag carpets (washable), Brussels, Velvets, Axminsters, and Ingrains; all grades and styles of Lace Curtains, imported and domestic portiers and couch covers; Rugs, small as 12x20 inches and large as 15x20 feet; Linoleum in grades A, B, C, D, E; all widths, floor oil cloth, roller shades, carpet sweepers, curtain poles, mouldings, brass fixtures and trimmings of all kinds, in fact everything carried in an up to date department can be obtained from B. M. BEHRENDS CO., INC.
47
+
48
+
49
+ ## Usage
50
+ OCRonos use a custom instruction structure: "### Text ###\n[text]\n\n### Correction ###\n" and a custom eos #END#.
51
+
52
+ Typical usage with vllm:
53
+
54
+ ```python
55
+ sampling_params = SamplingParams(temperature=0.9, top_p=.95, max_tokens=4000, presence_penalty = 0, stop=["#END#"])
56
+ prompt = "### Text ###\n" + user_input + "\n\n### Correction ###\n"
57
+ outputs = llm.generate(prompts, sampling_params, use_tqdm = False)
58
+ ```
59
+
60
+ ## Issues
61
+ LLMs are theoretically well suited for the task of OCR correction as they are trained to predict the most probably word, they are not usually trained on sources with digitization artifact but on native web content.
62
+
63
+ On past experiments, a common issue with OCR correction has *language switching*: due to the inherent noise in the input text, an LLM will transcribe in a different language or even in a different script (like cyrillic). The issue has been especially observed in smaller generalist models like GPT-3.5 or Claude-Haiku.
64
+
65
+ OCRonos largely mitigates this issue. A few instability have been noticed at scale, with the inclusion of repeated words but could be generally filtered out.