simondg commited on
Commit
c86f565
1 Parent(s): af74633

update README

Browse files
Files changed (1) hide show
  1. README.md +4 -4
README.md CHANGED
@@ -1,4 +1,4 @@
1
- # ByT5 Dutch OCR correction
2
 
3
  This model is a finetuned byT5 model that corrects OCR mistakes found in dutch sentences. The [google/byt5-base](https://huggingface.co/google/byt5-base) model is finetuned on the dutch section of the [OSCAR](https://huggingface.co/datasets/oscar) dataset.
4
 
@@ -8,13 +8,13 @@ This model is a finetuned byT5 model that corrects OCR mistakes found in dutch s
8
  ```python
9
  from transformers import AutoTokenizer, T5ForConditionalGeneration
10
 
11
- example_sentence = "Een algoritme dat op basis van kunstmatige inte11i9entie vkijwe1 geautomatiseerd een Nederlandstalige tekst samenstelt."
12
 
13
- tokenizer = AutoTokenizer.from_pretrained('ml6team/byt5-small-dutch-ocr-correction')
14
 
15
  model_inputs = tokenizer(example_sentence, max_length=128, truncation=True, return_tensors="pt")
16
 
17
- model = T5ForConditionalGeneration.from_pretrained('ml6team/byt5-small-dutch-ocr-correction')
18
  outputs = model.generate(**model_inputs, max_length=128)
19
 
20
  tokenizer.decode(outputs[0])
 
1
+ # ByT5 Dutch OCR Correction
2
 
3
  This model is a finetuned byT5 model that corrects OCR mistakes found in dutch sentences. The [google/byt5-base](https://huggingface.co/google/byt5-base) model is finetuned on the dutch section of the [OSCAR](https://huggingface.co/datasets/oscar) dataset.
4
 
 
8
  ```python
9
  from transformers import AutoTokenizer, T5ForConditionalGeneration
10
 
11
+ example_sentence = "Ben algoritme dat op ba8i8 van kunstmatige inte11i9entie vkijwel geautomatiseerd een tekst herstelt met OCR fuuten."
12
 
13
+ tokenizer = AutoTokenizer.from_pretrained('ml6team/byt5-base-dutch-ocr-correction')
14
 
15
  model_inputs = tokenizer(example_sentence, max_length=128, truncation=True, return_tensors="pt")
16
 
17
+ model = T5ForConditionalGeneration.from_pretrained('ml6team/byt5-base-dutch-ocr-correction')
18
  outputs = model.generate(**model_inputs, max_length=128)
19
 
20
  tokenizer.decode(outputs[0])