simondg
commited on
Commit
•
c86f565
1
Parent(s):
af74633
update README
Browse files
README.md
CHANGED
@@ -1,4 +1,4 @@
|
|
1 |
-
# ByT5 Dutch OCR
|
2 |
|
3 |
This model is a finetuned byT5 model that corrects OCR mistakes found in dutch sentences. The [google/byt5-base](https://huggingface.co/google/byt5-base) model is finetuned on the dutch section of the [OSCAR](https://huggingface.co/datasets/oscar) dataset.
|
4 |
|
@@ -8,13 +8,13 @@ This model is a finetuned byT5 model that corrects OCR mistakes found in dutch s
|
|
8 |
```python
|
9 |
from transformers import AutoTokenizer, T5ForConditionalGeneration
|
10 |
|
11 |
-
example_sentence = "
|
12 |
|
13 |
-
tokenizer = AutoTokenizer.from_pretrained('ml6team/byt5-
|
14 |
|
15 |
model_inputs = tokenizer(example_sentence, max_length=128, truncation=True, return_tensors="pt")
|
16 |
|
17 |
-
model = T5ForConditionalGeneration.from_pretrained('ml6team/byt5-
|
18 |
outputs = model.generate(**model_inputs, max_length=128)
|
19 |
|
20 |
tokenizer.decode(outputs[0])
|
|
|
1 |
+
# ByT5 Dutch OCR Correction
|
2 |
|
3 |
This model is a finetuned byT5 model that corrects OCR mistakes found in dutch sentences. The [google/byt5-base](https://huggingface.co/google/byt5-base) model is finetuned on the dutch section of the [OSCAR](https://huggingface.co/datasets/oscar) dataset.
|
4 |
|
|
|
8 |
```python
|
9 |
from transformers import AutoTokenizer, T5ForConditionalGeneration
|
10 |
|
11 |
+
example_sentence = "Ben algoritme dat op ba8i8 van kunstmatige inte11i9entie vkijwel geautomatiseerd een tekst herstelt met OCR fuuten."
|
12 |
|
13 |
+
tokenizer = AutoTokenizer.from_pretrained('ml6team/byt5-base-dutch-ocr-correction')
|
14 |
|
15 |
model_inputs = tokenizer(example_sentence, max_length=128, truncation=True, return_tensors="pt")
|
16 |
|
17 |
+
model = T5ForConditionalGeneration.from_pretrained('ml6team/byt5-base-dutch-ocr-correction')
|
18 |
outputs = model.generate(**model_inputs, max_length=128)
|
19 |
|
20 |
tokenizer.decode(outputs[0])
|