Update README.md

a3c1c90 verified 6 months ago

7.31 kB

	---
	license: cc-by-nc-4.0
	inference: false
	base_model: naver-clova-ix/donut-base
	tags:
	- donut
	- image-to-text
	- vision
	model-index:
	- name: donut-receipts-extract
	results:
	- task:
	type: image-to-text
	name: Image to text
	metrics:
	- type: loss
	value: 0.326069
	- type: accuracy
	value: 0.895219
	name: Accuracy
	- type: cer
	value: 0.158358
	name: CER
	- type: wer
	value: 1.673989
	name: WER
	- type: edit distance
	value: 0.145293
	name: Edit_distance
	metrics:
	- cer
	- wer
	- accuracy
	datasets:
	- AdamCodd/donut-receipts
	pipeline_tag: image-to-text
	extra_gated_prompt: "To get access to this model, send an email to [email protected] and provide a brief description of your project or application. Requests without this information will not be considered, and access will not be granted under any circumstances."
	extra_gated_fields:
	Company/University: text
	Country: country
	---

	# Donut-receipts-extract

	Donut model was introduced in the paper [OCR-free Document Understanding Transformer](https://arxiv.org/abs/2111.15664) by Geewok et al. and first released in [this repository](https://github.com/clovaai/donut).

	## === V2 ===

	This model has been retrained on an improved version of the [AdamCodd/donut-receipts](https://huggingface.co/datasets/AdamCodd/donut-receipts) dataset (deduplicated, manually corrected). The new license for the V2 model is cc-by-nc-4.0. For commercial use rights, please contact me ([email protected]). Meanwhile, the V1 model remains available under the MIT license (under v1 branch).

	It achieves the following results on the evaluation set:
	* Loss: 0.326069
	* Edit distance: 0.145293
	* CER: 0.158358
	* WER: 1.673989
	* Mean accuracy: 0.895219
	* F1: 0.977897

	The task_prompt has been changed to ``<s_receipt>`` for the V2 (previously ``<s_cord-v2>`` for V1). Two new keys ``<s_svc>`` and ``<s_discount>`` have been added, ``<s_telephone>`` has been renamed to ``<s_phone>``.

	The V2 performs way better than the V1 as it has been trained on twice the resolution for the receipts, using a better dataset. Despite that, it's not perfect due to a lack of diverse receipts (the training dataset is still ~1100 receipts); for a future version, that will be the main focus.

	## === V1 ====

	This model is a finetune of the [donut base model](https://huggingface.co/naver-clova-ix/donut-base/) on the [AdamCodd/donut-receipts](https://huggingface.co/datasets/AdamCodd/donut-receipts) dataset. Its purpose is to efficiently extract text from receipts.

	It achieves the following results on the evaluation set:
	* Loss: 0.498843
	* Edit distance: 0.198315
	* CER: 0.213929
	* WER: 7.634032
	* Mean accuracy: 0.843472

	## Model description

	Donut consists of a vision encoder (Swin Transformer) and a text decoder (BART). Given an image, the encoder first encodes the image into a tensor of embeddings (of shape batch_size, seq_len, hidden_size), after which the decoder autoregressively generates text, conditioned on the encoding of the encoder.

	![model image](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/donut_architecture.jpg)


	### How to use

	```python
	import torch
	import re
	from PIL import Image
	from transformers import DonutProcessor, VisionEncoderDecoderModel

	device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
	processor = DonutProcessor.from_pretrained("AdamCodd/donut-receipts-extract")
	model = VisionEncoderDecoderModel.from_pretrained("AdamCodd/donut-receipts-extract")
	model.to(device)

	def load_and_preprocess_image(image_path: str, processor):
	"""
	Load an image and preprocess it for the model.
	"""
	image = Image.open(image_path).convert("RGB")
	pixel_values = processor(image, return_tensors="pt").pixel_values
	return pixel_values

	def generate_text_from_image(model, image_path: str, processor, device):
	"""
	Generate text from an image using the trained model.
	"""
	# Load and preprocess the image
	pixel_values = load_and_preprocess_image(image_path, processor)
	pixel_values = pixel_values.to(device)

	# Generate output using model
	model.eval()
	with torch.no_grad():
	task_prompt = "<s_receipt>" # <s_cord-v2> for v1
	decoder_input_ids = processor.tokenizer(task_prompt, add_special_tokens=False, return_tensors="pt").input_ids
	decoder_input_ids = decoder_input_ids.to(device)
	generated_outputs = model.generate(
	pixel_values,
	decoder_input_ids=decoder_input_ids,
	max_length=model.decoder.config.max_position_embeddings,
	pad_token_id=processor.tokenizer.pad_token_id,
	eos_token_id=processor.tokenizer.eos_token_id,
	early_stopping=True,
	bad_words_ids=[[processor.tokenizer.unk_token_id]],
	return_dict_in_generate=True
	)

	# Decode generated output
	decoded_text = processor.batch_decode(generated_outputs.sequences)[0]
	decoded_text = decoded_text.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "")
	decoded_text = re.sub(r"<.*?>", "", decoded_text, count=1).strip() # remove first task start token
	decoded_text = processor.token2json(decoded_text)
	return decoded_text

	# Example usage
	image_path = "path_to_your_image" # Replace with your image path
	extracted_text = generate_text_from_image(model, image_path, processor, device)
	print("Extracted Text:", extracted_text)
	```

	Refer to the [documentation](https://huggingface.co/docs/transformers/main/en/model_doc/donut) for more code examples.

	## Intended uses & limitations

	This fine-tuned model is specifically designed for extracting text from receipts and may not perform optimally on other types of documents. The dataset used is still suboptimal (numerous errors are still there) so this model will need to be retrained at a later date to improve its performance.

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 3e-05
	- train_batch_size: 2
	- eval_batch_size: 4
	- seed: 42
	- optimizer: AdamW with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- lr_scheduler_warmup_steps: 300
	- num_epochs: 35
	- weight_decay: 0.01

	### Framework versions

	- Transformers 4.36.2
	- Datasets 2.16.1
	- Tokenizers 0.15.0
	- Evaluate 0.4.1

	If you want to support me, you can [here](https://ko-fi.com/adamcodd).

	### BibTeX entry and citation info

	```bibtex
	@article{DBLP:journals/corr/abs-2111-15664,
	author = {Geewook Kim and
	Teakgyu Hong and
	Moonbin Yim and
	Jinyoung Park and
	Jinyeong Yim and
	Wonseok Hwang and
	Sangdoo Yun and
	Dongyoon Han and
	Seunghyun Park},
	title = {Donut: Document Understanding Transformer without {OCR}},
	journal = {CoRR},
	volume = {abs/2111.15664},
	year = {2021},
	url = {https://arxiv.org/abs/2111.15664},
	eprinttype = {arXiv},
	eprint = {2111.15664},
	timestamp = {Thu, 02 Dec 2021 10:50:44 +0100},
	biburl = {https://dblp.org/rec/journals/corr/abs-2111-15664.bib},
	bibsource = {dblp computer science bibliography, https://dblp.org}
	}
	```