FinTabQA: Financial Table Question-Answering
A model for financial table question-answering using the LayoutLM architecture.
Quick start
To get started with FinTabQA, load it, and a fast tokenizer, like you would any other Hugging Face Transformer model and tokenizer. Below is a minimum working example using the SynFinTabs dataset.
>>> from typing import List, Tuple
>>> from datasets import load_dataset
>>> from transformers import LayoutLMForQuestionAnswering, LayoutLMTokenizerFast
>>> import torch
>>>
>>> synfintabs_dataset = load_dataset("ethanbradley/synfintabs")
>>> model = LayoutLMForQuestionAnswering.from_pretrained("ethanbradley/fintabqa")
>>> tokenizer = LayoutLMTokenizerFast.from_pretrained(
... "microsoft/layoutlm-base-uncased")
>>>
>>> def normalise_boxes(
... boxes: List[List[int]],
... old_image_size: Tuple[int, int],
... new_image_size: Tuple[int, int]) -> List[List[int]]:
... old_im_w, old_im_h = old_image_size
... new_im_w, new_im_h = new_image_size
...
... return [[
... max(min(int(x1 / old_im_w * new_im_w), new_im_w), 0),
... max(min(int(y1 / old_im_h * new_im_h), new_im_h), 0),
... max(min(int(x2 / old_im_w * new_im_w), new_im_w), 0),
... max(min(int(y2 / old_im_h * new_im_h), new_im_h), 0)
... ] for (x1, y1, x2, y2) in boxes]
>>>
>>> item = synfintabs_dataset['test'][0]
>>> question_dict = next(question for question in item['questions']
... if question['id'] == item['question_id'])
>>> encoding = tokenizer(
... question_dict['question'].split(),
... item['ocr_results']['words'],
... max_length=512,
... padding="max_length",
... truncation="only_second",
... is_split_into_words=True,
... return_token_type_ids=True,
... return_tensors="pt")
>>>
>>> word_boxes = normalise_boxes(
... item['ocr_results']['bboxes'],
... item['image'].crop(item['bbox']).size,
... (1000, 1000))
>>> token_boxes = []
>>>
>>> for i, s, w in zip(
... encoding['input_ids'][0],
... encoding.sequence_ids(0),
... encoding.word_ids(0)):
... if s == 1:
... token_boxes.append(word_boxes[w])
... elif i == tokenizer.sep_token_id:
... token_boxes.append([1000] * 4)
... else:
... token_boxes.append([0] * 4)
>>>
>>> encoding['bbox'] = torch.tensor([token_boxes])
>>> outputs = model(**encoding)
>>> start = encoding.word_ids(0)[outputs['start_logits'].argmax(-1)]
>>> end = encoding.word_ids(0)[outputs['end_logits'].argmax(-1)]
>>>
>>> print(f"Target: {question_dict['answer']}")
Target: 6,980
>>>
>>> print(f"Prediction: {' '.join(item['ocr_results']['words'][start : end])}")
Prediction: 6,980
Citation
If you use this model, please cite both the article using the citation below and the model itself.
@misc{bradley2024synfintabs,
title = {Syn{F}in{T}abs: A Dataset of Synthetic Financial Tables for Information and Table Extraction},
author = {Bradley, Ethan and Roman, Muhammad and Rafferty, Karen and Devereux, Barry},
year = {2024},
eprint = {2412.04262},
archivePrefix = {arXiv},
primaryClass = {cs.LG},
url = {https://arxiv.org/abs/2412.04262}
}
- Downloads last month
- 65
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.
Model tree for ethanbradley/fintabqa
Base model
microsoft/layoutlm-base-uncased