MGP-STR (base-sized model)

MGP-STR base-sized model is trained on MJSynth and SynthText. It was introduced in the paper Multi-Granularity Prediction for Scene Text Recognition and first released in this repository.

Model description

MGP-STR is pure vision STR model, consisting of ViT and specially designed A^3 modules. The ViT module was initialized from the weights of DeiT-base, except the patch embedding model, due to the inconsistent input size.

Images (32x128) are presented to the model as a sequence of fixed-size patches (resolution 4x4), which are linearly embedded. One also adds absolute position embeddings before feeding the sequence to the layers of the ViT module. Next, A^3 module selects a meaningful combination from the tokens of ViT output and integrates them into one output token corresponding to a specific character. Moreover, subword classification heads based on BPE A^3 module and WordPiece A^3 module are devised for subword predictions, so that the language information can be implicitly modeled. Finally, these multi-granularity predictions (character, subword and even word) are merged via a simple and effective fusion strategy.

Intended uses & limitations

You can use the raw model for optical character recognition (OCR) on text images. See the model hub to look for fine-tuned versions on a task that interests you.

How to use

Here is how to use this model in PyTorch:

from transformers import MgpstrProcessor, MgpstrForSceneTextRecognition
import requests
from PIL import Image

processor = MgpstrProcessor.from_pretrained('alibaba-damo/mgp-str-base')
model = MgpstrForSceneTextRecognition.from_pretrained('alibaba-damo/mgp-str-base')

# load image from the IIIT-5k dataset
url = "https://i.postimg.cc/ZKwLg2Gw/367-14.png"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")

pixel_values = processor(images=image, return_tensors="pt").pixel_values
outputs = model(pixel_values)

generated_text = processor.batch_decode(outputs.logits)['generated_text']

BibTeX entry and citation info

@inproceedings{ECCV2022mgp_str,
  title={Multi-Granularity Prediction for Scene Text Recognition},
  author={Peng Wang, Cheng Da, and Cong Yao},
  booktitle = {ECCV},
  year={2022}
}
Downloads last month
6,839
Safetensors
Model size
148M params
Tensor type
F32
Β·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for alibaba-damo/mgp-str-base

Quantizations
1 model

Spaces using alibaba-damo/mgp-str-base 4