trocr-devanagari-2 / README.md
paudelanil's picture
Update README.md
33cdd0c verified
metadata
library_name: transformers
license: mit
datasets:
  - c3rl/IIIT-INDIC-HW-WORDS-Hindi
language:
  - ne
metrics:
  - cer
base_model:
  - google/vit-base-patch16-224-in21k
  - amitness/roberta-base-ne
pipeline_tag: image-to-text

Model Card for Model ID

Model Details

TrOCR Devanagari - Handwritten Text Recognition

Overview

TrOCR Devanagari is an end-to-end Vision Encoder-Decoder model built to recognize and convert handwritten Devanagari script (specifically for Nepali language) into machine-readable text. It leverages a Vision Transformer (ViT) as the encoder and uses a transformer-based decoder (NepBERT) to produce textual output. This project aims to assist in digitizing handwritten Nepali documents.

Model Architecture

The model pipeline includes the following steps:

  1. Text Detection: Extracts regions of interest from scanned handwritten documents.
  2. Image Preprocessing: Resizes and pads input images to feed into the model.
  3. Text Recognition: Uses the TrOCR-based Vision Encoder Decoder model to predict handwritten text.
  4. UI Interface (Optional): Displays the results and enables user interaction with the system.

Model Information

  • Model Name: TrOCR Devanagari
  • Developed by: Anil Paudel, Aayush Puri, Yubaraj Sigdel
  • Language: Nepali
  • License: MIT (tentative)
  • Model Type: Vision Encoder Decoder
  • Repository: paudelanil/trocr-devanagari-2
  • Training Data: IIIT-HW Dataset
  • Evaluation Metric: CER (Character Error Rate)
  • Hardware Used: NVIDIA RTX A4500

Getting Started

Installation

To use the model, ensure you have the following Python packages installed:

pip install torch transformers pillow

Preprocessing Function

The image preprocessing function is used to resize images to the target size while maintaining the aspect ratio and padding the remaining space.

from PIL import Image

def preprocess_image(image):
    target_size = (224, 224)
    original_size = image.size

    aspect_ratio = original_size[0] / original_size[1]
    if aspect_ratio > 1:
        new_width = target_size[0]
        new_height = int(target_size[0] / aspect_ratio)
    else:
        new_height = target_size[1]
        new_width = int(target_size[1] * aspect_ratio)

    resized_img = image.resize((new_width, new_height))

    padding_width = target_size[0] - new_width
    padding_height = target_size[1] - new_height
    pad_left = padding_width // 2
    pad_top = padding_height // 2

    pad_image = Image.new('RGB', target_size, (255, 255, 255))
    pad_image.paste(resized_img, (pad_left, pad_top))
    return pad_image

Prediction Code

Here’s how you can use the model for text recognition:

import torch
from PIL import Image
from transformers import AutoTokenizer, VisionEncoderDecoderModel, ViTFeatureExtractor, TrOCRProcessor

# Load the model and processor
tokenizer = AutoTokenizer.from_pretrained("aayushpuri01/TrOCR-Devanagari")
model1 = VisionEncoderDecoderModel.from_pretrained("aayushpuri01/TrOCR-Devanagari")
feature_extractor1 = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224-in21k')
processor1 = TrOCRProcessor(feature_extractor=feature_extractor1, tokenizer=tokenizer)

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model1.to(device)

# Prediction function
def predict(image):
    # Preprocess the image
    image = Image.open(image).convert("RGB")
    image = preprocess_image(image)
    pixel_values = processor1(image, return_tensors="pt").pixel_values.to(device)
    
    # Generate text from the image
    generated_ids = model1.generate(pixel_values)
    generated_text = processor1.batch_decode(generated_ids, skip_special_tokens=True)[0]
    
    return generated_text

Usage Example

# Load and predict
image_path = "path_to_your_image.jpg"
predicted_text = predict(image_path)
print("Predicted Text:", predicted_text)

Training Hyperparameters

training_args = Seq2SeqTrainingArguments(
    predict_with_generate=True,
    evaluation_strategy="steps",
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    output_dir='/workspace/checkpoint-save/',
    save_total_limit=2,
    logging_steps=2,
    save_steps=1000,
    eval_steps=1000,
    save_strategy="steps",
    load_best_model_at_end=True,
    metric_for_best_model="cer",
    greater_is_better=False,
    num_train_epochs=15
)

License

The model is shared under the MIT license. For details, see the LICENSE file.

Acknowledgments

This model is based on the 🤗 Transformers library, and uses the ViT encoder and NepBERT decoder architecture. Special thanks to the IIIT-HW dataset contributors.


Feel free to explore the project and contribute to the repository!