library_name: transformers
license: mit
datasets:
- c3rl/IIIT-INDIC-HW-WORDS-Hindi
language:
- ne
metrics:
- cer
base_model:
- google/vit-base-patch16-224-in21k
- amitness/roberta-base-ne
pipeline_tag: image-to-text
Model Card for Model ID
Model Details
TrOCR Devanagari - Handwritten Text Recognition
Overview
TrOCR Devanagari is an end-to-end Vision Encoder-Decoder model built to recognize and convert handwritten Devanagari script (specifically for Nepali language) into machine-readable text. It leverages a Vision Transformer (ViT) as the encoder and uses a transformer-based decoder (NepBERT) to produce textual output. This project aims to assist in digitizing handwritten Nepali documents.
Model Architecture
The model pipeline includes the following steps:
- Text Detection: Extracts regions of interest from scanned handwritten documents.
- Image Preprocessing: Resizes and pads input images to feed into the model.
- Text Recognition: Uses the TrOCR-based Vision Encoder Decoder model to predict handwritten text.
- UI Interface (Optional): Displays the results and enables user interaction with the system.
Model Information
- Model Name: TrOCR Devanagari
- Developed by: Anil Paudel, Aayush Puri, Yubaraj Sigdel
- Language: Nepali
- License: MIT (tentative)
- Model Type: Vision Encoder Decoder
- Repository: paudelanil/trocr-devanagari-2
- Training Data: IIIT-HW Dataset
- Evaluation Metric: CER (Character Error Rate)
- Hardware Used: NVIDIA RTX A4500
Getting Started
Installation
To use the model, ensure you have the following Python packages installed:
pip install torch transformers pillow
Preprocessing Function
The image preprocessing function is used to resize images to the target size while maintaining the aspect ratio and padding the remaining space.
from PIL import Image
def preprocess_image(image):
target_size = (224, 224)
original_size = image.size
aspect_ratio = original_size[0] / original_size[1]
if aspect_ratio > 1:
new_width = target_size[0]
new_height = int(target_size[0] / aspect_ratio)
else:
new_height = target_size[1]
new_width = int(target_size[1] * aspect_ratio)
resized_img = image.resize((new_width, new_height))
padding_width = target_size[0] - new_width
padding_height = target_size[1] - new_height
pad_left = padding_width // 2
pad_top = padding_height // 2
pad_image = Image.new('RGB', target_size, (255, 255, 255))
pad_image.paste(resized_img, (pad_left, pad_top))
return pad_image
Prediction Code
Here’s how you can use the model for text recognition:
import torch
from PIL import Image
from transformers import AutoTokenizer, VisionEncoderDecoderModel, ViTFeatureExtractor, TrOCRProcessor
# Load the model and processor
tokenizer = AutoTokenizer.from_pretrained("aayushpuri01/TrOCR-Devanagari")
model1 = VisionEncoderDecoderModel.from_pretrained("aayushpuri01/TrOCR-Devanagari")
feature_extractor1 = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224-in21k')
processor1 = TrOCRProcessor(feature_extractor=feature_extractor1, tokenizer=tokenizer)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model1.to(device)
# Prediction function
def predict(image):
# Preprocess the image
image = Image.open(image).convert("RGB")
image = preprocess_image(image)
pixel_values = processor1(image, return_tensors="pt").pixel_values.to(device)
# Generate text from the image
generated_ids = model1.generate(pixel_values)
generated_text = processor1.batch_decode(generated_ids, skip_special_tokens=True)[0]
return generated_text
Usage Example
# Load and predict
image_path = "path_to_your_image.jpg"
predicted_text = predict(image_path)
print("Predicted Text:", predicted_text)
Training Hyperparameters
training_args = Seq2SeqTrainingArguments(
predict_with_generate=True,
evaluation_strategy="steps",
per_device_train_batch_size=32,
per_device_eval_batch_size=32,
output_dir='/workspace/checkpoint-save/',
save_total_limit=2,
logging_steps=2,
save_steps=1000,
eval_steps=1000,
save_strategy="steps",
load_best_model_at_end=True,
metric_for_best_model="cer",
greater_is_better=False,
num_train_epochs=15
)
License
The model is shared under the MIT license. For details, see the LICENSE file.
Acknowledgments
This model is based on the 🤗 Transformers library, and uses the ViT encoder and NepBERT decoder architecture. Special thanks to the IIIT-HW dataset contributors.
Feel free to explore the project and contribute to the repository!