Dua-Vision-Base

A Vision Encoder-Decoder model that doesn’t just caption images but generates questions and possible answers based on what it “sees.” Using ViT as the encoder and BART as the decoder, it’s built for image-based QA without the fluff.

Translation: feed it an image, and get back a useful question-answer pair. Perfect for creating and synthesizing data in image QA tasks. It’s one model, two tasks, and a lot of potential!

#LLMs #VisionTransformer #ImageQA #AI

Dua-Vision-Base is a Vision Encoder-Decoder model. This model integrates Vision Transformer (ViT) as the encoder and BART as the decoder, enabling effective processing and contextual interpretation of visual inputs alongside natural language generation.

Model Architecture

Encoder: ViT (Vision Transformer), pre-trained on vit-base-patch16-224-in21k from Google.
Decoder: BART (Bidirectional and Auto-Regressive Transformers) model pre-trained on facebook/bart-base.

Usage

To use this model with images, you’ll need the necessary components: the ViTImageProcessor for handling visual inputs and the BartTokenizer for processing text prompts. This model is optimized for generating question and an answer for given images, adhering to the following specifications:

Input:
- Images in RGB format (processed via ViTImageProcessor).
- Textual prompts using BartTokenizer for contextual initialization.
Output:
- Textual question & answer generated based on the visual content in the image.

Installation

!pip install transformers datasets torch torchvision

How to Load the Model

from transformers import VisionEncoderDecoderModel, ViTImageProcessor, BartTokenizer

# Load model, processor, and tokenizer
model = VisionEncoderDecoderModel.from_pretrained("HV-Khurdula/Dua-Vision-Base")
image_processor = ViTImageProcessor.from_pretrained("HV-Khurdula/Dua-Vision-Base")
tokenizer = BartTokenizer.from_pretrained("HV-Khurdula/Dua-Vision-Base")

Inference Example

Here's a sample usage for generating captions for an image:

# Load image and process
image_url = "https://example.com/image.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)
pixel_values = image_processor(images=image, return_tensors="pt").pixel_values

# Generate caption
generated_ids = model.generate(pixel_values, max_length=128, num_beams=5, early_stopping=True)
generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print("Generated:", generated_text)

Training

The model was trained on a dataset of conversational prompts alongside images. During training, captions were generated based on both the image content and specific prompts, enhancing contextual relevancy in generated captions. It is highly recommended to fine-tune the model, according to the task.

Hyperparameters

Batch Size: 16
Learning Rate: 5e-5
Epochs: 5

License

This model and its code are released under the terms of the Apache 2.0 license.

HV-Khurdula
/

Dua-Vision-Base