Visual Question Answering (VQA) Model

This is a multimodal Visual Question Answering system built for my Bachelor's final project. It combines a Vision Transformer (ViT) image encoder and a SmolLM2 language model using a cross-attention mechanism.

Model Architecture

Vision Encoder: Pretrained ViT
Language Model: SmolLM2-135M
Fusion: Cross-attention layer aligning vision and language
Dataset: VQA v2 and LLaVa datasets for training

How to Use

from transformers import AutoProcessor, AutoModelForVisualQuestionAnswering
from PIL import Image

processor = AutoProcessor.from_pretrained("mehmetkuzucu/Waffle-v1.0")
model = AutoModelForVisualQuestionAnswering.from_pretrained("mehmetkuzucu/Waffle-v1.0")

image = Image.open("example.jpg")
question = "What is the person doing?"

inputs = processor(images=image, text=question, return_tensors="pt")
outputs = model(**inputs)
answer = processor.tokenizer.decode(outputs.logits.argmax(-1))