Visual Question Answering (VQA) Model

This is a multimodal Visual Question Answering system built for my Bachelor's final project. It combines a Vision Transformer (ViT) image encoder and a SmolLM2 language model using a cross-attention mechanism.

Model Architecture

  • Vision Encoder: Pretrained ViT
  • Language Model: SmolLM2-135M
  • Fusion: Cross-attention layer aligning vision and language
  • Dataset: VQA v2 and LLaVa datasets for training

How to Use

from transformers import AutoProcessor, AutoModelForVisualQuestionAnswering
from PIL import Image

processor = AutoProcessor.from_pretrained("mehmetkuzucu/Waffle-v1.0")
model = AutoModelForVisualQuestionAnswering.from_pretrained("mehmetkuzucu/Waffle-v1.0")

image = Image.open("example.jpg")
question = "What is the person doing?"

inputs = processor(images=image, text=question, return_tensors="pt")
outputs = model(**inputs)
answer = processor.tokenizer.decode(outputs.logits.argmax(-1))
Downloads last month
98
Safetensors
Model size
223M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support