Visual Question Answering (VQA) Model
This is a multimodal Visual Question Answering system built for my Bachelor's final project. It combines a Vision Transformer (ViT) image encoder and a SmolLM2 language model using a cross-attention mechanism.
Model Architecture
- Vision Encoder: Pretrained ViT
- Language Model: SmolLM2-135M
- Fusion: Cross-attention layer aligning vision and language
- Dataset: VQA v2 and LLaVa datasets for training
How to Use
from transformers import AutoProcessor, AutoModelForVisualQuestionAnswering
from PIL import Image
processor = AutoProcessor.from_pretrained("mehmetkuzucu/Waffle-v1.0")
model = AutoModelForVisualQuestionAnswering.from_pretrained("mehmetkuzucu/Waffle-v1.0")
image = Image.open("example.jpg")
question = "What is the person doing?"
inputs = processor(images=image, text=question, return_tensors="pt")
outputs = model(**inputs)
answer = processor.tokenizer.decode(outputs.logits.argmax(-1))
- Downloads last month
- 98
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support