FaVQA - Fashion-related Visual Question Answering

Summary

A Vision-and-Language Pre-training (VLP) model for a fashion-related downstream task, Visual Question Answering (VQA). The related model, ViLT, was proposed in ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision and incorporates text embeddings into a Vision Transformer (ViT), allowing it to have a minimal design for VLP.

Model Description

Model type: Vision Question Answering, ViLT
License: MIT
Train/test dataset: yanka9/deepfashion-for-VQA, derived from DeepFashion

Model Sources

Demo: 🤗 Space

How to Get Started with the Model

Use the code below to get started with the model. It's similar to original model.

from transformers import ViltProcessor, ViltForQuestionAnswering
import requests
from PIL import Image

# prepare image + question
image = Image.open(YOUR_IMAGE)
text = "how long is the sleeve?"

processor = ViltProcessor.from_pretrained("yanka9/vilt_finetuned_deepfashionVQA_v2")
model = ViltForQuestionAnswering.from_pretrained("yanka9/vilt_finetuned_deepfashionVQA_v2")

# prepare inputs
encoding = processor(image, text, return_tensors="pt")

# forward pass
outputs = model(**encoding)
logits = outputs.logits
idx = logits.argmax(-1).item()
print("Answer:", model.config.id2label[idx])

Training Details

Training Data

A custom training dataset was developed for training the ViLT classifier. It was derived from DeepFashion-MultiModal, which is a large-scale high-quality human dataset with rich multi-modal annotations. It contains 44,096 high-resolution human images, including 12,701 full-body human images. For each full body image, the authors manually annotate the human parsing labels of 24 classes.

It has several other properties, but for the scope of this project, only the full body images and labels were utilized to generate the training dataset. Moreover, the labels encompass at least one category of the following: fabric, color, and shape. 209481 questions were generated for 44096 images, the categories used for training are listed below.

'Color.LOWER_CLOTH',
'Color.OUTER_CLOTH',
'Color.UPPER_CLOTH',
'Fabric.OUTER_CLOTH',
'Fabric.UPPER_CLOTH',
'Gender',
'Shape.CARDIGAN',
'Shape.COVERED_NAVEL',
'Shape.HAT',
'Shape.LOWER_CLOTHING_LENGTH',
'Shape.NECKWEAR',
'Shape.RING',
'Shape.SLEEVE',
'Shape.WRISTWEAR'

Question Types

The model supports both open and close-ended (yes or no) questions. Below one may find examples from the training phase generated questions.

    'how long is the sleeve?',
    'what is the length of the lower clothing?',
    'how would you describe the color of the upper cloth?',
    'whats is the color of the lower cloth?'
    'what fabric is the upper cloth made of?'
    'who is the target audience for this garment'
    'is there a hat worn?',
    'is the navel covered?',
    'does the lower clothing cover the navel?',

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

yanka9
/

vilt_finetuned_deepfashionVQA_v2