arxiv:2409.13592

YesBut: A High-Quality Annotated Multimodal Dataset for evaluating Satire Comprehension capability of Vision-Language Models

Published on Sep 20

· Submitted by

abhi1nandy2 on Sep 23

#2 Paper of the day

Upvote

Authors:

Abhilash Nandy ,

Yash Agarwal ,

Millon Madhur Das ,

Aman Bansal ,

Pawan Goyal ,

Abstract

Understanding satire and humor is a challenging task for even current Vision-Language models. In this paper, we propose the challenging tasks of Satirical Image Detection (detecting whether an image is satirical), Understanding (generating the reason behind the image being satirical), and Completion (given one half of the image, selecting the other half from 2 given options, such that the complete image is satirical) and release a high-quality dataset YesBut, consisting of 2547 images, 1084 satirical and 1463 non-satirical, containing different artistic styles, to evaluate those tasks. Each satirical image in the dataset depicts a normal scenario, along with a conflicting scenario which is funny or ironic. Despite the success of current Vision-Language Models on multimodal tasks such as Visual QA and Image Captioning, our benchmarking experiments show that such models perform poorly on the proposed tasks on the YesBut Dataset in Zero-Shot Settings w.r.t both automated as well as human evaluation. Additionally, we release a dataset of 119 real, satirical photographs for further research. The dataset and code are available at https://github.com/abhi1nandy2/yesbut_dataset.

View arXiv page View PDF Add to collection

Community

abhi1nandy2

Paper author Paper submitter Sep 23

•

edited Oct 12

🎉 Exciting News! 🎉
Our paper "YesBut: A High-Quality Annotated Multimodal Dataset for Evaluating Satire Comprehension Capability of Vision-Language Models" has been accepted as a long paper at #EMNLP2024! 🚀

Authors: Abhilash Nandy, Yash Agarwal, Ashish Patwa, Millon Madhur Das, Aman Bansal, Ankit Raj, Pawan Goyal, Niloy Ganguly

🔑 Key Highlights:

What’s special?

🗂️ YesBut Dataset: A one-of-a-kind multimodal dataset with 2,547 images, combining satirical and non-satirical content, enriched with diverse artistic styles!
🖼️ Challenges: We introduced unique tasks like Satirical Image Detection, Satirical Image Understanding, and Satirical Image Completion—each pushing the boundaries of current Vision-Language (VL) models.
🤖 Benchmarking Results: Even cutting-edge VL models struggle with our tasks, showing the complexity of understanding irony, humor, and societal satire!

Future Directions:

Developing models that truly grasp complex human emotions like irony and humor.
Exploring cross-lingual satire comprehension and real-world applications in digital media.

Stay tuned for more! 🙌

Check out our

Dataset: https://huggingface.co/datasets/bansalaman18/yesbut
Code: https://github.com/abhi1nandy2/yesbut_dataset.

abhi1nandy2

Paper author Sep 23

Check out this high-level, fun video explanation of the paper - https://www.youtube.com/watch?v=S7zJis8rEbw

puffy310

Sep 24

NOW THIS IS RESEARCH

librarian-bot

Sep 24

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Haawron

Sep 24

In table 3, Kosmos2 models have relatively large differences between the test accuracy and F1 though the prior is not that skewed. Addtionally, their values are almost switched after CoT applied. How can I interpret this? Aplogies in advance if I missed something in the paper.

abhi1nandy2

Paper author Sep 24

Kosmos-2 in Zero-shot CoT setting performs well in 1 of the 2 classes (satirical and non-satirical), that is why, the accuracy is high, while the F1 Score is low.