arxiv:2506.00979

IVY-FAKE: A Unified Explainable Framework and Benchmark for Image and Video AIGC Detection

Published on Jun 1

· Submitted by

arnodjiang on Jun 3

Upvote

Authors:

Changjiang Jiang ,

Abstract

IVY-FAKE dataset and Ivy Explainable Detector (IVY-XDETECTOR) architecture address the limitations of current AIGC detection by providing a unified, explainable framework for images and videos.

AI-generated summary

The rapid advancement of Artificial Intelligence Generated Content (AIGC) in visual domains has resulted in highly realistic synthetic images and videos, driven by sophisticated generative frameworks such as diffusion-based architectures. While these breakthroughs open substantial opportunities, they simultaneously raise critical concerns about content authenticity and integrity. Many current AIGC detection methods operate as black-box binary classifiers, which offer limited interpretability, and no approach supports detecting both images and videos in a unified framework. This dual limitation compromises model transparency, reduces trustworthiness, and hinders practical deployment. To address these challenges, we introduce IVY-FAKE , a novel, unified, and large-scale dataset specifically designed for explainable multimodal AIGC detection. Unlike prior benchmarks, which suffer from fragmented modality coverage and sparse annotations, IVY-FAKE contains over 150,000 richly annotated training samples (images and videos) and 18,700 evaluation examples, each accompanied by detailed natural-language reasoning beyond simple binary labels. Building on this, we propose Ivy Explainable Detector (IVY-XDETECTOR), a unified AIGC detection and explainable architecture that jointly performs explainable detection for both image and video content. Our unified vision-language model achieves state-of-the-art performance across multiple image and video detection benchmarks, highlighting the significant advancements enabled by our dataset and modeling framework. Our data is publicly available at https://huggingface.co/datasets/AI-Safeguard/Ivy-Fake.

View arXiv page View PDF Project page Add to collection

Community

arnodjiang

Paper author Paper submitter 2 days ago

•

edited 1 day ago

Project page: https://pi3ai.github.io/IvyFake

🚀 This paper introduces IVY-FAKE, a groundbreaking framework to tackle the rapidly growing challenge of detecting sophisticated AI-generated images and videos. Current detection methods often act as black boxes and struggle to handle both images and videos seamlessly. IVY-FAKE offers a unified and explainable benchmark!

😆 Takeaways:

Unified Multimodal Dataset (IVY-FAKE): The first large-scale benchmark designed for explainable AIGC detection across both images and videos. It boasts over 150,000 richly annotated training samples and 18,700 evaluation examples, going beyond simple "real/fake" labels to include detailed natural-language reasoning. This addresses the fragmented modality coverage and sparse annotations of previous datasets.
Explainable Detector (IVY-XDETECTOR): A novel vision-language architecture that performs joint detection and explanation for both image and video content. Unlike models that output only coordinates or heatmaps, IVY-XDETECTOR provides human-readable natural-language descriptions of visual artifacts.
Addressing "Black Box" Limitations: Many existing AIGC detectors are binary classifiers with limited interpretability, hindering transparency and trust. IVY-FAKE and IVY-XDETECTOR are designed to overcome this.
Rich Annotations and Progressive Training: The dataset includes detailed reasoning, enabling a more nuanced evaluation of models' interpretability and explanatory capabilities. Annotations were generated using Gemini 2.5 Pro with a structured approach to articulate reasoning before conclusions; IVY-XDETECTOR uses a three-stage training pipeline: 1) General video understanding, 2) AIGC detection fine-tuning for binary classification, and 3) Joint optimization for detection and explainability.

This work provides a significant step towards more transparent and trustworthy AI content analysis, offering a robust foundation for future research in multimodal AIGC detection.