IVY-FAKE: A Unified Explainable Framework and Benchmark for Image and Video AIGC Detection
Abstract
IVY-FAKE dataset and Ivy Explainable Detector (IVY-XDETECTOR) architecture address the limitations of current AIGC detection by providing a unified, explainable framework for images and videos.
The rapid advancement of Artificial Intelligence Generated Content (AIGC) in visual domains has resulted in highly realistic synthetic images and videos, driven by sophisticated generative frameworks such as diffusion-based architectures. While these breakthroughs open substantial opportunities, they simultaneously raise critical concerns about content authenticity and integrity. Many current AIGC detection methods operate as black-box binary classifiers, which offer limited interpretability, and no approach supports detecting both images and videos in a unified framework. This dual limitation compromises model transparency, reduces trustworthiness, and hinders practical deployment. To address these challenges, we introduce IVY-FAKE , a novel, unified, and large-scale dataset specifically designed for explainable multimodal AIGC detection. Unlike prior benchmarks, which suffer from fragmented modality coverage and sparse annotations, IVY-FAKE contains over 150,000 richly annotated training samples (images and videos) and 18,700 evaluation examples, each accompanied by detailed natural-language reasoning beyond simple binary labels. Building on this, we propose Ivy Explainable Detector (IVY-XDETECTOR), a unified AIGC detection and explainable architecture that jointly performs explainable detection for both image and video content. Our unified vision-language model achieves state-of-the-art performance across multiple image and video detection benchmarks, highlighting the significant advancements enabled by our dataset and modeling framework. Our data is publicly available at https://huggingface.co/datasets/AI-Safeguard/Ivy-Fake.
Community
Project page: https://pi3ai.github.io/IvyFake
🚀 This paper introduces IVY-FAKE, a groundbreaking framework to tackle the rapidly growing challenge of detecting sophisticated AI-generated images and videos. Current detection methods often act as black boxes and struggle to handle both images and videos seamlessly. IVY-FAKE offers a unified and explainable benchmark!
😆 Takeaways:
- Unified Multimodal Dataset (IVY-FAKE): The first large-scale benchmark designed for explainable AIGC detection across both images and videos. It boasts over 150,000 richly annotated training samples and 18,700 evaluation examples, going beyond simple "real/fake" labels to include detailed natural-language reasoning. This addresses the fragmented modality coverage and sparse annotations of previous datasets.
- Explainable Detector (IVY-XDETECTOR): A novel vision-language architecture that performs joint detection and explanation for both image and video content. Unlike models that output only coordinates or heatmaps, IVY-XDETECTOR provides human-readable natural-language descriptions of visual artifacts.
- Addressing "Black Box" Limitations: Many existing AIGC detectors are binary classifiers with limited interpretability, hindering transparency and trust. IVY-FAKE and IVY-XDETECTOR are designed to overcome this.
- Rich Annotations and Progressive Training: The dataset includes detailed reasoning, enabling a more nuanced evaluation of models' interpretability and explanatory capabilities. Annotations were generated using Gemini 2.5 Pro with a structured approach to articulate reasoning before conclusions; IVY-XDETECTOR uses a three-stage training pipeline: 1) General video understanding, 2) AIGC detection fine-tuning for binary classification, and 3) Joint optimization for detection and explainability.
This work provides a significant step towards more transparent and trustworthy AI content analysis, offering a robust foundation for future research in multimodal AIGC detection.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- BusterX: MLLM-Powered AI-Generated Video Forgery Detection and Explanation (2025)
- So-Fake: Benchmarking and Explaining Social Media Image Forgery Detection (2025)
- Identity-Aware Vision-Language Model for Explainable Face Forgery Detection (2025)
- Towards Explainable Fake Image Detection with Multi-Modal Large Language Models (2025)
- Could AI Trace and Explain the Origins of AI-Generated Images and Text? (2025)
- Fact-R1: Towards Explainable Video Misinformation Detection with Deep Reasoning (2025)
- Is Artificial Intelligence Generated Image Detection a Solved Problem? (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Amazing!!
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper