Papers
arxiv:2507.12841

AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning

Published on Jul 17
Β· Submitted by Ruihang on Jul 18
Authors:
,
,
,
,
,
,
,
,
,
,

Abstract

The AnyCap Project introduces a framework, dataset, and evaluation protocol to enhance controllability and reliability in multimodal captioning.

AI-generated summary

Controllable captioning is essential for precise multimodal alignment and instruction following, yet existing models often lack fine-grained control and reliable evaluation protocols. To address this gap, we present the AnyCap Project, an integrated solution spanning model, dataset, and evaluation. We introduce AnyCapModel (ACM), a lightweight plug-and-play framework that enhances the controllability of existing foundation models for omni-modal captioning without retraining the base model. ACM reuses the original captions from base models while incorporating user instructions and modality features to generate improved captions. To remedy the data scarcity in controllable multimodal captioning, we build AnyCapDataset (ACD), covering three modalities, 28 user-instruction types, and 300\,k high-quality data entries. We further propose AnyCapEval, a new benchmark that provides more reliable evaluation metrics for controllable captioning by decoupling content accuracy and stylistic fidelity. ACM markedly improves caption quality across a diverse set of base models on AnyCapEval. Notably, ACM-8B raises GPT-4o\'s content scores by 45\% and style scores by 12\%, and it also achieves substantial gains on widely used benchmarks such as MIA-Bench and VidCapBench.

Community

Paper author Paper submitter
β€’
edited 2 days ago

🎯 AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning

arXiv Link

AnyCap Project is a unified captioning framework, dataset, and benchmark that supports image, audio, and video captioning with controllable styles. It’s fully open-sourced, covering training, evaluation, and benchmarking!


✨ Highlights

πŸ† Unified Multi-modal Captioning

A single framework for:

  • Image Captioning
  • Audio Captioning
  • Video Captioning

All under one roofβ€”with support for modality-specific components.


πŸ“ Customizable Captioning

Control the content and style of captions via single user text prompts:

  • Content: Background, Event, Instance, Action, Instance Appearance, Region and so on
  • Style: Brief, Detail, Genre, Length, Theme

Supports captions tailored for user needs.


πŸ“Š Open Benchmark & Evaluation: AnyCapEval

An industry-level benchmark with:

  • Modality-specific test sets (image/audio/video)
  • Content-related metrics
  • Style-related metrics

Gives rise to improved accuracy and reduced variance in assessment.


πŸ› οΈ End-to-End Open Source

Everything you need is included:

  • βœ… Full training data
  • βœ… Model inference pipeline
  • βœ… Evaluation benchmark

All available under a permissive open-source license.


πŸ”— Get Started

Check out the paper and code:

πŸ“„ Paper: arXiv:2507.12841
πŸ“¦ Code & Models: Github


πŸ“¬ Contact

For questions, collaborations, or benchmark submissions, please reach out via the paper's contact email.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2507.12841 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2507.12841 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2507.12841 in a Space README.md to link it from this page.

Collections including this paper 2