arxiv:2502.15393

LongCaptioning: Unlocking the Power of Long Video Caption Generation in Large Multimodal Models

Published on Feb 21

Authors:

Abstract

LongCaption-Agent synthesizes long captions for training large multimodal models, enabling them to generate detailed descriptions of long videos without significant manual effort.

AI-generated summary

Large Multimodal Models (LMMs) have demonstrated exceptional performance in video captioning tasks, particularly for short videos. However, as the length of the video increases, generating long, detailed captions becomes a significant challenge. In this paper, we investigate the limitations of LMMs in generating long captions for long videos. Our analysis reveals that open-source LMMs struggle to consistently produce outputs exceeding 300 words, leading to incomplete or overly concise descriptions of the visual content. This limitation hinders the ability of LMMs to provide comprehensive and detailed captions for long videos, ultimately missing important visual information. Through controlled experiments, we find that the scarcity of paired examples with long-captions during training is the primary factor limiting the model's output length. However, manually annotating long-caption examples for long-form videos is time-consuming and expensive. To overcome the annotation bottleneck, we propose the LongCaption-Agent, a framework that synthesizes long caption data by hierarchical semantic aggregation. % aggregating multi-level descriptions. Using LongCaption-Agent, we curated a new long-caption dataset, LongCaption-10K. We also develop LongCaption-Bench, a benchmark designed to comprehensively evaluate the quality of long captions generated by LMMs. By incorporating LongCaption-10K into training, we enable LMMs to generate captions exceeding 1,000 words for long-form videos, while maintaining high output quality. In LongCaption-Bench, our model achieved State-of-The-Art performance, even surpassing larger proprietary models like GPT4o.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2502.15393 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2502.15393 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2502.15393 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.