TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action
TEMPURA enhances video temporal understanding by integrating causal reasoning with fine-grained temporal segmentation. It uses a two-stage training framework: first, masked event prediction reasoning reconstructs missing events and generates causal explanations; second, it learns video segmentation and dense captioning to decompose videos into non-overlapping events with detailed, timestamp-aligned descriptions. TEMPURA is trained on VER, a large-scale dataset (1M training instances, 500K videos) with temporally aligned event descriptions and structured reasoning steps. It outperforms strong baseline models on temporal grounding and highlight detection benchmarks.
Project Page | arXiv Preprint | VER Dataset | Github Repo
Model Weights
Citing TEMPURA
If you find our paper or dataset useful, please consider citing our work!
@article{tempura,
title={TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action},
author={Jen-Hao Cheng and Vivian Wang and Huayu Wang and Huapeng Zhou and Yi-Hao Peng and Hou-I Liu
and Hsiang-Wei Huang and Kuang-Ming Chen and Cheng-Yen Yang
and Wenhao Chai and Yi-Ling Chen and Vibhav Vineet and Qin Cai and Jenq-Neng Hwang},
journal={arXiv preprint arXiv:2505.01583},
year={2025}
}
- Downloads last month
- 11