TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action

TEMPURA enhances video temporal understanding by integrating causal reasoning with fine-grained temporal segmentation. It uses a two-stage training framework: first, masked event prediction reasoning reconstructs missing events and generates causal explanations; second, it learns video segmentation and dense captioning to decompose videos into non-overlapping events with detailed, timestamp-aligned descriptions. TEMPURA is trained on VER, a large-scale dataset (1M training instances, 500K videos) with temporally aligned event descriptions and structured reasoning steps. It outperforms strong baseline models on temporal grounding and highlight detection benchmarks.

Project Page | arXiv Preprint | VER Dataset | Github Repo

Model Weights

Citing TEMPURA

If you find our paper or dataset useful, please consider citing our work!

@article{tempura,
       title={TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action}, 
       author={Jen-Hao Cheng and Vivian Wang and Huayu Wang and Huapeng Zhou and Yi-Hao Peng and Hou-I Liu
              and Hsiang-Wei Huang and Kuang-Ming Chen and Cheng-Yen Yang
              and Wenhao Chai and Yi-Ling Chen and Vibhav Vineet and Qin Cai and Jenq-Neng Hwang},
       journal={arXiv preprint arXiv:2505.01583},
       year={2025}
}

andaba
/

TEMPURA-Qwen2.5-VL-3B-s1

TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action

Model Weights

Citing TEMPURA

Model tree for andaba/TEMPURA-Qwen2.5-VL-3B-s1

Dataset used to train andaba/TEMPURA-Qwen2.5-VL-3B-s1

Collection including andaba/TEMPURA-Qwen2.5-VL-3B-s1

TEMPURA