Model Card for Model ID
This model card describes TEMPURA, a vision-language models to reason about causal event relationships and generate fine-grained, timestamped descriptions of untrimmed videos.
Model Details
Model Description
TEMPURA enhances video temporal understanding by integrating causal reasoning with fine-grained temporal segmentation. More details can be found on the project page.
- Developed by: Jen-Hao Cheng, Vivian Wang, Huayu Wang, Huapeng Zhou, Yi-Hao Peng, Hou-I Liu, Hsiang-Wei Huang, Kuang-Ming Chen, Cheng-Yen Yang, Wenhao Chai, Yi-Ling Chen, Vibhav Vineet, Qin Cai, Jenq-Neng Hwang
- Model type: Video-Language Model
- Language(s) (NLP): English
- License: cc-by-4.0
- Finetuned from model: Qwen/Qwen2.5-VL-3B-Instruct
Model Sources
- Repository: https://github.com/andy-cheng/TEMPURA
- Paper: TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action
- Project Page: https://andy-cheng.github.io/TEMPURA/
Uses
Direct Use
The model can be used directly for temporal grounding and highlight detection in videos.
Downstream Use [optional]
The model can be fine-tuned for various applications requiring temporal video understanding, such as video summarization, event extraction, and question answering.
Out-of-Scope Use
The model may not perform well on videos with significantly different visual styles or languages compared to the training data.
Bias, Risks, and Limitations
The model's performance is influenced by biases present in the VER dataset. Further analysis is needed to fully characterize these biases.
Recommendations
Users should be aware of potential biases in the model's outputs.
How to Get Started with the Model
Inference: Please check the inference example.
Training: Please check the model training script.
Training Details
Training Data
The model was trained on the VER dataset (https://huggingface.co/datasets/andaba/TEMPURA-VER).
Training Procedure
The training procedure involves masked event prediction and video event segmentation with temporal dense captioning. See the training scripts in the repository for details.
Training Hyperparameters
- Training regime: [More Information Needed]
Speeds, Sizes, Times
[More Information Needed]
Evaluation
Testing Data, Factors & Metrics
Testing Data
[More Information Needed]
Factors
[More Information Needed]
Metrics
[More Information Needed]
Results
[More Information Needed]
Summary
Environmental Impact
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
- Hardware Type: [More Information Needed]
- Hours used: [More Information Needed]
- Cloud Provider: [More Information Needed]
- Compute Region: [More Information Needed]
- Carbon Emitted: [More Information Needed]
Technical Specifications [optional]
Model Architecture and Objective
[More Information Needed]
Compute Infrastructure
[More Information Needed]
Hardware
[More Information Needed]
Software
[More Information Needed]
Citation
BibTeX:
@article{tempura,
title={TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action},
author={Jen-Hao Cheng and Vivian Wang and Huayu Wang and Huapeng Zhou and Yi-Hao Peng and Hou-I Liu
and Hsiang-Wei Huang and Kuang-Ming Chen and Cheng-Yen Yang
and Wenhao Chai and Yi-Ling Chen and Vibhav Vineet and Qin Cai and Jenq-Neng Hwang},
journal={arXiv preprint arXiv:2505.01583},
year={2025}
}
APA:
Cheng, J.-H., Wang, V., Wang, H., Zhou, H., Peng, Y.-H., Liu, H.-I., Huang, H.-W., Chen, K.-M., Yang, C.-Y., Chai, W., Chen, Y.-L., Vineet, V., Cai, Q., & Hwang, J.-N. (2025). TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action. arXiv preprint arXiv:2505.01583.
Model Card Contact
Jen-Hao Cheng, [email protected]
- Downloads last month
- 68