Model Card for Model ID

This model card describes TEMPURA, a vision-language models to reason about causal event relationships and generate fine-grained, timestamped descriptions of untrimmed videos.

Model Details

Model Description

TEMPURA enhances video temporal understanding by integrating causal reasoning with fine-grained temporal segmentation. More details can be found on the project page.

  • Developed by: Jen-Hao Cheng, Vivian Wang, Huayu Wang, Huapeng Zhou, Yi-Hao Peng, Hou-I Liu, Hsiang-Wei Huang, Kuang-Ming Chen, Cheng-Yen Yang, Wenhao Chai, Yi-Ling Chen, Vibhav Vineet, Qin Cai, Jenq-Neng Hwang
  • Model type: Video-Language Model
  • Language(s) (NLP): English
  • License: cc-by-4.0
  • Finetuned from model: Qwen/Qwen2.5-VL-3B-Instruct

Model Sources

Uses

Direct Use

The model can be used directly for temporal grounding and highlight detection in videos.

Downstream Use [optional]

The model can be fine-tuned for various applications requiring temporal video understanding, such as video summarization, event extraction, and question answering.

Out-of-Scope Use

The model may not perform well on videos with significantly different visual styles or languages compared to the training data.

Bias, Risks, and Limitations

The model's performance is influenced by biases present in the VER dataset. Further analysis is needed to fully characterize these biases.

Recommendations

Users should be aware of potential biases in the model's outputs.

How to Get Started with the Model

Inference: Please check the inference example.

Training: Please check the model training script.

Training Details

Training Data

The model was trained on the VER dataset (https://huggingface.co/datasets/andaba/TEMPURA-VER).

Training Procedure

The training procedure involves masked event prediction and video event segmentation with temporal dense captioning. See the training scripts in the repository for details.

Training Hyperparameters

  • Training regime: [More Information Needed]

Speeds, Sizes, Times

[More Information Needed]

Evaluation

Testing Data, Factors & Metrics

Testing Data

[More Information Needed]

Factors

[More Information Needed]

Metrics

[More Information Needed]

Results

[More Information Needed]

Summary

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

  • Hardware Type: [More Information Needed]
  • Hours used: [More Information Needed]
  • Cloud Provider: [More Information Needed]
  • Compute Region: [More Information Needed]
  • Carbon Emitted: [More Information Needed]

Technical Specifications [optional]

Model Architecture and Objective

[More Information Needed]

Compute Infrastructure

[More Information Needed]

Hardware

[More Information Needed]

Software

[More Information Needed]

Citation

BibTeX:

@article{tempura,
       title={TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action}, 
       author={Jen-Hao Cheng and Vivian Wang and Huayu Wang and Huapeng Zhou and Yi-Hao Peng and Hou-I Liu
              and Hsiang-Wei Huang and Kuang-Ming Chen and Cheng-Yen Yang
              and Wenhao Chai and Yi-Ling Chen and Vibhav Vineet and Qin Cai and Jenq-Neng Hwang},
       journal={arXiv preprint arXiv:2505.01583},
       year={2025}
}

APA:

Cheng, J.-H., Wang, V., Wang, H., Zhou, H., Peng, Y.-H., Liu, H.-I., Huang, H.-W., Chen, K.-M., Yang, C.-Y., Chai, W., Chen, Y.-L., Vineet, V., Cai, Q., & Hwang, J.-N. (2025). TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action. arXiv preprint arXiv:2505.01583.

Model Card Contact

Jen-Hao Cheng, [email protected]

Downloads last month
68
Safetensors
Model size
3.75B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for andaba/TEMPURA-Qwen2.5-VL-3B-s2

Finetuned
(156)
this model
Quantizations
1 model

Dataset used to train andaba/TEMPURA-Qwen2.5-VL-3B-s2

Collection including andaba/TEMPURA-Qwen2.5-VL-3B-s2