StreamFormer (base-sized model)
StreamFormer backbone model pre-trained on Global-, Temporal- and Spatial- granularities. It was introduced in the paper Learning Streaming Video Representation via Multitask Training and first released in this repository.
Intended uses & limitations
StreamFormer is a streaming video representation backbone that encodes a stream of video input. It is designed for multiple downstream applications like Online Action Detection, Online Video Instance Segmentation and Video Question Answering.
Installation
git clone https://github.com/Go2Heart/StreamFormer.git
cd StreamFormer
conda create -n streamformer python=3.10
conda activate streamformer
conda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 pytorch-cuda=12.4 -c pytorch -c nvidia
pip install -r requirements.txt
How to use
How to get the multi-granularity feature:
from models import TimesformerMultiTaskingModelSigLIP
import torch
model = TimesformerMultiTaskingModelSigLIP.from_pretrained("StreamFormer/streamformer-timesformer").eval()
with torch.no_grad():
fake_frames = torch.randn(1, 16, 3, 224, 224)
fake_frames = fake_frames.to(model.device)
output = model(fake_frames)
# global representation [B, D]
print(output.pooler_output[:,-1].shape, output.pooler_output[:,-1])
# temporal representation [B, T, D]
print(output.pooler_output.shape, output.pooler_output)
# spatial representation [B, T, HxW, D]
print(output.last_hidden_state.shape, output.last_hidden_state)
BibTeX entry and citation info
@misc{yan2025learning,
title={Learning Streaming Video Representation via Multitask Training},
author={Yibin Yan and Jilan Xu and Shangzhe Di and Yikun Liu and Yudi Shi and Qirui Chen and Zeqian Li and Yifei Huang and Weidi Xie},
year={2025},
eprint={2504.20041},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
- Downloads last month
- 8
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support