StreamFormer (base-sized model)

StreamFormer backbone model pre-trained on Global-, Temporal- and Spatial- granularities. It was introduced in the paper Learning Streaming Video Representation via Multitask Training and first released in this repository.

Intended uses & limitations

StreamFormer is a streaming video representation backbone that encodes a stream of video input. It is designed for multiple downstream applications like Online Action Detection, Online Video Instance Segmentation and Video Question Answering.

Installation

git clone https://github.com/Go2Heart/StreamFormer.git
cd StreamFormer
conda create -n streamformer python=3.10
conda activate streamformer
conda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 pytorch-cuda=12.4 -c pytorch -c nvidia
pip install -r requirements.txt

How to use

How to get the multi-granularity feature:

from models import TimesformerMultiTaskingModelSigLIP
import torch
model = TimesformerMultiTaskingModelSigLIP.from_pretrained("StreamFormer/streamformer-timesformer").eval()
with torch.no_grad():
    fake_frames = torch.randn(1, 16, 3, 224, 224)
    fake_frames = fake_frames.to(model.device)
    output = model(fake_frames)
    # global representation [B, D]
    print(output.pooler_output[:,-1].shape, output.pooler_output[:,-1])
    
    # temporal representation [B, T, D]
    print(output.pooler_output.shape, output.pooler_output)
    
    # spatial representation [B, T, HxW, D]
    print(output.last_hidden_state.shape, output.last_hidden_state)

BibTeX entry and citation info

@misc{yan2025learning,
        title={Learning Streaming Video Representation via Multitask Training},
        author={Yibin Yan and Jilan Xu and Shangzhe Di and Yikun Liu and Yudi Shi and Qirui Chen and Zeqian Li and Yifei Huang and Weidi Xie},
        year={2025},
        eprint={2504.20041},
        archivePrefix={arXiv},
        primaryClass={cs.CV}
}