Collection shoaib6174/video_swin_transformer/1

Collection of Video Swin Transformers feature extractor models.

Overview

This collection contains different Video Swin Transformer [1] models. The original model weights are provided from [2]. There were ported to Keras models (tf.keras.Model) and then serialized as TensorFlow SavedModels. The porting steps are available in [3].

About the models

These models can be directly used to extract features from videos. These models are accompanied by Colab Notebooks with fine-tuning steps for action-recognition task and video-classification.

The table below provides a performance summary:

model_name pre-train dataset fine-tune dataset acc@1(%) acc@5(%)
swin_tiny_patch244_window877_kinetics400_1k ImageNet-1K Kinetics 400(1k 78.8 93.6
swin_small_patch244_window877_kinetics400_1k ImageNet-1K Kinetics 400(1k) 80.6 94.5
swin_base_patch244_window877_kinetics400_1k ImageNet-1K Kinetics 400(1k) 80.6 96.6
swin_base_patch244_window877_kinetics400_22k ImageNet-12K Kinetics 400(1k) 82.7 95.5
swin_base_patch244_window877_kinetics600_22k ImageNet-1K Kinetics 600(1k) 84.0 96.5
swin_base_patch244_window1677_sthv2 Kinetics 400 Something-Something V2 69.6 92.7

These scores for all the models are taken from [2].

Video Swin Transformer Feature extractors Models

Notes

The input shape for these models are [None, 3, 32, 224, 224] representing [batch_size, channels, frames, height, width]. To create models with different input shape use this notebook.

References

[1] Video Swin Transformer Ze et al. [2] Video Swin Transformers GitHub [3] GSOC-22-Video-Swin-Transformers GitHub

Acknowledgements

Downloads last month
9
Inference API
Unable to determine this model’s pipeline type. Check the docs .