Papers
arxiv:2506.13458
Leveraging Vision-Language Pre-training for Human Activity Recognition in Still Images
Published on Jun 16
Authors:
Abstract
Contrastive vision-language pre-training significantly enhances the accuracy of action recognition in single photos compared to traditional CNNs.
AI-generated summary
Recognising human activity in a single photo enables indexing, safety and assistive applications, yet lacks motion cues. Using 285 MSCOCO images labelled as walking, running, sitting, and standing, scratch CNNs scored 41% accuracy. Fine-tuning multimodal CLIP raised this to 76%, demonstrating that contrastive vision-language pre-training decisively improves still-image action recognition in real-world deployments.
Models citing this paper 0
No model linking this paper
Cite arxiv.org/abs/2506.13458 in a model README.md to link it from this page.
Datasets citing this paper 0
No dataset linking this paper
Cite arxiv.org/abs/2506.13458 in a dataset README.md to link it from this page.
Spaces citing this paper 0
No Space linking this paper
Cite arxiv.org/abs/2506.13458 in a Space README.md to link it from this page.
Collections including this paper 0
No Collection including this paper
Add this paper to a
collection
to link it from this page.