CLIP Model based on DistilBERT and ViT
This repository contains a CLIP (Contrastive Language-Image Pretraining) model that combines the power of two state-of-the-art architectures:
- DistilBERT (based on
distilbert-base-uncased
): A smaller, faster, and lighter version of BERT. - Vision Transformer (ViT) (based on
google/vit-base-patch16-224
): A powerful vision transformer architecture for image processing.
The model is trained to learn joint representations of images and text, enabling a variety of multimodal tasks such as image-text matching, zero-shot classification, and cross-modal retrieval.
Model Overview
CLIP combines a text encoder and an image encoder to map both images and texts into a shared embedding space. By training the model on a large number of image-text pairs, it can perform various downstream tasks without needing task-specific fine-tuning.
Components:
- Text Encoder:
distilbert-base-uncased
is used to encode the textual input into a dense vector. - Image Encoder:
google/vit-base-patch16-224
processes image data by dividing images into patches and learning their contextual relationships.
Future work:
Train over larger datasets and with more computer resources
Model tree for sebastiansarasti/clip_fashion
Base model
distilbert/distilbert-base-uncased