nprasad24 commited on
Commit
444e1e8
·
verified ·
1 Parent(s): 53e3096

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +144 -0
README.md ADDED
@@ -0,0 +1,144 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # EuroSAT Landcover Classification using CLIP
2
+
3
+ CLIP (Contrastive Language–Image Pretraining) is a neural network model developed by OpenAI that can understand and generate text from images and vice versa. It stands for "Contrastive Language-Image Pretraining."
4
+
5
+ According to the <a href = "https://arxiv.org/abs/2103.00020">paper</a>, *CLIP builds on a large body of work on zero-shot transfer, natural language supervision, and multimodal learning*
6
+
7
+ CLIP uses an abundantly available source of supervision: the text paired with images found across the internet. Given an image, the task of CLIP is to predict which out of a set of 32,768 randomly sampled text snippets, was actually paired with it in the dataset.
8
+
9
+ ## ViT-B/32
10
+ In this project, we fine-tune the CLIP with a ViT-B/32 transformer. ViT-B/32 is a specific variant of the Vision Transformer model, which is an architecture for computer vision tasks that leverages the transformer model, originally designed for natural language processing. Here are some details about ViT-B/32:
11
+
12
+ 1. Architecture:
13
+ - Transformer Backbone: ViT uses a transformer architecture, which relies on self-attention mechanisms to process input data.
14
+ - Patch Embeddings: The input is divided into fixed-size patches, which are then linearly embedded into a sequence of vectors.
15
+ - Position Embdeddings: Since the transformer model does not inherently understand the order of the patches, position embeddings are added to the patch embdeddings to retain spatial information.
16
+ 2. Model Information:
17
+ - B: The "B" in ViT-B/32 stands for "Base" size, which indicates the model's scale. ViT models come in various sizes, with Base being a moderate size compared to larger variants (like large or huge).
18
+ - 32: This number denotes the size of the patches into which the input image is divided. For ViT-B/32, the image is split into 32x32 pixel patches.
19
+
20
+ ## <a href = "https://github.com/MuhammedM294/EuroSat/tree/main/dataset_rgb">Dataset</a>
21
+
22
+ The training dataset contains 22,011 images divided into 10 categories. These categories are:
23
+ - annual crop land
24
+ - brushland or shrubland
25
+ - forest
26
+ - highway or road
27
+ - industrial buildings or commercial buildings
28
+ - lake or sea
29
+ - pasture land
30
+ - permanent crop land
31
+ - residential buildings or homes or apartments
32
+ - river
33
+
34
+ The test dataset consists of 5000 images divided into the same 10 categories.
35
+
36
+ ## Data Preparation (as suggested by OpenAI)
37
+
38
+ As mentioned in the paper, when the name of a class is the only information provided to CLIP's text encoder, it is unable to differentiate due to lack of context. Hence, a good default template would be "a satellite photo of a {label}".
39
+
40
+ After changing our ground truth text descriptions according to this template (which is provided by OpenAI for different datasets <a href="https://github.com/openai/CLIP/blob/main/data/prompts.md">here</a>), our outputs should look like this:
41
+ ```python
42
+ classes = [
43
+ 'a centered satellite photo of forest',
44
+ 'a centered satellite photo of permanent crop land',
45
+ 'a centered satellite photo of residential buildings or homes or apartments',
46
+ 'a centered satellite photo of river',
47
+ 'a centered satellite photo of pasture land',
48
+ 'a centered satellite photo of lake or sea',
49
+ 'a centered satellite photo of brushland or shrubland',
50
+ 'a centered satellite photo of annual crop land',
51
+ 'a centered satellite photo of industrial buildings or commercial buildings',
52
+ 'a centered satellite photo of highway or road',
53
+ ]
54
+ ```
55
+ Note that 'a centered satellite photo of {label}' is one of many template prompts provided by OpenAI.
56
+
57
+
58
+ ## Installing Dependencies
59
+ ```python
60
+ !pip install transformers
61
+ !pip install pytorch==1.7.1 torchvision
62
+ !pip install ftfy regex tqdm
63
+ !pip install git+https://github.com/openai/CLIP.git
64
+ ```
65
+
66
+ ## Pre-Trained Model
67
+
68
+ To run inference on the pre-trained model, run the following script
69
+ ```python
70
+ import torch
71
+ import clip
72
+ from PIL import Image
73
+
74
+ device = "cuda" if torch.cuda.is_available() else "cpu"
75
+ model, preprocess = clip.load("ViT-B/32", device=device)
76
+
77
+ image = preprocess(Image.open("CLIP.png")).unsqueeze(0).to(device)
78
+ text = clip.tokenize(["a diagram", "a dog", "a cat"]).to(device)
79
+
80
+ with torch.no_grad():
81
+ image_features = model.encode_image(image)
82
+ text_features = model.encode_text(text)
83
+
84
+ logits_per_image, logits_per_text = model(image, text)
85
+ probs = logits_per_image.softmax(dim=-1).cpu().numpy()
86
+
87
+ print("Label probs:", probs) # prints: [[0.9927937 0.00421068 0.00299572]]
88
+
89
+ #source: https://github.com/openai/CLIP
90
+ ```
91
+
92
+ Running this script on the EuroSAT dataset gave an accuracy of ```42.18%```
93
+
94
+ ## Fine Tuning
95
+
96
+ Further, I decided to fine-tune the model on this dataset. The code for the same can be found in the jupyter notebook.
97
+
98
+ On fine-tuning, the accuracy of the model came out to be ```73.76%```
99
+
100
+ The model was trained for 16 epochs (half of that mentioned in the paper) on L4 GPU.
101
+
102
+ The model is saved and can be used to run inferences.
103
+
104
+ To run inference using the fine-tuned model, use the following script:
105
+ ```python
106
+ import requests
107
+ import clip
108
+ from PIL import Image
109
+ from io import BytesIO
110
+
111
+ device = "cuda" if torch.cuda.is_available() else "cpu"
112
+ model, preprocess = clip.load("ViT-B/32", device=device)
113
+ model.load_state_dict(torch.load("euroSATclip.pt"))
114
+
115
+ classes = ["a centered satellite photo of annual crop land",
116
+ "a centered satellite photo of forest",
117
+ "a centered satellite photo of lake or sea",
118
+ "a centered satellite photo of pasture land",
119
+ "a centered satellite photo of permanent crop land",
120
+ "a centered satellite photo of river",
121
+ "a centered satellite photo of residential buildings or homes or apartments",
122
+ "a centered satellite photo of industrial buildings or commercial buildings",
123
+ "a centered satellite photo of highway or road",
124
+ "a centered satellite photo of brushland or shrubland"]
125
+
126
+ # fetch image
127
+ image = Image.open('<image-path>')
128
+
129
+ image_encoded = preprocess(Image.open(image)).unsqueeze(0).to(device)
130
+ text_encoded = clip.tokenize(classes).to(device)
131
+
132
+ with torch.no_grad():
133
+ image_features = model.encode_image(image_encoded)
134
+ text_features = model.encode_text(text_encoded)
135
+ image_features /= image_features.norm(dim=-1, keepdim=True)
136
+ text_features /= text_features.norm(dim=-1, keepdim=True)
137
+
138
+ similarity = (image_features @ text_features.T).squeeze()
139
+ best_match_idx = similarity.argmax().item()
140
+ best_description = classes[best_match_idx]
141
+ print(best_description)
142
+ ```
143
+
144
+ The fine-tuned model can be found in the files section of this repo