Update README.md
Browse files
README.md
CHANGED
@@ -3,8 +3,11 @@ library_name: transformers
|
|
3 |
tags:
|
4 |
- vision
|
5 |
- image-segmentation
|
|
|
6 |
datasets:
|
7 |
- coralscapes
|
|
|
|
|
8 |
---
|
9 |
|
10 |
# Model Card for Model ID
|
@@ -21,287 +24,151 @@ using the AdamW optimizer with an initial learning rate of 6e-5, weight decay of
|
|
21 |
During training, images are randomly scaled within a range of 1 and 2, flipped horizontally with a 0.5 probability and randomly cropped to 1024×1024 pixels.
|
22 |
Input images are normalized using the ImageNet mean and standard deviation. For evaluation, a non-overlapping sliding window strategy is employed,
|
23 |
using a window size of 1024x1024.
|
24 |
-
<!-- TODO - We used a stride of 1024 but in the demo it is variable. Should we move this entire section to training below? -->
|
25 |
|
26 |
-
|
27 |
-
- **
|
28 |
-
- **Shared by [optional]:** [More Information Needed]
|
29 |
-
- **Model type:** [More Information Needed]
|
30 |
- **License:** [More Information Needed]
|
31 |
- **Finetuned from model:** [SegFormer (b2-sized) encoder pre-trained-only (`nvidia/mit-b2`)](https://huggingface.co/nvidia/mit-b2)
|
32 |
|
33 |
-
### Model Sources
|
34 |
-
|
35 |
-
<!-- Provide the basic links for the model. -->
|
36 |
|
37 |
- **Repository:** [coralscapesScripts](https://github.com/eceo-epfl/coralscapesScripts/)
|
38 |
- **Paper [optional]:** [More Information Needed]
|
39 |
- **Demo** [Hugging Face Spaces](https://huggingface.co/spaces/EPFL-ECEO/coralscapes_demo):
|
40 |
|
41 |
-
## Uses
|
42 |
-
|
43 |
-
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
44 |
-
|
45 |
-
### Direct Use
|
46 |
-
|
47 |
-
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
|
48 |
-
|
49 |
-
[More Information Needed]
|
50 |
-
|
51 |
-
### Downstream Use [optional]
|
52 |
-
|
53 |
-
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
|
54 |
-
|
55 |
-
[More Information Needed]
|
56 |
-
|
57 |
-
### Out-of-Scope Use
|
58 |
-
|
59 |
-
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
|
60 |
-
|
61 |
-
[More Information Needed]
|
62 |
-
|
63 |
-
## Bias, Risks, and Limitations
|
64 |
-
|
65 |
-
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
|
66 |
-
|
67 |
-
[More Information Needed]
|
68 |
-
|
69 |
-
### Recommendations
|
70 |
-
|
71 |
-
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
|
72 |
-
|
73 |
-
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
|
74 |
-
|
75 |
## How to Get Started with the Model
|
76 |
|
77 |
The simplest way to use this model to segment an image of the Coralscapes dataset is as follows:
|
78 |
|
79 |
```python
|
80 |
-
from transformers import SegformerImageProcessor, SegformerForSemanticSegmentation
|
81 |
-
from PIL import Image
|
82 |
-
from datasets import load_dataset
|
83 |
-
|
84 |
-
# Load an image from the coralscapes dataset or load your own image
|
85 |
-
dataset = load_dataset("EPFL-ECEO/coralscapes")
|
86 |
-
image = dataset["test"][42]["image"]
|
87 |
-
|
88 |
-
preprocessor = SegformerImageProcessor.from_pretrained("EPFL-ECEO/segformer-b2-finetuned-coralscapes-1024-1024")
|
89 |
-
model = SegformerForSemanticSegmentation.from_pretrained("EPFL-ECEO/segformer-b2-finetuned-coralscapes-1024-1024")
|
90 |
-
|
91 |
-
inputs = preprocessor(image, return_tensors = "pt")
|
92 |
-
outputs = model(**inputs)
|
93 |
-
outputs = preprocessor.post_process_semantic_segmentation(outputs, target_sizes=[(image.size[1], image.size[0])])
|
94 |
-
label_pred = outputs[0].cpu().numpy()
|
95 |
```
|
96 |
|
97 |
While using the above approach should still work for images of different sizes and scales, for images that are not close to the training size of the model (1024x1024),
|
98 |
we recommend using the following approach using a sliding window to achieve better results:
|
99 |
|
100 |
```python
|
101 |
-
import torch
|
102 |
-
import torch.nn.functional as F
|
103 |
-
from transformers import SegformerImageProcessor, SegformerForSemanticSegmentation
|
104 |
-
from PIL import Image
|
105 |
-
|
106 |
-
|
107 |
-
|
108 |
-
|
109 |
-
|
110 |
-
|
111 |
-
|
112 |
-
|
113 |
-
|
114 |
-
|
115 |
-
|
116 |
-
|
117 |
-
|
118 |
-
|
119 |
-
|
120 |
-
|
121 |
-
|
122 |
-
|
123 |
-
|
124 |
-
|
125 |
-
|
126 |
-
|
127 |
-
|
128 |
-
|
129 |
-
|
130 |
-
|
131 |
-
|
132 |
-
|
133 |
-
|
134 |
-
|
135 |
-
|
136 |
-
|
137 |
-
|
138 |
-
|
139 |
-
|
140 |
-
|
141 |
-
|
142 |
-
|
143 |
-
|
144 |
-
|
145 |
-
|
146 |
-
|
147 |
-
|
148 |
-
|
149 |
-
|
150 |
-
|
151 |
-
|
152 |
-
|
153 |
-
|
154 |
-
|
155 |
-
|
156 |
-
|
157 |
-
|
158 |
-
|
159 |
-
|
160 |
-
|
161 |
-
|
162 |
-
|
163 |
-
|
164 |
-
|
165 |
-
|
166 |
-
|
167 |
-
|
168 |
-
|
169 |
-
|
170 |
-
|
171 |
-
|
172 |
-
|
173 |
-
|
174 |
-
|
175 |
-
|
176 |
-
|
177 |
-
|
178 |
-
|
179 |
-
|
180 |
-
|
181 |
-
|
|
|
182 |
```
|
183 |
|
184 |
-
## Training Details
|
185 |
-
|
186 |
-
### Training Data
|
187 |
-
|
188 |
-
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
|
189 |
-
|
190 |
-
[More Information Needed]
|
191 |
-
|
192 |
-
### Training Procedure
|
193 |
-
|
194 |
-
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
|
195 |
-
|
196 |
-
#### Preprocessing [optional]
|
197 |
-
|
198 |
-
[More Information Needed]
|
199 |
-
|
200 |
-
|
201 |
-
#### Training Hyperparameters
|
202 |
|
203 |
-
|
204 |
|
205 |
-
|
206 |
|
207 |
-
<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
|
208 |
|
209 |
-
|
210 |
-
|
211 |
-
## Evaluation
|
212 |
-
|
213 |
-
<!-- This section describes the evaluation protocols and provides the results. -->
|
214 |
-
|
215 |
-
### Testing Data, Factors & Metrics
|
216 |
-
|
217 |
-
#### Testing Data
|
218 |
-
|
219 |
-
<!-- This should link to a Dataset Card if possible. -->
|
220 |
-
|
221 |
-
[More Information Needed]
|
222 |
-
|
223 |
-
#### Factors
|
224 |
-
|
225 |
-
<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
|
226 |
-
|
227 |
-
[More Information Needed]
|
228 |
|
229 |
-
|
230 |
-
|
231 |
-
|
|
|
|
|
232 |
|
233 |
-
[More Information Needed]
|
234 |
|
235 |
### Results
|
236 |
|
237 |
-
|
238 |
-
|
239 |
-
#### Summary
|
240 |
-
|
241 |
-
|
242 |
-
|
243 |
-
## Model Examination [optional]
|
244 |
-
|
245 |
-
<!-- Relevant interpretability work for the model goes here -->
|
246 |
-
|
247 |
-
[More Information Needed]
|
248 |
-
|
249 |
-
## Environmental Impact
|
250 |
-
|
251 |
-
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
|
252 |
-
|
253 |
-
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
|
254 |
-
|
255 |
-
- **Hardware Type:** [More Information Needed]
|
256 |
-
- **Hours used:** [More Information Needed]
|
257 |
-
- **Cloud Provider:** [More Information Needed]
|
258 |
-
- **Compute Region:** [More Information Needed]
|
259 |
-
- **Carbon Emitted:** [More Information Needed]
|
260 |
-
|
261 |
-
## Technical Specifications [optional]
|
262 |
-
|
263 |
-
### Model Architecture and Objective
|
264 |
-
|
265 |
-
[More Information Needed]
|
266 |
-
|
267 |
-
### Compute Infrastructure
|
268 |
|
269 |
-
|
270 |
-
|
271 |
-
#### Hardware
|
272 |
-
|
273 |
-
[More Information Needed]
|
274 |
-
|
275 |
-
#### Software
|
276 |
-
|
277 |
-
[More Information Needed]
|
278 |
-
|
279 |
-
## Citation [optional]
|
280 |
-
|
281 |
-
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
|
282 |
|
283 |
**BibTeX:**
|
284 |
|
285 |
[More Information Needed]
|
286 |
-
|
287 |
-
**APA:**
|
288 |
-
|
289 |
-
[More Information Needed]
|
290 |
-
|
291 |
-
## Glossary [optional]
|
292 |
-
|
293 |
-
<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
|
294 |
-
|
295 |
-
[More Information Needed]
|
296 |
-
|
297 |
-
## More Information [optional]
|
298 |
-
|
299 |
-
[More Information Needed]
|
300 |
-
|
301 |
-
## Model Card Authors [optional]
|
302 |
-
|
303 |
-
[More Information Needed]
|
304 |
-
|
305 |
-
## Model Card Contact
|
306 |
-
|
307 |
-
[More Information Needed]
|
|
|
3 |
tags:
|
4 |
- vision
|
5 |
- image-segmentation
|
6 |
+
- ecology
|
7 |
datasets:
|
8 |
- coralscapes
|
9 |
+
metrics:
|
10 |
+
- mean_iou
|
11 |
---
|
12 |
|
13 |
# Model Card for Model ID
|
|
|
24 |
During training, images are randomly scaled within a range of 1 and 2, flipped horizontally with a 0.5 probability and randomly cropped to 1024×1024 pixels.
|
25 |
Input images are normalized using the ImageNet mean and standard deviation. For evaluation, a non-overlapping sliding window strategy is employed,
|
26 |
using a window size of 1024x1024.
|
|
|
27 |
|
28 |
+
|
29 |
+
- **Model type:** SegFormer
|
|
|
|
|
30 |
- **License:** [More Information Needed]
|
31 |
- **Finetuned from model:** [SegFormer (b2-sized) encoder pre-trained-only (`nvidia/mit-b2`)](https://huggingface.co/nvidia/mit-b2)
|
32 |
|
33 |
+
### Model Sources
|
|
|
|
|
34 |
|
35 |
- **Repository:** [coralscapesScripts](https://github.com/eceo-epfl/coralscapesScripts/)
|
36 |
- **Paper [optional]:** [More Information Needed]
|
37 |
- **Demo** [Hugging Face Spaces](https://huggingface.co/spaces/EPFL-ECEO/coralscapes_demo):
|
38 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
39 |
## How to Get Started with the Model
|
40 |
|
41 |
The simplest way to use this model to segment an image of the Coralscapes dataset is as follows:
|
42 |
|
43 |
```python
|
44 |
+
from transformers import SegformerImageProcessor, SegformerForSemanticSegmentation
|
45 |
+
from PIL import Image
|
46 |
+
from datasets import load_dataset
|
47 |
+
|
48 |
+
# Load an image from the coralscapes dataset or load your own image
|
49 |
+
dataset = load_dataset("EPFL-ECEO/coralscapes")
|
50 |
+
image = dataset["test"][42]["image"]
|
51 |
+
|
52 |
+
preprocessor = SegformerImageProcessor.from_pretrained("EPFL-ECEO/segformer-b2-finetuned-coralscapes-1024-1024")
|
53 |
+
model = SegformerForSemanticSegmentation.from_pretrained("EPFL-ECEO/segformer-b2-finetuned-coralscapes-1024-1024")
|
54 |
+
|
55 |
+
inputs = preprocessor(image, return_tensors = "pt")
|
56 |
+
outputs = model(**inputs)
|
57 |
+
outputs = preprocessor.post_process_semantic_segmentation(outputs, target_sizes=[(image.size[1], image.size[0])])
|
58 |
+
label_pred = outputs[0].cpu().numpy()
|
59 |
```
|
60 |
|
61 |
While using the above approach should still work for images of different sizes and scales, for images that are not close to the training size of the model (1024x1024),
|
62 |
we recommend using the following approach using a sliding window to achieve better results:
|
63 |
|
64 |
```python
|
65 |
+
import torch
|
66 |
+
import torch.nn.functional as F
|
67 |
+
from transformers import SegformerImageProcessor, SegformerForSemanticSegmentation
|
68 |
+
from PIL import Image
|
69 |
+
import numpy as np
|
70 |
+
from datasets import load_dataset
|
71 |
+
device = 'cuda' if torch.cuda.is_available() else 'cpu'
|
72 |
+
|
73 |
+
def resize_image(image, target_size=1024):
|
74 |
+
"""
|
75 |
+
Used to resize the image such that the smaller side equals 1024
|
76 |
+
"""
|
77 |
+
h_img, w_img = image.size
|
78 |
+
if h_img < w_img:
|
79 |
+
new_h, new_w = target_size, int(w_img * (target_size / h_img))
|
80 |
+
else:
|
81 |
+
new_h, new_w = int(h_img * (target_size / w_img)), target_size
|
82 |
+
resized_img = image.resize((new_h, new_w))
|
83 |
+
return resized_img
|
84 |
+
|
85 |
+
def segment_image(image, preprocessor, model, crop_size = (1024, 1024), num_classes = 40, transform=None):
|
86 |
+
"""
|
87 |
+
Finds an optimal stride based on the image size and aspect ratio to create
|
88 |
+
overlapping sliding windows of size 1024x1024 which are then fed into the model.
|
89 |
+
"""
|
90 |
+
h_crop, w_crop = crop_size
|
91 |
+
|
92 |
+
img = torch.Tensor(np.array(resize_image(image, target_size=1024)).transpose(2, 0, 1)).unsqueeze(0)
|
93 |
+
batch_size, _, h_img, w_img = img.size()
|
94 |
+
|
95 |
+
if transform:
|
96 |
+
img = torch.Tensor(transform(image = img.numpy())["image"]).to(device)
|
97 |
+
|
98 |
+
h_grids = int(np.round(3/2*h_img/h_crop)) if h_img > h_crop else 1
|
99 |
+
w_grids = int(np.round(3/2*w_img/w_crop)) if w_img > w_crop else 1
|
100 |
+
|
101 |
+
h_stride = int((h_img - h_crop + h_grids -1)/(h_grids -1)) if h_grids > 1 else h_crop
|
102 |
+
w_stride = int((w_img - w_crop + w_grids -1)/(w_grids -1)) if w_grids > 1 else w_crop
|
103 |
+
|
104 |
+
preds = img.new_zeros((batch_size, num_classes, h_img, w_img))
|
105 |
+
count_mat = img.new_zeros((batch_size, 1, h_img, w_img))
|
106 |
+
|
107 |
+
for h_idx in range(h_grids):
|
108 |
+
for w_idx in range(w_grids):
|
109 |
+
y1 = h_idx * h_stride
|
110 |
+
x1 = w_idx * w_stride
|
111 |
+
y2 = min(y1 + h_crop, h_img)
|
112 |
+
x2 = min(x1 + w_crop, w_img)
|
113 |
+
y1 = max(y2 - h_crop, 0)
|
114 |
+
x1 = max(x2 - w_crop, 0)
|
115 |
+
crop_img = img[:, :, y1:y2, x1:x2]
|
116 |
+
with torch.no_grad():
|
117 |
+
if(preprocessor):
|
118 |
+
inputs = preprocessor(crop_img, return_tensors = "pt")
|
119 |
+
inputs["pixel_values"] = inputs["pixel_values"].to(device)
|
120 |
+
else:
|
121 |
+
inputs = crop_img.to(device)
|
122 |
+
outputs = model(**inputs)
|
123 |
+
|
124 |
+
resized_logits = F.interpolate(
|
125 |
+
outputs.logits[0].unsqueeze(dim=0), size=crop_img.shape[-2:], mode="bilinear", align_corners=False
|
126 |
+
)
|
127 |
+
preds += F.pad(resized_logits,
|
128 |
+
(int(x1), int(preds.shape[3] - x2), int(y1),
|
129 |
+
int(preds.shape[2] - y2))).cpu()
|
130 |
+
count_mat[:, :, y1:y2, x1:x2] += 1
|
131 |
+
|
132 |
+
assert (count_mat == 0).sum() == 0
|
133 |
+
preds = preds / count_mat
|
134 |
+
preds = preds.argmax(dim=1)
|
135 |
+
preds = F.interpolate(preds.unsqueeze(0).type(torch.uint8), size=image.size[::-1], mode='nearest')
|
136 |
+
label_pred = preds.squeeze().cpu().numpy()
|
137 |
+
return label_pred
|
138 |
+
|
139 |
+
# Load an image from the coralscapes dataset or load your own image
|
140 |
+
dataset = load_dataset("EPFL-ECEO/coralscapes")
|
141 |
+
image = dataset["test"][42]["image"]
|
142 |
+
|
143 |
+
preprocessor = SegformerImageProcessor.from_pretrained("EPFL-ECEO/segformer-b2-finetuned-coralscapes-1024-1024")
|
144 |
+
model = SegformerForSemanticSegmentation.from_pretrained("EPFL-ECEO/segformer-b2-finetuned-coralscapes-1024-1024")
|
145 |
+
|
146 |
+
label_pred = segment_image(image, preprocessor, model)
|
147 |
```
|
148 |
|
149 |
+
## Training & Evaluation Details
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
150 |
|
151 |
+
### Data
|
152 |
|
153 |
+
The model is trained and evaluated on the [Coralscapes dataset](https://huggingface.co/datasets/EPFL-ECEO/coralscapes) which is a general-purpose dense semantic segmentation dataset for coral reefs.
|
154 |
|
|
|
155 |
|
156 |
+
### Procedure
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
157 |
|
158 |
+
Training is conducted following the Segformer original [implementation](https://proceedings.neurips.cc/paper_files/paper/2021/file/64f1f27bf1b4ec22924fd0acb550c235-Paper.pdf), using a batch size of 8 for 265 epochs,
|
159 |
+
using the AdamW optimizer with an initial learning rate of 6e-5, weight decay of 1e-2 and polynomial learning rate scheduler with a power of 1.
|
160 |
+
During training, images are randomly scaled within a range of 1 and 2, flipped horizontally with a 0.5 probability and randomly cropped to 1024×1024 pixels.
|
161 |
+
Input images are normalized using the ImageNet mean and standard deviation. For evaluation, a non-overlapping sliding window strategy is employed,
|
162 |
+
using a window size of 1024x1024.
|
163 |
|
|
|
164 |
|
165 |
### Results
|
166 |
|
167 |
+
mIoU - 54.682
|
168 |
+
Accuracy - 80.904
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
169 |
|
170 |
+
## Citation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
171 |
|
172 |
**BibTeX:**
|
173 |
|
174 |
[More Information Needed]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|