Update README.md
Browse files
README.md
CHANGED
@@ -5,7 +5,7 @@ tags:
|
|
5 |
- vision
|
6 |
---
|
7 |
|
8 |
-
# Vision Transformer (base-sized model
|
9 |
|
10 |
Vision Transformer (ViT) model trained using the DINOv2 method. It was introduced in the paper [DINOv2: Learning Robust Visual Features without Supervision](https://arxiv.org/abs/2304.07193) by Oquab et al. and first released in [this repository](https://github.com/facebookresearch/dinov2).
|
11 |
|
@@ -15,7 +15,7 @@ Disclaimer: The team releasing DINOv2 did not write a model card for this model
|
|
15 |
|
16 |
The Vision Transformer (ViT) is a transformer encoder model (BERT-like) pretrained on a large collection of images in a self-supervised fashion at a resolution of 224x224 pixels.
|
17 |
|
18 |
-
Images are presented to the model as a sequence of fixed-size patches
|
19 |
|
20 |
Note that this model does not include any fine-tuned heads.
|
21 |
|
|
|
5 |
- vision
|
6 |
---
|
7 |
|
8 |
+
# Vision Transformer (base-sized model) trained using DINOv2
|
9 |
|
10 |
Vision Transformer (ViT) model trained using the DINOv2 method. It was introduced in the paper [DINOv2: Learning Robust Visual Features without Supervision](https://arxiv.org/abs/2304.07193) by Oquab et al. and first released in [this repository](https://github.com/facebookresearch/dinov2).
|
11 |
|
|
|
15 |
|
16 |
The Vision Transformer (ViT) is a transformer encoder model (BERT-like) pretrained on a large collection of images in a self-supervised fashion at a resolution of 224x224 pixels.
|
17 |
|
18 |
+
Images are presented to the model as a sequence of fixed-size patches, which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder.
|
19 |
|
20 |
Note that this model does not include any fine-tuned heads.
|
21 |
|