facebook
/

webssl-dino7b-full8b-518

Image Feature Extraction

Model card Files Files and versions

webssl-dino7b-full8b-518 / README.md

davidfan97's picture

Initial commit

aee350d verified about 1 month ago

|

history blame contribute delete

2.17 kB

	---
	library_name: transformers
	license: cc-by-nc-4.0
	inference: false
	---
	# Web-SSL DINO ViT-7B: 8B MetaCLIP data, 518 Resolution

	A 7 billion parameter Vision Transformer (ViT) trained with DINOv2 self-supervised learning on web-scale image data without language supervision. Introduced in ["Scaling Language-Free Visual Representation Learning"](https://arxiv.org/abs/2504.01017) (Fan et al., 2025).

	## Model Details
	- Architecture: ViT (4096 width, 32 depth, 32 heads)
	- Parameters: 7B
	- Resolution: 518×518 pixels
	- Training: Self-supervised Web-DINO on 8B image samples from MetaCLIP web data

	## Model Descriptions
	Web-SSL DINO 7B is a 7 billion parameter Vision Transformer model trained using self-supervised learning on 8 billion web images without language supervision. This model demonstrates that pure visual learning, when scaled appropriately, can match or exceed the performance of language-supervised models like CLIP across various vision tasks. It excels in both traditional vision benchmarks and multimodal tasks including visual question answering and OCR & chart understanding.

	<img src="webssl_teaser.png" alt="WebSSL Model Overview" width="600">

	## Usage

	```python
	from transformers import AutoImageProcessor, Dinov2Model
	import torch
	from PIL import Image

	processor = AutoImageProcessor.from_pretrained('facebook/webssl-dino7b-full8b-518')
	model = Dinov2Model.from_pretrained('facebook/webssl-dino7b-full8b-518')

	# Process an image
	image = Image.open('path/to/image.jpg')
	inputs = processor(images=image, return_tensors="pt")
	with torch.no_grad():
	outputs = model(**inputs)
	cls_features = outputs.last_hidden_state[:, 0] # CLS token features
	patch_features = outputs.last_hidden_state[:, 1:] # patch-wise token features
	```

	## Citation

	```bibtex
	@article{fan2025scaling,
	title={Scaling Language-Free Visual Representation Learning},
	author={David Fan and Shengbang Tong and Jiachen Zhu and Koustuv Sinha and Zhuang Liu and Xinlei Chen and Michael Rabbat and Nicolas Ballas and Yann LeCun and Amir Bar and Saining Xie},
	year={2025},
	eprint={2504.01017},
	archivePrefix={arXiv},
	primaryClass={cs.CV}
	}
	```