DPT 3.1 (Swinv2 backbone)

DPT (Dense Prediction Transformer) model trained on 1.4 million images for monocular depth estimation. It was introduced in the paper Vision Transformers for Dense Prediction by Ranftl et al. (2021) and first released in this repository.

Disclaimer: The team releasing DPT did not write a model card for this model so this model card has been written by the Hugging Face team.

Model description

This DPT model uses the Swinv2 model as backbone and adds a neck + head on top for monocular depth estimation.

How to use

Here is how to use this model for zero-shot depth estimation on an image:

from transformers import DPTImageProcessor, DPTForDepthEstimation
import torch
import numpy as np
from PIL import Image
import requests

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

processor = DPTImageProcessor.from_pretrained("Intel/dpt-swinv2-base-384")
model = DPTForDepthEstimation.from_pretrained("Intel/dpt-swinv2-base-384")

# prepare image for the model
inputs = processor(images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    predicted_depth = outputs.predicted_depth

# interpolate to original size
prediction = torch.nn.functional.interpolate(
    predicted_depth.unsqueeze(1),
    size=image.size[::-1],
    mode="bicubic",
    align_corners=False,
)

# visualize the prediction
output = prediction.squeeze().cpu().numpy()
formatted = (output * 255 / np.max(output)).astype("uint8")
depth = Image.fromarray(formatted)

or one can use the pipeline API:

from transformers import pipeline

pipe = pipeline(task="depth-estimation", model="Intel/dpt-swinv2-base-384")
result = pipe("http://images.cocodataset.org/val2017/000000039769.jpg")
result["depth"]