NoctOWL: Fine-Grained Open-Vocabulary Object Detector

Model Description

NoctOWL (Not only coarse-text OWL) is an adaptation of OWL-ViT (NoctOWL) and OWLv2 (NoctOWLv2), designed for Fine-Grained Open-Vocabulary Detection (FG-OVD). Unlike standard open-vocabulary object detectors, which focus primarily on class-level recognition, NoctOWL enhances the ability to detect and distinguish fine-grained object attributes such as color, material, transparency, and pattern.

It maintains a balanced trade-off between fine- and coarse-grained detection, making it particularly effective in scenarios requiring detailed object descriptions.

You can find the original code to train and evaluate the model here.

Model Variants

  • NoctOWL Base (lorebianchi98/NoctOWL-base-patch16)
  • NoctOWLv2 Base (lorebianchi98/NoctOWLv2-base-patch16)
  • NoctOWL Large (lorebianchi98/NoctOWL-large-patch14)
  • NoctOWLv2 Large (lorebianchi98/NoctOWLv2-large-patch14)

Usage

Loading the Model

from transformers import OwlViTForObjectDetection, Owlv2ForObjectDetection, OwlViTProcessor, Owlv2Processor

# Load NoctOWL model
model = OwlViTForObjectDetection.from_pretrained("lorebianchi98/NoctOWL-base-patch16")
processor = OwlViTProcessor.from_pretrained("google/owlvit-base-patch16")

# Load NoctOWLv2 model
model_v2 = Owlv2ForObjectDetection.from_pretrained("lorebianchi98/NoctOWLv2-base-patch16")
processor_v2 = Owlv2Processor.from_pretrained("google/owlv2-base-patch16")

Inference Example

from PIL import Image
import torch

# Load image
image = Image.open("example.jpg")

# Define text prompts (fine-grained descriptions)
text_queries = ["a red patterned dress", "a dark brown wooden chair"]

# Process inputs
inputs = processor(images=image, text=text_queries, return_tensors="pt")

# Run inference
outputs = model(**inputs)

# Extract detected objects
logits = outputs.logits
boxes = outputs.pred_boxes

# Post-processing can be applied to visualize results

Results

We report the mean Average Precision (mAP) on the Fine-Grained Open-Vocabulary Detection (FG-OVD) benchmarks across different difficulty levels, as well as performance on rare classes from the LVIS dataset.

Model LVIS (Rare) Trivial Easy Medium Hard Color Material Pattern Transparency
OWL (B/16) 20.6 53.9 38.4 39.8 26.2 45.3 37.3 26.6 34.1
OWL (L/14) 31.2 65.1 44.0 39.3 26.5 43.8 44.9 36.0 29.2
OWLv2 (B/16) 29.6 52.9 40.0 38.5 25.3 45.1 33.5 19.2 28.5
OWLv2 (L/14) 34.9 63.2 42.8 41.2 25.4 53.3 36.9 23.3 12.2
NoctOWL (B/16) 11.6 46.6 44.4 45.6 40.0 44.7 46.0 46.1 53.6
NoctOWL (L/14) 26.0 57.4 54.2 54.8 48.6 53.1 56.9 49.8 57.2
NoctOWLv2 (B/16) 17.5 48.3 49.1 47.1 42.1 46.8 48.2 42.2 50.2
NoctOWLv2 (L/14) 27.2 57.5 55.5 57.2 50.2 55.6 57.0 49.2 55.9
Downloads last month
13
Safetensors
Model size
438M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for lorebianchi98/NoctOWLv2-large-patch14

Finetuned
(1)
this model