NoctOWL: Fine-Grained Open-Vocabulary Object Detector

Model Description

NoctOWL (Not only coarse-text OWL) is an adaptation of OWL-ViT (NoctOWL) and OWLv2 (NoctOWLv2), designed for Fine-Grained Open-Vocabulary Detection (FG-OVD). Unlike standard open-vocabulary object detectors, which focus primarily on class-level recognition, NoctOWL enhances the ability to detect and distinguish fine-grained object attributes such as color, material, transparency, and pattern.

It maintains a balanced trade-off between fine- and coarse-grained detection, making it particularly effective in scenarios requiring detailed object descriptions.

You can find the original code to train and evaluate the model here.

Model Variants

NoctOWL Base (lorebianchi98/NoctOWL-base-patch16)
NoctOWLv2 Base (lorebianchi98/NoctOWLv2-base-patch16)
NoctOWL Large (lorebianchi98/NoctOWL-large-patch14)
NoctOWLv2 Large (lorebianchi98/NoctOWLv2-large-patch14)

Usage

Loading the Model

from transformers import OwlViTForObjectDetection, Owlv2ForObjectDetection, OwlViTProcessor, Owlv2Processor

# Load NoctOWL model
model = OwlViTForObjectDetection.from_pretrained("lorebianchi98/NoctOWL-base-patch16")
processor = OwlViTProcessor.from_pretrained("google/owlvit-base-patch16")

# Load NoctOWLv2 model
model_v2 = Owlv2ForObjectDetection.from_pretrained("lorebianchi98/NoctOWLv2-base-patch16")
processor_v2 = Owlv2Processor.from_pretrained("google/owlv2-base-patch16")

Inference Example

from PIL import Image
import torch

# Load image
image = Image.open("example.jpg")

# Define text prompts (fine-grained descriptions)
text_queries = ["a red patterned dress", "a dark brown wooden chair"]

# Process inputs
inputs = processor(images=image, text=text_queries, return_tensors="pt")

# Run inference
outputs = model(**inputs)

# Extract detected objects
logits = outputs.logits
boxes = outputs.pred_boxes

# Post-processing can be applied to visualize results

Results

We report the mean Average Precision (mAP) on the Fine-Grained Open-Vocabulary Detection (FG-OVD) benchmarks across different difficulty levels, as well as performance on rare classes from the LVIS dataset.

Model	LVIS (Rare)	Trivial	Easy	Medium	Hard	Color	Material	Pattern	Transparency
OWL (B/16)	20.6	53.9	38.4	39.8	26.2	45.3	37.3	26.6	34.1
OWL (L/14)	31.2	65.1	44.0	39.3	26.5	43.8	44.9	36.0	29.2
OWLv2 (B/16)	29.6	52.9	40.0	38.5	25.3	45.1	33.5	19.2	28.5
OWLv2 (L/14)	34.9	63.2	42.8	41.2	25.4	53.3	36.9	23.3	12.2
NoctOWL (B/16)	11.6	46.6	44.4	45.6	40.0	44.7	46.0	46.1	53.6
NoctOWL (L/14)	26.0	57.4	54.2	54.8	48.6	53.1	56.9	49.8	57.2
NoctOWLv2 (B/16)	17.5	48.3	49.1	47.1	42.1	46.8	48.2	42.2	50.2
NoctOWLv2 (L/14)	27.2	57.5	55.5	57.2	50.2	55.6	57.0	49.2	55.9

lorebianchi98
/

NoctOWLv2-large-patch14