An Overview of Common Computer Vision Tasks

Community Article Published August 4, 2025
CV Task Definition/Purpose Key Baseline Models/Architectures Common Approaches/Paradigms
Image Classification Assigns a single label to an entire image. ViT, DeiT, ConvNeXt Fine-tuning pre-trained models; Zero-shot classification with multimodal models.
Object Detection Identifies and localizes multiple objects with bounding boxes and labels. DETR, YOLOv8, FathomNet End-to-end detection; Tracking-by-detection (for tracking).
Image Segmentation Divides an image into meaningful parts (pixel-level labeling). SegFormer, Mask2Former-Panoptic) Pixel-level classification; Mask classification paradigm; Fine-tuning.
Pose Estimation Approximates spatial position and orientation of objects/bodies via keypoints. ViTPose Top-down keypoint detection (requires object detector).
Visual Question Answering (VQA) Answers natural language questions based on an image. ViLT, BLIP, BLIP-2, InstructBLIP, VisualBERT Classification-based (multi-label); Generative (free-form answers); Zero-shot VQA.
Anomaly Detection Identifies patterns not conforming to expected behavior in images/videos. Time series anomaly using Autoencoders, AnomalyCLIP Unsupervised learning (reconstruction-based); Zero-shot learning; Outlier exposure.
Scene Understanding Interprets 3D geometry, semantics, and relationships within a scene. Semantic Scene understanding papers, SceneDINO, Phi-4-multimodal-instruct Neural implicit representations; Self-supervised learning; Multimodal LLMs.
3D Reconstruction Captures shape and appearance of real objects/scenes to create 3D models. MeshFormer, Common Objects in 3D papers, DUSt3R papers Neural implicit surfaces; Diffusion models for 3D; Multi-view geometry.
Video Understanding Classifies entire videos or recognizes specific actions within them. VideoMAE, TimeSformer, VideoMamba, Human-Action-Recognition Fine-tuning; Spatio-temporal modeling; Online vs. Offline processing.
Image Feature Extraction Extracts semantically meaningful numerical representations from images. ViT-base-patch16-224, ViT-base-patch16-384 Removing task-specific heads from pre-trained CV models.
Feature Matching Finds corresponding points/regions between images for alignment. LoFTR papers Distance-based comparison; Approximate nearest neighbors; Transformer-based matching.
Optical Character Recognition (OCR) Converts documents/images into editable, searchable text. Text Detection models, CRNN, PARSeq Modular pipeline (detection + recognition); Fine-tuning.
Image Tagging & Attribute Prediction Assigns descriptive keywords or infers specific characteristics of objects/subjects. wd-swinv2-tagger, Facial-Attribute-Detection Multi-label classification; Fine-tuning on annotated datasets.
Point Cloud Processing Works with 3D data as collections of points (generation, completion, analysis). General Point Model (GPM), Point-JEPA, Diffusion models for point clouds Diffusion models; Self-supervised learning; Prompt tuning; Multimodal alignment.
Image Generation Creates new images from text (text-to-image) or transforms existing images (image-to-image). Stable Diffusion, FLUX models, Kandinsky 2.2 Diffusion models; Latent space manipulation.

Community

Sign up or log in to comment