An Overview of Common Computer Vision Tasks
Community Article
Published
August 4, 2025
CV Task | Definition/Purpose | Key Baseline Models/Architectures | Common Approaches/Paradigms |
---|---|---|---|
Image Classification | Assigns a single label to an entire image. | ViT, DeiT, ConvNeXt | Fine-tuning pre-trained models; Zero-shot classification with multimodal models. |
Object Detection | Identifies and localizes multiple objects with bounding boxes and labels. | DETR, YOLOv8, FathomNet | End-to-end detection; Tracking-by-detection (for tracking). |
Image Segmentation | Divides an image into meaningful parts (pixel-level labeling). | SegFormer, Mask2Former-Panoptic) | Pixel-level classification; Mask classification paradigm; Fine-tuning. |
Pose Estimation | Approximates spatial position and orientation of objects/bodies via keypoints. | ViTPose | Top-down keypoint detection (requires object detector). |
Visual Question Answering (VQA) | Answers natural language questions based on an image. | ViLT, BLIP, BLIP-2, InstructBLIP, VisualBERT | Classification-based (multi-label); Generative (free-form answers); Zero-shot VQA. |
Anomaly Detection | Identifies patterns not conforming to expected behavior in images/videos. | Time series anomaly using Autoencoders, AnomalyCLIP | Unsupervised learning (reconstruction-based); Zero-shot learning; Outlier exposure. |
Scene Understanding | Interprets 3D geometry, semantics, and relationships within a scene. | Semantic Scene understanding papers, SceneDINO, Phi-4-multimodal-instruct | Neural implicit representations; Self-supervised learning; Multimodal LLMs. |
3D Reconstruction | Captures shape and appearance of real objects/scenes to create 3D models. | MeshFormer, Common Objects in 3D papers, DUSt3R papers | Neural implicit surfaces; Diffusion models for 3D; Multi-view geometry. |
Video Understanding | Classifies entire videos or recognizes specific actions within them. | VideoMAE, TimeSformer, VideoMamba, Human-Action-Recognition | Fine-tuning; Spatio-temporal modeling; Online vs. Offline processing. |
Image Feature Extraction | Extracts semantically meaningful numerical representations from images. | ViT-base-patch16-224, ViT-base-patch16-384 | Removing task-specific heads from pre-trained CV models. |
Feature Matching | Finds corresponding points/regions between images for alignment. | LoFTR papers | Distance-based comparison; Approximate nearest neighbors; Transformer-based matching. |
Optical Character Recognition (OCR) | Converts documents/images into editable, searchable text. | Text Detection models, CRNN, PARSeq | Modular pipeline (detection + recognition); Fine-tuning. |
Image Tagging & Attribute Prediction | Assigns descriptive keywords or infers specific characteristics of objects/subjects. | wd-swinv2-tagger, Facial-Attribute-Detection | Multi-label classification; Fine-tuning on annotated datasets. |
Point Cloud Processing | Works with 3D data as collections of points (generation, completion, analysis). | General Point Model (GPM), Point-JEPA, Diffusion models for point clouds | Diffusion models; Self-supervised learning; Prompt tuning; Multimodal alignment. |
Image Generation | Creates new images from text (text-to-image) or transforms existing images (image-to-image). | Stable Diffusion, FLUX models, Kandinsky 2.2 | Diffusion models; Latent space manipulation. |