Back to Articles

An Overview of Common Computer Vision Tasks

Community Article Published August 4, 2025

CV Task	Definition/Purpose	Key Baseline Models/Architectures	Common Approaches/Paradigms
Image Classification	Assigns a single label to an entire image.	ViT, DeiT, ConvNeXt	Fine-tuning pre-trained models; Zero-shot classification with multimodal models.
Object Detection	Identifies and localizes multiple objects with bounding boxes and labels.	DETR, YOLOv8, FathomNet	End-to-end detection; Tracking-by-detection (for tracking).
Image Segmentation	Divides an image into meaningful parts (pixel-level labeling).	SegFormer, Mask2Former-Panoptic)	Pixel-level classification; Mask classification paradigm; Fine-tuning.
Pose Estimation	Approximates spatial position and orientation of objects/bodies via keypoints.	ViTPose	Top-down keypoint detection (requires object detector).
Visual Question Answering (VQA)	Answers natural language questions based on an image.	ViLT, BLIP, BLIP-2, InstructBLIP, VisualBERT	Classification-based (multi-label); Generative (free-form answers); Zero-shot VQA.
Anomaly Detection	Identifies patterns not conforming to expected behavior in images/videos.	Time series anomaly using Autoencoders, AnomalyCLIP	Unsupervised learning (reconstruction-based); Zero-shot learning; Outlier exposure.
Scene Understanding	Interprets 3D geometry, semantics, and relationships within a scene.	Semantic Scene understanding papers, SceneDINO, Phi-4-multimodal-instruct	Neural implicit representations; Self-supervised learning; Multimodal LLMs.
3D Reconstruction	Captures shape and appearance of real objects/scenes to create 3D models.	MeshFormer, Common Objects in 3D papers, DUSt3R papers	Neural implicit surfaces; Diffusion models for 3D; Multi-view geometry.
Video Understanding	Classifies entire videos or recognizes specific actions within them.	VideoMAE, TimeSformer, VideoMamba, Human-Action-Recognition	Fine-tuning; Spatio-temporal modeling; Online vs. Offline processing.
Image Feature Extraction	Extracts semantically meaningful numerical representations from images.	ViT-base-patch16-224, ViT-base-patch16-384	Removing task-specific heads from pre-trained CV models.
Feature Matching	Finds corresponding points/regions between images for alignment.	LoFTR papers	Distance-based comparison; Approximate nearest neighbors; Transformer-based matching.
Optical Character Recognition (OCR)	Converts documents/images into editable, searchable text.	Text Detection models, CRNN, PARSeq	Modular pipeline (detection + recognition); Fine-tuning.
Image Tagging & Attribute Prediction	Assigns descriptive keywords or infers specific characteristics of objects/subjects.	wd-swinv2-tagger, Facial-Attribute-Detection	Multi-label classification; Fine-tuning on annotated datasets.
Point Cloud Processing	Works with 3D data as collections of points (generation, completion, analysis).	General Point Model (GPM), Point-JEPA, Diffusion models for point clouds	Diffusion models; Self-supervised learning; Prompt tuning; Multimodal alignment.
Image Generation	Creates new images from text (text-to-image) or transforms existing images (image-to-image).	Stable Diffusion, FLUX models, Kandinsky 2.2	Diffusion models; Latent space manipulation.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment