Adapting Vehicle Detectors for Aerial Imagery to Unseen Domains with Weak Supervision
Abstract
A multi-stage, multi-modal knowledge transfer framework using fine-tuned latent diffusion models improves vehicle detection in aerial imagery across different domains.
Detecting vehicles in aerial imagery is a critical task with applications in traffic monitoring, urban planning, and defense intelligence. Deep learning methods have provided state-of-the-art (SOTA) results for this application. However, a significant challenge arises when models trained on data from one geographic region fail to generalize effectively to other areas. Variability in factors such as environmental conditions, urban layouts, road networks, vehicle types, and image acquisition parameters (e.g., resolution, lighting, and angle) leads to domain shifts that degrade model performance. This paper proposes a novel method that uses generative AI to synthesize high-quality aerial images and their labels, improving detector training through data augmentation. Our key contribution is the development of a multi-stage, multi-modal knowledge transfer framework utilizing fine-tuned latent diffusion models (LDMs) to mitigate the distribution gap between the source and target environments. Extensive experiments across diverse aerial imagery domains show consistent performance improvements in AP50 over supervised learning on source domain data, weakly supervised adaptation methods, unsupervised domain adaptation methods, and open-set object detectors by 4-23%, 6-10%, 7-40%, and more than 50%, respectively. Furthermore, we introduce two newly annotated aerial datasets from New Zealand and Utah to support further research in this field. Project page is available at: https://humansensinglab.github.io/AGenDA
Community
Motivation:
How can we use synthetic data to improve cross-domain object detection performance in aerial imagery?
Takeaway:
⭐ We highlight challenges for large models like Gemini, Qwen2.5-VL, Deepseek-VL2, and Stable Diffusion in understanding and generating real-world aerial imagery.
⭐ We fine-tune Stable Diffusion to generate synthetic aerial view images and automatically annotate them using cross-attention maps.
⭐ We introduce two large-scale datasets to advance overhead vehicle detection in aerial imagery.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- UMDATrack: Unified Multi-Domain Adaptive Tracking Under Adverse Weather Conditions (2025)
- Synthetic Data Matters: Re-training with Geo-typical Synthetic Labels for Building Detection (2025)
- SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation (2025)
- Unsupervised Domain Adaptation for 3D LiDAR Semantic Segmentation Using Contrastive Learning and Multi-Model Pseudo Labeling (2025)
- Open-Vocabulary Object Detection in UAV Imagery: A Review and Future Perspectives (2025)
- Understanding Trade offs When Conditioning Synthetic Data (2025)
- ViRefSAM: Visual Reference-Guided Segment Anything Model for Remote Sensing Segmentation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 2
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper