RF-DETR Object Detection vs YOLOv12 : A Study of Transformer-based and CNN-based Architectures for Single-Class and Multi-Class Greenfruit Detection in Complex Orchard Environments Under Label Ambiguity
Abstract
This study conducts a detailed comparison of RF-DETR object detection base model and YOLOv12 object detection model configurations for detecting greenfruits in a complex orchard environment marked by label ambiguity, occlusions, and background blending. A custom dataset was developed featuring both single-class (greenfruit) and multi-class (occluded and non-occluded greenfruits) annotations to assess model performance under dynamic real-world conditions. RF-DETR object detection model, utilizing a DINOv2 backbone and deformable attention, excelled in global context modeling, effectively identifying partially occluded or ambiguous greenfruits. In contrast, YOLOv12 leveraged CNN-based attention for enhanced local feature extraction, optimizing it for computational efficiency and edge deployment. RF-DETR achieved the highest mean Average Precision (mAP50) of 0.9464 in single-class detection, proving its superior ability to localize greenfruits in cluttered scenes. Although YOLOv12N recorded the highest mAP@50:95 of 0.7620, RF-DETR consistently outperformed in complex spatial scenarios. For multi-class detection, RF-DETR led with an mAP@50 of 0.8298, showing its capability to differentiate between occluded and non-occluded fruits, while YOLOv12L scored highest in mAP@50:95 with 0.6622, indicating better classification in detailed occlusion contexts. Training dynamics analysis highlighted RF-DETR's swift convergence, particularly in single-class settings where it plateaued within 10 epochs, demonstrating the efficiency of transformer-based architectures in adapting to dynamic visual data. These findings validate RF-DETR's effectiveness for precision agricultural applications, with YOLOv12 suited for fast-response scenarios. >Index Terms: RF-DETR object detection, YOLOv12, YOLOv13, YOLOv14, YOLOv15, YOLOE, YOLO World, YOLO, You Only Look Once, Roboflow, Detection Transformers, CNNs
Community
This study presents a comprehensive comparison between RF-DETR object detection and YOLOv12 object detection models for greenfruit recognition in complex orchard environments characterized by label ambiguity, occlusion, and background camouflage. A custom dataset was developed featuring both single-class (greenfruit) and multi-class (occluded and non-occluded greenfruits) annotations to assess model performance under real-world conditions. The RF-DETR object detection model, leveraging a DINOv2 backbone with deformable attention mechanisms, excelled in global context modeling, which proved particularly effective for identifying partially occluded or visually ambiguous greenfruits. Conversely, the YOLOv12 model employed CNN-based attention mechanisms to enhance local feature extraction, optimizing it for computational efficiency and edge deployment suitability. In the single-class detection scenarios, RF-DETR achieved the highest mean Average Precision (mAP@50) of 0.9464, showcasing its robust capability to accurately localize greenfruits within cluttered scenes. Despite YOLOv12N achieving the highest mAP@50:95 of 0.7620, RF-DETR object detection model consistently outperformed in managing complex spatial scenarios. In multi-class detection, RF-DETR again led with an mAP@50 of 0.8298, demonstrating its effectiveness in distinguishing between occluded and non-occluded fruits, whereas YOLOv12L topped the mAP@50:95 metric with 0.6622, indicating superior classification under detailed occlusion conditions. The analysis of model training dynamics revealed RF-DETR’s rapid convergence, particularly in single-class scenarios where it plateaued at fewer than 10 epochs, underscoring the efficiency and adaptability of transformer-based architectures to dynamic visual data. These results confirm RF-DETR’s suitability for accuracy-critical agricultural tasks, while YOLOv12 remains ideal for speed-sensitive deployments.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper