LocalMamba: Visual State Space Model with Windowed Selective Scan Paper • 2403.09338 • Published Mar 14 • 7
GiT: Towards Generalist Vision Transformer through Universal Language Interface Paper • 2403.09394 • Published Mar 14 • 25
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers Paper • 2402.19479 • Published Feb 29 • 32
Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection Paper • 2405.10300 • Published May 16 • 26
EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model Paper • 2406.20076 • Published Jun 28 • 8
SVGCraft: Beyond Single Object Text-to-SVG Synthesis with Comprehensive Canvas Layout Paper • 2404.00412 • Published Mar 30 • 2
LKCell: Efficient Cell Nuclei Instance Segmentation with Large Convolution Kernels Paper • 2407.18054 • Published Jul 25 • 10
NeuFlow v2: High-Efficiency Optical Flow Estimation on Edge Devices Paper • 2408.10161 • Published Aug 19 • 13
DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos Paper • 2409.02095 • Published Sep 3 • 35
General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model Paper • 2409.01704 • Published Sep 3 • 83
Mamba-YOLO-World: Marrying YOLO-World with Mamba for Open-Vocabulary Detection Paper • 2409.08513 • Published Sep 13 • 11
Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think Paper • 2409.11355 • Published Sep 17 • 28
Degradation-Guided One-Step Image Super-Resolution with Diffusion Priors Paper • 2409.17058 • Published Sep 25 • 11
Self-Supervised Any-Point Tracking by Contrastive Random Walks Paper • 2409.16288 • Published Sep 24 • 5
Lotus: Diffusion-based Visual Foundation Model for High-quality Dense Prediction Paper • 2409.18124 • Published Sep 26 • 32
MinerU: An Open-Source Solution for Precise Document Content Extraction Paper • 2409.18839 • Published Sep 27 • 26
Depth Pro: Sharp Monocular Metric Depth in Less Than a Second Paper • 2410.02073 • Published Oct 2 • 41
Towards Natural Image Matting in the Wild via Real-Scenario Prior Paper • 2410.06593 • Published Oct 9 • 2
SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree Paper • 2410.16268 • Published Oct 21 • 65
GrounDiT: Grounding Diffusion Transformers via Noisy Patch Transplantation Paper • 2410.20474 • Published Oct 27 • 14
DELTA: Dense Efficient Long-range 3D Tracking for any video Paper • 2410.24211 • Published Oct 31 • 8
ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements Paper • 2411.12044 • Published Nov 18 • 13
SEAGULL: No-reference Image Quality Assessment for Regions of Interest via Vision-Language Instruction Tuning Paper • 2411.10161 • Published Nov 15 • 8
SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory Paper • 2411.11922 • Published Nov 18 • 18
DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding Paper • 2411.14347 • Published Nov 21 • 13
Knowledge Transfer Across Modalities with Natural Language Supervision Paper • 2411.15611 • Published Nov 23 • 15
Edge Weight Prediction For Category-Agnostic Pose Estimation Paper • 2411.16665 • Published about 1 month ago • 4
EfficientViM: Efficient Vision Mamba with Hidden State Mixer based State Space Duality Paper • 2411.15241 • Published Nov 22 • 5
Scaling Image Tokenizers with Grouped Spherical Quantization Paper • 2412.02632 • Published 22 days ago • 10