cv - a zzfive Collection

zzfive 's Collections

dLLM

RAG

ssm

safety

inference optimization

RL+reason model

medical

3d

image

LLMs

video

agent

cv

audio

robot

cv

updated 13 days ago

LocalMamba: Visual State Space Model with Windowed Selective Scan

Paper • 2403.09338 • Published Mar 14, 2024 • 9
GiT: Towards Generalist Vision Transformer through Universal Language Interface

Paper • 2403.09394 • Published Mar 14, 2024 • 28
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

Paper • 2402.19479 • Published Feb 29, 2024 • 35
Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection

Paper • 2405.10300 • Published May 16, 2024 • 31
EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model

Paper • 2406.20076 • Published Jun 28, 2024 • 10
SVGCraft: Beyond Single Object Text-to-SVG Synthesis with Comprehensive Canvas Layout

Paper • 2404.00412 • Published Mar 30, 2024 • 2
LKCell: Efficient Cell Nuclei Instance Segmentation with Large Convolution Kernels

Paper • 2407.18054 • Published Jul 25, 2024 • 12
Matting by Generation

Paper • 2407.21017 • Published Jul 30, 2024 • 24
SAM 2: Segment Anything in Images and Videos

Paper • 2408.00714 • Published Aug 1, 2024 • 116
NeuFlow v2: High-Efficiency Optical Flow Estimation on Edge Devices

Paper • 2408.10161 • Published Aug 19, 2024 • 15
Sapiens: Foundation for Human Vision Models

Paper • 2408.12569 • Published Aug 22, 2024 • 92
DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos

Paper • 2409.02095 • Published Sep 3, 2024 • 37
General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

Paper • 2409.01704 • Published Sep 3, 2024 • 84
Mamba-YOLO-World: Marrying YOLO-World with Mamba for Open-Vocabulary Detection

Paper • 2409.08513 • Published Sep 13, 2024 • 15
OmniGen: Unified Image Generation

Paper • 2409.11340 • Published Sep 17, 2024 • 116
Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think

Paper • 2409.11355 • Published Sep 17, 2024 • 32
Degradation-Guided One-Step Image Super-Resolution with Diffusion Priors

Paper • 2409.17058 • Published Sep 25, 2024 • 13
Self-Supervised Any-Point Tracking by Contrastive Random Walks

Paper • 2409.16288 • Published Sep 24, 2024 • 7
Lotus: Diffusion-based Visual Foundation Model for High-quality Dense Prediction

Paper • 2409.18124 • Published Sep 26, 2024 • 34
MinerU: An Open-Source Solution for Precise Document Content Extraction

Paper • 2409.18839 • Published Sep 27, 2024 • 28
Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

Paper • 2410.02073 • Published Oct 2, 2024 • 42
Towards Natural Image Matting in the Wild via Real-Scenario Prior

Paper • 2410.06593 • Published Oct 9, 2024 • 3
SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree

Paper • 2410.16268 • Published Oct 21, 2024 • 70
SMITE: Segment Me In TimE

Paper • 2410.18538 • Published Oct 24, 2024 • 16
GrounDiT: Grounding Diffusion Transformers via Noisy Patch Transplantation

Paper • 2410.20474 • Published Oct 27, 2024 • 14
DELTA: Dense Efficient Long-range 3D Tracking for any video

Paper • 2410.24211 • Published Oct 31, 2024 • 9
Face Anonymization Made Simple

Paper • 2411.00762 • Published Nov 1, 2024 • 7
ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements

Paper • 2411.12044 • Published Nov 18, 2024 • 14
SEAGULL: No-reference Image Quality Assessment for Regions of Interest via Vision-Language Instruction Tuning

Paper • 2411.10161 • Published Nov 15, 2024 • 9
SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory

Paper • 2411.11922 • Published Nov 18, 2024 • 19
DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding

Paper • 2411.14347 • Published Nov 21, 2024 • 15
Knowledge Transfer Across Modalities with Natural Language Supervision

Paper • 2411.15611 • Published Nov 23, 2024 • 17
Edge Weight Prediction For Category-Agnostic Pose Estimation

Paper • 2411.16665 • Published Nov 25, 2024 • 6
EfficientViM: Efficient Vision Mamba with Hidden State Mixer based State Space Duality

Paper • 2411.15241 • Published Nov 22, 2024 • 7
Scaling Image Tokenizers with Grouped Spherical Quantization

Paper • 2412.02632 • Published Dec 3, 2024 • 10
EMOv2: Pushing 5M Vision Model Frontier

Paper • 2412.06674 • Published Dec 9, 2024 • 13
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks

Paper • 2501.08326 • Published Jan 14 • 35
iFormer: Integrating ConvNet and Transformer for Mobile Application

Paper • 2501.15369 • Published Jan 26 • 13
MatAnyone: Stable Video Matting with Consistent Memory Propagation

Paper • 2501.14677 • Published Jan 24 • 36
PixelWorld: Towards Perceiving Everything as Pixels

Paper • 2501.19339 • Published Jan 31 • 17
SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders

Paper • 2501.18052 • Published Jan 29 • 8
GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding

Paper • 2503.10596 • Published Mar 13 • 18
SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

Paper • 2503.11576 • Published Mar 14 • 109
Semantic Library Adaptation: LoRA Retrieval and Fusion for Open-Vocabulary Semantic Segmentation

Paper • 2503.21780 • Published Mar 27 • 9
TAPNext: Tracking Any Point (TAP) as Next Token Prediction

Paper • 2504.05579 • Published Apr 8 • 5
DC-SAM: In-Context Segment Anything in Images and Videos via Dual Consistency

Paper • 2504.12080 • Published Apr 16 • 7
Group Downsampling with Equivariant Anti-aliasing

Paper • 2504.17258 • Published Apr 24 • 8
Marigold: Affordable Adaptation of Diffusion-Based Image Generators for Image Analysis

Paper • 2505.09358 • Published May 14 • 25
PictSure: Pretraining Embeddings Matters for In-Context Learning Image Classifiers

Paper • 2506.14842 • Published 17 days ago • 7