MutiModal_Paper
updated
PUMA: Empowering Unified MLLM with Multi-granular Visual Generation
Paper
• 2410.13861
• Published
• 56
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified
Multimodal Understanding and Generation
Paper
• 2411.07975
• Published
• 31
Enhancing the Reasoning Ability of Multimodal Large Language Models via
Mixed Preference Optimization
Paper
• 2411.10442
• Published
• 87
Multimodal Autoregressive Pre-training of Large Vision Encoders
Paper
• 2411.14402
• Published
• 47
DINO-X: A Unified Vision Model for Open-World Object Detection and
Understanding
Paper
• 2411.14347
• Published
• 16
Large Multi-modal Models Can Interpret Features in Large Multi-modal
Models
Paper
• 2411.14982
• Published
• 19
Efficient Long Video Tokenization via Coordinated-based Patch
Reconstruction
Paper
• 2411.14762
• Published
• 11
TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic
Vision-Language Negatives
Paper
• 2411.02545
• Published
• 1
Hymba: A Hybrid-head Architecture for Small Language Models
Paper
• 2411.13676
• Published
• 47
SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking
with Motion-Aware Memory
Paper
• 2411.11922
• Published
• 19
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
Paper
• 2411.17465
• Published
• 89
Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for
Training-Free Acceleration
Paper
• 2411.17686
• Published
• 19
DreamMix: Decoupling Object Attributes for Enhanced Editability in
Customized Image Inpainting
Paper
• 2411.17223
• Published
• 7
FINECAPTION: Compositional Image Captioning Focusing on Wherever You
Want at Any Granularity
Paper
• 2411.15411
• Published
• 8
GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A
Comprehensive Multimodal Dataset Towards General Medical AI
Paper
• 2411.14522
• Published
• 38
Knowledge Transfer Across Modalities with Natural Language Supervision
Paper
• 2411.15611
• Published
• 16
ChatRex: Taming Multimodal LLM for Joint Perception and Understanding
Paper
• 2411.18363
• Published
• 10
EfficientViM: Efficient Vision Mamba with Hidden State Mixer based State
Space Duality
Paper
• 2411.15241
• Published
• 7
Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient
Paper
• 2411.17787
• Published
• 12
On Domain-Specific Post-Training for Multimodal Large Language Models
Paper
• 2411.19930
• Published
• 31
One Token to Seg Them All: Language Instructed Reasoning Segmentation in
Videos
Paper
• 2409.19603
• Published
• 19
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and
Understanding
Paper
• 2406.19389
• Published
• 54
AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and
Pruning
Paper
• 2412.03248
• Published
• 26
CompCap: Improving Multimodal Large Language Models with Composite
Captions
Paper
• 2412.05243
• Published
• 20
Florence-VL: Enhancing Vision-Language Models with Generative Vision
Encoder and Depth-Breadth Fusion
Paper
• 2412.04424
• Published
• 62
POINTS1.5: Building a Vision-Language Model towards Real World
Applications
Paper
• 2412.08443
• Published
• 38
Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity
Visual Descriptions
Paper
• 2412.08737
• Published
• 54
SynerGen-VL: Towards Synergistic Image Understanding and Generation with
Vision Experts and Token Folding
Paper
• 2412.09604
• Published
• 38
Learned Compression for Compressed Learning
Paper
• 2412.09405
• Published
• 13
LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via
Hierarchical Window Transformer
Paper
• 2412.13871
• Published
• 18
AnySat: An Earth Observation Model for Any Resolutions, Scales, and
Modalities
Paper
• 2412.14123
• Published
• 11
FastVLM: Efficient Vision Encoding for Vision Language Models
Paper
• 2412.13303
• Published
• 75
Exploring Multi-Grained Concept Annotations for Multimodal Large
Language Models
Paper
• 2412.05939
• Published
• 15
Grounding Descriptions in Images informs Zero-Shot Visual Recognition
Paper
• 2412.04429
• Published
Viewer
• Updated
• 2.18M • 10
• 2
Migician: Revealing the Magic of Free-Form Multi-Image Grounding in
Multimodal Large Language Models
Paper
• 2501.05767
• Published
• 29
QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive
Multimodal Understanding and Generation
Paper
• 2502.05178
• Published
• 10
VideoRoPE: What Makes for Good Video Rotary Position Embedding?
Paper
• 2502.05173
• Published
• 64
Scaling Laws in Patchification: An Image Is Worth 50,176 Tokens And More
Paper
• 2502.03738
• Published
• 11
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward
Model
Paper
• 2501.12368
• Published
• 45
Vision-R1: Evolving Human-Free Alignment in Large Vision-Language Models
via Vision-Guided Reinforcement Learning
Paper
• 2503.18013
• Published
• 20
VideoWorld: Exploring Knowledge Learning from Unlabeled Videos
Paper
• 2501.09781
• Published
• 27
Where do Large Vision-Language Models Look at when Answering Questions?
Paper
• 2503.13891
• Published
• 8
Seedream 3.0 Technical Report
Paper
• 2504.11346
• Published
• 70
RL makes MLLMs see better than SFT
Paper
• 2510.16333
• Published
• 49