CCMat
's Collections
MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced
Training
Paper
•
2311.17049
•
Published
•
1
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts
Language Model
Paper
•
2405.04434
•
Published
•
14
A Study of Autoregressive Decoders for Multi-Tasking in Computer Vision
Paper
•
2303.17376
•
Published
Sigmoid Loss for Language Image Pre-Training
Paper
•
2303.15343
•
Published
•
5
Better & Faster Large Language Models via Multi-token Prediction
Paper
•
2404.19737
•
Published
•
73
Medusa: Simple LLM Inference Acceleration Framework with Multiple
Decoding Heads
Paper
•
2401.10774
•
Published
•
54
InstantFamily: Masked Attention for Zero-shot Multi-ID Image Generation
Paper
•
2404.19427
•
Published
•
71
CogVLM: Visual Expert for Pretrained Language Models
Paper
•
2311.03079
•
Published
•
23
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model
Handling Resolutions from 336 Pixels to 4K HD
Paper
•
2404.06512
•
Published
•
29
InternLM-XComposer2: Mastering Free-form Text-Image Composition and
Comprehension in Vision-Language Large Model
Paper
•
2401.16420
•
Published
•
55
InstantStyle: Free Lunch towards Style-Preserving in Text-to-Image
Generation
Paper
•
2404.02733
•
Published
•
20
Demonstration-Regularized RL
Paper
•
2310.17303
•
Published
Vision Transformers Need Registers
Paper
•
2309.16588
•
Published
•
78
StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video
Generation
Paper
•
2405.01434
•
Published
•
52
Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation
Paper
•
2404.19752
•
Published
•
22
Prometheus 2: An Open Source Language Model Specialized in Evaluating
Other Language Models
Paper
•
2405.01535
•
Published
•
119
LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report
Paper
•
2405.00732
•
Published
•
118
RLHF Workflow: From Reward Modeling to Online RLHF
Paper
•
2405.07863
•
Published
•
66
What matters when building vision-language models?
Paper
•
2405.02246
•
Published
•
101
Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with
Fine-Grained Chinese Understanding
Paper
•
2405.08748
•
Published
•
19
Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection
Paper
•
2405.10300
•
Published
•
26
Many-Shot In-Context Learning in Multimodal Foundation Models
Paper
•
2405.09798
•
Published
•
26
CAT3D: Create Anything in 3D with Multi-View Diffusion Models
Paper
•
2405.10314
•
Published
•
45
LoRA Learns Less and Forgets Less
Paper
•
2405.09673
•
Published
•
87
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Paper
•
2405.09818
•
Published
•
126
Layer-Condensed KV Cache for Efficient Inference of Large Language
Models
Paper
•
2405.10637
•
Published
•
19
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework
Paper
•
2405.11143
•
Published
•
34
MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning
Paper
•
2405.12130
•
Published
•
46
FIFO-Diffusion: Generating Infinite Videos from Text without Training
Paper
•
2405.11473
•
Published
•
53
Face Adapter for Pre-Trained Diffusion Models with Fine-Grained ID and
Attribute Control
Paper
•
2405.12970
•
Published
•
22
Reducing Transformer Key-Value Cache Size with Cross-Layer Attention
Paper
•
2405.12981
•
Published
•
28
Diffusion for World Modeling: Visual Details Matter in Atari
Paper
•
2405.12399
•
Published
•
28
Your Transformer is Secretly Linear
Paper
•
2405.12250
•
Published
•
149
ReVideo: Remake a Video with Motion and Content Control
Paper
•
2405.13865
•
Published
•
23
Matryoshka Multimodal Models
Paper
•
2405.17430
•
Published
•
31
An Introduction to Vision-Language Modeling
Paper
•
2405.17247
•
Published
•
87
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal
Models
Paper
•
2405.15738
•
Published
•
43
Improving the Training of Rectified Flows
Paper
•
2405.20320
•
Published
•
1
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Paper
•
2403.03206
•
Published
•
60
BitsFusion: 1.99 bits Weight Quantization of Diffusion Model
Paper
•
2406.04333
•
Published
•
36
ShareGPT4Video: Improving Video Understanding and Generation with Better
Captions
Paper
•
2406.04325
•
Published
•
72
Step-aware Preference Optimization: Aligning Preference with Denoising
Performance at Each Step
Paper
•
2406.04314
•
Published
•
27
Block Transformer: Global-to-Local Language Modeling for Fast Inference
Paper
•
2406.02657
•
Published
•
37
Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and
Resolution
Paper
•
2307.06304
•
Published
•
28
OpenELM: An Efficient Language Model Family with Open-source Training
and Inference Framework
Paper
•
2404.14619
•
Published
•
126
Multi-Head Mixture-of-Experts
Paper
•
2404.15045
•
Published
•
59
Pegasus-v1 Technical Report
Paper
•
2404.14687
•
Published
•
30
Towards Modular LLMs by Building and Reusing a Library of LoRAs
Paper
•
2405.11157
•
Published
•
27
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision
Models
Paper
•
2405.15574
•
Published
•
53
Paper
•
2405.18407
•
Published
•
46
Transformers are SSMs: Generalized Models and Efficient Algorithms
Through Structured State Space Duality
Paper
•
2405.21060
•
Published
•
63
CRAG -- Comprehensive RAG Benchmark
Paper
•
2406.04744
•
Published
•
44
DiTFastAttn: Attention Compression for Diffusion Transformer Models
Paper
•
2406.08552
•
Published
•
23
Paper
•
2406.09414
•
Published
•
95
An Image is Worth More Than 16x16 Patches: Exploring Transformers on
Individual Pixels
Paper
•
2406.09415
•
Published
•
50
The Devil is in the Details: StyleFeatureEditor for Detail-Rich StyleGAN
Inversion and High Quality Image Editing
Paper
•
2406.10601
•
Published
•
65
EvTexture: Event-driven Texture Enhancement for Video Super-Resolution
Paper
•
2406.13457
•
Published
•
16
Depth Anywhere: Enhancing 360 Monocular Depth Estimation via Perspective
Distillation and Unlabeled Data Augmentation
Paper
•
2406.12849
•
Published
•
49
Adam-mini: Use Fewer Learning Rates To Gain More
Paper
•
2406.16793
•
Published
•
67
DreamBench++: A Human-Aligned Benchmark for Personalized Image
Generation
Paper
•
2406.16855
•
Published
•
54
Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion
Paper
•
2407.01392
•
Published
•
39
No Training, No Problem: Rethinking Classifier-Free Guidance for
Diffusion Models
Paper
•
2407.02687
•
Published
•
22
InternLM-XComposer-2.5: A Versatile Large Vision Language Model
Supporting Long-Contextual Input and Output
Paper
•
2407.03320
•
Published
•
93
Video Diffusion Alignment via Reward Gradients
Paper
•
2407.08737
•
Published
•
48
Paper
•
2407.10671
•
Published
•
160
Theia: Distilling Diverse Vision Foundation Models for Robot Learning
Paper
•
2407.20179
•
Published
•
46
Gemma 2: Improving Open Language Models at a Practical Size
Paper
•
2408.00118
•
Published
•
75
The Llama 3 Herd of Models
Paper
•
2407.21783
•
Published
•
109
SF3D: Stable Fast 3D Mesh Reconstruction with UV-unwrapping and
Illumination Disentanglement
Paper
•
2408.00653
•
Published
•
28
SAM 2: Segment Anything in Images and Videos
Paper
•
2408.00714
•
Published
•
109
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Paper
•
2408.01800
•
Published
•
79
Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation
with Multimodal Generative Pretraining
Paper
•
2408.02657
•
Published
•
33
IPAdapter-Instruct: Resolving Ambiguity in Image-based Conditioning
using Instruct Prompts
Paper
•
2408.03209
•
Published
•
21
MMIU: Multimodal Multi-image Understanding for Evaluating Large
Vision-Language Models
Paper
•
2408.02718
•
Published
•
60
GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards
General Medical AI
Paper
•
2408.03361
•
Published
•
85
An Object is Worth 64x64 Pixels: Generating 3D Object via Image
Diffusion
Paper
•
2408.03178
•
Published
•
38
LLaVA-OneVision: Easy Visual Task Transfer
Paper
•
2408.03326
•
Published
•
59
Sketch2Scene: Automatic Generation of Interactive 3D Game Scenes from
User's Casual Sketches
Paper
•
2408.04567
•
Published
•
24
Transformer Explainer: Interactive Learning of Text-Generative Models
Paper
•
2408.04619
•
Published
•
155
ControlNeXt: Powerful and Efficient Control for Image and Video
Generation
Paper
•
2408.06070
•
Published
•
53
Qwen2-Audio Technical Report
Paper
•
2407.10759
•
Published
•
55
GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill
and Extreme KV-Cache Compression
Paper
•
2407.12077
•
Published
•
54
Compact Language Models via Pruning and Knowledge Distillation
Paper
•
2407.14679
•
Published
•
38
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language
Models
Paper
•
2407.15841
•
Published
•
40
KAN or MLP: A Fairer Comparison
Paper
•
2407.16674
•
Published
•
42
MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence
Paper
•
2407.16655
•
Published
•
29
OutfitAnyone: Ultra-high Quality Virtual Try-On for Any Clothing and Any
Person
Paper
•
2407.16224
•
Published
•
27
MeshAnything V2: Artist-Created Mesh Generation With Adjacent Mesh
Tokenization
Paper
•
2408.02555
•
Published
•
28
Mixture of Nested Experts: Adaptive Processing of Visual Tokens
Paper
•
2407.19985
•
Published
•
36
Diffree: Text-Guided Shape Free Object Inpainting with Diffusion Model
Paper
•
2407.16982
•
Published
•
40
VILA^2: VILA Augmented VILA
Paper
•
2407.17453
•
Published
•
39
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Paper
•
2408.06072
•
Published
•
37
Paper
•
2408.07009
•
Published
•
61
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Paper
•
2408.08872
•
Published
•
98
MeshFormer: High-Quality Mesh Generation with 3D-Guided Reconstruction
Model
Paper
•
2408.10198
•
Published
•
32
Transfusion: Predict the Next Token and Diffuse Images with One
Multi-Modal Model
Paper
•
2408.11039
•
Published
•
58
MegaFusion: Extend Diffusion Models towards Higher-resolution Image
Generation without Further Tuning
Paper
•
2408.11001
•
Published
•
11
Sapiens: Foundation for Human Vision Models
Paper
•
2408.12569
•
Published
•
89
DreamCinema: Cinematic Transfer with Free Camera and 3D Character
Paper
•
2408.12601
•
Published
•
28
Scalable Autoregressive Image Generation with Mamba
Paper
•
2408.12245
•
Published
•
25
Building and better understanding vision-language models: insights and
future directions
Paper
•
2408.12637
•
Published
•
124
LayerPano3D: Layered 3D Panorama for Hyper-Immersive Scene Generation
Paper
•
2408.13252
•
Published
•
24
SwiftBrush v2: Make Your One-step Diffusion Model Better Than Its
Teacher
Paper
•
2408.14176
•
Published
•
60
Foundation Models for Music: A Survey
Paper
•
2408.14340
•
Published
•
43
Diffusion Models Are Real-Time Game Engines
Paper
•
2408.14837
•
Published
•
121
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of
Encoders
Paper
•
2408.15998
•
Published
•
84
CogVLM2: Visual Language Models for Image and Video Understanding
Paper
•
2408.16500
•
Published
•
56
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio
Language Modeling
Paper
•
2408.16532
•
Published
•
47
ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion
Model
Paper
•
2408.16767
•
Published
•
30
CSGO: Content-Style Composition in Text-to-Image Generation
Paper
•
2408.16766
•
Published
•
17
CoRe: Context-Regularized Text Embedding Learning for Text-to-Image
Personalization
Paper
•
2408.15914
•
Published
•
22
LinFusion: 1 GPU, 1 Minute, 16K Image
Paper
•
2409.02097
•
Published
•
32
Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion
Dependency
Paper
•
2409.02634
•
Published
•
90
Guide-and-Rescale: Self-Guidance Mechanism for Effective Tuning-Free
Real Image Editing
Paper
•
2409.01322
•
Published
•
94
Geometry Image Diffusion: Fast and Data-Efficient Text-to-3D with
Image-Based Surface Representation
Paper
•
2409.03718
•
Published
•
25
Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language
Models
Paper
•
2404.12387
•
Published
•
38
Dynamic Typography: Bringing Words to Life
Paper
•
2404.11614
•
Published
•
44
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your
Phone
Paper
•
2404.14219
•
Published
•
253
LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding
Paper
•
2404.16710
•
Published
•
75
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal
Models with Open-Source Suites
Paper
•
2404.16821
•
Published
•
55
Iterative Reasoning Preference Optimization
Paper
•
2404.19733
•
Published
•
47
KAN: Kolmogorov-Arnold Networks
Paper
•
2404.19756
•
Published
•
108
OmniGen: Unified Image Generation
Paper
•
2409.11340
•
Published
•
108
IFAdapter: Instance Feature Control for Grounded Text-to-Image
Generation
Paper
•
2409.08240
•
Published
•
18
Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video
Diffusion Models
Paper
•
2409.07452
•
Published
•
20
Towards a Unified View of Preference Learning for Large Language Models:
A Survey
Paper
•
2409.02795
•
Published
•
71
Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think
Paper
•
2409.11355
•
Published
•
28
Phidias: A Generative Model for Creating 3D Content from Text, Image,
and 3D Conditions with Reference-Augmented Diffusion
Paper
•
2409.11406
•
Published
•
25
Qwen2.5-Coder Technical Report
Paper
•
2409.12186
•
Published
•
138
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at
Any Resolution
Paper
•
2409.12191
•
Published
•
74
VideoPoet: A Large Language Model for Zero-Shot Video Generation
Paper
•
2312.14125
•
Published
•
44
Training Language Models to Self-Correct via Reinforcement Learning
Paper
•
2409.12917
•
Published
•
135
Imagine yourself: Tuning-Free Personalized Image Generation
Paper
•
2409.13346
•
Published
•
68
Colorful Diffuse Intrinsic Image Decomposition in the Wild
Paper
•
2409.13690
•
Published
•
12
Emu3: Next-Token Prediction is All You Need
Paper
•
2409.18869
•
Published
•
93
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning
Paper
•
2409.20566
•
Published
•
53
One Token to Seg Them All: Language Instructed Reasoning Segmentation in
Videos
Paper
•
2409.19603
•
Published
•
18
EVER: Exact Volumetric Ellipsoid Rendering for Real-time View Synthesis
Paper
•
2410.01804
•
Published
•
5
Revisit Large-Scale Image-Caption Data in Pre-training Multimodal
Foundation Models
Paper
•
2410.02740
•
Published
•
52
Loong: Generating Minute-level Long Videos with Autoregressive Language
Models
Paper
•
2410.02757
•
Published
•
36
RATIONALYST: Pre-training Process-Supervision for Improving Reasoning
Paper
•
2410.01044
•
Published
•
34
PHI-S: Distribution Balancing for Label-Free Multi-Teacher Distillation
Paper
•
2410.01680
•
Published
•
32
Depth Pro: Sharp Monocular Metric Depth in Less Than a Second
Paper
•
2410.02073
•
Published
•
41
Baichuan-Omni Technical Report
Paper
•
2410.08565
•
Published
•
84
DART: Denoising Autoregressive Transformer for Scalable Text-to-Image
Generation
Paper
•
2410.08159
•
Published
•
25
Animate-X: Universal Character Image Animation with Enhanced Motion
Representation
Paper
•
2410.10306
•
Published
•
54
Efficient Diffusion Models: A Comprehensive Survey from Principles to
Practices
Paper
•
2410.11795
•
Published
•
16
SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a
Training-Free Memory Tree
Paper
•
2410.16268
•
Published
•
65
SpectroMotion: Dynamic 3D Reconstruction of Specular Scenes
Paper
•
2410.17249
•
Published
•
41
Movie Gen: A Cast of Media Foundation Models
Paper
•
2410.13720
•
Published
•
89
Fluid: Scaling Autoregressive Text-to-image Generative Models with
Continuous Tokens
Paper
•
2410.13863
•
Published
•
36
FrugalNeRF: Fast Convergence for Few-shot Novel View Synthesis without
Learned Priors
Paper
•
2410.16271
•
Published
•
80
PUMA: Empowering Unified MLLM with Multi-granular Visual Generation
Paper
•
2410.13861
•
Published
•
52
Unbounded: A Generative Infinite Game of Character Life Simulation
Paper
•
2410.18975
•
Published
•
35
Breaking the Memory Barrier: Near Infinite Batch Size Scaling for
Contrastive Loss
Paper
•
2410.17243
•
Published
•
89
Representation Alignment for Generation: Training Diffusion Transformers
Is Easier Than You Think
Paper
•
2410.06940
•
Published
•
6
Addition is All You Need for Energy-efficient Language Models
Paper
•
2410.00907
•
Published
•
144
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding
and Generation
Paper
•
2410.13848
•
Published
•
31
Semantic Image Inversion and Editing using Rectified Stochastic
Differential Equations
Paper
•
2410.10792
•
Published
•
29
CLEAR: Character Unlearning in Textual and Visual Modalities
Paper
•
2410.18057
•
Published
•
200
In-Context LoRA for Diffusion Transformers
Paper
•
2410.23775
•
Published
•
10
LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation
Paper
•
2411.04997
•
Published
•
37
Add-it: Training-Free Object Insertion in Images With Pretrained
Diffusion Models
Paper
•
2411.07232
•
Published
•
62
OmniEdit: Building Image Editing Generalist Models Through Specialist
Supervision
Paper
•
2411.07199
•
Published
•
45
Edify Image: High-Quality Image Generation with Pixel Space Laplacian
Diffusion Models
Paper
•
2411.07126
•
Published
•
28
SAMPart3D: Segment Any Part in 3D Objects
Paper
•
2411.07184
•
Published
•
26
Large Language Models Can Self-Improve in Long-context Reasoning
Paper
•
2411.08147
•
Published
•
62
EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Video
Generation
Paper
•
2411.08380
•
Published
•
25
LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models
Paper
•
2411.09595
•
Published
•
71
MagicQuill: An Intelligent Interactive Image Editing System
Paper
•
2411.09703
•
Published
•
57
LLaVA-o1: Let Vision Language Models Reason Step-by-Step
Paper
•
2411.10440
•
Published
•
111
GaussianAnything: Interactive Point Cloud Latent Diffusion for 3D
Generation
Paper
•
2411.08033
•
Published
•
22
Region-Aware Text-to-Image Generation via Hard Binding and Soft
Refinement
Paper
•
2411.06558
•
Published
•
34
Generative World Explorer
Paper
•
2411.11844
•
Published
•
75
AnimateAnything: Consistent and Controllable Animation for Video
Generation
Paper
•
2411.10836
•
Published
•
23
RedPajama: an Open Dataset for Training Large Language Models
Paper
•
2411.12372
•
Published
•
47
FlipSketch: Flipping Static Drawings to Text-Guided Sketch Animations
Paper
•
2411.10818
•
Published
•
24
SageAttention2 Technical Report: Accurate 4 Bit Attention for
Plug-and-play Inference Acceleration
Paper
•
2411.10958
•
Published
•
50
SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking
with Motion-Aware Memory
Paper
•
2411.11922
•
Published
•
18
Stylecodes: Encoding Stylistic Information For Image Generation
Paper
•
2411.12811
•
Published
•
11
Multimodal Autoregressive Pre-training of Large Vision Encoders
Paper
•
2411.14402
•
Published
•
43
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large
Language Models
Paper
•
2411.14432
•
Published
•
21
Stable Flow: Vital Layers for Training-Free Image Editing
Paper
•
2411.14430
•
Published
•
14
Style-Friendly SNR Sampler for Style-Driven Generation
Paper
•
2411.14793
•
Published
•
36
Star Attention: Efficient LLM Inference over Long Sequences
Paper
•
2411.17116
•
Published
•
47
Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot
Subject-Driven Image Generator
Paper
•
2411.15466
•
Published
•
34
Material Anything: Generating Materials for Any 3D Object via Diffusion
Paper
•
2411.15138
•
Published
•
42
OminiControl: Minimal and Universal Control for Diffusion Transformer
Paper
•
2411.15098
•
Published
•
53
World-consistent Video Diffusion with Explicit 3D Modeling
Paper
•
2412.01821
•
Published
•
4
PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos
Paper
•
2412.01800
•
Published
•
6
WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent
Video Diffusion Model
Paper
•
2411.17459
•
Published
•
10
Art-Free Generative Models: Art Creation Without Graphic Art Knowledge
Paper
•
2412.00176
•
Published
•
8
VideoGen-of-Thought: A Collaborative Framework for Multi-Shot Video
Generation
Paper
•
2412.02259
•
Published
•
59
Critical Tokens Matter: Token-Level Contrastive Estimation Enhence LLM's
Reasoning Capability
Paper
•
2411.19943
•
Published
•
55
PaliGemma 2: A Family of Versatile VLMs for Transfer
Paper
•
2412.03555
•
Published
•
118
SNOOPI: Supercharged One-step Diffusion Distillation with Proper
Guidance
Paper
•
2412.02687
•
Published
•
109
TokenFlow: Unified Image Tokenizer for Multimodal Understanding and
Generation
Paper
•
2412.03069
•
Published
•
30
Imagine360: Immersive 360 Video Generation from Perspective Anchor
Paper
•
2412.03552
•
Published
•
26
Distilling Diffusion Models to Efficient 3D LiDAR Scene Completion
Paper
•
2412.03515
•
Published
•
25
FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking
Portrait
Paper
•
2412.01064
•
Published
•
25
VISTA: Enhancing Long-Duration and High-Resolution Video Understanding
by Video Spatiotemporal Augmentation
Paper
•
2412.00927
•
Published
•
26
Open-Sora Plan: Open-Source Large Video Generation Model
Paper
•
2412.00131
•
Published
•
32
SpotLight: Shadow-Guided Object Relighting via Diffusion
Paper
•
2411.18665
•
Published
•
1
Video Depth without Video Models
Paper
•
2411.19189
•
Published
•
33
TryOffDiff: Virtual-Try-Off via High-Fidelity Garment Reconstruction
using Diffusion Models
Paper
•
2411.18350
•
Published
•
22
CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models
Paper
•
2411.18613
•
Published
•
50
FINECAPTION: Compositional Image Captioning Focusing on Wherever You
Want at Any Granularity
Paper
•
2411.15411
•
Published
•
7
SketchAgent: Language-Driven Sequential Sketch Generation
Paper
•
2411.17673
•
Published
•
18
Pathways on the Image Manifold: Image Editing via Video Generation
Paper
•
2411.16819
•
Published
•
30
Identity-Preserving Text-to-Video Generation by Frequency Decomposition
Paper
•
2411.17440
•
Published
•
35
ROICtrl: Boosting Instance Control for Visual Generation
Paper
•
2411.17949
•
Published
•
82
CleanDIFT: Diffusion Features without Noise
Paper
•
2412.03439
•
Published
•
12
LumiNet: Latent Intrinsics Meets Diffusion Models for Indoor Scene
Relighting
Paper
•
2412.00177
•
Published
•
7
VisionZip: Longer is Better but Not Necessary in Vision Language Models
Paper
•
2412.04467
•
Published
•
104
Florence-VL: Enhancing Vision-Language Models with Generative Vision
Encoder and Depth-Breadth Fusion
Paper
•
2412.04424
•
Published
•
55
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
Paper
•
2412.04454
•
Published
•
48
Structured 3D Latents for Scalable and Versatile 3D Generation
Paper
•
2412.01506
•
Published
•
46
A Noise is Worth Diffusion Guidance
Paper
•
2412.03895
•
Published
•
27
AnyDressing: Customizable Multi-Garment Virtual Dressing via Latent
Diffusion Models
Paper
•
2412.04146
•
Published
•
21
Expanding Performance Boundaries of Open-Source Multimodal Models with
Model, Data, and Test-Time Scaling
Paper
•
2412.05271
•
Published
•
121
SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step
Diffusion
Paper
•
2412.04301
•
Published
•
32
APOLLO: SGD-like Memory, AdamW-level Performance
Paper
•
2412.05270
•
Published
•
38
STIV: Scalable Text and Image Conditioned Video Generation
Paper
•
2412.07730
•
Published
•
69
UniReal: Universal Image Generation and Editing via Learning Real-world
Dynamics
Paper
•
2412.07774
•
Published
•
25
Paper
•
2412.07583
•
Published
•
19
Video Motion Transfer with Diffusion Transformers
Paper
•
2412.07776
•
Published
•
17
LoRA.rar: Learning to Merge LoRAs via Hypernetworks for Subject-Style
Conditioned Image Generation
Paper
•
2412.05148
•
Published
•
11
SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse
Viewpoints
Paper
•
2412.07760
•
Published
•
49
StyleMaster: Stylize Your Video with Artistic Generation and Translation
Paper
•
2412.07744
•
Published
•
19
Track4Gen: Teaching Video Diffusion Models to Track Points Improves
Video Generation
Paper
•
2412.06016
•
Published
•
20
Learning Flow Fields in Attention for Controllable Person Image
Generation
Paper
•
2412.08486
•
Published
•
32
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for
Long-term Streaming Video and Audio Interactions
Paper
•
2412.09596
•
Published
•
90
Paper
•
2412.08905
•
Published
•
92
Neural LightRig: Unlocking Accurate Object Normal and Material
Estimation with Multi-Light Diffusion
Paper
•
2412.09593
•
Published
•
17
Arbitrary-steps Image Super-resolution via Diffusion Inversion
Paper
•
2412.09013
•
Published
•
11
DisPose: Disentangling Pose Guidance for Controllable Human Image
Animation
Paper
•
2412.09349
•
Published
•
7
Flowing from Words to Pixels: A Framework for Cross-Modality Evolution
Paper
•
2412.15213
•
Published
•
25
Parallelized Autoregressive Visual Generation
Paper
•
2412.15119
•
Published
•
44
SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation
Paper
•
2412.13649
•
Published
•
17
B-STaR: Monitoring and Balancing Exploration and Exploitation in
Self-Taught Reasoners
Paper
•
2412.17256
•
Published
•
32
Paper
•
2412.15115
•
Published
•
327
LoRACLR: Contrastive Adaptation for Customization of Diffusion Models
Paper
•
2412.09622
•
Published
•
7
Apollo: An Exploration of Video Understanding in Large Multimodal Models
Paper
•
2412.10360
•
Published
•
131
GenEx: Generating an Explorable World
Paper
•
2412.09624
•
Published
•
84
SynerGen-VL: Towards Synergistic Image Understanding and Generation with
Vision Experts and Token Folding
Paper
•
2412.09604
•
Published
•
35
FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free
Scale Fusion
Paper
•
2412.09626
•
Published
•
19
InstanceCap: Improving Text-to-Video Generation via Instance-aware
Structured Caption
Paper
•
2412.09283
•
Published
•
19
FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers
Paper
•
2412.09611
•
Published
•
9
Byte Latent Transformer: Patches Scale Better Than Tokens
Paper
•
2412.09871
•
Published
•
74
BrushEdit: All-In-One Image Inpainting and Editing
Paper
•
2412.10316
•
Published
•
33
ColorFlow: Retrieval-Augmented Image Sequence Colorization
Paper
•
2412.11815
•
Published
•
26
Thinking in Space: How Multimodal Large Language Models See, Remember,
and Recall Spaces
Paper
•
2412.14171
•
Published
•
22
Prompting Depth Anything for 4K Resolution Accurate Metric Depth
Estimation
Paper
•
2412.14015
•
Published
•
12