mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality Paper • 2304.14178 • Published Apr 27, 2023 • 3
UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model Paper • 2310.05126 • Published Oct 8, 2023 • 1
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding Paper • 2307.02499 • Published Jul 4, 2023 • 14
BUS:Efficient and Effective Vision-language Pre-training with Bottom-Up Patch Summarization Paper • 2307.08504 • Published Jul 17, 2023
Evaluation and Analysis of Hallucination in Large Vision-Language Models Paper • 2308.15126 • Published Aug 29, 2023 • 1
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training Paper • 2212.14546 • Published Dec 30, 2022
Learning Trajectory-Word Alignments for Video-Language Tasks Paper • 2301.01953 • Published Jan 5, 2023
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video Paper • 2302.00402 • Published Feb 1, 2023
mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model Paper • 2311.18248 • Published Nov 30, 2023
LLaVA-Critic: Learning to Evaluate Multimodal Models Paper • 2410.02712 • Published Oct 3, 2024 • 38
TiMix: Text-aware Image Mixing for Effective Vision-Language Pre-training Paper • 2312.08846 • Published Dec 14, 2023
Classification Done Right for Vision-Language Pre-Training Paper • 2411.03313 • Published Nov 5, 2024
Painting with Words: Elevating Detailed Image Captioning with Benchmark and Alignment Learning Paper • 2503.07906 • Published Mar 10 • 4
MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation? Paper • 2407.04842 • Published Jul 5, 2024 • 57
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration Paper • 2311.04257 • Published Nov 7, 2023 • 22
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks Paper • 2306.04362 • Published Jun 7, 2023 • 2