VDT: General-purpose Video Diffusion Transformers via Mask Modeling Paper • 2305.13311 • Published May 22, 2023
WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training Paper • 2103.06561 • Published Mar 11, 2021
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism Paper • 2401.02954 • Published Jan 5, 2024 • 49
DeepSeek-VL: Towards Real-World Vision-Language Understanding Paper • 2403.05525 • Published Mar 8, 2024 • 47
UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling Paper • 2302.06605 • Published Feb 13, 2023
Needle In A Video Haystack: A Scalable Synthetic Framework for Benchmarking Video MLLMs Paper • 2406.09367 • Published Jun 13, 2024
Beyond Filtering: Adaptive Image-Text Quality Enhancement for MLLM Pretraining Paper • 2410.16166 • Published Oct 21, 2024
R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization Paper • 2503.10615 • Published Mar 13 • 17