LaViDa: A Large Diffusion Language Model for Multimodal Understanding
Abstract
LaViDa, a family of vision-language models built on discrete diffusion models, offers competitive performance on multimodal benchmarks with advantages in speed, controllability, and bidirectional reasoning.
Modern Vision-Language Models (VLMs) can solve a wide range of tasks requiring visual reasoning. In real-world scenarios, desirable properties for VLMs include fast inference and controllable generation (e.g., constraining outputs to adhere to a desired format). However, existing autoregressive (AR) VLMs like LLaVA struggle in these aspects. Discrete diffusion models (DMs) offer a promising alternative, enabling parallel decoding for faster inference and bidirectional context for controllable generation through text-infilling. While effective in language-only settings, DMs' potential for multimodal tasks is underexplored. We introduce LaViDa, a family of VLMs built on DMs. We build LaViDa by equipping DMs with a vision encoder and jointly fine-tune the combined parts for multimodal instruction following. To address challenges encountered, LaViDa incorporates novel techniques such as complementary masking for effective training, prefix KV cache for efficient inference, and timestep shifting for high-quality sampling. Experiments show that LaViDa achieves competitive or superior performance to AR VLMs on multi-modal benchmarks such as MMMU, while offering unique advantages of DMs, including flexible speed-quality tradeoff, controllability, and bidirectional reasoning. On COCO captioning, LaViDa surpasses Open-LLaVa-Next-8B by +4.1 CIDEr with 1.92x speedup. On bidirectional tasks, it achieves +59% improvement on Constrained Poem Completion. These results demonstrate LaViDa as a strong alternative to AR VLMs. Code and models will be released in the camera-ready version.
Community
We propose LaViDa, one of the first and fastesr diffusion language models for multi-modal understanding tasks.
Project Page: https://homepage.jackli.org/projects/lavida/index.html
Checkpoints and Data: https://huggingface.co/collections/jacklishufan/lavida-10-682ecf5a5fa8c5df85c61ded
Code: https://github.com/jacklishufan/LaViDa
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MASSV: Multimodal Adaptation and Self-Data Distillation for Speculative Decoding of Vision-Language Models (2025)
- Aya Vision: Advancing the Frontier of Multilingual Multimodality (2025)
- BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset (2025)
- Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs (2025)
- Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models (2025)
- VLMT: Vision-Language Multimodal Transformer for Multimodal Multi-hop Question Answering (2025)
- TS-VLM: Text-Guided SoftSort Pooling for Vision-Language Models in Multi-View Driving Reasoning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper