Multimodal Foundation Models: From Specialists to General-Purpose Assistants
Abstract
This paper presents a comprehensive survey of the taxonomy and evolution of multimodal foundation models that demonstrate vision and vision-language capabilities, focusing on the transition from specialist models to general-purpose assistants. The research landscape encompasses five core topics, categorized into two classes. (i) We start with a survey of well-established research areas: multimodal foundation models pre-trained for specific purposes, including two topics -- methods of learning vision backbones for visual understanding and text-to-image generation. (ii) Then, we present recent advances in exploratory, open research areas: multimodal foundation models that aim to play the role of general-purpose assistants, including three topics -- unified vision models inspired by large language models (LLMs), end-to-end training of multimodal LLMs, and chaining multimodal tools with LLMs. The target audiences of the paper are researchers, graduate students, and professionals in computer vision and vision-language multimodal communities who are eager to learn the basics and recent advances in multimodal foundation models.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants (2023)
- Kosmos-2.5: A Multimodal Literate Model (2023)
- StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data (2023)
- The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision) (2023)
- Language as the Medium: Multimodal Video Classification through text only (2023)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
Models citing this paper 9
Browse 9 models citing this paperDatasets citing this paper 0
No dataset linking this paper