multilingual vision models
Some papers I read for understanding vision models and also adding multilingual capabilities to them
Paper • 2405.17247 • Published • 87Note Great overview on vision-language modelling approaches
Visual Instruction Tuning
Paper • 2304.08485 • Published • 13Note - Among the first models to incorporate instruction fine-tuning in vision language models to improve multimodal chat capabilities - Generate 158k synthetically generated visual instruction samples using GPT-4 - Original LLaVA model incorporated a pretrained Vicuna LM and a pretrained CLIP vision encoder and fine-tuned end-to-end on generated vision-language instruction-following data
Improved Baselines with Visual Instruction Tuning
Paper • 2310.03744 • Published • 37
PALO: A Polyglot Large Multimodal Model for 5B People
Paper • 2402.14818 • Published • 23Note - Develop a multilingual LLM covering 10 languages using similar architecture as in LLaVA - use pretrained CLIP and Vicuna using a two-layer MLP with GELU as the projector between modalities - Multilingual dataset curated using a semi-automated translation pipeline - Translated LLaVA dataset
Aya 23: Open Weight Releases to Further Multilingual Progress
Paper • 2405.15032 • Published • 27Note - Introduce Aya 23, a family of multilingual (text-only) language models supporting 23 languages based on Cohere’s “Command” model which are pre-trained using a data mixture that includes texts from 23 languages and fine-tuned on the Aya multilingual instruction data - Available in 8B and 35B sizes
Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model
Paper • 2402.07827 • Published • 45Parrot: Multilingual Visual Instruction Tuning
Paper • 2406.02539 • Published • 35MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Paper • 2401.15947 • Published • 49PaLI: A Jointly-Scaled Multilingual Language-Image Model
Paper • 2209.06794 • Published • 2Maya: An Instruction Finetuned Multilingual Multimodal Model
Paper • 2412.07112 • Published • 25Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages
Paper • 2410.16153 • Published • 43Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Paper • 2409.12191 • Published • 74Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
Paper • 2409.17146 • Published • 104The Llama 3 Herd of Models
Paper • 2407.21783 • Published • 110