Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model
Abstract
Most Large Vision-Language Models (LVLMs) to date are trained predominantly on English data, which makes them struggle to understand non-English input and fail to generate output in the desired target language. Existing efforts mitigate these issues by adding multilingual training data, but do so in a largely ad-hoc manner, lacking insight into how different training mixes tip the scale for different groups of languages. In this work, we present a comprehensive investigation into the training strategies for massively multilingual LVLMs. First, we conduct a series of multi-stage experiments spanning 13 downstream vision-language tasks and 43 languages, systematically examining: (1) the number of training languages that can be included without degrading English performance and (2) optimal language distributions of pre-training as well as (3) instruction-tuning data. Further, we (4) investigate how to improve multilingual text-in-image understanding, and introduce a new benchmark for the task. Surprisingly, our analysis reveals that one can (i) include as many as 100 training languages simultaneously (ii) with as little as 25-50\% of non-English data, to greatly improve multilingual performance while retaining strong English performance. We further find that (iii) including non-English OCR data in pre-training and instruction-tuning is paramount for improving multilingual text-in-image understanding. Finally, we put all our findings together and train Centurio, a 100-language LVLM, offering state-of-the-art performance in an evaluation covering 14 tasks and 56 languages.
Community
We are presenting "Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model".
Multilingual large vision-language models have to be trained with multilingual data but how should this data look like? How many languages? How much should be multilingual? What about multilingual text in images?
In this work we first extensively explore the design space for the multilingual training data and then apply those lessons two train Centurio, two state-of-the-art multilingual LVLMs based on Aya-Expanse and Qwen 2.5.
For a summary of our results, see here. Our model checkpoints can be found in this HuggingFace collection.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LDP: Generalizing to Multilingual Visual Information Extraction by Language Decoupled Pretraining (2024)
- Maya: An Instruction Finetuned Multilingual Multimodal Model (2024)
- Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement (2024)
- SLAM: Towards Efficient Multilingual Reasoning via Selective Language Alignment (2025)
- The Roles of English in Evaluating Multilingual Language Models (2024)
- The Rise and Down of Babel Tower: Investigating the Evolution Process of Multilingual Code Large Language Model (2024)
- Why We Build Local Large Language Models: An Observational Analysis from 35 Japanese and Multilingual LLMs (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Checkout Detailed Summary of the Paper: https://gyanendradas.substack.com/p/centurio-paper-explained