metadata
library_name: transformers
license: apache-2.0
language:
- en
- ko
base_model:
- trillionlabs/Trillion-7B-preview
pipeline_tag: visual-question-answering
Introduction
We introduce Trillion-LLaVA-7B, a Vision Language Model (VLM) capable of understanding images.
To better observe the transfer of multilinguality in vision tasks under controlled conditions, we adopted the same dataset, two-stage training strategy, and model architecture as LLaVA. While Trillion-7B-preview-vision was trained exclusively on English vision-language instruction pairs, the model is able to demonstrate strong performance on Korean visual reasoning tasks. The results indicate that our model’s robust multilingual foundation enables the effective transfer of visual reasoning capabilities across languages without requiring language-specific visual training data.
Evaluation
Performance comparison (English, Korean) across different vision-language models
Model | MMBENCH En | MMBENCH Ko | SEED-I En | SEED-I Ko | MMStar En | MMStar Ko | K-DTCB |
---|---|---|---|---|---|---|---|
Llava-1.5-7b | 0.64 | 0.43 | 0.66 | 0.52 | 0.34 | 0.33 | 0.30 |
Llava-1.6-mistral-7b | 0.68 | 0.49 | 0.72 | 0.61 | 0.36 | 0.33 | 0.30 |
Trillion-LLaVA-7B | 0.66 | 0.61 | 0.68 | 0.66 | 0.37 | 0.37 | 0.33 |
Limitations
- Lack of training on multilingual visual instruction tuning data: The model was trained exclusive on English vision-language pairs, which leaves room for improvement on other language pairs.
- The model inherits the limitations of Trillion-7B-preview, since no additional training was done except on vision language understanding data.
License
This model repository is licensed under the Apache-2.0 License.