Introduction

We introduce Trillion-LLaVA-7B, a Vision Language Model (VLM) capable of understanding images.

To better observe the transfer of multilinguality in vision tasks under controlled conditions, we adopted the same dataset, two-stage training strategy, and model architecture as LLaVA. While Trillion-7B-preview-vision was trained exclusively on English vision-language instruction pairs, the model is able to demonstrate strong performance on Korean visual reasoning tasks. The results indicate that our model’s robust multilingual foundation enables the effective transfer of visual reasoning capabilities across languages without requiring language-specific visual training data.

Evaluation

Performance comparison (English, Korean) across different vision-language models

Model MMBENCH En MMBENCH Ko SEED-I En SEED-I Ko MMStar En MMStar Ko K-DTCB
Llava-1.5-7b 0.64 0.43 0.66 0.52 0.34 0.33 0.30
Llava-1.6-mistral-7b 0.68 0.49 0.72 0.61 0.36 0.33 0.30
Trillion-LLaVA-7B 0.66 0.61 0.68 0.66 0.37 0.37 0.33

Limitations

  • Lack of training on multilingual visual instruction tuning data: The model was trained exclusive on English vision-language pairs, which leaves room for improvement on other language pairs.
  • The model inherits the limitations of Trillion-7B-preview, since no additional training was done except on vision language understanding data.

License

This model repository is licensed under the Apache-2.0 License.

Downloads last month
12
Safetensors
Model size
7.55B params
Tensor type
FP16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for trillionlabs/Trillion-LLaVA-7B-FP16

Finetuned
(3)
this model