metadata

library_name: transformers
license: apache-2.0
language:
  - en
  - ko
base_model:
  - trillionlabs/Trillion-7B-preview
pipeline_tag: visual-question-answering

Introduction

We introduce Trillion-LLaVA-7B, a Vision Language Model (VLM) capable of understanding images.

To better observe the transfer of multilinguality in vision tasks under controlled conditions, we adopted the same dataset, two-stage training strategy, and model architecture as LLaVA. While Trillion-7B-preview-vision was trained exclusively on English vision-language instruction pairs, the model is able to demonstrate strong performance on Korean visual reasoning tasks. The results indicate that our model’s robust multilingual foundation enables the effective transfer of visual reasoning capabilities across languages without requiring language-specific visual training data.

Evaluation

Performance comparison (English, Korean) across different vision-language models

Model	MMBENCH En	MMBENCH Ko	SEED-I En	SEED-I Ko	MMStar En	MMStar Ko	K-DTCB
Llava-1.5-7b	0.64	0.43	0.66	0.52	0.34	0.33	0.30
Llava-1.6-mistral-7b	0.68	0.49	0.72	0.61	0.36	0.33	0.30
Trillion-LLaVA-7B	0.66	0.61	0.68	0.66	0.37	0.37	0.33

Limitations

Lack of training on multilingual visual instruction tuning data: The model was trained exclusive on English vision-language pairs, which leaves room for improvement on other language pairs.
The model inherits the limitations of Trillion-7B-preview, since no additional training was done except on vision language understanding data.

License

This model repository is licensed under the Apache-2.0 License.