File size: 2,129 Bytes

cb9d376
 
f0bfb3c
 
 
 
 
 
 
cb9d376
 
f0bfb3c
cb9d376
 
f0bfb3c
cb9d376
f0bfb3c
cb9d376
f0bfb3c
cb9d376
f0bfb3c
cb9d376
f0bfb3c
 
 
 
 
cb9d376
 
f0bfb3c
 
 
cb9d376
f0bfb3c

---
library_name: transformers
license: apache-2.0
language:
- en
- ko
base_model:
- trillionlabs/Trillion-7B-preview
pipeline_tag: visual-question-answering
---

# Introduction

<!-- Provide a quick summary of what the model is/does. -->
We introduce Trillion-LLaVA-7B, a Vision Language Model (VLM) capable of understanding images.

To better observe the transfer of multilinguality in vision tasks under controlled conditions, we adopted the same dataset, two-stage training strategy, and model architecture as LLaVA. While Trillion-7B-preview-vision was trained exclusively on English vision-language instruction pairs, the model is able to demonstrate strong performance on Korean visual reasoning tasks. The results indicate that our model’s robust multilingual foundation enables the effective transfer of visual reasoning capabilities across languages without requiring language-specific visual training data.

# Evaluation

### Performance comparison (English, Korean) across different vision-language models

| Model                 | **MMBENCH** En | **MMBENCH** Ko | **SEED-I** En | **SEED-I** Ko | **MMStar** En | **MMStar** Ko | **K-DTCB** |
|----------------------|----------------|----------------|---------------|---------------|---------------|---------------|------------|
| Llava-1.5-7b         | 0.64           | 0.43           | 0.66          | 0.52          | 0.34          | 0.33          | 0.30       |
| Llava-1.6-mistral-7b | 0.68           | 0.49           | 0.72          | 0.61          | 0.36          | 0.33          | 0.30       |
| Trillion-LLaVA-7B    | 0.66           | **0.61**       | 0.68          | **0.66**      | 0.37          | **0.37**      | **0.33**   |


# Limitations
- Lack of training on multilingual visual instruction tuning data: The model was trained exclusive on English vision-language pairs, which leaves room for improvement on other language pairs.
- The model inherits the limitations of Trillion-7B-preview, since no additional training was done except on vision language understanding data.

# License
This model repository is licensed under the Apache-2.0 License.