Llama-3.2-MAAL-11B-Vision-v0.1

Llama-3.2-MAAL-11B-Vision-v0.1 is bilingual multimodal model trained for text and visual understanding across Korean and English languages. We are releasing a model, a subset of the training dataset, and a leaderboard to promote and accelerate the development of Korean Vision-Language Models (VLMs).

  • Developed by: maum.ai Brain NLP. Jaeyoon Jung, Yoonshik Kim, Yekyung Nah
  • Language(s) (NLP): Korean, English (currently, bilingual)

Model Description

Version 0.1 is fine-tuned by English and Korean VQA datasets with other datasets (OCR, Math, etc)...

  • We trained this model on 8 H100-80G for 2 days with image-text pair multimodal fine-tuning dataset
  • maum-ai/General-Evol-VQA is one of the datasets that we used for fine-tuning.

sample inference code (GPU)

Starting with transformers >= 4.45.0 onward, you can run inference to generate text based on an image and a starting prompt you supply.

import requests
import torch
from PIL import Image
from transformers import MllamaForConditionalGeneration, AutoProcessor

model_id = "maum-ai/Llama-3.2-MAAL-11B-Vision-v0.1"

model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_id)

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
image = Image.open(requests.get(url, stream=True).raw)

messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "이 이미지에 대해서 시를 써줘"}
    ]}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
    image,
    input_text,
    add_special_tokens=False,
    return_tensors="pt"
).to(model.device)

output = model.generate(**inputs, max_new_tokens=200)
print(processor.decode(output[0]))

Evaluation Results

As the main goal of version 0.1 is leveraging Korean VQA and OCR capabilities tailored to real-world business use cases, we select KOFFVQA as our evaluation method to assess the Korean instruction-following skills.

Model Params (B) average(↑)
NCSOFT/VARCO-VISION-14B 15.2b 66.69
Qwen/Qwen2-VL-7B-Instruct 8.3b 63.53
maum-ai/Llama-3.2-MAAL-11B-Vision-v0.1 10.7b 61.13
meta-llama/Llama-3.2-11B-Vision-Instruct 10.7b 50.36
mistralai/Pixtral-12B-2409 12.7b 44.62
llava-onevision-qwen2-7b-ov 8b 43.78
InternVL2-8b 8.1b 32.76
MiniCPM-V-2_6 8.1b 32.69

Our model has achieved a 20% performance improvement compared to the previous base model. You can check more results in this Leaderboard

We will release enhanced model, v0.2 soon

Downloads last month
38
Safetensors
Model size
10.7B params
Tensor type
BF16
·
Inference API
Unable to determine this model's library. Check the docs .

Model tree for maum-ai/Llama-3.2-MAAL-11B-Vision-v0.1

Finetuned
(76)
this model

Dataset used to train maum-ai/Llama-3.2-MAAL-11B-Vision-v0.1