--- license: apache-2.0 datasets: - neulab/CulturalGround language: - am - ar - bg - bn - cs - de - el - en - es - fa - fr - ga - hi - id - ig - it - iw - ja - jv - ko - nl - mn - ms - no - pl - pt - ro - ru - si - su - sw - ta - te - th - tr - uk - ur - vi - zh base_model: - neulab/Pangea-7B --- # CulturalPangea-7B Model Card [Grounding Multilingual Multimodal LLMs With Cultural Knowledge](https://neulab.github.io/CulturePangea/) 🌍 🇩🇪 🇫🇷 🇬🇧 🇪🇸 🇮🇹 🇵🇱 🇷🇺 🇨🇿 🇯🇵 🇺🇦 🇧🇷 🇮🇳 🇨🇳 🇳🇴 🇵🇹 🇮🇩 🇮🇱 🇹🇷 🇬🇷 🇷🇴 🇮🇷 🇹🇼 🇲🇽 🇮🇪 🇰🇷 🇧🇬 🇹🇭 🇳🇱 🇪🇬 🇵🇰 🇳🇬 🇮🇩 🇻🇳 🇲🇾 🇸🇦 🇮🇩 🇧🇩 🇸🇬 🇱🇰 🇰🇪 🇲🇳 🇪🇹 🇹🇿 🇷🇼 [🏠 Homepage](https://neulab.github.io/CulturalGround/) | [🤖 CulturalPangea-7B](https://huggingface.co/neulab/CulturalPangea-7B) | [📊 CulturalGround](https://huggingface.co/datasets/neulab/CulturalGround) | [💻 Github](https://github.com/neulab/CulturalGround) | [📄 Arxiv](https://arxiv.org/abs/2508.07414) [IMAGE]

## Model Details - **Model:** `CulturalPangea-7B` is an open-source Multilingual Multimodal LLM fine-tuned to interpret and reason about long-tail cultural entities and concepts. It is designed to bridge the cultural gap often present in MLLMs. - **Date:** `CulturalPangea-7B` was trained in 2025. - **Training Dataset:** The model was fine-tuned on the [CulturalGround](https://huggingface.co/datasets/neulab/CulturalGround) dataset, using 14 million open-ended and 6 million multiple-choice culturally-grounded VQA pairs samples from 30M total samples(22M OE, 8M MCQs). This was interleaved with the substantial portion of original Pangea instruction data to maintain general abilities. - **Architecture:** `CulturalPangea-7B` is a fine-tuned version of [Pangea-7B](https://huggingface.co/neulab/Pangea-7B). It uses a frozen [CLIP-ViT](https://huggingface.co/openai/clip-vit-large-patch14) vision encoder with a [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) LLM backbone. During training, only the connector and the language model were fine-tuned. ## Uses `CulturalPangea-7B` follows the same architecture and usage patterns as LLaVA-NeXT and Pangea-7B. ### Direct Use First, you need to clone and install the LLaVA-NeXT repository. ```bash git clone [https://github.com/LLaVA-VL/LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT) cd LLaVA-NeXT pip install -e ".[train]" ``` Then, you can load CulturalPangea-7B using the following code: ```python from llava.model.builder import load_pretrained_model model_path = 'neulab/CulturalPangea-7B' model_name = 'CulturalPangea-7B-qwen' args = {"multimodal": True} tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, None, model_name, **args) ``` Defining helper functions for model inference: ```python import torch from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN from llava.utils import disable_torch_init from llava.constants import IGNORE_INDEX, DEFAULT_IMAGE_TOKEN, IMAGE_TOKEN_INDEX from typing import Dict import transformers import re from PIL import Image def preprocess_qwen(sources, tokenizer: transformers.PreTrainedTokenizer, has_image: bool = False, max_len=2048, system_message: str = "You are a helpful assistant.") -> Dict: roles = {"human": "<|im_start|>user", "gpt": "<|im_start|>assistant"} im_start, im_end = tokenizer.additional_special_tokens_ids nl_tokens = tokenizer("\n").input_ids _system = tokenizer("system").input_ids + nl_tokens _user = tokenizer("user").input_ids + nl_tokens _assistant = tokenizer("assistant").input_ids + nl_tokens input_ids = [] source = sources if roles[source[0]["from"]] != roles["human"]: source = source[1:] input_id, target = [], [] system = [im_start] + _system + tokenizer(system_message).input_ids + [im_end] + nl_tokens input_id += system target += [im_start] + [IGNORE_INDEX] * (len(system) - 3) + [im_end] + nl_tokens assert len(input_id) == len(target) for j, sentence in enumerate(source): role = roles[sentence["from"]] if has_image and sentence["value"] is not None and "" in sentence["value"]: num_image = len(re.findall(DEFAULT_IMAGE_TOKEN, sentence["value"])) texts = sentence["value"].split('') _input_id = tokenizer(role).input_ids + nl_tokens for i,text in enumerate(texts): _input_id += tokenizer(text).input_ids if i