Image-Text-to-Text
Transformers
Safetensors
llava_cohere
text-generation
Merge
conversational
Inference Endpoints

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Maya: A Multilingual Vision Language Model

Maya is an instruction-finetuned multilingual multimodal model that expands multimodal capabilities to eight languages with an emphasis on data quality and cultural sensitivity. Built on the LLaVA framework, Maya includes a newly created pre-training dataset designed to support multilingual and culturally aware VLM development.

Model Description

Model Details

Maya leverages the lightweight architecture to provide a compact yet powerful multimodal experience, with several key features:

  • Built on LLaVA framework using Aya-23 8B model
  • Uses SigLIP for vision encoding with multilingual adaptability
  • Supports 8 languages with strong cultural understanding
  • Trained on toxicity-filtered dataset for safer deployment

Model Architecture

  • Base Model: Aya-23 8B
  • Vision Encoder: SigLIP (multilingual)
  • Training Data: 558,000 images with multilingual annotations
  • Context Length: 8K tokens
  • Parameters: 8 billion

Intended Uses

Maya is designed for:

  • Multilingual visual question answering
  • Cross-cultural image understanding
  • Image captioning in multiple languages
  • Visual reasoning tasks
  • Document understanding

Usage

# Clone the Github repository
git clone https://github.com/nahidalam/maya

# Change the working directory
cd maya
# Run the following code
from llava.eval.talk2maya import run_vqa_model

# Define inputs
question = "Try identify what plane this is, based on the design."
image_path = "./llava/eval/claude_plane_test_2.jpeg" 

# Run model
answer = run_vqa_model(
    question=question,
    image_file=image_path
)

Limitations

  • Limited to 8 languages currently
  • Requires high-quality images for optimal performance
  • May not capture nuanced cultural contexts in all cases
  • Performance varies across languages and tasks

Bias, Risks, and Limitations

Maya has been developed with attention to bias mitigation and safety:

  • Dataset filtered for toxic content
  • Cultural sensitivity evaluations performed
  • Regular bias assessments conducted
  • Limited to high-quality, vetted training data

However, users should be aware that:

  • Model may still exhibit biases present in training data
  • Performance may vary across different cultural contexts
  • Not suitable for critical decision-making applications

Training Details

Maya was trained using:

  • 558,000 curated images
  • Multilingual annotations in 8 languages
  • Toxicity-filtered dataset
  • 8xH100 GPUs with 80GB DRAM
  • Batch size of 32 (per device)
  • Learning rate of 1e-3 with cosine scheduler

Citation

@misc{alam2024mayainstructionfinetunedmultilingual,
      title={Maya: An Instruction Finetuned Multilingual Multimodal Model}, 
      author={Nahid Alam and Karthik Reddy Kanjula and Surya Guthikonda and Timothy Chung and Bala Krishna S Vegesna and Abhipsha Das and Anthony Susevski and Ryan Sze-Yin Chan and S M Iftekhar Uddin and Shayekh Bin Islam and Roshan Santhosh and Snegha A and Drishti Sharma and Chen Liu and Isha Chaturvedi and Genta Indra Winata and Ashvanth. S and Snehanshu Mukherjee and Alham Fikri Aji},
      year={2024},
      eprint={2412.07112},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.07112}, 
}

Contact

For questions or feedback about Maya, please:

Downloads last month
805
Safetensors
Model size
8.14B params
Tensor type
BF16
·
Inference Examples
Inference API (serverless) does not yet support transformers models for this pipeline type.

Model tree for maya-multimodal/maya

Datasets used to train maya-multimodal/maya

Space using maya-multimodal/maya 1