MaLA-500: Massive Language Adaptation of Large Language Models

MaLA-500 is a novel large language model designed to cover an extensive range of 534 languages. This model builds upon LLaMA 2 7B and integrates continued pretraining with vocabulary extension, with an expanded vocabulary size of 260,164, and LoRA low-rank adaptation.

  • Continued Pretraining: Enhances the model's ability to adapt to a wide range of languages.
  • LoRA Low-Rank Adaptation: LoRA low-rank adaptation refines the model's adaptation capabilities.
  • Vocabulary Extension: MaLA-500 boasts an extended vocabulary size of 260,164.
  • Multilingual Proficiency: Trained on Glot500-c, covering 534 languages.

With vocabulary extension and LoRA modules, the MaLA-500 introduces additional 2.1B trainable parameters, making the total parameters to be 10.7B.

Please refer to our paper for more details.

How to Get Started with the Model

Requirements:

transformers>=4.36.1
peft>=0.6.2

Use the code below to get started with the model.

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b-hf')
base_model.resize_token_embeddings(260164)
tokenizer = AutoTokenizer.from_pretrained('MaLA-LM/mala-500-10b')
model = PeftModel.from_pretrained(base_model, 'MaLA-LM/mala-500-10b')

Citation

@misc{lin2024mala500,
      title={MaLA-500: Massive Language Adaptation of Large Language Models}, 
      author={Peiqin Lin and Shaoxiong Ji and Jörg Tiedemann and André F. T. Martins and Hinrich Schütze},
      year={2024},
      eprint={2401.13303},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for MaLA-LM/mala-500-10b-v2

Finetuned
(619)
this model

Dataset used to train MaLA-LM/mala-500-10b-v2

Collection including MaLA-LM/mala-500-10b-v2