Model Card for MADLAD-400-3B-CT2

Table of Contents

  1. TL;DR
  2. Model Details
  3. Usage
  4. Uses
  5. Bias, Risks, and Limitations
  6. Training Details
  7. Evaluation
  8. Environmental Impact
  9. Citation

TL;DR

MADLAD-400-3B-MT is a multilingual machine translation model based on the T5 architecture that was trained on 1 trillion tokens covering over 450 languages using publicly available data. It is competitive with models that are significantly larger.

Disclaimer: Santhosh Thottingal, who was not involved in this research, converted the original models to CTranslate2 optimized model and wrote the contents of this model card based on google/madlad400-3b-mt.

Model Details

Model Description

Usage

Find below some example scripts on how to use the model:

Running the model on a CPU or GPU

First, install the CTranslate2 packages that are required:

pip install ctranslate2 sentencepiece

import ctranslate2
from sentencepiece import SentencePieceProcessor
from huggingface_hub import snapshot_download

model_name = "santhosh/madlad400-3b-ct2"
model_path = snapshot_download(model_name)

tokenizer = SentencePieceProcessor()
tokenizer.load(f"{model_path}/sentencepiece.model")
translator = ctranslate2.Translator(model_path)

input_text = "I love pizza!"
input_tokens = tokenizer.encode(f"<2{target_language}> {input_text}", out_type=str)
results = translator.translate_batch(
    [input_tokens],
    batch_type="tokens",
    max_batch_size=1024,
    beam_size=1,
    no_repeat_ngram_size=1,
    repetition_penalty=2,
)
translated_sentence = tokenizer.decode(results[0].hypotheses[0])
print(translated_sentence)
# Eu adoro pizza!

Uses

Direct Use and Downstream Use

Primary intended uses: Machine Translation and multilingual NLP tasks on over 400 languages. Primary intended users: Research community.

Out-of-Scope Use

These models are trained on general domain data and are therefore not meant to work on domain-specific models out-of-the box. Moreover, these research models have not been assessed for production usecases.

Bias, Risks, and Limitations

We note that we evaluate on only 204 of the languages supported by these models and on machine translation and few-shot machine translation tasks. Users must consider use of this model carefully for their own usecase.

Ethical considerations and risks

We trained these models with MADLAD-400 and publicly available data to create baseline models that support NLP for over 400 languages, with a focus on languages underrepresented in large-scale corpora. Given that these models were trained with web-crawled datasets that may contain sensitive, offensive or otherwise low-quality content despite extensive preprocessing, it is still possible that these issues to the underlying training data may cause differences in model performance and toxic (or otherwise problematic) output for certain domains. Moreover, large models are dual use technologies that have specific risks associated with their use and development. We point the reader to surveys such as those written by Weidinger et al. or Bommasani et al. for a more detailed discussion of these risks, and to Liebling et al. for a thorough discussion of the risks of machine translation systems.

Known Limitations

More information needed

Sensitive Use:

More information needed

Training Details

We train models of various sizes: a 3B, 32-layer parameter model, a 7.2B 48-layer parameter model and a 10.7B 32-layer parameter model. We share all parameters of the model across language pairs, and use a Sentence Piece Model with 256k tokens shared on both the encoder and decoder side. Each input sentence has a <2xx> token prepended to the source sentence to indicate the target language.

See the research paper for further details.

Training Data

For both the machine translation and language model, MADLAD-400 is used. For the machine translation model, a combination of parallel datasources covering 157 languages is also used. Further details are described in the paper.

Training Procedure

See the research paper for further details.

Evaluation

Testing Data, Factors & Metrics

For evaluation, we used WMT, NTREX, Flores-200 and Gatones datasets as described in Section 4.3 in the paper.

The translation quality of this model varies based on language, as seen in the paper, and likely varies on domain, though we have not assessed this.

Results

image/png

image/png

image/png

See the research paper for further details.

Environmental Impact

More information needed

Citation

BibTeX:

@misc{kudugunta2023madlad400,
      title={MADLAD-400: A Multilingual And Document-Level Large Audited Dataset}, 
      author={Sneha Kudugunta and Isaac Caswell and Biao Zhang and Xavier Garcia and Christopher A. Choquette-Choo and Katherine Lee and Derrick Xin and Aditya Kusupati and Romi Stella and Ankur Bapna and Orhan Firat},
      year={2023},
      eprint={2309.04662},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
Downloads last month
184
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train santhosh/madlad400-3b-ct2

Spaces using santhosh/madlad400-3b-ct2 2