metadata

license: apache-2.0
language:
  - pt
  - vmw
datasets:
  - LIACC/Emakhuwa-Portuguese-News-MT
base_model:
  - facebook/nllb-200-distilled-600M
pipeline_tag: translation
new_version: felerminoali/ct2_nllb200_pt_vmw_bilingual_int8_ver1

CTranslate2 NLLB-200 Translation Example

This guide demonstrates how to use a CTranslate2-quantized version of the NLLB-200 model for bilingual translation between Portuguese (por_Latn) and a target language (vmw_Latn).

Prerequisites

Install required packages:

pip install ctranslate2 sentencepiece

# Download model to local folder

git lfs install  # Install Git LFS if not already present
git clone https://huggingface.co/felerminoali/ct2_nllb200_pt_vmw_bilingual_int8_ver1
cd ct2_nllb200_pt_vmw_bilingual_int8_ver1
git lfs pull  # Download large files (LFS-tracked)

Inference

import os
import ctranslate2
import sentencepiece as spm

model_name ="ct2_nllb200_pt_vmw_bilingual_int8_ver1"
model_name_hf = f"felerminoali/{model_name}"

local_dir = f"./{model_name}"
src_lang="por_Latn"
tgt_lang="vmw_Latn"
sentence="Olá mundo das língua!"


print(f"Model downloaded to {local_dir}")

# [Modify] Set paths to the CTranslate2 and SentencePiece models
ct_model_path = os.path.join(f'./{model_name}')
model_load_name = os.path.join(f'./{model_name}')
sp_model_path = os.path.join(os.path.join(f'./{model_name}'),"sentencepiece.bpe.model")

#device = "cuda"  # or "cpu"
device = "cpu"
beam_size = 4

# Load the source SentecePiece model
sp = spm.SentencePieceProcessor()
sp.load(sp_model_path)

translator = ctranslate2.Translator(ct_model_path, device)

source_sents = [sentence]

target_prefix = [[tgt_lang]] * len(source_sents)

# Subword the source sentences
source_sents_subworded = sp.encode(source_sents, out_type=str)
source_sents_subworded = [[src_lang] + sent + ["</s>"] for sent in source_sents_subworded]

# Translate the source sentences
translator = ctranslate2.Translator(ct_model_path, device=device)
translations = translator.translate_batch(source_sents_subworded, batch_type="tokens", max_batch_size=2024, beam_size=beam_size, target_prefix=target_prefix)
translations = [translation.hypotheses[0] for translation in translations]

# Desubword the target sentences
translations_desubword = sp.decode(translations)
translations_desubword = [sent[len(tgt_lang):] for sent in translations_desubword]

print("Translations:", *translations_desubword, sep="\n")