--- license: apache-2.0 language: - pt - vmw datasets: - LIACC/Emakhuwa-Portuguese-News-MT base_model: - facebook/nllb-200-distilled-600M pipeline_tag: translation new_version: felerminoali/ct2_nllb200_pt_vmw_bilingual_int8_ver1 --- # CTranslate2 NLLB-200 Translation Example This guide demonstrates how to use a CTranslate2-quantized version of the NLLB-200 model for bilingual translation between Portuguese (`por_Latn`) and a target language (`vmw_Latn`). ## Prerequisites - Install required packages: ```bash pip install ctranslate2 sentencepiece # Download model to local folder git lfs install # Install Git LFS if not already present git clone https://huggingface.co/felerminoali/ct2_nllb200_pt_vmw_bilingual_int8_ver1 cd ct2_nllb200_pt_vmw_bilingual_int8_ver1 git lfs pull # Download large files (LFS-tracked) ``` # Inference ```python import os import ctranslate2 import sentencepiece as spm model_name ="ct2_nllb200_pt_vmw_bilingual_int8_ver1" model_name_hf = f"felerminoali/{model_name}" local_dir = f"./{model_name}" src_lang="por_Latn" tgt_lang="vmw_Latn" sentence="Olá mundo das língua!" print(f"Model downloaded to {local_dir}") # [Modify] Set paths to the CTranslate2 and SentencePiece models ct_model_path = os.path.join(f'./{model_name}') model_load_name = os.path.join(f'./{model_name}') sp_model_path = os.path.join(os.path.join(f'./{model_name}'),"sentencepiece.bpe.model") #device = "cuda" # or "cpu" device = "cpu" beam_size = 4 # Load the source SentecePiece model sp = spm.SentencePieceProcessor() sp.load(sp_model_path) translator = ctranslate2.Translator(ct_model_path, device) source_sents = [sentence] target_prefix = [[tgt_lang]] * len(source_sents) # Subword the source sentences source_sents_subworded = sp.encode(source_sents, out_type=str) source_sents_subworded = [[src_lang] + sent + [""] for sent in source_sents_subworded] # Translate the source sentences translator = ctranslate2.Translator(ct_model_path, device=device) translations = translator.translate_batch(source_sents_subworded, batch_type="tokens", max_batch_size=2024, beam_size=beam_size, target_prefix=target_prefix) translations = [translation.hypotheses[0] for translation in translations] # Desubword the target sentences translations_desubword = sp.decode(translations) translations_desubword = [sent[len(tgt_lang):] for sent in translations_desubword] print("Translations:", *translations_desubword, sep="\n") ```