--- language: - ace - acm - acq - aeb - af - ajp - ak - als - am - apc - ar - ars - ary - arz - as - ast - awa - ayr - azb - azj - ba - bm - ban - be - bem - bn - bho - bjn - bo - bs - bug - bg - ca - ceb - cs - cjk - ckb - crh - cy - da - de - dik - dyu - dz - el - en - eo - et - eu - ee - fo - fj - fi - fon - fr - fur - fuv - gaz - gd - ga - gl - gn - gu - ht - ha - he - hi - hne - hr - hu - hy - ig - ilo - id - is - it - jv - ja - kab - kac - kam - kn - ks - ka - kk - kbd - kbp - kea - khk - km - ki - rw - ky - kmb - kmr - knc - kg - ko - lo - lij - li - ln - lt - lmo - ltg - lb - lua - lg - luo - lus - lvs - mag - mai - ml - mar - min - mk - mt - mni - mos - mi - my - nl - nn - nb - npi - nso - nus - ny - oc - ory - pag - pa - pap - pbt - pes - plt - pl - pt - prs - quy - ro - rn - ru - sg - sa - sat - scn - shn - si - sk - sl - sm - sn - sd - so - st - es - sc - sr - ss - su - sv - swh - szl - ta - taq - tt - te - tg - tl - th - ti - tpi - tn - ts - tk - tum - tr - tw - tzm - ug - uk - umb - ur - uzn - vec - vi - war - wo - xh - ydd - yo - yue - zh - zsm - zu language_details: >- ace_Arab, ace_Latn, acm_Arab, acq_Arab, aeb_Arab, afr_Latn, ajp_Arab, aka_Latn, amh_Ethi, apc_Arab, arb_Arab, ars_Arab, ary_Arab, arz_Arab, asm_Beng, ast_Latn, awa_Deva, ayr_Latn, azb_Arab, azj_Latn, bak_Cyrl, bam_Latn, ban_Latn,bel_Cyrl, bem_Latn, ben_Beng, bho_Deva, bjn_Arab, bjn_Latn, bod_Tibt, bos_Latn, bug_Latn, bul_Cyrl, cat_Latn, ceb_Latn, ces_Latn, cjk_Latn, ckb_Arab, crh_Latn, cym_Latn, dan_Latn, deu_Latn, dik_Latn, dyu_Latn, dzo_Tibt, ell_Grek, eng_Latn, epo_Latn, est_Latn, eus_Latn, ewe_Latn, fao_Latn, pes_Arab, fij_Latn, fin_Latn, fon_Latn, fra_Latn, fur_Latn, fuv_Latn, gla_Latn, gle_Latn, glg_Latn, grn_Latn, guj_Gujr, hat_Latn, hau_Latn, heb_Hebr, hin_Deva, hne_Deva, hrv_Latn, hun_Latn, hye_Armn, ibo_Latn, ilo_Latn, ind_Latn, isl_Latn, ita_Latn, jav_Latn, jpn_Jpan, kab_Latn, kac_Latn, kam_Latn, kan_Knda, kas_Arab, kas_Deva, kat_Geor, knc_Arab, knc_Latn, kaz_Cyrl, kbd_Cyrl, kbp_Latn, kea_Latn, khm_Khmr, kik_Latn, kin_Latn, kir_Cyrl, kmb_Latn, kon_Latn, kor_Hang, kmr_Latn, lao_Laoo, lvs_Latn, lij_Latn, lim_Latn, lin_Latn, lit_Latn, lmo_Latn, ltg_Latn, ltz_Latn, lua_Latn, lug_Latn, luo_Latn, lus_Latn, mag_Deva, mai_Deva, mal_Mlym, mar_Deva, min_Latn, mkd_Cyrl, plt_Latn, mlt_Latn, mni_Beng, khk_Cyrl, mos_Latn, mri_Latn, zsm_Latn, mya_Mymr, nld_Latn, nno_Latn, nob_Latn, npi_Deva, nso_Latn, nus_Latn, nya_Latn, oci_Latn, gaz_Latn, ory_Orya, pag_Latn, pan_Guru, pap_Latn, pol_Latn, por_Latn, prs_Arab, pbt_Arab, quy_Latn, ron_Latn, run_Latn, rus_Cyrl, sag_Latn, san_Deva, sat_Beng, scn_Latn, shn_Mymr, sin_Sinh, slk_Latn, slv_Latn, smo_Latn, sna_Latn, snd_Arab, som_Latn, sot_Latn, spa_Latn, als_Latn, srd_Latn, srp_Cyrl, ssw_Latn, sun_Latn, swe_Latn, swh_Latn, szl_Latn, tam_Taml, tat_Cyrl, tel_Telu, tgk_Cyrl, tgl_Latn, tha_Thai, tir_Ethi, taq_Latn, taq_Tfng, tpi_Latn, tsn_Latn, tso_Latn, tuk_Latn, tum_Latn, tur_Latn, twi_Latn, tzm_Tfng, uig_Arab, ukr_Cyrl, umb_Latn, urd_Arab, uzn_Latn, vec_Latn, vie_Latn, war_Latn, wol_Latn, xho_Latn, ydd_Hebr, yor_Latn, yue_Hant, zho_Hans, zho_Hant, zul_Latn tags: - nllb - translation license: cc-by-nc-4.0 datasets: - flores-200 metrics: - bleu - spbleu - chrf++ inference: false base_model: - facebook/nllb-200-1.3B --- # NLLB-200 1.3B Pre-trained for Kabardian Translation ## Model Details - **Model Name:** nllb-200-1.3b-kbd-pretrain - **Base Model:** NLLB-200 1.3B - **Model Type:** Translation - **Language(s):** Kabardian and others from NLLB-200 (200 languages) - **License:** CC-BY-NC (inherited from base model) - **Developer:** panagoa (fine-tuning), Meta AI (base model) - **Last Updated:** January 23, 2025 - **Paper:** [NLLB Team et al, No Language Left Behind: Scaling Human-Centered Machine Translation, Arxiv, 2022](https://arxiv.org/abs/2207.04672) ## Model Description This model is a pre-trained adaptation of the NLLB-200 (No Language Left Behind) 1.3B parameter model that has been specifically optimized to improve translation capabilities for the Kabardian language (kbd). The base NLLB-200 model was developed by Meta AI and supports 200 languages, with this variant specifically adjusted for Kabardian language translation tasks. ## Intended Uses - Machine translation to and from Kabardian - NLP applications involving the Kabardian language - Research on low-resource language translation - Cultural and linguistic preservation efforts for the Kabardian language ## Training Data This model has been pre-trained building upon the original NLLB-200 model, which used parallel multilingual data from various sources and monolingual data constructed from Common Crawl. The specific additional pre-training for Kabardian likely involved specialized Kabardian language resources. The original NLLB-200 model was evaluated using the Flores-200 dataset. ## Performance and Limitations - As a pre-trained model, this version is intended to be further fine-tuned for specific translation tasks - Inherits limitations from the base NLLB-200 model: - Not intended for production deployment (research model) - Not optimized for domain-specific texts (medical, legal, etc.) - Not designed for document translation (optimized for single sentences) - Training limited to input sequences not exceeding 512 tokens - Translations cannot be used as certified translations - May have additional limitations when handling specific cultural contexts, dialectal variations, or specialized terminology in Kabardian ## Usage Example ```python from transformers import AutoModelForSeq2SeqLM, AutoTokenizer model_name = "panagoa/nllb-200-1.3b-kbd-pretrain" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSeq2SeqLM.from_pretrained(model_name) # Example: Translating to Kabardian src_lang = "eng_Latn" # English tgt_lang = "kbd_Cyrl" # Kabardian in Cyrillic script text = "Hello, how are you?" inputs = tokenizer(f"{src_lang}: {text}", return_tensors="pt") translated_tokens = model.generate( **inputs, forced_bos_token_id=tokenizer.lang_code_to_id[tgt_lang], max_length=30 ) translation = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0] print(translation) ``` ## Ethical Considerations As noted for the base NLLB-200 model: - This work prioritizes human users and aims to minimize risks transferred to them - Translation access for low-resource languages can improve education and information access but could potentially make groups with lower digital literacy vulnerable to misinformation - Despite extensive data cleaning, personally identifiable information may not be entirely eliminated from training data - Mistranslations could have adverse impacts on those relying on translations for important decisions ## Caveats and Recommendations - The base model was primarily tested on the Wikimedia domain with limited investigation on other domains - Supported languages may have variations that the model does not capture - Users should make appropriate assessments for their specific use cases - This pre-trained model is part of a series of models specifically focused on Kabardian language translation - For production use cases, consider the fully fine-tuned versions (v0.1, v0.2) rather than this pre-trained version ## Additional Information This model is part of a collection of NLLB models fine-tuned for Kabardian language translation developed by panagoa.