opus-mt-tc-bible-big-deu_eng_fra_por_spa-mul
Table of Contents
- Model Details
- Uses
- Risks, Limitations and Biases
- How to Get Started With the Model
- Training
- Evaluation
- Citation Information
- Acknowledgements
Model Details
Neural machine translation model for translating from unknown (deu+eng+fra+por+spa) to Multiple languages (mul). Note that many of the listed languages will not be well supported by the model as the training data is very limited for the majority of the languages. Translation performance varies a lot and for a large number of language pairs it will not work at all.
This model is part of the OPUS-MT project, an effort to make neural machine translation models widely available and accessible for many languages in the world. All models are originally trained using the amazing framework of Marian NMT, an efficient NMT implementation written in pure C++. The models have been converted to pyTorch using the transformers library by huggingface. Training data is taken from OPUS and training pipelines use the procedures of OPUS-MT-train. Model Description:
- Developed by: Language Technology Research Group at the University of Helsinki
- Model Type: Translation (transformer-big)
- Release: 2024-05-30
- License: Apache-2.0
- Language(s):
- Source Language(s): deu eng fra por spa
- Target Language(s): aai aar aau abi abk acd ace acf ach acm acn acr ade adj ady aeu aey afb afh_Latn afr agd agn agu ahk aia aka akh akl_Latn akp alj aln alp alq alt alz ame amh ami ami_Latn amk amu amu_Latn ang_Latn ann anp anv aoz apc apr apu ara arc arg arq arz asm aso ast atg atj atq aui auy ava avk_Latn avn avu awa awb awx aze_Cyrl aze_Latn azg azz azz_Latn bak bal bal_Latn bam bam_Latn ban bar bas bav bba bbo bbr bcl bcw bef beh bel bem ben bep bex bfa bfd bfo bgr bhl bho bhz bib bik bim bis biv bjr bjv bku bkv blh blt blz bmh bmk bmq bmu bmv bnp bod boj bom_Latn bos_Cyrl bos_Latn bov box bpr bps bpy bqc bqj bqp bre bru brx bss btd bth bto bts btt btx bua bud bug buk bul bus bvy_Latn bwq bwu byn bzd bzh bzj bzt_Latn caa cab cac cak cak_Latn cat cay cbk_Latn cce cco ceb ces cfm cgc cha che chf chm chq chq_Latn chr chu chv chy chz cjk cjk_Latn cjo cjp cjp_Latn cjv cjy_Hans cjy_Hant ckb cko cle cme cmn cmn_Hans cmn_Hant cmo cmr cnh cnh_Latn cni cni_Latn cnl cnr cnr_Latn cnt cnw cok cop cop_Copt cor cos cot cpa cpu cre cre_Latn crh crn crs crx csb csb_Latn csk cso csy cta ctd ctp ctu cuc cui cuk cut cux cwe cwt cya cym czt daa dad dag_Latn dah dan ded deu dga dgi dig dik din diq div dje djk djk_Latn dng dni dnj dob dop dop_Latn drt_Latn dsb dsh dtp dty dug dws_Latn dww dyi dyo dyu dzo efi egl ell emi eng enm_Latn epo ess est eus ewe ext fai fal fao far fas fij fil fin fkv_Latn fon for fra frd frm_Latn fro_Latn frp frr fry fuc ful fur gag gah gaw gbm gcf gcf_Latn gde gej gfk ghs gil gkn gla gle glg glk glv gnd gng gog gor gos got got_Goth gqr grc grc_Grek grn gsw guc gud guh guj guo gur guw guw_Latn gux gvf gvl gwi gwr gym gyr hag hat hau hau_Latn haw hay hbo hbo_Hebr hbs hbs_Cyrl hbs_Latn hch heb heh her hif hif_Latn hig hil hin hin_Latn hla hlt hmn hne hnj hnn hns hoc hoc_Wara hot hrv hrx_Latn hsb hsn hui hun hus hus_Latn hvn hwc hye hyw hyw_Armn hyw_Latn iba ibo icr ido_Latn ifa ifb ife ifk ifu ify ign igs_Latn iii ike_Latn iku iku_Latn ile_Latn ilo imo ina_Latn ind inh inh_Latn ino iou ipi ipk iri irk iry isl ita itv ium ixl ixl_Latn izh izr jaa jaa_Bopo jaa_Hira jaa_Kana jaa_Yiii jac jak_Latn jam jav jav_Java jbo jbo_Cyrl jbo_Latn jbu jdt_Cyrl jmc jpa_Hebr jpn jun jvn kaa kab kac kal kam kan kao kas_Arab kas_Deva kat kau kaz kaz_Cyrl kbd kbm kbp kbp_Cans kbp_Ethi kbp_Geor kbp_Grek kbp_Hang kbp_Latn kbp_Mlym kbp_Yiii kdc kdj kdl kdn kea kek kek_Latn ken keo ker keu kew kez kgf kgk kha khm khz kia kik kin kir_Cyrl kjb kje kjh kjs kki kkj kle kma kmb kmg kmh kmo kmr kmu knc kne knj knk kno kog koi kok kom kon kpf kpg kpr kpv kpw kpz kqe kqf kqp kqw krc kri krj krl kru ksb ksh ksr ktb ktj kua kub kud kue kum kur_Arab kur_Cyrl kur_Latn kus kvn kwf kxc kxm kyc kyf kyg kyq kzf laa_Latn lac lad lad_Latn lah lao las lat lat_Latn lav law lbe lcm ldn_Latn lee lef lem leu lew lex lez lfn_Cyrl lfn_Latn lgg lhu lia lid lif lij lim lin lip lit liv_Latn ljp lkt lld_Latn lln lme lmo lnd lob lok lon lou_Latn lrc lsi ltz lua luc lug luo lus lut_Latn luy lzz_Latn maa mad mag mah mai maj mak mal mam mam_Latn maq mar mau maw max_Latn maz mbb mbf mbt mcb mcp mcu mda mdf med mee meh_Latn mek men meq mfe mfh mfi mfk mfq mfy mgd mgm_Latn mgo mhi mhl mhx mhy mib mic mie mif mig mih mil mio mit mix mix_Latn miy miz mjc mkd mks mlg mlh mlp mlt mmo mmx mna mnb mnf mnh mni mnr_Latn mnw moa mog moh mol mon mop mor mos mox mpg mpm mpt mpx mqb mqj mri mrj mrw msa msa_Arab msa_Latn msm mta muh mux muy mva mvp mvv_Latn mwc mwl mwm mwv mww mxb mxt mya myb myk myu myv myw myx mzk mzm mzn mzw mzz naf nak nap nas nau nav nbl nca nch ncj ncl ncu nde ndo nds ndz neb nep new nfr ngt_Latn ngu ngu_Latn nhe nhg nhg_Latn nhi nhn_Latn nhu nhw nhx nhy nia nif nii nij nim nin niu njm nlc nld nlv_Latn nmz nnb nnb_Latn nnh nno nnw nob nog non nop nor not nou nov_Latn npi npl npy nqo nsn nso nss nst_Latn nsu ntm ntp ntr nuj nus nuy nwb nwi nya nyf nyn nyo nyy nzi oar_Hebr oar_Syrc obo oci ofs_Latn oji_Latn oku okv old omw ood ood_Latn opm ori orm orv_Cyrl osp_Latn oss ota_Arab ota_Latn ota_Rohg ota_Syrc ota_Thaa ota_Yezi ote otk_Orkh otm otn otq ozm pab pad pag pai_Latn pal pam pan pan_Guru pao pap pau pbi pbl pcd pck_Latn pcm pdc pes pfl phn_Phnx pib pih pih_Latn pio pis pkb pli pls plt plw pmf pms pmy_Latn pne pnt_Grek poe poh pol por pot pot_Latn ppk ppk_Latn ppl_Latn prf prg_Latn prs ptp ptu pus pwg pww quc qya qya_Latn rai rap rav rej rhg_Latn rif_Latn rim rmy roh rom ron rop rro rue rug run rup rus rwo sab sag sah san san_Deva sas sat sat_Latn sba sbd sbl scn sco sda sdh seh ses sgb sgs sgw sgz shi shi_Latn shk shn shs_Latn shy_Latn sig sil sin sjn_Latn skr sld slk sll slv sma sme smk sml sml_Latn smn smo sna snc snd_Arab snp snw som sot soy spa spl spp sps sqi srd srm srn srp_Cyrl srq ssd ssw ssx stn stp stq sue suk sun sur sus suz swa swc swe swg swh swp sxb sxn syc syl_Sylo syr szb szl tab tac tah taj tam taq tat tbc tbl tbo tbz tcs tcy tel tem teo ter tet tfr tgk tgk_Cyrl tgk_Latn tgl tgl_Latn tgl_Tglg tgo tgp tha thk thv tig tik tim tir tkl tlb tlf tlh tlh_Latn tlj tlx tly_Latn tmc tmh tmr_Hebr tmw_Latn toh toi toi_Latn toj ton tpa tpi tpm tpw_Latn tpz trc trn trq trs trs_Latn trv tsn tso tsw ttc tte ttr tts tuc tuf tuk tuk_Latn tum tur tvl twb twi twu txa tyj_Latn tyv tzh tzj tzl tzl_Latn tzm_Latn tzm_Tfng tzo ubr ubu udm udu uig uig_Arab uig_Cyrl uig_Latn ukr umb urd usa usp usp_Latn uvl uzb_Cyrl uzb_Latn vag vec ven vep vie viv vls vmw vmy vol_Latn vot vot_Latn vro vun wae waj wal wap war wbm wbp wed wln wmt wmw wnc wnu wob wol wsk wuu wuv xal xcl_Armn xcl_Latn xed xho xmf xog xon xrb xsb xsi xsm xsr xtd xtm xuo yal yam yaq yaz yby ycl ycn yid yli yml yon yor yua yue_Hans yue_Hant yut yuw zam zap zea zgh zha zia zlm_Arab zlm_Latn zom zsm_Arab zsm_Latn zul zyp zza
- Original Model: opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-30.zip
- Resources for more information:
This is a multilingual translation model with multiple target languages. A sentence initial language token is required in the form of >>id<<
(id = valid target language ID), e.g. >>aai<<
Uses
This model can be used for translation and text-to-text generation.
Risks, Limitations and Biases
CONTENT WARNING: Readers should be aware that the model is trained on various public data sets that may contain content that is disturbing, offensive, and can propagate historical and current stereotypes.
Significant research has explored bias and fairness issues with language models (see, e.g., Sheng et al. (2021) and Bender et al. (2021)).
Also note that many of the listed languages will not be well supported by the model as the training data is very limited for the majority of the languages. Translation performance varies a lot and for a large number of language pairs it will not work at all.
How to Get Started With the Model
A short example code:
from transformers import MarianMTModel, MarianTokenizer
src_text = [
">>aai<< Replace this with text in an accepted source language.",
">>zza<< This is the second sentence."
]
model_name = "pytorch-models/opus-mt-tc-bible-big-deu_eng_fra_por_spa-mul"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
for t in translated:
print( tokenizer.decode(t, skip_special_tokens=True) )
You can also use OPUS-MT models with the transformers pipelines, for example:
from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-bible-big-deu_eng_fra_por_spa-mul")
print(pipe(">>aai<< Replace this with text in an accepted source language."))
Training
- Data: opusTCv20230926max50+bt+jhubc (source)
- Pre-processing: SentencePiece (spm32k,spm32k)
- Model Type: transformer-big
- Original MarianNMT Model: opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-30.zip
- Training Scripts: GitHub Repo
Evaluation
- Model scores at the OPUS-MT dashboard
- test set translations: opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-29.test.txt
- test set scores: opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-29.eval.txt
- benchmark results: benchmark_results.txt
- benchmark output: benchmark_translations.zip
langpair | testset | chr-F | BLEU | #sent | #words |
---|---|---|---|---|---|
multi-multi | tatoeba-test-v2020-07-28-v2023-09-26 | 0.55024 | 29.2 | 10000 | 75838 |
Citation Information
- Publications: Democratizing neural machine translation with OPUS-MT and OPUS-MT – Building open translation services for the World and The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT (Please, cite if you use this model.)
@article{tiedemann2023democratizing,
title={Democratizing neural machine translation with {OPUS-MT}},
author={Tiedemann, J{\"o}rg and Aulamo, Mikko and Bakshandaeva, Daria and Boggia, Michele and Gr{\"o}nroos, Stig-Arne and Nieminen, Tommi and Raganato, Alessandro and Scherrer, Yves and Vazquez, Raul and Virpioja, Sami},
journal={Language Resources and Evaluation},
number={58},
pages={713--755},
year={2023},
publisher={Springer Nature},
issn={1574-0218},
doi={10.1007/s10579-023-09704-w}
}
@inproceedings{tiedemann-thottingal-2020-opus,
title = "{OPUS}-{MT} {--} Building open translation services for the World",
author = {Tiedemann, J{\"o}rg and Thottingal, Santhosh},
booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
month = nov,
year = "2020",
address = "Lisboa, Portugal",
publisher = "European Association for Machine Translation",
url = "https://aclanthology.org/2020.eamt-1.61",
pages = "479--480",
}
@inproceedings{tiedemann-2020-tatoeba,
title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
author = {Tiedemann, J{\"o}rg},
booktitle = "Proceedings of the Fifth Conference on Machine Translation",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2020.wmt-1.139",
pages = "1174--1182",
}
Acknowledgements
The work is supported by the HPLT project, funded by the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350. We are also grateful for the generous computational resources and IT infrastructure provided by CSC -- IT Center for Science, Finland, and the EuroHPC supercomputer LUMI.
Model conversion info
- transformers version: 4.45.1
- OPUS-MT git hash: 0882077
- port time: Wed Oct 9 18:54:16 EEST 2024
- port machine: LM0-400-22516.local
- Downloads last month
- 445
Collection including Helsinki-NLP/opus-mt-tc-bible-big-deu_eng_fra_por_spa-mul
Evaluation results
- BLEU on tatoeba-test-v2020-07-28-v2023-09-26self-reported29.200
- chr-F on tatoeba-test-v2020-07-28-v2023-09-26self-reported0.550