Smugri-tuned NLLB-1.3b, v0.01

This is a fine-tune of NLLB-1.3b with parallel data for 29 Finno-Ugric languages. It supports different dialect/variety generation for some of the languages, more info below.

Info on used data and other details: soon. The training of this model is in progress, quality is not tested yet. So far only parallel data was taken into training, more dialects are to come after monolingual/synthetic data is added.

Usage in Python, to translate from English to Veps (New written Veps dialect/variety):

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("tartuNLP/nllb1.3-smugri4-v0.01")
tokenizer = AutoTokenizer.from_pretrained("tartuNLP/nllb1.3-smugri4-v0.01")

input_text = "<New written Veps> This is a short example sentence."
source_lang = "eng_Latn"
target_lang = "vep_Latn"

tokenizer.src_lang = source_lang

input_tokenized = tokenizer(input_text, return_tensors="pt")

output_raw = model.generate(**input_tokenized, forced_bos_token_id=tokenizer.convert_tokens_to_ids(target_lang))

output = tokenizer.decode(output_raw[0], skip_special_tokens=True)

print(output) # should be 'Nece om lühüd ozutezsana.'

# for '<Central Eastern Veps>' the output becomes 'Nece om lühüd naverz’ sanond.'

Supported languages

  • est_Latn (Estonian), fin_Latn (Finnish), fkv_Latn (Kven), izh_Latn (Izhorian*), krl_Latn (Proper Karelian*), liv_Latn (Livonian), lud_Latn (Ludian*), olo_Latn (Livvi-Karelian*), vep_Latn (Veps*), vot_Latn (Votic*), vro_Latn (Võro)
  • sje_Latn (Pite Sami), sju_Latn (Ume Sami), sma_Latn (Southern Sami), sme_Latn (Northern Sami), smj_Latn (Lule Sami), smn_Latn (Inari Sami), sms_Latn (Skolt Sami), sjd_Cyrl (Kildin Sami*)
  • kpv_Cyrl (Komi-Zyrian), koi_Cyrl (Komi-Permyak), udm_Cyrl (Udmurt)
  • mdf_Cyrl (Moksha), myv_Cyrl (Erzya)
  • mhr_Cyrl (Meadow Mari), mrj_Cyrl (Hill Mari)
  • hun_Latn (Hungarian), kca_Cyrl (Khanty*), mns_Cyrl (Mansi)
  • eng_Latn (English), lvs_Latn (Latvian), rus_Cyrl (Russian), nor_Latn (Norwegian)

Supported dialects

  • for Izhorian: alal (Lower Luga), soik (Soikkola)
  • for Votic: I, J, Ja, K, , Ke, Ko, L, Li, Lu, M, P, Po, R, Ra, S, U, V (explanation: https://arhiiv.eki.ee/dict/vadja/lisad/v_lyhendid.pdf)
  • for Karelian Proper: Dyorzha, Ilomantsi, Keret, Kestenga, Kontokki, Korbiselga, Maslozero, Myandyselga, New written Tver, New written karelian, Oulanga, Padany, Panozero, Poduzhemye, Porosozero, Reboly, Rugozero, Suistamo, Suoyarvi, Tikhtozero, Tikhvin, Tolmachi, Tunguda, Uhta, Valdai, Vesyegonsk, Voknavolok, Vychetaibola, Yushkozero
  • for Ludian: Central Ludian (Munozero), Mikhailovskoye, New written Ludian, Northern Ludian (Kondopoga), Southern Ludian (Svjatozero), Miikul (Central Ludian)
  • for Livvi-Karelian: Impilahti, Kondushi, Kotkozero, Nekkula, New written Livvic, Rypushkalitsa, Salmi, Suoyarvi, Syamozero, Tulmozero, Vedlozero, Vidlitsa
  • for Veps: Central Eastern Veps, Central Western Veps, New written Veps, Northern Veps, Southern Veps
  • for Kildin Sami: orth1
  • for Khanty: kazym (Kazym), shuryshkary (Shuryshkar)
Downloads last month
38
Safetensors
Model size
1.37B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support