Masoretic Hebrew to Yiddish (Yehoyesh) MarianMT Model

This model fine-tunes the Helsinki-NLP/opus-mt-mul-en MarianMT model for translation from the Masoretic Hebrew consonantal text of the Tanakh (Hebrew Bible) to the Yiddish translation by Yehoyesh (Yehoash Solomon Blumgarten, 1870-1927). The model is trained on a parallel corpus of the entire Tanakh, with Hebrew source and Yiddish target, both in Hebrew script.

Model Details

  • Model Name: johnlockejrr/marianmt-he2yid-tanakh
  • Base Model: Helsinki-NLP/opus-mt-mul-en
  • Language Pair: Masoretic Hebrew → Yiddish (he2yid)
  • Script: Both languages in Hebrew characters (consonantal for Hebrew, standard for Yiddish)
  • Domain: Biblical texts (Tanakh)
  • License: MIT

Dataset

  • Hebrew Source: Masoretic Hebrew consonantal text of the Tanakh (Torah, Neviʼim, Khetuvim)
  • Yiddish Target: Yiddish translation of the Tanakh by Yehoyesh Shloyme (Yehoash Solomon) Blumgarten (1870-1927), as published in "Torah, Neviʼim, u-Khetuvim" (New York: Yehoʼash Farlag Gezelshaft, 1941)
  • Alignment: Verse-aligned, covering the entire Tanakh

Training Configuration

  • Base Model: Helsinki-NLP/opus-mt-mul-en
  • Batch Size: 4 (per device, gradient accumulation for effective batch size)
  • Learning Rate: 1e-5
  • Epochs: 100
  • FP16: Enabled
  • Language Prefix: Uses >>heb<< for Hebrew and >>yi<< for Yiddish
  • Tokenizer: MarianMT tokenizer with added special tokens for language direction

Usage

Inference Example

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "johnlockejrr/marianmt-he2yid-tanakh"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Translate Hebrew to Yiddish
text = "בְּרֵאשִׁית בָּרָא אֱלֹהִים אֵת הַשָּׁמַיִם וְאֵת הָאָרֶץ"
inputs = tokenizer(f">>heb<< {text}", return_tensors="pt", max_length=512, truncation=True)
outputs = model.generate(**inputs, max_length=512, num_beams=4)
translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Yiddish: {translation}")

Intended Use

  • Primary: Automatic translation of Masoretic Hebrew Tanakh verses to Yiddish (Yehoyesh translation style)
  • Research: Useful for digital humanities, comparative linguistics, and Jewish studies
  • Education: Can assist in language learning and textual analysis

Limitations

  • Context: The model is trained at the verse level and does not have document-level context
  • Domain: Optimized for biblical text; may not generalize to modern Hebrew or Yiddish
  • Orthography: Hebrew is consonantal; Yiddish is in standard Yiddish orthography (Hebrew script)

Citation

If you use this model, please cite:

@misc{marianmt-he2yid-tanakh,
  author = {John Locke Jr.},
  title = {Masoretic Hebrew to Yiddish (Yehoyesh) MarianMT Model},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face model repository},
  howpublished = {\url{https://huggingface.co/johnlockejrr/marianmt-he2yid-tanakh}},
}

Acknowledgements

  • Yehoyesh Tanakh: Yiddish translation by Yehoyesh Shloyme (Yehoash Solomon) Blumgarten (1870-1927)
  • Masoretic Text: Public domain sources
  • Helsinki-NLP: For the base MarianMT model

License

MIT

Downloads last month
16
Safetensors
Model size
77.1M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for johnlockejrr/marianmt-he2yid-tanakh

Finetuned
(15)
this model

Space using johnlockejrr/marianmt-he2yid-tanakh 1