Masoretic Hebrew to Yiddish (Yehoyesh) MarianMT Model

This model fine-tunes the Helsinki-NLP/opus-mt-mul-en MarianMT model for translation from the Masoretic Hebrew consonantal text of the Tanakh (Hebrew Bible) to the Yiddish translation by Yehoyesh (Yehoash Solomon Blumgarten, 1870-1927). The model is trained on a parallel corpus of the entire Tanakh, with Hebrew source and Yiddish target, both in Hebrew script.

Model Details

Model Name: johnlockejrr/marianmt-he2yid-tanakh
Base Model: Helsinki-NLP/opus-mt-mul-en
Language Pair: Masoretic Hebrew → Yiddish (he2yid)
Script: Both languages in Hebrew characters (consonantal for Hebrew, standard for Yiddish)
Domain: Biblical texts (Tanakh)
License: MIT

Dataset

Hebrew Source: Masoretic Hebrew consonantal text of the Tanakh (Torah, Neviʼim, Khetuvim)
Yiddish Target: Yiddish translation of the Tanakh by Yehoyesh Shloyme (Yehoash Solomon) Blumgarten (1870-1927), as published in "Torah, Neviʼim, u-Khetuvim" (New York: Yehoʼash Farlag Gezelshaft, 1941)
Alignment: Verse-aligned, covering the entire Tanakh

Training Configuration

Base Model: Helsinki-NLP/opus-mt-mul-en
Batch Size: 4 (per device, gradient accumulation for effective batch size)
Learning Rate: 1e-5
Epochs: 100
FP16: Enabled
Language Prefix: Uses >>heb<< for Hebrew and >>yi<< for Yiddish
Tokenizer: MarianMT tokenizer with added special tokens for language direction

Usage

Inference Example

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "johnlockejrr/marianmt-he2yid-tanakh"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Translate Hebrew to Yiddish
text = "בְּרֵאשִׁית בָּרָא אֱלֹהִים אֵת הַשָּׁמַיִם וְאֵת הָאָרֶץ"
inputs = tokenizer(f">>heb<< {text}", return_tensors="pt", max_length=512, truncation=True)
outputs = model.generate(**inputs, max_length=512, num_beams=4)
translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Yiddish: {translation}")

Intended Use

Primary: Automatic translation of Masoretic Hebrew Tanakh verses to Yiddish (Yehoyesh translation style)
Research: Useful for digital humanities, comparative linguistics, and Jewish studies
Education: Can assist in language learning and textual analysis

Limitations

Context: The model is trained at the verse level and does not have document-level context
Domain: Optimized for biblical text; may not generalize to modern Hebrew or Yiddish
Orthography: Hebrew is consonantal; Yiddish is in standard Yiddish orthography (Hebrew script)

Citation

If you use this model, please cite:

@misc{marianmt-he2yid-tanakh,
  author = {John Locke Jr.},
  title = {Masoretic Hebrew to Yiddish (Yehoyesh) MarianMT Model},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face model repository},
  howpublished = {\url{https://huggingface.co/johnlockejrr/marianmt-he2yid-tanakh}},
}

Acknowledgements

Yehoyesh Tanakh: Yiddish translation by Yehoyesh Shloyme (Yehoash Solomon) Blumgarten (1870-1927)
Masoretic Text: Public domain sources
Helsinki-NLP: For the base MarianMT model

License

MIT

johnlockejrr
/

marianmt-he2yid-tanakh