Hebrew-Aramaic Translation Model

This project provides a complete pipeline for fine-tuning MarianMT models on Hebrew-Aramaic parallel texts, specifically designed for translating between Hebrew (Samaritan) and Aramaic (Targum) texts.

Overview

The pipeline consists of:

  1. Dataset Preparation (prepare_dataset.py) - Processes the aligned corpus and splits it into train/validation/test sets
  2. Model Training (train_translation_model.py) - Fine-tunes a pre-trained MarianMT model
  3. Inference (inference.py) - Provides translation functionality using the trained model

Data Format

The input data should be in TSV format with the following columns:

  • Book - Book identifier
  • Chapter - Chapter number
  • Verse - Verse number
  • Targum - Aramaic text
  • Samaritan - Hebrew text

Example:

Book|Chapter|Verse|Targum|Samaritan
1|3|2|מן אילן גנה ניכל|מפרי עץ הגן נאכל

Installation

  1. Clone this repository:
git clone <repository-url>
cd sam-aram
  1. Install dependencies:
pip install -r requirements.txt
  1. (Optional) Install CUDA for GPU acceleration if available.

Usage

Step 1: Prepare the Dataset

First, prepare your aligned corpus for training:

python prepare_dataset.py \
    --input_file aligned_corpus.tsv \
    --output_dir ./hebrew_aramaic_dataset \
    --test_size 0.1 \
    --val_size 0.1

This will:

  • Load the TSV file
  • Clean and filter the data
  • Split into train/validation/test sets
  • Save the processed dataset

Step 2: Train the Model

Train a translation model using the prepared dataset:

python train_translation_model.py \
    --dataset_path ./hebrew_aramaic_dataset \
    --output_dir ./hebrew_aramaic_model \
    --model_name Helsinki-NLP/opus-mt-mul-en \
    --direction he2arc \
    --batch_size 8 \
    --learning_rate 2e-5 \
    --num_epochs 3 \
    --use_fp16

Key Parameters:

  • --model_name: Pre-trained model to fine-tune. Recommended options:
    • Helsinki-NLP/opus-mt-mul-en (multilingual)
    • Helsinki-NLP/opus-mt-he-en (Hebrew-English)
    • Helsinki-NLP/opus-mt-ar-en (Arabic-English)
  • --direction: Translation direction (he2arc or arc2he)
  • --batch_size: Training batch size (adjust based on GPU memory)
  • --learning_rate: Learning rate for fine-tuning
  • --num_epochs: Number of training epochs
  • --use_fp16: Enable mixed precision training (faster, less memory)

Training with Weights & Biases (Optional):

python train_translation_model.py \
    --dataset_path ./hebrew_aramaic_dataset \
    --output_dir ./hebrew_aramaic_model \
    --model_name Helsinki-NLP/opus-mt-mul-en \
    --use_wandb

Step 3: Use the Trained Model

Interactive Translation:

python inference.py --model_path ./hebrew_aramaic_model

Translate a Single Text:

python inference.py \
    --model_path ./hebrew_aramaic_model \
    --text "מפרי עץ הגן נאכל" \
    --direction he2arc

Batch Translation:

python inference.py \
    --model_path ./hebrew_aramaic_model \
    --input_file input_texts.txt \
    --output_file translations.txt \
    --direction he2arc

Model Recommendations

Based on the information in info.txt, here are recommended pre-trained models for Hebrew-Aramaic translation:

1. Multilingual Models

  • Helsinki-NLP/opus-mt-mul-en - Good starting point for multilingual fine-tuning
  • facebook/m2m100_1.2B - Large multilingual model with Hebrew and Aramaic support

2. Hebrew-Related Models

  • Helsinki-NLP/opus-mt-he-en - Hebrew to English (can be adapted)
  • Helsinki-NLP/opus-mt-heb-ara - Hebrew to Arabic (Semitic language family)

3. Arabic-Related Models

  • Helsinki-NLP/opus-mt-ar-en - Arabic to English (Aramaic is related to Arabic)
  • Helsinki-NLP/opus-mt-ar-heb - Arabic to Hebrew

Training Tips

1. Data Quality

  • Ensure your parallel texts are properly aligned
  • Clean the data to remove noise and inconsistencies
  • Consider the length ratio between source and target texts

2. Model Selection

  • Start with a multilingual model if available
  • Consider the vocabulary overlap between your languages
  • Test different pre-trained models to find the best starting point

3. Hyperparameter Tuning

  • Use smaller batch sizes for limited GPU memory
  • Start with a lower learning rate (1e-5 to 5e-5)
  • Increase epochs if the model hasn't converged
  • Use early stopping to prevent overfitting

4. Evaluation

  • Monitor BLEU score during training
  • Use character-level accuracy for Hebrew/Aramaic
  • Test on a held-out test set

File Structure

sam-aram/
├── aligned_corpus.tsv          # Input parallel corpus
├── prepare_dataset.py          # Dataset preparation script
├── train_translation_model.py  # Training script
├── inference.py               # Inference script
├── requirements.txt           # Python dependencies
├── README.md                 # This file
├── info.txt                  # Project information
├── hebrew_aramaic_dataset/   # Prepared dataset (created)
└── hebrew_aramaic_model/     # Trained model (created)

Troubleshooting

Common Issues:

  1. Out of Memory: Reduce batch size or use gradient accumulation

  2. Poor Translation Quality:

    • Check data quality and alignment
    • Try different pre-trained models
    • Increase training epochs
    • Adjust learning rate
  3. Tokenization Issues:

    • Ensure the tokenizer supports Hebrew/Aramaic scripts
    • Check for proper UTF-8 encoding
  4. Training Instability:

    • Reduce learning rate
    • Increase warmup steps
    • Use gradient clipping

Performance Optimization:

  • Use mixed precision training (--use_fp16)
  • Enable gradient accumulation for larger effective batch sizes
  • Use multiple GPUs if available
  • Consider model quantization for inference

Evaluation Metrics

The training script computes:

  • BLEU Score: Standard machine translation metric
  • Character Accuracy: Character-level accuracy for Hebrew/Aramaic text

Contributing

To improve the pipeline:

  1. Test with different pre-trained models
  2. Experiment with different data preprocessing techniques
  3. Add more evaluation metrics
  4. Optimize for specific use cases

License

This project is provided as-is for research and educational purposes.

References

Downloads last month
3
Safetensors
Model size
77.1M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for johnlockejrr/opus-mt-arc-heb

Finetuned
(13)
this model

Space using johnlockejrr/opus-mt-arc-heb 1