Hebrew-Aramaic Translation Model

This project provides a complete pipeline for fine-tuning MarianMT models on Hebrew-Aramaic parallel texts, specifically designed for translating between Hebrew (Samaritan) and Aramaic (Targum) texts.

Overview

The pipeline consists of:

Dataset Preparation (prepare_dataset.py) - Processes the aligned corpus and splits it into train/validation/test sets
Model Training (train_translation_model.py) - Fine-tunes a pre-trained MarianMT model
Inference (inference.py) - Provides translation functionality using the trained model

Data Format

The input data should be in TSV format with the following columns:

Book - Book identifier
Chapter - Chapter number
Verse - Verse number
Targum - Aramaic text
Samaritan - Hebrew text

Example:

Book|Chapter|Verse|Targum|Samaritan
1|3|2|מן אילן גנה ניכל|מפרי עץ הגן נאכל

Installation

Clone this repository:

git clone <repository-url>
cd sam-aram

Install dependencies:

pip install -r requirements.txt

(Optional) Install CUDA for GPU acceleration if available.

Usage

Step 1: Prepare the Dataset

First, prepare your aligned corpus for training:

python prepare_dataset.py \
    --input_file aligned_corpus.tsv \
    --output_dir ./hebrew_aramaic_dataset \
    --test_size 0.1 \
    --val_size 0.1

This will:

Load the TSV file
Clean and filter the data
Split into train/validation/test sets
Save the processed dataset

Step 2: Train the Model

Train a translation model using the prepared dataset:

python train_translation_model.py \
    --dataset_path ./hebrew_aramaic_dataset \
    --output_dir ./hebrew_aramaic_model \
    --model_name Helsinki-NLP/opus-mt-mul-en \
    --direction he2arc \
    --batch_size 8 \
    --learning_rate 2e-5 \
    --num_epochs 3 \
    --use_fp16

Key Parameters:

--model_name: Pre-trained model to fine-tune. Recommended options:
- Helsinki-NLP/opus-mt-mul-en (multilingual)
- Helsinki-NLP/opus-mt-he-en (Hebrew-English)
- Helsinki-NLP/opus-mt-ar-en (Arabic-English)
--direction: Translation direction (he2arc or arc2he)
--batch_size: Training batch size (adjust based on GPU memory)
--learning_rate: Learning rate for fine-tuning
--num_epochs: Number of training epochs
--use_fp16: Enable mixed precision training (faster, less memory)

Training with Weights & Biases (Optional):

python train_translation_model.py \
    --dataset_path ./hebrew_aramaic_dataset \
    --output_dir ./hebrew_aramaic_model \
    --model_name Helsinki-NLP/opus-mt-mul-en \
    --use_wandb

Step 3: Use the Trained Model

Interactive Translation:

python inference.py --model_path ./hebrew_aramaic_model

Translate a Single Text:

python inference.py \
    --model_path ./hebrew_aramaic_model \
    --text "מפרי עץ הגן נאכל" \
    --direction he2arc

Batch Translation:

python inference.py \
    --model_path ./hebrew_aramaic_model \
    --input_file input_texts.txt \
    --output_file translations.txt \
    --direction he2arc

Model Recommendations

Based on the information in info.txt, here are recommended pre-trained models for Hebrew-Aramaic translation:

1. Multilingual Models

Helsinki-NLP/opus-mt-mul-en - Good starting point for multilingual fine-tuning
facebook/m2m100_1.2B - Large multilingual model with Hebrew and Aramaic support

2. Hebrew-Related Models

Helsinki-NLP/opus-mt-he-en - Hebrew to English (can be adapted)
Helsinki-NLP/opus-mt-heb-ara - Hebrew to Arabic (Semitic language family)

3. Arabic-Related Models

Helsinki-NLP/opus-mt-ar-en - Arabic to English (Aramaic is related to Arabic)
Helsinki-NLP/opus-mt-ar-heb - Arabic to Hebrew

Training Tips

1. Data Quality

Ensure your parallel texts are properly aligned
Clean the data to remove noise and inconsistencies
Consider the length ratio between source and target texts

2. Model Selection

Start with a multilingual model if available
Consider the vocabulary overlap between your languages
Test different pre-trained models to find the best starting point

3. Hyperparameter Tuning

Use smaller batch sizes for limited GPU memory
Start with a lower learning rate (1e-5 to 5e-5)
Increase epochs if the model hasn't converged
Use early stopping to prevent overfitting

4. Evaluation

Monitor BLEU score during training
Use character-level accuracy for Hebrew/Aramaic
Test on a held-out test set

File Structure

sam-aram/
├── aligned_corpus.tsv          # Input parallel corpus
├── prepare_dataset.py          # Dataset preparation script
├── train_translation_model.py  # Training script
├── inference.py               # Inference script
├── requirements.txt           # Python dependencies
├── README.md                 # This file
├── info.txt                  # Project information
├── hebrew_aramaic_dataset/   # Prepared dataset (created)
└── hebrew_aramaic_model/     # Trained model (created)

Troubleshooting

Common Issues:

Out of Memory: Reduce batch size or use gradient accumulation
Poor Translation Quality:
- Check data quality and alignment
- Try different pre-trained models
- Increase training epochs
- Adjust learning rate
Tokenization Issues:
- Ensure the tokenizer supports Hebrew/Aramaic scripts
- Check for proper UTF-8 encoding
Training Instability:
- Reduce learning rate
- Increase warmup steps
- Use gradient clipping

Performance Optimization:

Use mixed precision training (--use_fp16)
Enable gradient accumulation for larger effective batch sizes
Use multiple GPUs if available
Consider model quantization for inference

Evaluation Metrics

The training script computes:

BLEU Score: Standard machine translation metric
Character Accuracy: Character-level accuracy for Hebrew/Aramaic text

Contributing

To improve the pipeline:

Test with different pre-trained models
Experiment with different data preprocessing techniques
Add more evaluation metrics
Optimize for specific use cases

License

This project is provided as-is for research and educational purposes.

johnlockejrr
/

opus-mt-arc-heb