Hebrew-Aramaic Translation Model
This project provides a complete pipeline for fine-tuning MarianMT models on Hebrew-Aramaic parallel texts, specifically designed for translating between Hebrew (Samaritan) and Aramaic (Targum) texts.
Overview
The pipeline consists of:
- Dataset Preparation (
prepare_dataset.py
) - Processes the aligned corpus and splits it into train/validation/test sets - Model Training (
train_translation_model.py
) - Fine-tunes a pre-trained MarianMT model - Inference (
inference.py
) - Provides translation functionality using the trained model
Data Format
The input data should be in TSV format with the following columns:
Book
- Book identifierChapter
- Chapter numberVerse
- Verse numberTargum
- Aramaic textSamaritan
- Hebrew text
Example:
Book|Chapter|Verse|Targum|Samaritan
1|3|2|מן אילן גנה ניכל|מפרי עץ הגן נאכל
Installation
- Clone this repository:
git clone <repository-url>
cd sam-aram
- Install dependencies:
pip install -r requirements.txt
- (Optional) Install CUDA for GPU acceleration if available.
Usage
Step 1: Prepare the Dataset
First, prepare your aligned corpus for training:
python prepare_dataset.py \
--input_file aligned_corpus.tsv \
--output_dir ./hebrew_aramaic_dataset \
--test_size 0.1 \
--val_size 0.1
This will:
- Load the TSV file
- Clean and filter the data
- Split into train/validation/test sets
- Save the processed dataset
Step 2: Train the Model
Train a translation model using the prepared dataset:
python train_translation_model.py \
--dataset_path ./hebrew_aramaic_dataset \
--output_dir ./hebrew_aramaic_model \
--model_name Helsinki-NLP/opus-mt-mul-en \
--direction he2arc \
--batch_size 8 \
--learning_rate 2e-5 \
--num_epochs 3 \
--use_fp16
Key Parameters:
--model_name
: Pre-trained model to fine-tune. Recommended options:Helsinki-NLP/opus-mt-mul-en
(multilingual)Helsinki-NLP/opus-mt-he-en
(Hebrew-English)Helsinki-NLP/opus-mt-ar-en
(Arabic-English)
--direction
: Translation direction (he2arc
orarc2he
)--batch_size
: Training batch size (adjust based on GPU memory)--learning_rate
: Learning rate for fine-tuning--num_epochs
: Number of training epochs--use_fp16
: Enable mixed precision training (faster, less memory)
Training with Weights & Biases (Optional):
python train_translation_model.py \
--dataset_path ./hebrew_aramaic_dataset \
--output_dir ./hebrew_aramaic_model \
--model_name Helsinki-NLP/opus-mt-mul-en \
--use_wandb
Step 3: Use the Trained Model
Interactive Translation:
python inference.py --model_path ./hebrew_aramaic_model
Translate a Single Text:
python inference.py \
--model_path ./hebrew_aramaic_model \
--text "מפרי עץ הגן נאכל" \
--direction he2arc
Batch Translation:
python inference.py \
--model_path ./hebrew_aramaic_model \
--input_file input_texts.txt \
--output_file translations.txt \
--direction he2arc
Model Recommendations
Based on the information in info.txt
, here are recommended pre-trained models for Hebrew-Aramaic translation:
1. Multilingual Models
Helsinki-NLP/opus-mt-mul-en
- Good starting point for multilingual fine-tuningfacebook/m2m100_1.2B
- Large multilingual model with Hebrew and Aramaic support
2. Hebrew-Related Models
Helsinki-NLP/opus-mt-he-en
- Hebrew to English (can be adapted)Helsinki-NLP/opus-mt-heb-ara
- Hebrew to Arabic (Semitic language family)
3. Arabic-Related Models
Helsinki-NLP/opus-mt-ar-en
- Arabic to English (Aramaic is related to Arabic)Helsinki-NLP/opus-mt-ar-heb
- Arabic to Hebrew
Training Tips
1. Data Quality
- Ensure your parallel texts are properly aligned
- Clean the data to remove noise and inconsistencies
- Consider the length ratio between source and target texts
2. Model Selection
- Start with a multilingual model if available
- Consider the vocabulary overlap between your languages
- Test different pre-trained models to find the best starting point
3. Hyperparameter Tuning
- Use smaller batch sizes for limited GPU memory
- Start with a lower learning rate (1e-5 to 5e-5)
- Increase epochs if the model hasn't converged
- Use early stopping to prevent overfitting
4. Evaluation
- Monitor BLEU score during training
- Use character-level accuracy for Hebrew/Aramaic
- Test on a held-out test set
File Structure
sam-aram/
├── aligned_corpus.tsv # Input parallel corpus
├── prepare_dataset.py # Dataset preparation script
├── train_translation_model.py # Training script
├── inference.py # Inference script
├── requirements.txt # Python dependencies
├── README.md # This file
├── info.txt # Project information
├── hebrew_aramaic_dataset/ # Prepared dataset (created)
└── hebrew_aramaic_model/ # Trained model (created)
Troubleshooting
Common Issues:
Out of Memory: Reduce batch size or use gradient accumulation
Poor Translation Quality:
- Check data quality and alignment
- Try different pre-trained models
- Increase training epochs
- Adjust learning rate
Tokenization Issues:
- Ensure the tokenizer supports Hebrew/Aramaic scripts
- Check for proper UTF-8 encoding
Training Instability:
- Reduce learning rate
- Increase warmup steps
- Use gradient clipping
Performance Optimization:
- Use mixed precision training (
--use_fp16
) - Enable gradient accumulation for larger effective batch sizes
- Use multiple GPUs if available
- Consider model quantization for inference
Evaluation Metrics
The training script computes:
- BLEU Score: Standard machine translation metric
- Character Accuracy: Character-level accuracy for Hebrew/Aramaic text
Contributing
To improve the pipeline:
- Test with different pre-trained models
- Experiment with different data preprocessing techniques
- Add more evaluation metrics
- Optimize for specific use cases
License
This project is provided as-is for research and educational purposes.
References
- Downloads last month
- 3
Model tree for johnlockejrr/opus-mt-arc-heb
Base model
Helsinki-NLP/opus-mt-mul-en