--- license: mit language: - he - arc base_model: - Helsinki-NLP/opus-mt-mul-en --- # Hebrew-Aramaic Translation Model This project provides a complete pipeline for fine-tuning MarianMT models on Hebrew-Aramaic parallel texts, specifically designed for translating between Hebrew (Samaritan) and Aramaic (Targum) texts. ## Overview The pipeline consists of: 1. **Dataset Preparation** (`prepare_dataset.py`) - Processes the aligned corpus and splits it into train/validation/test sets 2. **Model Training** (`train_translation_model.py`) - Fine-tunes a pre-trained MarianMT model 3. **Inference** (`inference.py`) - Provides translation functionality using the trained model ## Data Format The input data should be in TSV format with the following columns: - `Book` - Book identifier - `Chapter` - Chapter number - `Verse` - Verse number - `Targum` - Aramaic text - `Samaritan` - Hebrew text Example: ``` Book|Chapter|Verse|Targum|Samaritan 1|3|2|מן אילן גנה ניכל|מפרי עץ הגן נאכל ``` ## Installation 1. Clone this repository: ```bash git clone cd sam-aram ``` 2. Install dependencies: ```bash pip install -r requirements.txt ``` 3. (Optional) Install CUDA for GPU acceleration if available. ## Usage ### Step 1: Prepare the Dataset First, prepare your aligned corpus for training: ```bash python prepare_dataset.py \ --input_file aligned_corpus.tsv \ --output_dir ./hebrew_aramaic_dataset \ --test_size 0.1 \ --val_size 0.1 ``` This will: - Load the TSV file - Clean and filter the data - Split into train/validation/test sets - Save the processed dataset ### Step 2: Train the Model Train a translation model using the prepared dataset: ```bash python train_translation_model.py \ --dataset_path ./hebrew_aramaic_dataset \ --output_dir ./hebrew_aramaic_model \ --model_name Helsinki-NLP/opus-mt-mul-en \ --direction he2arc \ --batch_size 8 \ --learning_rate 2e-5 \ --num_epochs 3 \ --use_fp16 ``` #### Key Parameters: - `--model_name`: Pre-trained model to fine-tune. Recommended options: - `Helsinki-NLP/opus-mt-mul-en` (multilingual) - `Helsinki-NLP/opus-mt-he-en` (Hebrew-English) - `Helsinki-NLP/opus-mt-ar-en` (Arabic-English) - `--direction`: Translation direction (`he2arc` or `arc2he`) - `--batch_size`: Training batch size (adjust based on GPU memory) - `--learning_rate`: Learning rate for fine-tuning - `--num_epochs`: Number of training epochs - `--use_fp16`: Enable mixed precision training (faster, less memory) #### Training with Weights & Biases (Optional): ```bash python train_translation_model.py \ --dataset_path ./hebrew_aramaic_dataset \ --output_dir ./hebrew_aramaic_model \ --model_name Helsinki-NLP/opus-mt-mul-en \ --use_wandb ``` ### Step 3: Use the Trained Model #### Interactive Translation: ```bash python inference.py --model_path ./hebrew_aramaic_model ``` #### Translate a Single Text: ```bash python inference.py \ --model_path ./hebrew_aramaic_model \ --text "מפרי עץ הגן נאכל" \ --direction he2arc ``` #### Batch Translation: ```bash python inference.py \ --model_path ./hebrew_aramaic_model \ --input_file input_texts.txt \ --output_file translations.txt \ --direction he2arc ``` ## Model Recommendations Based on the information in `info.txt`, here are recommended pre-trained models for Hebrew-Aramaic translation: ### 1. Multilingual Models - `Helsinki-NLP/opus-mt-mul-en` - Good starting point for multilingual fine-tuning - `facebook/m2m100_1.2B` - Large multilingual model with Hebrew and Aramaic support ### 2. Hebrew-Related Models - `Helsinki-NLP/opus-mt-he-en` - Hebrew to English (can be adapted) - `Helsinki-NLP/opus-mt-heb-ara` - Hebrew to Arabic (Semitic language family) ### 3. Arabic-Related Models - `Helsinki-NLP/opus-mt-ar-en` - Arabic to English (Aramaic is related to Arabic) - `Helsinki-NLP/opus-mt-ar-heb` - Arabic to Hebrew ## Training Tips ### 1. Data Quality - Ensure your parallel texts are properly aligned - Clean the data to remove noise and inconsistencies - Consider the length ratio between source and target texts ### 2. Model Selection - Start with a multilingual model if available - Consider the vocabulary overlap between your languages - Test different pre-trained models to find the best starting point ### 3. Hyperparameter Tuning - Use smaller batch sizes for limited GPU memory - Start with a lower learning rate (1e-5 to 5e-5) - Increase epochs if the model hasn't converged - Use early stopping to prevent overfitting ### 4. Evaluation - Monitor BLEU score during training - Use character-level accuracy for Hebrew/Aramaic - Test on a held-out test set ## File Structure ``` sam-aram/ ├── aligned_corpus.tsv # Input parallel corpus ├── prepare_dataset.py # Dataset preparation script ├── train_translation_model.py # Training script ├── inference.py # Inference script ├── requirements.txt # Python dependencies ├── README.md # This file ├── info.txt # Project information ├── hebrew_aramaic_dataset/ # Prepared dataset (created) └── hebrew_aramaic_model/ # Trained model (created) ``` ## Troubleshooting ### Common Issues: 1. **Out of Memory**: Reduce batch size or use gradient accumulation 2. **Poor Translation Quality**: - Check data quality and alignment - Try different pre-trained models - Increase training epochs - Adjust learning rate 3. **Tokenization Issues**: - Ensure the tokenizer supports Hebrew/Aramaic scripts - Check for proper UTF-8 encoding 4. **Training Instability**: - Reduce learning rate - Increase warmup steps - Use gradient clipping ### Performance Optimization: - Use mixed precision training (`--use_fp16`) - Enable gradient accumulation for larger effective batch sizes - Use multiple GPUs if available - Consider model quantization for inference ## Evaluation Metrics The training script computes: - **BLEU Score**: Standard machine translation metric - **Character Accuracy**: Character-level accuracy for Hebrew/Aramaic text ## Contributing To improve the pipeline: 1. Test with different pre-trained models 2. Experiment with different data preprocessing techniques 3. Add more evaluation metrics 4. Optimize for specific use cases ## License This project is provided as-is for research and educational purposes. ## References - [MarianMT Documentation](https://huggingface.co/docs/transformers/en/model_doc/marian) - [Helsinki-NLP Models](https://huggingface.co/Helsinki-NLP) - [Transformers Library](https://huggingface.co/docs/transformers/)