johnlockejrr
/

opus-mt-arc-heb

Safetensors

Hebrew

Official Aramaic (700-300 BCE)

marian

Model card Files Files and versions Community

johnlockejrr commited on 22 days ago

Commit

48bff22

verified ·

1 Parent(s): d3d218f

Update README.md

Browse files

Files changed (1) hide show

README.md +231 -1

README.md CHANGED Viewed

@@ -5,4 +5,234 @@ language:
 - arc
 base_model:
 - Helsinki-NLP/opus-mt-mul-en
----

 - arc
 base_model:
 - Helsinki-NLP/opus-mt-mul-en
+---
+# Hebrew-Aramaic Translation Model
+This project provides a complete pipeline for fine-tuning MarianMT models on Hebrew-Aramaic parallel texts, specifically designed for translating between Hebrew (Samaritan) and Aramaic (Targum) texts.
+## Overview
+The pipeline consists of:
+1. **Dataset Preparation** (`prepare_dataset.py`) - Processes the aligned corpus and splits it into train/validation/test sets
+2. **Model Training** (`train_translation_model.py`) - Fine-tunes a pre-trained MarianMT model
+3. **Inference** (`inference.py`) - Provides translation functionality using the trained model
+## Data Format
+The input data should be in TSV format with the following columns:
+- `Book` - Book identifier
+- `Chapter` - Chapter number
+- `Verse` - Verse number
+- `Targum` - Aramaic text
+- `Samaritan` - Hebrew text
+Example:
+```
+Book|Chapter|Verse|Targum|Samaritan
+1|3|2|מן אילן גנה ניכל|מפרי עץ הגן נאכל
+```
+## Installation
+1. Clone this repository:
+```bash
+git clone <repository-url>
+cd sam-aram
+```
+2. Install dependencies:
+```bash
+pip install -r requirements.txt
+```
+3. (Optional) Install CUDA for GPU acceleration if available.
+## Usage
+### Step 1: Prepare the Dataset
+First, prepare your aligned corpus for training:
+```bash
+python prepare_dataset.py \
+    --input_file aligned_corpus.tsv \
+    --output_dir ./hebrew_aramaic_dataset \
+    --test_size 0.1 \
+    --val_size 0.1
+```
+This will:
+- Load the TSV file
+- Clean and filter the data
+- Split into train/validation/test sets
+- Save the processed dataset
+### Step 2: Train the Model
+Train a translation model using the prepared dataset:
+```bash
+python train_translation_model.py \
+    --dataset_path ./hebrew_aramaic_dataset \
+    --output_dir ./hebrew_aramaic_model \
+    --model_name Helsinki-NLP/opus-mt-mul-en \
+    --direction he2arc \
+    --batch_size 8 \
+    --learning_rate 2e-5 \
+    --num_epochs 3 \
+    --use_fp16
+```
+#### Key Parameters:
+- `--model_name`: Pre-trained model to fine-tune. Recommended options:
+  - `Helsinki-NLP/opus-mt-mul-en` (multilingual)
+  - `Helsinki-NLP/opus-mt-he-en` (Hebrew-English)
+  - `Helsinki-NLP/opus-mt-ar-en` (Arabic-English)
+- `--direction`: Translation direction (`he2arc` or `arc2he`)
+- `--batch_size`: Training batch size (adjust based on GPU memory)
+- `--learning_rate`: Learning rate for fine-tuning
+- `--num_epochs`: Number of training epochs
+- `--use_fp16`: Enable mixed precision training (faster, less memory)
+#### Training with Weights & Biases (Optional):
+```bash
+python train_translation_model.py \
+    --dataset_path ./hebrew_aramaic_dataset \
+    --output_dir ./hebrew_aramaic_model \
+    --model_name Helsinki-NLP/opus-mt-mul-en \
+    --use_wandb
+```
+### Step 3: Use the Trained Model
+#### Interactive Translation:
+```bash
+python inference.py --model_path ./hebrew_aramaic_model
+```
+#### Translate a Single Text:
+```bash
+python inference.py \
+    --model_path ./hebrew_aramaic_model \
+    --text "מפרי עץ הגן נאכל" \
+    --direction he2arc
+```
+#### Batch Translation:
+```bash
+python inference.py \
+    --model_path ./hebrew_aramaic_model \
+    --input_file input_texts.txt \
+    --output_file translations.txt \
+    --direction he2arc
+```
+## Model Recommendations
+Based on the information in `info.txt`, here are recommended pre-trained models for Hebrew-Aramaic translation:
+### 1. Multilingual Models
+- `Helsinki-NLP/opus-mt-mul-en` - Good starting point for multilingual fine-tuning
+- `facebook/m2m100_1.2B` - Large multilingual model with Hebrew and Aramaic support
+### 2. Hebrew-Related Models
+- `Helsinki-NLP/opus-mt-he-en` - Hebrew to English (can be adapted)
+- `Helsinki-NLP/opus-mt-heb-ara` - Hebrew to Arabic (Semitic language family)
+### 3. Arabic-Related Models
+- `Helsinki-NLP/opus-mt-ar-en` - Arabic to English (Aramaic is related to Arabic)
+- `Helsinki-NLP/opus-mt-ar-heb` - Arabic to Hebrew
+## Training Tips
+### 1. Data Quality
+- Ensure your parallel texts are properly aligned
+- Clean the data to remove noise and inconsistencies
+- Consider the length ratio between source and target texts
+### 2. Model Selection
+- Start with a multilingual model if available
+- Consider the vocabulary overlap between your languages
+- Test different pre-trained models to find the best starting point
+### 3. Hyperparameter Tuning
+- Use smaller batch sizes for limited GPU memory
+- Start with a lower learning rate (1e-5 to 5e-5)
+- Increase epochs if the model hasn't converged
+- Use early stopping to prevent overfitting
+### 4. Evaluation
+- Monitor BLEU score during training
+- Use character-level accuracy for Hebrew/Aramaic
+- Test on a held-out test set
+## File Structure
+```
+sam-aram/
+├── aligned_corpus.tsv          # Input parallel corpus
+├── prepare_dataset.py          # Dataset preparation script
+├── train_translation_model.py  # Training script
+├── inference.py               # Inference script
+├── requirements.txt           # Python dependencies
+├── README.md                 # This file
+├── info.txt                  # Project information
+├── hebrew_aramaic_dataset/   # Prepared dataset (created)
+└── hebrew_aramaic_model/     # Trained model (created)
+```
+## Troubleshooting
+### Common Issues:
+1. **Out of Memory**: Reduce batch size or use gradient accumulation
+2. **Poor Translation Quality**:
+   - Check data quality and alignment
+   - Try different pre-trained models
+   - Increase training epochs
+   - Adjust learning rate
+3. **Tokenization Issues**:
+   - Ensure the tokenizer supports Hebrew/Aramaic scripts
+   - Check for proper UTF-8 encoding
+4. **Training Instability**:
+   - Reduce learning rate
+   - Increase warmup steps
+   - Use gradient clipping
+### Performance Optimization:
+- Use mixed precision training (`--use_fp16`)
+- Enable gradient accumulation for larger effective batch sizes
+- Use multiple GPUs if available
+- Consider model quantization for inference
+## Evaluation Metrics
+The training script computes:
+- **BLEU Score**: Standard machine translation metric
+- **Character Accuracy**: Character-level accuracy for Hebrew/Aramaic text
+## Contributing
+To improve the pipeline:
+1. Test with different pre-trained models
+2. Experiment with different data preprocessing techniques
+3. Add more evaluation metrics
+4. Optimize for specific use cases
+## License
+This project is provided as-is for research and educational purposes.
+## References
+- [MarianMT Documentation](https://huggingface.co/docs/transformers/en/model_doc/marian)
+- [Helsinki-NLP Models](https://huggingface.co/Helsinki-NLP)
+- [Transformers Library](https://huggingface.co/docs/transformers/)