Update README.md
Browse files
README.md
CHANGED
@@ -5,4 +5,234 @@ language:
|
|
5 |
- arc
|
6 |
base_model:
|
7 |
- Helsinki-NLP/opus-mt-mul-en
|
8 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
5 |
- arc
|
6 |
base_model:
|
7 |
- Helsinki-NLP/opus-mt-mul-en
|
8 |
+
---
|
9 |
+
# Hebrew-Aramaic Translation Model
|
10 |
+
|
11 |
+
This project provides a complete pipeline for fine-tuning MarianMT models on Hebrew-Aramaic parallel texts, specifically designed for translating between Hebrew (Samaritan) and Aramaic (Targum) texts.
|
12 |
+
|
13 |
+
## Overview
|
14 |
+
|
15 |
+
The pipeline consists of:
|
16 |
+
1. **Dataset Preparation** (`prepare_dataset.py`) - Processes the aligned corpus and splits it into train/validation/test sets
|
17 |
+
2. **Model Training** (`train_translation_model.py`) - Fine-tunes a pre-trained MarianMT model
|
18 |
+
3. **Inference** (`inference.py`) - Provides translation functionality using the trained model
|
19 |
+
|
20 |
+
## Data Format
|
21 |
+
|
22 |
+
The input data should be in TSV format with the following columns:
|
23 |
+
- `Book` - Book identifier
|
24 |
+
- `Chapter` - Chapter number
|
25 |
+
- `Verse` - Verse number
|
26 |
+
- `Targum` - Aramaic text
|
27 |
+
- `Samaritan` - Hebrew text
|
28 |
+
|
29 |
+
Example:
|
30 |
+
```
|
31 |
+
Book|Chapter|Verse|Targum|Samaritan
|
32 |
+
1|3|2|מן אילן גנה ניכל|מפרי עץ הגן נאכל
|
33 |
+
```
|
34 |
+
|
35 |
+
## Installation
|
36 |
+
|
37 |
+
1. Clone this repository:
|
38 |
+
```bash
|
39 |
+
git clone <repository-url>
|
40 |
+
cd sam-aram
|
41 |
+
```
|
42 |
+
|
43 |
+
2. Install dependencies:
|
44 |
+
```bash
|
45 |
+
pip install -r requirements.txt
|
46 |
+
```
|
47 |
+
|
48 |
+
3. (Optional) Install CUDA for GPU acceleration if available.
|
49 |
+
|
50 |
+
## Usage
|
51 |
+
|
52 |
+
### Step 1: Prepare the Dataset
|
53 |
+
|
54 |
+
First, prepare your aligned corpus for training:
|
55 |
+
|
56 |
+
```bash
|
57 |
+
python prepare_dataset.py \
|
58 |
+
--input_file aligned_corpus.tsv \
|
59 |
+
--output_dir ./hebrew_aramaic_dataset \
|
60 |
+
--test_size 0.1 \
|
61 |
+
--val_size 0.1
|
62 |
+
```
|
63 |
+
|
64 |
+
This will:
|
65 |
+
- Load the TSV file
|
66 |
+
- Clean and filter the data
|
67 |
+
- Split into train/validation/test sets
|
68 |
+
- Save the processed dataset
|
69 |
+
|
70 |
+
### Step 2: Train the Model
|
71 |
+
|
72 |
+
Train a translation model using the prepared dataset:
|
73 |
+
|
74 |
+
```bash
|
75 |
+
python train_translation_model.py \
|
76 |
+
--dataset_path ./hebrew_aramaic_dataset \
|
77 |
+
--output_dir ./hebrew_aramaic_model \
|
78 |
+
--model_name Helsinki-NLP/opus-mt-mul-en \
|
79 |
+
--direction he2arc \
|
80 |
+
--batch_size 8 \
|
81 |
+
--learning_rate 2e-5 \
|
82 |
+
--num_epochs 3 \
|
83 |
+
--use_fp16
|
84 |
+
```
|
85 |
+
|
86 |
+
#### Key Parameters:
|
87 |
+
|
88 |
+
- `--model_name`: Pre-trained model to fine-tune. Recommended options:
|
89 |
+
- `Helsinki-NLP/opus-mt-mul-en` (multilingual)
|
90 |
+
- `Helsinki-NLP/opus-mt-he-en` (Hebrew-English)
|
91 |
+
- `Helsinki-NLP/opus-mt-ar-en` (Arabic-English)
|
92 |
+
- `--direction`: Translation direction (`he2arc` or `arc2he`)
|
93 |
+
- `--batch_size`: Training batch size (adjust based on GPU memory)
|
94 |
+
- `--learning_rate`: Learning rate for fine-tuning
|
95 |
+
- `--num_epochs`: Number of training epochs
|
96 |
+
- `--use_fp16`: Enable mixed precision training (faster, less memory)
|
97 |
+
|
98 |
+
#### Training with Weights & Biases (Optional):
|
99 |
+
|
100 |
+
```bash
|
101 |
+
python train_translation_model.py \
|
102 |
+
--dataset_path ./hebrew_aramaic_dataset \
|
103 |
+
--output_dir ./hebrew_aramaic_model \
|
104 |
+
--model_name Helsinki-NLP/opus-mt-mul-en \
|
105 |
+
--use_wandb
|
106 |
+
```
|
107 |
+
|
108 |
+
### Step 3: Use the Trained Model
|
109 |
+
|
110 |
+
#### Interactive Translation:
|
111 |
+
|
112 |
+
```bash
|
113 |
+
python inference.py --model_path ./hebrew_aramaic_model
|
114 |
+
```
|
115 |
+
|
116 |
+
#### Translate a Single Text:
|
117 |
+
|
118 |
+
```bash
|
119 |
+
python inference.py \
|
120 |
+
--model_path ./hebrew_aramaic_model \
|
121 |
+
--text "מפרי עץ הגן נאכל" \
|
122 |
+
--direction he2arc
|
123 |
+
```
|
124 |
+
|
125 |
+
#### Batch Translation:
|
126 |
+
|
127 |
+
```bash
|
128 |
+
python inference.py \
|
129 |
+
--model_path ./hebrew_aramaic_model \
|
130 |
+
--input_file input_texts.txt \
|
131 |
+
--output_file translations.txt \
|
132 |
+
--direction he2arc
|
133 |
+
```
|
134 |
+
|
135 |
+
## Model Recommendations
|
136 |
+
|
137 |
+
Based on the information in `info.txt`, here are recommended pre-trained models for Hebrew-Aramaic translation:
|
138 |
+
|
139 |
+
### 1. Multilingual Models
|
140 |
+
- `Helsinki-NLP/opus-mt-mul-en` - Good starting point for multilingual fine-tuning
|
141 |
+
- `facebook/m2m100_1.2B` - Large multilingual model with Hebrew and Aramaic support
|
142 |
+
|
143 |
+
### 2. Hebrew-Related Models
|
144 |
+
- `Helsinki-NLP/opus-mt-he-en` - Hebrew to English (can be adapted)
|
145 |
+
- `Helsinki-NLP/opus-mt-heb-ara` - Hebrew to Arabic (Semitic language family)
|
146 |
+
|
147 |
+
### 3. Arabic-Related Models
|
148 |
+
- `Helsinki-NLP/opus-mt-ar-en` - Arabic to English (Aramaic is related to Arabic)
|
149 |
+
- `Helsinki-NLP/opus-mt-ar-heb` - Arabic to Hebrew
|
150 |
+
|
151 |
+
## Training Tips
|
152 |
+
|
153 |
+
### 1. Data Quality
|
154 |
+
- Ensure your parallel texts are properly aligned
|
155 |
+
- Clean the data to remove noise and inconsistencies
|
156 |
+
- Consider the length ratio between source and target texts
|
157 |
+
|
158 |
+
### 2. Model Selection
|
159 |
+
- Start with a multilingual model if available
|
160 |
+
- Consider the vocabulary overlap between your languages
|
161 |
+
- Test different pre-trained models to find the best starting point
|
162 |
+
|
163 |
+
### 3. Hyperparameter Tuning
|
164 |
+
- Use smaller batch sizes for limited GPU memory
|
165 |
+
- Start with a lower learning rate (1e-5 to 5e-5)
|
166 |
+
- Increase epochs if the model hasn't converged
|
167 |
+
- Use early stopping to prevent overfitting
|
168 |
+
|
169 |
+
### 4. Evaluation
|
170 |
+
- Monitor BLEU score during training
|
171 |
+
- Use character-level accuracy for Hebrew/Aramaic
|
172 |
+
- Test on a held-out test set
|
173 |
+
|
174 |
+
## File Structure
|
175 |
+
|
176 |
+
```
|
177 |
+
sam-aram/
|
178 |
+
├── aligned_corpus.tsv # Input parallel corpus
|
179 |
+
├── prepare_dataset.py # Dataset preparation script
|
180 |
+
├── train_translation_model.py # Training script
|
181 |
+
├── inference.py # Inference script
|
182 |
+
├── requirements.txt # Python dependencies
|
183 |
+
├── README.md # This file
|
184 |
+
├── info.txt # Project information
|
185 |
+
├── hebrew_aramaic_dataset/ # Prepared dataset (created)
|
186 |
+
└── hebrew_aramaic_model/ # Trained model (created)
|
187 |
+
```
|
188 |
+
|
189 |
+
## Troubleshooting
|
190 |
+
|
191 |
+
### Common Issues:
|
192 |
+
|
193 |
+
1. **Out of Memory**: Reduce batch size or use gradient accumulation
|
194 |
+
2. **Poor Translation Quality**:
|
195 |
+
- Check data quality and alignment
|
196 |
+
- Try different pre-trained models
|
197 |
+
- Increase training epochs
|
198 |
+
- Adjust learning rate
|
199 |
+
|
200 |
+
3. **Tokenization Issues**:
|
201 |
+
- Ensure the tokenizer supports Hebrew/Aramaic scripts
|
202 |
+
- Check for proper UTF-8 encoding
|
203 |
+
|
204 |
+
4. **Training Instability**:
|
205 |
+
- Reduce learning rate
|
206 |
+
- Increase warmup steps
|
207 |
+
- Use gradient clipping
|
208 |
+
|
209 |
+
### Performance Optimization:
|
210 |
+
|
211 |
+
- Use mixed precision training (`--use_fp16`)
|
212 |
+
- Enable gradient accumulation for larger effective batch sizes
|
213 |
+
- Use multiple GPUs if available
|
214 |
+
- Consider model quantization for inference
|
215 |
+
|
216 |
+
## Evaluation Metrics
|
217 |
+
|
218 |
+
The training script computes:
|
219 |
+
- **BLEU Score**: Standard machine translation metric
|
220 |
+
- **Character Accuracy**: Character-level accuracy for Hebrew/Aramaic text
|
221 |
+
|
222 |
+
## Contributing
|
223 |
+
|
224 |
+
To improve the pipeline:
|
225 |
+
1. Test with different pre-trained models
|
226 |
+
2. Experiment with different data preprocessing techniques
|
227 |
+
3. Add more evaluation metrics
|
228 |
+
4. Optimize for specific use cases
|
229 |
+
|
230 |
+
## License
|
231 |
+
|
232 |
+
This project is provided as-is for research and educational purposes.
|
233 |
+
|
234 |
+
## References
|
235 |
+
|
236 |
+
- [MarianMT Documentation](https://huggingface.co/docs/transformers/en/model_doc/marian)
|
237 |
+
- [Helsinki-NLP Models](https://huggingface.co/Helsinki-NLP)
|
238 |
+
- [Transformers Library](https://huggingface.co/docs/transformers/)
|