johnlockejrr commited on
Commit
48bff22
·
verified ·
1 Parent(s): d3d218f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +231 -1
README.md CHANGED
@@ -5,4 +5,234 @@ language:
5
  - arc
6
  base_model:
7
  - Helsinki-NLP/opus-mt-mul-en
8
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  - arc
6
  base_model:
7
  - Helsinki-NLP/opus-mt-mul-en
8
+ ---
9
+ # Hebrew-Aramaic Translation Model
10
+
11
+ This project provides a complete pipeline for fine-tuning MarianMT models on Hebrew-Aramaic parallel texts, specifically designed for translating between Hebrew (Samaritan) and Aramaic (Targum) texts.
12
+
13
+ ## Overview
14
+
15
+ The pipeline consists of:
16
+ 1. **Dataset Preparation** (`prepare_dataset.py`) - Processes the aligned corpus and splits it into train/validation/test sets
17
+ 2. **Model Training** (`train_translation_model.py`) - Fine-tunes a pre-trained MarianMT model
18
+ 3. **Inference** (`inference.py`) - Provides translation functionality using the trained model
19
+
20
+ ## Data Format
21
+
22
+ The input data should be in TSV format with the following columns:
23
+ - `Book` - Book identifier
24
+ - `Chapter` - Chapter number
25
+ - `Verse` - Verse number
26
+ - `Targum` - Aramaic text
27
+ - `Samaritan` - Hebrew text
28
+
29
+ Example:
30
+ ```
31
+ Book|Chapter|Verse|Targum|Samaritan
32
+ 1|3|2|מן אילן גנה ניכל|מפרי עץ הגן נאכל
33
+ ```
34
+
35
+ ## Installation
36
+
37
+ 1. Clone this repository:
38
+ ```bash
39
+ git clone <repository-url>
40
+ cd sam-aram
41
+ ```
42
+
43
+ 2. Install dependencies:
44
+ ```bash
45
+ pip install -r requirements.txt
46
+ ```
47
+
48
+ 3. (Optional) Install CUDA for GPU acceleration if available.
49
+
50
+ ## Usage
51
+
52
+ ### Step 1: Prepare the Dataset
53
+
54
+ First, prepare your aligned corpus for training:
55
+
56
+ ```bash
57
+ python prepare_dataset.py \
58
+ --input_file aligned_corpus.tsv \
59
+ --output_dir ./hebrew_aramaic_dataset \
60
+ --test_size 0.1 \
61
+ --val_size 0.1
62
+ ```
63
+
64
+ This will:
65
+ - Load the TSV file
66
+ - Clean and filter the data
67
+ - Split into train/validation/test sets
68
+ - Save the processed dataset
69
+
70
+ ### Step 2: Train the Model
71
+
72
+ Train a translation model using the prepared dataset:
73
+
74
+ ```bash
75
+ python train_translation_model.py \
76
+ --dataset_path ./hebrew_aramaic_dataset \
77
+ --output_dir ./hebrew_aramaic_model \
78
+ --model_name Helsinki-NLP/opus-mt-mul-en \
79
+ --direction he2arc \
80
+ --batch_size 8 \
81
+ --learning_rate 2e-5 \
82
+ --num_epochs 3 \
83
+ --use_fp16
84
+ ```
85
+
86
+ #### Key Parameters:
87
+
88
+ - `--model_name`: Pre-trained model to fine-tune. Recommended options:
89
+ - `Helsinki-NLP/opus-mt-mul-en` (multilingual)
90
+ - `Helsinki-NLP/opus-mt-he-en` (Hebrew-English)
91
+ - `Helsinki-NLP/opus-mt-ar-en` (Arabic-English)
92
+ - `--direction`: Translation direction (`he2arc` or `arc2he`)
93
+ - `--batch_size`: Training batch size (adjust based on GPU memory)
94
+ - `--learning_rate`: Learning rate for fine-tuning
95
+ - `--num_epochs`: Number of training epochs
96
+ - `--use_fp16`: Enable mixed precision training (faster, less memory)
97
+
98
+ #### Training with Weights & Biases (Optional):
99
+
100
+ ```bash
101
+ python train_translation_model.py \
102
+ --dataset_path ./hebrew_aramaic_dataset \
103
+ --output_dir ./hebrew_aramaic_model \
104
+ --model_name Helsinki-NLP/opus-mt-mul-en \
105
+ --use_wandb
106
+ ```
107
+
108
+ ### Step 3: Use the Trained Model
109
+
110
+ #### Interactive Translation:
111
+
112
+ ```bash
113
+ python inference.py --model_path ./hebrew_aramaic_model
114
+ ```
115
+
116
+ #### Translate a Single Text:
117
+
118
+ ```bash
119
+ python inference.py \
120
+ --model_path ./hebrew_aramaic_model \
121
+ --text "מפרי עץ הגן נאכל" \
122
+ --direction he2arc
123
+ ```
124
+
125
+ #### Batch Translation:
126
+
127
+ ```bash
128
+ python inference.py \
129
+ --model_path ./hebrew_aramaic_model \
130
+ --input_file input_texts.txt \
131
+ --output_file translations.txt \
132
+ --direction he2arc
133
+ ```
134
+
135
+ ## Model Recommendations
136
+
137
+ Based on the information in `info.txt`, here are recommended pre-trained models for Hebrew-Aramaic translation:
138
+
139
+ ### 1. Multilingual Models
140
+ - `Helsinki-NLP/opus-mt-mul-en` - Good starting point for multilingual fine-tuning
141
+ - `facebook/m2m100_1.2B` - Large multilingual model with Hebrew and Aramaic support
142
+
143
+ ### 2. Hebrew-Related Models
144
+ - `Helsinki-NLP/opus-mt-he-en` - Hebrew to English (can be adapted)
145
+ - `Helsinki-NLP/opus-mt-heb-ara` - Hebrew to Arabic (Semitic language family)
146
+
147
+ ### 3. Arabic-Related Models
148
+ - `Helsinki-NLP/opus-mt-ar-en` - Arabic to English (Aramaic is related to Arabic)
149
+ - `Helsinki-NLP/opus-mt-ar-heb` - Arabic to Hebrew
150
+
151
+ ## Training Tips
152
+
153
+ ### 1. Data Quality
154
+ - Ensure your parallel texts are properly aligned
155
+ - Clean the data to remove noise and inconsistencies
156
+ - Consider the length ratio between source and target texts
157
+
158
+ ### 2. Model Selection
159
+ - Start with a multilingual model if available
160
+ - Consider the vocabulary overlap between your languages
161
+ - Test different pre-trained models to find the best starting point
162
+
163
+ ### 3. Hyperparameter Tuning
164
+ - Use smaller batch sizes for limited GPU memory
165
+ - Start with a lower learning rate (1e-5 to 5e-5)
166
+ - Increase epochs if the model hasn't converged
167
+ - Use early stopping to prevent overfitting
168
+
169
+ ### 4. Evaluation
170
+ - Monitor BLEU score during training
171
+ - Use character-level accuracy for Hebrew/Aramaic
172
+ - Test on a held-out test set
173
+
174
+ ## File Structure
175
+
176
+ ```
177
+ sam-aram/
178
+ ├── aligned_corpus.tsv # Input parallel corpus
179
+ ├── prepare_dataset.py # Dataset preparation script
180
+ ├── train_translation_model.py # Training script
181
+ ├── inference.py # Inference script
182
+ ├── requirements.txt # Python dependencies
183
+ ├── README.md # This file
184
+ ├── info.txt # Project information
185
+ ├── hebrew_aramaic_dataset/ # Prepared dataset (created)
186
+ └── hebrew_aramaic_model/ # Trained model (created)
187
+ ```
188
+
189
+ ## Troubleshooting
190
+
191
+ ### Common Issues:
192
+
193
+ 1. **Out of Memory**: Reduce batch size or use gradient accumulation
194
+ 2. **Poor Translation Quality**:
195
+ - Check data quality and alignment
196
+ - Try different pre-trained models
197
+ - Increase training epochs
198
+ - Adjust learning rate
199
+
200
+ 3. **Tokenization Issues**:
201
+ - Ensure the tokenizer supports Hebrew/Aramaic scripts
202
+ - Check for proper UTF-8 encoding
203
+
204
+ 4. **Training Instability**:
205
+ - Reduce learning rate
206
+ - Increase warmup steps
207
+ - Use gradient clipping
208
+
209
+ ### Performance Optimization:
210
+
211
+ - Use mixed precision training (`--use_fp16`)
212
+ - Enable gradient accumulation for larger effective batch sizes
213
+ - Use multiple GPUs if available
214
+ - Consider model quantization for inference
215
+
216
+ ## Evaluation Metrics
217
+
218
+ The training script computes:
219
+ - **BLEU Score**: Standard machine translation metric
220
+ - **Character Accuracy**: Character-level accuracy for Hebrew/Aramaic text
221
+
222
+ ## Contributing
223
+
224
+ To improve the pipeline:
225
+ 1. Test with different pre-trained models
226
+ 2. Experiment with different data preprocessing techniques
227
+ 3. Add more evaluation metrics
228
+ 4. Optimize for specific use cases
229
+
230
+ ## License
231
+
232
+ This project is provided as-is for research and educational purposes.
233
+
234
+ ## References
235
+
236
+ - [MarianMT Documentation](https://huggingface.co/docs/transformers/en/model_doc/marian)
237
+ - [Helsinki-NLP Models](https://huggingface.co/Helsinki-NLP)
238
+ - [Transformers Library](https://huggingface.co/docs/transformers/)