Kurdish-English Machine Translation with Transformers

This repository focuses on fine-tuning a Kurdish-English machine translation model using Hugging Face's transformers library with MarianMT. The model is trained on a custom parallel corpus with a detailed pipeline that includes data preprocessing, bidirectional training, evaluation, and inference. This model is a product of the AI Center of Kurdistan University.

Table of Contents

Introduction

This project fine-tunes a MarianMT model for Kurdish-English translation on a custom parallel corpus. Training is configured for bidirectional translation, enabling model use in both language directions.

Requirements

  • Python 3.8+
  • Hugging Face Transformers
  • Datasets library
  • SentencePiece
  • PyTorch 1.9+
  • CUDA (for GPU support)

Setup

  1. Clone the repository and install dependencies.
  2. Ensure GPU availability.
  3. Prepare your Kurdish-English corpus in CSV format.

Pipeline Overview

Data Preparation

  1. Corpus: A Kurdish-English parallel corpus in CSV format with columns Source (Kurdish) and Target (English).
  2. Path Definition: Specify the corpus path in the configuration.

Training SentencePiece Tokenizer

  • Vocabulary Size: 32,000
  • Source Data: The tokenizer is trained on both the primary Kurdish corpus and the English dataset to create shared subword tokens.

Model and Tokenizer Setup

  • Model: Helsinki-NLP/opus-mt-en-mul pre-trained MarianMT model.
  • Tokenizer: MarianMT tokenizer aligned with the model, with source and target languages set dynamically.

Tokenization and Dataset Preparation

  • Train-Validation Split: 90% train, 10% validation split.
  • Maximum Sequence Length: 128 tokens for both source and target sequences.
  • Bidirectional Tokenization: Prepare tokenized sequences for both Kurdish-English and English-Kurdish translation.

Training Configuration

  • Learning Rate: 2e-5
  • Batch Size: 4 (per device, for both training and evaluation)
  • Weight Decay: 0.01
  • Evaluation Strategy: Per epoch
  • Epochs: 3
  • Logging: Logs saved every 100 steps, with TensorBoard logging enabled
  • Output Directory: ./results
  • Device: GPU 1 explicitly set

Evaluation and Metrics

The following metrics are computed on the validation dataset:

  • BLEU: Measures translation quality based on precision and recall of n-grams.
  • METEOR: Considers synonymy and stem matches.
  • BERTScore: Evaluates semantic similarity with BERT embeddings.

Inference

Inference includes bidirectional translation capabilities:

  • Source to Target: English to Kurdish translation.
  • Target to Source: Kurdish to English translation.

Results

The fine-tuned model and tokenizer are saved to ./fine-tuned-marianmt, including evaluation metrics across BLEU, METEOR, and BERTScore. """

Write the content to README.md

file_path = "/mnt/data/README.md" with open(file_path, "w") as readme_file: readme_file.write(readme_content)

file_path

Downloads last month
0
Safetensors
Model size
77M params
Tensor type
F32
·
Inference Examples
Inference API (serverless) does not yet support adapter-transformers models for this pipeline type.