File size: 1,885 Bytes
1dcaefa 5900e71 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 |
---
license: mit
---
# Uzbek-English Neural Machine Translation (Seq2Seq with Attention)
This repository contains an implementation of a **sequence-to-sequence (Seq2Seq)** model with **attention**, designed for **translating sentences between Uzbek and English** (in both directions).
The architecture is inspired by the 2015 paper:
📄 [Effective Approaches to Attention-based Neural Machine Translation](https://arxiv.org/abs/1508.04025) by Luong et al.
---
## 🚀 Features
- Encoder-decoder model with **LSTM** layers
- **Luong-style attention mechanism** (global attention)
- Vocabulary size: **50,000**
- Embedding dimension: **1000**
- Hidden state dimension: **1000**
- Trained on **50,000 Uzbek-English parallel sentences**
- Word-level tokenization
- Built with **PyTorch**
- Achieved **BLEU score ~22** for both Uzbek→English and English→Uzbek translation tasks
---
## 📚 Dataset
We use the bilingual dataset:
🔗 [SlimOrca-Dedup-English-Uzbek](https://huggingface.co/datasets/MLDataScientist/SlimOrca-Dedup-English-Uzbek)
Each entry in the dataset is a sentence pair with translations between English and Uzbek.
---
## 🧠 Model Architecture
- **Encoder:** LSTM that encodes the source sentence
- **Decoder:** LSTM with attention and input-feeding
- **Attention Layer:** Dot-product attention (Luong-style global attention)
- **Output Layer:** Concatenated decoder + context → Linear → Softmax
---
## 🏋️ Training
- Optimizer: `Adam`
- Loss function: `CrossEntropyLoss` with masking for padded tokens
- Batch size: configurable
- Training data size: ~50,000 samples
- Token `<eos>` used for padding
---
## 📊 Evaluation
- Evaluation metric: **BLEU score**
- Average BLEU on validation set (~64 samples per direction):
- **Uzbek → English:** ~22
- **English → Uzbek:** ~22
---
## 🌌 GUI

|