TransformerNMT: English-to-Hindi Experimental Transformer Model

This repository contains a Transformer Encoder-Decoder model implemented from scratch in PyTorch for English-to-Hindi neural machine translation. The model and all training, preprocessing, and inference scripts are custom and do not use Hugging Face Transformers, but follow the original "Attention is All You Need" architecture.

Model Details

Architecture: Transformer Encoder-Decoder (Vaswani et al., 2017)
Framework: PyTorch
Languages: English (source) → Hindi (target)
Vocabulary: 32,000 BPE tokens per language (trained with tokenizers)
Training Data: Parallel English-Hindi corpus (see repo for data details)
Intended Use: Research, experimentation, and educational purposes

Training

Trained from scratch using the scripts in this repository.
Supports distributed and mixed-precision training.
Checkpoints and tokenizer files are provided in the models/ and Data/bi_tokenizers_32k/ directories.

Intended Uses & Limitations

Intended for: Experimentation, research, and demonstration of custom Transformer implementations.
Not intended for: Production use or high-stakes applications.
Limitations: May not achieve state-of-the-art translation quality. Use with caution for real-world tasks.

Example Inference

Below is a simple inference script to translate English text to Hindi using the trained model and tokenizer:

import torch
from tokenizer import BilingualTokenizer as Tokenizer
from model import Transformer, TransformerConfig
from translator import TranslationInference

# 1. Load config and checkpoint
config = TransformerConfig(shared_embeddings=True)
checkpoint = torch.load('models/TNMT_v1_Beta_single.pt', map_location='cpu')

# 2. Build model and load weights
model = Transformer(config)
model.load_state_dict(checkpoint['model_state_dict'])
model = model.to('cpu')

# 3. Load tokenizer
tokenizer = Tokenizer(vocab_size=32000)
tokenizer_loaded = tokenizer.load_tokenizers('bi_tokenizers_32k')

# 4. Create inference helper
translator = TranslationInference(
    model=model,
    tokenizer=tokenizer_loaded,
    device='cpu'
)

# 5. Translate
source_text = "This is a test sentence."
translated_text = translator.translate_text(source_text)
print("Translated text:", translated_text)

Citation

If you use this code or model, please cite:

Vaswani et al., "Attention is All You Need", NeurIPS 2017.

Author: QuarkML License: MIT