Nepali GPT Model

Overview

The NepaliGPTModel is a custom GPT-style pretrained transformer model designed for natural language processing tasks, with a focus on the Nepali language. It is built using PyTorch and made compatible with the Hugging Face transformers library for easy integration and deployment. This model is intended for text generation and other language modeling tasks, particularly for Nepali text.

This repository contains the model weights, tokenizer, and necessary files to load and use the model on Hugging Face.

Model Details

Architecture

Model Type: nep_gptv1
Vocabulary Size: 51,728
Embedding Dimension: 768
Context Length: 1,024
Number of Attention Heads: 12
Number of Layers: 9
Dropout Rate: 0.1
QKV Bias: false (attention layers do not use bias)
Torch Data Type: float32
Transformers Version: 4.51.0.dev0

The model follows a decoder-only transformer architecture, similar to GPT, with 9 transformer blocks, each containing multi-head attention and feedforward layers. It uses causal masking to ensure autoregressive generation.

Training Details

Training Configuration

The model was trained with the following hyperparameters:

Batch Size: 1
Learning Rate: 3.00e-4
Weight Decay: 0.01
Betas (for AdamW Optimizer): (0.9, 0.999)
Number of Workers: 8
Maximum Epochs: 1
Warmup Rate: 0.15 (fraction of steps for learning rate warmup)
Initial Learning Rate: 2.94e-4
Minimum Learning Rate: 2.92e-4
Maximum Gradient Norm: 1.0 (for gradient clipping)
Evaluation Frequency: Every 8,000 steps
No Improvement Loss Count: 5 (early stopping criterion)
Start Context: "रोगविज्ञान रोग वा चोटको कारण तथा प्रभावहरूको अध्ययन गर्ने" (used for generation during training)

Training and Evaluation Loss

The plot below shows the training and evaluation loss over the course of training:

Training Loss (Blue Line): Measured per epoch, showing a steady decrease from around 11 to below 4 over 65,000 global steps.
Evaluation Loss (Orange Line with 'x' Markers): Measured per evaluation step (every 8,000 steps), starting at around 5 and decreasing to just below 4.

The plot indicates that the model is learning effectively, with both training and evaluation losses converging, suggesting good generalization and minimal overfitting.

Installation

To use this model, you’ll need to install the required dependencies:

pip install torch transformers

Usage

Loading the Model and Tokenizer

To load and use the model, you need to register the custom model class by importing the model.py file included in this repository. Then, you can use the transformers library to load the model and tokenizer.

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the model and tokenizer
model_name = "dinesh-bk/nepali-gpt"  # Replace with your Hugging Face repo name
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Generate text
input_text = "नमस्ते, संसार!"  # "Hello, world!" in Nepali
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Generated text:", generated_text)

Notes

Generation Options: You can customize generation with parameters like do_sample=True, top_k=50, or temperature=0.7 for more diverse outputs.

Files in This Repository

config.json: Model configuration.
model.safetensors: Model weights in SafeTensors format.
configuration_nepaligpt.py: Defines the NepaliGPTConfig configuration of the model.
model_nepaligpt.py: Defines the TransformerBlock, and NepaliGPTModel classes, including registration.
special_tokens_map.json: Special tokens for the tokenizer.
tokenizer_config.json: Tokenizer settings.
tokenizer.json: Core tokenizer configuration and vocabulary.

Training Data

The model was trained on a dataset of Nepali text from hugging face(IRIISNEPAL/Nepali-Text-Corpus).

Limitations

Training Duration: The model was trained for only 1 epoch, which might limit its performance. Further training could improve results.
Overfitting: While the training and evaluation losses are close, the small gap suggests potential for slight overfitting, especially with a single epoch.
Language Specificity: The model is tailored for Nepali but may not generalize well to other languages without fine-tuning.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contact

For questions or contributions, please contact

dinesh-bk
/

NepGPT2