BharatVLM
/

AssameseGPT2

Text Generation

Generated from Trainer

text-generation-inference

Model card Files Files and versions

Metrics Training metrics Community

AssameseGPT2 / README.md

BharatVLM's picture

Upload folder using huggingface_hub

3f3f955 verified 3 months ago

|

history blame contribute delete

3.36 kB

	---
	library_name: transformers
	tags:
	- gpt2
	- assamese
	- language-model
	- text-generation
	- low-resource
	- educational
	- research
	- generated_from_trainer
	metrics:
	- accuracy
	model-index:
	- name: Assamese GPT-2
	results: []
	---

	# Assamese GPT-2 Model

	This is a GPT-2 language model trained from scratch on Assamese monolingual text, using data from IndicCorpV2 . The model is developed for educational and research purposes to support natural language understanding and generation tasks in Assamese — a low-resource language.

	## 📖 Model Description

	The Assamese GPT-2 model is based on the standard GPT-2 decoder-only transformer architecture with 12 layers, 12 attention heads, 768 hidden size. It is capable of generating grammatically coherent and contextually relevant Assamese text and serves as a foundation for downstream NLP tasks such as:

	- Language modeling
	- Text completion/generation
	- Fine-tuning for classification or summarization

	## ✅ Intended Uses

	- Academic research on Assamese NLP
	- Training and benchmarking in educational settings
	- Exploration of low-resource language modeling

	## 🚫 Limitations

	- Trained on general-domain monolingual data, may not perform well on domain-specific texts (e.g., legal, medical).
	- Might generate biased, incomplete, or hallucinated outputs.
	- Not suitable for production use or deployment in sensitive applications.

	## 📚 Training and Evaluation Data

	The model was trained using Assamese monolingual data collected from:

	- IndicCorpV2: A curated collection of web-crawled and processed data for Indic languages.

	Data preprocessing included:
	- Unicode normalization
	- Removal of noisy characters and malformed tokens
	- Sentence segmentation using Assamese-specific heuristics

	## 🧪 Training Procedure

	### Hyperparameters
	- Architecture: GPT2 (12 layers, 12 heads, 768 hidden size)
	- Tokenizer vocab size: 50,000
	- Context window size: 1024 tokens
	- Learning rate: 5e-5
	- Epochs: 20
	- Batch size: 64
	- Optimizer: AdamW (β₁=0.9, β₂=0.999, ε=1e-8)
	- Scheduler: Linear
	- Mixed Precision: Native AMP
	- Seed: 42

	### Results
	- Final Evaluation Loss: -29.1890
	- Accuracy: 0.3452

	## 🚀 Example Usage

	```python
	from transformers import GPT2LMHeadModel, GPT2Tokenizer

	model = GPT2LMHeadModel.from_pretrained("BharatVLM/AssameseGPT2")
	tokenizer = GPT2Tokenizer.from_pretrained("BharatVLM/AssameseGPT2")

	prompt = "অসমৰ ইতিহাস"
	inputs = tokenizer(prompt, return_tensors="pt")
	outputs = model.generate(**inputs, max_length=50, do_sample=True)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	## 📄 License

	This model is released under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license.
	Commercial use is not permitted. Use is allowed for academic and research purposes only.

	## 📬 Citation

	Please cite this model as:

	@misc{assamesegpt2,
	author = {BharatVLM},
	title = {Assamese GPT-2 Model},
	year = 2025,
	howpublished = {\url{https://huggingface.co/BharatVLM/AssameseGPT2}},
	note = {Trained using IndicCorpV2 and OSCAR corpora}
	}

	## 🧰 Framework Versions

	- Transformers: 4.52.0.dev0

	- PyTorch: 2.5.1+cu121

	- Datasets: 3.6.0

	- Tokenizers: 0.21.1


	## Contact Us
	For questions or academic collaboration, please contact: [email protected].