|
--- |
|
library_name: transformers |
|
tags: |
|
- gpt2 |
|
- assamese |
|
- language-model |
|
- text-generation |
|
- low-resource |
|
- educational |
|
- research |
|
- generated_from_trainer |
|
metrics: |
|
- accuracy |
|
model-index: |
|
- name: Assamese GPT-2 |
|
results: [] |
|
--- |
|
|
|
# Assamese GPT-2 Model |
|
|
|
This is a GPT-2 language model trained from scratch on Assamese monolingual text, using data from **IndicCorpV2** . The model is developed for **educational and research purposes** to support natural language understanding and generation tasks in Assamese — a low-resource language. |
|
|
|
## 📖 Model Description |
|
|
|
The Assamese GPT-2 model is based on the standard GPT-2 decoder-only transformer architecture with 12 layers, 12 attention heads, 768 hidden size. It is capable of generating grammatically coherent and contextually relevant Assamese text and serves as a foundation for downstream NLP tasks such as: |
|
|
|
- Language modeling |
|
- Text completion/generation |
|
- Fine-tuning for classification or summarization |
|
|
|
## ✅ Intended Uses |
|
|
|
- Academic research on Assamese NLP |
|
- Training and benchmarking in educational settings |
|
- Exploration of low-resource language modeling |
|
|
|
## 🚫 Limitations |
|
|
|
- Trained on general-domain monolingual data, may not perform well on domain-specific texts (e.g., legal, medical). |
|
- Might generate biased, incomplete, or hallucinated outputs. |
|
- Not suitable for production use or deployment in sensitive applications. |
|
|
|
## 📚 Training and Evaluation Data |
|
|
|
The model was trained using Assamese monolingual data collected from: |
|
|
|
- **IndicCorpV2**: A curated collection of web-crawled and processed data for Indic languages. |
|
|
|
Data preprocessing included: |
|
- Unicode normalization |
|
- Removal of noisy characters and malformed tokens |
|
- Sentence segmentation using Assamese-specific heuristics |
|
|
|
## 🧪 Training Procedure |
|
|
|
### Hyperparameters |
|
- Architecture: GPT2 (12 layers, 12 heads, 768 hidden size) |
|
- Tokenizer vocab size: 50,000 |
|
- Context window size: 1024 tokens |
|
- Learning rate: 5e-5 |
|
- Epochs: 20 |
|
- Batch size: 64 |
|
- Optimizer: AdamW (β₁=0.9, β₂=0.999, ε=1e-8) |
|
- Scheduler: Linear |
|
- Mixed Precision: Native AMP |
|
- Seed: 42 |
|
|
|
### Results |
|
- Final Evaluation Loss: -29.1890 |
|
- Accuracy: 0.3452 |
|
|
|
## 🚀 Example Usage |
|
|
|
```python |
|
from transformers import GPT2LMHeadModel, GPT2Tokenizer |
|
|
|
model = GPT2LMHeadModel.from_pretrained("BharatVLM/AssameseGPT2") |
|
tokenizer = GPT2Tokenizer.from_pretrained("BharatVLM/AssameseGPT2") |
|
|
|
prompt = "অসমৰ ইতিহাস" |
|
inputs = tokenizer(prompt, return_tensors="pt") |
|
outputs = model.generate(**inputs, max_length=50, do_sample=True) |
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
``` |
|
|
|
## 📄 License |
|
|
|
This model is released under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. |
|
Commercial use is not permitted. Use is allowed for academic and research purposes only. |
|
|
|
## 📬 Citation |
|
|
|
Please cite this model as: |
|
|
|
@misc{assamesegpt2, |
|
author = {BharatVLM}, |
|
title = {Assamese GPT-2 Model}, |
|
year = 2025, |
|
howpublished = {\url{https://huggingface.co/BharatVLM/AssameseGPT2}}, |
|
note = {Trained using IndicCorpV2 and OSCAR corpora} |
|
} |
|
|
|
## 🧰 Framework Versions |
|
|
|
- Transformers: 4.52.0.dev0 |
|
|
|
- PyTorch: 2.5.1+cu121 |
|
|
|
- Datasets: 3.6.0 |
|
|
|
- Tokenizers: 0.21.1 |
|
|
|
|
|
## Contact Us |
|
For questions or academic collaboration, please contact: [email protected]. |
|
|