File size: 3,355 Bytes
81472f1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3f3f955
81472f1
 
 
3df5c20
81472f1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3df5c20
 
 
81472f1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c48216b
 
81472f1
 
 
 
 
 
 
c48216b
 
 
 
 
 
 
 
 
 
1ef8ffa
c48216b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
---
library_name: transformers
tags:
- gpt2
- assamese
- language-model
- text-generation
- low-resource
- educational
- research
- generated_from_trainer
metrics:
- accuracy
model-index:
- name: Assamese GPT-2
  results: []
---

# Assamese GPT-2 Model

This is a GPT-2 language model trained from scratch on Assamese monolingual text, using data from **IndicCorpV2** . The model is developed for **educational and research purposes** to support natural language understanding and generation tasks in Assamese — a low-resource language.

## 📖 Model Description

The Assamese GPT-2 model is based on the standard GPT-2 decoder-only transformer architecture with 12 layers, 12 attention heads, 768 hidden size. It is capable of generating grammatically coherent and contextually relevant Assamese text and serves as a foundation for downstream NLP tasks such as:

- Language modeling
- Text completion/generation
- Fine-tuning for classification or summarization

## ✅ Intended Uses

- Academic research on Assamese NLP
- Training and benchmarking in educational settings
- Exploration of low-resource language modeling

## 🚫 Limitations

- Trained on general-domain monolingual data, may not perform well on domain-specific texts (e.g., legal, medical).
- Might generate biased, incomplete, or hallucinated outputs.
- Not suitable for production use or deployment in sensitive applications.

## 📚 Training and Evaluation Data

The model was trained using Assamese monolingual data collected from:

- **IndicCorpV2**: A curated collection of web-crawled and processed data for Indic languages.

Data preprocessing included:
- Unicode normalization
- Removal of noisy characters and malformed tokens
- Sentence segmentation using Assamese-specific heuristics

## 🧪 Training Procedure

### Hyperparameters
- Architecture: GPT2 (12 layers, 12 heads, 768 hidden size)
- Tokenizer vocab size: 50,000
- Context window size: 1024 tokens
- Learning rate: 5e-5  
- Epochs: 20  
- Batch size: 64  
- Optimizer: AdamW (β₁=0.9, β₂=0.999, ε=1e-8)  
- Scheduler: Linear  
- Mixed Precision: Native AMP  
- Seed: 42

### Results
- Final Evaluation Loss: -29.1890  
- Accuracy: 0.3452  

## 🚀 Example Usage

```python
from transformers import GPT2LMHeadModel, GPT2Tokenizer

model = GPT2LMHeadModel.from_pretrained("BharatVLM/AssameseGPT2")
tokenizer = GPT2Tokenizer.from_pretrained("BharatVLM/AssameseGPT2")

prompt = "অসমৰ ইতিহাস"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=50, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## 📄 License

This model is released under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license.
Commercial use is not permitted. Use is allowed for academic and research purposes only.

## 📬 Citation

Please cite this model as:

@misc{assamesegpt2,
  author = {BharatVLM},
  title = {Assamese GPT-2 Model},
  year = 2025,
  howpublished = {\url{https://huggingface.co/BharatVLM/AssameseGPT2}},
  note = {Trained using IndicCorpV2 and OSCAR corpora}
}

## 🧰 Framework Versions

- Transformers: 4.52.0.dev0

- PyTorch: 2.5.1+cu121

- Datasets: 3.6.0

- Tokenizers: 0.21.1


## Contact Us
For questions or academic collaboration, please contact: [email protected].