|
--- |
|
license: apache-2.0 |
|
language: |
|
- kk |
|
- ru |
|
- en |
|
base_model: |
|
- google-bert/bert-base-uncased |
|
pipeline_tag: fill-mask |
|
tags: |
|
- pytorch |
|
- safetensors |
|
library_name: transformers |
|
paper: https://doi.org/10.5281/zenodo.15565394 |
|
datasets: |
|
- amandyk/kazakh_wiki_articles |
|
- Eraly-ml/kk-cc-data |
|
direct_use: true |
|
widget: |
|
- text: "KazBERT қазақ тілін [MASK] түсінеді." |
|
--- |
|
|
|
|
|
# KazBERT: A Custom BERT Model for the Kazakh Language 🇰🇿 |
|
|
|
<details> |
|
<summary><span style="color:#4CAF50;"><strong>License & Metadata</strong></span></summary> |
|
|
|
- **License:** apache-2.0 |
|
- **Languages:** Kazakh (kk), Russian (ru), English (en) |
|
- **Base Model:** google-bert/bert-base-uncased |
|
- **Pipeline Tag:** fill-mask |
|
- **Tags:** pytorch, safetensors |
|
- **Library:** transformers |
|
- **Datasets:** |
|
- amandyk/kazakh_wiki_articles |
|
- Eraly-ml/kk-cc-data |
|
- **Direct Use:** ✅ |
|
- **Widget Example:** |
|
`"KazBERT қазақ тілін [MASK] түсінеді."` |
|
|
|
</details> |
|
|
|
## <span style="color:#4CAF50;"> Model Overview</span> |
|
|
|
**KazBERT** is a BERT-based model fine-tuned specifically for Kazakh using Masked Language Modeling (MLM). It is based on `bert-base-uncased` and uses a custom tokenizer trained on Kazakh text. |
|
|
|
### <span style="color:#4CAF50;">Model Details</span> |
|
|
|
- **Architecture:** BERT |
|
- **Tokenizer:** WordPiece trained on Kazakh |
|
- **Training Data:** Kazakh Wikipedia & Common Crawl |
|
- **Method:** Masked Language Modeling (MLM) |
|
|
|
**Erlanulu, Y. G. (2025). KazBERT: A Custom BERT Model for the Kazakh Language. Zenodo.** |
|
|
|
📄 [Read the paper](https://doi.org/10.5281/zenodo.15565394) |
|
|
|
--- |
|
|
|
|
|
## <span style="color:#4CAF50;"> Files in Repository</span> |
|
|
|
- `config.json` – Model config |
|
- `model.safetensors` – Model weights |
|
- `tokenizer.json` – Tokenizer data |
|
- `tokenizer_config.json` – Tokenizer config |
|
- `special_tokens_map.json` – Special tokens |
|
- `vocab.txt` – Vocabulary |
|
|
|
--- |
|
|
|
## <span style="color:#4CAF50;"> Training Configuration</span> |
|
|
|
- **Epochs:** 20 |
|
- **Batch size:** 16 |
|
- **Learning rate:** Default |
|
- **Weight decay:** 0.01 |
|
- **FP16 Training:** Enabled |
|
|
|
--- |
|
|
|
## <span style="color:#4CAF50;"> Usage</span> |
|
|
|
Install 🤗 Transformers and load the model: |
|
|
|
```python |
|
from transformers import BertForMaskedLM, BertTokenizerFast |
|
|
|
model_name = "Eraly-ml/KazBERT" |
|
tokenizer = BertTokenizerFast.from_pretrained(model_name) |
|
model = BertForMaskedLM.from_pretrained(model_name) |
|
```` |
|
|
|
--- |
|
|
|
## <span style="color:#4CAF50;"> Example: Masked Token Prediction</span> |
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
pipe = pipeline("fill-mask", model="Eraly-ml/KazBERT") |
|
output = pipe('KazBERT қазақ тілін [MASK] түсінеді.') |
|
``` |
|
|
|
**Output:** |
|
|
|
```json |
|
[ |
|
{"score": 0.198, "token_str": "жетік", "sequence": "KazBERT қазақ тілін жетік түсінеді."}, |
|
{"score": 0.038, "token_str": "де", "sequence": "KazBERT қазақ тілін де түсінеді."}, |
|
{"score": 0.032, "token_str": "терең", "sequence": "KazBERT қазақ тілін терең түсінеді."}, |
|
{"score": 0.029, "token_str": "ерте", "sequence": "KazBERT қазақ тілін ерте түсінеді."}, |
|
{"score": 0.026, "token_str": "жете", "sequence": "KazBERT қазақ тілін жете түсінеді."} |
|
] |
|
``` |
|
|
|
--- |
|
|
|
## <span style="color:#4CAF50;"> Bias and Limitations</span> |
|
|
|
``` |
|
- Trained only on public Kazakh Wikipedia & Common Crawl |
|
- Might miss informal speech or dialects |
|
- Could underperform on deep-context or rare words |
|
- May reflect cultural or social biases in data |
|
``` |
|
|
|
--- |
|
|
|
## <span style="color:#4CAF50;"> License</span> |
|
|
|
Apache 2.0 License |
|
|
|
--- |
|
|
|
## <span style="color:#4CAF50;"> Citation</span> |
|
|
|
```bibtex |
|
@misc{eraly_gainulla_2025, |
|
author = { Eraly Gainulla }, |
|
title = { KazBERT (Revision 15240d4) }, |
|
year = 2025, |
|
url = { https://huggingface.co/Eraly-ml/KazBERT }, |
|
doi = { 10.57967/hf/5271 }, |
|
publisher = { Hugging Face } |
|
} |
|
``` |