---
license: apache-2.0
language:
- kk
- ru
- en
base_model:
- google-bert/bert-base-uncased
pipeline_tag: fill-mask
tags:
- pytorch
- safetensors
library_name: transformers
datasets:
- amandyk/kazakh_wiki_articles
---
# KazBERT: A Custom BERT Model for the Kazakh Language 🇰🇿 

This repository contains **KazBERT**, a BERT-based model fine-tuned for Kazakh language tasks. The model is trained using Masked Language Modeling (MLM) on a Kazakh text corpus.

## Model Details
- **Architecture:** BERT (based on `bert-base-uncased`)
- **Tokenizer:** WordPiece tokenizer trained on Kazakh texts
- **Training Data:** Custom Kazakh corpus
- **Training Method:** Masked Language Modeling (MLM)

**For full details look at the** [paper]((KazakhBER(5).pdf))

## Files in this Repository
- `config.json` – Model configuration
- `model.safetensors` – Model weights in safetensors format
- `tokenizer.json` – Tokenizer data
- `tokenizer_config.json` – Tokenizer configuration
- `special_tokens_map.json` – Special token mappings
- `vocab.txt` – Vocabulary file

## Training Details
- **Number of epochs:** 20
- **Batch size:** 16
- **Learning rate:** default
- **Weight decay:** 0.01
- **Mixed Precision Training:** Enabled (FP16)
# Usage
To use the model with the 🤗 Transformers library:

```python
from transformers import BertForMaskedLM, BertTokenizerFast

model_name = "Eraly-ml/KazBERT"
tokenizer = BertTokenizerFast.from_pretrained(model_name)
model = BertForMaskedLM.from_pretrained(model_name)
```

## Example: Masked Token Prediction
```python
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("fill-mask", model="Eraly-ml/KazBERT")

output = pipe('KazBERT қазақ тілін [MASK] түсінеді.')
```
```python
Output:
[{'score': 0.19899696111679077,
  'token': 25721,
  'token_str': 'жетік',
  'sequence': 'kazbert қазақ тілін жетік түсінеді.'},
 {'score': 0.0383591502904892,
  'token': 1722,
  'token_str': 'де',
  'sequence': 'kazbert қазақ тілін де түсінеді.'},
 {'score': 0.0325467586517334,
  'token': 4743,
  'token_str': 'терең',
  'sequence': 'kazbert қазақ тілін терең түсінеді.'},
 {'score': 0.029968073591589928,
  'token': 5533,
  'token_str': 'ерте',
  'sequence': 'kazbert қазақ тілін ерте түсінеді.'},
 {'score': 0.0264473594725132,
  'token': 17340,
  'token_str': 'жете',
  'sequence': 'kazbert қазақ тілін жете түсінеді.'}]
```
## Citation
If you use this model, please cite:
```
@misc{kazbert2025,
  title={KazBERT: A BERT-based Language Model for Kazakh},
  author={Gainulla Eraly},
  year={2025},
  publisher={Hugging Face Model Hub}
}
```

## License
This model is released under the Apache 2.0 License.