KazBERT: A Custom BERT Model for the Kazakh Language 🇰🇿

License & Metadata

License: apache-2.0
Languages: Kazakh (kk), Russian (ru), English (en)
Base Model: google-bert/bert-base-uncased
Pipeline Tag: fill-mask
Tags: pytorch, safetensors
Library: transformers
Datasets:
- amandyk/kazakh_wiki_articles
- Eraly-ml/kk-cc-data
Direct Use: ✅
Widget Example:
"KazBERT қазақ тілін [MASK] түсінеді."

Model Overview

KazBERT is a BERT-based model fine-tuned specifically for Kazakh using Masked Language Modeling (MLM). It is based on bert-base-uncased and uses a custom tokenizer trained on Kazakh text.

Model Details

Architecture: BERT
Tokenizer: WordPiece trained on Kazakh
Training Data: Kazakh Wikipedia & Common Crawl
Method: Masked Language Modeling (MLM)

Erlanulu, Y. G. (2025). KazBERT: A Custom BERT Model for the Kazakh Language. Zenodo.

📄 Read the paper

Files in Repository

config.json – Model config
model.safetensors – Model weights
tokenizer.json – Tokenizer data
tokenizer_config.json – Tokenizer config
special_tokens_map.json – Special tokens
vocab.txt – Vocabulary

Training Configuration

Epochs: 20
Batch size: 16
Learning rate: Default
Weight decay: 0.01
FP16 Training: Enabled

Usage

Install 🤗 Transformers and load the model:

from transformers import BertForMaskedLM, BertTokenizerFast

model_name = "Eraly-ml/KazBERT"
tokenizer = BertTokenizerFast.from_pretrained(model_name)
model = BertForMaskedLM.from_pretrained(model_name)

Example: Masked Token Prediction

from transformers import pipeline

pipe = pipeline("fill-mask", model="Eraly-ml/KazBERT")
output = pipe('KazBERT қазақ тілін [MASK] түсінеді.')

Output:

[
  {"score": 0.198, "token_str": "жетік", "sequence": "KazBERT қазақ тілін жетік түсінеді."},
  {"score": 0.038, "token_str": "де", "sequence": "KazBERT қазақ тілін де түсінеді."},
  {"score": 0.032, "token_str": "терең", "sequence": "KazBERT қазақ тілін терең түсінеді."},
  {"score": 0.029, "token_str": "ерте", "sequence": "KazBERT қазақ тілін ерте түсінеді."},
  {"score": 0.026, "token_str": "жете", "sequence": "KazBERT қазақ тілін жете түсінеді."}
]

Bias and Limitations

- Trained only on public Kazakh Wikipedia & Common Crawl
- Might miss informal speech or dialects
- Could underperform on deep-context or rare words
- May reflect cultural or social biases in data

License

Apache 2.0 License

Citation

@misc{eraly_gainulla_2025,
    author       = { Eraly Gainulla },
    title        = { KazBERT (Revision 15240d4) },
    year         = 2025,
    url          = { https://huggingface.co/Eraly-ml/KazBERT },
    doi          = { 10.57967/hf/5271 },
    publisher    = { Hugging Face }
}

Eraly-ml
/

KazBERT