Fill-Mask
Transformers
Safetensors
PyTorch
Kazakh
Russian
English
bert

KazBERT: A Custom BERT Model for the Kazakh Language 🇰🇿

This repository contains KazBERT, a BERT-based model fine-tuned for Kazakh language tasks. The model is trained using Masked Language Modeling (MLM) on a Kazakh text corpus.

Model Details

  • Architecture: BERT (based on bert-base-uncased)
  • Tokenizer: WordPiece tokenizer trained on Kazakh texts
  • Training Data: Custom Kazakh corpus
  • Training Method: Masked Language Modeling (MLM)

For full details look at the paper

Files in this Repository

  • config.json – Model configuration
  • model.safetensors – Model weights in safetensors format
  • tokenizer.json – Tokenizer data
  • tokenizer_config.json – Tokenizer configuration
  • special_tokens_map.json – Special token mappings
  • vocab.txt – Vocabulary file

Training Details

  • Number of epochs: 20
  • Batch size: 16
  • Learning rate: default
  • Weight decay: 0.01
  • Mixed Precision Training: Enabled (FP16)

Usage

To use the model with the 🤗 Transformers library:

from transformers import BertForMaskedLM, BertTokenizerFast

model_name = "Eraly-ml/KazBERT"
tokenizer = BertTokenizerFast.from_pretrained(model_name)
model = BertForMaskedLM.from_pretrained(model_name)

Example: Masked Token Prediction

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("fill-mask", model="Eraly-ml/KazBERT")

output = pipe('KazBERT қазақ тілін [MASK] түсінеді.')
Output:
[{'score': 0.19899696111679077,
  'token': 25721,
  'token_str': 'жетік',
  'sequence': 'kazbert қазақ тілін жетік түсінеді.'},
 {'score': 0.0383591502904892,
  'token': 1722,
  'token_str': 'де',
  'sequence': 'kazbert қазақ тілін де түсінеді.'},
 {'score': 0.0325467586517334,
  'token': 4743,
  'token_str': 'терең',
  'sequence': 'kazbert қазақ тілін терең түсінеді.'},
 {'score': 0.029968073591589928,
  'token': 5533,
  'token_str': 'ерте',
  'sequence': 'kazbert қазақ тілін ерте түсінеді.'},
 {'score': 0.0264473594725132,
  'token': 17340,
  'token_str': 'жете',
  'sequence': 'kazbert қазақ тілін жете түсінеді.'}]

Bias and Limitations

The model is trained only on publicly available Kazakh Wikipedia articles, which might not represent the full diversity of the Kazakh language (e.g., regional dialects, informal speech, minority group expressions).

Since Kazakh texts are limited compared to English corpora, the model might underperform on complex tasks requiring deep understanding of context or rare words.

Some social or cultural biases present in the training data may be reflected in model predictions.

The model has not been extensively evaluated on downstream tasks such as sentiment analysis, QA, or text generation.

License

This model is released under the Apache 2.0 License.

Citation

If you use this model, please cite:

@misc{eraly_gainulla_2025,
    author       = { Eraly Gainulla },
    title        = { KazBERT (Revision 15240d4) },
    year         = 2025,
    url          = { https://huggingface.co/Eraly-ml/KazBERT },
    doi          = { 10.57967/hf/5271 },
    publisher    = { Hugging Face }
}
Downloads last month
90
Safetensors
Model size
111M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 3 Ask for provider support

Model tree for Eraly-ml/KazBERT

Finetuned
(5099)
this model

Datasets used to train Eraly-ml/KazBERT