KazBERT: A Custom BERT Model for the Kazakh Language 🇰🇿
This repository contains KazBERT, a BERT-based model fine-tuned for Kazakh language tasks. The model is trained using Masked Language Modeling (MLM) on a Kazakh text corpus.
Model Details
- Architecture: BERT (based on
bert-base-uncased
) - Tokenizer: WordPiece tokenizer trained on Kazakh texts
- Training Data: Custom Kazakh corpus
- Training Method: Masked Language Modeling (MLM)
For full details look at the paper
Files in this Repository
config.json
– Model configurationmodel.safetensors
– Model weights in safetensors formattokenizer.json
– Tokenizer datatokenizer_config.json
– Tokenizer configurationspecial_tokens_map.json
– Special token mappingsvocab.txt
– Vocabulary file
Training Details
- Number of epochs: 20
- Batch size: 16
- Learning rate: default
- Weight decay: 0.01
- Mixed Precision Training: Enabled (FP16)
Usage
To use the model with the 🤗 Transformers library:
from transformers import BertForMaskedLM, BertTokenizerFast
model_name = "Eraly-ml/KazBERT"
tokenizer = BertTokenizerFast.from_pretrained(model_name)
model = BertForMaskedLM.from_pretrained(model_name)
Example: Masked Token Prediction
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("fill-mask", model="Eraly-ml/KazBERT")
output = pipe('KazBERT қазақ тілін [MASK] түсінеді.')
Output:
[{'score': 0.19899696111679077,
'token': 25721,
'token_str': 'жетік',
'sequence': 'kazbert қазақ тілін жетік түсінеді.'},
{'score': 0.0383591502904892,
'token': 1722,
'token_str': 'де',
'sequence': 'kazbert қазақ тілін де түсінеді.'},
{'score': 0.0325467586517334,
'token': 4743,
'token_str': 'терең',
'sequence': 'kazbert қазақ тілін терең түсінеді.'},
{'score': 0.029968073591589928,
'token': 5533,
'token_str': 'ерте',
'sequence': 'kazbert қазақ тілін ерте түсінеді.'},
{'score': 0.0264473594725132,
'token': 17340,
'token_str': 'жете',
'sequence': 'kazbert қазақ тілін жете түсінеді.'}]
Bias and Limitations
The model is trained only on publicly available Kazakh Wikipedia articles, which might not represent the full diversity of the Kazakh language (e.g., regional dialects, informal speech, minority group expressions).
Since Kazakh texts are limited compared to English corpora, the model might underperform on complex tasks requiring deep understanding of context or rare words.
Some social or cultural biases present in the training data may be reflected in model predictions.
The model has not been extensively evaluated on downstream tasks such as sentiment analysis, QA, or text generation.
License
This model is released under the Apache 2.0 License.
Citation
If you use this model, please cite:
@misc{eraly_gainulla_2025,
author = { Eraly Gainulla },
title = { KazBERT (Revision 15240d4) },
year = 2025,
url = { https://huggingface.co/Eraly-ml/KazBERT },
doi = { 10.57967/hf/5271 },
publisher = { Hugging Face }
}
- Downloads last month
- 90
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
3
Ask for provider support
Model tree for Eraly-ml/KazBERT
Base model
google-bert/bert-base-uncased