--- license: apache-2.0 language: - kk - ru - en base_model: - google-bert/bert-base-uncased pipeline_tag: fill-mask tags: - pytorch - safetensors library_name: transformers datasets: - amandyk/kazakh_wiki_articles --- # KazBERT: A Custom BERT Model for the Kazakh Language 🇰🇿 This repository contains **KazBERT**, a BERT-based model fine-tuned for Kazakh language tasks. The model is trained using Masked Language Modeling (MLM) on a Kazakh text corpus. ## Model Details - **Architecture:** BERT (based on `bert-base-uncased`) - **Tokenizer:** WordPiece tokenizer trained on Kazakh texts - **Training Data:** Custom Kazakh corpus - **Training Method:** Masked Language Modeling (MLM) **For full details look at the** [paper]((KazakhBER(5).pdf)) ## Files in this Repository - `config.json` – Model configuration - `model.safetensors` – Model weights in safetensors format - `tokenizer.json` – Tokenizer data - `tokenizer_config.json` – Tokenizer configuration - `special_tokens_map.json` – Special token mappings - `vocab.txt` – Vocabulary file ## Training Details - **Number of epochs:** 20 - **Batch size:** 16 - **Learning rate:** default - **Weight decay:** 0.01 - **Mixed Precision Training:** Enabled (FP16) # Usage To use the model with the 🤗 Transformers library: ```python from transformers import BertForMaskedLM, BertTokenizerFast model_name = "Eraly-ml/KazBERT" tokenizer = BertTokenizerFast.from_pretrained(model_name) model = BertForMaskedLM.from_pretrained(model_name) ``` ## Example: Masked Token Prediction ```python # Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="Eraly-ml/KazBERT") output = pipe('KazBERT қазақ тілін [MASK] түсінеді.') ``` ```python Output: [{'score': 0.19899696111679077, 'token': 25721, 'token_str': 'жетік', 'sequence': 'kazbert қазақ тілін жетік түсінеді.'}, {'score': 0.0383591502904892, 'token': 1722, 'token_str': 'де', 'sequence': 'kazbert қазақ тілін де түсінеді.'}, {'score': 0.0325467586517334, 'token': 4743, 'token_str': 'терең', 'sequence': 'kazbert қазақ тілін терең түсінеді.'}, {'score': 0.029968073591589928, 'token': 5533, 'token_str': 'ерте', 'sequence': 'kazbert қазақ тілін ерте түсінеді.'}, {'score': 0.0264473594725132, 'token': 17340, 'token_str': 'жете', 'sequence': 'kazbert қазақ тілін жете түсінеді.'}] ``` ## Citation If you use this model, please cite: ``` @misc{kazbert2025, title={KazBERT: A BERT-based Language Model for Kazakh}, author={Gainulla Eraly}, year={2025}, publisher={Hugging Face Model Hub} } ``` ## License This model is released under the Apache 2.0 License.