KazBERT / README.md

Update README.md

f83852d verified about 1 month ago

4 kB

	---
	license: apache-2.0
	language:
	- kk
	- ru
	- en
	base_model:
	- google-bert/bert-base-uncased
	pipeline_tag: fill-mask
	tags:
	- pytorch
	- safetensors
	library_name: transformers
	paper: https://doi.org/10.5281/zenodo.15565394
	datasets:
	- amandyk/kazakh_wiki_articles
	- Eraly-ml/kk-cc-data
	direct_use: true
	widget:
	- text: "KazBERT қазақ тілін [MASK] түсінеді."
	---


	# KazBERT: A Custom BERT Model for the Kazakh Language 🇰🇿

	<details>
	<summary><span style="color:#4CAF50;"><strong>License & Metadata</strong></span></summary>

	- License: apache-2.0
	- Languages: Kazakh (kk), Russian (ru), English (en)
	- Base Model: google-bert/bert-base-uncased
	- Pipeline Tag: fill-mask
	- Tags: pytorch, safetensors
	- Library: transformers
	- Datasets:
	- amandyk/kazakh_wiki_articles
	- Eraly-ml/kk-cc-data
	- Direct Use: ✅
	- Widget Example:
	`"KazBERT қазақ тілін [MASK] түсінеді."`

	</details>

	## <span style="color:#4CAF50;"> Model Overview</span>

	KazBERT is a BERT-based model fine-tuned specifically for Kazakh using Masked Language Modeling (MLM). It is based on `bert-base-uncased` and uses a custom tokenizer trained on Kazakh text.

	### <span style="color:#4CAF50;">Model Details</span>

	- Architecture: BERT
	- Tokenizer: WordPiece trained on Kazakh
	- Training Data: Kazakh Wikipedia & Common Crawl
	- Method: Masked Language Modeling (MLM)

	Erlanulu, Y. G. (2025). KazBERT: A Custom BERT Model for the Kazakh Language. Zenodo.

	📄 [Read the paper](https://doi.org/10.5281/zenodo.15565394)

	---


	## <span style="color:#4CAF50;"> Files in Repository</span>

	- `config.json` – Model config
	- `model.safetensors` – Model weights
	- `tokenizer.json` – Tokenizer data
	- `tokenizer_config.json` – Tokenizer config
	- `special_tokens_map.json` – Special tokens
	- `vocab.txt` – Vocabulary

	---

	## <span style="color:#4CAF50;"> Training Configuration</span>

	- Epochs: 20
	- Batch size: 16
	- Learning rate: Default
	- Weight decay: 0.01
	- FP16 Training: Enabled

	---

	## <span style="color:#4CAF50;"> Usage</span>

	Install 🤗 Transformers and load the model:

	```python
	from transformers import BertForMaskedLM, BertTokenizerFast

	model_name = "Eraly-ml/KazBERT"
	tokenizer = BertTokenizerFast.from_pretrained(model_name)
	model = BertForMaskedLM.from_pretrained(model_name)
	````

	---

	## <span style="color:#4CAF50;"> Example: Masked Token Prediction</span>

	```python
	from transformers import pipeline

	pipe = pipeline("fill-mask", model="Eraly-ml/KazBERT")
	output = pipe('KazBERT қазақ тілін [MASK] түсінеді.')
	```

	Output:

	```json
	[
	{"score": 0.198, "token_str": "жетік", "sequence": "KazBERT қазақ тілін жетік түсінеді."},
	{"score": 0.038, "token_str": "де", "sequence": "KazBERT қазақ тілін де түсінеді."},
	{"score": 0.032, "token_str": "терең", "sequence": "KazBERT қазақ тілін терең түсінеді."},
	{"score": 0.029, "token_str": "ерте", "sequence": "KazBERT қазақ тілін ерте түсінеді."},
	{"score": 0.026, "token_str": "жете", "sequence": "KazBERT қазақ тілін жете түсінеді."}
	]
	```

	---

	## <span style="color:#4CAF50;"> Bias and Limitations</span>

	```
	- Trained only on public Kazakh Wikipedia & Common Crawl
	- Might miss informal speech or dialects
	- Could underperform on deep-context or rare words
	- May reflect cultural or social biases in data
	```

	---

	## <span style="color:#4CAF50;"> License</span>

	Apache 2.0 License

	---

	## <span style="color:#4CAF50;"> Citation</span>

	```bibtex
	@misc{eraly_gainulla_2025,
	author = { Eraly Gainulla },
	title = { KazBERT (Revision 15240d4) },
	year = 2025,
	url = { https://huggingface.co/Eraly-ml/KazBERT },
	doi = { 10.57967/hf/5271 },
	publisher = { Hugging Face }
	}
	```