--- license: apache-2.0 datasets: - Arailym-tleubayeva/KazakhTextDuplicates language: - kk - en - ru base_model: - Eraly-ml/KazBERT pipeline_tag: sentence-similarity library_name: transformers tags: - pytorch - Accelerate --- # Eraly-ml/KazBERT-Duplicates-BETA_TEST KazBERT-Duplicates is a Kazakh language model fine-tuned to classify types of textual duplication between sentence pairs. It predicts whether two sentences are **exact**, **partial**, **paraphrase**, or **contextual** duplicates. ## Model Description * **Base Model**: [KazBERT (BERT-based)](https://huggingface.co/Eraly-ml/KazBERT) * **Language**: Kazakh 🇰🇿 * **Task**: Sentence Pair Classification (Duplicate Detection) * **Labels**: * `exact`: Sentences are exactly the same * `partial`: One sentence partially overlaps with the other * `paraphrase`: Sentences convey the same meaning in different wording * `contextual`: Sentences are topically similar but semantically different --- ## How to Use ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline tokenizer = AutoTokenizer.from_pretrained("Eraly-ml/KazBERT-Duplicates") model = AutoModelForSequenceClassification.from_pretrained("Eraly-ml/KazBERT-Duplicates") model.config.label2id = {v: k for k, v in model.config.id2label.items()} nlp = pipeline( task="text-classification", model=model, tokenizer=tokenizer, return_all_scores=True, device=0, # remove if not using GPU batch_size=2, ) examples = [ {"text": "Менің атым Ералы", "text_pair": "Менің есімім — Ералы"}, {"text": "Бүгін ауа‑райы жақсы", "text_pair": "Кеше жаңбыр жауды"} ] results = nlp(examples, truncation=True, padding=True, max_length=512) for i, (ex, res) in enumerate(zip(examples, results), 1): print(f"\n[{i}] \"{ex['text']}\" ↔ \"{ex['text_pair']}\"") top = max(res, key=lambda x: x['score']) print(f" → Top prediction: **{top['label']}** ({top['score']:.2%})") print(" All scores:") for r in res: print(f" - {r['label']}: {r['score']:.2%}") ```` Output: ```python [1] "Менің атым Ералы" ↔ "Менің есімім — Ералы" → Top prediction: **partial** (55.15%) All scores: - contextual: 2.04% - exact: 6.70% - paraphrase: 36.11% - partial: 55.15% [2] "Бүгін ауа‑райы жақсы" ↔ "Кеше жаңбыр жауды" → Top prediction: **contextual** (99.59%) All scores: - contextual: 99.59% - exact: 0.01% - paraphrase: 0.04% - partial: 0.36% ``` --- ## Evaluation Metrics The model was evaluated on a held-out test set using macro-averaged metrics: | Metric | Value | Description | | ---------------- | ------ | ----------------------------------------------------------- | | `eval_loss` | 0.21 | Low loss — good, the model is confident in its predictions. | | `eval_accuracy` | 91.05% | Very high accuracy for the classification task. | | `eval_f1` | 91.05% | Excellent balance between precision and recall. | | `eval_precision` | 92.36% | Almost no false positives. | | `eval_recall` | 91.21% | High coverage of true positives. | --- ## Training Details * **Framework**: Hugging Face Transformers + Accelerate * **Dataset**: [KazakhTextDuplicates](https://huggingface.co/datasets/Arailym-tleubayeva/KazakhTextDuplicates) * **Batch Size**: 16 * **Epochs**: 6 * **Learning Rate**: 2e-5 * **Optimizer**: AdamW * **Max Seq Length**: 512 * **Loss Function**: CrossEntropyLoss Training was launched with multi-GPU support: ```python notebook_launcher(training_function, num_processes=2) ``` --- ## Intended Uses * Kazakh duplicate sentence classification * Plagiarism detection in Kazakh * NLP pre-processing for deduplication tasks --- ## Limitations * Limited to the Kazakh language * May not generalize well to domain-specific text (e.g. legal or medical) * Sensitive to long or noisy inputs --- ## Contact For questions or collaborations, contact via [Home Page](https://eraly-ml.github.io/).