---
license: apache-2.0
datasets:
- Arailym-tleubayeva/KazakhTextDuplicates
language:
- kk
- en
- ru
base_model:
- Eraly-ml/KazBERT
pipeline_tag: sentence-similarity
library_name: transformers
tags:
- pytorch
- Accelerate
---
# Eraly-ml/KazBERT-Duplicates-BETA_TEST

KazBERT-Duplicates is a Kazakh language model fine-tuned to classify types of textual duplication between sentence pairs. It predicts whether two sentences are **exact**, **partial**, **paraphrase**, or **contextual** duplicates.

## Model Description

* **Base Model**: [KazBERT (BERT-based)](https://huggingface.co/Eraly-ml/KazBERT)
* **Language**: Kazakh 🇰🇿
* **Task**: Sentence Pair Classification (Duplicate Detection)
* **Labels**:
  * `exact`: Sentences are exactly the same
  * `partial`: One sentence partially overlaps with the other
  * `paraphrase`: Sentences convey the same meaning in different wording
  * `contextual`: Sentences are topically similar but semantically different

---

## How to Use

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

tokenizer = AutoTokenizer.from_pretrained("Eraly-ml/KazBERT-Duplicates")
model = AutoModelForSequenceClassification.from_pretrained("Eraly-ml/KazBERT-Duplicates")

model.config.label2id = {v: k for k, v in model.config.id2label.items()}

nlp = pipeline(
    task="text-classification",
    model=model,
    tokenizer=tokenizer,
    return_all_scores=True,
    device=0,  # remove if not using GPU
    batch_size=2,
)

examples = [
    {"text": "Менің атым Ералы", "text_pair": "Менің есімім — Ералы"},
    {"text": "Бүгін ауа‑райы жақсы", "text_pair": "Кеше жаңбыр жауды"}
]

results = nlp(examples, truncation=True, padding=True, max_length=512)

for i, (ex, res) in enumerate(zip(examples, results), 1):
    print(f"\n[{i}] \"{ex['text']}\" ↔ \"{ex['text_pair']}\"")
    top = max(res, key=lambda x: x['score'])
    print(f"   → Top prediction: **{top['label']}** ({top['score']:.2%})")
    print("   All scores:")
    for r in res:
        print(f"      - {r['label']}: {r['score']:.2%}")
````
Output:
```python
[1] "Менің атым Ералы" ↔ "Менің есімім — Ералы"
   → Top prediction: **partial** (55.15%)
   All scores:
      - contextual: 2.04%
      - exact: 6.70%
      - paraphrase: 36.11%
      - partial: 55.15%

[2] "Бүгін ауа‑райы жақсы" ↔ "Кеше жаңбыр жауды"
   → Top prediction: **contextual** (99.59%)
   All scores:
      - contextual: 99.59%
      - exact: 0.01%
      - paraphrase: 0.04%
      - partial: 0.36%
```

---

## Evaluation Metrics

The model was evaluated on a held-out test set using macro-averaged metrics:

| Metric           | Value  | Description                                                 |
| ---------------- | ------ | ----------------------------------------------------------- |
| `eval_loss`      | 0.21   | Low loss — good, the model is confident in its predictions. |
| `eval_accuracy`  | 91.05% | Very high accuracy for the classification task.             |
| `eval_f1`        | 91.05% | Excellent balance between precision and recall.             |
| `eval_precision` | 92.36% | Almost no false positives.                                  |
| `eval_recall`    | 91.21% | High coverage of true positives.                            |

---

## Training Details

* **Framework**: Hugging Face Transformers + Accelerate
* **Dataset**: [KazakhTextDuplicates](https://huggingface.co/datasets/Arailym-tleubayeva/KazakhTextDuplicates)
* **Batch Size**: 16
* **Epochs**: 6
* **Learning Rate**: 2e-5
* **Optimizer**: AdamW
* **Max Seq Length**: 512
* **Loss Function**: CrossEntropyLoss

Training was launched with multi-GPU support:

```python
notebook_launcher(training_function, num_processes=2)
```

---

## Intended Uses

* Kazakh duplicate sentence classification
* Plagiarism detection in Kazakh
* NLP pre-processing for deduplication tasks

---

## Limitations

* Limited to the Kazakh language
* May not generalize well to domain-specific text (e.g. legal or medical)
* Sensitive to long or noisy inputs


---

## Contact

For questions or collaborations, contact via [Home Page](https://eraly-ml.github.io/).