Update README.md
Browse files
README.md
CHANGED
@@ -15,52 +15,61 @@ pipeline_tag: text-classification
|
|
15 |
tags:
|
16 |
- spam
|
17 |
- detection
|
|
|
|
|
18 |
library_name: transformers
|
19 |
---
|
20 |
-
#
|
21 |
-
|
22 |
-
|
23 |
-
|
24 |
-
|
25 |
-
|
26 |
-
##
|
27 |
-
|
28 |
-
|
29 |
-
|
30 |
-
|
31 |
-
|
32 |
-
|
33 |
-
|
34 |
-
|
35 |
-
-
|
36 |
-
|
37 |
-
|
38 |
-
|
39 |
-
|
40 |
-
|
41 |
-
|
42 |
-
|
43 |
-
|
44 |
-
|
45 |
-
|
46 |
-
|
47 |
-
|
48 |
-
|
49 |
-
|
50 |
-
|
51 |
-
|
52 |
-
|
53 |
-
|
54 |
-
|
55 |
-
|
56 |
-
|
57 |
-
|
58 |
-
|
59 |
-
|
60 |
-
|
61 |
-
|
62 |
-
|
63 |
-
|
64 |
-
|
65 |
-
|
66 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
15 |
tags:
|
16 |
- spam
|
17 |
- detection
|
18 |
+
- classification
|
19 |
+
- russian
|
20 |
library_name: transformers
|
21 |
---
|
22 |
+
# russian_spam_detector
|
23 |
+
|
24 |
+
Модель **russian_spam_detector** предназначена для бинарной классификации текстов на 2 категории:
|
25 |
+
- **LABEL_0** — спам-сообщение
|
26 |
+
- **LABEL_1** — нормальное сообщение (не спам)
|
27 |
+
|
28 |
+
## 🚀 Использование
|
29 |
+
|
30 |
+
```python
|
31 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
|
32 |
+
|
33 |
+
model_name = "corall88/russian_spam_detector"
|
34 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
35 |
+
model = AutoModelForSequenceClassification.from_pretrained(model_name)
|
36 |
+
|
37 |
+
detector = pipeline("text-classification", model=model, tokenizer=tokenizer)
|
38 |
+
|
39 |
+
message = "Поздравляем! Вы выиграли 1000000 рублей, пройдите по ссылке - ..."
|
40 |
+
predict = detector(message)
|
41 |
+
print(predict)
|
42 |
+
```
|
43 |
+
|
44 |
+
## 📊 Датасет
|
45 |
+
В качетсвете данных для файнтюнинга модели был выбран датасет[https://huggingface.co/datasets/alt-gnome/telegram-spam] cо спам сообщениями.
|
46 |
+
|
47 |
+
## 🧠 Архитектура
|
48 |
+
Модель основана на **[RuModernBERT-base](https://huggingface.co/ModernBERT-base)** и дообучена на задаче бинарной классификации.
|
49 |
+
|
50 |
+
## ⚙️ Параметры обучения
|
51 |
+
- **Epochs**: 4
|
52 |
+
- **Batch size**: 16
|
53 |
+
- **Optimizer**: AdamW
|
54 |
+
- **Learning rate**: 2e-5
|
55 |
+
- **Loss**: CrossEntropyLoss
|
56 |
+
- **Max sequence length**: 256
|
57 |
+
|
58 |
+
## 📈 Результаты
|
59 |
+
| Metric | Value |
|
60 |
+
|-----------|-------|
|
61 |
+
| Accuracy | 0.99 |
|
62 |
+
| F1-score | 0.99 |
|
63 |
+
| Precision | 0.99 |
|
64 |
+
| Recall | 0.99 |
|
65 |
+
|
66 |
+
|
67 |
+
```
|
68 |
+
@misc{russian_spam_detector,
|
69 |
+
title={russian_spam_detector: modern model for spam detection},
|
70 |
+
author={corall88},
|
71 |
+
url={https://huggingface.co/corall88/russian_spam_detector},
|
72 |
+
publisher={Hugging Face}
|
73 |
+
year={2025},
|
74 |
+
}
|
75 |
+
```
|