Update README.md
Browse files
README.md
CHANGED
@@ -11,4 +11,79 @@ pipeline_tag: text-classification
|
|
11 |
tags:
|
12 |
- russian
|
13 |
- suicide
|
14 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
11 |
tags:
|
12 |
- russian
|
13 |
- suicide
|
14 |
+
---
|
15 |
+
|
16 |
+
# Presuicidal RuBERT base
|
17 |
+
|
18 |
+
The fine-tuned [ruBert](https://huggingface.co/ai-forever/ruBert-base) on the presuicidal dataset. Aims to help the psychologists to find text with useful information about person's suicide behavior.
|
19 |
+
|
20 |
+
The model has two categories:
|
21 |
+
* category 1 - the texts with useful information about person's suicidal behavior such as attempts and facts of rape, problems with parents, the fact of being in a psychiatric hospital, facts of self-harm, etc. Also, this category includes messages containing a display of subjective negative attitude towards oneself and others, including a desire to die, a feeling of pressure from the past, self-hatred, aggressiveness, rage directed at oneself or others.
|
22 |
+
* category 0 - normal texts that don't contain abovementioned information.
|
23 |
+
|
24 |
+
# How to use
|
25 |
+
|
26 |
+
```python
|
27 |
+
import torch
|
28 |
+
|
29 |
+
tokenizer = AutoTokenizer.from_pretrained("astromis/presuisidal_rubert")
|
30 |
+
model = BertForSequenceClassification.from_pretrained("astromis/presuisidal_rubert")
|
31 |
+
model.eval()
|
32 |
+
|
33 |
+
text = ["мне так плохо я хочу умереть", "вчера была на сходке с друзьями было оч клево"]
|
34 |
+
|
35 |
+
tokenized_text = tokenizer(text, padding="max_length", truncation=True, max_length=512, return_tensors="pt")
|
36 |
+
|
37 |
+
with torch.no_grad():
|
38 |
+
prediction = model(**tokenized_text).logits
|
39 |
+
print(prediction.argmax(dim=1).numpy())
|
40 |
+
# >>> [1, 0]
|
41 |
+
```
|
42 |
+
|
43 |
+
# Training procedure
|
44 |
+
|
45 |
+
## Data preprocessing
|
46 |
+
|
47 |
+
Before training, the text was transformed in the next way:
|
48 |
+
* removed all emojis. In the dataset, they are marked as `<emoji>emoja_name</emoji>`;
|
49 |
+
* the punctuation was removed;
|
50 |
+
* text was lowered;
|
51 |
+
* all enters was swapped to spaces;
|
52 |
+
* all several spaces were collapsed.
|
53 |
+
|
54 |
+
As the dataset is heavily imbalanced, the train part of normal texts was randomly downsampled to have only 22% samples out of source volume.
|
55 |
+
|
56 |
+
## Training
|
57 |
+
|
58 |
+
The training was done with `Trainier` class that have next parameters:
|
59 |
+
```
|
60 |
+
TrainingArguments(evaluation_strategy="epoch",
|
61 |
+
per_device_train_batch_size=16,
|
62 |
+
per_device_eval_batch_size=32,
|
63 |
+
learning_rate=1e-5,
|
64 |
+
num_train_epochs=5,
|
65 |
+
weight_decay=1e-3,
|
66 |
+
load_best_model_at_end=True,
|
67 |
+
save_strategy="epoch")
|
68 |
+
```
|
69 |
+
|
70 |
+
# Metrics
|
71 |
+
|
72 |
+
| F1-micro | F1-macro | F1-weighted |
|
73 |
+
|----------|----------|-------------|
|
74 |
+
| 0.811926 | 0.726722 | 0.831000 |
|
75 |
+
|
76 |
+
# Citation
|
77 |
+
|
78 |
+
```bibxtex
|
79 |
+
@article {Buyanov2022TheDF,
|
80 |
+
title={The dataset for presuicidal signals detection in text and its analysis},
|
81 |
+
author={Igor Buyanov and Ilya Sochenkov},
|
82 |
+
journal={Computational Linguistics and Intellectual Technologies},
|
83 |
+
year={2022},
|
84 |
+
month={June},
|
85 |
+
number={21},
|
86 |
+
pages={81--92},
|
87 |
+
url={https://api.semanticscholar.org/CorpusID:253195162},
|
88 |
+
}
|
89 |
+
```
|