astromis commited on
Commit
d155313
·
1 Parent(s): 1c4c4f0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +76 -1
README.md CHANGED
@@ -11,4 +11,79 @@ pipeline_tag: text-classification
11
  tags:
12
  - russian
13
  - suicide
14
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  tags:
12
  - russian
13
  - suicide
14
+ ---
15
+
16
+ # Presuicidal RuBERT base
17
+
18
+ The fine-tuned [ruBert](https://huggingface.co/ai-forever/ruBert-base) on the presuicidal dataset. Aims to help the psychologists to find text with useful information about person's suicide behavior.
19
+
20
+ The model has two categories:
21
+ * category 1 - the texts with useful information about person's suicidal behavior such as attempts and facts of rape, problems with parents, the fact of being in a psychiatric hospital, facts of self-harm, etc. Also, this category includes messages containing a display of subjective negative attitude towards oneself and others, including a desire to die, a feeling of pressure from the past, self-hatred, aggressiveness, rage directed at oneself or others.
22
+ * category 0 - normal texts that don't contain abovementioned information.
23
+
24
+ # How to use
25
+
26
+ ```python
27
+ import torch
28
+
29
+ tokenizer = AutoTokenizer.from_pretrained("astromis/presuisidal_rubert")
30
+ model = BertForSequenceClassification.from_pretrained("astromis/presuisidal_rubert")
31
+ model.eval()
32
+
33
+ text = ["мне так плохо я хочу умереть", "вчера была на сходке с друзьями было оч клево"]
34
+
35
+ tokenized_text = tokenizer(text, padding="max_length", truncation=True, max_length=512, return_tensors="pt")
36
+
37
+ with torch.no_grad():
38
+ prediction = model(**tokenized_text).logits
39
+ print(prediction.argmax(dim=1).numpy())
40
+ # >>> [1, 0]
41
+ ```
42
+
43
+ # Training procedure
44
+
45
+ ## Data preprocessing
46
+
47
+ Before training, the text was transformed in the next way:
48
+ * removed all emojis. In the dataset, they are marked as `<emoji>emoja_name</emoji>`;
49
+ * the punctuation was removed;
50
+ * text was lowered;
51
+ * all enters was swapped to spaces;
52
+ * all several spaces were collapsed.
53
+
54
+ As the dataset is heavily imbalanced, the train part of normal texts was randomly downsampled to have only 22% samples out of source volume.
55
+
56
+ ## Training
57
+
58
+ The training was done with `Trainier` class that have next parameters:
59
+ ```
60
+ TrainingArguments(evaluation_strategy="epoch",
61
+ per_device_train_batch_size=16,
62
+ per_device_eval_batch_size=32,
63
+ learning_rate=1e-5,
64
+ num_train_epochs=5,
65
+ weight_decay=1e-3,
66
+ load_best_model_at_end=True,
67
+ save_strategy="epoch")
68
+ ```
69
+
70
+ # Metrics
71
+
72
+ | F1-micro | F1-macro | F1-weighted |
73
+ |----------|----------|-------------|
74
+ | 0.811926 | 0.726722 | 0.831000 |
75
+
76
+ # Citation
77
+
78
+ ```bibxtex
79
+ @article {Buyanov2022TheDF,
80
+ title={The dataset for presuicidal signals detection in text and its analysis},
81
+ author={Igor Buyanov and Ilya Sochenkov},
82
+ journal={Computational Linguistics and Intellectual Technologies},
83
+ year={2022},
84
+ month={June},
85
+ number={21},
86
+ pages={81--92},
87
+ url={https://api.semanticscholar.org/CorpusID:253195162},
88
+ }
89
+ ```