astromis
/

presuisidal_rubert

Text Classification

Model card Files Files and versions Community

astromis commited on Jan 5, 2024

Commit

d155313

·

1 Parent(s): 1c4c4f0

Update README.md

Files changed (1) hide show

README.md +76 -1

README.md CHANGED Viewed

@@ -11,4 +11,79 @@ pipeline_tag: text-classification
 tags:
 - russian
 - suicide
----

 tags:
 - russian
 - suicide
+---
+# Presuicidal RuBERT base
+The fine-tuned [ruBert](https://huggingface.co/ai-forever/ruBert-base) on the presuicidal dataset. Aims to help the psychologists to find text with useful information about person's suicide behavior.
+The model has two categories:
+* category 1 - the texts with useful information about person's suicidal behavior such as attempts and facts of rape, problems with parents, the fact of being in a psychiatric hospital, facts of self-harm, etc. Also, this category includes messages containing a display of subjective negative attitude towards oneself and others, including a desire to die, a feeling of pressure from the past, self-hatred, aggressiveness, rage directed at oneself or others.
+* category 0 - normal texts that don't contain abovementioned information.
+# How to use
+```python
+import torch
+tokenizer = AutoTokenizer.from_pretrained("astromis/presuisidal_rubert")
+model = BertForSequenceClassification.from_pretrained("astromis/presuisidal_rubert")
+model.eval()
+text = ["мне так плохо я хочу умереть", "вчера была на сходке с друзьями было оч клево"]
+tokenized_text = tokenizer(text, padding="max_length", truncation=True, max_length=512, return_tensors="pt")
+with torch.no_grad():
+prediction = model(**tokenized_text).logits
+print(prediction.argmax(dim=1).numpy())
+# >>> [1, 0]
+```
+# Training procedure
+## Data preprocessing
+Before training, the text was transformed in the next way:
+* removed all emojis. In the dataset, they are marked as `<emoji>emoja_name</emoji>`;
+* the punctuation was removed;
+* text was lowered;
+* all enters was swapped to spaces;
+* all several spaces were collapsed.
+As the dataset is heavily imbalanced, the train part of normal texts was randomly downsampled to have only 22% samples out of source volume.
+## Training
+The training was done with `Trainier` class that have next parameters:
+```
+TrainingArguments(evaluation_strategy="epoch",
+per_device_train_batch_size=16,
+per_device_eval_batch_size=32,
+learning_rate=1e-5,
+num_train_epochs=5,
+weight_decay=1e-3,
+load_best_model_at_end=True,
+save_strategy="epoch")
+```
+# Metrics
+| F1-micro | F1-macro | F1-weighted |
+|----------|----------|-------------|
+| 0.811926 | 0.726722 | 0.831000 |
+# Citation
+```bibxtex
+@article {Buyanov2022TheDF,
+title={The dataset for presuicidal signals detection in text and its analysis},
+author={Igor Buyanov and Ilya Sochenkov},
+journal={Computational Linguistics and Intellectual Technologies},
+year={2022},
+month={June},
+number={21},
+pages={81--92},
+url={https://api.semanticscholar.org/CorpusID:253195162},
+}
+```