File size: 5,253 Bytes
f0cfeff d113389 f3c8848 d113389 f3c8848 d113389 56b8442 f0cfeff 3fd91f1 7554cf7 f0cfeff 40c1dc9 4ff2b95 31f34fb f0cfeff 31f34fb 8623247 31f34fb e20264e f0cfeff 31f34fb f0cfeff e20264e 8623247 e20264e 2fe7d11 e20264e f0cfeff e20264e f0cfeff e20264e 144f6fb 40c1dc9 e20264e f0cfeff e20264e 332e575 616bd3e 40c1dc9 616bd3e 40c1dc9 332e575 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 |
---
datasets:
- Paul/hatecheck-spanish
language:
- es
metrics:
- accuracy
- precision
- recall
- f1
base_model:
- dccuchile/bert-base-spanish-wwm-cased
pipeline_tag: text-classification
library_name: transformers
tags:
- bert
- transformer
- beto
- sequence-classification
- text-classification
- hate-speech-detection
- sentiment-analysis
- spanish
- nlp
- content-moderation
- social-media-analysis
- fine-tuned
---
# IMPORTANT Update and Model information !
A newer, enhanced version of this model is now available: [HateSpeech-BETO-cased-v2](https://huggingface.co/delarosajav95/HateSpeech-BETO-cased-v2). This updated version offers significantly improved performance, addressing real-world challenges with greater accuracy and reliability.
**We strongly recommend using this other second fine-tuned version**, which has been specifically optimized to better detect nuanced forms of hate speech in Spanish.
Explore the improved model here: 👉 [HateSpeech-BETO-cased-v2](https://huggingface.co/delarosajav95/HateSpeech-BETO-cased-v2).
# HateSpeech-BETO-cased-v1
This model is a fine-tuned version of [dccuchile/bert-base-spanish-wwm-cased](https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased) for hate speech detection related to racism, homophobia, sexism, and other forms of discrimination in Spanish text.
It is trained on the dataset [Paul/hatecheck-spanish](https://huggingface.co/Paul/hatecheck-spanish).
The text classification task in this model is based on 2 sentiment labels: hateful and non-hateful.
## Full classification example:
```python
from transformers import pipeline
pipe = pipeline(model="delarosajav95/HateSpeech-BETO-cased-v1")
inputs = ["El camarero a cargo del bar era demasiado afeminado, se notaba que la manera de moverse no era la propia de un hombre.",
"La manera en que me miraba tan desagradable era porque era una persona del sur.",
"Estoy cansado de que las mujeres se pongan falda y vayan provocando por ahí a toda persona normal y corriente."
]
result = pipe(inputs, return_all_scores=True)
label_mapping = {"LABEL_0": "Hateful", "LABEL_1": "Non-hateful"}
for i, predictions in enumerate(result):
print("==================================")
print(f"Text {i + 1}: {inputs[i]}")
for pred in predictions:
label = label_mapping.get(pred['label'], pred['label'])
score = pred['score']
print(f"{label}: {score:.2%}")
```
Output:
```python
==================================
Text 1: El camarero a cargo del bar era demasiado afeminado, se notaba que la manera de moverse no era la propia de un hombre.
Hateful: 99.98%
Non-hateful: 0.02%
==================================
Text 2: La manera en que me miraba tan desagradable era porque era una persona del sur.
Hateful: 99.98%
Non-hateful: 0.02%
==================================
Text 3: Estoy cansado de que las mujeres se pongan falda y vayan provocando por ahí a toda persona normal y corriente.
Hateful: 99.84%
Non-hateful: 0.16%
```
## Metrics and results:
It achieves the following results on the *evaluation set* (last epoch):
- 'eval_loss': 0.03607647866010666
- 'eval_accuracy': 0.9933244325767691
- 'eval_precision_per_label': [1.0, 0.9905123339658444]
- 'eval_recall_per_label': [0.9779735682819384, 1.0]
- 'eval_f1_per_label': [0.9888641425389755, 0.9952335557673975]
- 'eval_precision_weighted': 0.9933877681310691
- 'eval_recall_weighted': 0.9933244325767691
- 'eval_f1_weighted': 0.9933031728530427
- 'eval_runtime': 1.7545
- 'eval_samples_per_second': 426.913
- 'eval_steps_per_second': 53.578
- 'epoch': 4.0
It achieves the following results on the *test set*:
- 'eval_loss': 0.052769944071769714
- 'eval_accuracy': 0.9933244325767691
- 'eval_precision_per_label': [0.9956140350877193, 0.9923224568138196]
- 'eval_recall_per_label': [0.9826839826839827, 0.9980694980694981]
- 'eval_f1_per_label': [0.9891067538126361, 0.9951876804619827]
- 'eval_precision_weighted': 0.9933376164683867
- 'eval_recall_weighted': 0.9933244325767691
- 'eval_f1_weighted': 0.993312254486016
### Training Details and Procedure
## Main Hyperparameters:
- evaluation_strategy: "epoch"
- learning_rate: 1e-5
- per_device_train_batch_size: 8
- per_device_eval_batch_size: 8
- num_train_epochs: 4
- optimizer: AdamW
- weight_decay: 0.01
- save_strategy: "epoch"
- lr_scheduler_type: "linear"
- warmup_steps: 449
- logging_steps: 10
#### Preprocessing and Postprocessing:
- Needed to manually map dataset creating the different sets: train 60%, validation 20%, and test 20%.
- Seed=42
- Num labels = 2
- Needed to manually map dataset's labels, from str ("hateful", "non-hateful") to int (0,1), in order to properly create tensors.
- Dynamic Padding through DataCollator was used.
### Framework versions
- Transformers 4.47.0
- Pytorch 2.5.1+cu121
- Datasets 3.2.0
- Tokenizers 0.21.0
## CITATION:
```bibtex
@inproceedings{CaneteCFP2020,
title={Spanish Pre-Trained BERT Model and Evaluation Data},
author={Cañete, José and Chaperon, Gabriel and Fuentes, Rodrigo and Ho, Jou-Hui and Kang, Hojin and Pérez, Jorge},
booktitle={PML4DC at ICLR 2020},
year={2020}
}
```
## More Information
- Fine-tuned by Javier de la Rosa Sánchez.
- [email protected]
- https://www.linkedin.com/in/delarosajav95/ |