File size: 5,253 Bytes
f0cfeff
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d113389
f3c8848
d113389
f3c8848
 
 
 
d113389
56b8442
f0cfeff
3fd91f1
7554cf7
f0cfeff
40c1dc9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4ff2b95
31f34fb
f0cfeff
31f34fb
8623247
 
 
 
 
 
 
 
 
 
 
 
 
31f34fb
e20264e
 
 
 
 
 
 
 
f0cfeff
31f34fb
f0cfeff
e20264e
 
 
8623247
e20264e
 
 
2fe7d11
e20264e
 
 
 
 
f0cfeff
 
e20264e
f0cfeff
e20264e
144f6fb
 
40c1dc9
e20264e
f0cfeff
 
e20264e
 
 
 
 
332e575
 
616bd3e
 
40c1dc9
616bd3e
 
 
 
 
 
40c1dc9
332e575
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
---
datasets:
- Paul/hatecheck-spanish
language:
- es
metrics:
- accuracy
- precision
- recall
- f1
base_model:
- dccuchile/bert-base-spanish-wwm-cased
pipeline_tag: text-classification
library_name: transformers
tags:
- bert
- transformer
- beto
- sequence-classification
- text-classification
- hate-speech-detection
- sentiment-analysis
- spanish
- nlp
- content-moderation
- social-media-analysis
- fine-tuned
---

# IMPORTANT Update and Model information !

A newer, enhanced version of this model is now available: [HateSpeech-BETO-cased-v2](https://huggingface.co/delarosajav95/HateSpeech-BETO-cased-v2). This updated version offers significantly improved performance, addressing real-world challenges with greater accuracy and reliability.

**We strongly recommend using this other second fine-tuned version**, which has been specifically optimized to better detect nuanced forms of hate speech in Spanish.
Explore the improved model here: 👉 [HateSpeech-BETO-cased-v2](https://huggingface.co/delarosajav95/HateSpeech-BETO-cased-v2).

# HateSpeech-BETO-cased-v1

This model is a fine-tuned version of [dccuchile/bert-base-spanish-wwm-cased](https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased) for hate speech detection related to racism, homophobia, sexism, and other forms of discrimination in Spanish text.
It is trained on the dataset [Paul/hatecheck-spanish](https://huggingface.co/Paul/hatecheck-spanish).

The text classification task in this model is based on 2 sentiment labels: hateful and non-hateful.

## Full classification example:

```python
from transformers import pipeline
pipe = pipeline(model="delarosajav95/HateSpeech-BETO-cased-v1")
inputs = ["El camarero a cargo del bar era demasiado afeminado, se notaba que la manera de moverse no era la propia de un hombre.",
  "La manera en que me miraba tan desagradable era porque era una persona del sur.",
  "Estoy cansado de que las mujeres se pongan falda y vayan provocando por ahí a toda persona normal y corriente."
  ]

result = pipe(inputs, return_all_scores=True)

label_mapping = {"LABEL_0": "Hateful", "LABEL_1": "Non-hateful"}

for i, predictions in enumerate(result):
  print("==================================")
  print(f"Text {i + 1}: {inputs[i]}")
  for pred in predictions:
    label = label_mapping.get(pred['label'], pred['label'])
    score = pred['score']
    print(f"{label}: {score:.2%}") 
```

Output:

```python
==================================
Text 1: El camarero a cargo del bar era demasiado afeminado, se notaba que la manera de moverse no era la propia de un hombre.
Hateful: 99.98%
Non-hateful: 0.02%
==================================
Text 2: La manera en que me miraba tan desagradable era porque era una persona del sur.
Hateful: 99.98%
Non-hateful: 0.02%
==================================
Text 3: Estoy cansado de que las mujeres se pongan falda y vayan provocando por ahí a toda persona normal y corriente.
Hateful: 99.84%
Non-hateful: 0.16%
```

## Metrics and results:

It achieves the following results on the *evaluation set* (last epoch):
- 'eval_loss': 0.03607647866010666
- 'eval_accuracy': 0.9933244325767691
- 'eval_precision_per_label': [1.0, 0.9905123339658444]
- 'eval_recall_per_label': [0.9779735682819384, 1.0]
- 'eval_f1_per_label': [0.9888641425389755, 0.9952335557673975]
- 'eval_precision_weighted': 0.9933877681310691
- 'eval_recall_weighted': 0.9933244325767691
- 'eval_f1_weighted': 0.9933031728530427
- 'eval_runtime': 1.7545
- 'eval_samples_per_second': 426.913
- 'eval_steps_per_second': 53.578
- 'epoch': 4.0

It achieves the following results on the *test set*:
- 'eval_loss': 0.052769944071769714
- 'eval_accuracy': 0.9933244325767691
- 'eval_precision_per_label': [0.9956140350877193, 0.9923224568138196]
- 'eval_recall_per_label': [0.9826839826839827, 0.9980694980694981]
- 'eval_f1_per_label': [0.9891067538126361, 0.9951876804619827]
- 'eval_precision_weighted': 0.9933376164683867
- 'eval_recall_weighted': 0.9933244325767691
- 'eval_f1_weighted': 0.993312254486016

### Training Details and Procedure

## Main Hyperparameters:

- evaluation_strategy: "epoch"
- learning_rate: 1e-5
- per_device_train_batch_size: 8
- per_device_eval_batch_size: 8
- num_train_epochs: 4
- optimizer: AdamW
- weight_decay: 0.01
- save_strategy: "epoch"
- lr_scheduler_type: "linear"
- warmup_steps: 449
- logging_steps: 10


#### Preprocessing and Postprocessing:

- Needed to manually map dataset creating the different sets: train 60%, validation 20%, and test 20%.
- Seed=42
- Num labels = 2
- Needed to manually map dataset's labels, from str ("hateful", "non-hateful") to int (0,1), in order to properly create tensors.
- Dynamic Padding through DataCollator was used.


### Framework versions

- Transformers 4.47.0
- Pytorch 2.5.1+cu121
- Datasets 3.2.0
- Tokenizers 0.21.0

## CITATION:

```bibtex
@inproceedings{CaneteCFP2020,
  title={Spanish Pre-Trained BERT Model and Evaluation Data},
  author={Cañete, José and Chaperon, Gabriel and Fuentes, Rodrigo and Ho, Jou-Hui and Kang, Hojin and Pérez, Jorge},
  booktitle={PML4DC at ICLR 2020},
  year={2020}
}
```

## More Information

- Fine-tuned by Javier de la Rosa Sánchez.
- [email protected]
- https://www.linkedin.com/in/delarosajav95/