cardiffnlp/twitter-xlm-roberta-base-hate-spanish

This model is a fine-tuned version of cardiffnlp/twitter-xlm-roberta-base using the HaterNet dataset and the Spanish subset of SemEval-2019 Task 5.

Following metrics are achieved

on the test split of SemEval-2019 Task 5
- F1 (weighted): 0.7866
- F1 (macro): 0.7935
- Accuracy: 0.7937
on custom test split of Haternet
- F1 (weighted): 0.7815
- F1 (macro): 0.6981
- Accuracy: 0.7933
on Haternet & SemEval-2019 Task 5
- F1 (weighted): 0.7908
- F1 (macro): 0.7657
- Accuracy: 0.7936

Usage

Install tweetnlp via pip.

pip install tweetnlp

Load the model in python.

import tweetnlp
model = tweetnlp.Classifier("cardiffnlp/twitter-xlm-roberta-base-hate-spanish")
model.predict('Ismael es egocentrico porque se vuelve loca si le dicen que tiene el pelo bonito😂😂😂😂 eso se define con otro objetivo #FirstDates251')
>> {'label': 'NOT-HATE'}

Datasets

@inproceedings{basile-etal-2019-semeval, title = "{S}em{E}val-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in {T}witter", author = "Basile, Valerio and Bosco, Cristina and Fersini, Elisabetta and Nozza, Debora and Patti, Viviana and Rangel Pardo, Francisco Manuel and Rosso, Paolo and Sanguinetti, Manuela", booktitle = "Proceedings of the 13th International Workshop on Semantic Evaluation", month = jun, year = "2019", address = "Minneapolis, Minnesota, USA", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/S19-2007", doi = "10.18653/v1/S19-2007", pages = "54--63", abstract = "The paper describes the organization of the SemEval 2019 Task 5 about the detection of hate speech against immigrants and women in Spanish and English messages extracted from Twitter. The task is organized in two related classification subtasks: a main binary subtask for detecting the presence of hate speech, and a finer-grained one devoted to identifying further features in hateful contents such as the aggressive attitude and the target harassed, to distinguish if the incitement is against an individual rather than a group. HatEval has been one of the most popular tasks in SemEval-2019 with a total of 108 submitted runs for Subtask A and 70 runs for Subtask B, from a total of 74 different teams. Data provided for the task are described by showing how they have been collected and annotated. Moreover, the paper provides an analysis and discussion about the participant systems and the results they achieved in both subtasks.", }

@article{quijano2019haternet, title={HaterNet a system for detecting and analyzing hate speech in Twitter (Version 1.0)[Data set]}, author={Quijano-Sanchez, Lara and Kohatsu, Juan Carlos Pereira and Liberatore, Federico and Camacho-Collados, Miguel}, journal={Zenodo}, year={2019} }