Model Card for Model ID (in progress of completing)

This model is a fine-tunning of BETO uncase to detect offensive and discriminatory language against lgbt community. It could be used as a moderation service in forums and digital spaces.

Model Card Contact

[[email protected]]

Model Details

Model description process

-Starting recovering of discriminatory phrases for the LGBTQIA+ community from X/Twitter, Instagram and Tiktok (197 phrases) . -Labelling by 3 raters as non-lgbtphobic (0) and lgbtphobic (1). -Text augmentation was applied backtranslation and random synonyms replacing. -Translating to Spanish part of McGiff, J., & Nikolov, N. S. (2024) dataset and added (under licence CC-BY-4.0) -Finally, we obtained 1234 tagged phrases for version 1.0.1 of LGBTQIAphobia_augmented. Please cite data set as:

Martínez-Araneda, C., Maldonado Montiel, D., Gutiérrez Valenzuela, M., Gómez Meneses, P., Segura Navarrete, A., & Vidal-Castro, C. (2024). LGBTQIAphobia dataset (augmented) (1.0.1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.14563166

Developed by: [Martínez-Araneda, C; Segura Navarrete, A.; Gutierrez Valenzuela, Mariella; Maldonado Mintiel, Diego; Gómez Meneses, P.; Vidal-Castro; Christian ]
Model type: [text-classification]
Language(s) (NLP): [Spanish]
License: [CC-BY-4.0]
Finetuned from model [dccuchile/bert-base-spanish-wwm-uncased]: More information of base model [https://github.com/dccuchile/beto]

Model Sources [optional]

Uses

This model can be used to detect offensive and discriminatory language against lgbt community. It could be used as a moderation service in forums and digital spaces.

Direct Use

[More Information Needed]

Out-of-Scope Use

[More Information Needed]

Bias, Risks, and Limitations

This model has its own bias from having been adjusted with a small data set.

[More Information Needed]

Recommendations

How to Get Started with the Model

#libraries from transformers import AutoModelForSequenceClassification, AutoTokenizer

Define la ruta de donde cargarás el modelo

#load_directory = "./lgbetO"

Cargar el modelo entrenado

#model = AutoModelForSequenceClassification.from_pretrained(load_directory)

Cargar el tokenizer

#tokenizer = AutoTokenizer.from_pretrained(load_directory)

Training Details

The training process begins by retrieving offensive/non-offensive and discriminatory/non-discriminatory language against phrases related to the lgbt community from twitter, instagram and tiktok, preprocessing them, labeling them by 3 raters, augmenting them with backtranslation and synonyms, and adjusting the BETO base model (dccuchile/bert-base -spanish-wwm-uncased) for discriminatory phrase detection for the lgbt community.

Training Data

Citation Martínez-Araneda, C., Maldonado Montiel, D., Gutiérrez Valenzuela, M., Gómez Meneses, P., Segura Navarrete, A., & Vidal-Castro, C. (2024). LGBTQIAphobia dataset (augmented) (1.0.1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.14563166

Training Procedure

Preprocessing [optional]

[More Information Needed]

Training Hyperparameters

Training regime: [More Information Needed]

Speeds, Sizes, Times [optional]

[More Information Needed]

Evaluation

Testing Data, Factors & Metrics

Testing Data

[More Information Needed]

Factors

[More Information Needed]

Metrics

[More Information Needed]

Results

[More Information Needed]

Summary

Model Examination [optional]

[More Information Needed]

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: Google Cloud Platform [More Information Needed]
Hours used: [More Information Needed]
Cloud Provider: [More Information Needed]
Compute Region: southamerica
Carbon Emitted: 0.14kgCO$_2$eq/kWh

Experiments were conducted using Google Cloud Platform in region southamerica-east1, which has a carbon efficiency of 0.2 kgCO$_2$eq/kWh. A cumulative of 10 hours of computation was performed on hardware of type T4 (TDP of 70W).

Total emissions are estimated to be 0.14 kgCO$_2$eq of which 100 percents were directly offset by the cloud provider.

Technical Specifications [optional]

Model Architecture and Objective

[More Information Needed]

Compute Infrastructure

(GPU) del backend de Google Compute Engine en Python 3

Hardware

RAM: 3.87 GB/12.67 GB Disco: 33.96 GB/112.64 GB

Software

[More Information Needed]

Citation [optional]

BibTeX:

[More Information Needed]

APA:

[More Information Needed]

Glossary [optional]

[More Information Needed]

More Information [optional]

[More Information Needed]

Model Card Authors [optional]

[More Information Needed]