somosnlp-hackathon-2022
/

readability-es-sentences

Text Classification

Inference Endpoints

Model card Files Files and versions Community

pcuenq HF staff commited on Apr 4, 2022

Commit

d03e979

·

1 Parent(s): e0cb9cc

Model card.

Files changed (1) hide show

README.md +50 -0

README.md ADDED Viewed

	@@ -0,0 +1,50 @@

+---
+language: es
+license: cc-by-4.0
+tags:
+- spanish
+- roberta
+- bertin
+pipeline_tag: text-classification
+widget:
+- text: La ciencia nos enseña, en efecto, a someter nuestra razón a la verdad y a conocer y juzgar las cosas como son, es decir, como ellas mismas eligen ser y no como quisiéramos que fueran.
+---
+# Readability ES Sentences for two classes
+Model based on the Roberta architecture finetuned on [BERTIN](https://huggingface.co/bertin-project/bertin-roberta-base-spanish) for readability assessment of Spanish texts.
+## Description and performance
+This version of the model was trained on a mix of datasets, using sentence-level granularity when possible. The model performs binary classification among the following classes:
+- Simple.
+- Complex.
+It achieves a F1 macro average score of 0.8923, measured on the validation set.
+## Datasets
+- [`readability-es-sentences`](https://huggingface.co/datasets/hackathon-pln-es/readability-es-sentences), composed of:
+  * coh-metrix-esp corpus.
+  * Various text resources scraped from websites.
+- Other non-public datasets: newsela-es, simplext.
+## Training details
+Please, refer to [this training run](https://wandb.ai/readability-es/readability-es/runs/3rgvwps0/overview?workspace=user-pcuenq) for full details on hyperparameters and training regime.
+## Biases and Limitations
+- Due to the scarcity of data and the lack of a reliable gold test set, performance metrics are reported on the validation set.
+- One of the datasets involved is the Spanish version of newsela, which is frequently used as a reference. However, it was created by translating previous datasets, and therefore it may contain somewhat unnatural phrases.
+- Some of the datasets used cannot be publicly disseminated, making it more difficult to assess the existence of biases or mistakes.
+- Language might be biased towards the Spanish dialect spoken in Spain. Other regional variants might be sub-represented.
+- In-depth examinations of other limitations and biases have not yet been carried out.
+## Authors
+- [Laura Vásquez-Rodríguez](https://lmvasque.github.io/)
+- [Pedro Cuenca](https://twitter.com/pcuenq)
+- [Sergio Morales](https://www.fireblend.com/)
+- [Fernando Alva-Manchego](https://feralvam.github.io/)