Model card.
Browse files
README.md
ADDED
@@ -0,0 +1,50 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: es
|
3 |
+
license: cc-by-4.0
|
4 |
+
tags:
|
5 |
+
- spanish
|
6 |
+
- roberta
|
7 |
+
- bertin
|
8 |
+
pipeline_tag: text-classification
|
9 |
+
widget:
|
10 |
+
- text: La ciencia nos enseña, en efecto, a someter nuestra razón a la verdad y a conocer y juzgar las cosas como son, es decir, como ellas mismas eligen ser y no como quisiéramos que fueran.
|
11 |
+
---
|
12 |
+
|
13 |
+
# Readability ES Sentences for two classes
|
14 |
+
|
15 |
+
Model based on the Roberta architecture finetuned on [BERTIN](https://huggingface.co/bertin-project/bertin-roberta-base-spanish) for readability assessment of Spanish texts.
|
16 |
+
|
17 |
+
## Description and performance
|
18 |
+
|
19 |
+
This version of the model was trained on a mix of datasets, using sentence-level granularity when possible. The model performs binary classification among the following classes:
|
20 |
+
- Simple.
|
21 |
+
- Complex.
|
22 |
+
|
23 |
+
It achieves a F1 macro average score of 0.8923, measured on the validation set.
|
24 |
+
|
25 |
+
## Datasets
|
26 |
+
|
27 |
+
- [`readability-es-sentences`](https://huggingface.co/datasets/hackathon-pln-es/readability-es-sentences), composed of:
|
28 |
+
* coh-metrix-esp corpus.
|
29 |
+
* Various text resources scraped from websites.
|
30 |
+
- Other non-public datasets: newsela-es, simplext.
|
31 |
+
|
32 |
+
## Training details
|
33 |
+
|
34 |
+
Please, refer to [this training run](https://wandb.ai/readability-es/readability-es/runs/3rgvwps0/overview?workspace=user-pcuenq) for full details on hyperparameters and training regime.
|
35 |
+
|
36 |
+
## Biases and Limitations
|
37 |
+
|
38 |
+
- Due to the scarcity of data and the lack of a reliable gold test set, performance metrics are reported on the validation set.
|
39 |
+
- One of the datasets involved is the Spanish version of newsela, which is frequently used as a reference. However, it was created by translating previous datasets, and therefore it may contain somewhat unnatural phrases.
|
40 |
+
- Some of the datasets used cannot be publicly disseminated, making it more difficult to assess the existence of biases or mistakes.
|
41 |
+
- Language might be biased towards the Spanish dialect spoken in Spain. Other regional variants might be sub-represented.
|
42 |
+
- In-depth examinations of other limitations and biases have not yet been carried out.
|
43 |
+
|
44 |
+
## Authors
|
45 |
+
|
46 |
+
- [Laura Vásquez-Rodríguez](https://lmvasque.github.io/)
|
47 |
+
- [Pedro Cuenca](https://twitter.com/pcuenq)
|
48 |
+
- [Sergio Morales](https://www.fireblend.com/)
|
49 |
+
- [Fernando Alva-Manchego](https://feralvam.github.io/)
|
50 |
+
|