pcuenq HF staff commited on
Commit
d03e979
·
1 Parent(s): e0cb9cc

Model card.

Browse files
Files changed (1) hide show
  1. README.md +50 -0
README.md ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: es
3
+ license: cc-by-4.0
4
+ tags:
5
+ - spanish
6
+ - roberta
7
+ - bertin
8
+ pipeline_tag: text-classification
9
+ widget:
10
+ - text: La ciencia nos enseña, en efecto, a someter nuestra razón a la verdad y a conocer y juzgar las cosas como son, es decir, como ellas mismas eligen ser y no como quisiéramos que fueran.
11
+ ---
12
+
13
+ # Readability ES Sentences for two classes
14
+
15
+ Model based on the Roberta architecture finetuned on [BERTIN](https://huggingface.co/bertin-project/bertin-roberta-base-spanish) for readability assessment of Spanish texts.
16
+
17
+ ## Description and performance
18
+
19
+ This version of the model was trained on a mix of datasets, using sentence-level granularity when possible. The model performs binary classification among the following classes:
20
+ - Simple.
21
+ - Complex.
22
+
23
+ It achieves a F1 macro average score of 0.8923, measured on the validation set.
24
+
25
+ ## Datasets
26
+
27
+ - [`readability-es-sentences`](https://huggingface.co/datasets/hackathon-pln-es/readability-es-sentences), composed of:
28
+ * coh-metrix-esp corpus.
29
+ * Various text resources scraped from websites.
30
+ - Other non-public datasets: newsela-es, simplext.
31
+
32
+ ## Training details
33
+
34
+ Please, refer to [this training run](https://wandb.ai/readability-es/readability-es/runs/3rgvwps0/overview?workspace=user-pcuenq) for full details on hyperparameters and training regime.
35
+
36
+ ## Biases and Limitations
37
+
38
+ - Due to the scarcity of data and the lack of a reliable gold test set, performance metrics are reported on the validation set.
39
+ - One of the datasets involved is the Spanish version of newsela, which is frequently used as a reference. However, it was created by translating previous datasets, and therefore it may contain somewhat unnatural phrases.
40
+ - Some of the datasets used cannot be publicly disseminated, making it more difficult to assess the existence of biases or mistakes.
41
+ - Language might be biased towards the Spanish dialect spoken in Spain. Other regional variants might be sub-represented.
42
+ - In-depth examinations of other limitations and biases have not yet been carried out.
43
+
44
+ ## Authors
45
+
46
+ - [Laura Vásquez-Rodríguez](https://lmvasque.github.io/)
47
+ - [Pedro Cuenca](https://twitter.com/pcuenq)
48
+ - [Sergio Morales](https://www.fireblend.com/)
49
+ - [Fernando Alva-Manchego](https://feralvam.github.io/)
50
+