File size: 9,776 Bytes
4f4aadc 9a8ca27 4f4aadc adfa996 948fae3 48fb056 948fae3 0bcd9d1 427827c 0bcd9d1 bbce29e 77c985c adfa996 db66d9b bbce29e 2696ffd bbce29e 1518b01 bbce29e de11376 bbce29e adfa996 77c985c ca3c910 a925f1d 4bfa20b a925f1d ca3c910 22361c3 ca3c910 056f873 0bcd9d1 056f873 0bcd9d1 1518b01 056f873 39ba715 06c0467 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 |
---
license: bigscience-openrail-m
---
## Model description
An xlm-roberta-large model fine-tuned on ~1,6 million annotated statements contained in the [Manifesto Corpus](https://manifesto-project.wzb.eu/information/documents/corpus) (version 2023a).
The model can be used to categorize any type of text into 56 different political topics according to the Manifesto Project's coding scheme ([Handbook 4](https://manifesto-project.wzb.eu/coding_schemes/mp_v4)).
It works for all languages the xlm-roberta model is pretrained on ([overview](https://github.com/facebookresearch/fairseq/tree/main/examples/xlmr#introduction)), just note that it will perform best for the 38 languages contained in the Manifesto Corpus:
||||||
|------|------|------|------|------|
|armenian|bosnian|bulgarian|catalan|croatian|
|czech|danish|dutch|english|estonian|
|finnish|french|galician|georgian|german|
|greek|hebrew|hungarian|icelandic|italian|
|japanese|korean|latvian|lithuanian|macedonian|
|montenegrin|norwegian|polish|portuguese|romanian|
|russian|serbian|slovak|slovenian|spanish|
|swedish|turkish|ukrainian| | |
The context model variant additionally incorporates the surrounding sentences of a statement to improve the classification results for ambiguous sentences. (See Training Procedure for details)
**Important**
We slightly modified the Classification Head of the `XLMRobertaModelForSequenceClassification` model (removed the tanh activation and the intermediate linear layer) as that improved the model performance for this task considerably.
To correctly load the full model, include the `trust_remote_code=True` argument when using the `from_pretrained method`.
## How to use
```python
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model = AutoModelForSequenceClassification.from_pretrained("manifesto-project/manifestoberta-xlm-roberta-56policy-topics-context-2023-1-1", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large")
sentence = "These principles are under threat."
context = "Human rights and international humanitarian law are fundamental pillars of a secure global system. These principles are under threat. Some of the world's most powerful states choose to sell arms to human-rights abusing states."
# For sentences without additional context, just use the sentence itself as the context.
# Example: context = "These principles are under threat."
inputs = tokenizer(sentence,
context,
return_tensors="pt",
max_length=300, #we limited the input to 300 tokens during finetuning
padding="max_length",
truncation=True
)
logits = model(**inputs).logits
probabilities = torch.softmax(logits, dim=1).tolist()[0]
probabilities = {model.config.id2label[index]: round(probability * 100, 2) for index, probability in enumerate(probabilities)}
probabilities = dict(sorted(probabilities.items(), key=lambda item: item[1], reverse=True))
print(probabilities)
# {'201 - Freedom and Human Rights': 90.76, '107 - Internationalism: Positive': 5.82, '105 - Military: Negative': 0.66...
predicted_class = model.config.id2label[logits.argmax().item()]
print(predicted_class)
# 201 - Freedom and Human Rights
```
## Training Procedure
Training of the model took place on all quasi-sentences of the Manifesto Corpus (version 2023a), minus 10% that were kept out of training for the final test and evaluation results.
This results in a training dataset of 1,601,329 quasi-sentences.
As our context-including model input poses the threat of data-leakage problems between train and test data, we refrained from randomly splitting quasi-sentences into train and test data.
Instead, we randomly split the dataset on the manifesto level, so that 1779 manifestos and all their quasi-sentences were assigned to the train set and 198 to the test set.
As training parameters, we used the following settings: learning rate: 1e-5, weight decay: 0.01, epochs: 1, batch size: 4, gradient accumulation steps: 4 (effective batch size: 16).
### Context
To adapt the model to the task of classifying statements in manifestos we made some modifications to the traditional training setup.
Given that human annotators in the Manifesto Project are encouraged to use surrounding sentences to interpret ambiguous statements , we combined statements with their context for our model's input.
Specifically, we used a sentence-pair input, where the single to-be-classified statement gets followed by the separator token followed by the whole bigger context of length 200 tokens, in which the statement to-be-classified is embedded.
Here is an example:
*"`<s>` We must right the wrongs in our democracy, `</s>` `</s>` To turn this crisis into a crucible, from which we will forge a stronger, brighter, and more equitable future. We must right the wrongs in our democracy, redress the systemic injustices that have long plagued our society,throw open the doors of opportunity for all Americans and reinvent our institutions at home and our leadership abroad. `</s>`".*
The second part, which contains the context, is greedily filled until it contains 200 tokens.
Our tests showed that including the context helped to improve the performance of the classification model considerably (~7% accuracy).
We tried other approaches like using two XLM-RoBERTa models as a duo, where one receives the sentence and one the context, and a shared-layer model, where both inputs are fed separately trough the same model.
Both variants performed similarly to our sentence pair approach, but lead to higher complexity and computing costs, which is why we ultimately opted for the sentence pair way to include the surrounding context.
## Model Performance
The model was evaluated on a test set of 199,046 annotated manifesto statements.
### Overall
| | Accuracy | Top2_Acc | Top3_Acc | Precision| Recall | F1_Macro | MCC | Cross-Entropy |
|-------------------------------------------------------------------------------------------------------|:--------:|:--------:|:--------:|:--------:|:------:|:--------:|:---:|:-------------:|
[Sentence Model](https://huggingface.co/manifesto-project/manifestoberta-xlm-roberta-56policy-topics-sentence-2023-1-1)| 0.57 | 0.73 | 0.81 | 0.49 | 0.43 | 0.45 | 0.55| 1.5 |
[Context Model](https://huggingface.co/manifesto-project/manifestoberta-xlm-roberta-56policy-topics-context-2023-1-1) | 0.64 | 0.81 | 0.88 | 0.54 | 0.52 | 0.53 | 0.62| 1.15 |
### Categories
|Category|Precision|Recall|F1|n_test(%)|n_predicted(%)|
|:------|:-----------:|:----:|:----:|:-----:|:-----:|
| 101 |0.50|0.48|0.49|0.30%|0.29%|
|102|0.56|0.61|0.58|0.09%|0.10%|
|103|0.51|0.36|0.42|0.28%|0.20%|
|104|0.78|0.81|0.79|1.57%|1.64%|
|105|0.69|0.70|0.69|0.34%|0.34%|
|106|0.59|0.57|0.58|0.33%|0.32%|
|107|0.68|0.66|0.67|2.24%|2.17%|
|108|0.66|0.68|0.67|1.20%|1.24%|
|109|0.52|0.39|0.45|0.17%|0.13%|
|110|0.63|0.68|0.65|0.36%|0.38%|
|201|0.58|0.59|0.59|2.16%|2.20%|
|202|0.62|0.63|0.62|3.25%|3.28%|
|203|0.46|0.47|0.47|0.19%|0.19%|
|204|0.61|0.37|0.46|0.25%|0.15%|
|301|0.66|0.71|0.68|2.13%|2.29%|
|302|0.38|0.25|0.30|0.17%|0.11%|
|303|0.58|0.60|0.59|5.12%|5.31%|
|304|0.67|0.65|0.66|1.38%|1.34%|
|305|0.59|0.57|0.58|2.32%|2.22%|
|401|0.45|0.36|0.40|1.50%|1.21%|
|402|0.61|0.58|0.59|2.73%|2.60%|
|403|0.56|0.51|0.53|3.59%|3.25%|
|404|0.30|0.15|0.20|0.58%|0.28%|
|405|0.43|0.51|0.47|0.18%|0.21%|
|406|0.38|0.46|0.42|0.26%|0.31%|
|407|0.56|0.52|0.54|0.40%|0.38%|
|408|0.28|0.17|0.21|1.34%|0.79%|
|409|0.37|0.21|0.27|0.24%|0.14%|
|410|0.53|0.50|0.52|2.22%|2.08%|
|411|0.73|0.75|0.74|8.32%|8.53%|
|412|0.26|0.20|0.22|0.58%|0.45%|
|413|0.49|0.63|0.55|0.29%|0.37%|
|414|0.58|0.55|0.56|1.38%|1.32%|
|415|0.14|0.23|0.18|0.05%|0.07%|
|416|0.52|0.49|0.50|2.45%|2.35%|
|501|0.69|0.78|0.73|4.77%|5.35%|
|502|0.78|0.84|0.81|3.08%|3.32%|
|503|0.61|0.63|0.62|5.96%|6.11%|
|504|0.71|0.76|0.74|10.05%|10.76%|
|505|0.46|0.37|0.41|0.69%|0.55%|
|506|0.78|0.82|0.80|5.42%|5.72%|
|507|0.45|0.26|0.33|0.14%|0.08%|
|601|0.52|0.46|0.49|1.79%|1.57%|
|602|0.35|0.34|0.34|0.24%|0.24%|
|603|0.65|0.68|0.67|1.36%|1.42%|
|604|0.62|0.48|0.54|0.57%|0.44%|
|605|0.72|0.74|0.73|4.22%|4.33%|
|606|0.56|0.48|0.51|1.45%|1.23%|
|607|0.57|0.67|0.62|1.08%|1.25%|
|608|0.48|0.48|0.48|0.41%|0.41%|
|701|0.62|0.66|0.64|3.35%|3.59%|
|702|0.42|0.30|0.35|0.08%|0.06%|
|703|0.75|0.87|0.80|2.65%|3.07%|
|704|0.43|0.32|0.37|0.57%|0.43%|
|705|0.38|0.33|0.35|0.80%|0.69%|
|706|0.43|0.37|0.39|1.35%|1.16%|
## Citation
Please cite the model as follows:
Burst, Tobias / Lehmann, Pola / Franzmann, Simon / Al-Gaddooa, Denise / Ivanusch, Christoph / Regel, Sven / Riethmüller, Felicia / Weßels, Bernhard / Zehnter, Lisa (2023): manifestoberta. Version 56topics.context.2023.1.1. Berlin: Wissenschaftszentrum Berlin für Sozialforschung (WZB) / Göttingen: Institut für Demokratieforschung (IfDem). https://doi.org/10.25522/manifesto.manifestoberta.56topics.context.2023.1.1
```bib
@misc{Burst:2023,
Address = {Berlin / Göttingen},
Author = {Burst, Tobias AND Lehmann, Pola AND Franzmann, Simon AND Al-Gaddooa, Denise AND Ivanusch, Christoph AND Regel, Sven AND Riethmüller, Felicia AND Weßels, Bernhard AND Zehnter, Lisa},
Publisher = {Wissenschaftszentrum Berlin für Sozialforschung / Göttinger Institut für Demokratieforschung},
Title = {manifestoberta. Version 56topics.context.2023.1.1},
doi = {10.25522/manifesto.manifestoberta.56topics.context.2023.1.1},
url = {https://doi.org/10.25522/manifesto.manifestoberta.56topics.context.2023.1.1},
Year = {2023},
``` |