matous-volf commited on
Commit
b0ede6c
·
verified ·
1 Parent(s): f411f04

docs: add a readme

Browse files
Files changed (2) hide show
  1. README.md +361 -0
  2. confusion_matrix.svg +1044 -0
README.md ADDED
@@ -0,0 +1,361 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - "en"
4
+ license: "cc-by-nc-4.0"
5
+ library_name: "transformers"
6
+ pipeline_tag: text-classification
7
+ tags:
8
+ - "text"
9
+ - "politics"
10
+ - "political"
11
+ - "leaning"
12
+ - "bias"
13
+ - "politicalness"
14
+ base_model: "microsoft/deberta-v3-large"
15
+ datasets:
16
+ - "mlburnham/dem_rep_party_platform_topics"
17
+ - "cajcodes/political-bias"
18
+ - "JyotiNayak/political_ideologies"
19
+ - "Jacobvs/PoliticalTweets"
20
+ widget:
21
+ - example_title: "Taxes 1"
22
+ text: "The government should raise taxes on the rich so it can give more money to the homeless."
23
+ output:
24
+ - label: left
25
+ score: 1.00
26
+ - label: center
27
+ score: 0.00
28
+ - label: right
29
+ score: 0.00
30
+ - example_title: "Taxes 2"
31
+ text: "The government should cut taxes because it is not using them efficiently anyway."
32
+ output:
33
+ - label: left
34
+ score: 0.00
35
+ - label: center
36
+ score: 0.00
37
+ - label: right
38
+ score: 1.00
39
+ - example_title: "Abortion 1"
40
+ text: "Opting for abortion is an inalienable right of every individual."
41
+ output:
42
+ - label: left
43
+ score: 1.00
44
+ - label: center
45
+ score: 0.00
46
+ - label: right
47
+ score: 0.00
48
+ - example_title: "Abortion 2"
49
+ text: "Terminating a pregnancy is equivalent to committing homicide."
50
+ output:
51
+ - label: left
52
+ score: 0.42
53
+ - label: center
54
+ score: 0.00
55
+ - label: right
56
+ score: 0.58
57
+ - example_title: "Immigration 1"
58
+ text: "Mass detention of undocumented persons is an unjust practice that disproportionately harms vulnerable populations and must end."
59
+ output:
60
+ - label: left
61
+ score: 1.00
62
+ - label: center
63
+ score: 0.00
64
+ - label: right
65
+ score: 0.00
66
+ - example_title: "Immigration 2"
67
+ text: "Immigration must be strictly controlled to protect national security, as it increases the risk of terrorism."
68
+ output:
69
+ - label: left
70
+ score: 0.00
71
+ - label: center
72
+ score: 0.00
73
+ - label: right
74
+ score: 1.00
75
+ model-index:
76
+ - name: "political-leaning-deberta-large"
77
+ results:
78
+ - task:
79
+ type: "text-classification"
80
+ name: "text political leaning classification"
81
+ dataset:
82
+ type: "-"
83
+ name: "Article bias prediction"
84
+ metrics:
85
+ - type: "f1"
86
+ value: 89
87
+ name: "F1 score"
88
+ args:
89
+ average: "weighted"
90
+ source:
91
+ name: "the paper"
92
+ url: "https://github.com/matous-volf/political-leaning-prediction/blob/main/paper.pdf"
93
+ - task:
94
+ type: "text-classification"
95
+ name: "text political leaning classification"
96
+ dataset:
97
+ type: "-"
98
+ name: "BIGNEWSBLN"
99
+ metrics:
100
+ - type: "f1"
101
+ value: 88.6
102
+ name: "F1 score"
103
+ args:
104
+ average: "weighted"
105
+ source:
106
+ name: "the paper"
107
+ url: "https://github.com/matous-volf/political-leaning-prediction/blob/main/paper.pdf"
108
+ - task:
109
+ type: "text-classification"
110
+ name: "text political leaning classification"
111
+ dataset:
112
+ type: "-"
113
+ name: "CommonCrawl news articles"
114
+ metrics:
115
+ - type: "f1"
116
+ value: 88.9
117
+ name: "F1 score"
118
+ args:
119
+ average: "weighted"
120
+ source:
121
+ name: "the paper"
122
+ url: "https://github.com/matous-volf/political-leaning-prediction/blob/main/paper.pdf"
123
+ - task:
124
+ type: "text-classification"
125
+ name: "text political leaning classification"
126
+ dataset:
127
+ type: "-"
128
+ name: "Dem., rep. party platform topics"
129
+ metrics:
130
+ - type: "f1"
131
+ value: 85.6
132
+ name: "F1 score"
133
+ args:
134
+ average: "weighted"
135
+ source:
136
+ name: "the paper"
137
+ url: "https://github.com/matous-volf/political-leaning-prediction/blob/main/paper.pdf"
138
+ - task:
139
+ type: "text-classification"
140
+ name: "text political leaning classification"
141
+ dataset:
142
+ type: "cajcodes/political-bias"
143
+ name: "GPT-4 political bias"
144
+ metrics:
145
+ - type: "f1"
146
+ value: 86.9
147
+ name: "F1 score"
148
+ args:
149
+ average: "weighted"
150
+ source:
151
+ name: "the paper"
152
+ url: "https://github.com/matous-volf/political-leaning-prediction/blob/main/paper.pdf"
153
+ - task:
154
+ type: "text-classification"
155
+ name: "text political leaning classification"
156
+ dataset:
157
+ type: "JyotiNayak/political_ideologies"
158
+ name: "GPT-4 political ideologies"
159
+ metrics:
160
+ - type: "f1"
161
+ value: 99.6
162
+ name: "F1 score"
163
+ args:
164
+ average: "weighted"
165
+ source:
166
+ name: "the paper"
167
+ url: "https://github.com/matous-volf/political-leaning-prediction/blob/main/paper.pdf"
168
+ - task:
169
+ type: "text-classification"
170
+ name: "text political leaning classification"
171
+ dataset:
172
+ type: "-"
173
+ name: "Media political stance"
174
+ metrics:
175
+ - type: "f1"
176
+ value: 93.1
177
+ name: "F1 score"
178
+ args:
179
+ average: "weighted"
180
+ source:
181
+ name: "the paper"
182
+ url: "https://github.com/matous-volf/political-leaning-prediction/blob/main/paper.pdf"
183
+ - task:
184
+ type: "text-classification"
185
+ name: "text political leaning classification"
186
+ dataset:
187
+ type: "-"
188
+ name: "Political podcasts"
189
+ metrics:
190
+ - type: "f1"
191
+ value: 99.8
192
+ name: "F1 score"
193
+ args:
194
+ average: "weighted"
195
+ source:
196
+ name: "the paper"
197
+ url: "https://github.com/matous-volf/political-leaning-prediction/blob/main/paper.pdf"
198
+ - task:
199
+ type: "text-classification"
200
+ name: "text political leaning classification"
201
+ dataset:
202
+ type: "Jacobvs/PoliticalTweets"
203
+ name: "Political tweets"
204
+ metrics:
205
+ - type: "f1"
206
+ value: 82.1
207
+ name: "F1 score"
208
+ args:
209
+ average: "weighted"
210
+ source:
211
+ name: "the paper"
212
+ url: "https://github.com/matous-volf/political-leaning-prediction/blob/main/paper.pdf"
213
+ - task:
214
+ type: "text-classification"
215
+ name: "text political leaning classification"
216
+ dataset:
217
+ type: "-"
218
+ name: "Qbias"
219
+ metrics:
220
+ - type: "f1"
221
+ value: 57.9
222
+ name: "F1 score"
223
+ args:
224
+ average: "weighted"
225
+ source:
226
+ name: "the paper"
227
+ url: "https://github.com/matous-volf/political-leaning-prediction/blob/main/paper.pdf"
228
+ ---
229
+
230
+ # Text political leaning classifier based on DeBERTa V3 large
231
+
232
+ This model classifies text by its political leaning into three classes: left, center, right. It has been trained on news
233
+ articles, social network posts and LLM-generated politological statements. The training data comes from the context of
234
+ the United States, and so the left class is mostly defined by the liberal ideology and democratic party views, and the
235
+ same applies for the right class being closely tied to the conservative and republican views.
236
+
237
+ The model is a part of the research done in the paper
238
+ [Predicting political leaning and politicalness of text using transformer models](https://github.com/matous-volf/political-leaning-prediction/blob/main/paper.pdf).
239
+ It focuses on predicting political leaning as well as politicalness – a binary class indicating whether a text even is
240
+ about politics or not. We have benchmarked the existing models for politicalness and shown that one of them –
241
+ [Political DEBATE](https://huggingface.co/mlburnham/Political_DEBATE_large_v1.0) – achieves an \\(F_1\\) score of over
242
+ 90 %. This makes it suitable for filtering non-political texts in front of a political leaning classifier like this
243
+ one. We recommend doing so if the input to this model is not guaranteed to be about politics.
244
+
245
+ Our paper addresses the challenge of automatically classifying text according to political leaning and politicalness
246
+ using transformer models. We compose a comprehensive overview of existing datasets and models for these tasks, finding
247
+ that current approaches create siloed solutions that perform poorly on out-of-distribution texts. To address this
248
+ limitation, we compile a diverse dataset by combining 12 datasets for political leaning classification and creating a
249
+ new dataset for politicalness by extending 18 existing datasets with the appropriate label. Through extensive
250
+ benchmarking with leave-one-in and leave-one-out methodologies, we evaluate the performance of existing models and train
251
+ new ones with enhanced generalization capabilities.
252
+
253
+ Alongside the paper, we release the complete
254
+ [source code and results](https://github.com/matous-volf/political-leaning-prediction). This model is deployed in
255
+ a [demo web app](https://political-leaning.matousvolf.cz).
256
+ A [second, smaller model](https://huggingface.co/matous-volf/political-leaning-politics) has also been produced.
257
+
258
+ ## Usage
259
+
260
+ The model outputs 0 for the left, 1 for the center and 2 for the right leaning. The score of the predicted class is
261
+ between \\(\frac{1}{3}\\) and 1.
262
+
263
+ To use the model, you can either utilize the high-level Hugging Face
264
+ [pipeline](https://huggingface.co/docs/transformers/main_classes/pipelines):
265
+
266
+ ```py
267
+ from transformers import pipeline
268
+
269
+ pipe = pipeline(
270
+ "text-classification",
271
+ model="matous-volf/political-leaning-deberta-large",
272
+ tokenizer="microsoft/deberta-v3-large",
273
+ )
274
+
275
+ text = "The government should raise taxes on the rich so it can give more money to the homeless."
276
+
277
+ output = pipe(text)
278
+ print(output)
279
+ ```
280
+
281
+ Or load it [directly](https://huggingface.co/docs/transformers/en/models):
282
+
283
+ ```py
284
+ from torch import argmax
285
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
286
+ from torch.nn.functional import softmax
287
+
288
+ tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-large")
289
+ model = AutoModelForSequenceClassification.from_pretrained("matous-volf/political-leaning-deberta-large")
290
+
291
+ text = "The government should cut taxes because it is not using them efficiently anyway."
292
+
293
+ tokens = tokenizer(text, return_tensors="pt")
294
+ output = model(**tokens)
295
+ logits = output.logits
296
+
297
+ political_leaning = argmax(logits, dim=1).item()
298
+ probabilities = softmax(logits, dim=1)
299
+ score = probabilities[0, political_leaning].item()
300
+ print(political_leaning, score)
301
+ ```
302
+
303
+ ## Evaluation
304
+
305
+ The following table displays the performance of the model on test sets (15 %) of the datasets used for training.
306
+
307
+ | dataset | accuracy | \\(F_1\\) score |
308
+ |:---------------------------------|:---------|:----------------|
309
+ | Article bias prediction | 89 | 89 |
310
+ | BIGNEWSBLN | 88.6 | 88.6 |
311
+ | CommonCrawl news articles | 88.9 | 88.9 |
312
+ | Dem., rep. party platform topics | 85.5 | 85.6 |
313
+ | GPT-4 political bias | 87 | 86.9 |
314
+ | GPT-4 political ideologies | 99.6 | 99.6 |
315
+ | Media political stance | 91.6 | 93.1 |
316
+ | Political podcasts | 99.8 | 99.8 |
317
+ | Political tweets | 82.1 | 82.1 |
318
+ | Qbias | 58 | 57.9 |
319
+ | **average** | **87** | **87.2** |
320
+
321
+ The following is an example of a confusion matrix, after evaluating the model on a test set from the CommonCrawl news
322
+ articles dataset.
323
+
324
+ <img src="confusion_matrix.svg" alt="a confusion matrix example" height="350rem"/>
325
+
326
+ The complete results of all our measurements are available in the source code repository.
327
+
328
+ ## Training
329
+
330
+ This model is based on [DeBERTa V3 large](https://huggingface.co/microsoft/deberta-v3-large). All the datasets used for
331
+ fine-tuning are listed in the paper, as well as a detailed description of the preprocessing, training and evaluation
332
+ methodology. In summary, we have manually tweaked the hyperparameters with a setup designed for maximizing performance
333
+ on unseen types of text (out-of-distribution) to increase the model's generalization abilities. In this setup, we have
334
+ left one of the datasets at a time out of the training sample and used it as the validation set. Then, we have taken the
335
+ resulting optimal hyperparameters and trained this model on all the available datasets.
336
+
337
+ ## Authors
338
+
339
+ - Matous Volf ([[email protected]](mailto:[email protected])),
340
+ [DELTA – High school of computer science and economics](https://www.delta-skola.cz), Pardubice, Czechia
341
+ - Jakub Simko ([[email protected]](mailto:[email protected])),
342
+ [Kempelen Institute of Intelligent Technologies](https://kinit.sk), Bratislava, Slovakia
343
+
344
+ ## Citation
345
+
346
+ ### BibTeX
347
+
348
+ ```
349
+ @article{volf-simko-2025-political-leaning,
350
+ title = {Predicting political leaning and politicalness of text using transformer models},
351
+ author = {Volf, Matous and Simko, Jakub},
352
+ year = 2025,
353
+ institution = {DELTA – High school of computer science and economics, Pardubice, Czechia; Kempelen Institute of Intelligent Technologies, Bratislava, Slovakia}
354
+ }
355
+ ```
356
+
357
+ ### APA
358
+
359
+ Volf, M. and Simko, J. (2025). Predicting political leaning and politicalness of text using transformer models. DELTA –
360
+ High school of computer science and economics, Pardubice, Czechia; Kempelen Institute of Intelligent Technologies,
361
+ Bratislava, Slovakia.
confusion_matrix.svg ADDED