study_news_detection_german

This model is a fine-tuned version of deepset/gelectra-large that was trained to identify texts that contain a reference to a research result. It was trained on a dataset of 4200 annotated German print news articles that was created specifically for this task.

Model description

The model was trained to detect texts that mention at least one individual research result (a scientific study, e.g. a journal paper). Given (journalistic) texts, it produces a binary classification: texts are classified as either 1 (mentions a research result) or 0 (does not mention a research result). The model was developed as part of a research project at Karlsruhe Institute of Technology, which investigated journalistic coverage of individual research results. In the same project, a similar model was trained for the same task on English news articles (study_news_detection_english) as well as two models that were fine-tuned to identify journal names in German and English news articles (journal_identification_german and journal_identification_english).

Model type: text classification
Language: German
Finetuned from: deepset/gelectra-large
Supported by: The author acknowledges support by the state of Baden-Württemberg through bwHPC.

Intended uses & limitations

The intended use of this model is to enable large-scale analyses of public communication about scientific research. While it has some utility on its own, it is primarily intended to be part of a larger analysis pipeline in which it serves as the first filtering step. For example, it was used to identify texts that contain a reference to a research result in a large corpus of more than 600k German news articles. These texts were the basis for several subsequent analyses that examined patterns in the journalistic coverage of research results (e.g. with regards to source or event selection).

How to use

You can use this model with a Transformers pipeline for text classification:

>>> from transformers import pipeline
>>> classifier = pipeline('text-classification', model = 'nikoprom/study_news_detection_german')
>>> text = ['''Gute Freunde sind nicht nur das Schönste, was es gibt auf der Welt,
... sie schenken uns sogar ein längeres Leben:
... Forscher fanden heraus, dass ältere Menschen besonders lange gesund waren,
... wenn sie im Alltag Freunde um sich rum hatten.''']
>>> classifier(text)
[{'label': 'LABEL_1', 'score': 0.9879398345947266}]

Texts passed to the model should be complete news articles or at least paragraphs as this was the setting in which the model was fine-tuned.

Limitations

The model was developed for a very narrow use case in a research project and fine-tuned on a rather small dataset with texts from a very specific context (see below). As a consequence, its performance could be much worse when applied to texts from other domains (e.g. types of texts other than news articles, texts from other periods of time).

In addition, a very specific definition of research results was used in the creation of the training data. This definition exludes "studies" that were conducted by institutions that are not part of the science system and/or primarily out of political or economical interest (e.g. political polls, consumer surveys by market research companies).

Training data

The training data was created as part of a larger manual content analysis in which the coverage of research results in print media from three countries (Germany, UK, US) was investigated. The dataset used for this model contained 4200 articles retrieved from a press database with a broad search string that includes several research- or science-related terms. These articles were published in 31 different media outlets from Germany over six years (2001, 2010, 2017-2020). Each article was classified as containing no (label 0) or at least one (label 1) reference to a research result by four human coders. Intercoder reliability was satisfactory (average pairwise agreement: 93.9 %, Krippendorff's alpha: 0.72). The distribution of the labels is rather imbalanced with only 14.3 % of all articles being classified as containing at least one reference to a research result.

Training procedure

All texts were cleaned to remove some frequent formatting errors present in the original articles (e.g. Ã¤ instead of ä). 64 % of the texts (2688) were used for training, 16 % (672) for validation and 20 % (840) for testing. The texts were tokenized using a WordPiece tokenizer corresponding to the model (with a vocabulary size of 31,102, without lower casing, with padding and truncation). The model was then fine-tuned using TensorFlow on two NVIDIA Tesla V100-SXM2-32GB GPUs on the bwUniCluster 2.0. The learning rate was chosen after comparing three values (5e-6, 1e-5, 2e-5) to optimize accuracy in the validation set. For the final model, all texts from the training and validation set (3360 texts) were used for training.

Training hyperparameters

The following hyperparameters were used during training:

Batch size: 8
Number of epochs: 5
Optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
Learning rate: 5e-6
Warmup steps: 294

Framework versions

Transformers 4.32.0
TensorFlow 2.14.0
Datasets 2.12.0
Tokenizers 0.13.3

Evaluation

The model was evaluated with a test set of 840 articles. To get a binary classification from the model output (which consists of a value between 0 and 1 for each text), a decision threshold has to be chosen. A threshold of 0.5 gives the following results:

Accuracy: 0.9167
Precision: 0.7107
Recall: 0.7107
F1: 0.7107

Confusion matrix:

-	Predicted label 0	Predicted label 1
True label 0	684	35
True label 1	35	86

Other threshold values give slightly different results. If we vary the threshold in steps of 0.05 between 0.1 and 0.9, the maximum F1 score is achieved with a value of 0.8:

Precision: 0.7757
Recall: 0.6860
F1: 0.7281

nikoprom
/

study_news_detection_german