Aleph-Alpha-GermanWeb-Quality-Classifier-BERT

Aleph-Alpha-GermanWeb-Quality-Classifier-BERT is a model that was used in the creation of Aleph-Alpha-GermanWeb, a new German-language dataset that combines heuristic and model-based filtering techniques with synthetic data generation to achieve SOTA performance in German-language benchmarks.

Here we provide one of our quality classification models, based on a BERT backbone, along with inference code. This model is released as part of a collection of four text quality classification models.

To train Aleph-Alpha-GermanWeb-Quality-Classifier-BERT, we used an LLM-as-a-judge to annotate a random set of 600,000 documents from German FineWeb2 according to three criteria: (1) content quality, assessing coherence, informativeness and overall quality of the content, (2) language quality, evaluating the use of language, including formality, objectivity, and the presence of errors such as slang, and (3) orthography, assessing the correctness of grammar, spelling, and punctuation, including errors such as typos, incorrect verb conjugation, and incorrect declension.

For each document, we calculated a combined educational quality score by taking the minimum over the three criteria rated by the LLM-as-a-judge. We then used these educational quality scores as the training signal for the quality classification model. The Aleph-Alpha-GermanWeb-Quality-Classifier-BERT model was tasked with predicting the educational quality scores given the first 512-tokens of the document's text.

We trained Aleph-Alpha-GermanWeb-Quality-Classifier-BERT using up to 75,000 documents from each class. We used 95% of this dataset for training to predict the one to five scores. The model achieved an overall accuracy of 42% and a macro-average accuracy of 46% when evaluated on the remaining 5% of the data, which served as the validation set.

Further details, including our LLM judging prompt, can be found in our accompanying paper.

Example Snippet

import torch
from transformers import BertTokenizer, BertForSequenceClassification

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

model = BertForSequenceClassification.from_pretrained("Aleph-Alpha/Aleph-Alpha-GermanWeb-Quality-Classifier-BERT", num_labels=5).to(device)
model.eval()

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# disclaimer: short text is not in the model distribution
text = 'Das ist ein Beispieltext, um die Qualität zu überprüfen.'

target_names = ['Quality Score 1', 'Quality Score 2', 'Quality Score 3', 'Quality Score 4', 'Quality Score 5']

with torch.no_grad():
    prediction = torch.argmax(
                    model(**tokenizer(text, return_tensors='pt', truncation=True, padding=True).to(device)).logits
                ).item()
print(target_names[prediction])

Aleph-Alpha
/

Aleph-Alpha-GermanWeb-Quality-Classifier-BERT

Aleph-Alpha-GermanWeb-Quality-Classifier-BERT

Example Snippet

Model tree for Aleph-Alpha/Aleph-Alpha-GermanWeb-Quality-Classifier-BERT

Collection including Aleph-Alpha/Aleph-Alpha-GermanWeb-Quality-Classifier-BERT

Aleph-Alpha-GermanWeb