Aleph-Alpha-GermanWeb-Quality-Classifier-fastText

Aleph-Alpha-GermanWeb-Quality-Classifier-fastText is a model that was used in the creation of Aleph-Alpha-GermanWeb, a new German-language dataset that combines heuristic and model-based filtering techniques with synthetic data generation to achieve SOTA performance in German-language benchmarks.

Here we provide one of our quality classification models, a fastText model, along with inference code. This model is released as part of a collection of four text quality classification models.

To train Aleph-Alpha-GermanWeb-Quality-Classifier-fastText, we used an LLM-as-a-judge to annotate a random set of 600,000 documents from German FineWeb2 according to three criteria: (1) content quality, assessing coherence, informativeness and overall quality of the content, (2) language quality, evaluating the use of language, including formality, objectivity, and the presence of errors such as slang, and (3) orthography, assessing the correctness of grammar, spelling, and punctuation, including errors such as typos, incorrect verb conjugation, and incorrect declension.

For each document, we calculated a combined educational quality score by taking the minimum over the three criteria rated by the LLM-as-a-judge. We then used these educational quality scores as the training signal for the quality classification model. The Aleph-Alpha-GermanWeb-Quality-Classifier-fastText model was tasked with distinguishing between texts with educational quality scores of one or two (“low quality”) vs. four or five (“high quality”) given the document's text.

We trained Aleph-Alpha-GermanWeb-Quality-Classifier-fastText using 185,403 documents in each class. We used 95% of the data (and the remaining 5% for validation) to train a fastText model to classify between high and low quality text data. It reached 77% precision and 77% recall on the validation set.

Further details, including our LLM judging prompt, can be found in our accompanying paper.

Example Snippet

import fasttext
from huggingface_hub import hf_hub_download


model_path = hf_hub_download(repo_id="Aleph-Alpha/Aleph-Alpha-GermanWeb-Quality-Classifier-fastText", filename="model.bin")
model = fasttext.load_model(model_path)

text = "Das ist ein Beispieltext, um die Qualität zu überprüfen."

pre_processed_document = text.replace("\n", " ")

predicted_class, prob = model.predict(pre_processed_document)
predicted_label = predicted_class[0].replace("__label__", "")
document_score = prob[0]
# similar to https://github.com/NVIDIA/NeMo-Curator/blob/31c8171434205e62f6a7d38565ffd9cb4c2806b7/nemo_curator/filters/classifier_filter.py#L47 , the document score is defined as the probability of the predicted class is the predicted label is 'high quality', otherwise it is 1 - document_score

if predicted_label != "high_quality":
    document_score = 1 - document_score

print(predicted_label, document_score)

Aleph-Alpha
/

Aleph-Alpha-GermanWeb-Quality-Classifier-fastText

Aleph-Alpha-GermanWeb-Quality-Classifier-fastText

Example Snippet

Collection including Aleph-Alpha/Aleph-Alpha-GermanWeb-Quality-Classifier-fastText

Aleph-Alpha-GermanWeb