Aleph-Alpha-GermanWeb-Grammar-Classifier-BERT
Aleph-Alpha-GermanWeb-Grammar-Classifier-BERT is a model that was used in the creation of Aleph-Alpha-GermanWeb, a new German-language dataset that combines heuristic and model-based filtering techniques with synthetic data generation to achieve SOTA performance in German-language benchmarks.
Here we provide one of our quality classification models, based on a BERT backbone, along with inference code. This model is released as part of a collection of four text quality classification models.
To train Aleph-Alpha-GermanWeb-Grammar-Classifier-BERT, we used LanguageTool to annotate a random subset of 400,000 German FineWeb2 documents with the DE_AGREEMENT rule, which identifies text passages with grammatical disagreement. To train our classifier, we randomly selected 75,000 documents without identified grammar mistakes as high quality examples. As low quality examples, we took 75,000 random documents containing at least one identified grammar error.
We trained Aleph-Alpha-GermanWeb-Grammar-Classifier-BERT on 95% of the data to classify the high and low quality examples -- and used the remaining 5% for validation, reaching a precision of 67% and recall of 66% on the validation set.
Further details can be found in our accompanying paper (link to paper coming soon).
Example Snippet
import torch
from transformers import BertTokenizer, BertForSequenceClassification
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model = BertForSequenceClassification.from_pretrained("Aleph-Alpha/Aleph-Alpha-GermanWeb-Grammar-Classifier-BERT", num_labels=2).to(device)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# disclaimer: short text is not in the model distribution
text = 'Das ist ein Beispieltext, um die Grammatik zu überprüfen.'
target_names = ['Low Quality', 'High Quality']
with torch.no_grad():
prediction = torch.argmax(
model(**tokenizer(text, return_tensors='pt', truncation=True, padding=True).to(device)).logits
).item()
print(target_names[prediction])
- Downloads last month
- 5
Model tree for Aleph-Alpha/Aleph-Alpha-GermanWeb-Grammar-Classifier-BERT
Base model
google-bert/bert-base-uncased