arxiv:2508.17973

German4All - A Dataset and Model for Readability-Controlled Paraphrasing in German

Published on Aug 25

· Submitted by

Authors:

Abstract

A large-scale German dataset and model for readability-controlled paraphrasing are introduced, achieving state-of-the-art performance in text simplification.

AI-generated summary

The ability to paraphrase texts across different complexity levels is essential for creating accessible texts that can be tailored toward diverse reader groups. Thus, we introduce German4All, the first large-scale German dataset of aligned readability-controlled, paragraph-level paraphrases. It spans five readability levels and comprises over 25,000 samples. The dataset is automatically synthesized using GPT-4 and rigorously evaluated through both human and LLM-based judgments. Using German4All, we train an open-source, readability-controlled paraphrasing model that achieves state-of-the-art performance in German text simplification, enabling more nuanced and reader-specific adaptations. We opensource both the dataset and the model to encourage further research on multi-level paraphrasing

View arXiv page View PDF GitHub 2 Add to collection

Community

stefan-it

Paper submitter 11 days ago

The paper introduces German4All, the first large-scale German dataset for readability-controlled paraphrasing. It contains over 25,000 Wikipedia-based paragraph samples paraphrased by GPT-4 into five distinct complexity levels, ranging from easy-to-read language for people with reading difficulties to academic-level German.

stefan-it

Paper submitter 11 days ago

•

edited 11 days ago

Really great resource for German, many thanks @MiriUll and team!

stefan-it

Paper submitter 11 days ago

I have accidentally pretrained a new German T5 model from scratch (see https://huggingface.co/GermanT5/occiglot5), maybe it is also worth to try this out :)