German4All - A Dataset and Model for Readability-Controlled Paraphrasing in German
Abstract
A large-scale German dataset and model for readability-controlled paraphrasing are introduced, achieving state-of-the-art performance in text simplification.
The ability to paraphrase texts across different complexity levels is essential for creating accessible texts that can be tailored toward diverse reader groups. Thus, we introduce German4All, the first large-scale German dataset of aligned readability-controlled, paragraph-level paraphrases. It spans five readability levels and comprises over 25,000 samples. The dataset is automatically synthesized using GPT-4 and rigorously evaluated through both human and LLM-based judgments. Using German4All, we train an open-source, readability-controlled paraphrasing model that achieves state-of-the-art performance in German text simplification, enabling more nuanced and reader-specific adaptations. We opensource both the dataset and the model to encourage further research on multi-level paraphrasing
Community
The paper introduces German4All, the first large-scale German dataset for readability-controlled paraphrasing. It contains over 25,000 Wikipedia-based paragraph samples paraphrased by GPT-4 into five distinct complexity levels, ranging from easy-to-read language for people with reading difficulties to academic-level German.
I have accidentally pretrained a new German T5 model from scratch (see https://huggingface.co/GermanT5/occiglot5), maybe it is also worth to try this out :)
Stark
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- WangchanThaiInstruct: An instruction-following Dataset for Culture-Aware, Multitask, and Multi-domain Evaluation in Thai (2025)
- MahaParaphrase: A Marathi Paraphrase Detection Corpus and BERT-based Models (2025)
- Simplifications are Absolutists: How Simplified Language Reduces Word Sense Awareness in LLM-Generated Definitions (2025)
- MATA (m=ata): Mindful Assessment of the Telugu Abilities of Large Language Models (2025)
- WETBench: A Benchmark for Detecting Task-Specific Machine-Generated Text on Wikipedia (2025)
- JUDGEBERT: Assessing Legal Meaning Preservation Between Sentences (2025)
- Evaluating LLMs on Chinese Idiom Translation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper