arxiv:2505.03052

Teaching Models to Understand (but not Generate) High-risk Data

Published on May 5

· Submitted by

mattf1n on May 7

Upvote

Authors:

Matthew Finlayson ,

Abstract

Language model developers typically filter out high-risk content -- such as toxic or copyrighted text -- from their pre-training data to prevent models from generating similar outputs. However, removing such data altogether limits models' ability to recognize and appropriately respond to harmful or sensitive content. In this paper, we introduce Selective Loss to Understand but Not Generate (SLUNG), a pre-training paradigm through which models learn to understand high-risk data without learning to generate it. Instead of uniformly applying the next-token prediction loss, SLUNG selectively avoids incentivizing the generation of high-risk tokens while ensuring they remain within the model's context window. As the model learns to predict low-risk tokens that follow high-risk ones, it is forced to understand the high-risk content. Through our experiments, we show that SLUNG consistently improves models' understanding of high-risk data (e.g., ability to recognize toxic content) without increasing its generation (e.g., toxicity of model responses). Overall, our SLUNG paradigm enables models to benefit from high-risk text that would otherwise be filtered out.

View arXiv page View PDF Add to collection

Community

mattf1n

Paper author Paper submitter 2 days ago

When training a language model, you can either throw out toxic text in your data and get worse performance, or keep it and get undesirable behavior at deployment time. But there is a third option: we introduce a training method (SLUNG) that teaches models to understand high-risk text (like toxicity) without teaching it to generate that text. Our method uses loss masking to prevent toxic predictions while still allowing the model to condition non-toxic generations on toxic context.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2505.03052 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2505.03052 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2505.03052 in a Space README.md to link it from this page.