Papers
arxiv:2406.20087

ProgressGym: Alignment with a Millennium of Moral Progress

Published on Jun 28
· Submitted by TianyiQ on Jul 2
Authors:
,
,
,

Abstract

Frontier AI systems, including large language models (LLMs), hold increasing influence over the epistemology of human users. Such influence can reinforce prevailing societal values, potentially contributing to the lock-in of misguided moral beliefs and, consequently, the perpetuation of problematic moral practices on a broad scale. We introduce progress alignment as a technical solution to mitigate this imminent risk. Progress alignment algorithms learn to emulate the mechanics of human moral progress, thereby addressing the susceptibility of existing alignment methods to contemporary moral blindspots. To empower research in progress alignment, we introduce ProgressGym, an experimental framework allowing the learning of moral progress mechanics from history, in order to facilitate future progress in real-world moral decisions. Leveraging 9 centuries of historical text and 18 historical LLMs, ProgressGym enables codification of real-world progress alignment challenges into concrete benchmarks. Specifically, we introduce three core challenges: tracking evolving values (PG-Follow), preemptively anticipating moral progress (PG-Predict), and regulating the feedback loop between human and AI value shifts (PG-Coevolve). Alignment methods without a temporal dimension are inapplicable to these tasks. In response, we present lifelong and extrapolative algorithms as baseline methods of progress alignment, and build an open leaderboard soliciting novel algorithms and challenges. The framework and the leaderboard are available at https://github.com/PKU-Alignment/ProgressGym and https://huggingface.co/spaces/PKU-Alignment/ProgressGym-LeaderBoard respectively.

Community

Paper author Paper submitter
edited Jul 13

Human values are evolving and have undergone huge, continual progress over the past millennium. Values embedded into the LLMs need to undergo the same process, or else we risk locking-in current human values by putting humans into an echo chamber of like-minded LLMs.

This concern is especially salient when LLMs have become personal assistants, romantic partners, K-12 educators, etc., and psychological studies have demonstrated very significant impact of LLMs on human views.

In this paper, we formulate the problem of progress alignment - emulating continual moral progress in AI alignment procedures, in order to prevent value lock-in. We build the ProgressGym experimental framework, leveraging 9 centuries of historical text data to tune 36 historical LLMs, and thereby building an experimental environment containing datasets, models, algorithms, tasks, and benchmarks for progress alignment research.

To demonstrate the tractability of the problem, we designed and tested lifelong and extrapolative alignment algorithms as initial solutions, included extensive discussions on the future roadmap of progress alignment, and built an open leaderboard/playground for progress alignment research.

main-diagram.jpg

Moral progress? Or moral decay? Who decides what is right and wrong? Who are we to impose our morals on others? What makes us so certain that our morals are superior? Etc

Also, people are hypocrites, what's written down is not what happens in practice. Think of the church as an example, on paper they are saints, in practice they would be the first ones to go to hell, if there was such a place. Social media: News, etc. All fake.

Morals, especially written morals, are a pack of lies.

You're just teaching the model how to be a virtue signalling a--hole, and the end result will be highly frustrated users that will ditch the model in favour of a jailbroken model that doesn't judge or patronise them

Let people be, and stop trying to shove your morals down their throats

Sign up or log in to comment

Models citing this paper 36

Browse 36 models citing this paper

Datasets citing this paper 3

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2406.20087 in a Space README.md to link it from this page.

Collections including this paper 1