MiCRo: Mixture Modeling and Context-aware Routing for Personalized Preference Learning
Abstract
MiCRo, a two-stage framework, improves personalized preference learning for large language models by leveraging binary preference datasets and dynamically adapting mixture weights based on context, effectively capturing diverse human preferences.
Reward modeling is a key step in building safe foundation models when applying reinforcement learning from human feedback (RLHF) to align Large Language Models (LLMs). However, reward modeling based on the Bradley-Terry (BT) model assumes a global reward function, failing to capture the inherently diverse and heterogeneous human preferences. Hence, such oversimplification limits LLMs from supporting personalization and pluralistic alignment. Theoretically, we show that when human preferences follow a mixture distribution of diverse subgroups, a single BT model has an irreducible error. While existing solutions, such as multi-objective learning with fine-grained annotations, help address this issue, they are costly and constrained by predefined attributes, failing to fully capture the richness of human values. In this work, we introduce MiCRo, a two-stage framework that enhances personalized preference learning by leveraging large-scale binary preference datasets without requiring explicit fine-grained annotations. In the first stage, MiCRo introduces context-aware mixture modeling approach to capture diverse human preferences. In the second stage, MiCRo integrates an online routing strategy that dynamically adapts mixture weights based on specific context to resolve ambiguity, allowing for efficient and scalable preference adaptation with minimal additional supervision. Experiments on multiple preference datasets demonstrate that MiCRo effectively captures diverse human preferences and significantly improves downstream personalization.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LoRe: Personalizing LLMs via Low-Rank Reward Modeling (2025)
- Energy-Based Reward Models for Robust Language Model Alignment (2025)
- Persona-judge: Personalized Alignment of Large Language Models via Token-level Self-judgment (2025)
- PARM: Multi-Objective Test-Time Alignment via Preference-Aware Autoregressive Reward Model (2025)
- Two Minds Better Than One: Collaborative Reward Modeling for LLM Alignment (2025)
- Latent Preference Coding: Aligning Large Language Models via Discrete Latent Codes (2025)
- MOSLIM:Align with diverse preferences in prompts through reward classification (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper