Role-Playing Evaluation for Large Language Models
Abstract
A benchmark called Role-Playing Eval assesses Large Language Models in role-playing across emotional understanding, decision-making, moral alignment, and in-character consistency.
Large Language Models (LLMs) demonstrate a notable capacity for adopting personas and engaging in role-playing. However, evaluating this ability presents significant challenges, as human assessments are resource-intensive and automated evaluations can be biased. To address this, we introduce Role-Playing Eval (RPEval), a novel benchmark designed to assess LLM role-playing capabilities across four key dimensions: emotional understanding, decision-making, moral alignment, and in-character consistency. This article details the construction of RPEval and presents baseline evaluations. Our code and dataset are available at https://github.com/yelboudouri/RPEval
Community
Hey everyone,
We've put together a benchmark that evaluates LLMs based on their roleplaying capabilities. We're now building a leaderboard that includes evaluations of both open-source and proprietary models. So far, we've evaluated 8 different models using the RPEval method introduced in this paper.
If there's a specific model you'd like us to include, or if you have suggestions to improve the evaluation, feel free to share them!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- ChARM: Character-based Act-adaptive Reward Modeling for Advanced Role-Playing Language Agents (2025)
- PsyMem: Fine-grained psychological alignment and Explicit Memory Control for Advanced Role-Playing LLMs (2025)
- KnowsLM: A framework for evaluation of small language models for knowledge augmentation and humanised conversations (2025)
- MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Chatbots and Dialogue Evaluators (2025)
- Sample-Efficient Language Model for Hinglish Conversational AI (2025)
- Beyond One-Size-Fits-All: Inversion Learning for Highly Effective NLG Evaluation Prompts (2025)
- IMPersona: Evaluating Individual Level LM Impersonation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper