Are Today's LLMs Ready to Explain Well-Being Concepts?
Abstract
LLMs can be fine-tuned to generate high-quality, audience-tailored explanations of well-being concepts using Supervised Fine-Tuning and Direct Preference Optimization.
Well-being encompasses mental, physical, and social dimensions essential to personal growth and informed life decisions. As individuals increasingly consult Large Language Models (LLMs) to understand well-being, a key challenge emerges: Can LLMs generate explanations that are not only accurate but also tailored to diverse audiences? High-quality explanations require both factual correctness and the ability to meet the expectations of users with varying expertise. In this work, we construct a large-scale dataset comprising 43,880 explanations of 2,194 well-being concepts, generated by ten diverse LLMs. We introduce a principle-guided LLM-as-a-judge evaluation framework, employing dual judges to assess explanation quality. Furthermore, we show that fine-tuning an open-source LLM using Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) can significantly enhance the quality of generated explanations. Our results reveal: (1) The proposed LLM judges align well with human evaluations; (2) explanation quality varies significantly across models, audiences, and categories; and (3) DPO- and SFT-finetuned models outperform their larger counterparts, demonstrating the effectiveness of preference-based learning for specialized explanation tasks.
Community
As people increasingly turn to LLMs for guidance on mental, physical, and social well-being, we ask: Can these models explain well-being concepts accurately and appropriately for different audiences?
What We Did:
- Built the first large-scale dataset of 43,880 explanations for 2,194 well-being concepts from 10 diverse LLMs.
- Designed a principle-guided LLM-as-a-judge framework with audience-specific evaluation criteria.
- Fine-tuned an open-source model using Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) to boost explanation quality.
Key Findings:
- Large models generally perform better, but finetuned smaller models can match or outperform them in specialized tasks.
- LLMs are more reliable for general public explanations than for domain experts.
- All models share weaknesses in practical advice (Utility) and depth of analysis (Depth).
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Beyond the Surface: Enhancing LLM-as-a-Judge Alignment with Human via Internal Representations (2025)
- Distilling Empathy from Large Language Models (2025)
- InspireDebate: Multi-Dimensional Subjective-Objective Evaluation-Guided Reasoning and Optimization for Debating (2025)
- Automating Expert-Level Medical Reasoning Evaluation of Large Language Models (2025)
- An Uncertainty-Driven Adaptive Self-Alignment Framework for Large Language Models (2025)
- Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study (2025)
- Teaching Language Models To Gather Information Proactively (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper