Papers
arxiv:2508.03990

Are Today's LLMs Ready to Explain Well-Being Concepts?

Published on Aug 6
· Submitted by Bohan-Jiang on Aug 8
Authors:
,
,
,

Abstract

LLMs can be fine-tuned to generate high-quality, audience-tailored explanations of well-being concepts using Supervised Fine-Tuning and Direct Preference Optimization.

AI-generated summary

Well-being encompasses mental, physical, and social dimensions essential to personal growth and informed life decisions. As individuals increasingly consult Large Language Models (LLMs) to understand well-being, a key challenge emerges: Can LLMs generate explanations that are not only accurate but also tailored to diverse audiences? High-quality explanations require both factual correctness and the ability to meet the expectations of users with varying expertise. In this work, we construct a large-scale dataset comprising 43,880 explanations of 2,194 well-being concepts, generated by ten diverse LLMs. We introduce a principle-guided LLM-as-a-judge evaluation framework, employing dual judges to assess explanation quality. Furthermore, we show that fine-tuning an open-source LLM using Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) can significantly enhance the quality of generated explanations. Our results reveal: (1) The proposed LLM judges align well with human evaluations; (2) explanation quality varies significantly across models, audiences, and categories; and (3) DPO- and SFT-finetuned models outperform their larger counterparts, demonstrating the effectiveness of preference-based learning for specialized explanation tasks.

Community

As people increasingly turn to LLMs for guidance on mental, physical, and social well-being, we ask: Can these models explain well-being concepts accurately and appropriately for different audiences?

What We Did:

  1. Built the first large-scale dataset of 43,880 explanations for 2,194 well-being concepts from 10 diverse LLMs.
  2. Designed a principle-guided LLM-as-a-judge framework with audience-specific evaluation criteria.
  3. Fine-tuned an open-source model using Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) to boost explanation quality.

Key Findings:

  1. Large models generally perform better, but finetuned smaller models can match or outperform them in specialized tasks.
  2. LLMs are more reliable for general public explanations than for domain experts.
  3. All models share weaknesses in practical advice (Utility) and depth of analysis (Depth).

overview.png

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2508.03990 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2508.03990 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2508.03990 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.