arxiv:2506.10341

Provably Learning from Language Feedback

Published on Jun 12

· Submitted by

Authors:

Abstract

A formal framework and no-regret algorithm are introduced for learning from language feedback, addressing challenges in interactive learning with large language models.

AI-generated summary

Interactively learning from observation and language feedback is an increasingly studied area driven by the emergence of large language model (LLM) agents. While impressive empirical demonstrations have been shown, so far a principled framing of these decision problems remains lacking. In this paper, we formalize the Learning from Language Feedback (LLF) problem, assert sufficient assumptions to enable learning despite latent rewards, and introduce transfer eluder dimension as a complexity measure to characterize the hardness of LLF problems. We show that transfer eluder dimension captures the intuition that information in the feedback changes the learning complexity of the LLF problem. We demonstrate cases where learning from rich language feedback can be exponentially faster than learning from reward. We develop a no-regret algorithm, called HELiX, that provably solves LLF problems through sequential interactions, with performance guarantees that scale with the transfer eluder dimension of the problem. Across several empirical domains, we show that HELiX performs well even when repeatedly prompting LLMs does not work reliably. Our contributions mark a first step towards designing principled interactive learning algorithms from generic language feedback.

View arXiv page View PDF Add to collection

Community

allenanie

Paper submitter 2 days ago

Decision-making with LLM can be studied with RL! Can an agent solve a task with text feedback (OS terminal, compiler, a person) efficiently? How can we understand the difficulty? We propose a new notion of learning complexity to study learning with language feedback only.

Based on Eluder Dimension (Russo and Van Roy 🎓, 2013), we propose Transfer Eluder Dimension, which captures how efficiently language feedback can reduce uncertainty about rewards. A smaller dimTE means a single piece of language feedback carries more information.

Building on this concept, we develop HELiX 🧬(Hypothesis Elimination using Language-informed Exploration), which achieves a regret bound that scales gracefully with time horizon T, establishing the first formal connection between no-regret learning and language feedback.

We introduce a meta-algorithm that implements HELiX with LLMs via thinking tokens🤔💭. The LLM samples parallel thoughts as plausible hypotheses of the world🌍. We conduct pessimistic exploitation via thought consensus and optimistic exploration through action self-rating.

We have preliminary results that show that by leveraging thinking tokens, our algorithm can help LLM make better decisions by evolving a set of thoughts and conducting efficient exploration after evaluating them.

The connection we establish here between thinking (reasoning) and exploration (RL) is just scratching the surface of learning from language via LLMs.

librarian-bot

2 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2506.10341 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2506.10341 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.10341 in a Space README.md to link it from this page.