Abstract
A formal framework and no-regret algorithm are introduced for learning from language feedback, addressing challenges in interactive learning with large language models.
Interactively learning from observation and language feedback is an increasingly studied area driven by the emergence of large language model (LLM) agents. While impressive empirical demonstrations have been shown, so far a principled framing of these decision problems remains lacking. In this paper, we formalize the Learning from Language Feedback (LLF) problem, assert sufficient assumptions to enable learning despite latent rewards, and introduce transfer eluder dimension as a complexity measure to characterize the hardness of LLF problems. We show that transfer eluder dimension captures the intuition that information in the feedback changes the learning complexity of the LLF problem. We demonstrate cases where learning from rich language feedback can be exponentially faster than learning from reward. We develop a no-regret algorithm, called HELiX, that provably solves LLF problems through sequential interactions, with performance guarantees that scale with the transfer eluder dimension of the problem. Across several empirical domains, we show that HELiX performs well even when repeatedly prompting LLMs does not work reliably. Our contributions mark a first step towards designing principled interactive learning algorithms from generic language feedback.
Community
Decision-making with LLM can be studied with RL! Can an agent solve a task with text feedback (OS terminal, compiler, a person) efficiently? How can we understand the difficulty? We propose a new notion of learning complexity to study learning with language feedback only.
Based on Eluder Dimension (Russo and Van Roy 🎓, 2013), we propose Transfer Eluder Dimension, which captures how efficiently language feedback can reduce uncertainty about rewards. A smaller dimTE means a single piece of language feedback carries more information.
Building on this concept, we develop HELiX 🧬(Hypothesis Elimination using Language-informed Exploration), which achieves a regret bound that scales gracefully with time horizon T, establishing the first formal connection between no-regret learning and language feedback.
We introduce a meta-algorithm that implements HELiX with LLMs via thinking tokens🤔💭. The LLM samples parallel thoughts as plausible hypotheses of the world🌍. We conduct pessimistic exploitation via thought consensus and optimistic exploration through action self-rating.
We have preliminary results that show that by leveraging thinking tokens, our algorithm can help LLM make better decisions by evolving a set of thoughts and conducting efficient exploration after evaluating them.
The connection we establish here between thinking (reasoning) and exploration (RL) is just scratching the surface of learning from language via LLMs.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Toward Efficient Exploration by Large Language Model Agents (2025)
- Outcome-Based Online Reinforcement Learning: Algorithms and Fundamental Limits (2025)
- Reinforcement Learning from Multi-level and Episodic Human Feedback (2025)
- Reward Is Enough: LLMs Are In-Context Reinforcement Learners (2025)
- NOVER: Incentive Training for Language Models via Verifier-Free Reinforcement Learning (2025)
- LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities (2025)
- Bayesian Optimization from Human Feedback: Near-Optimal Regret Bounds (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper