
ernie-research/TLDR-Gemma-2B-MA-PPO-Fixed5
3B
•
Updated
•
10
[ICLR'25] [MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions](https://openreview.net/forum?id=WWXjMYZxfH)