Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences
Paper
•
2404.03715
•
Published
•
60
A batched on-policy algorithm that conducts self-improvement iteratively via contrastive learning