Iterative Reasoning Preference Optimization
Self-Rewarding Language Models
Download PDF
Diffusion Model Alignment Using Direct Preference Optimization
Download PDF
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
PDF
Researchers From Stanford And DeepMind Come Up With The Idea of Using Large Language Models LLMs as a Proxy Reward Function