Direct Preference Optimization: Your Language Model is Secretly a Reward ModelPDF#Large Language Models#Preferences#Reward#Training#Paper#PDF·arxiv.org·Jun 4, 2023Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Improving Mathematical Reasoning with Process Supervision#OpenAI#Process Supervision#Reward#Machine Learning#Paper·openai.com·May 31, 2023Improving Mathematical Reasoning with Process Supervision
Reward Design with Language Models#Reward#Paper#PDF#Large Language Models·arxiv.org·Mar 9, 2023Reward Design with Language Models