Executive summary RLHF is a popular approach for human value alignment, as in ChatGPT. Direct Preference Optimization (DPO) does not need a reward model and reinforcement learning. However, DPO applies only to the Bradley-Terry model, which underlies a particular RLHF approach. There may be other ways to handle human preference. Moreover, there may be non-preference ways to handle value alignment. As a general approach, RL(HF) helps language models (LMs).
“This paper theoretically proves that, for a wide range of preference models, we can solve preference-based RL directly using existing algorithms and techniques for reward-based RL, with small or no extra costs. Specifically, (1) for preferences that are drawn from reward-based probabilistic models, we reduce the problem to robust reward-based RL that can tolerate small errors in rewards; (2) for general arbitrary preferences where the objective is to find the von Neumann winner, we reduce the problem to multiagent reward-based RL which finds Nash equilibria for factored Markov games under a restricted set of policies. ”
RL(HF) Helps LMs
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
https://arxiv.org/abs/2307.15217
Is RLHF More Difficult than Standard RL?
https://arxiv.org/pdf/2306.14111.pdf
“This paper theoretically proves that, for a wide range of preference models, we can solve preference-based RL directly using existing algorithms and techniques for reward-based RL, with small or no extra costs. Specifically, (1) for preferences that are drawn from reward-based probabilistic models, we reduce the problem to robust reward-based RL that can tolerate small errors in rewards; (2) for general arbitrary preferences where the objective is to find the von Neumann winner, we reduce the problem to multiagent reward-based RL which finds Nash equilibria for factored Markov games under a restricted set of policies. ”