Executive summary RLHF is a popular approach for human value alignment, as in ChatGPT. Direct Preference Optimization (DPO) does not need a reward model and reinforcement learning. However, DPO applies only to the Bradley-Terry model, which underlies a particular RLHF approach. There may be other ways to handle human preference. Moreover, there may be non-preference ways to handle value alignment. As a general approach, RL(HF) helps language models (LMs).
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Is RLHF More Difficult than Standard RL?
“This paper theoretically proves that, for a wide range of preference models, we can solve preference-based RL directly using existing algorithms and techniques for reward-based RL, with small or no extra costs. Specifically, (1) for preferences that are drawn from reward-based probabilistic models, we reduce the problem to robust reward-based RL that can tolerate small errors in rewards; (2) for general arbitrary preferences where the objective is to find the von Neumann winner, we reduce the problem to multiagent reward-based RL which finds Nash equilibria for factored Markov games under a restricted set of policies. ”