RL(HF) Helps LMs

Jul 23, 2023

Executive summary

RLHF is a popular approach for human value alignment, as in ChatGPT. Direct Preference Optimization (DPO) does not need a reward model and reinforcement learning. However, DPO applies only to the Bradley-Terry model, which underlies a particular RLHF approach. There may be other ways to handle human preference. Moreover, there may be non-preference ways to handle value alignment. As a general approach, RL(HF) helps language models (LMs).

Humans’ preference and value are complex, which makes it is hard, if not impossible, to define a perfect function for them. In reinforcement learning’s parlance, it is a reward function.

Language is a major medium for communication among humans. How to achieve human value alignment is thus dispensable for a language model interacting with humans.

Reinforcement learning from human feedback (RLHF) was initially proposed as a way to overcome the problem with defining a reward function, in Deep Reinforcement Learning from Human Preferences (NIPS 2017). InstructGPT in Training language models to follow instructions with human feedback borrowed the idea for human value alignment in language modelling. ChatGPT extended InstructGPT, and thus incorporated RLHF.

Direct Preference Optimization: Your Language Model is Secretly a Reward Model proposed to optimize preference directly, by solving a classification problem on the human preference data, without reward modelling and reinforcement learning.

On one hand, DPO is a great work following the previous work we discuss above. It follows the Bradley-Terry model (R. A. Bradley and M. E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.) for estimating score functions from pairwise preferences (see the figure below)

Figure source: Deep Reinforcement Learning from Human Preferences (NIPS 2017)

On the other hand, Direct Preference Optimization solves the human preference problem with the particular Bradley-Terry model, i.e., DPO applies only to the Bradley-Terry model.

However, there may be other ways to handle human preference. Moreover, we may have non-preference ways to handle human alignment. One example follows, which uses rating to transmit human knowledge to an RL agent: TAMER: Training an Agent Manually via Evaluative Reinforcement.

How to achieve human value alignment for a language model is a big topic. How to improve language models is even a bigger topic. See several recent blogs：

That is why we make the title RL(HF) Helps LMs.

Aug 8, 2023

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

https://arxiv.org/abs/2307.15217

Expand full comment

Jul 29, 2023

Is RLHF More Difficult than Standard RL?

https://arxiv.org/pdf/2306.14111.pdf

“This paper theoretically proves that, for a wide range of preference models, we can solve preference-based RL directly using existing algorithms and techniques for reward-based RL, with small or no extra costs. Specifically, (1) for preferences that are drawn from reward-based probabilistic models, we reduce the problem to robust reward-based RL that can tolerate small errors in rewards; (2) for general arbitrary preferences where the objective is to find the von Neumann winner, we reduce the problem to multiagent reward-based RL which finds Nash equilibria for factored Markov games under a restricted set of policies. ”

Yuxi’s Substack

Discussion about this post