Discover more from Yuxi’s Substack
RL(HF) Helps LMs
RLHF is a popular approach for human value alignment, as in ChatGPT. Direct Preference Optimization (DPO) does not need a reward model and reinforcement learning. However, DPO applies only to the Bradley-Terry model, which underlies a particular RLHF approach. There may be other ways to handle human preference. Moreover, there may be non-preference ways to handle value alignment. As a general approach, RL(HF) helps language models (LMs).
Humans’ preference and value are complex, which makes it is hard, if not impossible, to define a perfect function for them. In reinforcement learning’s parlance, it is a reward function.
Language is a major medium for communication among humans. How to achieve human value alignment is thus dispensable for a language model interacting with humans.
Reinforcement learning from human feedback (RLHF) was initially proposed as a way to overcome the problem with defining a reward function, in Deep Reinforcement Learning from Human Preferences (NIPS 2017). InstructGPT in Training language models to follow instructions with human feedback borrowed the idea for human value alignment in language modelling. ChatGPT extended InstructGPT, and thus incorporated RLHF.
Direct Preference Optimization: Your Language Model is Secretly a Reward Model proposed to optimize preference directly, by solving a classification problem on the human preference data, without reward modelling and reinforcement learning.
On one hand, DPO is a great work following the previous work we discuss above. It follows the Bradley-Terry model (R. A. Bradley and M. E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.) for estimating score functions from pairwise preferences (see the figure below)
On the other hand, Direct Preference Optimization solves the human preference problem with the particular Bradley-Terry model, i.e., DPO applies only to the Bradley-Terry model.
However, there may be other ways to handle human preference. Moreover, we may have non-preference ways to handle human alignment. One example follows, which uses rating to transmit human knowledge to an RL agent: TAMER: Training an Agent Manually via Evaluative Reinforcement.
How to achieve human value alignment for a language model is a big topic. How to improve language models is even a bigger topic. See several recent blogs：
That is why we make the title RL(HF) Helps LMs.