Human alignment is very hard
Should physicist J. Robert Oppenheimer has had developed the atomic bomb?
Should the nuclear weapons have had been used on zero or one or two cities?
This is a big moral dilemma.
Had the nuclear weapons not been developed, what the world would have been? Would there has been WWIII or more major wars already? Noble laureates Robert Aumann and Thomas Schelling applied game theory to discuss such issues.
This is about human alignment.
We can recall the Trolley Problem: five people on one track or one on another.
Is there a perfect solution? Perfect in what sense? What is the meaning of a solution?
People have been working hard to design social norms, rules and laws, attempting to “solve” the human alignment problem during the whole process of civilization. Can we say people have solved it, considering the wars going along with human beings all the time, in particular, the one in progress?
To constrain it to an individual, recommendation and personalization techniques are popular, and we are enjoying the fast progress. However, recommender systems are still struggling with issues like topic diversity and adaptation to users’ changing preferences.
Human alignment is subjective, not objective.
Human alignment is many-facet.
Language modelling is a very complex optimization problem, with many objectives, e.g., correctness, groundedness, unbiasedness, sensibleness, interestingness, factuality, safety and specificity. Many factors, like culture, history, psychology and philosophy, play their roles. Different users have different or even contradictory perspectives.
Should we expect a general language model to be optimal for all objectives and for all users, approximately or in a Pareto sense?
There are two basic optimization principles: Multi-objective optimization usually will not optimize all objectives. The more and the tighter constraints, the less chance to find a feasible solution. To respect such principles, we should employ modularity and benefit from collaboration of specialized, expert-level modules.
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback is a recent survey, well done with deep insights.
The authors “divide challenges with RLHF into three main types: challenges with obtaining quality human feedback, challenges with learning a good reward model, and challenges with policy optimization”.
However, the authors make wrong attributions: the challenges with human feedback and reward model are inherent to natural language processing (NLP). RLHF provides a data-driven approach to approximate the reward model. RL is promising to revolutionize language models.
Human alignment is very hard, or likely impossible to “solve”. People are making best efforts to balance tradeoffs among many factors.