David Silver and Richard Sutton recently shared a paper: Welcome to the Era of Experience. It is highly recommended, esp. for people pursuing true agents and AI.
This is their first written perspective on LLMs.
Some of their talks and interviews:
Is Human Data Enough? With David Silver;
David Silver - Towards Superhuman Intelligence - RLC 2024;
A Perspective on Intelligence by Rich Sutton;
DeepSeek (The Derby Mill Series ep 02);
Rich Sutton Brings Reinforcements - 72nd Conversation.
Short videos by me:
Rich Sutton on Intelligence, LLMs, Scaling laws;
Bitter lesson implies scaling "laws"?.
A clear takeaway of the paper is:
Experience is critical for further progress in AI, including LLMs.
Reinforcement learning is a natural framework for learning from experience.
Here are my comments.
We may need to highlight the importance of ground truth and the difficulties of generalizing/transferring achievements in games, maths, and coding to other problems.
1. Ground truth
2. Objective vs subjective reward/objective functions; the role of humans
3. Model-based RL vs. inverse RL, in particular, RLHF
4. Digital vs physical
5. Superhuman AI is hard even for maths and coding
6. A brief summary
1. Ground truth
There are three sources of ground truth data:
1) perfect rules/laws/world models,
2) perfect verifiers with formal methods, and
3) experience by interacting with the world.
Sources 1 (eg AlphaZero) and 2 (eg AlphaProof) can guarantee correctness of data. They can be regarded as special types of experience from interacting with special environments, e.g., games AI like AlphaZero with a perfect game rule, so a perfect game engine, and maths and coding with perfect syntax and theorem, so a perfect prover / verifier.
Source 3 (experience) may come with noises/uncertainty, although the world can be viewed as a perfect model itself. We may need to deal with sampling errors, partial observability, imperfect/incomplete information, strategic/adversarial scenarios, and/or weak specifications like code unit tests.
Before the LLMs era, ground truth data was a prerequisite for AI and machine learning. However, many LLMs papers do not have true ground truth data, even for maths (matching the scalar result) and coding (passing several tests or successful execution), not to mention many other problems. LLMs are not perfect world models. There are actually many self-xxx approaches, attempting to build "information perpetual motion machines".
2. Objective vs subjective reward/objective functions; the role of humans
Many problems in natural science, formal science and engineering have objective rewards/objectives. We may collect experience from such problems and rely on them for learning. Sources 1 and 2 for ground truth above are two special cases.
Many problems in social science, arts and humanities usually do not have objective objectives, i.e., subjective objectives. In such cases, experts' feedback may be the best signals we can get.
3. Model-based RL vs. inverse RL, in particular, RLHF
For a problem without a reward function, model-based RL and inverse RL, in particular, RLHF, provide principled approaches to learn a reward function.
Imitation learning is not enough, e.g., as shown in AlphaGo vs AlphaZero. However, *iterative* inverse RL can be regarded as model-based RL (for the reward model).
4. Digital vs physical
Some problems can be fully digital, e.g., many games, maths, and coding.
Many problems have to be physical, e.g., robotics and human-centric problems.
One factor is about the cost of collecting experience.
5. Superhuman AI is hard even for maths and coding
Games AI is very unique: perfect game rules, fully digital (for many games).
We have many super-human AIs, like AlphaZero.
Maths and coding come close, but the goal of fully autonomous and correct agents is likely infeasible, due to the impossibility results, e.g., Gödel's incompleteness theorems and the halting problem. Even so, there are lots of academic and business opportunities and we may try our best.
Many science and engineering problems have physical and/or human components. It thus may be costly to collect data. Also, there is usually a simulation to reality gap.
For human-centric problems, "super-human" may not be well defined. Consider translation as an example:
- no objective objective; NLP metrics like BLEU score and perplexity are heuristic
- experts' evaluation: likely the best data we can get, may be irreplaceable
- seemingly no way to collect super-human data, or how to judge it is superhuman or not
Games, maths and coding are special problems, with special, perfect environments. The achievements in such problems may not be straightforwardly generalizable / transferable to other problems.
6. A brief summary
+ Ground truth is a first principle.
+ Expect breakthroughs in maths and coding, likely in science and engineering.
+ Achievements in games, maths and coding may not be straightforwardly generalizable / transferable to other problems.
+ Experience from humans may be irreplaceable for human-centric problems. An (iterative) inverse RL, like RLHF, provides a principled approach to learn a reward function.
+ Study how to collect ground truth experience efficiently for problems with physical and/or human components.
+ Call for efficient (reinforcement) learning algorithms as always.