Building information perpetual motion machines?
From wiki: A perpetual motion machine is a hypothetical machine that can do work indefinitely without an external energy source.
Then, an information perpetual motion machine can improve indefinitely (until being perfect) without an external information source.
AlphaZero is an information perpetual motion machine.
The components of AlphaZero: deep learning, reinforcement learning, policy iteration, Monte Carlo tree search (MCTS), self play, and perfect model.
Board games like chess and Go have perfect rules, thus perfect models.
With a perfect model, we can build a perfect simulator to generate infinite perfect data.
Perfect model is the key.
LLMs are very different from AlphaZero.
LLMs may be regarded as an approximate model/simulator, i.e., not perfect.
LLMs can help with many tasks, e.g., data generation, evaluation, judgement, reflection, improvement, etc.
However, a verifier is required to guarantee correctness.
Self-play with LLMs won’t lead to an information perpetual motion machine.
Reinforcement learning enjoys its renaissance around Sep. 2024, after OpenAI launched o1.
Reinforcement learning! Why not in 2023?
Reinforcement learning is a general framework for sequential decision making.
An implicit assumption: reliable feedback.
Maths and coding are similar to games, with formal logic and perfect syntax.
However, there is no similar work as AlphaZero yet.
AlphaProof maybe?
For coding, unit tests or executions provide valuable information, but are not fully reliable feedback: they can not guarantee correctness.
For a maths problem, a scalar result won’t guarantee the correctness of the reasoning process.
We should stop building information perpetual motion machines.
We should seek fully reliable signals first.