Reinforcement learning! Why not in 2023?

Feb 02, 2025

ChatGPT took the world by storm.

End of 2022.

Obviously, the way ChatGPT was trained was believed to be the right way, in particular, GPT with next token prediction. Abundant “theories” to explain the success: scaling laws, emergent abilities, compression is intelligence, etc.

Scaling up GPT with next token prediction is all you need!

That was the consensus, for most people.

Many people may still think so.

There were more and more voices about hitting the wall, culminating in Ilya Sutskever’s remark that “Pre-training as we know it will end.” during his NeurIPS 2024 Test of Time Award talk.

OpenAI o1 and o3 made RL mainstream. More and more people care more about RL.

If RL is the right way, why not earlier? In 2023?

Tons of resources invested in scaling up GPT with next token prediction.

Is it necessary?

Any reflections?

Esp. from those promoting GPT with next token prediction, but not RL.

I post a blog on May 4, 2023.

Reinforcement learning is all you need, for next generation language models.

https://yuxili.substack.com/p/reinforcement-learning-is-all-you

I understand that “XXX is all you need” is fun, but not so scientific.

Moreover, the title is not quite the right thesis.

So I changed it and post a draft on July 7, 2023.

Iterative improvements from feedback for language models, 2023.

https://tinyurl.com/3h7xu25t

There were two arXiv papers in January 2023.

Human-Timescale Adaptation in an Open-Ended Task Space

https://arxiv.org/abs/2301.07608

SMART: Self-supervised Multi-task pretrAining with contRol Transformers

https://arxiv.org/abs/2301.09816

It seems they did not receive enough attention (81 and 49 citations as of writing), cf. CoT (10319) and ReAct (2100).

What if (much) more resources to push forward these ideas?

Anyways, it is happy to see the renaissance of RL.

Will RL be the right way?

In particular, the current approaches?

It is a big question.

It requires more exploration and exploitation.

One thing for sure:

Reinforcement learning is a general framework for sequential decision making.

An implicit assumption: reliable feedback.

Many people like to refer to Sutton’s The Bitter Lesson as a support of the scaling laws.

Let me close the blog with Rich Sutton’s perspective on large language models.

Rich Sutton mentioned that current LLMs are “disappointing”, “superficial”, “enormous distraction”, “sucking the oxygen out of the room”.

PS 1.

Building information perpetual motion machine?

PS 2.

RL did appear in the training pipeline of ChatGPT: pre-training, supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF). RLHF did received considerable attention. But, RLHF is different from RL.

RLHF is inverse RL, so it is imitation learning. It is different from "learning from demonstration" and “behaviour cloning” though, which are supervised learning, or in the parlance of LLMs, "supervised fine-tuning" (SFT). RLHF is not supervise learning. RLHF is a principled approach to sequential decision problems without a reward function, e.g., most NLP problems, like translation and summarization, do not have objective objectives, and performance metrics like BLEU, ROGUE and perplexity scores are heuristic and approximate.

PS 3.

Data scarcity becomes an issue. This may be true for GPT with next token prediction, with data from the Internet.

However, pre-training is at the very beginning, considering automatic generation of reliable data, alternative neural network architectures, alternative abstractions (token vs sentence vs paragraph, etc), alternative algorithms, and the valuable domain knowledge. Note, reliable data is an implicit requirement for RL.

Yuxi’s Substack

Discussion about this post