Will AGI Emerge from Large Language Models?

What is AGI? What is a language model? SOTA. Emergent abilities, language vs thought, scaling up, prompting, augmented LM, goal. Potential approaches to AGI. With an executive summary.

Feb 28, 2023

Artificial general intelligence (AGI) becomes a popular topic again, after the launch of ChatGPT in November 2022 and the quick accumulation of millions of users. Last time it was after AlphaGo in 2016, “the first to defeat a Go world champion”. Here we attempt to discuss if AGI will emerge from large language models (LLMs). Such a grand topic! Omissions and errors are inevitable. Comments and criticisms are welcome.

Executive Summary

What is AGI? What is human-level intelligence?
What is a language model?
The state of the art and discussions:
- a candid statement from OpenAI: still many issues, require significant improvements
- “LLMs are good models of language but incomplete models of human thought”; two fallacies: “good at language -> good at thought” and “bad at thought -> bad at language”
- emergent abilities, e.g., multi-step reasoning, require more studies
- “Scaling is all you need” is a misreading of the Bitter Lesson.
- small language models and traditional NLP may help
- prompting and in-context learning may be temporary
- augmented language models are promising
- the goal of a language model is likely hard to define
- Yann LeCun’s opinion on current (auto-regressive) LLMs
Potential approaches to AGI
- modular architecture, neurosymbolic AI, importance of experience, reasoning, planning, world model, tests of intelligence

What is AGI?

AGI is the Holy Grail of AI, of computer science, of science.

From Wiki, “Artificial general intelligence (AGI) is the ability of an intelligent agent to understand or learn any intellectual task that human beings or other animals can.”

AGI is also known as strong AI or human-level intelligence.

Human-level intelligence

What is human intelligence? In the paper Building machines that learn and think like people shortly after AlphaGo, an MIT professor Joshua B. Tenenbaum and colleagues discuss that we should build machines toward human-like learning and thinking. In particular, we should build causal world models to support understanding and explanation, seeing entities rather than just raw inputs or features, rather than just pattern recognition; we should support and enrich the learned knowledge grounding in intuitive physics and intuitive psychology; and we should represent, acquire, and generalize knowledge, leveraging compositionality and learning to learn, rapidly adapt to new tasks and scenarios, recombining representations, without retraining from scratch.

Causality

Judea Pearl is a professor at UCLA, a Turing Award Laureate,“For fundamental contributions to artificial intelligence through the development of a calculus for probabilistic and causal reasoning.” In a Communications of the ACM article, The seven tools of causal inference, with reflections on machine learning, Pearl discusses that there are three fundamental obstacles for current machine learning systems to exhibit human-level intelligence: adaptability or robustness, explainability, and understanding of cause-effect connections. The author describes a three-layer causal hierarchy: association, intervention, and counterfactual. Association invokes statistical relationships, with typical questions like “What is?” and “How would seeing X change my belief in Y ?” Intervention considers not only seeing what is, but also changing what we see, with typical questions like “What if?” and “What if I do X?” Counterfactual requires imagination and retrospection, with typical questions like “Why?” and “What if I had acted differently?” Counterfactuals subsume interventional and associational questions, and interventional questions subsume associational questions.

What is a language model?

Here we briefly discuss some basic concepts, based on lecture notes from the Stanford course CS324 - Large Language Models.

A language model is about the probability distribution of a sequence of tokens, i.e.,

Probability(a sequence of tokens).

An auto-regressive language model is about the probability distribution of next token, given previous tokens, i.e., the following conditional probability distribution:

Probability(next token | previous tokens).

Note: A token usually corresponds to a word; however, they may not be the same, e.g., the word “saying” may be decomposed into two tokens: “say” and “ing”.

Such conditional probability distributions can be computed efficiently.

Claude Shannon, the founder of information theory, first in 1948 used the concept of n-gram models, where the probability of next token depends on the last n-1 tokens, rather than all the previous tokens. People develop 3-gram, 4-gram, or 5-gram models. A limitation of n-gram models is their handling of long-range dependencies, and when n becomes large, the probabilities of long sequences go to zero, making it statistically infeasible.

Yoshua Bengio and colleagues pioneered neural language models in the paper A Neural Probabilistic Language Model in 2003. In such models, an n-gram model or an auto-regressive language model can be represented by neural networks.

Neural networks evolve from feed-forward neural networks, convolutional neural networks (CNNs), recurrent neural networks (RNNs), including Long short-term memory (LSTM) and gated recurrent units (GRUs), to Transformers, which are the backbone of recent large language models. Neural networks have strong capacity for representations, e.g., GPT-3 allows 2048-token long context. Maybe 4096 for ChatGPT.

When generating a sequence of tokens with an auto-regressive language model, the conditional probabilities are usually annealed by a temperature, a hyper parameter tuned manually or by a piece of code, an idea borrowed from metallurgy for hot materials cooling gradually.

A language model is probabilistic by nature. Tunings like the annealing temperature introduce more randomness.

Pre-trained Transformers

Self-supervised learning Pre-trained Transformers on large corpora are powerful.

OpenAI open-sourced GPT in June 2018 and GPT-2 in February 2019. OpenAI release GPT-3 APIs in June 2020, with Microsoft acquiring exclusive license in September 2020. GPT stands for Generative Pre-trained Transformer, or Generative PreTraining.

Google open-sourced BERT: Bidirectional Encoder Representations from Transformers in November 2018, and T5: Text-To-Text Transfer Transformer in February 2020.

GPTs are decoders, natural for language generation. BERTs are encoders, learning with bidirectional contexts, with masked language modeling, good at language understanding. T5 is encoder-decoder.

The state of the art and discussions

Large language models, e.g., OpenAI ChatGPT, Deepmind Gopher, Deepmind Chinchilla, Google Palm, Meta LLaMA, have shown impressive results, with issues.

A candid statement from OpenAI

OpenAI published a blog, How should AI systems behave, and who should decide?, on February 16, 2023. OpenAI confesses that ChatGPT sometimes “falls short of our intent (producing a safe and useful tool) and the user’s intent (getting a helpful output in response to a given input)”. OpenAI plans to improve ChatGPT w.r.t. bias, and to improve ChatGPT with advances from the community, e.g., Deepmind’s rule based rewards and Anthropic’s Constitutional AI. OpenAI proposes to “improve default behavior”, “define your AI’s values, within broad bounds”, and “public input on defaults and hard bounds”.

Formal vs functional competence

An MIT professor Joshua Tenenbaum and colleagues study LLMs in the paper, Dissociating language and thought in large language models: a cognitive perspective. The authors “review the capabilities of LLMs by considering their performance on two different aspects of language use: ‘formal linguistic competence’, which includes knowledge of rules and patterns of a given language, and ‘functional linguistic competence’, a host of cognitive abilities required for language understanding and use in the real world.” and conclude that “LLMs show impressive (although imperfect) performance on tasks requiring formal linguistic competence, but fail on many tests requiring functional competence.” The authors argue that “(1) contemporary LLMs should be taken seriously as models of formal linguistic skills; (2) models that master real-life language use would need to incorporate or develop not only a core language module, but also multiple non-language-specific cognitive capacities required for modeling thought.” The authors also discuss two fallacies: “good at language -> good at thought” and “bad at thought -> bad at language”.

Emergent abilities

There are recent studies about Emergent Abilities of Large Language Models. “Emergence is when quantitative changes in a system result in qualitative changes in behavior.” “An ability is emergent if it is not present in smaller models but is present in larger models.” Emergent abilities have been observed from in-context learning such as few-shot prompting and augmented prompting, in particular, as in the paper Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. See a blog about tracing the origin of emergent abilities in ChatGPT.

Why abilities like reasoning emerge from LLMs? People may wonder, can we expect reasoning as emergent abilities, based on maximum likelihood language models? People may wonder, can we expect intervention and counterfactuals as emergent abilities, based on a computing/learning architecture merely for association?

One hypothesis is that in an LLM, a correction method somehow recovers causation phenomena in large language corpora, somewhat like an imitation learning method, to some extent, attempts to learn an optimal sequential decision policy. Expert moves helped AlphaGo; however, tabula rasa is the right way for computer Go, as shown in AlphaZero, although for real life problems, reincarnating RL would be helpful. Another hypothesis is that LLMs attempt to approximate the world model.

The following is basically a copy from the paper Dissociating language and thought in large language models: a cognitive perspective. We argue that the approach that has dominated the field for the last five years—training LLMs on large “naturalistic” text corpora from the web with a words-in-context prediction objective—is insufficient to induce the emergence of functional linguistic competence. First, this approach is biased toward low-level input properties, leading to unstable model behavior that depends on a particular way the prompt is phrased. Second, information contained in regular text corpora does not faithfully reflect the world: for instance, it is biased toward unusual events and contains little commonsense knowledge. Third, and perhaps most crucially, it incentivizes the models to learn patterns in the text (at various levels of abstraction) but limits their ability to generalize out-of-distribution. Finally, even in cases where LLMs succeed, the amount of naturalistic data required for non-linguistic capacities to emerge is ridiculously large, making this approach vastly inefficient (and environmentally irresponsible).

Even so, there would likely be significant efforts pushing the limit of ability emergence, esp. by believers of empiricism, by scaling models up.

Scaling up is all you need?

People may tend to scale up the sizes of the network and the training dataset to achieve better performance, even more emergent abilities. People may refer to The Bitter Lesson to support the scaling up.

Rich Sutton states in the blog The Bitter Lesson: “The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin.” “One thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great. The two methods that seem to scale arbitrarily in this way are search and learning.” Rich highlights the importance of meta methods. See also, A Better Lesson by Rodney Brooks and Engineering AI by Leslie Kaelbling.

In a paper, Training Compute-Optimal Large Language Models, Hoffmann et al. from Deepmind find that “for compute-optimal training, the model size and the number of training tokens should be scaled equally”. The authors propose Chinchilla, an LLM, “that uses the same compute budget as Gopher but with 70B parameters and 4× more more data”, and observe that “Chinchilla uniformly and significantly outperforms Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large range of downstream evaluation tasks”.

In a similar vein, Touvron et al. from Meta introduce LLaMA: Open and Efficient Foundation Language Models, “a collection of foundation language models ranging from 7B to 65B parameters”. The authors train their models “on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets” and show that LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B.

Specialized LLMs usually have much smaller sizes, e.g., Stanford 2.7B BioMedLM, Microsoft 1.5B BioGPT, and 1.2B ProGen (ProGen2 up to 6.4B).

There should still be a large room to improve the current LLMs, or foundation models in general, in particular, with multi-modal data, e.g., image, voice, and video. See e.g., Multimodal Chain-of-Thought Reasoning in Language Models,

The classic experiment by Held and Hein showed the importance of interactive perception: “only the active kittens developed meaningful visually-guided behavior that was tested in separate tasks”. Google SayCan studies grounding language in robotic affordances. Pierre-Yves Oudeyer and colleagues study functional grounding of LLMs with RL. There are more recent work in robotics leveraging “Bitter Lesson 2.0”. Sergey Levine discusses the purpose of a language model and how RL can help fulfill it and talks about the role of data and optimization for emergence.

“Scaling up is all you need” is a misreading of the Bitter Lesson. For example, scaling up heuristic search algorithms like A* and IDA* won’t achieve superhuman Go. The collaboration of excellent achievements in deep learning, reinforcement learning, and Monte-Carlo tree search (MCTS), together with powerful computing, set the landmark.

Small language models

How about small language models and traditional natural language processing (NLP)? These are the majority of work in NLP and may shed light in further progress in LLMs. See e.g., Compositional Attention Networks for Machine Reasoning by Drew A. Hudson and Christopher D. Manning and a talk David V.S. Goliath: the Art of Leaderboarding in the Era of Extreme-Scale Neural Models by Yejin Choi.

Prompting and in-context learning

Prompting and in-context learning are effective to elicit the potential / full power of the current LLMs.

In short- or mid-term, when prompting is still necessary, it can be automated, and may be wrapped as part of an LLM, e.g., hiding right behind the interface. In long-term, in the context of AGI, prompting is ad-hoc, temporary, or even unnecessary. The more powerful the LLMs, the less important, or the simpler the prompting.

Such argument may extend to in-context learning, although which enables few-shot learning. Again, we are discussing AGI. Consider you are talking with a smart and knowledgeable enough person, who is capable of an effective and efficient communication. She will understand and fulfill your intent, possibly by initiating a dialogue with you, rather than requiring you to figure out how to talk to her.

Prompting and in-context learning will likely stay for a while. For how long? How important are they for next generation LLMs, esp. AGI LLMs? That are the questions.

Augmented language models

As discussed by Yoav Goldberg, a “traditional” language model is trained with natural text data alone, while ChatGPT is not traditional any more: it is augmented with instruction tuning, programming language code data, and and reinforcement learning from human feedback.

As discussed earlier, ChatGPT and most general LLMs are good at linguistic skills like essay writing, but not competent for functional uses like factuality, commonsense, arithmetic, and reasoning. It is natural to investigate how to integrate current LLMs with such functions like a search engine, an external knowledge base, or a symbolic AI solver. See, e.g., a recent paper by Yann LeCun and colleagues Augmented Language Models: a Survey.

The goal of a language model

Shane Legg and Marcus Hutter define intelligence in Universal Intelligence: A Definition of Machine Intelligence as “Intelligence measures an agent’s ability to achieve goals in a wide range of environments.” Then, what is the goal of a language model? Is such a goal what we want? Can we define it precisely? How about the current performance metrics?

Reward is a related to goal. David Silver, Satinder Singh, Doina Precup, and Richard Sutton propose the reward-is-enough hypothesis: “Intelligence, and its associated abilities, can be understood as subserving the maximisation of reward by an agent acting in its environment.” Sutton later confessed in a talk The Increasing Role of Sensorimotor Experience in AI that “But still, for many, reward is not enough” and “Enough for animals maybe, enough for engineering okay, but not enough for people, not enough for intelligence”.

Language modelling is a very complex multi-objective optimization problem, with many objectives, e.g., groundedness, safety, unbiasedness, sensibleness, specificity, and interestingness, and different users have different or even contradictory perspectives. Should we expect a general language model to be optimal for all objectives and for all users?

We pose many questions here, yet leave them open for now.

Yann LeCun’s opinion on current (auto-regressive) LLMs

The following is from Yann LeCun’s Tweet on February 13, 2023:

My unwavering opinion on current (auto-regressive) LLMs

1. They are useful as writing aids.

2. They are "reactive" & don't plan nor reason.

3. They make stuff up or retrieve stuff approximately.

4. That can be mitigated but not fixed by human feedback.

5. Better systems will come

6. Current LLMs should be used as writing aids, not much more.

7. Marrying them with tools such as search engines is highly non trivial.

8. There *will* be better systems that are factual, non toxic, and controllable. They just won't be auto-regressive LLMs.

I have been consistent while:

9. defending Galactica as a scientific writing aid.

10. Warning folks that AR-LLMs make stuff up and should not be used to get factual advice.

11. Warning that only a small superficial portion of human knowledge can ever be captured by LLMs.

12. Being clear that better system will be appearing, but they will be based on different principles. They will not be auto-regressive LLMs.

13. Why do LLMs appear much better at generating code than generating general text? Because, unlike the real world, the universe that a program manipulates (the state of the variables) is limited, discrete, deterministic, and fully observable. The real world is none of that.

14. Unlike what the most acerbic critics of Galactica have claimed

- LLMs *are* being used as writing aids.

- They *will not* destroy the fabric of society by causing the mindless masses to believe their made-up nonsense.

- People will use them for what they are helpful with.

More discussions

One way to keep up-to-date with such discussions is to follow experts on Twitter, e.g., Yann LeCun, Gary Marcus, and Melanie Mitchell.

Potential approaches to AGI

There are long-standing debates about nature versus nurture, empiricism versus rationalism, and connectionism vs symbolism.

Judea Pearl presents in a blog, Radical Empiricism and Machine Learning Research, “three arguments why empiricism should be balanced with the principles of model-based science, in which learning is guided by two sources of information: (a) data and (b) man-made models of how data are generated”, and the three arguments are expediency, transparency and explainability.

To approach AGI from language models, the authors of Dissociating language and thought in large language models: a cognitive perspective suggest that, “instead of or in addition to scaling up the size of the models, more promising solutions will come in the form of modular architectures …, like the human brain, integrate language processing with additional systems that carry out perception, reasoning, and planning”. The authors believe that “a model that succeeds at real-world language use would include–—in addition to the core language component–—a successful problem solver, a grounded experiencer, a situation modeler, a pragmatic reasoner, and a goal setter”.

Quote from Artificial Intelligence—The Revolution Hasn’t Happened Yet by Michael Jordan, “These problems include the need to bring meaning and reasoning into systems that perform natural language processing, the need to infer and represent causality, the need to develop computationally-tractable representations of uncertainty and the need to develop systems that formulate and pursue long-term goals.”

Yoshua Bengio, Yann Lecun, and Geoffrey Hinton are Turing Award Laureates “For conceptual and engineering breakthroughs that have made deep neural networks a critical component of computing.” The following is a copy from their article Deep learning for AI in Communications of the ACM: “How are the directions suggested by these open questions related to the symbolic AI research program from the 20th century? Clearly, this symbolic AI program aimed at achieving system 2 abilities, such as reasoning, being able to factorize knowledge into pieces which can easily recombined in a sequence of computational steps, and being able to manipulate abstract variables, types, and instances. We would like to design neural networks which can do all these things while working with real-valued vectors so as to pre-serve the strengths of deep learning which include efficient large-scale learning using differentiable computation and gradient-based adaptation, grounding of high-level concepts in low-level perception and action, handling uncertain data, and using distributed representations.”

From Gathering Strength, Gathering Storms: The One Hundred Year Study on Artificial Intelligence (AI100) 2021 Study Panel Report, “The burgeoning area of neurosymbolic AI, which unites classical symbolic approaches to AI with the more data-driven neural approaches, may be where the most progress towards the AI dream is seen over the next decade.”

Yann LeCun proposes, in a position paper A Path Towards Autonomous Machine Intelligence, a system architecture for autonomous intelligence with differentiable configurator, perception, world model, cost, short-term memory, and actor modules.

Richard Sutton states in the talk The Increasing Role of Sensorimotor Experience in AI, “Over AI’s seven decades, experience has played an increasing role; I see four major steps in this progression: Step 1: Agenthood (having experience), Step 2: Reward (goals in terms of experience), Step 3: Experiential state (state in terms of experience), and Step 4: Predictive knowledge (to know is to predict experience). For each step, AI has reluctantly moved toward experience in order to be more grounded, learnable, and scalable.” Sutton presents The Quest for a Common Model of the Intelligent Decision Maker.

Shane Legg and Marcus Hutter in Universal Intelligence: A Definition of Machine Intelligence compare tests of intelligence: Turing Test, Total Turing Test, Inverted Turing Test, Toddler Turing Test, Linguistic Complexity Text, Compression Test, Turing Ratio, Psychometric AI Smith’s Test, C-Test, Universal Intelligence w.r.t. the properties: valid, informative, wide range, general, dynamic, unbiased, fundamental, formal, objective, fully defined, universal, practical, and test vs. definition.

Francois Chollet provides the definition: “The intelligence of a system is a measure of its skill-acquisition efficiency over a scope of tasks, with respect to priors, experience, and generalization difficulty.”, and the Abstraction and Reasoning Corpus (ARC) benchmark, in the paper On the Measure of Intelligence.

Epilogue

Emergent abilities from large language models are astonishing yet brittle, calling for more studies about reliability and interpretability. For a short-term plan, augmenting current large language models is promising to unlock powerful functionalities. For a mid- or long-term plan, large language models are a potential route to AGI, after injecting ingredients, in particular, for a world model, planning and reasoning. Current large language models, like OpenAI ChatGPT, Google LaMDA, Deepmind Sparrow, Anthropic Claude, Meta LLaMA, and Hugging Face BLOOM, set a good starting point to AGI.

Yuxi’s Substack

Discussion about this post