LLMs Are Not As Smart As You Think

Transformers struggle with intellectual tasks that require multiple steps of combining information, such as solving multiplication or logic puzzles and often resort to solving them by finding patterns and shortcuts instead of truly understanding the composition of the problem.

Share

Published on June 26, 2023

by Shritama Saha

A recent paper by MIT claimed that GPT-4 scored 100% on MIT’s curriculum, but further investigation revealed incomplete questions and biased evaluation methods, resulting in significantly lower accuracy, making the paper null and void. Over time, several researchers have jumped onto the bandwagon of publishing papers on LLMs, especially the ones like ChatGPT passing US Medical exam, bar exam and so on. However, when the same LLM-based chatbots are asked to solve simple math problems or spell words like lollipop backwards, they fail terribly. All LLMs like GPT-3.5, GPT-4, LLaMA, PaLM 2, have proved to be terrible at these easy tasks.

But why does this happen?

Most of the papers published in the recent time are just full of fluff. But finally, we have “Faith and Fate: Limits of Transformers on Compositionality”, a paper by the Allen Institute for AI that discusses the limitations of these transformer-based models. Authored by researchers from the University of Washington, University of Southern California and the University of Chicago, the paper discussed the fundamental limits of Transformer Language Models by focusing on compositional problems that require multi-step reasoning. The study investigates three representative compositional tasks: long-form multiplication, logic grid puzzles (e.g., Einstein’s puzzle), and a classic dynamic programming problem.

According to Microsoft’s research paper, ‘Sparks of AGI: Early experiments with GPT-4’, such language models represent an early version of artificial general intelligence (AGI). But the scientific community seems to be divided on the true capabilities of LLMs, but this paper will shed some light on how they actually work.

Getting Dumber by the Day

To gain a better understanding of how LLMs compare to human thought processes, the researchers used a graph structure. In this approach, human problem-solving skills can be thought of as a graph structure, where each vertex represents a partial solution and the edges signify operators that modify these solutions. This conceptual framework is then extrapolated, providing a basis for understanding the reasoning abilities of transformers.

The researchers then put popular LLMs like ChatGPT, GPT 3, and GPT 4 to the test on multi-step compositional tasks. They found that while leveraging zero-shot, few-shot, and fine-tuning, transformer models show a drop in performance as task complexity increases. While finetuning with task-specific data improves performance within the trained domain, it fails to generalise to unseen examples. Even explicit training with scratchpads does not enable models to learn component operations effectively.

The autoregressive nature of transformers presents a fundamental challenge in understanding tasks comprehensively. These findings underscore the pressing need for advancements in transformer architecture and training methods.

According to Yan LeCun, the chief data scientist at Meta, “Auto-regressive LLMs are like processes that keep getting away from the correct answers exponentially”.

When you are generating a response using these models, there is a probability that each generated word is not a correct answer and as more words are generated, the probability of the entire response being correct decreases exponentially because errors accumulate.

Reinforcement Learning from Human Feedback (RLHF) may decrease the probability of errors, but they don’t change the fact that token production is still auto-regressive and subject to exponential divergence. However, he believes that it’s not possible to completely eliminate the problem because the process is still auto-regressive, meaning that each token is generated based on previous tokens.

Transformers excel at single-step reasoning but struggle to extend their capabilities to more complex scenarios. However, the scientists behind the paper also mentioned some training methods that might help LLMs push past this seemingly unbreakable boundary.

The Way Forward

Researchers have tried different approaches to improve the performance of transformers in compositional tasks, such as fine-tuning the models or teaching them explicit reasoning steps. However, these approaches have not achieved 100% accuracy, especially in out-of-domain settings where the models encounter new types of problems.

Transformers sometimes produce partially correct answers even when the overall response is incorrect, as the models can learn specific patterns within the task distribution. This allows them to make guesses without understanding the task’s requirements. The concept of relative information gain aids in predicting these patterns that transformers are likely to learn.

The main issue is that transformers tend to reduce multi-step reasoning into linearised subgraph matching, relying on pattern matching rather than comprehensive reasoning, making it suck at tasks that demand planning and introducing multiple steps for correct reasoning. Thus it can be said that transformers often memorise specific operations during training, leading to correct outputs despite incorrect computations.

Should LLMs be Replaced?

Although Transformers perform well in single-step reasoning tasks, they face difficulties when it comes to combining multiple steps effectively. The models also struggle in generalising their knowledge, including easy-to-hard generalisation and generalization on mathematical integration, achieving full mastery and accurate generalisation is still.

Transformers, while powerful language models, exhibit limitations in their ability to perform complex compositional reasoning. Their reliance on patterns, memorisation, and single-step operations impedes their effectiveness in tackling challenging tasks.

The research paper highlights the importance of advancing transformer architecture and training methods to address these limitations and enable future breakthroughs in compositional reasoning. Further exploration in this domain holds the key to unlocking the full potential of AGI.

Access all our open Survey & Awards Nomination forms in one place