Faith and Fate: Limits of Transformers on Compositionality

Published in

AI2 Blog

7 min readNov 17, 2023

Contributors: Nouha Dziri*, Ximing Lu*, Melanie Sclar*, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Peter West, Chandra Bhagavatula, Ronan Le Bras, Jena D. Hwang, Soumya Sanya, Sean Welleck, Xiang Ren, Allyson Ettinger, Zaid Harchaoui, Yejin Choi

Large-scale Transformers like ChatGPT and GPT-4 have taken the world by storm with their incredible capabilities, even hailed as “sparks of AGI”. However, they often fail at surprisingly trivial problems. For instance, humans can solve 3-digit by 3-digit multiplication arithmetic after learning basic calculation rules. Yet, off-the-shelf GPT4 achieves only 59% accuracy on this task.

This begs the question: Under what conditions do Transformers succeed, fail, and why? What types of errors do they make? Can Transformers uncover implicit problem-solving rules or be taught to follow reasoning paths?

There is a great deal of hype in the field around the capabilities of Transformers, to the point that some people believe they can handle any task effortlessly. In this context, our work is critical and timely to address potential misconceptions. We explore whether pushing Transformers to their limits can lead to complete mastery of tasks.

Through extensive experiments, our Neurips 2023 paper provides novel insights and a nuanced understanding of why and when Transformers succeed or struggle in compositional problems requiring strict multi-hop reasoning to derive correct solutions.

We study three tasks: multiplication, logic grid puzzles, and a classic dynamic programming problem. We formulate these tasks as computation graphs which break down problem-solving into smaller, functional steps.

This is an illustration of a Transformation, from an entered algorithm to its associated graph. — Transformation of an algorithm A to its computational graph GA(x). The depicted example is of long-form multiplication algorithm A, for inputs x = [7, 49] (i.e. computing 7 x 49).

These are our findings:

1. Model Performance Decreases as Graph Complexity Increases

To investigate the inherent problem-solving capabilities of LLMs, we pushed transformer LLMs to their limits through zero-shot, few-shot, and fine-tuning experiments. Across all tasks, we observed a significant decline in performance from nearly perfect to zero as complexity increased. For instance, GPT-4 achieves only 59% accuracy on 3-digit x 3-digit multiplication, dropping to 4% for 4-digit x 4-digit multiplication.

A plot of accuracy as it relates to complexity. — **(a) Zero-shot accuracy.** Axes refer to problem sizes (number of digits in multiplication, number of houses and attributes in puzzle, and sequence length in the DP task). Transformers’ accuracy decreases to near zero as task complexity increases, measuring task complexity by the problem size. **(b) Average parallelism** negatively correlates with accuracy.

What if we introduced a step-by-step solution, also known as a “scratchpad,” to the models? Testing this approach with a few-shot scratchpad notably boosted performance, such as increasing GPT-4’s accuracy on 3x3 multiplication from 59% to 92%. However, this improvement didn’t extend to highly complex problems, remaining near 0. Perhaps this is due to insufficient exposure to specific task data during pretraining. To address this, we decided to extensively train the models on a huge amount of task-specific data.

Even after exhaustive training on various tasks like multiplication, dynamic programming, and puzzle logic, the models couldn’t fully master the tasks. While they excelled in in-domain distribution cases, they utterly failed to generalize to out-of-distribution cases.

A plot demonstrating that with increasing complexity of problems, GPT3’s accuracy decreases. — GPT3 fine-tuned exhaustively on task-specific data up to a certain problem size. The blue region represents the in-distribution examples and the red region refers to OOD examples.

These results reveal that the autoregressive nature of Transformers poses an inherent challenge that cannot be resolved by instructing the model to generate a step-by-step solution. Models fundamentally rely on a greedy process, predicting the next word without a comprehensive global understanding of the task.

2. Does Grokking Solve Multiplication?

Next, we explore whether extended training beyond overfitting leads to improved generalization abilities, a phenomenon known as grokking. We fine-tune GPT3 with question-answer pairs and with question-scratchpad pairs. Both models’ training far exceeded the point at which in-domain accuracy plateaus. Results show no improvement in generalization for OOD cases beyond the overfitting point, even after extensive training periods. We speculate that increased task difficulty significantly impedes learning a well-structured representation, which aligns with achieving grokking.

Graphs demonstrating GPT3’s performance with and without Scratchpad. — Results of training beyond the overfitting point for the multiplication task with the goal of exploring whether OOD generalization capabilities (i.e., grokking) arise.

3. Information Gain Explains Where Transformers Partially Excel

When Transformers fail to provide the correct answer, we observed that they often partially predict the response accurately, even if the overall answer is incorrect. For instance, in multiplication, the model may correctly guess the first and last digits while getting the rest wrong. To investigate this phenomenon, we use relative information gain to predict surface patterns likely learned by the model. The analysis shows a high correlation between the first digits of the output and input numbers, suggesting that the model is likely learning this spurious pattern.

This suggests that if an output element heavily relies on a single or a small set of input features, Transformers are likely to recognize such correlation during training and make partial guesses during testing without executing the full multi-step reasoning required by the task. There is nothing wrong with learning shortcuts, as we humans commonly use them for swiftly delivering answers. However, the key difference lies in our ability to discern when and how to use shortcuts — a skill LLMs seem to lack.

Transformers Partially Excel in predicting the answer by learning shortcuts.

4. Transformers Reduce Multi-Step Compositional Reasoning into Linearized Subgraph Matching

We now explore whether models’ correct predictions on unseen test data are due to learning the underlying algorithm or, instead, explainable by exposure to similar training examples. For each graph, we computed the average frequency of partial computations that appear in the training data needed to solve a task, for both correctly and wrongly predicted examples.

We found that Transformers’ successes are heavily linked to having seen significant portions of the required computation graph during training — suggesting that compositional behaviors may truly be pattern matching.

Although the final prediction may be highly compositional, the true compositionality shown is much less impressive because the solutions could be easily extracted from input-output sequences present in the training data. This type of learning could be very effective when the compositional complexity of tasks is low but it becomes less efficient when tasks are increasingly complex.

Transformers collapse multi-step reasoning into analogical sub-graph matching.

5. What Types of Errors Do Transformers Make at Different Reasoning Depths

For a clearer understanding of where Transformers fall short, we analyze 4 types of errors that transformers make for nodes at different layers in the computation graph. Our analyses show that models are able to correctly perform single-step reasoning, potentially due to memorizing such single-step operations during training, but fail to plan and compose several of these steps for overall correct reasoning.

Errors that Transformers Make at Different Reasoning Depths

6. Error Propagations: The Theoretical Limits

So far, we have highlighted the limitations of current Transformers in handling complex, multi-step reasoning tasks: errors rapidly escalate as the problem size grows. Compositional task algorithms often involve multiple independent applications of a function (width) and/or iterated applications of the same function (depth). In executing such algorithms, Transformers act as estimators for these functions. Specifically, Transformers struggle with problems featuring large graph width and depth. Our theoretical insights explain why these models perform significantly worse in compositional tasks as the problem size increases. We analyze the probability of Transformers reaching the correct answer as the problem size grows, demonstrating that, under reasonable assumptions, the probability of incorrect predictions converges exponentially to ≈ 1 for abstract compositional tasks.

Conclusions

Collapsed Compositionality and Robustness Implications

Transformers today demonstrate undeniably powerful empirical results. Yet, our study suggests that Transformers may have fundamental weaknesses in certain intellectual tasks that require true multi-step compositional operations. Our careful study demonstrates that Transformers can often solve multi-step compositional problems by collapsing the depth of the compositional operations via analogical pattern matching. This suggests that the strong performance of Transformers should be taken with a certain grain of salt: Despite initially appearing challenging, certain tasks may not possess the inherent compositionality they seem to have. However, such an approach can ultimately result in poor generalization capabilities as shown in our study.

Avenues for Improving Transformers Performance on Compositional Tasks

Building on these findings, we suggest several empirical strategies for harnessing the potential of transformer LLMs. First, Transformers may be best suited for compositional tasks where evaluation metrics can afford some leniency; for example, finding approximate solutions that do not require executing the whole graph, such as identifying the most significant digit in a multiplication. Second, we suggest augmenting Transformers with planning modules as well as using refinement methods, that can iteratively improve their generations.

Check out our current openings, follow @allen_ai on Twitter, and subscribe to the AI2 Newsletter to stay current on news and research coming out of AI2.