The Evolution of Interpretability: Angelica Chen’s Exploration of “Sudden Drops in the Loss”

4 min readOct 10, 2023

Most machine learning interpretability research analyzes the behavior of models after their training is complete, and is often correlational, or even anecdotal. However, limiting analysis to post-training behavior restricts our understanding of training dynamics and the development of the model, which are key to establishing a mature, reliable science of AI. This is why, in a recent paper on the interpretation of masked language models (MLMs) titled “Sudden Drops in the Loss: Syntax Acquisition, Phase Transitions, and Simplicity Bias in MLMs,” CDS PhD student Angelica Chen and her co-authors delved into the intricate dynamics during the training process itself. Their paper not only challenges prevailing notions in AI interpretability but also sets a new precedent for future research.

In a recent interview, Chen explained the importance of studying interpretability artifacts not just at the end of a model’s training but throughout its entire learning process. “A lot happens to these interpretability artifacts during training,” said Chen, who believes that by only focusing on the end result, we might be missing out on understanding the entire journey of the model’s learning.

The paper is a case study of syntax acquisition in BERT (Bidirectional Encoder Representations from Transformers). An MLM, BERT gained significant attention around 2018–2019 and is now often used as a base model fine-tuned for various tasks, such as classification. Chen’s study delves into the attention mechanism within BERT, particularly focusing on the phenomenon where different “heads” or instances of the attention matrix show patterns corresponding to syntactic relations between tokens.

The groundbreaking methodology of Chen’s research is the introduction of interventions during training, which can be used to establish causal relationships between training differences and later behavior. Chen and her team marshaled this kind of intervention on SAS, or Syntactic Attention Structure, which, in a model like BERT, directly precedes a spike in the Unlabeled Attachment Score (UAS), and can be interpreted as the stage of the model’s language acquisition in which it learns syntax. Chen and her team utilized the BLiMP (Benchmark of Linguistic Minimal Pairs) linguistic challenge to then assess the development of complex linguistic abilities following the onset of SAS, finding a direct correlation between the emergence of SAS and the model’s performance improvement.

Furthermore, the research establishes a causal relationship between SAS and the model’s linguistic capabilities. Early in training, the model seems to adopt an alternative learning strategy. If SAS is temporarily suppressed early in training, then the alternative strategy is learned before SAS and allows the model to develop better representations and improve its overall performance.

Chen also draws parallels between her findings and the information bottleneck theory, which divides training into a memorization phase and a compression phase. Her research indicates that as SAS develops, the model’s complexity increases, reflecting the memorization phase. This is followed by a decrease in complexity, representing the compression phase when syntax is acquired.

There are multiple motivations for improving interpretability. While developing safe systems is a concern, Chen believes that, even more fundamentally, rigorous scientific interpretability is crucial for the advancement of deep learning in general. “Understanding the evolution of models during the course of training is an important part of the science of deep learning,” she stated, emphasizing the need for a thorough understanding of models to improve future iterations.

Chen’s research was conducted in collaboration with CDS Professor of Computer Science and Data Science Kyunghyun Cho, who served in an advisory role, Matthew L. Leavitt, who advised the project and provided insights on training dynamics, CDS Faculty Fellow Ravid Shwartz-Ziv, who contributed his knowledge of information theory, and Naomi Saphra, who advised and contributed valuable knowledge of past work in training dynamics, interpretability, and the science of deep learning.

The project, however, began from a discussion between Naomi Saphra, another co-author and a postdoc in the CILVR group working with Cho, and Chen, who discovered that they had both independently observed that the Syntactic Attention Structure (SAS) pattern emerged early in training. Saphra’s prior work on the MultiBERTs paper, which studied BERT throughout its training, was unique. Most companies only release fully trained models, but the MultiBERTs research provided access to intermediate checkpoints, enabling temporal patterns to be discovered. (Saphra also wrote a blogpost related to “Sudden Drops in the Loss”, called “The Parable of the Prinia’s Egg: An Allegory for AI Science,” emphasizing the importance of understanding the developmental process of models, rather than focusing solely on the end result.)

Angelica Chen and her co-authors’ research offers a fresh perspective on the world of interpretability, emphasizing the importance of understanding the entire learning journey of machine learning models. As the field of data science continues to evolve, such rigorous and in-depth interpretability research will pave the way for a deeper, more thorough understanding of the machine learning models reshaping today’s world.

By Stephen Thomas

The Evolution of Interpretability: Angelica Chen’s Exploration of “Sudden Drops in the Loss”

Written by NYU Center for Data Science