I2D2: Inductive Knowledge Distillation with NeuroLogic and Self-Imitation

Published in

AI2 Blog

3 min readJul 11, 2023

An illustration of an elephant and a mouse on a see-saw, where the mouse is weighing down its side of the see-saw.

Massive scale has enabled mind-blowing capabilities in modern language models. But it raises an intriguing question: Is scale the only way?

In this work, we investigate if smaller models that are far more accessible and efficient compete against the largest of models available today.

Our encouraging results show that scale is not the only way. Smaller models can outperform 100X larger models if they are empowered with novel distillation, constrained decoding, and self-imitation learning algorithms. So, how does this work?

Overview

The generation quality of smaller LMs is known and expected to be low. Our I2D2 framework addresses this challenge through two key innovations:

First, we perform constrained generation through neurologic decoding. This achieves slight improvements in the quality of generations. Further improvements are unlocked through the use of a small critic model that filters out low-quality generations.
Then, in the self-imitation step, the language model is fine-tuned on its own high-quality generations obtained after critic filtering.

Moreover, these steps can be applied iteratively, to continue improving the smaller LM's performance.

Specifically, in our work, we apply I2D2 to the task of generating commonsense knowledge about everyday concepts. I2D2 is able to generate a high-quality corpus of generic commonsense knowledge.

Unlike other recent works, I2D2 does not depend on any GPT-3 generations, commonly used in knowledge distillation, to improve its performance.

I2D2 generations are more accurate than GPT-3's

We compare the accuracy of generics present in the static resource GenericsKB and generators GPT-3 and I2D2. I2D2 generations are better than GPT-3, in spite of being based on a 100X smaller model.

A graph demonstrating accuracy of generations from the static resource, GenericsKB, generators GPT-3 and I2D2. — Accuracy of generations from the static resource, GenericsKB, generators GPT-3 and I2D2

**How well can I2D2 identify true generic statements vs. false?**

We use the I2D2’s critic model to identify whether a given commonsense statement is true or false. We compare this against GPT-3’s perplexity assigned to these statements. I2D2 is much better at identifying true commonsense statements.

A graph showing how I2D2 identifies true statements from false ones more accurately than GPT-3.

I2D2 generations are more diverse, and the model gets better with successive iterations

Compared to GenericsKB, I2D2 generations are 10X more diverse. And diversity improves over iterations of self-imitation.

A graph comparing the diversity of GenericsKB and Gen-A-Tomic, demonstrating that I2D2 improves its performance over iterative learning. — Comparing the diversity of GenericsKB and Gen-A-Tomic (a corpus of generic statements generated by I2D2)

Our findings show that I2D2 produces more accurate and diverse generic generations than GPT-3, while using a 100x smaller model. Critic-informed self-imitation iteratively improves the generation quality of models that are weak.

Key Results and Implications of our work
Smaller, more efficient LMs have significant room for improvement. Employing novel algorithmic techniques can enable smaller LMs to be comparable to the largest LMs available today for some tasks.
Smaller LMs can also self-improve, a capability usually attributed only to LLMs.

Authors

Photographs of the team members who authored this paper: Chandra Bhagavatula, Jena Hwang, Doug Downey, Ronan Le Bras, Ximing Lu, Lianhui Qin, Keisuke Sakaguchi, Swabha Swayamdipta, Peter West and Yejin Choi.

Check out our current openings, follow @allen_ai on Twitter, and subscribe to the AI2 Newsletter to stay current on news and research coming out of AI2.