What are they thinking?

Do language models have coherent mental models of everyday things?

Published in

AI2 Blog

7 min readJul 7, 2023

Why are mental models important?

The concept of mental models is not new. Many years ago, Kenneth Craik (1943) proposed that thinking itself is the manipulation of internal representations of the world. Coherent mental models have been seen as crucial to the processes that underlie human reasoning (Johnson-Laird, 2006).

Our ACL 2023 paper focuses on mental models in the context of ordinary objects we encounter in our everyday lives. Such commonsense knowledge helps us understand how these everyday things work and how to interact with them. For example, when someone tries to make a fried egg, they know that it has a shell and that it can be cracked open to reveal the egg white and yolk inside. However, if a system does not have a coherent picture of such everyday things, for example, thinking that the egg yolk surrounds the shell, then it might have to resort to ridiculous approaches such as trying to scrape the egg yolk off the shell into the pan.

While coherent internal representations of spatial layouts are crucial for human reasoning, their role, coherence, and even existence in language models (LMs) have not been systematically explored. In our work, we bridge this gap by proposing a benchmark dataset (ParRoT) and methodology to compare human internal representations of spatial layouts of everyday things with those of LMs.

A diagram of how humans create mental models, versus how LLMs might construct them.

Dataset on everyday things: ParRoT (Parts and Relations of Things)

Do language models similarly have a coherent picture of everyday things? To investigate this, we propose a benchmark dataset on parts and relations of things, called the ParRoT (Parts and Relations of Things) dataset.

We compiled a list of everyday objects (like an egg, a tree, and bicycle) and asked human subjects to sketch a mental model for each of them in the form of a graph. In such graphs (see box A in the image above), (1) each node represents a part of the object and (2) each edge shows a relationship that holds between two parts. As an example, here the person annotated, for an egg, there’s a shell that surrounds the egg white, and the egg white then surrounds the yolk. By collecting such annotations, we construct the ParRoT dataset which comprises 300 mental models across 100 everyday things, covering over 2 thousand parts and over 11 thousand relationships.

Do LMs have similar coherent pictures of everyday things?

To find out if LMs have coherent pictures of everyday things, we query an LM with a large number of T/F queries. For instance, we ask, “Judge whether this statement is true or false: In an egg, the shell surrounds the yolk.” Based on these answers, we assemble the LM’s picture of what an egg is. In this case (see box B in the image above), we get this pretty confused picture of what it thinks the structure of an egg is, with some obvious contradictions e.g. you can’t surround things in both directions.

Experimenting with SOTA LMs like GPT-3 (Brown et al., 2020) and Macaw (Tafjord and Clark, 2021), our results show that mental models derived using these LMs’ predictions are significantly inconsistent, with 19–43% conditional violation. Compared to gold mental models in our ParRoT dataset, the LMs’ predictions are only 54–59% accurate. This is barely better than the majority class baseline at 59% and random chance at 50%.

Can we improve LMs’ mental models of everyday things?

We proposed ParRoT-Con, a neuro-symbolic method that applies constraint reasoning on top of raw LM predictions as a way of obtaining more consistent and more accurate mental models.

The input to the system is an entity and a list of parts. The first component of ParRoT-Con involves sending an exhaustive list of relation queries to an LM querying for relations between each pair of parts, from which we get a mental model of the object which is an incoherent picture. The second component “constraint reasoning” then applies a reasoning layer on top of these raw predictions using commonsense constraints about the relationships e.g. surrounds is an asymmetric relationship. The result then is a more coherent mental picture of the object (also see box C in the first image).

As well as removing inconsistencies, we find that ParRoT-Con also significantly improves mental model accuracy (by 16–20%).

The next steps and outlook

This work tries to bridge a gap in the field by systematically exploring the question, “Do language models have coherent mental models of everyday things?”

Our ParRoT dataset, to the best of our knowledge, is the first resource of its kind for researchers to study canonical mental models for a wide variety of everyday things, focusing on relationships between parts of objects.

ParRoT-Con, our proposed method that improves accuracy and consistency of LMs’ mental models, is a system that is a combination of a LM together with a reasoning layer that sits on top and tries to pull the pieces together in a coherent fashion. This approach suggests a broader cognitive architecture (LM + reasoner) for future systems, to better construct mental models than using the LM alone. ParRoT-Con is also easily extensible to other applications such as spatial relations in other domains (e.g. for geographical distances, we can similarly impose constraints on inverse relations like closer and further) and temporal relations (e.g. on a timeline, if event A occurred before event B, then event B cannot have occurred before event A). We encourage other researchers to build on our work, and to apply the approach introduced to other areas.

Our work also raises various other follow-up research directions that we believe would be valuable to pursue. For instance, common everyday things change over the years, affecting which relationships would be more prominent in an average person’s mental model. It would be interesting to use our ParRoT dataset as a point of comparison when studying mental models of everyday things in the future to reveal interesting insights on how humans’ mental models of everyday things evolve over time.

Other important future directions include to explore how more coherent mental models can help

in complex reasoning tasks about everyday things, combine these mental models on parts of everyday things with mental models along other dimensions e.g. Gu et al. (2022a,b), as well as using our dataset of commonsense queries about everyday things as a source of follow-up questions for existing QA tasks e.g.,PIQA (Bisk et al., 2020) and CSQA (Talmor et al., 2019).

References

Kenneth James Williams Craik. 1943. The nature of explanation, volume 445. Cambridge University Press.

P. Johnson-Laird. 2006. How we reason. Oxford University Press.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.

Oyvind Tafjord and Peter Clark. 2021. General-purpose question-answering with Macaw. arXiv preprint arXiv:2109.02593.

Yuling Gu, Bhavana Dalvi, and Peter Clark. 2022a. DREAM: Improving situational QA by first elaborating the situation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1115–1127, Seattle, United States. Association for Computational Linguistics.

Yuling Gu, Yao Fu, Valentina Pyatkin, Ian Magnusson, Bhavana Dalvi Mishra, and Peter Clark. 2022b. Just-DREAM-about-it: Figurative language understanding with DREAM-FLUTE. In Proceedings of the 3rd Workshop on Figurative Language Processing (FLP), pages 84–93, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.

Yonatan Bisk, Rowan Zellers, Ronan Le bras, Jianfeng Gao, and Yejin Choi. 2020. Piqa: Reasoning about physical commonsense in natural language. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):7432–7439.

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis, Minnesota. Association for Computational Linguistics.

We make our data and code publicly available at https://github.com/allenai/everyday-things.

You can also watch a presentation of the paper at https://youtu.be/cMpYOEoAVjY.

Learn more about the Aristo project and follow the team @ai2_aristo.

Check out our current openings, follow @allen_ai on Twitter, and subscribe to the AI2 Newsletter to stay current on news and research coming out of AI2.

What are they thinking?

Do language models have coherent mental models of everyday things?

Written by Yuling Gu