AI2 at ACL 2023

Highlighted work from our institute appearing at this year’s ACL conference

AI2
AI2 Blog

--

The 61st Annual Meeting of the Association for Computational Linguistics logo
The 61st Annual Meeting of the Association for Computational Linguistics logo

The 61st Annual Meeting of the Association for Computational Linguistics (ACL) is the premier conference in the field of computational linguistics, covering a broad spectrum of diverse research areas that are concerned with computational approaches to natural language. AI2 is thrilled to have multiple researchers represented at this year’s conference.

Highlighted Papers from AI2

Selected works involving AI2 researchers appearing at this year’s conference (* denotes AI2 affiliation):

A New Yorker cartoon with the transcript of the model recognizing the caption, evaluating it, and explaining why it is funny.
Schematic of the three tasks formulated using over a decade of New Yorker caption contests: models must 1) recognize a caption written about a cartoon (vs. options that were not); 2) evaluate the quality of that caption by scoring it more highly than a lower quality option from the same contest; and 3) explain why the joke is funny. Cartoon by Drew Dernavich, winning caption by Bennett Ellenbogen.

Do Androids Laugh at Electric Sheep? Humor “Understanding” Benchmarks from The New Yorker Caption Contest

Jack Hessel*, Ana Marasović, Jena D. Hwang*, Lillian Lee, Jeff Da, Rowan Zellers, Robert Mankoff, Yejin Choi*

🏆 Best Paper Award — We are delighted that this paper was selected for a Best Paper Award for ACL 2023!

Do LLMs have a sense of humor? Researchers from The Allen Institute for AI, University of Utah, Cornell University, OpenAI, and the Paul G. Allen School of Computer Science challenge AI models to “demonstrate understanding” of the sophisticated multimodal humor of The New Yorker Caption Contest. Concretely, they develop three carefully circumscribed tasks for which it suffices (but is not necessary) to grasp potentially complex and unexpected relationships between image and caption, and similarly complex and unexpected allusions to the wide varieties of human experience; these are the hallmarks of a New Yorker-caliber cartoon. The team investigated vision-and-language models that take as input the cartoon pixels and caption directly, as well as language-only models for which they circumvented image-processing by providing textual descriptions of the image.

Depiction of an Example Scenario. Carl from the U.S. and Aditya from India both want to use Perspective API, but it works better for Carl than it does for Aditya. This is because toxicity researchers’ positionalities lead them to make design choices that make toxicity datasets, and thus Perspective API, to have positionalities that are Western-centric.
Example Scenario. Carl from the U.S. and Aditya from India both want to use Perspective API, but it works better for Carl than it does for Aditya. This is because toxicity researchers’ positionalities lead them to make design choices that make toxicity datasets, and thus Perspective API, to have positionalities that are Western-centric.

NLPositionality: Characterizing Design Biases of Datasets and Models

Sebastin Santy, Jenny Liang, Ronan Le Bras*, Katharina Reinecke, Maarten Sap*

🏆 Outstanding Paper Award — We are honored to count this paper among the Outstanding Paper Award recipients for ACL 2023!

Design biases in NLP systems, such as performance differences for different populations, often stem from their creator’s positionality, i.e.,
views and lived experiences shaped by identity and background. Despite the prevalence and risks of design biases, they are hard to quantify because researcher, system, and dataset positionality is often unobserved. This paper introduces NLPositionality, a framework for characterizing design biases and quantifying the positionality of NLP datasets and models. NLPositionality is applied to existing datasets and models for two tasks — social acceptability and hate speech detection. The researchers find that datasets and models align predominantly with Western, White, college-educated, and younger populations. Additionally, certain groups, such as nonbinary people and non-native English speakers, are further marginalized by datasets and models as they rank least in alignment across all tasks.

An illustration of a situation requiring theory of mind: Readers must reason that Alice will look for the celery where she left it, and that Bob will make that same assumption. Questions shown require different depths of mental state modeling.
A simple story requiring theory of mind. Note that Alice’s belief of the celery’s location differs from reality (i.e. Alice holds a false belief). Readers must reason that Alice will look for the celery where she left it, and that Bob will make that same assumption. Questions shown require different depths of mental state modeling.

Minding Language Models’ (Lack of) Theory of Mind: A Plug-and-Play Multi-Character Belief Tracker

Melanie Sclar, Sachin Kumar, Peter West, Alane Suhr*, Yejin Choi*, Yulia Tsvetkov

🏆 Outstanding Paper Award — We are honored to count this paper among the Outstanding Paper Award recipients for ACL 2023!

Theory of Mind (ToM) — the ability to reason about the mental states of other people — is a key element of our social intelligence. Yet, despite their ever more impressive performance, large-scale neural language models still lack basic theory of mind capabilities out-of-the-box. This research posits that simply scaling up models will not imbue them with theory of mind due to the inherently symbolic and implicit nature of the phenomenon, and instead investigate an alternative: can we design a decoding-time algorithm that enhances theory of mind of off-the-shelf neural language models without explicit supervision? The researchers present SYMBOLICTOM, a plug-and-play approach to reason about the belief states of multiple characters in reading comprehension tasks via explicit symbolic representation.

Overview of HINT. (1) We feed an instruction into a HyperEncoder to produce an encoded instruction, use it to generate prefix and adapter weights, and then insert them into the underlying model. (2) We run the underlying model encoder as usual and optionally concatenate the encoded input with the previously encoded instruction, before running the underlying model decoder to generate the answer. We only use the hypernetwork once per task.
Overview of HINT. (1) Feed an instruction into a HyperEncoder to produce an encoded instruction, use it to generate prefix and adapter weights, and then insert them into the underlying model. (2) Run the underlying model encoder as usual and optionally concatenate the encoded input with the previously encoded instruction, before running the underlying model decoder to generate the answer. We only use the hypernetwork once per task.

HINT: Hypernetwork Instruction Tuning for Efficient Zero-Shot Generalisation

Hamish Ivison*, Akshita Bhagia*, Yizhong Wang, Hannaneh Hajishirzi*, Matthew E. Peters*

Recent NLP models have shown the remarkable ability to effectively generalise ‘zero-shot’ to new tasks using only natural language instructions as guidance. However, many of these approaches suffer from high computational costs due to their reliance on concatenating lengthy instructions with every input example, resulting in costly reprocessing of the instruction. To avoid this, this paper introduces Hypernetworks for INstruction Tuning (HINT), which convert task instructions and examples
into parameter-efficient modules inserted into an underlying model using a pretrained text encoder, eliminating the need to include instructions in the model input. The hypernetwork in HINT also produces an encoded instruction, which we concatenate with encoded inputs during decoding to further improve performance. HINT models outperform strong state-of-the-art baselines by over 10% when controlling for compute (measured in FLOPs).

Submissions to EMNLP 2021 binned by count of YES responses to the NLP Reproducibility Checklist items. The ACCEPT rate is given for each bin. Papers with more YES responses are more likely to be accepted, except those that mark YES to all checklist items, which we hypothesize contain responses which do not accurately represent the associated paper.
Submissions to EMNLP 2021 binned by count of YES responses to the NLP Reproducibility Checklist items. The ACCEPT rate is given for each bin. Papers with more YES responses are more likely to be accepted, except those that mark YES to all checklist items, which we hypothesize contain responses which do not accurately represent the associated paper.

Reproducibility in NLP: What Have We Learned from the Checklist?

Ian H. Magnusson*, Noah A. Smith*, Jesse Dodge*

Scientific progress in NLP rests on the reproducibility of researchers’ claims. The *CL conferences created the NLP Reproducibility Checklist in 2020 to be completed by authors at submission to remind them of key information to include. Here, researchers provide the first analysis of the Checklist by examining 10,405 anonymous responses to it. First, they found evidence of an increase in reporting of information on efficiency, validation performance, summary statistics, and hyperparameters after the Checklist’s introduction. Further, they showed acceptance rate grows for submissions with more YES responses. The 44% of submissions that gather new data are 5% less likely to be accepted than those that did not; the average reviewer-rated reproducibility of these submissions is also 2% lower relative to the rest. Only 46% of submissions claim to open-source their code, though submissions that do have 8% higher reproducibility score relative to those that do not, the most for any item. Researchers discuss what can be inferred about the state of reproducibility in NLP, and provide a set of recommendations for future conferences, including: a) allowing submitting code and appendices one week after the deadline, and b) measuring dataset reproducibility by a checklist of data collection practices.

Two examples for action planning (Tandon et al., 2021) and summarization (Saunders et al., 2022) tasks showcase a scenario where initial predictions by a learned model (yˆ) are incorrect. Human-written critiques © indicate errors in model outputs. While humans can reliably critique each other, machines lack such ability. This paper studies a multiagent collaborative framework where one language model can generate critiques to improve its peer’s performance.
Two examples for action planning (Tandon et al., 2021) and summarization (Saunders et al., 2022) tasks showcase a scenario where initial predictions by a learned model (yˆ) are incorrect. Human-written critiques © indicate errors in model outputs. While humans can reliably critique each other, machines lack such ability. This paper studies a multiagent collaborative framework where one language model can generate critiques to improve its peer’s performance.

RL4F: Generating Natural Language Feedback with Reinforcement Learning for Repairing Model Outputs

Afra Feyza Akyürek, Ekin Akyürek, Aman Madaan, Ashwin Kalyan*, Peter Clark*, Derry Wijaya, Niket Tandon*

Despite their unprecedented success, even the largest language models make mistakes. Similar to how humans learn and improve using feedback, previous work proposed providing language models with natural language
feedback to guide them in repairing their outputs. Because human-generated critiques are expensive to obtain, researchers have devised
learned critique generators in lieu of human critics while assuming one can train downstream models to utilize generated feedback. However, this approach does not apply to black-box or limited-access models such as
ChatGPT, as they cannot be fine-tuned. Moreover, in the era of large general-purpose language agents, fine-tuning is neither computationally nor spatially efficient as it results in multiple copies of the network. In this work, researchers introduce RL4F (Reinforcement Learning for Feedback), a multi-agent collaborative framework where the critique generator is trained to maximize end-task performance of GPT-3, a fixed model more than 200 times its size. RL4F produces critiques that help GPT-3 revise its outputs.

While humans appear to have coherent mental pictures of everyday things (e.g., an egg, A), researchers’ question-asking probes suggest that LMs do not (e.g., one LM answered that the egg white both surrounds and is surrounded by the shell, B). This model incoherence can be reduced by applying commonsense constraints (e.g., surrounds is asymmetric), resulting in a more coherent parts model ©.
While humans appear to have coherent mental pictures of everyday things (e.g., an egg, A), researchers’ question-asking probes suggest that LMs do not (e.g., one LM answered that the egg white both surrounds and is surrounded by the shell, B). This model incoherence can be reduced by applying commonsense constraints (e.g., surrounds is asymmetric), resulting in a more coherent parts model ©.

Do language models have coherent mental models of everyday things?

Yuling Gu*, Bhavana Dalvi Mishra*, Peter Clark*

When people think of everyday things like an egg, they typically have a mental image associated with it. This allows them to correctly judge, for example, that “the yolk surrounds the shell” is a false statement. Do language models similarly have a coherent picture of such everyday things? To investigate this, researchers proposed a benchmark dataset consisting of 100 everyday things, their parts, and the relationships between these parts, expressed as 11,720 “X relation Y?” true/false questions. Using these questions as probes, they observe that state-of-the-art pre-trained language models (LMs) like GPT-3 and Macaw have fragments of knowledge about these everyday things, but do not have fully coherent “parts mental models” (54-59% accurate, 19–43% conditional constraint violation). They proposed an extension where they added a constraint satisfaction layer on top of the LM’s raw predictions to apply commonsense constraints. As well as removing inconsistencies, they found that this also significantly improves accuracy (by 16–20%), suggesting how the incoherence of the LM’s pictures of everyday things can be significantly reduced.

Read more on the blog here.

View a presentation of this paper here.

A high-level overview of SELF-INSTRUCT. The process starts with a small seed set of tasks as the task pool. Random tasks are sampled from the task pool, and used to prompt an off-the-shelf LM to generate both new instructions and corresponding instances, followed by filtering low-quality or similar generations, and then added back to the initial repository of tasks. The resulting data can be used for the instruction tuning of the language model itself later to follow instructions better.
A high-level overview of SELF-INSTRUCT. The process starts with a small seed set of tasks as the task pool. Random tasks are sampled from the task pool, and used to prompt an off-the-shelf LM to generate both new instructions and corresponding instances, followed by filtering low-quality or similar generations, and then added back to the initial repository of tasks. The resulting data can be used for the instruction tuning of the language model itself later to follow instructions better. Tasks shown in the figure are generated by GPT3.

Self-Instruct: Aligning Language Model with Self Generated Instructions

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith*, Daniel Khashabi, Hannaneh Hajishirzi*

Large “instruction-tuned” language models (i.e., finetuned to respond to instructions) have demonstrated a remarkable ability to generalize zero-shot to new tasks. Nevertheless, they depend heavily on human-written instruction data that is often limited in quantity, diversity, and creativity, therefore hindering the generality of the tuned model. This paper introduces SELF-INSTRUCT, a framework for improving the instruction-following capabilities of pretrained language models by bootstrapping off their own generations. This pipeline generates instructions, input, and output samples from a language model, then filters invalid or similar ones before using them to finetune the original model. Applying this method to the vanilla GPT3, researchers demonstrate a 33% absolute improvement over the original model on SUPER-NATURALINSTRUCTIONS, on par with the performance of InstructGPT001, 1 which was trained with private user data and human annotations. SELF-INSTRUCT provides an almost annotation-free method for aligning pretrained language models with instructions, and includes a large synthetic dataset to facilitate future studies on instruction tuning.

Check out our current openings, follow @allen_ai on Twitter, and subscribe to the AI2 Newsletter to stay current on news and research coming out of AI2.

--

--

Our mission is to contribute to humanity through high-impact AI research and engineering. We are a Seattle-based non-profit founded in 2014 by Paul G. Allen.