AI2 at ACL 2023
Highlighted work from our institute appearing at this year’s ACL conference
The 61st Annual Meeting of the Association for Computational Linguistics (ACL) is the premier conference in the field of computational linguistics, covering a broad spectrum of diverse research areas that are concerned with computational approaches to natural language. AI2 is thrilled to have multiple researchers represented at this year’s conference.
Highlighted Papers from AI2
Selected works involving AI2 researchers appearing at this year’s conference (* denotes AI2 affiliation):
Do Androids Laugh at Electric Sheep? Humor “Understanding” Benchmarks from The New Yorker Caption Contest
Jack Hessel*, Ana Marasović, Jena D. Hwang*, Lillian Lee, Jeff Da, Rowan Zellers, Robert Mankoff, Yejin Choi*
🏆 Best Paper Award — We are delighted that this paper was selected for a Best Paper Award for ACL 2023!
Do LLMs have a sense of humor? Researchers from The Allen Institute for AI, University of Utah, Cornell University, OpenAI, and the Paul G. Allen School of Computer Science challenge AI models to “demonstrate understanding” of the sophisticated multimodal humor of The New Yorker Caption Contest. Concretely, they develop three carefully circumscribed tasks for which it suffices (but is not necessary) to grasp potentially complex and unexpected relationships between image and caption, and similarly complex and unexpected allusions to the wide varieties of human experience; these are the hallmarks of a New Yorker-caliber cartoon. The team investigated vision-and-language models that take as input the cartoon pixels and caption directly, as well as language-only models for which they circumvented image-processing by providing textual descriptions of the image.
NLPositionality: Characterizing Design Biases of Datasets and Models
Sebastin Santy, Jenny Liang, Ronan Le Bras*, Katharina Reinecke, Maarten Sap*
🏆 Outstanding Paper Award — We are honored to count this paper among the Outstanding Paper Award recipients for ACL 2023!
Design biases in NLP systems, such as performance differences for different populations, often stem from their creator’s positionality, i.e.,
views and lived experiences shaped by identity and background. Despite the prevalence and risks of design biases, they are hard to quantify because researcher, system, and dataset positionality is often unobserved. This paper introduces NLPositionality, a framework for characterizing design biases and quantifying the positionality of NLP datasets and models. NLPositionality is applied to existing datasets and models for two tasks — social acceptability and hate speech detection. The researchers find that datasets and models align predominantly with Western, White, college-educated, and younger populations. Additionally, certain groups, such as nonbinary people and non-native English speakers, are further marginalized by datasets and models as they rank least in alignment across all tasks.
Minding Language Models’ (Lack of) Theory of Mind: A Plug-and-Play Multi-Character Belief Tracker
Melanie Sclar, Sachin Kumar, Peter West, Alane Suhr*, Yejin Choi*, Yulia Tsvetkov
🏆 Outstanding Paper Award — We are honored to count this paper among the Outstanding Paper Award recipients for ACL 2023!
Theory of Mind (ToM) — the ability to reason about the mental states of other people — is a key element of our social intelligence. Yet, despite their ever more impressive performance, large-scale neural language models still lack basic theory of mind capabilities out-of-the-box. This research posits that simply scaling up models will not imbue them with theory of mind due to the inherently symbolic and implicit nature of the phenomenon, and instead investigate an alternative: can we design a decoding-time algorithm that enhances theory of mind of off-the-shelf neural language models without explicit supervision? The researchers present SYMBOLICTOM, a plug-and-play approach to reason about the belief states of multiple characters in reading comprehension tasks via explicit symbolic representation.
HINT: Hypernetwork Instruction Tuning for Efficient Zero-Shot Generalisation
Hamish Ivison*, Akshita Bhagia*, Yizhong Wang, Hannaneh Hajishirzi*, Matthew E. Peters*
Recent NLP models have shown the remarkable ability to effectively generalise ‘zero-shot’ to new tasks using only natural language instructions as guidance. However, many of these approaches suffer from high computational costs due to their reliance on concatenating lengthy instructions with every input example, resulting in costly reprocessing of the instruction. To avoid this, this paper introduces Hypernetworks for INstruction Tuning (HINT), which convert task instructions and examples
into parameter-efficient modules inserted into an underlying model using a pretrained text encoder, eliminating the need to include instructions in the model input. The hypernetwork in HINT also produces an encoded instruction, which we concatenate with encoded inputs during decoding to further improve performance. HINT models outperform strong state-of-the-art baselines by over 10% when controlling for compute (measured in FLOPs).
Reproducibility in NLP: What Have We Learned from the Checklist?
Ian H. Magnusson*, Noah A. Smith*, Jesse Dodge*
Scientific progress in NLP rests on the reproducibility of researchers’ claims. The *CL conferences created the NLP Reproducibility Checklist in 2020 to be completed by authors at submission to remind them of key information to include. Here, researchers provide the first analysis of the Checklist by examining 10,405 anonymous responses to it. First, they found evidence of an increase in reporting of information on efficiency, validation performance, summary statistics, and hyperparameters after the Checklist’s introduction. Further, they showed acceptance rate grows for submissions with more YES responses. The 44% of submissions that gather new data are 5% less likely to be accepted than those that did not; the average reviewer-rated reproducibility of these submissions is also 2% lower relative to the rest. Only 46% of submissions claim to open-source their code, though submissions that do have 8% higher reproducibility score relative to those that do not, the most for any item. Researchers discuss what can be inferred about the state of reproducibility in NLP, and provide a set of recommendations for future conferences, including: a) allowing submitting code and appendices one week after the deadline, and b) measuring dataset reproducibility by a checklist of data collection practices.
RL4F: Generating Natural Language Feedback with Reinforcement Learning for Repairing Model Outputs
Afra Feyza Akyürek, Ekin Akyürek, Aman Madaan, Ashwin Kalyan*, Peter Clark*, Derry Wijaya, Niket Tandon*
Despite their unprecedented success, even the largest language models make mistakes. Similar to how humans learn and improve using feedback, previous work proposed providing language models with natural language
feedback to guide them in repairing their outputs. Because human-generated critiques are expensive to obtain, researchers have devised
learned critique generators in lieu of human critics while assuming one can train downstream models to utilize generated feedback. However, this approach does not apply to black-box or limited-access models such as
ChatGPT, as they cannot be fine-tuned. Moreover, in the era of large general-purpose language agents, fine-tuning is neither computationally nor spatially efficient as it results in multiple copies of the network. In this work, researchers introduce RL4F (Reinforcement Learning for Feedback), a multi-agent collaborative framework where the critique generator is trained to maximize end-task performance of GPT-3, a fixed model more than 200 times its size. RL4F produces critiques that help GPT-3 revise its outputs.
Do language models have coherent mental models of everyday things?
Yuling Gu*, Bhavana Dalvi Mishra*, Peter Clark*
When people think of everyday things like an egg, they typically have a mental image associated with it. This allows them to correctly judge, for example, that “the yolk surrounds the shell” is a false statement. Do language models similarly have a coherent picture of such everyday things? To investigate this, researchers proposed a benchmark dataset consisting of 100 everyday things, their parts, and the relationships between these parts, expressed as 11,720 “X relation Y?” true/false questions. Using these questions as probes, they observe that state-of-the-art pre-trained language models (LMs) like GPT-3 and Macaw have fragments of knowledge about these everyday things, but do not have fully coherent “parts mental models” (54-59% accurate, 19–43% conditional constraint violation). They proposed an extension where they added a constraint satisfaction layer on top of the LM’s raw predictions to apply commonsense constraints. As well as removing inconsistencies, they found that this also significantly improves accuracy (by 16–20%), suggesting how the incoherence of the LM’s pictures of everyday things can be significantly reduced.
Read more on the blog here.
View a presentation of this paper here.
Self-Instruct: Aligning Language Model with Self Generated Instructions
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith*, Daniel Khashabi, Hannaneh Hajishirzi*
Large “instruction-tuned” language models (i.e., finetuned to respond to instructions) have demonstrated a remarkable ability to generalize zero-shot to new tasks. Nevertheless, they depend heavily on human-written instruction data that is often limited in quantity, diversity, and creativity, therefore hindering the generality of the tuned model. This paper introduces SELF-INSTRUCT, a framework for improving the instruction-following capabilities of pretrained language models by bootstrapping off their own generations. This pipeline generates instructions, input, and output samples from a language model, then filters invalid or similar ones before using them to finetune the original model. Applying this method to the vanilla GPT3, researchers demonstrate a 33% absolute improvement over the original model on SUPER-NATURALINSTRUCTIONS, on par with the performance of InstructGPT001, 1 which was trained with private user data and human annotations. SELF-INSTRUCT provides an almost annotation-free method for aligning pretrained language models with instructions, and includes a large synthetic dataset to facilitate future studies on instruction tuning.
Check out our current openings, follow @allen_ai on Twitter, and subscribe to the AI2 Newsletter to stay current on news and research coming out of AI2.