academics

Systematic Reviews in NLP

**Updated 7-Feb-24: Link added for Balloccu et al

Over the past year I have on several occasions encouraged NLP researchers to do systematic reviews of the research literature. I describe the concept below, I think it is a very useful tool in many contexts!

What is a structured review?

In AI and NLP, most literature surveys are like “previous work” sections in papers. The authors review related work which they believe is important and relevant to what they are trying to do. Such surveys can be very useful. For example I really liked Gehrmann et al’s survey of evaluation of NLG (https://www.jair.org/index.php/jair/article/view/13715), and have recommended this to other people on many occasions.

So conventional surveys can be very useful, but they are inherently subjective and hence not repeatable. In other words, if I were to write a survey of NLG evaluation, it would be very different from Gehrmann et al.

Systematic reviews are a methodology for doing surveys in a more objective and repeatable fashion, and usually focus on gathering what is known about specific research questions. They are most prominent in medicine, where meta-analyses are widely used to integrate findings from multiple research papers, for example on the effectiveness of a medical intervention. They are based on clear specifications of the following:

  • Which papers are considered: this is usually specified as a (repeatable) search criteria, for examples papers whose title includes “NLG” and “Evaluation” in the ACL Anthology.
  • Inclusion and exclusion criteria: screening criteria, for example papers have to present new results, they cannot be position papers.
  • Data extracted: what information we extract from the paper. For example, which evaluation techniques were used in the paper

Medical systematic reviews usually follow the PRISMA methodology; I recommend that AI/NLP researchers do likewise.

Example: Validity of BLEU

To give a concrete example, in 2018 I published a structured (systematic) review of how well BLEU scores correlate with human evaluations (paper). This was done as follows:

  • Which papers are considered: all papers present in the ACL Anthology in June 2017, whose title included one of 15 phrases (such as “correlation human”).
  • Inclusion and exclusion criteria: Main criteria was that papers had to present correlations of a standard version of BLEU with human evaluations. Other criteria included being written in English, measuring correlation on at least five items, and not re-presenting an experiment which had been described in a previous Anthology paper.
  • Data extracted: task being evaluated (eg, NLG or MT), BLEU details (eg, where reference texts came from), human evaluation details (eg, participants), and results (eg, correlations)

Having done the survey, I then presented the correlation numerically as box plots which showed the distribution of human-BLEU correlations in various contexs. I also summarised key insights from the other data I collected.

Please see the paper for more details. It is relatively short, and a concrete and relatively simple example of a structured systematic review in NLP.

Other examples

Howcroft et al used a systematic review to analyse a number of aspects of how NLG systems had been evaluated, including which quality criteria were used (eg, accuracy). Amongst other things, they found that there was a huge range of different quality criteria, many of which seemed to mean similar things.

Valizadeh and Parde used a systematic review of task-oriented dialogue systems in healthcare, including area of medicine, use case and user (eg doctor or patient). Amongst other things they found that most of the papers did not evaluate usability and user experience, which of course are essential for success.

Balloccu et al use a systematic review to analyse how common data contamination problems are in recent papers about GPT models; they also assessed the frequency of some other common evaluation and reproducibility problems. Amongst other things, they found that many datasets were almost completely leaked to the GPT models.

Of course systematic reviews can be used to investigate many other issues! My PhD students, for example, are working on systematic reviews on reproducibility issues and on how NLP is being used in specific application domains.

We can also use the ideas from systematic reviews in other contexts. For example, in the ReproHum project on the reproducibility of human evaluations, we selected candidate papers to reproduce in part using a Prisma-like process (paper).

Final Thoughts

Conventional literature surveys can of course be very useful! But they are inherently biased, in the sense that authors describe papers they are familiar with. This is probably fine in a context where the survey is giving a broad-bush qualitative review of an area.

However, if the goal of the survey is to answer a scientific research question in a repeatable manner, then it makes much more sense to perform a systematic review. I am seeing more systematic review now than I did back in 2018, which is great, but I still find that many NLP researchers are not familiar with the concept. Which is a shame since in many cases it is a very useful research methodology.

Leave a comment