evaluation

A bad way to measure hallucination

The Human evaluation workshop (website) included a shared task where participants repeated a published experiment. Some experiments were reproduced well, and others were not. The experiment which had the worst replicability (ie has the highest “mean CV*” in Table 4 of the shared task summary paper) attempted to measure hallucination by asking Turkers to read sentences and count the number of errors in them. It is described on page 520 of Pudupully and Lapata 2021 (note that this methodology has been used in many papers, not just this one!).

You can see an example of the task and user interface in Figure 1 of Gonzalez-Corbelle et al (2023) (one of two papers at the workshop which replicated this study). Essentially Turkers are shown data about a basketball game and then shown sentences extracted from a sports story about the game (either a corpus text or a text generated by an NLG system).

For example, the Turkers were asked how many correct and incorrect facts are in the following sentence:

The Warriors ( 46 – 9 ) were able to pull away in the end , however , as they outscored the Nuggets ( 25 – 30 ) by a 42 – 25 margin over the final 12 minutes .

Example sentence from Figure 1 of Gonzalez-Corbelle et al (2023)

I will focus here on the counts of incorrect facts (ie hallucinations), since measuring hallucination is a very important evaluation task. The above example sentence in fact contains three errors

  • The score in the final 12 minutes was 25-25, not 42-25
  • The Warriors did not outscore the Nuggets in this period
  • The Warriors did not “pull away”, in fact they lost the game

Replication data

Two papers at the workshop, Gonzalez-Corbelle et al (2023) and Watson and Gkatzia (2023) replicated this experiment. Both papers repeated the experiment as published; Watson and Gkatzia did an additional study where they asked local colleagues and students (instead of Turkers) to evaluate a randomly-selected subset of the texts.

Results (number of incorrect facts in a 4-sentence extract from the sports story) are presented below for human corpus texts. In other words, this is the number of errors were found in human-written stories.

PaperNotemean number of incorrect facts in a 4-sentence extract from corpus text
Pudupully and Lapata 2021Original paper0.07
Gonzalez-Corbelle et al (2023)Replication0.66
Watson and Gkatzia (2023)Replication1.525
Watson and Gkatzia (2023)Replication with academic evaluators (not Turkers)0.0625
Errors found in 4-sentence extracts from human-written sports stories

As a comparison point, Thomson et al 2023 counted the number of errors in human-written stories in the same corpus (Rotowire), using a much more rigorous methodology, and found 1.58 errors on average. This is for the entire story, not an extract. I do not know the average sentence length of corpus texts, but I believe it is around 12, which suggests that a 4-sentence extract should contain around 0.5 errors on average. Note that the mean of the scores in the table, 0.58, is fairly close to this number.

Anyways, the key point is that this human-evaluation protocol for counting errors produces extremely varied results, ranging from 0.0625 (8x smaller than our best estimate of actual error counts from Thomson et al) to 1.525 (3x higher than our best estimate). In short, this is not a good way to measure errors and hallucination! We see similar results when the protocol is used to assess the output of neural NLG systems (instead of corpus texts).

My personal suspicion is that problems with this protocol include

  • Annotators may not be clear about what counts as an error. The above example, for instance, contains only one numerical error, the other two errors are what Thomson et al refer to as Word errors; perhaps some Turkers did not know whether to count Word errors.
  • Annotators may not have been very careful (a known problem with Turkers), and some errors require a bit of effort to detect.
  • Explicitly annotating errors, and then counting the annotations, is more reliable than asking annotators to count errors; it also provides better data on what went wrong,

In humble opinion, the Thomson et al protocol (which I helped to develop) is a far better way to detect errors and hallucinations in such texts.

Where did this evaluation protocol come from?

The replicated paper, Pudupully and Lapata 2021 did not invent this protocol, they essentially adapted a human evaluation protocol which was first presented by the paper which introduced the Rotowire corpus and task, Wiseman et al 2017. Many other papers that use the Rotowire standard have done likewise, and indeed the above-mentioned evaluation protocol seems to have become something of a standard in this space. Which is pretty depressing since the above findings suggest that it is a bad way to measure hallucinations.

To me, this is perhaps the most worrying thing about this protocol. It was first proposed in 2017 and subsequent researchers seem to have blindly copied it despite all of its problems (and very few have switched to the better Thomson et al protocol, which was first proposed in 2020). In all honesty, I suspect that some researchers in this space simply dont care; their interest is in proposing new generation algorithms, and it is easier (not least from a leaderboard perspective) to copy existing evaluation practices, no matter how flawed they are. Which is not a good way to do meaningful scientific research!

Of course this sort of thing is not new, its also why the BLEU metric kept on being widely used long after much better alternatives became available.

Human evaluations are not the same!

In short, human evaluations are not the same! Some evaluation protocols give much more meaningful results than others, and replication studies are one technique for identifying protocols which are not robust. We need more such studies and (even more importantly) researchers need to move away from protocols which are shown to be unreliable.

3 thoughts on “A bad way to measure hallucination

Leave a comment