evaluation

There are many types of human evaluation!

We had the Human Evaluation workshop last week. Interesting as always, but also (as usual) the focus was on evaluations based on asking subjects to rate or rank outputs. Indeeed, I’ve had a few discussions recently about human evaluation, where the people I’m talking to assume that this means ratings or rankings.

But there are many other types of human evaluations, most of which usually give more meaningful results that ratings or rankings! I thought I’d describe some of them here, using examples from Francesco Moramarco’s PhD work on evaluating a system which summarises doctor-patient consultations. As part of his (almost completed) PhD, Francesco tried several types of human evaluation, which I describe below. He also tried metric evaluation, which I wont describe here, see Moramarco et al 2022 for details.

Task-based evaluation

We can evaluate an NLG system by measuring how it impacts users; does it help them perform a task, change their behaviour, etc? Most NLG systems have a purpose, and we can directly measure how well the system achieves its purpose. I have done a number of task-based evaluations in the past, including measuring whether an NLG system improves medical decision making (Portet et al 2009), and measuring whether a system helps people drive more safely (Braun et al 2018).

In Francesco’s case, the summarisation system was going to be used in a “human-in-loop” workflow, where doctors check and edit summaries before they are saved into the medical record. So Francesco measured how long it took doctors to post-edit the computer-generated summaries (Moramarco et al 2022), and how this compared time-wise to writing a summary manually; this is probably the most important real-world KPI (key performance indicator) for such a system.

Annotation-based evaluation

We can also evaluate an NLG system by asking people to annotate individual errors and mistakes. Craig Thomson and I did this with errors in computer-generated sports stories, and the MQM annotation methodology has been used by a number of researchers in machine translation.

In Francesco’s case, he asked clinicians to read computer-generated summaries and annotate mistakes; his annotation scheme included generic error types such as hallucination and omission, but also domain-specific errors such as non-standard medical acronyms. He also distinguished between critical and non-critical errors, which again is important in this domain. The annotation-based evaluation gave results which correlated with the task-based evaluation mentioned above, and also provided good insights about where the system did not work well. Francesco also developed an annotation protocol based on checklists (Savkov et al 2022); this protocol had better inter-annotator agreement.

Real-world evaluation

Finally, we can deploy a system in the real-world, and assess its impact in real-world usage. For example, we developed a system to help people stop smoking, and evaluated it by asking real-smokers to use it and seeing whether this impacted cessation rates (Reiter et al 2003).

Francesco measured what happened when the note generator system was deployed and used by real doctors. Some initial findings are reported in Knoll et al 2022; these are primarily qualitative, and for example show that the system was not effective for some types of consultations (usually more complex ones). Francesco’s thesis will include more quantitative data from real-world usage.

Final Thoughts

Asking subjects for subjective rankings or ratings of output texts is not the only way to evaluate texts with people! There are a lot of other ways to do human evaluations, and in my opinion in most cases the techniques described above (task-based, annotation, real-world) give better insights into how well a system works than subjective rankings or ratings. I strongly encourage readers to consider using such techniques if their goal is high-quality evaluation.

Leave a comment