other

NLG texts should not upset people

** UPDATED Mar 24: Added paper link for Balloccu’s work

I gave an invited talk at the Counterspeech workshop a few weeks ago on “Generated texts should not worsen users’ emotional state”. A few people asked me for copies of my slides, but I thought I’d instead write a blog based on my talk, I think that works better than reading a slide deck.

Problem

The core problem is that NLG systems can produce texts which are accurate and true, but make people feel bad about themselves; this most often happens when the texts give personal information about the recipients’ health, education, behaviours, etc. Sometimes some kind of negative reaction is inevitable (eg, its hard not to feel depressed if you find out that an operation was not successful), but I’ve seen many cases where NLG texts were much worse from this perspective than is necessary. I discussed one example (from Google MedPaLM) in an earlier blog.

Another problem I’ve seen is that texts can recommend actions which the recipient cannot perform; this can make the recipient feel bad about their self-efficacy and reduce their self-confidence. To give a personal example (based on human interaction, not NLG), around 2005 I was very stressed by dealing with my autistic son (who had many challenging behaviours at this point), and sometimes struggling to get even basic tasks done for my family. At this point well-meaning friends and relatives suggested that I try some very complex and demanding interventions with my son; I did not have the capacity to do these, and being reminded about such things just made me feel depressed and inadequate.

Examples

The workshop organisers asked me to ground my talk in concrete examples from real NLG systems, I mentioned the following.

Crying users: In the early 2000s, we developed the SkillSum system (paper) to give feedback to adults with poor literacy or numeracy. Essentially people took an online test to assess their skills, and were given feedback both on test results and also how this impacted career goals (eg, “you’ll need to improve your math skills if you want to be a plumber”). We evaluated this with 230 people, and although most subjects found the system to be useful, 2 subjects (ie, 0.9% of 230) got very upset when they read SkillSum texts; one started crying. Basically the texts were telling them upsetting news about their skills and career hopes, and the subjects told us that they wanted such news to come from a caring and compassionate person. Our commercial partners told us that they could not commercialise an NLG system that made even a small number of users really upset.

Bad news could kill someone: A few years after SkillSum, we developed Babytalk, which generates texts about sick babies in neonatal intensive care. Different texts went to doctors, nurses, parents, and relatives. Because of above SkillSum experiences, we were very aware of emotional issues, and did our best to minimise these (paper). These worked reasonably well, but there were still problems. In particular, for relatives, parents expressed concern about impact of bad news on older and frailer people; one mother said she was concerned that bad news could trigger a heart attack in the baby’s 99-year old great-grandmother (paper). In short, a text which is truthful and accurate could potentially trigger a heart attack and kill someone!

Dont criticise people: Even earlier, we developed a system which generated personalised smoking cessation advice (paper). We tried it out in a large-scale clinical trial with 2500 smokers. Generally the system was not very effective. Also, in a few cases people got angry because we were criticising them for behaviour (smoking) which was legal and which they enjoyed; what business did we have criticising them if they had made an informed choice to continue smoking? One could argue that our criticism was justified on health grounds, but criticising people is unlikely to work, and I have avoided this in subsequent projects.

Lying users: Recently one of my PhD students, Francesco Moramarco, developed (with colleagues) a system which summarised doctor-patient consultations (paper), which was deployed and used by real doctors (paper). One context where the system struggled was when users were dishonest and (for example) withheld information or exaggerated symptoms. Its not the NLG systems fault that they lied, but it made the systems task much harder.

chatGPT upsets people: Another student, Simone Balloccu, recently did an experiment (paper) where he asked chatGPT to respond to inquiries about dietary struggles, and then asked nutritionists to evaluate the quality of these responses. In 15% of cases the experts thought the responses were unsafe, primarily because of adverse emotional impact, such as telling users to take actions which they might not have the capacity to do (similar to the above example with my autistic son).

DIscussion

I ended my talk my discussing some general issues, which included the following.

Unaware of these issues: The first point is that I suspect many (most?) NLG researchers and developers are not aware of these issues. We tend to be highly educated, well paid, and self-confident, and may not have much personal experience interacting with people who cannot add, people with very stressful lives, etc. Indeed, one person commented to me after my talk that this was especially true of many LLM developers, who are paid huge amounts of money and effectively live in bubbles.

Training data: LLMs of course are trained on the Internet, and the Internet contains much more health-related content in forums (eg Reddit) than in carefully designed and vetted health websites. In Simone’s experiments, some of the nutritionists who evaluated chatGPT output commented that the texts followed Internet perspective and style, not best practice health guidelines, which was especially problematic for deprived people or people from unusual backgrounds.

Ethics: Doing research in this area raises research ethics concerns, since it involves possibly upsetting subjects. Many of the experiments I described above are old, and it would be difficult to do some of them in 2023 because of stricter research ethics controls (which are a good idea, I’m not complaining!).

Final Thoughts

Most discussions of AI safety which I have seen focus on hallucinations, offensive language, and dangerous content (eg, bomb-making instructions). And these are very important problems and challenges! But I think upsetting people is also a real risk in many contexts, and I encourage researchers and developers to take this risk seriously.

3 thoughts on “NLG texts should not upset people

Leave a comment