My MSc students evaluate chatGPT

Aberdeen University runs a one-year MSc course on AI. Students spend the last three months of the course doing a project, and this year 4 students did projects with me on evaluating chatGPT (and other LLMs) in different use cases. All of this was done with human subjects. Below I summarise what they did, and what I thought were especially interesting findings and insights. I should say that I focus on insights rather than numbers below, but all of the students’ reports contain extensive quantitative evaluation data.

Developer’s assistant

Kwopnan Dikwal looked at using chatGPT as an assistant to software developers. The main part of his work was an experiment where he asked developers (different skill levels and backgrounds) to do development tasks with and without chatGPT; these tasks included testing and documentation as well as coding.

Of course such results need to be interpreted cautiously because the effectiveness of a development tool is very dependent on UI, IDE integration, workflow, and familiarity of users with the tool. But anyways, ignoring this, Dikwal found that chatGPT was overall helpful and useful in most cases, especially with coding tasks (it was less useful in testing and documentation tasks).

However, an important caveat is that if chatGPT produced buggy code, the developers found it much harder and more time-consuming to debug chatGPT’s code than their own code. Which is not surprising, since most developers find it much harder to debug other people’s code.

Dikwal’s findings made me wonder what would happen in an agile development workflow, where software is constantly being updated based on client feedback; would developers find it difficult to regularly modify code from chatGPT?

Machine translation

Duanyan Xu looked at how well chatGPT and iFLYTEK Spark did at translating texts between English and Chinese (both directions), looking at several different text genres (ranging from medical essays to poetry). He recruited 29 people to evaluate the output of the sysetms; evaluators in specialist domains (such as medical essays) had relevant domain expertise. He found that translation quality overall was generally good. It was better for factual content than for literary content; however in specialist factual domains there were problems in accurately translating specialised terminology.

Xu also tried using chatGPT to evaluate translated texts; there is of course a lot of interest in using LLMs to evaluate generated texts. The correlation with Xu’s human evaluations was not great (correlation coefficient less than 0.5 for all genres). He also found that chatGPT was biased to towards its own outputs; in other words, chatGPT’s evaluation of its own outputs was more positive (in comparison to human evaluations) than its evaluation of iFLYTEK Spark outputs.

I think Xu’s last finding is really important. It makes sense that an LLM would evaluate its own output “generously” compared to the output of other LLMs, but this means we need to be very careful in using LLM evaluation to compare systems.

Exam performance

Yaqing Hao examined how well chatGPT and ErnieBot did at answering questions on the Gaokao exam, which is taken in China by people who want to go to university. She adapted questions from the 2023 Gaokao exam, asked chatGPT and ErnieBot to respond, and then asked 7 subjects to score the responses, using a modified version of the official scoring system. The modified version was more detailed, and excluded irrelevant criteria such as “neat handwriting”.

Overall the systems did well. Compared to chatGPT, ErnieBot did better at argumentative essays and worse at narrative essays; it also included more cultural references such as proverbs. In any case, there were problems, including hallucinations and a more general lack of persuasiveness in argumentation.

Hao also found that her subjects sometimes disagreed quite strongly about how to score quality criteria of a text; in a few cases one subject would score a criteria of a text as fourth class (worst), while another would score the same criteria for the same text as first class (best). Hao worked with the subjects to try to refine the criteria and reduce differences.

I think this last point is really interesting. Exams (and scoring criteria) such as Gaokao are of course designed to evaluate people, not LLMs. So perhaps its not surprising that its difficult to interpret how well LLMs do at such exams…

Emotional interactions

Jidapa Soymat looked at how well chatGPT responded to emotional sentences taken from the SAD dataset, such as “I feel so alone. I don’t know what to do.”, using a prompt which asked chatGPT to respond like a therapist. Soymat worked with a psychologist who assessed 609 such responses; she also asked 38 non-specialist subjects to assess a subset of 20 responses.

The assessment included several criteria, sometimes with very different results. For example, the psychologist rated most responses as having high emotional sensitivity, but low empathy. Overall appropriateness was mixed, in part because the responses were not client-centred. With this profile, chatGPT could be used to assist therapists, but not as a stand-alone tool to support people; of course such comments have been made in many other domains.

The psychologist also expressed worries about bias; I have heard similar concerns expressed for other patient-oriented uses of chatGPT. Because chaptGPT is trained on the Internet, its training data does not, for example, contain a lot of contributions from homeless people, prisoners, members of the UK Traveller community, and other under-represented groups. Hence it may not respond appropriately and helpfully to such people.

Final Thoughts

I think the overall message from these projects is that chatGPT and other LLMs can do amazing things, but they have limitations which impact their utility in the real-world applications discussed above. Of course the technology is getting better, but (as with any other tool) we need to carefully evaluate strengths and understand limitations.

Ehud Reiter's Blog

Ehud's thoughts and observations about Natural Language Generation

My MSc students evaluate chatGPT

Developer’s assistant

Machine translation

Exam performance

Emotional interactions

Final Thoughts

Leave a comment Cancel reply

Developer’s assistant

Machine translation

Exam performance

Emotional interactions

Final Thoughts

Share this:

Related

Share this:

Leave a comment Cancel reply