evaluation

I’m very worried about data contamination

I’m a strong believer in rigorous experiments, and am really worried by the lack of attention paid to (and even knowledge of) data contamination. A lot of the claims made by academics and researchers about LLMs are dubious because of insufficient attention paid to data contamination issues.

What is data contamination?

One of the fundamental principles of machine learning is that systems should not be tested and evaluated on the data they were trained on. In the context of evaluating LLMs, which are effectively trained on large portions of the Internet, this means that LLMs should not be evaluated on the portion of the Internet used to train the system. If this happens, this is called data contamination.

For example, in the past summarisation systems were often evaluated using the CNN/Dailymail corpus (which is essentially a set of highlights which were published with news stories in CNN or Daily Mail) and/or XSum (where the task is to generate the first sentence of a BBC news article from the remainder of the article) (blog). However, this is not a good way to evaluate modern LLMs, because their training data includes the relevant CNN, Daily Mail, and BBC articles (the corpora themselves are also on the Internet). Hence an LLM could get perfect scores on the evaluation simply by memorising the articles. Which would tell us nothing about the LLM’s ability to summarise new articles, which is what we actually care about in the real world.

LLMs can also learn from prompts and other user inputs. For example, if a researcher asks an LLM to evaluate the quality of a generated text with regard to a reference text, the LLM can memorise the reference text, which also counts as data contamination. Similar problems occur when examples are added to the prompt (few-shot prompting); the LLM can memorise the examples.

A further complication is that some LLMs are integrated with web search. This is very helpful for users, since it makes it easier for the LLM to give up-to-date information even if the model was trained a few years ago. But from an evaluation perspective, such models have access to the complete Internet (not just the portion they were trained on), which means that test data should not be on the Internet at all.

For a more detailed analysis of type of data contamination, I suggest reading Sainz et al (2023).

How widespread is data contamination?

I’ve not seen good data on how often data contamination degrades evaluation quality, in part because its difficult to detect (especially with closed models such as GPT) and even define. Also, if two researchers use model M on data set X, and researcher A is super-careful about data contamination but researcher B is sloppy (eg, puts the test data in a Github repo or reveals it to the LLM via prompts), then researcher A’s results still suffer from data contamination. Even though A was very conscientious, his research is still compromised by B’s actions.

Balloccu et al (2024) use a structured survey to quantify how often data is leaked (which is a different issue than how often experiments are compromised). They surveyed 216 recent papers which used chatGPT, and discovered that 90 of them (42%) leaked data to chatGPT. Which is not good!!

On a qualitative level, an excellent blog from AI Snake Oil shows that the performance of chatGPT on a coding competition is heavily inflated by the model memorising published solutions.

On a personal level, two things really scare me. One is that many researchers seem completely unaware (or uninterested) in this problem; I see this when I read papers, and also when I review them. The other is that we don’t know the scale of the problem; I personally suspect it could be huge (large numbers of NLP papers have flawed evaluations), not least because so many researchers dont seem to care.

Solutions

I usually recommend below to my students:

  • If possible, use test data which is not on the internet. For example, if the goal is to evaluate an NLG sportswriter, we could evaluate on game data from matches in schools (or indeed in fantasy leagues) which has never been written up by human sportswriters. Another possibility is anonymised versions of confidential data which has never been publicly released.
  • Try to determine if models have memorised your test data, eg, by asking models to regurgitate in some form. This can lead to unexpected results, where data we think is “clean” is known to the model, and data we think is “leaked” is not in fact known to the model.
  • Be very careful of exposing test data to models; “clean” test data is a precious resource which must be guarded!
  • Be honest about data contamination issues when publishing papers.

Many more sophisticated and detailed recommendations have been made in the literature, including Jacovi et al (2023) and Balloccu et al (2024). There will be a workshop about data contamination at ACL 2024, it will be interesting to see what ideas emerge from it.

Final thoughts

Data contamination scares me. In most aspects the quality of evaluations in NLP has been slowly getting better over the past 5 years, which is great, but I fear that evaluation quality may have nosedived over the past year because of data contamination. Outside the research community, I worry about extravagant claims made about LLMs which are dubious because of data contamination (like the coding claims mentioned above).

I would love to see the NLP research community take the lead on resolving this problem and enhancing evaluation quality, and then using this knowledge to alert the media and public to unjustified claims about LLMs!

2 thoughts on “I’m very worried about data contamination

Leave a comment