Ten tips on doing a good evaluation

I have written a lot of papers and blogs about aspects of evaluation. In this blog I want to try to pull out a few high-level messages and “tips” for good evaluation, based on mistakes I have seen people make. This is basically a “summary” blog which refers to many of my more detailed blogs on evaluation.

A: Experimental design

An evaluation is an experiment, and as such it must be well designed. There are plenty of papers which give detailed advice about experimental design for specific types of evaluations, such as van der Lee et al‘s advice on designing experiments which solicit human rankings or ratings. Here I will focus on a few high-level generic points which apply to many types of evaluation.

1 Evaluate what is important

Many things can be measured in an evaluation, and researchers should focus on what is important scientifically or practically. For example, in NLG we can evaluate fluency, accuracy, or utility of texts (blog); we can evaluate average or worst-case quality of generated texts (blog); we can evaluate non-functional aspects such as response time; etc. What is important depends on our hypotheses and/or application context’ For example in medical contexts worst-case accuracy is extremely important; we need to know if there is even a small chance that an NLG system could produce a text which leads to someone being harmed or even killed (blog).

Unfortunately, when I look at the research literature, I see a *lot* of evaluations which focus on measuring stuff that is relatively easy to measure (eg., average-case fluency) even in contexts where other things seem more important.

2 Do not use obsolete evaluation techniques

There are many different evaluation techniques, which are appropriate in different circumstances. However, there is no excuse for using obsolete evaluation techniques which are known to produce inferior results. For example, no one in 2024 should be using BLEU (blog) or ROUGE (blog) metrics to evaluate texts. In human evaluation, hallucinations and errors should be measured with annotation protocols (blog), instead of asking people to count errors (blog).

I appreciate that it may not be instantly clear which evaluation techniques are obsolete, and often its easier to just repeat what other people have done; this is one reason why obsolete techniques are still used. But researchers are responsible for keeping up-to-date on experimental methodology, this is part of being a good scientist.

3 Use good test data

Test data should be real data which is representative of real-world usage. Do not test a system on synthetic data (blog), do not test it on data from a different task (blog), and do not test a system on data which is completely unrepresentative (blog).

Of course getting good test data is a lot of work, and it is often much easier to just grab a dataset from Kaggle or Github without worrying about whether it is appropriate. But please keep in mind that experiments done with poor data are meaningless and not good scientific contributions.

4 Use strong baselines

It is depressing how often I read papers which make grand claims for a system after comparing it to an obsolete baseline (blog). For example, if you are working with language models, I expect to see comparisons with state-of-the-art models such as GPT4 or LLAMA2; comparisons with GPT2 and BERT (both of which I have seen recently) tell me nothing.

I do not understand this. A few people have told me that they avoid GPT4 for monetary reasons, but there are plenty of free alternatives. Also in many such cases the actual cost of using GPT4 would not be high, and in particular would be less than the cost of attending a major conference to present the results.

5 Avoid data contamination and testing on training data

One of the fundamental principles of machine learning is not to test a system on its training data. Unfortunately, this principle is widely ignored in NLP, especially with regard to data contamination (blog), ie testing a language model on internet data which the model was trained on.

Another problem in this space is the common practice of evaluating a model on test data, analysing where the model did poorly, updating models/prompts/parameters accordingly, and then re-doing the evaluation (blog). This effectively amounts to tuning on the test data, which is also very poor practice.

6 Compute statistical significance

When analysing experimental data, it is essential in most cases to check for statistical significance. Unfortunately, I regularly see papers which make no attempt at this, which is annoying.

A free online general intro to statistical inference is Lakens, and Dror et al look specifically at statistical significance testing in NLP. I have written blogs on specific issues in statistical testing, such as using 2-tailed p values and regression to mean.

7 Make your experiment replicable

Scientific experiments should be replicable. Ie if I do an experiment and you repeat the same experiment, we should get similar results. Unfortunately this does not always happen in NLP (blog). In any case, to support replication researchers should release full details of their experiments, perhaps in an associated Github repository. It is best to set this up at the time of the experiment, its much harder to find the information years later (blog).

Released information should include full experimental design, including code and details about code libraries used. For human evaluations, screenshots and subject details should be included. Raw data should also be released, unless this is prohibited by data privacy issues. Authors also need to respond to questions from people trying to replicate their work (see point 10 below) (blog).

Experimental process

8 Carefully execute and report your experiment

Experiments must be carefully executed; otherwise their results cannot be trusted. Unfortunately, we have found that many (perhaps most) NLP papers suffer from execution flaws such as code bugs and not correctly reporting results (blog; see also blog). This is depressing, to put it mildly.

There is no magic solution to this problem. Researchers need to be very careful in their experimental work, and also look for signs (such as inconsistent results) which may possibly indicate experimental problems. Ie if you are running an experiment with ten subjects and only see data for eight of them, investigate why this is happening!

9 Submit your work to peer review

Papers should be checked by a peer review process; this is perhaps the most important “quality assurance” process in scientific research (blog). The peer review process can also improve papers (blog). Of course the peer review process is not perfect and cannot detect all problems (blog), and some venues do it better than others (blog). But our goal should be to improve peer review, not discard it.

From this perspective I am concerned that I am increasingly seeing papers bypass peer review, usually by being uploaded to Arxiv. I am also concerned that many people cite older Arxiv versions of papers instead of published versions which have been improved with peer review comments (blog). Sometimes Arxiv-only publication makes sense, but it should be the exception, not the rule.

10 Respond to questions from other researchers

Authors are still responsible for scientific papers after they are published, and in particular need to answer questions, support replications, and fix mistakes (by publishing a correction). Unfortunately, our experience has been that most authors do not respond to questions about published papers (blog). Corrections are also rare in NLP (blog), I know of many papers with mistakes (which the authors are aware of) which have not been corrected.

When I was a PhD student and young researcher, I was always very excited when another researcher asked me about my research and did my best to answer, even if this meant trying to dig up details from a 5-year-old experiment. In 2024, however, I find that many PhD students and early career researchers ignore inquiries from me, perhaps because they regard papers as CV-enhancers instead of as scientific contributions. Which is very sad.

Final comments

I hope the above “tips” are easy to understand and make sense. They are not new, which makes it frustrating (at least to me) how often they are ignored…

Ehud Reiter's Blog

Ehud's thoughts and observations about Natural Language Generation

Ten tips on doing a good evaluation

A: Experimental design

1 Evaluate what is important

2 Do not use obsolete evaluation techniques

3 Use good test data

4 Use strong baselines

5 Avoid data contamination and testing on training data

6 Compute statistical significance

7 Make your experiment replicable

Experimental process

8 Carefully execute and report your experiment

9 Submit your work to peer review

10 Respond to questions from other researchers

Final comments

Leave a comment Cancel reply

A: Experimental design

1 Evaluate what is important

2 Do not use obsolete evaluation techniques

3 Use good test data

4 Use strong baselines

5 Avoid data contamination and testing on training data

6 Compute statistical significance

7 Make your experiment replicable

Experimental process

8 Carefully execute and report your experiment

9 Submit your work to peer review

10 Respond to questions from other researchers

Final comments

Share this:

Related

Share this:

Leave a comment Cancel reply