Common Flaws in NLP Evaluation Experiments

The ReproHum project (where I am working with Anya Belz (PI) and Craig Thomson (RF) as well as many partner labs) is looking at the reproducibility of human evaluations in NLP. So far our findings have been pretty depressing. We discovered early on in the project that none of the papers we considered replicating had sufficient information for replicability and that only 13% of authors were willing and able to provide the missing information (paper) (blog). Once replication results started coming in, they also showed that some NLP papers used difficult-to-replicate experimental methodologies which I suspect do not give meaningful results (blog).

Our latest paper, Common Flaws in Running Human Evaluation Experiments in NLP (journal link), discusses another depressing fact, that **every** paper that we tried to replicate had experimental flaws. Examples of the problems we saw include

Code bugs: Showing wrong texts to Turkers, incorrect aggregation of results, code in Github repo different from actual code used in experiment
Data collection: Poor user interface design causing data-entry errors
Inappropriate exclusion: Inappropriate removal of outliers, ad-hoc exclusion of subjects
Reporting errors: Numbers in paper different from experimental results
Ethical errors: Revealing IDs of Turkers

These issues are not rocket science, what they show is that researchers are not careful enough in creating experimental resources, conducting experiments, analysing data, reporting results, and following best-practice ethical procedures.

If people are interested in details and/or want to see specific examples, I suggest they read our paper, which is fairly short; we’ve tried to make it accessible and understandable. We also give practical suggestions for researchers who want to avoid the sort of problems we found in ReproHum, such as piloting UIs, following good software development practices, and automatically generating Latex for tables (etc) from data sets; again this isnt rocket science. I especially recommend our paper to students, I hope that it will make them more aware of these issues, and hence both a more careful experimenter and a more sceptical reader.

Reviewing practices and experimental errors

One point I do want to discuss in this blog is that current reviewing practices in NLP do **not** detect the sort of experimental problems we found.

Code bugs: Cannot be checked since reviewers usually do not have access to code (Github) repos when they review a paper. Even if they did, its unlikely that the sort of bugs we have seen could be detected unless reviewers spend much more time on reviewing than they currently do.
User interface problems: Very few NLP papers give enough information about UIs to enable reviewers to check these for problems. Especially since the best way to check a UI is to run it, which is not possible without access to relevant code or apps.
Inappropriate statical details, such as exclusion: This perhaps could be detected at review time, if sufficient information is provided. Relevant details are often omitted, however. If they are given, they may be buried in the paper and easy for reviewers to miss.
Reporting errors: Similar to code errors, these are impossible to detect without access to detailed data files. Some submissions do include data files as supplementary material, but even here reviewers usually do not have sufficient time to understand the data and check that it is accurately reported.
Ethical errors: These are hopefully becoming easier to check since many venues insist on ethical information, although again detailed info (eg on exactly who had access to personal data) is often not present in submissions.

The fact that reviewers do not detect such problems may in some cases lessen the urgency of fixing them. In other words, if addressing above problems takes time and does not increase chance of acceptance, some (not all!) researchers may decide this is not worth doing.

In all honesty its hard to see the big xACL conferences doing much about the above problems. They have too many papers to review in a tight timeframe. However I think journals such as Computational Linguistics and TACL could adjust reviewing procedures to check some of above. I suspect the best approach would be to get specialists to check UI, stats, ethics, etc. Medical journals, for example, often get specialists to review stats; perhaps NLP journals could do likewise?

Reviewing in medicine also has problems, however. Carlisle (2020) analysed papers submitted to a medical journal in order to identify worthless “zombie” papers. When he just read the papers, he classified 1% as zombies. However, when he had access to detailed experimental data, he discovered that ***26%*** were worthless zombie papers. In all honesty, I suspect that a careful analysis of NLP papers (looking at code, date, etc) would find at least this number (quite likely more) of fatally flawed experiments.

Anyways, another approach is to flag papers as problematic after they are published, if readers raise issues (including data and code not being available) which authors cannot address (blog). Top medical journals do this, via paper-specific discussion boards which are monitored by journal staff. I dont know of any NLP venue which supports post-publication discussion boards. I guess readers are meant to contact authors directly, but this doesnt do much good if only 13% respond with the requested information. Again this is an area where I think NLP journals could take the lead.

Final thoughts

If NLP is a science, then it must be based on carefully designed and executed experiments! Unfortunately. our work suggests that many NLP papers suffer from experimental flaws, which are generally not detected by current reviewing practices. We only looked at human evaluations, but I suspect the problem may be just as bad with metric evaluations (eg see Arvan et al (2022)).

Fundamentally I believe that this is a problem with NLP’s research culture, which does not place enough emphasis on carefully executed and reported experiments. It is of course hard and time-consuming to change culture, but not impossible. One first step would be to start changing reviewing practices, at least in journals (as mentioned above); this would send a clear signal to the community about the importance of high-quality experiments.

Reference

C Thomson, E Reiter, A Belz (2024). Common Flaws in Running Human Evaluation Experiments in NLP. Computational Linguistics. DOI 10.1162/coli_a_00508

Ehud Reiter's Blog

Ehud's thoughts and observations about Natural Language Generation

Common Flaws in NLP Evaluation Experiments

Reviewing practices and experimental errors

Final thoughts

Reference

One thought on “Common Flaws in NLP Evaluation Experiments”

Leave a comment Cancel reply

Reviewing practices and experimental errors

Final thoughts

Reference

Share this:

Related

Share this:

One thought on “Common Flaws in NLP Evaluation Experiments”

Leave a comment Cancel reply