evaluation

Evaluation: Plan ahead, details matter, keep it simple, pilot, be careful

A PhD student recently asked me for some advice about evaluation, he was looking for something shorter than a 20-page “best practices” paper. I tried to help, but afterwards asked myself if I could give useful advice in just 10 words. I came up with “Plan ahead, details matter, keep it simple, pilot, be careful”. I appreciate this is very high level and more about process than about details of evaluation protocols, but I think there is something useful here.

Plan ahead

Plan your evaluation in advance! Your plan should include

  • (human evaluation only) how you will recruit subjects; eg Prolific with filters XXX
  • what material will be evaluated; eg, outputs of MyAmazing and BoringBaseline NLG systems, on scenarios YYY.
  • how material is assessed; eg, BLEURT score for metric evaluation, or 7-pt Likert ratings on Accuracy and Fluency for human evaluation.
  • what statistical significance tests you will perform on the data; eg t-test comparing Likert ratings of texts from the two systems.

The key point is to plan this in advance **before** you start the experiment; what you dont want to do is figure out the above while you are doing the experiment! One concrete way of doing this is to pre-register your experiment; this process will force you to explain in detail what you plan to do.

Details matter

Its important to be detailed in your plan. Details include

  • Which system version will you use, especially for baseline and SOTA comparisons
  • If you are working with a dataset, how will you split it into test, validation, and train, and how will you ensure that there is no contamination between these partitions?
  • Which versions of BLEURT (etc) will you use?
  • Will human subjects be trained and/or given explanatory material?
  • What UI will human subjects see?
  • If you plan on excluding outliers, how will you do this?
  • etc

The key point is that details such as the above really matter. We know that details are very important in experiments in physics and medicine; they are also important in experiments in AI and computing!

Keep it simple

The KISS principle (Keep it simple, stupid!) absolutely applies to experimental design. A complex and/or non-standard experiment is much more likely to go wrong than a simple experiment.

For example, if you’re writing code to support your evaluation, remember that software engineering tells us to expect at least 1 bug per 100 lines of code in software which has not gone through a high-quality software test and quality assurance process (which is unheard of for academic research code). So writing more code increases likelihood of you experiment being disrupted by software bugs.

Likewise, if you’re running a human evaluation and ask subjects to answer lots of complicated questions, there is a good chance that the subjects will get confused, bored, and/or tired; much better to just ask a few simple questions! For example you may be keen to get ratings of generated texts on 10 different quality criteria, but you’re far more likely to get good data if you only ask about 2-3 criteria.

Pilot

Before you do your main experiment, always do a small-scale pilot first! In the pilot, you go through the entire process, but on a much smaller scale; the goal is to identify problems in the experiment, not to gather data.

Certainly piloting almost always makes sense for human experiments; I have seen a lot of problems which could have been detected by a pilot. These include UI problems, data capture (we once lost a lot of data because of a database problem), outlier issues, crowdworker problems, and code bugs. I think pilots are also useful in metric-based evaluations; they can detect code problems and issues with strange edge cases, amongst other things.

Because the goal of a pilot is identifying problems, its essential to check the “results” of the pilot to ensure that they accurately reflect what happened in the experiment.

Be Careful

Last but not least, be careful when you do experiments! Sloppy experiments are worthless in medicine and physics, and they are also worthless in computing and AI. Be careful, dont take short cuts, and investigate anomalies.

For example, if you see near-identical results for the MyAmazing and BoringBaseline systems, check if this is due to a bug in how the experiment was done (maybe the wrong data was sent to BLEURT?) or how the data was analysed (maybe you set an incorrect parameter on your R/Python analytics code?). Its exciting to see surprising and unexpected results in an evaluation, but check that these are genuine!

Final thoughts

Obviously there is a lot more to doing good experiments than “Plan ahead, details matter, keep it simple, pilot, be careful”. On the other hand, though, I’ve seen a lot of experiments which suffered because these principles were not followed, so they are important.

Also, please let me know if you have your own ideas for how to give useful evaluation advice in 10 words!

One thought on “Evaluation: Plan ahead, details matter, keep it simple, pilot, be careful

Leave a comment