Peer Review Has Improved My Papers

There is a lot of discussion in academic circles about how to improve peer review. I think one key issue is convincing authors to take a more positive view of peer review, so that they see it as a way of improving the quality of their papers, instead of just as an accept/reject “gate” for academic papers.

In other words, peer review is a quality assurance process, like software testing. But while almost all software developers I have worked with are very enthusiastic about software testing (ie, they want their code to be thoroughly tested, because testing makes their code more useful), many (most?) academics are not very enthusiastic about peer review (ie, see it is a necessary evil instead of as a way of improving their work). This negative attitude makes peer review much less helpful and useful.

Anyways, instead of talking about this abstractly, I thought I’d present a few concrete cases where peer review has really helped me improve my papers. I hope this will encourage people to take a more positive view of peer review!

SumTime paper: more experiments

One example which I have briefly mentioned before is my 2005 paper on our SumTime system, which generated weather forecasts. This was submitted to a special issue of Artificial Intelligence journal on “Connecting Language to the World“, in fact I was one of the guest editors of the special issue. Of course I was not involved in reviewing my own paper!

Anyways in this paper we described our work on choosing words in weather forecasts based on data. We basically created a corpus of human-written forecasts, aligned them with the numerical forecast data which they were based on, and then built classifiers to choose words such as time phrases (eg, “by evening”) and verbs (eg, “easing”). We also analysed the classifiers (which were white box, eg decision trees) in order to understand the impact of different factors. This led to some surprising insights, such as major differences between how individual forecasters used words.

When we submitted this paper, the reviewers said it looked interesting, but needed a user evaluation; in other words, they wanted us to ask forecasts users to evalute forecasts built using our word-choice models. We had in fact been considering doing this, but had mixed feelings because the evalation was not easy to run. But when the reviewers insisted on a better evaluation, we decided we had to go ahead and do the evaluation. And the results were amazing; in some cases forecast users preferred our computer-generated forecasts over human-written forecasts, because they used more appropriate words!

In short, by insisting that we do a proper evaluation, the reviewers massively improved our paper.

BLEU survey: better presentation of results

Another example is my 2018 paper which presented a structured survey of the validity of BLEU; this was published in Computational Linguistics journal. This paper arose from my continuing frustration with the widespread usage of the BLEU metric to evaluate text generation. Motivated by the structured surveys I had seen in medicine, I decided to the same thing in NLP. That is, I used the PRISMA methodology to find all papers which assessed how well BLEU correlated with human evaluations (structured search meant finding all relevant papers in the ACL Anthology, not just the ones I happened to be aware of), extracted the BLEU-human correlations reported in these papers, and presented a summary of these correlations.

Anyways, when I submitted this to CL, the reviewers responded that they liked what I was doing, but did not like the presentation of results (which was basically a table in my original version). They made a number of suggestions, including using a box plot to summarise the data. I did this, and it really improved the paper, by making the results much easier to understand; indeed whenever I talk about this work (which I still occassionally do even today), I always use the box plots.

In short, by telling me to present my data better, the reviewers made my paper much better and more impactful. Indeed it is one of my top-ten most cited papers, which I doubt would have happened without this change.

Final thoughts

Readers may note that both of the above examples were from high-quality journals. I dont think I have ever had conference reviews which led to major improvements in papers. This is largely because there simply isnt enough time (between getting the reviews and submitting the final “camera-ready” version of the paper) to make major changes to papers in a conference reviewing context. So conference reviewing really is primarily an accept/reject “gate” process.

But journal reviewing, especially for high-end journals, does have the potential to seriously improve the quality of papers (I could have given many other examples, including from papers which are currently in the review process). It doesnt always happen, of course, but it can happen, and this is one reason why I prefer publishing in journals.

Ehud Reiter's Blog

Ehud's thoughts and observations about Natural Language Generation

Peer Review Has Improved My Papers

SumTime paper: more experiments

BLEU survey: better presentation of results

Final thoughts

One thought on “Peer Review Has Improved My Papers”

Leave a comment Cancel reply

SumTime paper: more experiments

BLEU survey: better presentation of results

Final thoughts

Share this:

Related

Share this:

One thought on “Peer Review Has Improved My Papers”

Leave a comment Cancel reply