TY - JOUR
T1 - Peer reviews of peer reviews
T2 - A randomized controlled trial and other experiments
AU - Goldberg, Alexander
AU - Stelmakh, Ivan
AU - Cho, Kyunghyun
AU - Oh, Alice
AU - Agarwal, Alekh
AU - Belgrave, Danielle
AU - Shah, Nihar B.
N1 - Publisher Copyright:
© 2025 Goldberg et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
PY - 2025/4
Y1 - 2025/4
N2 - Is it possible to reliably evaluate the quality of peer reviews? We study this question driven by two primary motivations – incentivizing high-quality reviewing using assessed quality of reviews and measuring changes to review quality in experiments. We conduct a large scale study at the NeurIPS 2022 conference, a top-tier conference in machine learning, in which we invited (meta)-reviewers and authors to voluntarily evaluate reviews given to submitted papers. First, we conduct a randomized controlled trial to examine bias due to the length of reviews. We generate elongated versions of reviews by adding substantial amounts of non-informative content. Participants in the control group evaluate the original reviews, whereas participants in the experimental group evaluate the artificially lengthened versions. We find that lengthened reviews are scored (statistically significantly) higher quality than the original reviews. Additionally, in analysis of observational data we find that authors are positively biased towards reviews recommending acceptance of their own papers, even after controlling for confounders of review length, quality, and different numbers of papers per author. We also measure disagreement rates between multiple evaluations of the same review of 28% – 32%, which is comparable to that of paper reviewers at NeurIPS. Further, we assess the amount of miscalibration of evaluators of reviews using a linear model of quality scores and find that it is similar to estimates of miscalibration of paper reviewers at NeurIPS. Finally, we estimate the amount of variability in subjective opinions around how to map individual criteria to overall scores of review quality and find that it is roughly the same as that in the review of papers. Our results suggest that the various problems that exist in reviews of papers – inconsistency, bias towards irrelevant factors, miscalibration, subjectivity – also arise in reviewing of reviews.
AB - Is it possible to reliably evaluate the quality of peer reviews? We study this question driven by two primary motivations – incentivizing high-quality reviewing using assessed quality of reviews and measuring changes to review quality in experiments. We conduct a large scale study at the NeurIPS 2022 conference, a top-tier conference in machine learning, in which we invited (meta)-reviewers and authors to voluntarily evaluate reviews given to submitted papers. First, we conduct a randomized controlled trial to examine bias due to the length of reviews. We generate elongated versions of reviews by adding substantial amounts of non-informative content. Participants in the control group evaluate the original reviews, whereas participants in the experimental group evaluate the artificially lengthened versions. We find that lengthened reviews are scored (statistically significantly) higher quality than the original reviews. Additionally, in analysis of observational data we find that authors are positively biased towards reviews recommending acceptance of their own papers, even after controlling for confounders of review length, quality, and different numbers of papers per author. We also measure disagreement rates between multiple evaluations of the same review of 28% – 32%, which is comparable to that of paper reviewers at NeurIPS. Further, we assess the amount of miscalibration of evaluators of reviews using a linear model of quality scores and find that it is similar to estimates of miscalibration of paper reviewers at NeurIPS. Finally, we estimate the amount of variability in subjective opinions around how to map individual criteria to overall scores of review quality and find that it is roughly the same as that in the review of papers. Our results suggest that the various problems that exist in reviews of papers – inconsistency, bias towards irrelevant factors, miscalibration, subjectivity – also arise in reviewing of reviews.
UR - http://www.scopus.com/inward/record.url?scp=105001984765&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=105001984765&partnerID=8YFLogxK
U2 - 10.1371/journal.pone.0320444
DO - 10.1371/journal.pone.0320444
M3 - Article
C2 - 40173178
AN - SCOPUS:105001984765
SN - 1932-6203
VL - 20
JO - PloS one
JF - PloS one
IS - 4 April
M1 - e0320444
ER -