A statistical approach to calibrating the scores of biased reviewers of scientific papers

Kuhlisch, Wiltrud; Roos, Magnus; Rothe, Jörg; Rudolph, Joachim; Scheuermann, Björn; Stoyan, Dietrich

doi:10.1007/s00184-015-0542-z

A statistical approach to calibrating the scores of biased reviewers of scientific papers

Published: 24 April 2015

Volume 79, pages 37–57, (2016)
Cite this article

Metrika Aims and scope Submit manuscript

Wiltrud Kuhlisch¹,
Magnus Roos²,
Jörg Rothe³,
Joachim Rudolph⁴,
Björn Scheuermann⁵ &
…
Dietrich Stoyan⁶

344 Accesses
2 Citations
6 Altmetric
Explore all metrics

Abstract

Peer reviewing is the key ingredient of evaluating the quality of scientific work. Based on the review scores assigned by individual reviewers to papers, program committees of conferences and journal editors decide which papers to accept for publication and which to reject. A similar procedure is part of the selection process of grant applications and, among other fields, in sports. It is well known that the reviewing process suffers from measurement errors due to a lack of agreement among multiple reviewers of the same paper. And if not all papers are reviewed by all reviewers, the naive approach of averaging the scores is biased. Several statistical methods are proposed for aggregating review scores, which all can be realized by standard statistical software. The simplest method uses the well-known fixed-effects two-way classification with identical variances, while a more advanced method assumes different variances. As alternatives a mixed linear model and a generalized linear model are employed. The application of these methods implies an evaluation of the reviewers, which may help to improve reviewing processes. An application example with real conference data shows the potential of these statistical methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Interrater Reliability of the Peer Review Process in Management Journals

Quantifying the quality of peer reviewers through Zipf’s law

Article 05 September 2015

Inter-rater reliability and validity of peer reviews in an interdisciplinary field

Article 11 September 2017

Notes

The goodness-of-fit measure AIC depends on the maximized log likelihood and a penalty for an increasing number of estimated parameters. A model with a lower AIC should be preferred.

References

Alvo M, Yu PLH (2014) Statistical methods for ranking data. Springer, Berlin
Book MATH Google Scholar
Anastasi A, Urbina S (1997) Psychological testing. Prentice Hall, Englewood Cliffs
Google Scholar
Baumeister D, Erdélyi G, Hemaspaandra E, Hemaspaandra L, Rothe J (2010) Computational aspects of approval voting. In: Laslier J, Sanver R (eds) Handbook on approval voting. Springer, Berlin, pp 199–251
Chapter Google Scholar
Bornmann L, Mutz R, Marx W, Schier H, Daniel H-D (2011) A multilevel modelling approach to investigating the predictive validity of editorial decisions: Do the editors of a high profile journal select manuscripts that are highly cited after publication? J R Stat Soc A 174(4):857–879
Article MathSciNet Google Scholar
Brams S, Fishburn P (2002) Voting procedures. In: Arrow K, Sen A, Suzumura K (eds) Handbook of social choice and welfare. North-Holland, Amsterdam, pp 173–236
Chapter Google Scholar
Brandt F, Conitzer V, Endriss U (2013) Computational social choice. In: Weiß G (ed) Multiagent systems, 2nd edn. MIT Press, Cambridge, pp 213–283
Google Scholar
Chubin DE (1994) Grants peer review in theory and practice. Eval Rev 18(1):20–30
Article Google Scholar
Cicchetti DV (1980) Reliability of reviews for the American Psychologist: A biostatistical assessment of the data. Am Psychol 35(3):300–303
Article Google Scholar
Cohen RJ, Wserdlik ME (2005) Psychological testing and assessment. McGraw Hill, New York
Google Scholar
Conitzer V, Rothe J (eds) (2010) Proceedings of the 3rd international workshop on computational social choice. Universität Düsseldorf
Critchlow E, Fligner MA, Verducci JS (1991) Probability models on rankings. J Math Psychol 35:294–318
Article MATH MathSciNet Google Scholar
Draper N, Smith H (1998) Applied regression analysis, 3rd edn. Wiley series in probability and statistics. Wiley, New York
Google Scholar
Faliszewski P, Hemaspaandra E, Hemaspaandra L, Rothe J (2009) A richer understanding of the complexity of election systems. In: Ravi S, Shukla S (eds) Fundamental problems in computing: essays in honor of professor Daniel J. Springer, Rosenkrantz, pp 375–406
Chapter Google Scholar
Fligner MA, Verducci JS (1990) Posterior probabilities for a consensus ordering. Psychometrika 55(1):53–63
Article MathSciNet Google Scholar
Gao X, Alvo M (2005) A unified nonparametric approach for unbalanced factorial designs. J Am Stat Assoc 100:926–941
Article MATH MathSciNet Google Scholar
Jayasinghe UW, Marsh HW, Bond NW (2003) A multilevel cross-classified modelling approach to peer review of grant proposals: the effects of assessor and researcher attributes on assessor ratings. J R Stat Soc A 166(3):279–300
Article MathSciNet Google Scholar
Jayasinghe UW, Marsh HW, Bond NW (2006) A new reader trial approach to peer review in funding research grants: an Australian experiment. Scientometrics 69(3):591–605
Article Google Scholar
Koch K (1999) Parameter estimation and hypothesis testing in linear models, 2nd edn. Springer, Berlin
Book MATH Google Scholar
Lauw H, Lim E, Wang K (2007) Summarizing review scores of “unequal” reviewers. In: Proceedings of the 7th SIAM international conference on data mining. SIAM, pp 539–544
Lauw H, Lim E, Wang K (2008) Bias and controversy in evaluation systems. IEEE Trans Knowl Data Eng 20(11):1490–1504
Article Google Scholar
Lee CJ, Sugimoto CR, Zhang G, Cronin B (2013) Bias in peer review. J Am Soc Inf Sci Technol 64(1):2–17
Article Google Scholar
Marden JI (1995) Analyzing and modelling rank data. Chapman and Hall, London
Google Scholar
Marsh HW, Ball S (1981) Interjudgmental reliability of reviews for the Journal of Educational Psychology. J Educ Psychol 73(6):872–880
Article Google Scholar
Marsh HW, Jayasinghe UW, Bond NW (2008) Improving the peer-review process for grant applications. Am Pyschol 63(3):160–168
Article Google Scholar
Marsh HW, Bond NW, Jayasinghe UW (2007) Peer review process: assessments by application-nominated referees are biased, inflated, unreliable and invalid. Aust Psychol 42:33–38
Article Google Scholar
Pedhazur E, Pedhazur Schmelkin L (1991) Measurement, design, and analysis: an integrated approach. Lawrence Erlbaum Associates, London
Google Scholar
Poon WY, Chan W (2002) Influence analysis of ranking data. Psychometrika 67(3):421–436
Article MATH MathSciNet Google Scholar
Rothe J, Baumeister D, Lindner C, Rothe I (2011) Einführung in Computational Social Choice: Individuelle Strategien und kollektive Entscheidungen beim Spielen, Wählen und Teilen. Spektrum Akademischer Verlag, Heidelberg
Scheuermann B, Kiess W, Roos M, Jarre F, Mauve M (2009) On the time synchronization of distributed log files in networks with local broadcast media. IEEE/ACM Trans Netw 17(2):431–444
Article Google Scholar
Sokal R, Rohlf F (2012) Biometry: the principles and practice of statistics in biological research, 4th edn. W. H. Freeman, San Francisco
Google Scholar
West BT, Welch KB, Galecki AT (2007) Linear mixed models: a practical guide using statistical software. Chapman and Hall/CRC, London
Google Scholar
Yu PLH (2000) Bayesian analysis of order-statistics models for ranking data. Psychometrika 65(3):281–299
Article MATH Google Scholar

Download references

Acknowledgments

The authors cordially thank an anonymous referee for valuable constructive comments on an earlier version of the paper and Mayer Alvo for information on the use of model (2) in statistical ranking. This work was supported in part by DFG grant RO 1202/15-1.

Author information

Authors and Affiliations

Institut für Mathematische Stochastik, TU Dresden, 01069, Dresden, Germany
Wiltrud Kuhlisch
Institut für Sprache und Information, Heinrich-Heine-Universität Düsseldorf, 40225, Düsseldorf, Germany
Magnus Roos
Institut für Informatik, Heinrich-Heine-Universität Düsseldorf, 40225, Düsseldorf, Germany
Jörg Rothe
Institut für Sozialwissenschaften, Humboldt-Universität zu Berlin, 10099, Berlin, Germany
Joachim Rudolph
Institut für Informatik, Humboldt-Universität zu Berlin, 10099, Berlin, Germany
Björn Scheuermann
Institut für Stochastik, TU Bergakademie Freiberg, 09596, Freiberg, Germany
Dietrich Stoyan

Authors

Wiltrud Kuhlisch
View author publications
You can also search for this author in PubMed Google Scholar
Magnus Roos
View author publications
You can also search for this author in PubMed Google Scholar
Jörg Rothe
View author publications
You can also search for this author in PubMed Google Scholar
Joachim Rudolph
View author publications
You can also search for this author in PubMed Google Scholar
Björn Scheuermann
View author publications
You can also search for this author in PubMed Google Scholar
Dietrich Stoyan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Magnus Roos.

Additional information

Joachim Rudolph: Deceased on November 17, 2014.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kuhlisch, W., Roos, M., Rothe, J. et al. A statistical approach to calibrating the scores of biased reviewers of scientific papers. Metrika 79, 37–57 (2016). https://doi.org/10.1007/s00184-015-0542-z

Download citation

Received: 14 July 2014
Published: 24 April 2015
Issue Date: January 2016
DOI: https://doi.org/10.1007/s00184-015-0542-z

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A statistical approach to calibrating the scores of biased reviewers of scientific papers

Abstract

Access this article

Similar content being viewed by others

Interrater Reliability of the Peer Review Process in Management Journals

Quantifying the quality of peer reviewers through Zipf’s law

Inter-rater reliability and validity of peer reviews in an interdisciplinary field

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

A statistical approach to calibrating the scores of biased reviewers of scientific papers

Abstract

Access this article

Similar content being viewed by others

Interrater Reliability of the Peer Review Process in Management Journals

Quantifying the quality of peer reviewers through Zipf’s law

Inter-rater reliability and validity of peer reviews in an interdisciplinary field

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation