Abstract
Peer reviewing is the key ingredient of evaluating the quality of scientific work. Based on the review scores assigned by individual reviewers to papers, program committees of conferences and journal editors decide which papers to accept for publication and which to reject. A similar procedure is part of the selection process of grant applications and, among other fields, in sports. It is well known that the reviewing process suffers from measurement errors due to a lack of agreement among multiple reviewers of the same paper. And if not all papers are reviewed by all reviewers, the naive approach of averaging the scores is biased. Several statistical methods are proposed for aggregating review scores, which all can be realized by standard statistical software. The simplest method uses the well-known fixed-effects two-way classification with identical variances, while a more advanced method assumes different variances. As alternatives a mixed linear model and a generalized linear model are employed. The application of these methods implies an evaluation of the reviewers, which may help to improve reviewing processes. An application example with real conference data shows the potential of these statistical methods.
Similar content being viewed by others
Notes
The goodness-of-fit measure AIC depends on the maximized log likelihood and a penalty for an increasing number of estimated parameters. A model with a lower AIC should be preferred.
References
Alvo M, Yu PLH (2014) Statistical methods for ranking data. Springer, Berlin
Anastasi A, Urbina S (1997) Psychological testing. Prentice Hall, Englewood Cliffs
Baumeister D, Erdélyi G, Hemaspaandra E, Hemaspaandra L, Rothe J (2010) Computational aspects of approval voting. In: Laslier J, Sanver R (eds) Handbook on approval voting. Springer, Berlin, pp 199–251
Bornmann L, Mutz R, Marx W, Schier H, Daniel H-D (2011) A multilevel modelling approach to investigating the predictive validity of editorial decisions: Do the editors of a high profile journal select manuscripts that are highly cited after publication? J R Stat Soc A 174(4):857–879
Brams S, Fishburn P (2002) Voting procedures. In: Arrow K, Sen A, Suzumura K (eds) Handbook of social choice and welfare. North-Holland, Amsterdam, pp 173–236
Brandt F, Conitzer V, Endriss U (2013) Computational social choice. In: Weiß G (ed) Multiagent systems, 2nd edn. MIT Press, Cambridge, pp 213–283
Chubin DE (1994) Grants peer review in theory and practice. Eval Rev 18(1):20–30
Cicchetti DV (1980) Reliability of reviews for the American Psychologist: A biostatistical assessment of the data. Am Psychol 35(3):300–303
Cohen RJ, Wserdlik ME (2005) Psychological testing and assessment. McGraw Hill, New York
Conitzer V, Rothe J (eds) (2010) Proceedings of the 3rd international workshop on computational social choice. Universität Düsseldorf
Critchlow E, Fligner MA, Verducci JS (1991) Probability models on rankings. J Math Psychol 35:294–318
Draper N, Smith H (1998) Applied regression analysis, 3rd edn. Wiley series in probability and statistics. Wiley, New York
Faliszewski P, Hemaspaandra E, Hemaspaandra L, Rothe J (2009) A richer understanding of the complexity of election systems. In: Ravi S, Shukla S (eds) Fundamental problems in computing: essays in honor of professor Daniel J. Springer, Rosenkrantz, pp 375–406
Fligner MA, Verducci JS (1990) Posterior probabilities for a consensus ordering. Psychometrika 55(1):53–63
Gao X, Alvo M (2005) A unified nonparametric approach for unbalanced factorial designs. J Am Stat Assoc 100:926–941
Jayasinghe UW, Marsh HW, Bond NW (2003) A multilevel cross-classified modelling approach to peer review of grant proposals: the effects of assessor and researcher attributes on assessor ratings. J R Stat Soc A 166(3):279–300
Jayasinghe UW, Marsh HW, Bond NW (2006) A new reader trial approach to peer review in funding research grants: an Australian experiment. Scientometrics 69(3):591–605
Koch K (1999) Parameter estimation and hypothesis testing in linear models, 2nd edn. Springer, Berlin
Lauw H, Lim E, Wang K (2007) Summarizing review scores of “unequal” reviewers. In: Proceedings of the 7th SIAM international conference on data mining. SIAM, pp 539–544
Lauw H, Lim E, Wang K (2008) Bias and controversy in evaluation systems. IEEE Trans Knowl Data Eng 20(11):1490–1504
Lee CJ, Sugimoto CR, Zhang G, Cronin B (2013) Bias in peer review. J Am Soc Inf Sci Technol 64(1):2–17
Marden JI (1995) Analyzing and modelling rank data. Chapman and Hall, London
Marsh HW, Ball S (1981) Interjudgmental reliability of reviews for the Journal of Educational Psychology. J Educ Psychol 73(6):872–880
Marsh HW, Jayasinghe UW, Bond NW (2008) Improving the peer-review process for grant applications. Am Pyschol 63(3):160–168
Marsh HW, Bond NW, Jayasinghe UW (2007) Peer review process: assessments by application-nominated referees are biased, inflated, unreliable and invalid. Aust Psychol 42:33–38
Pedhazur E, Pedhazur Schmelkin L (1991) Measurement, design, and analysis: an integrated approach. Lawrence Erlbaum Associates, London
Poon WY, Chan W (2002) Influence analysis of ranking data. Psychometrika 67(3):421–436
Rothe J, Baumeister D, Lindner C, Rothe I (2011) Einführung in Computational Social Choice: Individuelle Strategien und kollektive Entscheidungen beim Spielen, Wählen und Teilen. Spektrum Akademischer Verlag, Heidelberg
Scheuermann B, Kiess W, Roos M, Jarre F, Mauve M (2009) On the time synchronization of distributed log files in networks with local broadcast media. IEEE/ACM Trans Netw 17(2):431–444
Sokal R, Rohlf F (2012) Biometry: the principles and practice of statistics in biological research, 4th edn. W. H. Freeman, San Francisco
West BT, Welch KB, Galecki AT (2007) Linear mixed models: a practical guide using statistical software. Chapman and Hall/CRC, London
Yu PLH (2000) Bayesian analysis of order-statistics models for ranking data. Psychometrika 65(3):281–299
Acknowledgments
The authors cordially thank an anonymous referee for valuable constructive comments on an earlier version of the paper and Mayer Alvo for information on the use of model (2) in statistical ranking. This work was supported in part by DFG grant RO 1202/15-1.
Author information
Authors and Affiliations
Corresponding author
Additional information
Joachim Rudolph: Deceased on November 17, 2014.
Rights and permissions
About this article
Cite this article
Kuhlisch, W., Roos, M., Rothe, J. et al. A statistical approach to calibrating the scores of biased reviewers of scientific papers. Metrika 79, 37–57 (2016). https://doi.org/10.1007/s00184-015-0542-z
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00184-015-0542-z