, Volume 79, Issue 1, pp 37–57 | Cite as

A statistical approach to calibrating the scores of biased reviewers of scientific papers

  • Wiltrud Kuhlisch
  • Magnus RoosEmail author
  • Jörg Rothe
  • Joachim Rudolph
  • Björn Scheuermann
  • Dietrich Stoyan


Peer reviewing is the key ingredient of evaluating the quality of scientific work. Based on the review scores assigned by individual reviewers to papers, program committees of conferences and journal editors decide which papers to accept for publication and which to reject. A similar procedure is part of the selection process of grant applications and, among other fields, in sports. It is well known that the reviewing process suffers from measurement errors due to a lack of agreement among multiple reviewers of the same paper. And if not all papers are reviewed by all reviewers, the naive approach of averaging the scores is biased. Several statistical methods are proposed for aggregating review scores, which all can be realized by standard statistical software. The simplest method uses the well-known fixed-effects two-way classification with identical variances, while a more advanced method assumes different variances. As alternatives a mixed linear model and a generalized linear model are employed. The application of these methods implies an evaluation of the reviewers, which may help to improve reviewing processes. An application example with real conference data shows the potential of these statistical methods.


Analysis of variance Model with main effects Design matrix  Peer reviewing Review scores 

Mathematics Subject Classification

62J10 91B82 



The authors cordially thank an anonymous referee for valuable constructive comments on an earlier version of the paper and Mayer Alvo for information on the use of model (2) in statistical ranking. This work was supported in part by DFG grant RO 1202/15-1.


  1. Alvo M, Yu PLH (2014) Statistical methods for ranking data. Springer, BerlinzbMATHCrossRefGoogle Scholar
  2. Anastasi A, Urbina S (1997) Psychological testing. Prentice Hall, Englewood CliffsGoogle Scholar
  3. Baumeister D, Erdélyi G, Hemaspaandra E, Hemaspaandra L, Rothe J (2010) Computational aspects of approval voting. In: Laslier J, Sanver R (eds) Handbook on approval voting. Springer, Berlin, pp 199–251CrossRefGoogle Scholar
  4. Bornmann L, Mutz R, Marx W, Schier H, Daniel H-D (2011) A multilevel modelling approach to investigating the predictive validity of editorial decisions: Do the editors of a high profile journal select manuscripts that are highly cited after publication? J R Stat Soc A 174(4):857–879MathSciNetCrossRefGoogle Scholar
  5. Brams S, Fishburn P (2002) Voting procedures. In: Arrow K, Sen A, Suzumura K (eds) Handbook of social choice and welfare. North-Holland, Amsterdam, pp 173–236CrossRefGoogle Scholar
  6. Brandt F, Conitzer V, Endriss U (2013) Computational social choice. In: Weiß G (ed) Multiagent systems, 2nd edn. MIT Press, Cambridge, pp 213–283Google Scholar
  7. Chubin DE (1994) Grants peer review in theory and practice. Eval Rev 18(1):20–30CrossRefGoogle Scholar
  8. Cicchetti DV (1980) Reliability of reviews for the American Psychologist: A biostatistical assessment of the data. Am Psychol 35(3):300–303CrossRefGoogle Scholar
  9. Cohen RJ, Wserdlik ME (2005) Psychological testing and assessment. McGraw Hill, New YorkGoogle Scholar
  10. Conitzer V, Rothe J (eds) (2010) Proceedings of the 3rd international workshop on computational social choice. Universität DüsseldorfGoogle Scholar
  11. Critchlow E, Fligner MA, Verducci JS (1991) Probability models on rankings. J Math Psychol 35:294–318zbMATHMathSciNetCrossRefGoogle Scholar
  12. Draper N, Smith H (1998) Applied regression analysis, 3rd edn. Wiley series in probability and statistics. Wiley, New YorkGoogle Scholar
  13. Faliszewski P, Hemaspaandra E, Hemaspaandra L, Rothe J (2009) A richer understanding of the complexity of election systems. In: Ravi S, Shukla S (eds) Fundamental problems in computing: essays in honor of professor Daniel J. Springer, Rosenkrantz, pp 375–406CrossRefGoogle Scholar
  14. Fligner MA, Verducci JS (1990) Posterior probabilities for a consensus ordering. Psychometrika 55(1):53–63MathSciNetCrossRefGoogle Scholar
  15. Gao X, Alvo M (2005) A unified nonparametric approach for unbalanced factorial designs. J Am Stat Assoc 100:926–941zbMATHMathSciNetCrossRefGoogle Scholar
  16. Jayasinghe UW, Marsh HW, Bond NW (2003) A multilevel cross-classified modelling approach to peer review of grant proposals: the effects of assessor and researcher attributes on assessor ratings. J R Stat Soc A 166(3):279–300MathSciNetCrossRefGoogle Scholar
  17. Jayasinghe UW, Marsh HW, Bond NW (2006) A new reader trial approach to peer review in funding research grants: an Australian experiment. Scientometrics 69(3):591–605CrossRefGoogle Scholar
  18. Koch K (1999) Parameter estimation and hypothesis testing in linear models, 2nd edn. Springer, BerlinzbMATHCrossRefGoogle Scholar
  19. Lauw H, Lim E, Wang K (2007) Summarizing review scores of “unequal” reviewers. In: Proceedings of the 7th SIAM international conference on data mining. SIAM, pp 539–544Google Scholar
  20. Lauw H, Lim E, Wang K (2008) Bias and controversy in evaluation systems. IEEE Trans Knowl Data Eng 20(11):1490–1504CrossRefGoogle Scholar
  21. Lee CJ, Sugimoto CR, Zhang G, Cronin B (2013) Bias in peer review. J Am Soc Inf Sci Technol 64(1):2–17CrossRefGoogle Scholar
  22. Marden JI (1995) Analyzing and modelling rank data. Chapman and Hall, LondonGoogle Scholar
  23. Marsh HW, Ball S (1981) Interjudgmental reliability of reviews for the Journal of Educational Psychology. J Educ Psychol 73(6):872–880CrossRefGoogle Scholar
  24. Marsh HW, Jayasinghe UW, Bond NW (2008) Improving the peer-review process for grant applications. Am Pyschol 63(3):160–168CrossRefGoogle Scholar
  25. Marsh HW, Bond NW, Jayasinghe UW (2007) Peer review process: assessments by application-nominated referees are biased, inflated, unreliable and invalid. Aust Psychol 42:33–38CrossRefGoogle Scholar
  26. Pedhazur E, Pedhazur Schmelkin L (1991) Measurement, design, and analysis: an integrated approach. Lawrence Erlbaum Associates, LondonGoogle Scholar
  27. Poon WY, Chan W (2002) Influence analysis of ranking data. Psychometrika 67(3):421–436zbMATHMathSciNetCrossRefGoogle Scholar
  28. Rothe J, Baumeister D, Lindner C, Rothe I (2011) Einführung in Computational Social Choice: Individuelle Strategien und kollektive Entscheidungen beim Spielen, Wählen und Teilen. Spektrum Akademischer Verlag, HeidelbergGoogle Scholar
  29. Scheuermann B, Kiess W, Roos M, Jarre F, Mauve M (2009) On the time synchronization of distributed log files in networks with local broadcast media. IEEE/ACM Trans Netw 17(2):431–444CrossRefGoogle Scholar
  30. Sokal R, Rohlf F (2012) Biometry: the principles and practice of statistics in biological research, 4th edn. W. H. Freeman, San FranciscoGoogle Scholar
  31. West BT, Welch KB, Galecki AT (2007) Linear mixed models: a practical guide using statistical software. Chapman and Hall/CRC, LondonGoogle Scholar
  32. Yu PLH (2000) Bayesian analysis of order-statistics models for ranking data. Psychometrika 65(3):281–299zbMATHCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2015

Authors and Affiliations

  • Wiltrud Kuhlisch
    • 1
  • Magnus Roos
    • 2
    Email author
  • Jörg Rothe
    • 3
  • Joachim Rudolph
    • 4
  • Björn Scheuermann
    • 5
  • Dietrich Stoyan
    • 6
  1. 1.Institut für Mathematische StochastikTU DresdenDresdenGermany
  2. 2.Institut für Sprache und InformationHeinrich-Heine-Universität DüsseldorfDüsseldorfGermany
  3. 3.Institut für InformatikHeinrich-Heine-Universität DüsseldorfDüsseldorfGermany
  4. 4.Institut für SozialwissenschaftenHumboldt-Universität zu BerlinBerlinGermany
  5. 5.Institut für InformatikHumboldt-Universität zu BerlinBerlinGermany
  6. 6.Institut für StochastikTU Bergakademie FreibergFreibergGermany

Personalised recommendations